sloria / textblob Goto Github PK

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Home Page: https://textblob.readthedocs.io/

License: MIT License

Python 100.00%

nlp nltk pattern python python-3 natural-language-processing

textblob's Introduction

TextBlob: Simplified Text Processing

Homepage: https://textblob.readthedocs.io/

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more.

from textblob import TextBlob

text = """
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
"""

blob = TextBlob(text)
blob.tags  # [('The', 'DT'), ('titular', 'JJ'),
#  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases  # WordList(['titular threat', 'blob',
#            'ultimate movie monster',
#            'amoeba-like mass', ...])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

Noun phrase extraction
Part-of-speech tagging
Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Tokenization (splitting text into words and sentences)
Word and phrase frequencies
Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization
Spelling correction
Add new models or languages through extensions
WordNet integration

Get it now

$ pip install -U textblob
$ python -m textblob.download_corpora

Examples

See more examples at the Quickstart guide.

Documentation

Full documentation is available at https://textblob.readthedocs.io/.

Project Links

Docs: https://textblob.readthedocs.io/
Changelog: https://textblob.readthedocs.io/en/latest/changelog.html
PyPI: https://pypi.python.org/pypi/TextBlob
Issues: https://github.com/sloria/TextBlob/issues

License

MIT licensed. See the bundled LICENSE file for more details.

textblob's People

Contributors

Stargazers

Watchers

Forkers

sbrosinski vibster ratancs vgoklani peterkeen extempore iconica user52 prakul web5design webridge debugger22 aydinsakar beforebeta burkeen saftacatalinmihai dotpot sguignot komuw nicolargo smonami waytai robertlayton allenwade3 pombredanne syllog1sm zed9 scraping-xx vambati lisadawn randomnaja ariaff hellcoderz liushmh josephmartz zhkj dhockaday atesty dberkholz rkabir hanjing5 ddani ljsou panyang ithacadream jimmy0000 changguanghua crvidya salilpa rafia001 fiveoceans-dev orangelpai lemonhall davidnk iiapache pathouse yudanta lineryang fodongkara bbengfort jeltef fanfannothing rskumar sirithink jmcarp ly0 saidimu stymy carmackjia aaronmartin0303 kranthicu biznixcn jezeelmohd xsongx anbo724 thachp gaolu chagge vindolin tkw faisal-w lentin gloriousft alex-foundation leeflora muhdamrullah julosaure wavelets yilab arttii evandempsey anurag173 camuslu geraldstanje rliebz ml-ai-nlp-ir yangls06 schevalier ryannorton viveksck

textblob's Issues

Add parsing with NLTK

TextBlob currently uses pattern's parser. Would be nice to have a parsers module that includes various parser implementations, such as those in NLTK.

Attempting to import textblob from within celery task results in SIGSEGV.

I'm attempting to use TextBlob within a Celery task. I've isolated the line

import textblob

as causing a SIGSEGV fault, causing the worker process to exit prematurely. The exactly same code works fine when run in-process (in this case, in the Django dev server).

I would supply more of a stacktrace, but there doesn't seem to be one available since it's a system level, fatal error. The import statement is definitely where it happens however.

If it's a misconfiguration / misuse on my end, any help is appreciated. Thanks!

(after 0.5.1) - AttributeError: 'module' object has no attribute 'compat'

Traceback (most recent call last):
File "sentiment.py", line 1, in
from text.blob import TextBlob
File "/usr/local/lib/python2.7/dist-packages/text/blob.py", line 149, in
@nltk.compat.python_2_unicode_compatible
AttributeError: 'module' object has no attribute 'compat'

translate proxy

hi can you add an optional parameter proxy for translate in textblob/translate.py _get_json5 ?

suggestion: wrap goose

https://github.com/grangier/python-goose

all python html -> text extractor

Add Greedy Average Perceptron POS tagger

Hi,

I'm preparing a pull request for you, for a new POS tagger. This is the first time I've tried to contribute to someone else's project, so probably there'll be some weird teething pain stuff. Also I spend all day writing research code, so maybe parts of my style are atrocious :p.

The two main files are:

https://github.com/syllog1sm/TextBlob/blob/feature/greedy_ap_tagger/text/taggers.py
https://github.com/syllog1sm/TextBlob/blob/feature/greedy_ap_tagger/text/_perceptron.py

I'm not quite done, but it's passing tests and its numbers are much better than the taggers you currently have hooks for:

NLTKTagger: 94.0 / 3m52
PatternTagger: 93.5 / 26s
PerceptronTagger: 96.8 / 16s

Accuracy figures refer to sections 22-24 of the Wall Street Journal, a common English evaluation. There's a table of some accuracies from the literature here: http://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art) . Speeds refer to time taken to tag the 129,654 words of input, including initialisation, on my Macbook Air.

If you check out that link, you'll see that the tagger's about 1% short of the pace for state-of-the-art accuracy. My Cython implementation has slightly better results, about 97.1, and it's a fair bit faster too. It's not very difficult to add some of the extra features to the Python implementation, or to improve its efficiency. Or we could hook in the Cython implementation, although that comes with much more baggage.

I think it's nice having the tagger in ~200 lines of pure Python though, with no dependencies. It should be fairly language independent too --- I'll run some tests to see how it does.

Stop word removal

Create a method that returns the text blob without stop words.

zipfile.BadZipfile Error encountered when attempting to access blob.noun_phrases

Having just gotten the scikit-learn package satisfied by installing the appropriate dependencies, I ran into another issue while working through the quickstart tutorial. I get this error whenever I try to access the noun_phrases of a blob. This is probably an issue with my machine than an issue with TextBlob though, however I thought I would make this to learn why this is happening. Thanks.

wiki = TextBlob("Python is a high-level, general-purpose programming language.")
>>> wiki.noun_phrases
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/text/decorators.py", line 23, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python2.7/dist-packages/text/blob.py", line 408, in noun_phrases
    for phrase in self.np_extractor.extract(self.raw)
  File "/usr/local/lib/python2.7/dist-packages/text/en/np_extractors.py", line 136, in extract
    self.train()
  File "/usr/local/lib/python2.7/dist-packages/text/decorators.py", line 33, in decorated
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/text/en/np_extractors.py", line 107, in train
    train_data = nltk.corpus.brown.tagged_sents(categories='news')
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/corpus/util.py", line 95, in __getattr__
    self.__load()
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/corpus/util.py", line 57, in __load
    root = nltk.data.find('corpora/%s' % self.__name)
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/data.py", line 602, in find
    return find(modified_name, paths)
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/data.py", line 589, in find
    return ZipFilePathPointer(p, zipentry)
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/compat.py", line 187, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/data.py", line 446, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/compat.py", line 187, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/text/nltk/data.py", line 931, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/usr/lib/python2.7/zipfile.py", line 766, in __init__
    self._RealGetContents()
  File "/usr/lib/python2.7/zipfile.py", line 807, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

Add Positive Naive Bayes Classifier

Default POS-tag in word.lemma property

Word class has pos_tag property. When this property is not empty, it makes sense to use its value instead of default ("NOUN") in word.lemma.

I could implement this, but actually I have no clue how to convert a "corpus" pos-tag into WordNet pos-tag. As far as I know, different corpuses use different tag sets. How do you think, is there a way to implement some common function to translate tags to WordNet format?

dealing with characters such as "&" while translating

Here is the issue I found during practising with textblob, Following is the small snippet to illustrate the same

a = u"Jenner & Block LLP"
b = TextBlob(a)
c = b.translate(to="de")
print c

The output of it is "Jenner \u0026 Block LLP"

Spelling correction

Incorporate a spelling corrector

Develop a better system for distributing pickled models and other extra data

Currently, @syllog1sm's trontagger.pickle file is distributed using Github's Releases, which allows appending of binary files to a release.

While this is fine in the short term, there is a 5MB limit on appended files, which will likely be too small for the long term.

I am not sure of the best way to do this and I welcome suggestions.

Also, the process of installing the pickled models is very manual, i.e. saving the files to the TextBlob installation path. It would be nice to automate this in some way, possibly through a downloader module, similar to NLTK.

>>> from text.downloader import download
>>> download("trontagger")
Successfully downloaded trontagger.pickle

Bring back TextBlob.json as a property with @property decorator

Version 0.4.0 broke backwards compatibility, by changing TextBlob.json to TextBlob.json(). We can give people both APIs by adding the @property decorator.

Incorporate pattern.text's language modules for more language support

TextBlob is currently using pattern.en. It would be great to add support for sentiment analysis and tokenization in other languages, using pattern's other language modules.

Error importing TextBlob due to nltk

I am running Python 3.3.2, and installed TextBlob 0.7.1. However, I am not able to import TextBlob, and it seems the problem is with nltk. Please see the error below

from text.blob import TextBlob
Traceback (most recent call last):
File "<pyshell#6>", line 1, in
from text.blob import TextBlob
File "C:\Python33\lib\site-packages\text\blob.py", line 28, in
from text.packages import nltk
File "C:\Python33\lib\site-packages\text\packages.py", line 12, in
import nltk
File "C:\Python33\lib\site-packages\text\nltk__init__.py", line 116, in
from . import ccg
ImportError: cannot import name ccg

I know TextBlob does a nltk installation, but it seems to be an import error from nltk because when I import nltk, I get the same error

import nltk
Traceback (most recent call last):
File "<pyshell#10>", line 1, in
import nltk
File "C:\Python33\lib\site-packages\text\nltk__init__.py", line 116, in
from . import ccg
ImportError: cannot import name ccg

Appreciate any thoughts you might have on how I can fix this.

Cheers
Ram

since 0.7.1 having trouble with the package

On both my mac and linux machines I have the same problem with 0.7.1

from text.blob import TextBlob
Traceback (most recent call last):
File "", line 1, in
File "text.py", line 5, in
from text.blob import TextBlob
ImportError: No module named blob

my sys.path does not contain the textblob module

import sys
for p in sys.path:
... print p
...

/Library/Python/2.7/site-packages/ipython-2.0.0_dev-py2.7.egg
/Library/Python/2.7/site-packages/matplotlib-1.3.0-py2.7-macosx-10.8-intel.egg
/Library/Python/2.7/site-packages/numpy-1.9.0.dev_fde3dee-py2.7-macosx-10.8-x86_64.egg
/Library/Python/2.7/site-packages/pandas-0.12.0_485_g02612c3-py2.7-macosx-10.8-x86_64.egg
/Library/Python/2.7/site-packages/pymc-2.3a-py2.7-macosx-10.8-x86_64.egg
/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-x86_64.egg
/Library/Python/2.7/site-packages/scipy-0.14.0.dev_4938da3-py2.7-macosx-10.8-x86_64.egg
/Library/Python/2.7/site-packages/statsmodels-0.6.0-py2.7-macosx-10.8-x86_64.egg
/Library/Python/2.7/site-packages/readline-6.2.4.1-py2.7-macosx-10.7-intel.egg
/Library/Python/2.7/site-packages/nose-1.3.0-py2.7.egg
/Library/Python/2.7/site-packages/six-1.4.1-py2.7.egg
/Library/Python/2.7/site-packages/pyparsing-1.5.7-py2.7.egg
/Library/Python/2.7/site-packages/pytz-2013.7-py2.7.egg
/Library/Python/2.7/site-packages/pyzmq-13.1.0-py2.7-macosx-10.6-intel.egg
/Library/Python/2.7/site-packages/pika-0.9.13-py2.7.egg
/Library/Python/2.7/site-packages/Jinja2-2.7.1-py2.7.egg
/Library/Python/2.7/site-packages/MarkupSafe-0.18-py2.7-macosx-10.8-intel.egg
/Library/Python/2.7/site-packages/patsy-0.2.1-py2.7.egg
/Library/Python/2.7/site-packages/Pygments-1.6-py2.7.egg
/Library/Python/2.7/site-packages/Sphinx-1.2b3-py2.7.egg
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC
/Library/Python/2.7/site-packages

despite it being there. I have uninstalled and reinstalled and tried all sorts of things:

mbpdar:deaas daren$ ls /Library/Python/2.7/site-packages/te*
/Library/Python/2.7/site-packages/text:
/Library/Python/2.7/site-packages/textblob-0.7.1-py2.7.egg-info:

I've verified the init.py doesn't have odd characters. if I change to the /Library/Python/2.7/site-packages/text folder I am able to import:

mbpdar:deaas daren$ cd /Library/Python/2.7/site-packages/text
mbpdar:text daren$ python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from text.blob import TextBlob

I cannot figure out what changed that might cause this.

Thanks in advance
Daren

Wordnet Integration

Train classifiers from CSV and JSON files

It would be nice to train classifiers directly from files in CSV, TSV, and JSON format.

from text.classifiers import NaiveBayesClassifier

cl = NaiveBayesClassifier("sent_train.csv", format="csv")

Markov Chain generation

How difficult would it be to add a Markov chain generator to TextBlob, similar to https://github.com/zolrath/marky_markov?

UnicodeEncodeError using TextBlob.tags on Python 2.7

Using Python 2.7.5 with TextBlob 0.6.1 on OSX, it is throwing a UnicodeEncodeError exception when using TextBlob.tags.

noun_phrases works as expected:

from __future__ import unicode_literals
from text.blob import TextBlob
text = 'Learn how to make the five classic French mother sauces: Béchamel, Tomato Sauce, Espagnole, Velouté and Hollandaise.'
blob1 = TextBlob(text)
blob1.noun_phrases
# WordList([u'learn', u'classic french mother sauces', u'b\xe9chamel', u'tomato sauce', u'espagnole', u'velout\xe9', u'hollandaise'])

(same result without the future import using u'string')

But tags does not:

blob1.tags
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 93: ordinal not in range(128)

FWIW, using the NLTK pos tagger (from the bundled NLTK package) does work.

import nltk
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)
....
#  (u'sauces', 'NNS'),
#  (u':', ':'),
#  (u'B\xe9chamel', 'JJ'),
#  (u',', ','),
#  (u'Tomato', 'NNP'),
#  (u'Sauce', 'NNP'),
#  (u',', ','),
#  (u'Espagnole', 'NNP'),
#  (u',', ','),
#  (u'Velout\xe9', 'NNP')
...

Love the library and the API you've created, very useful and easier to use than NLTK. I put the full session in gist here

Start and end indices for sentences not computed correctly

The start and end indices are not being computed correctly for blobs that have more than one sentence:

>>> from text.blob import TextBlob as tb
>>> blob = tb("Hello world. How do you do?")
>>> sent1 = blob.sentences[0]
>>> sent2 = blob.sentences[1]
>>> print blob[sent1.start:sent1.end]
Hello world.
>>> print blob[sent2.start:sent2.end]
 How do you do

The second sentence should be "How do you do?".

UnicodeDecodeError when using PatternTagger on Windows

From this StackOverflow question

It appears that text/_text.py line 339 is causing this error. en-lexicon.txt, en-context.txt, and en-entities.txt are being opened without specifying which encoding to use. Therefore, the platform default (apparently cp1252 on Windows) is used.

This might be solved by finding out how those text files are encoded and then use a Py2/3 compatible open(path, mode, encoding) method to open them.

Encapsulate all English specific code in its own package

Each language should have its own module. Let's start with English

[feature] Train your own sentiment analyzer

Would be nice to have an easy way to create a custom sentiment analyzer trained on your own corpus. I welcome ideas on how to do this.

I imagine an API, something like:

>>> from text.blob import TextBlob as tb
>>> from text.sentiments import CustomAnalyzer
>>> sa = CustomAnalyzer()
>>> sa.train_from_csv("./tweets.csv")
>>> blob = tb("This pizza is amazing!", analyzer=sa)
>>> blob.sentiment
('pos',  0.7, 0.3)

Need stronger warnings about using this with Python 2

This library is very vulnerable to Python2's character encoding warts.

Example:

from __future__ import unicode_literals

s = 'Jimmy always takes his title of \u201cmanager\u201d too seriously \u2013 a fact not lost on his peers.'
TextBlob(s) # UnicodeEncodeError

I hacked around a bit, including importing unicode_literals in text.blob, but then IPython had issues. I don't think it's an issue with your library but I think your docs should mention it.

Preserve linebreaks after spelling correction

Spelling correction currently doesn't preserve line breaks. Instead, newlines become spaces.

>>> from text.blob import TextBlob as tb
>>> b = tb("I have\ngood speling.")
>>> print(b.correct())
I have good spelling.

It would be nice if newlines were preserved.

>>> print(b.correct())
I have\ngood spelling.

Rename package while retaining backwards compatiblity

text is not a particularly good package name because it can easily clash with other namespaces called "text". This has already caused problems for some users ( #35 and here).

It would be good to transition TextBlob to a new package name, while still retaining backwards compatiblity until the 1.0 release.

from textblob import TextBlob
from textblob.classify import NaiveBayesClassifier

from text.blob import TextBlob
# show DeprecationWarning

download_corpora.py refers to old package name

curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python

mbpdar:deaas daren$ sudo curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 724 100 724 0 0 9298 0 --:--:-- --:--:-- --:--:-- 11492
Traceback (most recent call last):
File "", line 6, in
ImportError: No module named packages

download_corpora.py should be

from textblob.packages import ...

Maintainers needed for language extensions

Help free TextBlob from Anglocentrism by contributing a language extension!

For instructions on how to do so, check out the development docs: http://textblob.readthedocs.org/en/dev/contributing.html#language-extensions

I've begun textblob-fr, but it would be nice if it had another maintainer (preferably one who knows French and/or has experience with NLP in French). For now, you can use textblob-fr as a template for other language extensions.

repr() fails on WordList slices in Python 3

Using TextBlob 0.8.3 in a Python 3.3.2 virtual environment.

Steps to reproduce:

In [1]: import textblob

In [2]: words = 'the quick brown fox jumped over the lazy dogs'.split()

In [3]: word_list = textblob.WordList(words)

In [4]: repr(word_list[:3])

Expected output:

Out[4]: "WordList([u'the', u'quick', u'brown'])"

Returned output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rosspetchler/.virtualenvs/python3/lib/python3.3/site-packages/textblob/blob.py", line 219, in __repr__
    return '{cls}({lst})'.format(cls=class_name, lst=strings)
  File "/Users/rosspetchler/.virtualenvs/python3/lib/python3.3/site-packages/textblob/blob.py", line 80, in __repr__
    return repr(self.string)
  File "/Users/rosspetchler/.virtualenvs/python3/lib/python3.3/site-packages/textblob/blob.py", line 80, in __repr__
    return repr(self.string)
  File "/Users/rosspetchler/.virtualenvs/python3/lib/python3.3/site-packages/textblob/blob.py", line 80, in __repr__
    return repr(self.string)
  . . .
RuntimeError: maximum recursion depth exceeded while calling a Python object

(Note that Python 2 produces the expected output; the problem seems to be specific to Python 3.)

Default feature extractor for NaiveBayesianClassifier marks all features as False

I'm trying to use the NaiveBayesianClassifier to classify some text, as below:

train = [('CARD PAYMENT TO ASDA SUPERSTORE ON ', 'Supermarket'),
('MONTHLY ACCOUNT FEE', 'Bill'),
('CARD PAYMENT TO SAINSBURYS SMKT GBP RATE GBP ON ', 'Supermarket'),
('CARD PAYMENT TO ORDNANCE SURVEY GBP RATE GBP ON ', 'Eating Out'),
('CARD PAYMENT TO TEXQUEENSWAYSSTN GBP RATE GBP ON ', 'Petrol')]

c = NaiveBayesClassifier(train)

However, it doesn't seem to classify properly, and when I get it to extract the features, I find that all of the features are False:

c.extract_features('CARD PAYMENT')

{u'contains(ACCOUNT)': False,
u'contains(ASDA)': False,
u'contains(CARD)': False,
u'contains(FEE)': False,
u'contains(GBP)': False,
u'contains(MONTHLY)': False,
u'contains(ON)': False,
u'contains(ORDNANCE)': False,
u'contains(PAYMENT)': False,
u'contains(RATE)': False,
u'contains(SAINSBURYS)': False,
u'contains(SMKT)': False,
u'contains(SUPERSTORE)': False,
u'contains(SURVEY)': False,
u'contains(TEXQUEENSWAYSSTN)': False,
u'contains(TO)': False}

I assume this is a problem with the default feature extractor. When I write my own extractor, as below, it all works fine.

def extractor(doc):
    tokens = doc.split()

    features = {}

    for token in tokens:
        if token == "":
            continue
        features[token] = True

    return features

Verb conjugation

A Word::conjugate() method could be used to conjugate verbs into different forms.

Issue while downloading download_corpora.py

Hi,

While downloading download_corpora.py, the file download is getting stuck at midway. Only "brown" is getting downloaded, and after that while downloading "punkt", it is getting stopped. Can you suggest one way to override this. I have tried removing NLTK data path and again downloading it. But still I am facing the same problem.

Thanks

Issues with TextBlob and existing NLTK

When running from textblob.classifiers import NaiveBayesClassifier in Python 2.7.6, I get the following error:

File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/textblob/classifiers.py", line 398, in PositiveNaiveBayesClassifier
    nltk_class = nltk.classify.PositiveNaiveBayesClassifier
AttributeError: 'module' object has no attribute 'PositiveNaiveBayesClassifier'

Running import textblob does not produce any errors. I'm using Python 2.7.6 from MacPorts, I have a NLTK installed through MacPorts, and I have TextBlob installed with Pip.

However, when running the same command in Python 3.3.2, where I do not have NLTK installed, I get no error. This leads me to believe that the problem is due to an existing NLTK installations.

I apologize in advance if this issue has already been submitted or has been documented somewhere. The closest existing issue I could find is this one.

Thanks,
Nipun

Running on Google App Engine

The following error has been encountered when trying to run on Google App Engine:

File "/xxxxx/xxxxx/textblob/nltk/data.py", line 76, in
if os.path.expanduser('~~/') != '~~/':
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 260, in expanduser
userhome = pwd.getpwuid(os.getuid()).pw_dir
KeyError: 'getpwuid(): uid not found: -1'

Empty sentences

First of all many thanks for the great library!

I am not sure if it's a bug or a feature, but I have

ValueError: substring not found

if try to convert to words the TextBlob that looks like this:

" particular research projects are included. . The ARMAP"

From the IPython error message (inserted below) it looks like TextBlob fails to create the sentence object and the second dot after space is the reason for this. I wonder if it's intended behaviour and I have to clean my dataset first, or it is possible to handle such cases somehow?

ValueError                                Traceback (most recent call last)
<ipython-input-164-e15f5bc3174e> in <module>()
     12     abst = TextBlob(''.join(text_data))
     13     #split to words and singularize
---> 14     wl = abst.words.singularize()
     15     #rejoin to complete text
     16     abst_rejoin = TextBlob(' '.join(wl))

/usr/local/lib/python2.7/dist-packages/textblob/decorators.pyc in __get__(self, obj, cls)
     21         if obj is None:
     22             return self
---> 23         value = obj.__dict__[self.func.__name__] = self.func(obj)
     24         return value
     25 

/usr/local/lib/python2.7/dist-packages/textblob/blob.pyc in words(self)
    596         # blob into sentences before tokenizing to words
    597         words = []
--> 598         for sent in self.sentences:
    599             words.extend(WordTokenizer().tokenize(sent.raw, include_punc=False))
    600         return WordList(words)

/usr/local/lib/python2.7/dist-packages/textblob/decorators.pyc in __get__(self, obj, cls)
     21         if obj is None:
     22             return self
---> 23         value = obj.__dict__[self.func.__name__] = self.func(obj)
     24         return value
     25 

/usr/local/lib/python2.7/dist-packages/textblob/blob.pyc in sentences(self)
    585     def sentences(self):
    586         '''Return list of :class:`Sentence <Sentence>` objects.'''
--> 587         return self._create_sentence_objects()
    588 
    589     @cached_property

/usr/local/lib/python2.7/dist-packages/textblob/blob.pyc in _create_sentence_objects(self)
    641             # Compute the start and end indices of the sentence
    642             # within the blob
--> 643             start_index = self.raw.index(sent, char_index)
    644             char_index += len(sent)
    645             end_index = start_index + len(sent)

ValueError: substring not found

start_index = self.raw.index(sent, char_index) ValueError: substring not found

getting a stacktrace when trying to find textblob.words

Traceback (most recent call last):
File "test.py", line 7, in
print TextBlob(t).words
File "/usr/local/lib/python2.7/dist-packages/textblob/decorators.py", line 23, in get
value = obj.dict[self.func.name] = self.func(obj)
File "/usr/local/lib/python2.7/dist-packages/textblob/blob.py", line 598, in words
for sent in self.sentences:
File "/usr/local/lib/python2.7/dist-packages/textblob/decorators.py", line 23, in get
value = obj.dict[self.func.name] = self.func(obj)
File "/usr/local/lib/python2.7/dist-packages/textblob/blob.py", line 587, in sentences
return self._create_sentence_objects()
File "/usr/local/lib/python2.7/dist-packages/textblob/blob.py", line 643, in _create_sentence_objects
start_index = self.raw.index(sent, char_index)
ValueError: substring not found

see below demo program for reproducibility (tested against 0.8)

from textblob import TextBlob

t = '''
after sixteen years francis ford copolla has again returned to his favorite project , making the third installment in the godfather-trilogy . this new film has been underrated for no reason . it is as intellectual and majestically made as copolla's pervious films . it is also more psychological , pessimistic and more tragic than the first two . the only regret is the unconvincing performance by the newcomer sofia copolla and some " unfinished " developments of some characters . the film elegantly begins with nino rota's recognizable musical score , the beautiful skyscrapers of new york and michael's voice as he is writing a letter to his children : " the only wealth in this world is children . more than all money and power on earth , you are my treasure " . the year is 1979 and michael corleone has used the time since the ending of " part ii " to make his father's dream come true - making the corleone family legitimate . michael sold all his casinos and invests only in gambling . constantly haunted by the past , his only reason to live is his children . the family has amassed unimaginable wealth , and as the film opens michael corleone ( al pacino ) is being invested with a great honor by the church . later that day , at a reception , his daughter announces a corleone family gift to the church and the charities of sicily , " a check in the amount of $100 million . " but the corleones are about to find , as others have throughout history , that you cannot buy forgiveness . sure , you can do business with evil men inside the church , for all men are fallible and capable of sin . but god does not take payoffs . the plot of the movie , concocted by coppola and mario puzo in a screenplay inspired by headlines , brings the corleone family into the inner circles of corruption in the vatican . there is a moment in " godfather iii " where michael says : " all my life i have been trying to go up in society , where everything was legal . but the higher i go , the crookier it becomes . . " . visually this film is as spectacular as the first two . gordon willis' rich cinematography , carmine copolla's beautiful composition and alex tavoularis' wonderful art direction could not be better . but copolla's first two godfather-films were more famous for their deep , intellectual plots , tree dimensional characters and incredible acting , than for their visual perfection . the third installment has only the plot and visuals . some characters could be much more developed and the acting , although good , never accomplishes to reach the same height of the first two films . the biggest miscasting is sofia copolla , who is so unconvincing and unemotional that she manages to ruin several scenes throughout the movie , that could have been grander and more emotional . the best performance comes unsurprisingly from al pacino , who should have got a nomination for best actor at the oscars . andy garcia is powerful as sonny's son , strong , focused and loyal . violence is natural to him . he suffers no pangs of conscience when he takes revenge on his family's behalf , and in this he is supposed to be strong in the uncomplicated way don vito corleone was . however both kay ( diane keaton ) and connie ( talia shire ) are useless . and characters like vito corleone and tom hagen are really missed . the good part is that michael is again reunited with old friends , that you remember from the first and second films . in the third film michael has become almost like his father , vito in the first film and vincenzo resembles michael when he was much younger . this parallel could be more interesting if vincenzo's character was more developed . many have pointed out that making the third film , was unnecessary . i disagree . it is a beautiful film of great importance , completing the tragic saga of the corleone family . the first film showed some horrible results of corleone's life . it showed michael making a choice ; the second showed a man damning himself for his choices and feeling the impact of changing times . a man desperately trying to keep his balance , focus , family and sanity , while everything is crashing all around him . the third film is a terrifying conclusion - a result of michael's life . the life he chose for himself is like quicksand - one wrong step and you are doomed . there is no turning back . and no matter how hard you try to get out of it , to free yourself , no matter how powerful and wealthy you are , you are helpless - sinking deeper and deeper till it swallows you completely . the beautifully directed last sequence is also the powerful climax of the film , when michael is sitting alone in his chair , left by everyone , surrounded by emptiness and memories of his friends and family members long dead . here he dies - alone , miserable and unforgiven
'''

print TextBlob(t).words

translate error

Because translate is using get requests, a large block of text generates

urllib2.HTTPError: HTTP Error 414: Request-URI Too Large

(full stack)
File "/usr/local/lib/python2.7/dist-packages/textblob/blob.py", line 516, in translate
from_lang=from_lang, to_lang=to))
File "/usr/local/lib/python2.7/dist-packages/textblob/translate.py", line 51, in translate
json5 = self.get_json5(url, host=host, type=type_)
File "/usr/local/lib/python2.7/dist-packages/textblob/translate.py", line 85, in _get_json5
r = request.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 414: Request-URI Too Large

when trying to translate nltk.corpus movie sentiment test set on the first record:

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . in other words , don't dismiss this film because of its source . if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? the ghetto in question is , of course , whitechapel in 1888 london's east end . it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . don't worry - it'll all make sense when you see it . now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . the film , however , is all good . 2 : 00 - r for strong violence/gore , sexuality , language and drug content

Installation on python3.2 fails ==> resolved: documentation requires python 3.3

Upon installation on python3 (on debian wheezy with pip-3.2) I encounter a python3 SyntaxError:

root@lab:~# pip-3.2 install textblob
Downloading/unpacking textblob
  Downloading textblob-0.8.4.tar.gz (1.8Mb): 1.8Mb downloaded
  Running setup.py egg_info for package textblob

Requirement already satisfied (use --upgrade to upgrade): PyYAML in /usr/local/lib/python3.2/dist-packages (from textblob)
Installing collected packages: textblob
  Running setup.py install for textblob

      File "/usr/local/lib/python3.2/dist-packages/textblob/classifiers.py", line 78
        features = dict(((u'contains({0})'.format(word), (word in tokens))
                                         ^
    SyntaxError: invalid syntax

      File "/usr/local/lib/python3.2/dist-packages/textblob/_text.py", line 313
        and tokens[j] in ("'", "\"", u"”", u"’", "...", ".", "!", "?", ")", EOS):
                                          ^
    SyntaxError: invalid syntax

Successfully installed textblob
Cleaning up...
root@lab:~# python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import textblob
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.2/dist-packages/textblob/__init__.py", line 9, in <module>
    from .blob import TextBlob, Word, Sentence, Blobber, WordList
  File "/usr/local/lib/python3.2/dist-packages/textblob/blob.py", line 31, in <module>
    from textblob.inflect import singularize as _singularize, pluralize as _pluralize
  File "/usr/local/lib/python3.2/dist-packages/textblob/inflect.py", line 12, in <module>
    from textblob.en.inflect import singularize, pluralize
  File "/usr/local/lib/python3.2/dist-packages/textblob/en/__init__.py", line 8, in <module>
    from textblob._text import (Parser as _Parser, Sentiment as _Sentiment, Lexicon,
  File "/usr/local/lib/python3.2/dist-packages/textblob/_text.py", line 313
    and tokens[j] in ("'", "\"", u"”", u"’", "...", ".", "!", "?", ")", EOS):
                                      ^
SyntaxError: invalid syntax
>>> 
root@lab:~# cat /etc/debian_version 
7.2

After removing the 'u's everything seems to be fine.

Add requires_nltk_corpus decorator

When installing, "brown" will not download from the corpora

Terminal output:

MacBook-Pro:~ UserName$ curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   728  100   728    0     0   1012      0 --:--:-- --:--:-- --:--:--  1012
Downloading "brown"

after this nothing happens. I have left it for ~1hr and come back, yet it shows no other output.

Python 2.7.5, OS X 10.9.1

`TaggedString` should be `TaggedUnicode` based from unicode type so that it can be able to process encodings like CJK

I tried with Chinese text but failed, I will try to make the library fit with CJK encoding text later and send further report.

Documentation subjective/objective

In https://github.com/sloria/TextBlob/blob/master/docs/quickstart.rst the sentance: "The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very objective." Should be corrected to indicate which value is subjective or objective

Add decision tree classification

Fix repr with non-ascii text in Python 2

Non-ascii characters do not show correctly in the repr for TextBlobs on Py2, although str works fine.

>>> from text.blob import TextBlob
>>> b = TextBlob(u"问候世界")
>>> print(repr(b))
TextBlob(u'\u95ee\u5019\u4e16\u754c')
>>> print(b)
问候世界

Set up extension system (e.g. for language packages)

Users should be able to install extensions to TextBlob, such as language packages and new models.

For example, to install French language support

$ pip install textblob-fr

Then to use it

>>> from text.blob import TextBlob
>>> from textblob_fr.taggers import PatternTagger
>>> b = TextBlob("Bonjour tout le monde", pos_tagger=PatternTagger())

TextBlob.sentences produces ValueError

I'm running Python 2.7.5 via IPython Notebook with textblob v0.8.0. When I try to access the TextBlob.sentences attribute, I (usually, depending on the length of the text) get an error:

/usr/local/lib/python2.7/site-packages/textblob/blob.py in _create_sentence_objects(self)
    641             # Compute the start and end indices of the sentence
    642             # within the blob
--> 643             start_index = self.raw.index(sent, char_index)
    644             char_index += len(sent)
    645             end_index = start_index + len(sent)

ValueError: substring not found

I've manually checked that there's nothing special about the sentences on which this line fails. Other attributes, e.g. TextBlob.tags, work just fine; TextBlob.words does not, since it depends on TextBlob.sentences. This never happened before I updated to v0.8.0, although I don't know if that's behind the error.

correct() strips some punctuation (period)

from textblob import TextBlob
t = TextBlob('''Before you embark on any of this journey, write a quick high-level test that demonstrates the slowness. You may need to introduce some minimum set of data to reproduce a significant enough slowdown. Usually a run time of second or two is good enough to get a handle on an appreciable improvement when you find it.''')
print t.correct()
Before you embark on any of this journey, write a quick high-level test that demonstrates the slowness You may need to introduce some minimum set of data to reproduce a significant enough slowdown. Usually a run time of second or two is good enough to get a handle on an appreciable improvement when you find it.

Note that the original "write a quick high-level test that demonstrates the slowness. You may" has become " write a quick high-level test that demonstrates the slowness You may" missing the period after slowness.

I get this error even though PositiveNaiveBayesClassifier is present in classifier.py

Traceback (most recent call last):
File "/home/newspecies/Documents/Projects/final_year_project/scripts/audio_input.py", line 1, in
from textblob.classifiers import NaiveBayesClassifier
File "/usr/local/lib/python2.7/dist-packages/textblob-0.8.4-py2.7.egg/textblob/classifiers.py", line 359, in
class PositiveNaiveBayesClassifier(NLTKClassifier):
File "/usr/local/lib/python2.7/dist-packages/textblob-0.8.4-py2.7.egg/textblob/classifiers.py", line 400, in PositiveNaiveBayesClassifier
nltk_class = nltk.classify.PositiveNaiveBayesClassifier
AttributeError: 'module' object has no attribute 'PositiveNaiveBayesClassifier'