japerk / nltk-trainer Goto Github PK

View Code? Open in Web Editor NEW

747.0 64.0 227.0 235 KB

Train NLTK objects with zero code

Home Page: http://nltk-trainer.readthedocs.org/en/latest/

License: Apache License 2.0

Python 85.51% Shell 14.49%

nltk-trainer's Introduction

NLTK Trainer

NLTK Trainer exists to make training and evaluating NLTK objects as easy as possible.

Requirements

The scripts with default arguments have been tested for compatibility with Python3.7 and NLTK 3.4.5. If something does not work for you, please open an issue. Include the script with arguments and failure or exception output. To use the sklearn classifiers, you must also install scikit-learn.

If you want to use any of the corpora that come with NLTK, you should install the NLTK data.

Documentation

Documentation can be found at nltk-trainer.readthedocs.org (you can also find these documents in the docs directory. Many of the scripts are covered in Python 3 Text Processing with NLTK 3 Cookbook, and every script provides a --help option that describes all available parameters.

Using Trained Models

The trained models are pickle files that by default are put into your nltk_data directory. You can load them using nltk.data.load, for example:

import nltk.data
classifier = nltk.data.load('classifiers/movie_reviews_NaiveBayes.pickle')

You now have a NLTK classifier object you can work with.

nltk-trainer's People

Contributors

Stargazers

Watchers

Forkers

kisabaka robyoung werg josephmisiti saidimu aamironline nikicc luchux davidcoallier tfmorris clockwerx laranea ekape sp00 kevincobain2000 abhinavgupta icaromedeiros up1 guoyunsky ahmed26 srebolledo ageek berkeley-food-recommendations alepharchives keoki apseyed stefan-pdx febrianto nwenzel pragnesh chenguangshen sakinahshaeeali wongtai mohamadhussien vishwanathsingh shanksdleroux voukka dblommesteijn willie88 trunghlt hitesh915 spidermanjie azizur77 wandeg bobboelhouwer rahs24 kecaps anb2 tazjel pombredanne mborkar jimmy0000 davechan jsubhong jannson h4ck3rm1k3 robbestad marcoippolito fayimora mikolajb sigma23 foryoung10 geodrca widarr ericchanbd linearregression chenglongchen julosaure ameyab escherba spycon007 nitish11 lababidi djq deachan-jabong likaiguo sexybear bin2000 goryszewskig saurabh14m jamesjohnson92 b-rich ghitakouadri yaltabaoth egbertbouman jayanc kewlpluz hebuguiqu rohitbhoopalam zhmz90 aleata randy-ran stevenlol vamsijkrishna bolajav twistedtree pastorius punsiitg sandy4321 anilcs13m

nltk-trainer's Issues

Issue when using trained classifier for classification

As per the document, we can use
feats = dict([(word, True) for word in words + ngrams(words, 1)])
as feature set to classify, but I get the type error when I use.

TypeError: can only concatenate list (not "generator") to list

Could you please guide me , If I am doing anything wrong

I have a sentence as TEXT, I tried to create FV as below for classification :

    tokens = word_tokenize(text, include_punc=False)
    tokens = functools.reduce(operator.add, [tokens if n == 1 else list(ngrams(tokens, n)) for n in [3]])

    if not isinstance(tokens, list):
        tokens = list(tokens)
    feats = dict([(word, True) for word in tokens])

    print("Classify: ", self._classifier.classify(feats))

But I always get a constant Pos,Neg probability irrespective of the sentence and overall probability is always negative.

ValueError: you must specify a corpus reader

The following command works perfectly fine:
python train_chunker.py conll2002 --filename ~/nltk_data/chunkers/conll2002_chunker.pickle --classifier NaiveBayes

Then I copy ~/nltk_data/conll2002/ to ~/ntlk_data/conlltest/ and run the command:
python train_chunker.py conlltest --filename ~/nltk_data/chunkers/conlltest_chunker.pickle --classifier NaiveBayes

The output is:

loading conlltest
Traceback (most recent call last):
  File "train_chunker.py", line 80, in <module>
    chunked_corpus = load_corpus_reader(args.corpus, reader=args.reader, fileids=args.fileids)
  File "/mnt/3E6227E362279F21/scriptie/external/nltk-trainer/nltk_trainer/__init__.py", line 64, in load_corpus_reader
    raise ValueError('you must specify a corpus reader')
ValueError: you must specify a corpus reader

What am I missing? My version of nltk is 3.2.5.

Memory Problem

I have a corpus of 70,000 documents (roughly 237MB) I keep getting hit with memory-related error messages.
I tried renting a VPS with 100 Gigs of RAM, but I got the same error messages.
Is there a way to make the process less memory-intensive?
Is it possible to break the corpus up into smaller corpora, train multiple classifiers and then combine them into one large classifier?

Megam

Your link for megam is not correct. http://www.umiacs.umd.edu/~hal/megam/.

Improve README

I think it would be good to give a better explanation on how to use the trainer. I am trying to load a pre-trained model that made us of this library and couldn't understand how to add this dependency to my project.

Is it compatible with python 3.7? Cuz the documentation and the README mismatch on the content.

After cloning the repo, what should be done?

Can't load analyze_tagger_coverage with a ConllChunkCorpusReader

The ConllChunkCorpusReader needs an extra argument, a list of nodetags.

File "analyze_tagger_coverage.py", line 47, in
corpus = reader_cls(args.corpus, '.+')
TypeError: init() takes at least 4 arguments (3 given)

Maxent classifier doesn't work with latest scipy version (11.0)

Training tagger/chunker with Maxent as classifier ends with fail. To do this, there must be installed old version of scipy (10.1).

Turn Python scripts into a command-line tool

Great package! Wouldn't it be nice if you could invoke the scripts with something like:

$ nltk train movie_reviews --instances paras --classifier NaiveBayes
$ nltk analyze --sort count --reverse

instead of:

$ python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes
$ python analyze_tagged_corpus.py treebank --sort count --reverse

so that this is truly a command-line tool? Shouldn't be too much work using docopt and adding a console_script entry point in setup.py. What do you think?

Installation error on Ubuntu 10.10

Hi:

First of all, thank you for putting this code out. It seems to be very useful.

I installed japerk-nltk-trainer-5c0b53c on my Ubuntu 10.10 box. I did have to change the "requirements.txt" file in this line:
scipy>=0.7.0

The error message I'm getting is this:

dscs@lap02:~/Desktop/USC/taxonomy$ python /usr/local/bin/train_classifier.py --multi --instances sents -- cat_pattern "(.+).txt"
Traceback (most recent call last):
File "/usr/local/bin/train_classifier.py", line 5, in
pkg_resources.run_script('nltk-trainer==0.9', 'train_classifier.py')
File "/usr/local/lib/python2.6/dist-packages/distribute-0.6.21-py2.6.egg/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.6/dist-packages/distribute-0.6.21-py2.6.egg/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.6/dist-packages/nltk_trainer-0.9-py2.6.egg/EGG-INFO/scripts/train_classifier.py", line 4, in
import nltk_trainer.classification.args
File "/usr/local/lib/python2.6/dist-packages/nltk_trainer-0.9-py2.6.egg/nltk_trainer/init.py", line 7, in
from nltk_trainer.tagging.readers import NumberedTaggedSentCorpusReader
ImportError: No module named tagging.readers

Any suggestions appreciated. Thanks.

Using a trained sklearn classifier results in error

After training a sklearn.BernoulliNB classifier on a corpus I'm getting sporadic errors when trying to predict lables for features with the stored classifier:

feats = {'and': True, (',', 'clean'): True, ('clean', 'and'): True, 'good': True, ('friendly', 'staff'): True, ',': True, '.': True, 'gyros': True, 'clean': True, ('gyros', ','): True, ('good', 'gyros'): True, ('and', 'friendly'): True, 'friendly': True, ('staff', '.'): True, 'staff': True}

clf = pickle.load(open('saved_classifier.pickle'))
p = clf.prob_classify(feats)

The above works. However if:

feats = {'and': True, 'fresh': True, ('fresh', 'and'): True, 'inexpensive': True, ('and', 'inexpensive'): True}

clf.prob_classify(feats) results in a type error... here's the trace:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-184-86c30997b740> in <module>()
    ----> 1 p = clf.prob_classify(feats)
          2 p.prob('pos')

    /Library/Python/2.7/site-packages/nltk/classify/api.pyc in prob_classify(self, featureset)
         63         """
         64         if overridden(self.batch_prob_classify):
    ---> 65             return self.batch_prob_classify([featureset])[0]
         66         else:
         67             raise NotImplementedError()

    /Library/Python/2.7/site-packages/nltk/classify/scikitlearn.pyc in batch_prob_classify(self, featuresets)
         71     def batch_prob_classify(self, featuresets):
         72         X = self._convert(featuresets)
    ---> 73         y_proba = self._clf.predict_proba(X)
         74         return [self._make_probdist(y_proba[i]) for i in xrange(len(y_proba))]
         75 

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/pipeline.pyc in predict_proba(self, X)
        154         for name, transform in self.steps[:-1]:
        155             Xt = transform.transform(Xt)
    --> 156         return self.steps[-1][-1].predict_proba(Xt)
        157 
        158     def decision_function(self, X):

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in predict_proba(self, X)
         96             the model, where classes are ordered arithmetically.
         97         """
    ---> 98         return np.exp(self.predict_log_proba(X))
         99 
        100 

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in predict_log_proba(self, X)
         77             in the model, where classes are ordered arithmetically.
         78         """
    ---> 79         jll = self._joint_log_likelihood(X)
         80         # normalize by P(x) = P(f_1, ..., f_n)
         81         log_prob_x = logsumexp(jll, axis=1)

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
        433 
        434         if self.binarize is not None:
    --> 435             X = binarize(X, threshold=self.binarize)
        436 
        437         n_classes, n_features = self.feature_log_prob_.shape

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/preprocessing.pyc in binarize(X, threshold, copy)
        537         X.data[cond] = 1
        538         X.data[not_cond] = 0
    --> 539         X.eliminate_zeros()
        540     else:
        541         cond = X > threshold

    /Library/Python/2.7/site-packages/scipy-0.13.0.dev_c31f167_20130307-py2.7-macosx-10.8-intel.egg/scipy/sparse/compressed.pyc in eliminate_zeros(self)
        572         fn = sparsetools.csr_eliminate_zeros
        573         M,N = self._swap(self.shape)
    --> 574         fn( M, N, self.indptr, self.indices, self.data)
        575
        576         self.prune() #nnz may have changed

    /Library/Python/2.7/site-packages/scipy-0.13.0.dev_c31f167_20130307-py2.7-macosx-10.8-intel.egg/scipy/sparse/sparsetools/csr.pyc in csr_eliminate_zeros(*args)
        565     csr_eliminate_zeros(int n_row, int n_col, int Ap, int Aj, npy_clongdouble_wrapper Ax)
        566     """
    --> 567   return _csr.csr_eliminate_zeros(*args)
        568
        569 def csr_sum_duplicates(*args):

    TypeError: Array of type 'byte' required.  Array of type 'bool' given

--show-most-informative 10 ignored

Using invocation shown on blog, the parameter is ignored.

http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training ['NaiveBayes'] classifier
training NaiveBayes classifier
accuracy: 0.718000
neg precision: 0.950413
neg recall: 0.460000
neg f-measure: 0.619946
pos precision: 0.643799
pos recall: 0.976000
pos f-measure: 0.775835
[z@a japerk-nltk-trainer-dc71c61]$

setup.py tries to open README.txt

When trying to install using pip I get the error

IOError: [Errno 2] No such file or directory: '/Users/icaro.medeiros/.virtualenvs/pylearner/src/nltk-trainer/README.txt'

How can i train multiple corpus?

I need to train 2 to 3 corpus. How can I approach that?

Use with my own corpus

I have my own corpus that I want to use to classify strings as keep/reject. It's not immediately obvious how to use a corpus that's not in the nltk data directory. I'm guessing I'll need to modify the code?

No Generated Pickle While Using Cross Validation

I want to pickle my trained classifier, but I don't get any pickled classifier when I use --cross-fold option.

python train_classifier.py --algorithm NaiveBayes --instances sents --fraction 0.9 --cross-fold 10 --show-most-informative 10 /mycorpuspath/

It shows the cross validation results without pickled file. If I remove --cross-fold option, the pickled file is generated.

How can I get the the cross-validated classifier? Thanks.

How to use Sentiment Analysis Code

Hi ,
I am working on classifying the text into positive/negative/neutral. I have seen your demo on how to perform sentiment analysis on the text but I want to use your model in my code , I dont know where this code snippet is in your repo.Also I dont know how to use in my code for sentiment analysis.
Thank you

Add ability to pass max_depth to sklearn.RandomForestClassifier

Currently, I can't pass max_depth to a sklearn.RandomForestClassifier, I would like to be able to do this.

Problem when saving (filename not created correctly)

I just tried training a chuncked classifier on the conll2002 dataset, i ran the following command:

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes

Training succeeds but upon saving the following error occurs:

Traceback (most recent call last):
  File "train_chunker.py", line 210, in <module>
    name = '%s.pickle' % '_'.join(parts)
TypeError: sequence item 1: expected string, list found

I worked around this by using the --filename flag, but thought I'd submit the bugreport anyway.

Can not use sklearn as classifier

after

python train_classifier.py movie_reviews --classifier sklearn.LinearSVC

I got this

train_classifier.py: error: argument --classifier/--algorithm: invalid choice: '
sklearn.LinearSVC' (choose from 'NaiveBayes', 'DecisionTree', 'Maxent', 'GIS', '
IIS', 'CG', 'BFGS', 'Powell', 'LBFGSB', 'Nelder-Mead', 'MEGAM', 'TADM')

In python, I CAN import sklearn with no error or warning.

How can i solve this?

THX~

featx.py

https://github.com/japerk/nltk-trainer
This package is not the same as have in the book!
this is not as you have in your NLTK cook book.
featx.py is not complete n github.

Please let me know the latest version!
Otherwise I have to return the book, bcoz of a lot of issues.

-fk

python 3.x compatibility

Hi Jacoub,

It is written in the documentation at

that the code is python 3 compatible "These scripts are Python 2 & 3 compatible and work with NLTK 2.0.4 and higher.", and here in it is written python 2 only. I tried to run the code on python 3.7 but I got compatibility issues.

It would be very nice if you make the code compatible with python 3.

Thanks.

How can it be used to search from django database ?

I want to implement a phonetics search on my django database, can this be used there? If yes, do tell me how?

requirements.txt might install nltk 3.x

in requirements.txt, the line:
nltk>=2.0b8
might lead to nltk 3.x being installed, which is different from the requirements stated in the README (and has caused problems for my code personally).

Recommended that we set an upper limit with something like this:
nltk>=2.0b8,<=2.9.9