Coder Social home page Coder Social logo

nltk-trainer's Introduction

NLTK Trainer

NLTK Trainer exists to make training and evaluating NLTK objects as easy as possible.

Requirements

The scripts with default arguments have been tested for compatibility with Python3.7 and NLTK 3.4.5. If something does not work for you, please open an issue. Include the script with arguments and failure or exception output. To use the sklearn classifiers, you must also install scikit-learn.

If you want to use any of the corpora that come with NLTK, you should install the NLTK data.

Documentation

Documentation can be found at nltk-trainer.readthedocs.org (you can also find these documents in the docs directory. Many of the scripts are covered in Python 3 Text Processing with NLTK 3 Cookbook, and every script provides a --help option that describes all available parameters.

Using Trained Models

The trained models are pickle files that by default are put into your nltk_data directory. You can load them using nltk.data.load, for example:

import nltk.data
classifier = nltk.data.load('classifiers/movie_reviews_NaiveBayes.pickle')

You now have a NLTK classifier object you can work with.

nltk-trainer's People

Contributors

aleksandrpanteleymonov avatar danielgatis avatar dblommesteijn avatar icaromedeiros avatar japerk avatar kecaps avatar kisabaka avatar lababidi avatar muhammad-ahmad-rolustech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nltk-trainer's Issues

Issue when using trained classifier for classification

As per the document, we can use
feats = dict([(word, True) for word in words + ngrams(words, 1)])
as feature set to classify, but I get the type error when I use.

TypeError: can only concatenate list (not "generator") to list

Could you please guide me , If I am doing anything wrong

I have a sentence as TEXT, I tried to create FV as below for classification :

    tokens = word_tokenize(text, include_punc=False)
    tokens = functools.reduce(operator.add, [tokens if n == 1 else list(ngrams(tokens, n)) for n in [3]])

    if not isinstance(tokens, list):
        tokens = list(tokens)
    feats = dict([(word, True) for word in tokens])

    print("Classify: ", self._classifier.classify(feats))

But I always get a constant Pos,Neg probability irrespective of the sentence and overall probability is always negative.

ValueError: you must specify a corpus reader

The following command works perfectly fine:
python train_chunker.py conll2002 --filename ~/nltk_data/chunkers/conll2002_chunker.pickle --classifier NaiveBayes

Then I copy ~/nltk_data/conll2002/ to ~/ntlk_data/conlltest/ and run the command:
python train_chunker.py conlltest --filename ~/nltk_data/chunkers/conlltest_chunker.pickle --classifier NaiveBayes

The output is:

loading conlltest
Traceback (most recent call last):
  File "train_chunker.py", line 80, in <module>
    chunked_corpus = load_corpus_reader(args.corpus, reader=args.reader, fileids=args.fileids)
  File "/mnt/3E6227E362279F21/scriptie/external/nltk-trainer/nltk_trainer/__init__.py", line 64, in load_corpus_reader
    raise ValueError('you must specify a corpus reader')
ValueError: you must specify a corpus reader

What am I missing? My version of nltk is 3.2.5.

Memory Problem

I have a corpus of 70,000 documents (roughly 237MB) I keep getting hit with memory-related error messages.
I tried renting a VPS with 100 Gigs of RAM, but I got the same error messages.
Is there a way to make the process less memory-intensive?
Is it possible to break the corpus up into smaller corpora, train multiple classifiers and then combine them into one large classifier?

Improve README

I think it would be good to give a better explanation on how to use the trainer. I am trying to load a pre-trained model that made us of this library and couldn't understand how to add this dependency to my project.

Is it compatible with python 3.7? Cuz the documentation and the README mismatch on the content.

After cloning the repo, what should be done?

Turn Python scripts into a command-line tool

Great package! Wouldn't it be nice if you could invoke the scripts with something like:

$ nltk train movie_reviews --instances paras --classifier NaiveBayes
$ nltk analyze --sort count --reverse

instead of:

$ python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes
$ python analyze_tagged_corpus.py treebank --sort count --reverse

so that this is truly a command-line tool? Shouldn't be too much work using docopt and adding a console_script entry point in setup.py. What do you think?

Installation error on Ubuntu 10.10

Hi:

First of all, thank you for putting this code out. It seems to be very useful.

I installed japerk-nltk-trainer-5c0b53c on my Ubuntu 10.10 box. I did have to change the "requirements.txt" file in this line:
scipy>=0.7.0

The error message I'm getting is this:

dscs@lap02:~/Desktop/USC/taxonomy$ python /usr/local/bin/train_classifier.py --multi --instances sents -- cat_pattern "(.+).txt"
Traceback (most recent call last):
File "/usr/local/bin/train_classifier.py", line 5, in
pkg_resources.run_script('nltk-trainer==0.9', 'train_classifier.py')
File "/usr/local/lib/python2.6/dist-packages/distribute-0.6.21-py2.6.egg/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.6/dist-packages/distribute-0.6.21-py2.6.egg/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.6/dist-packages/nltk_trainer-0.9-py2.6.egg/EGG-INFO/scripts/train_classifier.py", line 4, in
import nltk_trainer.classification.args
File "/usr/local/lib/python2.6/dist-packages/nltk_trainer-0.9-py2.6.egg/nltk_trainer/init.py", line 7, in
from nltk_trainer.tagging.readers import NumberedTaggedSentCorpusReader
ImportError: No module named tagging.readers

Any suggestions appreciated. Thanks.

Using a trained sklearn classifier results in error

After training a sklearn.BernoulliNB classifier on a corpus I'm getting sporadic errors when trying to predict lables for features with the stored classifier:

feats = {'and': True, (',', 'clean'): True, ('clean', 'and'): True, 'good': True, ('friendly', 'staff'): True, ',': True, '.': True, 'gyros': True, 'clean': True, ('gyros', ','): True, ('good', 'gyros'): True, ('and', 'friendly'): True, 'friendly': True, ('staff', '.'): True, 'staff': True}

clf = pickle.load(open('saved_classifier.pickle'))
p = clf.prob_classify(feats)

The above works. However if:

feats = {'and': True, 'fresh': True, ('fresh', 'and'): True, 'inexpensive': True, ('and', 'inexpensive'): True}

clf.prob_classify(feats) results in a type error... here's the trace:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-184-86c30997b740> in <module>()
    ----> 1 p = clf.prob_classify(feats)
          2 p.prob('pos')

    /Library/Python/2.7/site-packages/nltk/classify/api.pyc in prob_classify(self, featureset)
         63         """
         64         if overridden(self.batch_prob_classify):
    ---> 65             return self.batch_prob_classify([featureset])[0]
         66         else:
         67             raise NotImplementedError()

    /Library/Python/2.7/site-packages/nltk/classify/scikitlearn.pyc in batch_prob_classify(self, featuresets)
         71     def batch_prob_classify(self, featuresets):
         72         X = self._convert(featuresets)
    ---> 73         y_proba = self._clf.predict_proba(X)
         74         return [self._make_probdist(y_proba[i]) for i in xrange(len(y_proba))]
         75 

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/pipeline.pyc in predict_proba(self, X)
        154         for name, transform in self.steps[:-1]:
        155             Xt = transform.transform(Xt)
    --> 156         return self.steps[-1][-1].predict_proba(Xt)
        157 
        158     def decision_function(self, X):

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in predict_proba(self, X)
         96             the model, where classes are ordered arithmetically.
         97         """
    ---> 98         return np.exp(self.predict_log_proba(X))
         99 
        100 

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in predict_log_proba(self, X)
         77             in the model, where classes are ordered arithmetically.
         78         """
    ---> 79         jll = self._joint_log_likelihood(X)
         80         # normalize by P(x) = P(f_1, ..., f_n)
         81         log_prob_x = logsumexp(jll, axis=1)

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
        433 
        434         if self.binarize is not None:
    --> 435             X = binarize(X, threshold=self.binarize)
        436 
        437         n_classes, n_features = self.feature_log_prob_.shape

    /Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/preprocessing.pyc in binarize(X, threshold, copy)
        537         X.data[cond] = 1
        538         X.data[not_cond] = 0
    --> 539         X.eliminate_zeros()
        540     else:
        541         cond = X > threshold

    /Library/Python/2.7/site-packages/scipy-0.13.0.dev_c31f167_20130307-py2.7-macosx-10.8-intel.egg/scipy/sparse/compressed.pyc in eliminate_zeros(self)
        572         fn = sparsetools.csr_eliminate_zeros
        573         M,N = self._swap(self.shape)
    --> 574         fn( M, N, self.indptr, self.indices, self.data)
        575
        576         self.prune() #nnz may have changed

    /Library/Python/2.7/site-packages/scipy-0.13.0.dev_c31f167_20130307-py2.7-macosx-10.8-intel.egg/scipy/sparse/sparsetools/csr.pyc in csr_eliminate_zeros(*args)
        565     csr_eliminate_zeros(int n_row, int n_col, int Ap, int Aj, npy_clongdouble_wrapper Ax)
        566     """
    --> 567   return _csr.csr_eliminate_zeros(*args)
        568
        569 def csr_sum_duplicates(*args):

    TypeError: Array of type 'byte' required.  Array of type 'bool' given

--show-most-informative 10 ignored

Using invocation shown on blog, the parameter is ignored.

http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training ['NaiveBayes'] classifier
training NaiveBayes classifier
accuracy: 0.718000
neg precision: 0.950413
neg recall: 0.460000
neg f-measure: 0.619946
pos precision: 0.643799
pos recall: 0.976000
pos f-measure: 0.775835
[z@a japerk-nltk-trainer-dc71c61]$

setup.py tries to open README.txt

When trying to install using pip I get the error

IOError: [Errno 2] No such file or directory: '/Users/icaro.medeiros/.virtualenvs/pylearner/src/nltk-trainer/README.txt'

Use with my own corpus

I have my own corpus that I want to use to classify strings as keep/reject. It's not immediately obvious how to use a corpus that's not in the nltk data directory. I'm guessing I'll need to modify the code?

No Generated Pickle While Using Cross Validation

I want to pickle my trained classifier, but I don't get any pickled classifier when I use --cross-fold option.

python train_classifier.py --algorithm NaiveBayes --instances sents --fraction 0.9 --cross-fold 10 --show-most-informative 10 /mycorpuspath/

It shows the cross validation results without pickled file. If I remove --cross-fold option, the pickled file is generated.

How can I get the the cross-validated classifier? Thanks.

How to use Sentiment Analysis Code

Hi ,
I am working on classifying the text into positive/negative/neutral. I have seen your demo on how to perform sentiment analysis on the text but I want to use your model in my code , I dont know where this code snippet is in your repo.Also I dont know how to use in my code for sentiment analysis.
Thank you

Problem when saving (filename not created correctly)

I just tried training a chuncked classifier on the conll2002 dataset, i ran the following command:

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes

Training succeeds but upon saving the following error occurs:

Traceback (most recent call last):
  File "train_chunker.py", line 210, in <module>
    name = '%s.pickle' % '_'.join(parts)
TypeError: sequence item 1: expected string, list found

I worked around this by using the --filename flag, but thought I'd submit the bugreport anyway.

Can not use sklearn as classifier

after

python train_classifier.py movie_reviews --classifier sklearn.LinearSVC

I got this

train_classifier.py: error: argument --classifier/--algorithm: invalid choice: '
sklearn.LinearSVC' (choose from 'NaiveBayes', 'DecisionTree', 'Maxent', 'GIS', '
IIS', 'CG', 'BFGS', 'Powell', 'LBFGSB', 'Nelder-Mead', 'MEGAM', 'TADM')

In python, I CAN import sklearn with no error or warning.

How can i solve this?

THX~

featx.py

https://github.com/japerk/nltk-trainer
This package is not the same as have in the book!
this is not as you have in your NLTK cook book.
featx.py is not complete n github.

Please let me know the latest version!
Otherwise I have to return the book, bcoz of a lot of issues.

-fk

python 3.x compatibility

Hi Jacoub,

It is written in the documentation at

that the code is python 3 compatible "These scripts are Python 2 & 3 compatible and work with NLTK 2.0.4 and higher.", and here in it is written python 2 only. I tried to run the code on python 3.7 but I got compatibility issues.

It would be very nice if you make the code compatible with python 3.

Thanks.

requirements.txt might install nltk 3.x

in requirements.txt, the line:
nltk>=2.0b8
might lead to nltk 3.x being installed, which is different from the requirements stated in the README (and has caused problems for my code personally).

Recommended that we set an upper limit with something like this:
nltk>=2.0b8,<=2.9.9

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.