Coder Social home page Coder Social logo

nltk / nltk Goto Github PK

View Code? Open in Web Editor NEW
13.0K 13.0K 2.8K 345.96 MB

NLTK Source

Home Page: https://www.nltk.org

License: Apache License 2.0

Makefile 0.16% Shell 0.09% Python 98.11% Jupyter Notebook 1.12% CSS 0.01% HTML 0.50%
machine-learning natural-language-processing nlp nltk python

nltk's Introduction

Natural Language Toolkit (NLTK)

PyPI CI

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.8, 3.9, 3.10, 3.11 or 3.12.

For documentation, please visit nltk.org.

Contributing

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details.

See also how to contribute to NLTK.

Donate

Have you found the toolkit helpful? Please support NLTK development by donating to the project via PayPal, using the link on the NLTK homepage.

Citing

If you publish work that uses NLTK, please cite the NLTK book, as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Copyright

Copyright (C) 2001-2023 NLTK Project

For license information, see LICENSE.txt.

AUTHORS.md contains a list of everyone who has contributed to NLTK.

Redistributing

  • NLTK source code is distributed under the Apache 2.0 License.
  • NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license.
  • NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable and available for non-commercial use.
  • NLTK may be freely redistributed, subject to the provisions of these licenses.

nltk's People

Contributors

alexrudnick avatar alvations avatar bmaland avatar dannysepler avatar daviddoukhan avatar dhgarrette avatar dimazest avatar dmmolitor avatar ekaf avatar ewan-klein avatar fievelk avatar heatherleaf avatar hoontw avatar iliakur avatar jfrazee avatar jnothman avatar kmike avatar larsmans avatar longdt219 avatar lrnzcig avatar muneson avatar naoyak avatar nschneid avatar purificant avatar rmalouf avatar sparcs avatar stevenbird avatar tomaarsen avatar tresoldi avatar xim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nltk's Issues

Punkt model has problems with acronyms

>>> sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent_tokenizer.tokenize("John saw Mary in the USA today. John saw Mary
in the U.S.A. today. John saw Mary.")
['John saw Mary in the USA today.', 'John saw Mary in the U.S.A. today.
John saw Mary.']

It never wants to put a break after U.S.A. even when its at the end of a
sentence.

>>> sent_tokenizer.tokenize("John saw Mary in the U.S.A. John saw Mary in
the U.S.A. today.")
['John saw Mary in the U.S.A. John saw Mary in the U.S.A. today.']

We need this working for chapter 3

Migrated from http://code.google.com/p/nltk/issues/detail?id=324


earlier comments

jmhelmhout said, at 2010-02-07T21:21:58.000Z:

First of all, thank you very much for creating the NLTK package. It helps a lot by working with ML and text.

The boundary detection of Kiss and Strunk.

I encountered a difficulty in a text in which the writer applies lots of e.g. and i.e.

Temporarily, I solved this by creating a domain/author specific list of abbrev_types
(add e.g i.e to the set) . Although not the best solution, it can give a quick help
with texts produced in a certain domain or by a person (e.g., needed for author
identification)

Although this dictionary solution is probably not the best, it will help prevent this
issue.

Hence, in making this more explicit by public method call, one could add a domain
specific list to the abbrev_types.

All the best
Martin Helmhout

jan.strunk said, at 2010-05-11T14:41:58.000Z:

I really like the idea of incorporating additional lexical informatin (such as abbreviation lists) into a trained Punkt model.

joel.nothman said, at 2010-06-30T05:43:30.000Z:

On acronyms:

It may be because the pickled data has changed, but I can't replicate your first example, Steve.

sent_tokenizer.tokenize("John saw Mary in the USA today. John saw Mary
in the U.S.A. today. John saw Mary.")
['John saw Mary in the USA today.', 'John saw Mary in the U.S.A. today.', 'John saw Mary.']

But the issue is in the second example:

sent_tokenizer.tokenize("John saw Mary in the U.S.A. John saw Mary in the U.S.A. today.")
['John saw Mary in the U.S.A. John saw Mary in the U.S.A. today.']

This isn't actually a problem with acronyms:

sent_tokenizer.tokenize("John saw Mary in Calif. John saw Mary in Calif. today.")
['John saw Mary in Calif. John saw Mary in Calif. today.']

And it's also partially a problem with John:

sent_tokenizer.tokenize("John saw Mary in the U.S.A. He saw Mary in the U.S.A. today.")
['John saw Mary in the U.S.A.', 'He saw Mary in the U.S.A. today.']

This is just a false positive given by the algorithm.

Let's look at the learnt parameters:

  • 'u.s.a' and 'calif' are both in abbrev_types.
  • 'he' is in sent_starters, 'john' is not ('he' was collocated with a preceding '.' in training).
  • None of [('calif', 'john'), ('u.s.a', 'john'), ('calif', 'he'), ('u.s.a', 'he')] are in collocations
  • ortho_context['he'] == 46, i.e. it appears capitalised at the beginning or middle of a sentence, and in the middle of a sentence lowercase.
  • ortho_context['john'] == 14, i.e. it appears capitalised always.

Given this information, 'he' is likely to start a sentence after an abbreviation, while an abbreviation preceding 'John' will be indeterminately a sentence boundary because John will be capitalised in any case.

Note that Kiss and Strunk's rules for using these features mean that setting ortho_context['john'] = 46 is not sufficient to work. Adding 'john' to sent_starters is.

So as long as NLTK's sentence boundary detection uses a largely-unmodified Punkt, and so long as you train it on a corpus that starts sentences with John much less often than the NLTK book does (yay for domain adaptation problems), I don't think this issue will be fixed.

As to the other posts: Incorporating unlearned information is a great idea, as is incorporating information of properties gleaned from corpora with gold-standard sentence boundaries.

mmmasterluke said, at 2010-09-08T09:05:34.000Z:

I'm working on scientific text.

The tokenizer doesn't recognize things like Fig., Eq., Ref. and al. ('et al.') as abbreviations but as sentence boundaries.

I tried to solve the issue by adding items to abbrev_types as suggested in comment 1 but this has no effect (except for 'al'). Sentences are still split between Fig. and the following number.

I also tried adding ('Fig.', '##number##') to the collocations set but nothing changed.

What's the problem here?

jmhelmhout said, at 2010-09-08T11:43:31.000Z:

This helped for me. I do not know if this is applicable for the current release, because I used the following code a long time ago ---------------------- def extract_sentences(text): punkttt = nltk.tokenize.PunktSentenceTokenizer()

#list of abbreviations remove last period from abbr.
# example list: ['i.e','e.g']
abbreviations = []


for abbrev in abbreviations:
    punkttt._params.abbrev_types.add(abbrev)
return punkttt.sentences_from_text(text, True)

this returns a generator object
use output.next() to iterate

Hope this helps
Martin

mmmasterluke said, at 2010-09-08T13:17:35.000Z:

Thanks for the reply. I did exactly the same in my first approach but to no avail.

I'm either missing something about how Punkt works internally or there is some technical problem.

It would be great to add patterns that cannot contain a sentence boundary (or those that must, for that matter).

joel.nothman said, at 2010-09-11T09:26:36.000Z:

Yes, I agree it would be useful to add patterns that do not contain a sentence boundary, but it's not part of the original algorithm, and so it's not been implemented (yet).

When adding to abbrev_types, make sure you enter the word:

  • lowercase;
  • without a following full-stop.

    mmmasterluke said, at 2010-09-13T09:42:11.000Z:

I forgot to add some abbrevs in lowercase. When I add 'fig' instead of 'Fig' it works. Thanks!

xfst interface

Add an interface to the Xerox Finite State Toolkit.
There are Python bindings for xfst; can we redistribute this code?

Migrated from http://code.google.com/p/nltk/issues/detail?id=73

nltk.draw.pos-concordance(), mentioned on website, does not exist

From website ( http://nltk.googlecode.com/svn/trunk/doc/howto/corpus.html#tagged-corpora ):

"Use nltk.draw.pos-concordance() to access a GUI for searching tagged corpora."

However, the command fails with "AttributeError: 'module' object has no attribute 'pos'".

Similar problem if I replace the hyphen with an underscore.

Yet concordance.py does exist:

http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/draw/concordance.py?spec=svn8391&r=7460

though the path at the top of the page is confusing, due to the double "nltk":

"svn/ trunk/ nltk/ nltk/ draw/ concordance.py"

The command nltk.draw.demo() did work, and according to the source code, it called pos_concordance:

def demo():
pos_concordance()

But all the demo seemed to do was allow one to move graphical elements around in the window, flip a tree from vertical to horizontal arrangement, and so on. I had been hoping it would allow one to drill down on a component or otherwise perform a function other than cosmetic rearrangement. At the least, I would think it would cast light on how the four trees were created via the API.

To summarize:
(1) The website refers to what seems to be a nonworking call.
(2) When I look in the source code and try to exercise the closest thing to that call, I get less than I was expecting.

Problem in using nullary predicate (ie propositional) symbols for tableau and resolution provers

What steps will reproduce the problem?

>>> import nltk
>>> from nltk.sem import logic
>>> lp = logic.LogicParser()
>>> con = lp.parse('(P & -P)')
>>> con
<AndExpression (P & -P)>
>>> from nltk.inference import *
>>> get_prover(con, [], prover_name='Prover9').prove()
False
>>> get_prover(con, [], prover_name='tableau').prove()
>>> get_prover(con, [], prover_name='resolution').prove()

What is the expected output? What do you see instead?

should get False, False, False.
The call to Prover9 returns False, but the other two raise errors:

File "/Users/ewan/svn/nltktrunk/nltk/nltk/test/tp-temp.doctest", line 10,
in tp-temp.doctest
Failed example:
get_prover(con, [], prover_name='tableau').prove()
Exception raised:
Traceback (most recent call last):
File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/doctest.py",
line 1212, in __run
compileflags, 1) in test.globs
File "<doctest tp-temp.doctest[7]>", line 1, in <module>
File "/Library/Python/2.5/site-packages/nltk/inference/api.py", line
266, in prove
verbose)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py",
line 32, in prove
result = _attempt_proof(agenda, set(), set(), debugger)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py",
line 196, in _attempt_proof
return proof_method(current, agenda, accessible_vars, atoms, debug)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py",
line 303, in _attempt_proof_n_and
agenda.put(-current.term.first)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py",
line 79, in put
self.sets[self._categorize_expression(ex_to_add)].add(ex_to_add)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py",
line 135, in _categorize_expression
return self._categorize_NegatedExpression(current)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py",
line 166, in _categorize_NegatedExpression
raise ProverParseError()
ProverParseError


File "/Users/ewan/svn/nltktrunk/nltk/nltk/test/tp-temp.doctest", line 11,
in tp-temp.doctest
Failed example:
get_prover(con, [], prover_name='resolution').prove()
Exception raised:
Traceback (most recent call last):
File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/doctest.py",
line 1212, in __run
compileflags, 1) in test.globs
File "<doctest tp-temp.doctest[8]>", line 1, in <module>
File
"/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 107,
in prove
verbose)
File
"/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 40,
in prove
clauses.extend(clausify(-goal))
File
"/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 418,
in clausify
for clause in _clausify(skolemize(expression)):
File
"/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 486,
in skolemize
return to_cnf(skolemize(-negated.first, univ_scope),
File
"/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 509,
in skolemize
raise ProverParseError()
ProverParseError

Please use labels and text to provide additional information.

Migrated from http://code.google.com/p/nltk/issues/detail?id=138


earlier comments

ewan.klein said, at 2008-12-15T14:08:51.000Z:

I'm now getting a different error, which is similar to the problem I had with the Cooper storage parsing, and seems to be due to the use of isinstance() tripping over the non-qualified class.

from nltk.inference import *
get_prover(con, [], prover_name='tableau').prove()
Traceback (most recent call last):
File "/Users/ewan/svn/nltktrunk/", line 1, in
File "/Library/Python/2.5/site-packages/nltk/inference/api.py", line 266, in prove
verbose)
File "/Library/Python/2.5/site-packages/nltk/inference/tableau.py", line 40, in prove
agenda.put(-goal)
File "/Users/ewan/svn/nltktrunk/", line 351, in neg
File "/Library/Python/2.5/site-packages/nltk/sem/logic.py", line 957, in init
assert isinstance(term, Expression), "%s is not an Expression" % term
AssertionError: (P & -P) is not an Expression
get_prover(con, [], prover_name='resolution').prove()
Traceback (most recent call last):
File "/Users/ewan/svn/nltktrunk/", line 1, in
File "/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 107, in
prove
verbose)
File "/Library/Python/2.5/site-packages/nltk/inference/resolution.py", line 52, in
prove
clauses.extend(clausify(-goal))
File "/Users/ewan/svn/nltktrunk/", line 351, in neg
File "/Library/Python/2.5/site-packages/nltk/sem/logic.py", line 957, in init
assert isinstance(term, Expression), "%s is not an Expression" % term
AssertionError: (P & -P) is not an Expression

DHGarrette said, at 2008-12-21T10:32:49.000Z:

Fixed for Tableau prover. Still working on resolution prover.

doctests

Lots of doctests are still broken, and need joint effort to repair...

http://nltk.googlecode.com/svn/trunk/doc/howto/index.html
http://buildbot.nltk.org/

Migrated from http://code.google.com/p/nltk/issues/detail?id=212


earlier comments

StevenBird1 said, at 2009-02-02T01:11:39.000Z:

I've fixed up a bunch of the doctests and updated the site:

http://nltk.googlecode.com/svn/trunk/doc/howto/index.html

The semantics-related packages still have a lot of errors which would be good to fix
in time for the 0.9.8 release.

DHGarrette said, at 2009-02-02T02:54:27.000Z:

I updated and ran all of the semantics doctests. No errors were found. Do you have the required 3rd party tools (Prover9, Mace4, MaltParser) installed?

StevenBird1 said, at 2009-02-09T09:23:16.000Z:

Thanks for confirming that there's no errors in the doctests, just in my config. Its a pity that these third party tools lack conventional installers. I plan to document the installation process a bit more. One source of confusion is the name of the PROVER9HOME environment variable -- sounds like the name of the root of a directory tree rather than the binary itself. Any objections if I change this to just PROVER9?

StevenBird1 said, at 2009-02-09T09:31:57.000Z:

Also, any objections to MALTPARSERHOME -> MALTPARSER ?

StevenBird1 said, at 2009-02-09T11:20:04.000Z:

Now that the Prover9 binary can be found, Python doesn't like it. Perhaps I got the wrong one. Which distribution should users install: the one that includes the GUI, or the commandline version?

File "/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/test/nonmonotonic.doctest",
line 135, in ../../nltk/test/nonmonotonic.doctest
Failed example:
unp.prove()
Exception raised:
Traceback (most recent call last):
File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/doctest.py",
line 1212, in __run
compileflags, 1) in test.globs
File "<doctest ../../nltk/test/nonmonotonic.doctest[67]>", line 1, in
File "/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/inference/api.py",
line 398, in prove
self.assumptions(),
File
"/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/inference/nonmonotonic.py", line
108, in assumptions
if get_prover(newEqEx, assumptions).prove():
File "/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/inference/api.py",
line 263, in prove
verbose)
File "/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/inference/prover9.py",
line 238, in prove
verbose=verbose)
File "/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/inference/prover9.py",
line 260, in _call_prover9
return self._call(input_str, self._prover9_bin, args, verbose)
File "/Users/sb/Documents/workspace/nltk/trunk/nltk/nltk/inference/prover9.py",
line 193, in _call
stdin=subprocess.PIPE)
File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/subprocess.py",
line 593, in init
errread, errwrite)
File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/subprocess.py",
line 1079, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error

DHGarrette said, at 2009-02-09T13:16:04.000Z:

PROVER9HOME and MALTPARSERHOME should point to the binary directories:

export PROVER9HOME="/usr/local/bin/LADR-2008-11A/bin/"
export MALTPARSERHOME="/usr/local/bin/malt-1.2/"

I have no problem with the environment variables being renamed.

I use the command line version of Prover9. I haven't tried the nltk code with the
gui version.

StevenBird1 said, at 2009-02-10T09:58:59.000Z:

I've changed the prover9 download links to the command-line version and added some brief installation instructions here: http://www.nltk.org/download Feedback welcome.

StevenBird1 said, at 2009-02-11T12:28:14.000Z:

Still some issues with the semantics doctests, but went ahead with 0.9.8b1 release anyway. I hope to have this all fixed in time for 0.9.8.

StevenBird1 said, at 2009-07-07T07:29:53.000Z:

I've fixed several more doctest problems, but many test still fail, incl:

chat80: 4 errors (some stale API issues?)
featgram: 7 errors (inconsistent feature uppercasing in sample grammars)
corpus: 2 errors (see issue 407)
inference: too slow at line 547: ppbc.prove()

ewan.klein said, at 2009-07-11T21:42:14.000Z:

the chat80 and featgram doctests now pass. I'm getting this error with relextract, which seems to be a bug in the conll corpus reader -- possibly a Unicode issue?

File "/Users/ewan/svn/nltktrunk/nltk/nltk/test/relextract.doctest", line 255, in
relextract.doctest
Failed example:
for doc in conll2002.chunked_sents('ned.train'):
for r in relextract.extract_rels('PER', 'ORG', doc, corpus='conll2002',
pattern=VAN):
print relextract.show_clause(r, relsym="VAN")
Exception raised:
Traceback (most recent call last):
File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/doctest.py",
line 1212, in __run
compileflags, 1) in test.globs
File "<doctest relextract.doctest[25]>", line 1, in
for doc in conll2002.chunked_sents('ned.train'):
File "/Users/ewan/svn/nltktrunk/nltk/nltk/util.py", line 844, in iterate_from
for value in self._lists[0].iterate_from(index):
File "nltk/corpus/reader/util.py", line 288, in iterate_from
File "nltk/corpus/reader/conll.py", line 193, in _read_grid_block
ValueError: Inconsistent number of columns:
U Pron O
mag V O
één V O
keer N O
raden V O
wie Pron O
hem Pron O
dàt V O
idee N O
heeft V O
ingefluisterd V O
. Punc O

StevenBird1 said, at 2009-09-19T03:03:37.000Z:

There's an issue with classify.doctest:

[Found megam: /usr/local/bin/megam]
                 test[0]        test[1]        test[2]        test[3]  
                p(x)  p(y)     p(x)  p(y)     p(x)  p(y)     p(x)  p(y)
-----------------------------------------------------------------------
     LBFGSB Error: <exceptions.ValueError instance at 0x1f33f08>
        GIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
        IIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
Nelder-Mead Error: <exceptions.ValueError instance at 0x1f33b98>
         CG Error: <exceptions.ValueError instance at 0x1f33e18>
       BFGS Error: <exceptions.ValueError instance at 0x1f33ee0>
      MEGAM     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
       TADM Error: <exceptions.LookupError instance at 0x1f3a238>
     Powell Error: <exceptions.ValueError instance at 0x1f33e40>

Diagnostic output from decision tree classifier

What steps will reproduce the problem?

  1. running the decision tree classifier

What is the expected output? What do you see instead?

Lots of diagnostic output of the form
best stump for 13 toks uses None err=0.6923

I think line 135 of decisiontree.py should be removed (or put inside the scope of a verbose flag)

Migrated from http://code.google.com/p/nltk/issues/detail?id=142


earlier comments

StevenBird1 said, at 2009-02-21T01:44:59.000Z:

Carried forward to 0.9.9

classifier API

Classifiers are trained using a factory function, e.g.:

nltk.NaiveBayesClassifier.train(train_set)

N-Gram taggers used to be trained like this too, but we changed them to do
training in constructor, e.g.:

nltk.BigramTagger(train_sents)

NEChunkParserTagger, NGramModel, FreqDist, and ConditionalFreqDist are also
initialized with data to the constructor. So how about:

nltk.NaiveBayesClassifier(train_set) ?

Migrated from http://code.google.com/p/nltk/issues/detail?id=316


earlier comments

[email protected] said, at 2009-04-02T18:05:16.000Z:

The BrillTaggerTrainer (and Fast-) also have train methods, perhaps they could be cleaned too?

trainer = nltk.BrillTaggerTrainer(basetagger, templates)
tagger = trainer.train(training_data)

How about this instead:

tagger = nltk.BrillTaggerTrainer(basetagger, templates, training_data)

StevenBird1 said, at 2009-05-04T01:11:41.000Z:

It would be good to resolve this in 0.9.9

sem.relextract demo

What steps will reproduce the problem?

  1. sem.relextract.demo()

What is the expected output? What do you see instead?

Dutch CoNLL2002: van(PER, ORG) -- raw rtuples with context:

...'')[PER: "Cornet/V d'Elzius/N"] 'is/V op/Prep dit/Pron ogenblik/N kabinetsadviseur/N van/Prep
staatssecretaris/N voor/Prep' [ORG: 'Buitenlandse/N Handel/N'](''...
Traceback %28most recent call last%29:
File "nltk/sem/relextract.py", line 437, in <module>
conllned%28%29
File "nltk/sem/relextract.py", line 401, in conllned
for doc in conll2002.chunked_sents%28'ned.train'%29:
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-
packages/nltk/util.py", line 869, in iterate_from
for value in self._lists[0].iterate_from%28index%29:
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-
packages/nltk/corpus/reader/util.py", line 275, in iterate_from
tokens = self.read_block%28self._stream%29
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-
packages/nltk/corpus/reader/conll.py", line 189, in _read_grid_block
% block)
ValueError: Inconsistent number of columns:
U Pron O
mag V O
één V O
keer N O
raden V O

Please use labels and text to provide additional information.

Migrated from http://code.google.com/p/nltk/issues/detail?id=158


earlier comments

StevenBird1 said, at 2009-09-19T05:25:51.000Z:

The same problem shows up in relextract.doctest

small bugs in nltk.wordnet_app

  • page_from_reference(href) - href type is 'str' in a docstring but href.word is accessed in the body of the function;
  • page_from_reference, line 789: I think it is better not to rely on w being defined there (it leaks from list comprehension but this will gone in python 3);
  • get_static_page_by_path, line 834 - the code with 'f=open(..)' is unreachable;

rst xref warnings

rst processing should display a warning when a numbered item (example,
pylisting, figure, table) is not referenced from within the file. An
approximation would be to find all such items having a manually-defined
symbolic id such as ex-foo or fig-bah, and make sure that this is cited
from within the file. Note that chapter-external references are irrelevant.

Migrated from http://code.google.com/p/nltk/issues/detail?id=304

English NER data

What data set would you like to have added to NLTK? If you have a particular corpus in mind,
please include information about availability, license, and permission for us to redistribute the
data.

There should be some English named entity data in the NLTK corpus collection. We have the
CoNLL 2002 data for Spanish and Dutch. The CoNLL 2003 data may be suitable:

> Language-Independent Named Entity Recognition at CoNLL-2003
> Notes: This dataset is a manual annotatation of a subset of RCV1 (Reuters Corpus Volume 1).
> The annotation per se is available free of charge (subject to a licensing agreement) from the
> CoNLL site. The raw text of RCV1 documents must be requested from NIST (also free of charge
> and also subject to a licensing agreement).
> http://www.cnts.ua.ac.be/conll2003/ner

http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html

Migrated from http://code.google.com/p/nltk/issues/detail?id=109


earlier comments

ewan.klein said, at 2008-11-27T09:56:58.000Z:

The CoNLL 2003 data might be a good bet, but I haven't got round to looking at it since the procedure for merging the annotation and the raw data presupposes you have the Reuters data available as a CD.

The MUC34 data doesn't seem to be in a suitable format. It would be really nice if we
could get hold of a portion of MUC7 from LDC, since it is one of the canonical datasets.

StevenBird1 said, at 2009-02-15T05:10:00.000Z:

We have a trained model for named-entity tagging, so providing a corpus is less important now.

Language Classification data

Linguist List, Rosetta Project, and SIL all have language classification data. Can we obtain this data
and allow people to explore it programmatically, generate visualizations, etc?

Migrated from http://code.google.com/p/nltk/issues/detail?id=112

BracketParsedCorpusReader's error-recovery behavior could be improved

Mike M said:

If you look at the _parse() function in BracketParsedCorpusReader
(http://nltk.org/doc/api/nltk.corpus.reader.bracket_parse-
pysrc.html#BracketParseCorpusReader._parse) it writes to the standard
output whenever it encounters an unexpected situation.

Does anyone know a quick and dirty way to prevent this??

Also, wouldn't it make more sense to throw an exception here instead
of writing to stderr??

Then Edward Loper said:

The motivation behind not raising an exception here is robustness -- if
your corpus file isn't quite valid (i.e., a single paren is off) then
throwing an exception here would make it impossible to read any portion of
the corpus file. This can be problematic for parsed corpora that consist
of a single big file.

The best solution here would probably be to default to throwing an
exception; but to allow an argument to the corpus reader that specifies
whether it should try to recover from bad trees. If a bad tree is
detected, it might be better to use the warnings module than to just print
to stderr.

Migrated from http://code.google.com/p/nltk/issues/detail?id=218

Chat-80 errors

>>> val=nltk.corpus.chat80.make_valuation(concepts,read=True)
/usr/lib/python2.4/site-packages/nltk/corpus/chat80.py:569:
DeprecationWarning:
Function read() has been deprecated. Call the valuation as an
initialization parameter instead
valuation.read(pairs)
;

>>> val['city']['calcutta']
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsubscriptable obje
.

Migrated from http://code.google.com/p/nltk/issues/detail?id=126


earlier comments

ewan.klein said, at 2008-12-18T16:09:50.000Z:

This will work:

val = chat80.make_valuation(concepts, read=True)
'calcutta' in val['city']
True
[town for (town, country) in val['country_of'] if country == 'india']
['bombay', 'delhi', 'madras', 'hyderabad', 'calcutta']

But make_valuation() needs to be updated to conform to current framework in sem.evaluate

Treebank Tokenizer buggy

The NLTK TreebankWordTokenizer appears to have at least two bugs in it:

(a) It doesn't properly handle sentences with quotes at the end.

 Input:
``If we don't have any further suggestions as to how we should present our case, we should be able to finish this.''

Output:
['`', '`', 'If', 'we', 'do', "n't", 'have', 'any', 'further', 'suggestions', 'as', 'to', 'how', 'we', 'should', 'present', 'our', 'case', ',', 'we', 'should', 'be', 'able', 'to', 'finish', "this.''"]

(b) According to the official Penn Treebank Tokenization page at http://www.cis.upenn.edu/~treebank/tokenization.html, double quotes (") should be changed to doubled single forward- and backward- quotes (`` and ''). However, nltk.word_tokenize() (TreebankTokenizer) does not follow this rule.

Input: 
He went to the "Concert for Hope" yesterday.

Output: 
['He',  'went',  'to',  'the',  '"',  'Concert',  'for',  'Hope',  '"',  'yesterday', '.']

One of my colleagues, Michael Heilman, took the official Penn Treebank sed script (http://www.cis.upenn.edu/~treebank/tokenizer.sed) and ported it to NLTK using the Tokenizer API. It can be found here: https://gist.github.com/1506443

This script should replicate the sed script and, therefore, should not have either of these issues.

Better visualization of dependency trees

At the moment, we ignore the arc labels on dependency graphs.

Migrated from http://code.google.com/p/nltk/issues/detail?id=141


earlier comments

ewan.klein said, at 2009-01-15T20:51:03.000Z:

The only graphical representation of dependency trees that we can automatically generate is a tree representation, using words rather than grammatical categories as nodes. However, this doesn't allow arc labels (e.g. modifier, subject-of) to be represented, which is an important omission.

StevenBird1 said, at 2009-01-15T21:19:44.000Z:

N.B. nltk_contrib.fst

StevenBird1 said, at 2009-02-21T01:44:59.000Z:

Carried forward to 0.9.9

corpus reader docstrings

The corpus reader docstrings should have a reasonable amount of
information, and be consistent from one corpus to the next, cf:

From issue 201:
"nltk.corpus.verbnet.doc gives very little information, so it could be
good to describe that a little."

Migrated from http://code.google.com/p/nltk/issues/detail?id=252

TimeBank

What data set would you like to have added to NLTK? If you have a
particular corpus in mind, please include information about availability,
license, and permission for us to redistribute the data.

It would be great if at least a part of the TimeBank corpus...

http://www.cs.brandeis.edu/~jamesp/arda/time/timebank.html

...could be included with NLTK. (I see that this would require some
negotiation with the owner, though...)

Adam P.

Migrated from http://code.google.com/p/nltk/issues/detail?id=330


earlier comments

stfz65 said, at 2011-03-07T17:46:56.000Z:

A corpus reader for timebank files would also be fine.

Stefan

behaviour of extract_rels unintuitive (plus comments on section 7.1 of the book)

What version of NLTK are you using? (See nltk.version). Please only
submit bug reports for the current version.

0.9.8 (but this applies also to 0.9.7).

What steps will reproduce the problem? (e.g. include Python source code)

This is more or less the textbook example (from the beginning of ch.7):

IN = re.compile(r'\bin\b')
for doc in nltk.corpus.ieer.parsed_docs():
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, pattern = IN):
print nltk.sem.show_raw_rtuple(rel)

The above code does exactly the same when IN is defined as follows:

IN = re.compile(r'\bin\b.*')

but not the same when it's:

IN = re.compile(r'.*\bin\b')

That is, it looks like a default ".*" is implicitly added to the end of the
filler regexp IN, but not to the beginning.

What is the expected output? What do you see instead?

I find that default ".*" added implicitly to the end of my regexp surprising.

BTW1: in the original regexp in the book, and also in the IE howto, i.e.,
IN = re.compile(r'.*\bin\b(?!\b.+ing\b)'), the third "\b" is spurious (it
matches exactly the same word boundary as the second "\b" in "\bin\b").
(But I think I've already reported that.)

BTW2: a more general remark: I like other parts of chapter 7, concerning
chunking, but section 7.1 is disappointing in various respects:

  • there is no mention on how to actually do NER in NLTK,

  • relation exctraction, done with the rather idiosyncratic methods
    extracs_rels() and show_raw_rtuple(rel), seems convoluted to me,

  • (I have the same impression when reading section 2 of
    http://nltk.googlecode.com/svn/trunk/doc/howto/relextract.html, with the
    natural concept of the tuple (left context, NE1, filler, NE2, right
    context) produced in an apparently roundabout way from triples of pairs of
    (context, NE))

  • the following bit of code, concerning the Dutch "van", is not explained
    (the reader doesn't even know that it's about Dutch, unless (s)he spots and
    correctly parses the "ned" in "corpus='conll2002-ned'").

    Migrated from http://code.google.com/p/nltk/issues/detail?id=329


    earlier comments

    Adam.Przepiorkowski said, at 2009-03-04T18:27:07.000Z:

    P.S. Maybe it would also make sense to mention in section 7.1. of the book that there is an implicit "window = 10" argument to nltk.sem.extract_rels(), as otherwise it is surprising that this command with this regular expression produces such reasonable results.

    StevenBird1 said, at 2009-03-04T19:26:13.000Z:

    > I find that default ".*" added implicitly to the end of my regexp surprising.

BTW Python's re.match() tries to match a regexp at the start of its string, as
opposed to re.search(). See Matching vs Searching in
http://docs.python.org/library/re.html; (this doesn't mean extract_rels should
behave like this.)

Adam.Przepiorkowski said, at 2009-03-04T20:59:37.000Z:

I see; I am new to Python, didn't know that. (In this case, maybe extract_rels should be renamed to match_rels, or so?)

StevenBird1 said, at 2009-05-03T10:20:10.000Z:

Ewan -- would you like to take action on this proposed name change, or shall we leave it?

Distribute NLTK as .msi

Describe the feature you would like us to add.

The NLTK download page currently lists the following as the Windows
distribution:

http://nltk.googlecode.com/files/nltk-0.9.7.win32.exe

This was presumably created with something like "python setup.py
bdist_wininst". As of Python 2.5, you can now instead use "python setup.py
bdist_msi" to get a nice .msi installer (which in my experience is much
nicer than the old .exe installers).

Migrated from http://code.google.com/p/nltk/issues/detail?id=219


earlier comments

StevenBird1 said, at 2009-01-29T01:36:45.000Z:

The distutils package has a method distutils.command.bdist_msi, however this doesn't appear to be accessible from setup.py (see list of supported formats below), nor is it mentioned in the distutils documentation section on creating windows installers:

http://docs.python.org/distutils/builtdist.html#creating-windows-installers

python setup.py bdist --help-formats
List of available distribution formats:
--formats=rpm RPM distribution
--formats=gztar gzip'ed tar file
--formats=bztar bzip2'ed tar file
--formats=ztar compressed tar file
--formats=tar tar file
--formats=wininst Windows executable installer
--formats=zip ZIP file

(Please reopen this if you discover some way to build an MSI package.)

steven.bethard said, at 2009-01-29T03:06:28.000Z:

It works for me. And it's in both the 2.5 and 2.6 documentation:

http://docs.python.org/dev/2.5/dist/module-distutils.command.bdistmsi.html
http://docs.python.org/distutils/apiref.html#module-distutils.command.bdist_msi

Here it is working on NLTK:

$ python setup.py bdist_msi
running bdist_msi
running build
running build_py
...
copying build\lib\nltk\book.py -> build\bdist.win32\msi\Lib\site-packages\nltk
creating build\bdist.win32\msi\Lib\site-packages\nltk\chat
copying build\lib\nltk\chat\eliza.py -> build\bdist.win32\msi\Lib\site-packages\nltk\chat
copying build\lib\nltk\chat\iesha.py -> build\bdist.win32\msi\Lib\site-packages\nltk\chat
...
Writing build\bdist.win32\msi\Lib\site-packages\nltk-0.9.6-py2.6.egg-info
removing 'build\bdist.win32\msi' (and everything under it)

$ dir /B dist
nltk-0.9.6.win32-py2.6.msi

I've filed a Python bug for the missing info in --help-formats:

http://bugs.python.org/issue5095

StevenBird1 said, at 2009-01-30T02:10:09.000Z:

Thanks for this. For python 2.5.1 and 2.5.2 running this setup command gives me: "error: invalid command 'bdist_msi'". It looks like you're using 2.6, and that distutils.command.bdist_rpm wasn't accessible via setup.py before 2.6.

steven.bethard said, at 2009-02-01T18:11:55.000Z:

That's pretty odd. I just downloaded 2.5.4 (the most recent 2.5 available), and here it is working there:

$ C:\Python25\python.exe setup.py bdist_msi
running bdist_msi
running build
running build_py
...

Seems odd that they'd introduce it in a point release. Are you running on a Windows
machine? Maybe .msi can only be created on a Windows machine?

StevenBird1 said, at 2009-02-21T01:44:59.000Z:

Carried forward to 0.9.9

StevenBird1 said, at 2009-06-21T10:46:16.000Z:

Mac Python 2.5.4 won't let me build a .msi package. Perhaps this is fixed in 2.6. Anyway, I'd like to ask your help to post a .msi release for 2.0rc1 next week please.

steven.bethard said, at 2009-06-21T14:51:35.000Z:

Sure, no problem.

Interface to very large corpora

Suggested by Mark Liberman:

A high-performance search (whether term-based or based on other things,
like dates) would yield a result denoting smallish segments of a very large
corpus, and this result could be interpreted by NLTK to delimit areas for
further processing.

(This could use the PyLucene API to interrogate Lucene; assuming a
text-only version of the annotated corpus was already indexed by Lucene.)

Migrated from http://code.google.com/p/nltk/issues/detail?id=268

Definitions of __str__ and __repr__ do not follow the Python recommendation

This is from the Python Language Reference:

  • object.__repr__(self)
    Called by the repr() built-in function and by string conversions (reverse quotes) to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines __repr__() but not __str__(), then __repr__() is also used when an “informal” string representation of instances of that class is required.
    This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.
  • object.__str__(self)
    Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object. This differs from __repr__() in that it does not have to be a valid Python expression: a more convenient or concise representation may be used instead. The return value must be a string object.

(http://docs.python.org/reference/datamodel.html#object.__repr__)


But this isn't followed in NLTK. E.g., in grammar.Nonterminal __repr__ and __str__ are the same:

    if isinstance(self._symbol, basestring):
        return '%s' % (self._symbol,)
    else:
        return '%r' % (self._symbol,)

The same for grammar.Production, __repr__ calls __str__, and __str__ checks if the symbols in the LHS are Nonterminals (otherwise repr is called on the symbol)

tree.Tree has a "correct" implementation of __repr__, but __str__ calls repr on the node if the node is not a string

chart.TreeEdge and chart.LeafEdge both call repr in the defintion of __str__, and __repr__ is just "[Edge: %s]", which is not a "valid Python expression that could be used to recreate an object".

I haven't looked at more files than that, but I guess there's more


The problem with all this (apart from not following the Python standard) is if you want to use the classes for something else. E.g., I wanted to use trees where the nodes are NOT strings or nonterminals. Suppose that I want to build a tree with dates:

>>> from datetime import date
>>> from nltk.tree import Tree
>>> xmas = date(2008, 12, 25)
>>> xmastree = Tree(xmas, ['jingle', 'jangle'])
>>> xmas
datetime.date(2008, 12, 25)
>>> xmastree
Tree(datetime.date(2008, 12, 25), ['jingle', 'jangle'])
>>> print xmas
2008-12-25
>>> print xmastree
(datetime.date(2008, 12, 25) jingle jangle)

I.e., str(xmastree) doesn't call str(xmas) when printing the node, but instead uses repr(xmas), which doesn't look good.

*Migrated from http://code.google.com/p/nltk/issues/detail?id=154 *


earlier comments

[email protected] said, at 2008-12-17T08:09:48.000Z:

Perhaps my original example is better. I created a new class Focus, for annotating trees with focused nodes:

class Focus(object):
    def __init__(self, focus):
        self._focus = focus
    def focus(self):
        return self._focus
    def __str__(self):
        return '*%s*' % self._focus
    def __repr__(self):
        return 'Focus(%r)' % self._focus

Then we get the following:

>>> b = Focus('b')
>>> t = Tree('a', [Tree(b, ['c','d']), 'e'])
>>> b
Focus('b')
>>> t
Tree('a', [Tree(Focus('b'), ['c', 'd']), 'e'])
>>> print b
*b*
>>> print t
(a (Focus('b') c d) e)

But I want (a (b c d) e) in the last line.

StevenBird1 said, at 2008-12-18T00:28:35.000Z:

Thanks for this.

[email protected] said, at 2009-03-16T13:16:16.000Z:

Apparently Python has the same strange behaviour on containers, and this will probably NOT change:

http://www.python.org/dev/peps/pep-3140/

Support Open section of American National Corpus

Write a corpus reader to access the content of the OANC:
http://www.anc.org/OANC/OANC-1.0.1-UTF8.zip
It would be good to avoid creating our own distribution of the data.

Migrated from http://code.google.com/p/nltk/issues/detail?id=137


earlier comments

edloper said, at 2009-02-08T04:27:30.000Z:

Mostly done with this; will try to get it checked in tomorrow.

StevenBird1 said, at 2009-02-21T01:44:59.000Z:

Carried forward to 0.9.9

tomonori.nagano said, at 2011-01-27T14:13:29.000Z:

Is it already done? I needed to use OANC and wrote a simple CorpusReader. http://language.dyndns.org/research/ANC/

ANC has a tool (ANCTools) that "supports" the NLTK format, but the data keep only POS information after transformation. The better implementation is that NLTK reads their (combined) XML format that has many other information such as utterances, NP/VP chunks, domain etc.

Update corpus metadata files

Check that the license and other information in the corpus index files is correct.

Migrated from http://code.google.com/p/nltk/issues/detail?id=96

Tests are failing with python 2.7

_________________________________ [tox sdist] __________________________________
[TOX] ***creating sdist package
[TOX] /Users/kmike/svn/nltk$ /usr/local/Cellar/python/2.7.2/bin/python setup.py sdist --formats=zip --dist-dir .tox/dist >.tox/log/0.log
[TOX] ***copying new sdistfile to '/Users/kmike/.tox/distshare/nltk-2.0.1rc3.zip'
______________________________ [tox testenv:py27] ______________________________
[TOX] ***reusing existing matching virtualenv py27
[TOX] ***upgrade-installing sdist
[TOX] /Users/kmike/svn/nltk/.tox/py27/log$ ../bin/pip install --download-cache=/Users/kmike/svn/nltk/.tox/_download /Users/kmike/svn/nltk/.tox/dist/nltk-2.0.1rc3.zip -U --no-deps >13.log
[TOX] /Users/kmike/svn/nltk/nltk/test$ ../../.tox/py27/bin/python testrunner.py
�[?1034h
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 49, in align.doctest
Failed example:
    als.alignment = align.Alignment([(0, 0), (1, 4), (2, 1), (3, 3)])
Expected:
    Traceback (most recent call last):
        ...
    IndexError: Alignment is outside boundary of mots
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[9]>", line 1, in <module>
        als.alignment = align.Alignment([(0, 0), (1, 4), (2, 1), (3, 3)])
    AttributeError: can't set attribute

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 53, in align.doctest
Failed example:
    als.alignment = align.Alignment([(-1, 0), (1, 2), (2, 1), (3, 3)])
Expected:
    Traceback (most recent call last):
        ...
    IndexError: Alignment is outside boundary of words
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[10]>", line 1, in <module>
        als.alignment = align.Alignment([(-1, 0), (1, 2), (2, 1), (3, 3)])
    AttributeError: can't set attribute

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 59, in align.doctest
Failed example:
    als.alignment = align.Alignment([(1, 3), (3, 2), (0, 1), (2, 0)])
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[12]>", line 1, in <module>
        als.alignment = align.Alignment([(1, 3), (3, 2), (0, 1), (2, 0)])
    AttributeError: can't set attribute

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 60, in align.doctest
Failed example:
    als.alignment
Expected:
    Alignment([(0, 1), (1, 3), (2, 0), (3, 2)])
Got:
    Alignment([(0, 0), (1, 1), (2, 2), (3, 3)])

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 67, in align.doctest
Failed example:
    als.alignment = [(0, 0), (1, 1), (2, 2, "boat"), (3, 3, False, (1,2))]
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[14]>", line 1, in <module>
        als.alignment = [(0, 0), (1, 1), (2, 2, "boat"), (3, 3, False, (1,2))]
    AttributeError: can't set attribute

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 68, in align.doctest
Failed example:
    als.alignment
Expected:
    Alignment([(0, 0), (1, 1), (2, 2, 'boat'), (3, 3, False, (1, 2))])
Got:
    Alignment([(0, 0), (1, 1), (2, 2), (3, 3)])

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 70, in align.doctest
Failed example:
    als.alignment = ((0, 0), (1, 1), (2, 2), (3, 3))
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[16]>", line 1, in <module>
        als.alignment = ((0, 0), (1, 1), (2, 2), (3, 3))
    AttributeError: can't set attribute

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 86, in align.doctest
Failed example:
    em_ibm1 = align.EMIBMModel1(corpus, 1e-3)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[19]>", line 1, in <module>
        em_ibm1 = align.EMIBMModel1(corpus, 1e-3)
    AttributeError: 'module' object has no attribute 'EMIBMModel1'

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 87, in align.doctest
Failed example:
    iterations = em_ibm1.train()
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[20]>", line 1, in <module>
        iterations = em_ibm1.train()
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 88, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['the', 'das'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[21]>", line 1, in <module>
        print round(em_ibm1.probabilities['the', 'das'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 90, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['book', 'das'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[22]>", line 1, in <module>
        print round(em_ibm1.probabilities['book', 'das'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 92, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['house', 'das'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[23]>", line 1, in <module>
        print round(em_ibm1.probabilities['house', 'das'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 94, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['the', 'Buch'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[24]>", line 1, in <module>
        print round(em_ibm1.probabilities['the', 'Buch'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 96, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['book', 'Buch'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[25]>", line 1, in <module>
        print round(em_ibm1.probabilities['book', 'Buch'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 98, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['a', 'Buch'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[26]>", line 1, in <module>
        print round(em_ibm1.probabilities['a', 'Buch'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 100, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['book', 'ein'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[27]>", line 1, in <module>
        print round(em_ibm1.probabilities['book', 'ein'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 102, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['a', 'ein'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[28]>", line 1, in <module>
        print round(em_ibm1.probabilities['a', 'ein'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 104, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['the', 'Haus'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[29]>", line 1, in <module>
        print round(em_ibm1.probabilities['the', 'Haus'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 106, in align.doctest
Failed example:
    print round(em_ibm1.probabilities['house', 'Haus'], 1)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[30]>", line 1, in <module>
        print round(em_ibm1.probabilities['house', 'Haus'], 1)
    NameError: name 'em_ibm1' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/align.doctest", line 112, in align.doctest
Failed example:
    em_ibm1.aligned() # doctest: +NORMALIZE_WHITESPACE
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest align.doctest[31]>", line 1, in <module>
        em_ibm1.aligned() # doctest: +NORMALIZE_WHITESPACE
    NameError: name 'em_ibm1' is not defined
.
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chat80.doctest", line 205, in chat80.doctest
Failed example:
    trees = cp.nbest_parse(query.split())
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest chat80.doctest[21]>", line 1, in <module>
        trees = cp.nbest_parse(query.split())
    AttributeError: 'FeatureGrammar' object has no attribute 'nbest_parse'

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chat80.doctest", line 206, in chat80.doctest
Failed example:
    answer = trees[0].node['SEM']
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest chat80.doctest[22]>", line 1, in <module>
        answer = trees[0].node['SEM']
    NameError: name 'trees' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chat80.doctest", line 207, in chat80.doctest
Failed example:
    q = join(answer)
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest chat80.doctest[23]>", line 1, in <module>
        q = join(answer)
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/string.py", line 318, in join
        return sep.join(words)
    TypeError: sequence item 1: expected string, int found

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chat80.doctest", line 208, in chat80.doctest
Failed example:
    print q
Expected:
    SELECT City FROM city_table WHERE   Country="china"
Got:
    SELECT City, Population FROM city_table WHERE Country = 'china' and Population > 1000

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chat80.doctest", line 211, in chat80.doctest
Failed example:
    for r in rows: print "%s" % r,
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest chat80.doctest[26]>", line 1, in <module>
        for r in rows: print "%s" % r,
    TypeError: not all arguments converted during string formatting
.
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chunk.doctest", line 99, in chunk.doctest
Failed example:
    chunkscore.recall()
Expected:
    0.33333333333333331
Got:
    0.3333333333333333

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/chunk.doctest", line 102, in chunk.doctest
Failed example:
    chunkscore.f_measure()
Expected:
    0.40000000000000002
Got:
    0.4
........
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/japanese.doctest", line 23, in japanese.doctest
Failed example:
    type(knbc.words()[0])
Expected:
    <type 'str'>
Got:
    <type 'unicode'>

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/japanese.doctest", line 28, in japanese.doctest
Failed example:
    type(knbc.sents()[0][0])
Expected:
    <type 'str'>
Got:
    <type 'unicode'>

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/japanese.doctest", line 49, in japanese.doctest
Failed example:
    type(jeita.tagged_words()[0][1])
Expected:
    <type 'str'>
Got:
    <type 'unicode'>
..
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/metrics.doctest", line 22, in metrics.doctest
Failed example:
    accuracy(reference, test)
Expected:
    0.80000000000000004
Got:
    0.8

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/metrics.doctest", line 32, in metrics.doctest
Failed example:
    recall(reference_set, test_set)
Expected:
    0.80000000000000004
Got:
    0.8

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/metrics.doctest", line 34, in metrics.doctest
Failed example:
    f_measure(reference_set, test_set)
Expected:
    0.88888888888888884
Got:
    0.8888888888888888

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/metrics.doctest", line 61, in metrics.doctest
Failed example:
    jaccard_distance(s1, s2)
Expected:
    0.59999999999999998
Got:
    0.6
..
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/parse.doctest", line 552, in parse.doctest
Failed example:
    prod.prob()
Expected:
    0.29999999999999999
Got:
    0.3
.
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/resolution.doctest", line 163, in resolution.doctest
Failed example:
    print tp.proof()
Expected:
    [1] {-mortal(Socrates)}     A
    [2] {-man(z2), mortal(z2)}  A
    [3] {man(Socrates)}         A
    [4] {-man(Socrates)}        (1, 2)
    [5] {mortal(Socrates)}      (2, 3)
    [6] {}                      (1, 5)
    <BLANKLINE>
Got:
    [1] {-mortal(Socrates)}     A 
    [2] {-man(z2), mortal(z2)}  A 
    [3] {man(Socrates)}         A 
    [4] {-man(Socrates)}        (1, 2) 
    [5] {mortal(Socrates)}      (2, 3) 
    [6] {}                      (1, 5) 
    <BLANKLINE>

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/resolution.doctest", line 184, in resolution.doctest
Failed example:
    print tp.proof()
Expected:
    [1] {father_of(art,john)}                  A
    [2] {father_of(bob,kim)}                   A
    [3] {-father_of(z4,z3), parent_of(z4,z3)}  A
    [4] {-parent_of(z6,john), ANSWER(z6)}      A
    [5] {parent_of(art,john)}                  (1, 3)
    [6] {parent_of(bob,kim)}                   (2, 3)
    [7] {ANSWER(z6), -father_of(z6,john)}      (3, 4)
    [8] {ANSWER(art)}                          (1, 7)
    [9] {ANSWER(art)}                          (4, 5)
    <BLANKLINE>
Got:
    [1] {father_of(art,john)}                  A 
    [2] {father_of(bob,kim)}                   A 
    [3] {-father_of(z4,z3), parent_of(z4,z3)}  A 
    [4] {-parent_of(z6,john), ANSWER(z6)}      A 
    [5] {parent_of(art,john)}                  (1, 3) 
    [6] {parent_of(bob,kim)}                   (2, 3) 
    [7] {ANSWER(z6), -father_of(z6,john)}      (3, 4) 
    [8] {ANSWER(art)}                          (1, 7) 
    [9] {ANSWER(art)}                          (4, 5) 
    <BLANKLINE>

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/resolution.doctest", line 206, in resolution.doctest
Failed example:
    print tp.proof()
Expected:
    [ 1] {father_of(art,john)}                  A
    [ 2] {mother_of(ann,john)}                  A
    [ 3] {-father_of(z4,z3), parent_of(z4,z3)}  A
    [ 4] {-mother_of(z8,z7), parent_of(z8,z7)}  A
    [ 5] {-parent_of(z10,john), ANSWER(z10)}    A
    [ 6] {parent_of(art,john)}                  (1, 3)
    [ 7] {parent_of(ann,john)}                  (2, 4)
    [ 8] {ANSWER(z10), -father_of(z10,john)}    (3, 5)
    [ 9] {ANSWER(art)}                          (1, 8)
    [10] {ANSWER(z10), -mother_of(z10,john)}    (4, 5)
    [11] {ANSWER(ann)}                          (2, 10)
    [12] {ANSWER(art)}                          (5, 6)
    [13] {ANSWER(ann)}                          (5, 7)
    <BLANKLINE>
Got:
    [ 1] {father_of(art,john)}                  A 
    [ 2] {mother_of(ann,john)}                  A 
    [ 3] {-father_of(z4,z3), parent_of(z4,z3)}  A 
    [ 4] {-mother_of(z8,z7), parent_of(z8,z7)}  A 
    [ 5] {-parent_of(z10,john), ANSWER(z10)}    A 
    [ 6] {parent_of(art,john)}                  (1, 3) 
    [ 7] {parent_of(ann,john)}                  (2, 4) 
    [ 8] {ANSWER(z10), -father_of(z10,john)}    (3, 5) 
    [ 9] {ANSWER(art)}                          (1, 8) 
    [10] {ANSWER(z10), -mother_of(z10,john)}    (4, 5) 
    [11] {ANSWER(ann)}                          (2, 10) 
    [12] {ANSWER(art)}                          (5, 6) 
    [13] {ANSWER(ann)}                          (5, 7) 
    <BLANKLINE>
.
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/semantics.doctest", line 663, in semantics.doctest
Failed example:
    print cs_semrep.core
Expected:
    chase(z2,z4)
Got:
    chase(z2,z3)

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/semantics.doctest", line 665, in semantics.doctest
Failed example:
    for bo in cs_semrep.store:
        print bo
Expected:
    bo(\P.all x.(girl(x) -> P(x)),z2)
    bo(\P.exists x.(dog(x) & P(x)),z4)
Got:
    bo(\P.all x.(girl(x) -> P(x)),z2)
    bo(\P.exists x.(dog(x) & P(x)),z3)

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/semantics.doctest", line 669, in semantics.doctest
Failed example:
    cs_semrep.s_retrieve(trace=True)
Expected:
    Permutation 1
       (\P.all x.(girl(x) -> P(x)))(\z2.chase(z2,z4))
       (\P.exists x.(dog(x) & P(x)))(\z4.all x.(girl(x) -> chase(x,z4)))
    Permutation 2
       (\P.exists x.(dog(x) & P(x)))(\z4.chase(z2,z4))
       (\P.all x.(girl(x) -> P(x)))(\z2.exists x.(dog(x) & chase(z2,x)))
Got:
    Permutation 1
       (\P.all x.(girl(x) -> P(x)))(\z2.chase(z2,z3))
       (\P.exists x.(dog(x) & P(x)))(\z3.all x.(girl(x) -> chase(x,z3)))
    Permutation 2
       (\P.exists x.(dog(x) & P(x)))(\z3.chase(z2,z3))
       (\P.all x.(girl(x) -> P(x)))(\z2.exists x.(dog(x) & chase(z2,x)))

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/semantics.doctest", line 677, in semantics.doctest
Failed example:
    for reading in cs_semrep.readings:
        print reading
Expected:
    exists x.(dog(x) & all z3.(girl(z3) -> chase(z3,x)))
    all x.(girl(x) -> exists z4.(dog(z4) & chase(x,z4)))
Got:
    exists x.(dog(x) & all z13.(girl(z13) -> chase(z13,x)))
    all x.(girl(x) -> exists z14.(dog(z14) & chase(x,z14)))
.
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/simple.doctest", line 33, in simple.doctest
Failed example:
    recall(reference_set, test_set)
Expected:
    0.80000000000000004
Got:
    0.8

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/simple.doctest", line 35, in simple.doctest
Failed example:
    f_measure(reference_set, test_set)
Expected:
    0.88888888888888884
Got:
    0.8888888888888888
.
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/sourcedstring.doctest", line 1246, in sourcedstring.doctest
Failed example:
    re.sub('better', 'worse', sent)
Expected:
    'I got worse.'
Got:
    'I got worse.'@[27:33,27:27,...,27:27,39:40]

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/sourcedstring.doctest", line 1254, in sourcedstring.doctest
Failed example:
    re.sub('better', 'worse', sent)
Expected:
    'I got worse.'
Got:
    'I got worse.'@[27:33,27:27,...,27:27,39:40]

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/sourcedstring.doctest", line 1448, in sourcedstring.doctest
Failed example:
    x.rpartition(y)
Expected:
    ('ascii byte string'@[0:17], ''@[17:17], ''@[17:17])
Got:
    (''@[0:0], ''@[0:0], 'ascii byte string'@[0:17])
...
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 20, in tokenize.doctest
Failed example:
    print regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=True)
Expected:
    ['Alas', 'it has not rained today', 'When', 'do you think',
     'will it rain again']
Got:
    ['Alas', 'it has not rained today', 'When', 'do you think', 'will it rain again']

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 28, in tokenize.doctest
Failed example:
    print regexp_tokenize(s3, r'</?(b|p)>', gaps=True)
Expected:
    ['Although this is ', 'not',
     ' the case here, we must not relax our vigilance!']
Got:
    ['Although this is ', 'not', ' the case here, we must not relax our vigilance!']

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 36, in tokenize.doctest
Failed example:
    print regexp_tokenize(s3, r'</?(?P<named>b|p)>', gaps=True)
Expected:
    ['Although this is ', 'not',
     ' the case here, we must not relax our vigilance!']
Got:
    ['Although this is ', 'not', ' the case here, we must not relax our vigilance!']

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 44, in tokenize.doctest
Failed example:
    print regexp_tokenize(s2, r'(h|r|l)a(s|(i|n0))', gaps=True)
Expected:
    ['A', ', it ', ' not ', 'ned today. When, do you think, will it ',
     'n again?']
Got:
    ['A', ', it ', ' not ', 'ned today. When, do you think, will it ', 'n again?']

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 50, in tokenize.doctest
Failed example:
    print regexp_tokenize(s2, r'(.)\1')
Expected:
    Traceback (most recent call last):
       ...
    ValueError: Regular expressions with back-references are
    not supported: '(.)\\1'
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tokenize.doctest[12]>", line 1, in <module>
        print regexp_tokenize(s2, r'(.)\1')
      File "/Users/kmike/svn/nltk/nltk/tokenize/regexp.py", line 192, in regexp_tokenize
        tokenizer = RegexpTokenizer(pattern, gaps, discard_empty, flags)
      File "/Users/kmike/svn/nltk/nltk/tokenize/regexp.py", line 112, in __init__
        nongrouping_pattern = convert_regexp_to_nongrouping(pattern)
      File "/Users/kmike/svn/nltk/nltk/internals.py", line 44, in convert_regexp_to_nongrouping
        'are not supported: %r' % pattern)
    ValueError: Regular expressions with back-references are not supported: '(.)\\1'

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 55, in tokenize.doctest
Failed example:
    print regexp_tokenize(s2, r'(?P<foo>)(?P=foo)')
Expected:
    Traceback (most recent call last):
       ...
    ValueError: Regular expressions with back-references are
    not supported: '(?P<foo>)(?P=foo)'
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tokenize.doctest[13]>", line 1, in <module>
        print regexp_tokenize(s2, r'(?P<foo>)(?P=foo)')
      File "/Users/kmike/svn/nltk/nltk/tokenize/regexp.py", line 192, in regexp_tokenize
        tokenizer = RegexpTokenizer(pattern, gaps, discard_empty, flags)
      File "/Users/kmike/svn/nltk/nltk/tokenize/regexp.py", line 112, in __init__
        nongrouping_pattern = convert_regexp_to_nongrouping(pattern)
      File "/Users/kmike/svn/nltk/nltk/internals.py", line 44, in convert_regexp_to_nongrouping
        'are not supported: %r' % pattern)
    ValueError: Regular expressions with back-references are not supported: '(?P<foo>)(?P=foo)'

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tokenize.doctest", line 63, in tokenize.doctest
Failed example:
    print regexp_tokenize(s, pattern=r'\.(\s+|$)', gaps=True)
Expected:
    ['Good muffins cost $3.88\nin New York',
     'Please buy me\ntwo of them', 'Thanks']
Got:
    ['Good muffins cost $3.88\nin New York', 'Please buy me\ntwo of them', 'Thanks']
..
***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 16, in tree.doctest
Failed example:
    print tree
Expected nothing
Got:
    (s (dp (d the) (np dog)) (vp (v chased) (dp (d the) (np cat))))

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 24, in tree.doctest
Failed example:
    print tree[path]
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tree.doctest[7]>", line 1, in <module>
        print tree[path]
    NameError: name 'path' is not defined

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 37, in tree.doctest
Failed example:
    print tree.pprint_latex_qtree()
Expected:
    \Tree [.s
            [.np [.d THE ] [.np DOG ] ]
            [.vp [.v CHASED ] [.np [.d THE ] [.np CAT ] ] ] ]
Got:
    \Tree [.s
            [.dp [.d the ] [.np dog ] ]
            [.vp [.v chased ] [.dp [.d the ] [.np cat ] ] ] ]

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 692, in tree.doctest
Failed example:
    ptree[0,0].remove(make_ptree('(Q p)'))
Expected:
    Traceback (most recent call last):
      . . .
    ValueError: list.index(x): x not in list
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tree.doctest[155]>", line 1, in <module>
        ptree[0,0].remove(make_ptree('(Q p)'))
      File "/Users/kmike/svn/nltk/nltk/tree.py", line 986, in remove
        index = self.index(child)
    ValueError: ParentedTree('Q', ['p']) is not in list

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 698, in tree.doctest
Failed example:
    ptree.remove('h');
Expected:
    Traceback (most recent call last):
      . . .
    ValueError: list.index(x): x not in list
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tree.doctest[157]>", line 1, in <module>
        ptree.remove('h');
      File "/Users/kmike/svn/nltk/nltk/tree.py", line 986, in remove
        index = self.index(child)
    ValueError: 'h' is not in list

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 994, in tree.doctest
Failed example:
    mptree[0,0].remove(make_mptree('(Q p)'))
Expected:
    Traceback (most recent call last):
      . . .
    ValueError: list.index(x): x not in list
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tree.doctest[234]>", line 1, in <module>
        mptree[0,0].remove(make_mptree('(Q p)'))
      File "/Users/kmike/svn/nltk/nltk/tree.py", line 986, in remove
        index = self.index(child)
    ValueError: MultiParentedTree('Q', ['p']) is not in list

***************************************************************************
File "/Users/kmike/svn/nltk/nltk/test/tree.doctest", line 1000, in tree.doctest
Failed example:
    mptree.remove('h');
Expected:
    Traceback (most recent call last):
      . . .
    ValueError: list.index(x): x not in list
Got:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.2/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest tree.doctest[236]>", line 1, in <module>
        mptree.remove('h');
      File "/Users/kmike/svn/nltk/nltk/tree.py", line 986, in remove
        index = self.index(child)
    ValueError: 'h' is not in list
....
A test failed unexpectedly. Please report this error
to the nltk-dev mailinglist.
[TOX] ERROR: InvocationError: '../../.tox/py27/bin/python testrunner.py'
________________________________ [tox summary] _________________________________
[TOX] ERROR: py27: commands failed

Add "update" command to the downloader

Add "update" command to the downloader

Migrated from http://code.google.com/p/nltk/issues/detail?id=95


earlier comments

nmadnani said, at 2010-09-20T11:57:31.000Z:

I took a stab at adding an update function to the downloader. It only updates out of date/stale packages. Can people test and let me know whether this works fine so I can commit?

StevenBird1 said, at 2010-09-30T08:17:34.000Z:

Thanks for this (NB there have been other changes to downloader.py since the revision on which this one is based, so it will be necessary to merge the changes).

I updated cmudict.zip, then ran the update option, via the download_shell. I saw a deprecation warning (object.new() takes no parameters), then the message: "Nothing to update", even though I have a stale version of the cmudict data installed.

(By the way, the command line interface should support an update switch.)

nmadnani said, at 2010-09-30T13:05:50.000Z:

I updated downloader.py to the latest version and merged in my changes. Although I do get the deprecation warning, I get it even with the checked in version (when do a 'd' and then an 'l') and only with Python 2.6. I think we have handled this issue before (see issue 390). I don't get any such warnings with python 2.7.

Other than the deprecation warning though, I am able to update cmudict just fine. See below:

[nmadnani@hadhafang nltk-trunk] python2.6
Python 2.6.4 (r264:75706, Feb 25 2010, 00:44:18)
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import nltk
nltk.download_shell()
NLTK Downloader


d) Download   l) List    u) Update   c) Config   h) Help   q) Quit

Downloader> u
nltk/init.py:588: DeprecationWarning: object.new() takes no parameters

Will update following packages (o=ok; x=cancel)
[ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)

Identifier> o
Downloading package 'cmudict' to /Users/nmadnani/nltk_data...
Unzipping corpora/cmudict.zip.


d) Download   l) List    u) Update   c) Config   h) Help   q) Quit

Downloader>

nmadnani said, at 2011-03-28T04:24:19.000Z:

This still works fine for me. Should we work on this more?

StevenBird1 said, at 2011-10-10T22:17:27.000Z:

Please go ahead and commit your changes. No problem about the deprecation warning.

Off-the-shelf models

Add off-the-shelf machine learning models for a variety of NLP tasks that are trained on a large
training corpus, and get reasonably good performance. E.g.:

  • POS Tagging
  • Tokenization
  • Parsing?
  • SRL
  • Chunking
  • NE detection

Migrated from http://code.google.com/p/nltk/issues/detail?id=210


earlier comments

edloper said, at 2009-02-08T04:26:45.000Z:

Status update: - Tagger & tokenizer: done and checked in. - NE chunker: done but not checked in. - NP chunker: still training the model.

ewan.klein said, at 2009-02-11T09:49:43.000Z:

Is there a target date for checking in the NE chunker?

StevenBird1 said, at 2009-02-11T10:51:11.000Z:

please see r7581

StevenBird1 said, at 2009-02-21T01:44:59.000Z:

Carried forward to 0.9.9

StevenBird1 said, at 2010-03-02T08:48:19.000Z:

The code for off-the-shelf models should live in the source repository, so it can be inspected and updated. E.g. how was the maxent tagger model used by NLTK's default POS tagger (nltk.pos_tag) created?

character encoding problem in UDHR corpus

For two of the UDHR files, doing list(udhr.words(f)) produces the following decoding errors:

Arabic_Alarabia-Arabic 'charmap' codec can't decode byte 0xde in position 18: character maps to
<undefined>
Czech_Cesky-UTF8 'utf8' codec can't decode byte 0x8a in position 1: unexpected code byte

Migrated from http://code.google.com/p/nltk/issues/detail?id=235

nltk_data python package

Create a new Python package trunk/nltk/nltk_data, containing modules for
building the models that get distributed from trunk/nltk_data. This will
make it possible to rebuild models when the associated code is changed.
Users will also be able to see how the models were trained, and be able to
modify the parameters if necessary.

Migrated from http://code.google.com/p/nltk/issues/detail?id=301


earlier comments

StevenBird1 said, at 2009-02-20T07:25:23.000Z:

I've done this for a new tagsets package.

edloper said, at 2009-02-20T18:22:40.000Z:

This seems like a good approach; I'll migrate the code that trains some of the existing models to this directory.

Some of that code depends on having access to corpora that aren't distributed with
NLTK. E.g., currently named_entity.py has the very ugly path
"/tmp/ace.old/data/ace.dev/text" hard-coded into it. I propose that we use
"/usr/share/corpora" as a standard location for corpora that are needed to train the
models, but which we don't distribute with them (such as the ACE corpus, or the
complete treebank).

StevenBird1 said, at 2009-02-20T21:48:34.000Z:

"/usr/share/corpora" as the standard place for non-distributed corpora sounds fine.

On the output side, we might want to package up the output files, and have a Makefile
that puts them in the right place in the working copy (cf the "publish" directives in
doc/Makefile).

documentation for nltk.metrics

The nltk.metrics package needs a corresponding doctest file (test/metrics.doctest).

Migrated from http://code.google.com/p/nltk/issues/detail?id=241


earlier comments

StevenBird1 said, at 2009-01-30T01:15:50.000Z:

Tom - the agreement package is now incorporated into the nltk.metrics package. Would you please add some doctests to nltk/test/metrics.doctest?

StevenBird1 said, at 2009-02-21T01:44:59.000Z:

Carried forward to 0.9.9

Read NE columns off of CoNLL-style files

It'd be nice to be able to read the ne columns off of CoNLL-style NER files without using the raw()
method.

I've attached a patch that allows this by allowing more than word/tag/IOB columns to be returned
from iob_sents() and iob_words().

E.g. c.iob_words(None, (c.WORDS, c.POS, c.CHUNK, c.NE)) would now return [...,('Electric', 'NNP', 'I-
NP', 'I-ORG'),...] but c.iob_words() returns [...,('Electric', 'NNP', 'I-NP'),...] as before.

Migrated from http://code.google.com/p/nltk/issues/detail?id=338

Brill Tagger on third-party corpora

Originally reported by starkman (sourceforge.net user: starkmanuk) on
2008-03-09

Hello all, I am having a bit of trouble whilst trying to use the Brill
tagger. Below is my code,

from nltk.corpus import treebank
from nltk import tag
from nltk.tag import brill
from nltk.corpus import reader
from nltk.corpus.reader import TaggedCorpusReader

root = 'C:\lob'
reader = TaggedCorpusReader(root, 'a.txt', sep='/')
tagged_data = reader.tagged_sents()
nn_cd_tagger = tag.RegexpTagger([(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*', 'NN')])

train is the proportion of data used in training; the rest is reserved

for testing.

print "Loading tagged data... "

cutoff = int(num_sents*train)
training_data = tagged_data[:cutoff]
gold_data = tagged_data[cutoff:num_sents]
testing_data = [[t[0] for t in sent] for sent in gold_data]
print "Done lodaing."

Start Unigram tagger

print "Training unigram tagger:"
unigram_tagger = tag.UnigramTagger(training_data,
backoff=nn_cd_tagger)
if gold_data:
print " [accuracy: %f]" % tag.accuracy(unigram_tagger, gold_data)

Start Bigram tagger

print "Training bigram tagger:"
bigram_tagger = tag.BigramTagger(training_data,
backoff=unigram_tagger)
if gold_data:
print " [accuracy: %f]" % tag.accuracy(bigram_tagger, gold_data)

Brill tagger

templates = [
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1)),
]
trainer = brill.FastBrillTaggerTrainer(bigram_tagger, templates, trace)

trainer = brill.BrillTaggerTrainer(u, templates, trace)

brill_tagger = trainer.train(training_data, max_rules, min_score)

It is of course a modifcation of the example brill tagger from the api. I
receive an error when it comes to computer the last line, brill_tagger
trainer.train(training_data, max_rules, min_score). This is the error..

Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
brilltagger()
File "<pyshell#0>", line 80, in brilltagger
brill_tagger = trainer.train(training_data, max_rules, min_score)
File "C:\Python25\Lib\site-packages\nltk\tag\brill.py", line 869, in train
rule = self._best_rule(train_sents, test_sents, min_score)
File "C:\Python25\Lib\site-packages\nltk\tag\brill.py", line 1008, in
_best_rule
max_score = max(self._rules_by_score)
ValueError: max() arg is an empty sequence

The error i think can decipher, (i.e. max() is empty) but i am unsure as to
why this is occurring. I am using sections from the LOB corpus, the part of
the corpus has been modified so that NLTK can decipher the word and its
associated tags. This seems to work and i can print out both the word and
its tag correctly as with the other predefined corpora that is bundled with
NLTK. Is it possible that the text I am passing to the Brill tagger simply
has no rules?

Kind Regards,
David

Migrated from http://code.google.com/p/nltk/issues/detail?id=67


earlier comments

paulbone.au said, at 2008-11-06T07:21:14.000Z:

David/starkman,

If you have a copy of the LOB corpus, particularly the a.txt file could you provide
it so I can test this against it.

Thanks.

paulbone.au said, at 2008-11-06T07:33:20.000Z:

I've commited a fix for this but am unable to test it until I have a failing test case. I'll leave the bug open.

StevenBird1 said, at 2009-01-08T23:35:27.000Z:

Wrote to starkmanuk to request data.

Error in Brill tagger docstrings

The docstring for nltk.tag.brill.SymmetricProximateTokensTemplate is not correct:

One C{ProximateTokensTemplate} is formed by passing in the
same arguments given to this class's constructor: tuples
representing intervals in which a tag may be found.  The other
C{ProximateTokensTemplate} is constructed with the negative
of all the arguments in reversed order.  For example, a
C{SymmetricProximateTokensTemplate} using the pair (-2,-1) and the
constructor C{SymmetricProximateTokensTemplate} generates the same rules as a
C{SymmetricProximateTokensTemplate} using (-2,-1) plus a second
C{SymmetricProximateTokensTemplate} using (1,2).

I think some occurrences of SymmetricProximateTokensTemplate in the text above should be
replaced, perhaps by ProximateTokensTemplate. Otherwise I don't understand it.

Also, the docstrings of ProximateTagsRule and ProximateWordsRule refer to SymmetricProximateTokensTemplate, but I think they should be referring to ProximateTokensTemplate.

Migrated from http://code.google.com/p/nltk/issues/detail?id=357

Add tracing to dependency parsers

It would be helpful, and conformant with other NLTK parsers, to add a
tracing facility to the projective and non-projective dependency parsers.

Migrated from http://code.google.com/p/nltk/issues/detail?id=148

tkinter problems

From issue 201, reported by kostik.vento:

While cfd.plot(cumulative=True) works fine, cfd.plot() takes all the
processor's time (more exactly: the graph appeares, but I can't resize it,
nor even minimize it — the window hangs, and the only thing that helps is
to forcingly close the graph window, causing Python Shell's restart). The
screenshot about the problem is attached as cfd.plot.jpg.

But when I try the same code after Python Shell's restart, cfd.plot() works
correctly, so that I can resize the plot's window or minimize it and so on.
I use Python 2.5 on WinXP SP2, if it matters.

Migrated from http://code.google.com/p/nltk/issues/detail?id=250

TreebankWordTokenizer with unicode characters

It seems there is a problem in the TreebankWordTokenizer when it comes to unicode characters as it split any non English word depending on the non ASCII characters! here is an example
Statement
In Düsseldorf I took my hat off. But I can't put it back on.

NLTK
In D ü sseldorf I took my hat off .
But I ca n't put it back on .

Stanford (PennTreebank) PTBLexer
[In, Düsseldorf, I, took, my, hat, off, .]
[But, I, ca, n't, put, it, back, on, .]

http://www.cis.upenn.edu/~treebank/tokenizer.sed
In Düsseldorf I took my hat off .
But I ca n't put it back on .

Brill tagger API

It is hard to do anything more with the Brill tagger than run its demo. It
needs a cleaner API.

Migrated from http://code.google.com/p/nltk/issues/detail?id=80

Alter synset.tree to handle potential cycles.

In nltk.corpus.wordnet the Synset class implements a method called tree
that can take a function that helps generate a tree given a synset.

This is used by the WordNet Browser to create hypernym inheritance trees,
however the wordnet data may contain cycles. Cycles may also be present if
a user passes some function that may result in cycles.

Migrated from http://code.google.com/p/nltk/issues/detail?id=257


earlier comments

StevenBird1 said, at 2009-02-05T01:47:07.000Z:

cf issue 99

steven.bethard said, at 2011-04-04T09:14:38.000Z:

Can you give an example of where this is failing? For example, I know there's a restrain.v.01 -> inhibit.v.04 -> restrain.v.01 cycle in the online Wordnet 3.0, but I don't see any problem currently with that:

from nltk.corpus import wordnet as wn
wn.synset('restrain.v.01').tree(lambda s: s.hypernyms())
[Synset('restrain.v.01'), [Synset('inhibit.v.04'), [Synset('suppress.v.04'), [Synset('forget.v.01')]]]]

I also tried running through all the WordNet synsets, but it didn't seem to get stuck in an infinite loop or anything:

for s in wn.all_synsets():
s.tree(lambda s: s.hypernyms())

Perhaps this was fixed when r6952 was reverted?

third-party download locations

nltk.download and package indexing should support third-party download
locations, including nltk.ldc.upenn.edu.

This needs to be mentioned in the book as a place where people can upload
their own datasets.

Is this something we can commit to?

Migrated from http://code.google.com/p/nltk/issues/detail?id=318


earlier comments

StevenBird1 said, at 2009-05-03T10:16:27.000Z:

This is mentioned in the book, so we need to do it.

StevenBird1 said, at 2009-07-05T22:39:36.000Z:

This was deleted from the book at the last minute. However, this is the only scalable way to distribute corpora.

Note that the recent addition of NomBank 1.0 copied the existing distribution into
the NLTK repository unchanged. It might have been better just to point to an
externally hosted distribution. The XML files defining corpus collections could
contain URLs in addition to local identifiers. E.g.:

<item ref="nombank.1.0"/>

would become:

<item ref="http://nlp.cs.nyu.edu/meyers/nombank/nombank.1.0/nombank.1.0"/>

The downloader would need to skip over any corpora that couldn't be accessed for
whatever reason.

nmadnani said, at 2009-07-16T01:36:11.000Z:

I would be happy to take a stab at resolving this issue if someone isn't already.

StevenBird1 said, at 2009-07-16T01:48:09.000Z:

We should also support externally hosted collections, e.g.: http://nltk.ldc.upenn.edu/index.xml

StevenBird1 said, at 2009-07-16T01:57:53.000Z:

Please go ahead. I think tools/build_pkg_index.py needs to be modified to handle the case described in comment 2, and nltk.downloader.Downloader._url needs to be modified to support a list of URLs (no longer hard-coded, but read from a standard location, e.g. http://nltk.googlecode.com/svn/trunk/nltk_data/sites.xml)

nmadnani said, at 2009-07-16T02:51:07.000Z:

It's not clear to me exactly what functionality to add in order to support third-party downloads.

For example, How will build_index() from downloader.py (which is what's used in build_pkg_index.py) know
whether or not to add a url in the xml file? Will the XML file describing the corpus (abc.xml, indian.xml etc.)
contain this information?

How will the URLs from a standard location (e.g., the sites.xml URL you mention above) fit into the picture?

I think a brief description of how the current downloading process works and what it lacks will be very useful
to me to figure out what to do next.

StevenBird1 said, at 2009-07-16T03:32:19.000Z:

Let me expand the proposal a bit. Comment 2 is a proposal to extend the scope of the corpus XML metadata files (like abc.xml) to contain full URLs. Build_pkg_index searches for such XML files when it creates its index.xml file. If we only did this, and put the europarl_raw.xml file (http://nltk.ldc.upenn.edu/packages/corpora/europarl_raw.xml) into the repository, we'd be done.

A separate issue is to support the hosting of external collections (Comment 4) that
could in principle be maintained by others, plus a file containing the URLs for all
those collections. Let's not bother with that just yet.

nmadnani said, at 2009-07-16T03:46:07.000Z:

Okay, that makes sense to me. Here's the current abc.xml

What would the new version, one that contained a URL, look like? Something like this may be?

So, all I have to do in build_index() is to determine whether an XML file for a package contains such a 'ref'
attribute and if so, just use that as the "url" field instead of using the provided base-url to create the value for
the "url" field.

Does that make sense?

nmadnani said, at 2009-07-16T04:58:55.000Z:

OK, I made the trivial change in build_index() as follows:

for pkg_xml, zf, subdir in _find_packages(os.path.join(root, 'packages')):
    zipstat = os.stat(zf.filename)
    
    # Check if the XML file contains a 'ref' attribute. If it does, 
    # then use that value as the URL instead of constructing one
    url = pkg_xml.get('ref', None)
    if not url:
        url = '%s/%s/%s' % (base_url, subdir, os.path.split(zf.filename)[1])

I tested it and it works as intended. Can I go ahead and commit?

StevenBird1 said, at 2009-07-16T06:13:12.000Z:

Sure. Please also add the europarl_raw.xml file I mentioned (edit as needed), and rebuild the data index using "make pkg_index" (in trunk/nltk).

StevenBird1 said, at 2009-07-16T06:14:15.000Z:

Note that there's going to be a problem with calculating checksums for remote corpora, which means we might need to investigate my suggestion for external collections.

nmadnani said, at 2009-07-16T13:41:20.000Z:

Yes, that's a good point. In general, build_index() can't compute any of the statistics (size, unzipped size, checksum) etc. if we don't have the actual zip file.

I don't see how using external collections would help with this though. I must be missing something ...

StevenBird1 said, at 2009-12-03T01:14:13.000Z:

I think the only easy solution is to manually provide checksums in the XML file. The XML file should live with the data, which means that the index building process needs to know some external locations where NLTK corpora live, i.e. URLs of files that contain lists of NLTK corpora.

Chunker efficiency

Reported by Steven Bird, July 10, 2008

The chunk parser code may have some inefficiencies buried within it...

From Greg Aumann:

grammar_parse takes 21 seconds to parse my lexicon and chunk_parse takes
143 seconds according to these results.

I am attaching four files.

  1. parser_profile.py: parses my lexical data using both grammar_parse
    and chunk_parse then discards the result. The purpose is just to
    generate profiling data.

  2. LBDictGram.py contains the two grammars, used by parser_profile.

  3. prof_stats.py generates the summary of the profiling stats.

  4. profile_results.txt the summary of the profile written by prof_stats.py

The profiling data files written by parser_profile.py are 96 Mbytes each
so I am not mailing those. Also my data is quite large.

Greg

3053795 function calls (3053751 primitive calls) in 20.501 CPU
seconds

Ordered by: internal time, call count
List reduced from 50 to 20 due to restriction <20>

ncalls tottime percall cumtime percall filename:lineno(function)
26 3.447 0.133 20.294 0.781 data.py:96(grammar_parse)
226851 3.379 0.000 10.639 0.000 toolbox.py:79(fields)
226851 2.557 0.000 6.027 0.000 toolbox.py:45(raw_fields)
304154 2.238 0.000 3.468 0.000 sre.py:126(match)
184489 2.065 0.000 3.902 0.000 ElementTree.py:1075(start)
453650 1.231 0.000 1.231 0.000 utf_8.py:15(decode)
304232 1.230 0.000 1.232 0.000 sre.py:213(_compile)
368978 1.167 0.000 1.388 0.000 ElementTree.py:1046(_flush)
184489 0.947 0.000 2.035 0.000 ElementTree.py:1091(end)
184463 0.744 0.000 1.137 0.000 ElementTree.py:285(append)
184489 0.400 0.000 0.400 0.000 ElementTree.py:190(init)
184463 0.393 0.000 0.393 0.000 ElementTree.py:726(iselement)
123006 0.257 0.000 0.257 0.000 ElementTree.py:1064(data)
123006 0.222 0.000 0.222 0.000 string.py:308(join)
1 0.168 0.168 20.501 20.501
grammar_prof.py:19(grammar_parse_lbdict_files)
26 0.036 0.001 0.036 0.001 toolbox.py:29(open)
26 0.014 0.001 0.014 0.001 data.py:73(_make_parse_table)
26 0.003 0.000 0.003 0.000 toolbox.py:120(close)
9/3 0.001 0.000 0.001 0.000 sre_parse.py:374(_parse)
26 0.001 0.000 0.001 0.000 posixpath.py:56(join)

19611531 function calls (19554773 primitive calls) in 142.584 CPU
seconds

Ordered by: internal time, call count
List reduced from 441 to 20 due to restriction <20>

ncalls tottime percall cumtime percall filename:lineno(function)
2778182 17.934 0.000 17.934 0.000
sre_parse.py:773(expand_template)
196040 17.774 0.000 49.911 0.000 sre.py:136(sub)
193908 17.120 0.000 33.737 0.000 init.py:280(_verify)
7014711 16.677 0.000 16.677 0.000 init.py:272(_tag)
2778182 12.551 0.000 30.485 0.000 sre.py:259(filter)
96954 9.433 0.000 14.950 0.000 init.py:245(init)
290862 6.236 0.000 7.553 0.000 sre.py:154(split)
226851 3.493 0.000 10.802 0.000 toolbox.py:79(fields)
791852 3.402 0.000 3.696 0.000 sre.py:213(_compile)
234335 2.601 0.000 5.020 0.000 ElementTree.py:1075(start)
96954 2.581 0.000 22.632 0.000
init.py:316(to_chunkstruct)
226851 2.535 0.000 5.939 0.000 toolbox.py:45(raw_fields)
26 2.385 0.092 21.982 0.845 toolbox.py:135(_record_parse)
96954 2.335 0.000 108.963 0.001 regexp.py:602(parse)
304154 2.153 0.000 3.403 0.000 sre.py:126(match)
468670 2.100 0.000 2.562 0.000 ElementTree.py:1046(_flush)
185401 1.891 0.000 2.557 0.000 ElementTree.py:447(Element)
62395/7458 1.847 0.000 6.902 0.001 data.py:22(_tree2etree)
384521 1.700 0.000 2.500 0.000 ElementTree.py:285(append)
26 1.390 0.053 142.249 5.471 data.py:38(chunk_parse)

Migrated from http://code.google.com/p/nltk/issues/detail?id=25


earlier comments

anand.jeyahar said, at 2011-06-02T05:07:31.000Z:

I am not able to find the profiler files attached. I searched in the nltk-dev google groups too, but couldn't find.. i would like to take a shot at this..

chunker models

Two chunker models lack an XML file, are gzipped instead of zipped, don't
unpack into a subdirectory, and have extra periods in their names. Please
see nltk_data/packages/chunkers/*.gz

Migrated from http://code.google.com/p/nltk/issues/detail?id=300


earlier comments

edloper said, at 2009-02-20T18:15:22.000Z:

These two files are pickles for nltk_contrib/coref objects (by jpf3). And at the moment, they don't actually work when you unpickle them. So I recommend that we either move them under nltk_contrib somewhere, or delete them.

StevenBird1 said, at 2009-02-20T21:01:23.000Z:

Joey -- I moved the chunk model files from nltk_data/chunkers down into nltk_data/packages/chunkers. However, they're not in the right format and not being distributed. Let us know if you need some help.

joseph.frazee said, at 2009-02-23T14:05:01.000Z:

Ok. I have the fixes made and just need to retrain on the whole Conll2k dataset. I'll have that done soon.

joseph.frazee said, at 2009-03-04T14:58:22.000Z:

The chunker refactoring is done and an NP-chunker has been retrained on the entire CoNLL 2000 dataset. All the dependencies in nltk_contrib.coref have been checked in but I'm waiting for the go ahead to check in the pickled ChunkTagger object.

joseph.frazee said, at 2009-03-09T13:55:14.000Z:

Since I wasn't sure if this would make it in, I enhanced the nltk_contrib.coref.chunk and nltk_contrib.coref.tag to allow the possibility of using models local to nltk_contrib.coref. For the time being, the demos run assuming this location. This should make it easier to ensure that the trunk code for those components is functional outside of the patches getting applied to the main nltk tree.

StevenBird1 said, at 2009-05-03T09:46:04.000Z:

Sorry for the long delay with 0.9.9. It would be good to have your updated models. Please ensure that the pickle file has the same directory structure and file naming convention as the existing chunker models, then check it in to nltk_data/packages/chunkers. Please also add the corresponding XML files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.