stanfordnlp / corenlp Goto Github PK

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

Home Page: http://stanfordnlp.github.io/CoreNLP/

License: GNU General Public License v3.0

Java 98.15% Python 0.11% Shell 0.12% Makefile 0.14% Perl 0.04% Common Lisp 0.11% HTML 0.03% CSS 0.01% JavaScript 0.26% Ruby 0.01% Batchfile 0.01% Lex 1.01%

natural-language-processing nlp nlp-parsing named-entity-recognition stanford-nlp

corenlp's Introduction

Stanford CoreNLP

Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of syntactic phrases or dependencies, and indicate which noun phrases refer to the same entities. It was originally developed for English, but now also provides varying levels of support for (Modern Standard) Arabic, (mainland) Chinese, French, German, Hungarian, Italian, and Spanish. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. Stanford CoreNLP is a set of stable and well-tested natural language processing tools, widely used by various groups in academia, industry, and government. The tools variously use rule-based, probabilistic machine learning, and deep learning components.

The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v2 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.

Build Instructions

Several times a year we distribute a new version of the software, which corresponds to a stable commit.

During the time between releases, one can always use the latest, under development version of our code.

Here are some helpful instructions to use the latest code:

Provided build

Sometimes we will provide updated jars here which have the latest version of the code.

At present, the current released version of the code is our most recent released jar, though you can always build the very latest from GitHub HEAD yourself.

Build with Ant

Make sure you have Ant installed, details here: http://ant.apache.org/
Compile the code with this command: cd CoreNLP ; ant
Then run this command to build a jar with the latest version of the code: cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu
This will create a new jar called stanford-corenlp.jar in the CoreNLP folder which contains the latest code
The dependencies that work with the latest code are in CoreNLP/lib and CoreNLP/liblocal, so make sure to include those in your CLASSPATH.
When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-models, and english-models-kbp and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.

Build with Maven

Make sure you have Maven installed, details here: https://maven.apache.org/
If you run this command in the CoreNLP directory: mvn package , it should run the tests and build this jar file: CoreNLP/target/stanford-corenlp-4.5.4.jar
When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-extra-models, and english-kbp-models and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install stanford-corenlp-models-current.jar you will need to set -Dclassifier=models. Here is the sample command for Spanish: mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=4.5.4 -Dclassifier=models-spanish -Dpackaging=jar

Models

The models jars that correspond to the latest code can be found in the table below.

Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar. These require downloading the English (extra) and English (kbp) jars. Resources for other languages require usage of the corresponding models jar.

The best way to get the models is to use git-lfs and clone them from Hugging Face Hub.

For instance, to get the French models, run the following commands:

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/stanfordnlp/corenlp-french

The jars can be directly downloaded from the links below or the Hugging Face Hub page as well.

Language	Model Jar	Last Updated
Arabic	download (HF Hub)	4.5.6
Chinese	download (HF Hub)	4.5.6
English (extra)	download (HF Hub)	4.5.6
English (KBP)	download (HF Hub)	4.5.6
French	download (HF Hub)	4.5.6
German	download (HF Hub)	4.5.6
Hungarian	download (HF Hub)	4.5.6
Italian	download (HF Hub)	4.5.6
Spanish	download (HF Hub)	4.5.6

Thank you to Hugging Face for helping with our hosting!

Install by Gradle

If you don't know Gradle itself, see official site: https://gradle.org

Write the following in your build.gradle according to Maven Central:

dependencies {
    implementation 'edu.stanford.nlp:stanford-corenlp:4.5.5'
}

If you want to analyse English, add following:

    implementation "edu.stanford.nlp:stanford-corenlp:4.5.5:models"
    implementation "edu.stanford.nlp:stanford-corenlp:4.5.5:models-english"
    implementation "edu.stanford.nlp:stanford-corenlp:4.5.5:models-english-kbp"

If you use another version, replace "4.5.5" to a version you use.

Useful resources

You can find releases of Stanford CoreNLP on Maven Central.

You can find more explanation and documentation on the Stanford CoreNLP homepage.

For information about making contributions to Stanford CoreNLP, see the file CONTRIBUTING.md.

Questions about CoreNLP can either be posted on StackOverflow with the tag stanford-nlp, or on the mailing lists.

corenlp's People

Contributors

Stargazers

Watchers

Forkers

jplehmann puneethn web5design lexmachinainc imclab relwell nipengadmaster abhishekkrthakur darshanhegde mcalpin tabladrum someapp harshal ramanqul bhavinmanek brusic dameikle adarshk davyli chrisleewashere eeshanchatterjee hongbopeng leeflora irjudson deanhiller markwoon perphyliu fmof raynald linguistrix chagge hehaotian jottinger vvworm i-aztec happyspace stask mityabor ferkartal drgmj dzelemba agibsonccc dartonw nitish11 matrix8128 uohzoaix hans thunderlbc qingqingqing sgreenenba evtimm thiagoneves rosslittle kalyanp aashusingh lixiangnlp nicholas-jordan karimkhanp tanikina koendeschacht jamilsonbatista onewaterdrop relentless1987 bdurgahee minhlongdo davisg123 craphtex neuralconcept prog8 hackerzhut ollie314 shangma anglc btanikella smarthomekit kanglingv frictionlesscoin nadaa njuhugn bigdbcloud adromil motasay sg1705 msosnick thamayanthys mingleili jiangxianliang durong zhichao-hong fanlinbo shellybear cordarei jediz legend0011 rasoolims nausheenfatma violetpeng horil awadodeh psoberanis

corenlp's Issues

NullPointerException on "hello world" input

I get an NPE on many (but not all) test sentences I've tried, including the text "hello world". Stack trace:

java.lang.NullPointerException
  at org.ejml.simple.SimpleMatrix.<init>(SimpleMatrix.java:158)
  at edu.stanford.nlp.rnn.RNNUtils.elementwiseApplyTanh(RNNUtils.java:175)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:328)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:333)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:332)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:333)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:333)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:332)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:333)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:333)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:332)
  at edu.stanford.nlp.sentiment.SentimentCostAndGradient.forwardPropagateTree(SentimentCostAndGradient.java:333)
  at edu.stanford.nlp.pipeline.SentimentAnnotator.annotate(SentimentAnnotator.java:46)
  at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:876)

In /StanfordCoreNLP/src/edu/stanford/nlp/sentiment/SentimentPipeline.java Remove "-file" option the "-fileList" option can handle a list of 1.

In /StanfordCoreNLP/src/edu/stanford/nlp/sentiment/SentimentPipeline.java
Remove "-file" option the "-fileList" option can handle a list of 1.

both -file and fileList options are provided which is redundant and error prone.
Right now file handling in the code for file and filelist are not in sync.

} else if (args[argIndex].equalsIgnoreCase("-file")) {
filename = args[argIndex + 1];
argIndex += 2;
} else if (args[argIndex].equalsIgnoreCase("-fileList")) {

Enable Sourcegraph

I want to use Sourcegraph code search and code review with CoreNLP. A project maintainer needs to enable it to set up a webhook so the code is up-to-date there.

Could you please enable CoreNLP on @sourcegraph by going to https://sourcegraph.com/github.com/stanfordnlp/CoreNLP and clicking on Settings? (It should only take 15 seconds.)

Thank you!

chinese_map_utils.jar

to execute the functionality described here do I have to make that chinese_map_utils.jar myself?

Provide version 3.5.0 at Maven Central

According to http://nlp.stanford.edu/software/corenlp.shtml#Download the most current version 3.5.0 is available, but when having a look at Maven Central, only version 3.4.1 is currently obtainable from there. Please provide version 3.5.0 there as well. By the way: thank you very much for providing Stanford CoreNLP! Great tool :)

CoreNLP crashes with a "No roots in graph" RuntimeException

I have a sentence that gives a RuntimeException based on what I'm presuming is a bad or unexpected parse when trying to do dependency conversion with the getFirstRoot() call. This sounds like an NLP problem and not a system engineering problem, so ideally, it would return null, or maybe a checked exception? Using the shift-reduce parser and version 3.5.1 I get the message

java.lang.RuntimeException: No roots in graph:
dep                 reln                gov                 
---                 ----                ---                 

    at edu.stanford.nlp.semgraph.SemanticGraph.getFirstRoot(SemanticGraph.java:773)

The text I'm parsing is the following. This is in JSON encoding. Sorry I don't know which sentence is causing it.

"days has elapsed after the \nreport is received. As used in this subsection--\n            ``(1) the term `legislative day means any calendar day on         which the House of Representatives is in session; and            ``(2) the terms `rule and `regulation mean a provision or         series of interrelated provisions stating a single, separable         rule of law..    (b) Report on Using Voter Communication Vouchers for Primary Elections.--The Commission shall submit to the House of Representa"

in plaintext,

days has elapsed after the 
report is received. As used in this subsection--
            ``(1) the term `legislative day means any calendar day on         which the House of Representatives is in session; and            ``(2) the terms `rule and `regulation mean a provision or         series of interrelated provisions stating a single, separable         rule of law..    (b) Report on Using Voter Communication Vouchers for Primary Elections.--The Commission shall submit to the House of Representa

Writing sentiment analysis results to XML

I'm having trouble figuring out how to get the sentiment analysis tool to output as an XML file when run from the command line. When I run the command provided on http://www-nlp.stanford.edu/sentiment/code.html it works fine but only outputs plain text:

Adding annotator tokenize

Adding annotator ssplit

Adding annotator parse

Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [1.3 sec].

Adding annotator sentiment

This is so great.

  Very positive

It was okay I guess.

  Neutral

However if I try to run the full CoreNLP tool with the sentiment annotator, like such:

java -cp stanford-corenlp-full-2013-11-12/stanford-corenlp-3.3.0.jar:stanford-corenlp-full-2013-11-12/stanford-corenlp-3.3.0-models.jar:stanford-corenlp-full-2013-11-12/xom.jar:stanford-corenlp-full-2013-11-12/joda-time.jar:stanford-corenlp-full-2013-11-12/jollyday.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse,sentiment -file  ./tweets/tweet1.txt

I get the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/ejml/simple/SimpleBase

    at edu.stanford.nlp.pipeline.SentimentAnnotator.<init>(SentimentAnnotator.java:45)

    at edu.stanford.nlp.pipeline.StanfordCoreNLP$14.create(StanfordCoreNLP.java:845)

    at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:81)

    at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:260)

    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:127)

    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:123)

    at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1430)

Caused by: java.lang.ClassNotFoundException: org.ejml.simple.SimpleBase

    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

    at java.security.AccessController.doPrivileged(Native Method)

    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

    ... 7 more

If I run the command without the sentiment annotator, it works fine but of course I can't get any sentiment results.

I should also mention that I am running everything wrapped inside a Python subprocess.Popen() call, since the rest of our project is written in Python.

Invalid JSON output format

Hello,

When I try POS tagging with stanford-corenlp-3.5.1, I got following part of the output by StanfordCoreNLP's jsonPrint method.

{
"index": "5",
"word": "'s",
"lemma": "'s",
"characterOffsetBegin": "17",
"characterOffsetEnd": "19",
"pos": "POS"
}

Sample sentence: "I was the teacher's student."

It looks like the "word" and "lemma" contain invalid JSON format and json validations fail. Single quote characters do not need to be escaped according to http://json.org/.
You can check it in http://jsonlint.com/

I glanced at the code, and maybe this part is relevant to it. https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/JSONOutputter.java#L178

I hope it is not my misunderstanding and I can commit for it.
Regards,

Train sentiment analyzer for a specific domain

Hello,

I am not sure if this is relevant to this forum but I want to train the sentiment analyzer model for specific domains, right now it is pretty generic. e.g. following sentences

The room was spacious or The restaurants were short walk from the hotel

get a sentiment of 1/5 whereas they talk positively.

Any instructions how I can achieve this extension will be appreciated.

EnglishFactored model is gone

Hi! I have been using English Factored model for a long time (it's slower but seems to be more accurate than PCFG one). I have upgrade my StanfordNLP library to 3.5.1, and it seems like edu/stanford/nlp/models/lexparser/englishFactored.ser.gz has gone missing and the only two models there are RNN and PCFG. Can I still find it somewhere?

Add Enum's for Part Of Speech

Right now PoS returns (IMO) a cryptic String that doesn't really benefit the developer. If this could return an Enumed type that represented the PoS it would allow for more developer friendly code.

SUTime NIGHT constant range end time is before start time

The NIGHT constant in SUTime.java currently has the range between hour (19:00 - 5:00) giving it a duration of -14 hours

REPRODUCE:
http://nlp.stanford.edu:8080/sutime/process
Input: "tomorrow night"
Output: tomorrow night

BUG LOCATION:
edu.stanford.nlp.time.SUTime.java:724
public static final Time NIGHT = createTemporal(StandardTemporalType.TIME_OF_DAY, "NI", new InexactTime(new Range(new InexactTime(new Partial(DateTimeFieldType.hourOfDay(), 19)), new InexactTime(new Partial(DateTimeFieldType
.hourOfDay(), 5)))));

FIX SHOULD BE SOMETING LIKE:
public static final Time NIGHT = createTemporal(StandardTemporalType.TIME_OF_DAY, "NI", new InexactTime(new Range(new InexactTime(new Partial(DateTimeFieldType.hourOfDay(), 19)), new InexactTime(new Partial(DateTimeFieldType
.hourOfDay(), 24)))));

Build Failure

Fresh checkout.

CoreNLP$ ant test
Buildfile: /Users/RCFischer/wkdir/JavaNLP/CoreNLP/build.xml

classpath:
     [echo] core

compile:
     [echo] core
    [javac] Compiling 1324 source files to /Users/RCFischer/wkdir/JavaNLP/CoreNLP/classes
    [javac] /Users/RCFischer/wkdir/JavaNLP/CoreNLP/src/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.java:352: error: cannot access SequencePattern
    [javac]   static public AnnotationExtractRule createTokenPatternRule(Env env, SequencePattern.PatternExpr expr, Expression result)
    [javac]                                                                       ^
    [javac]   bad class file: /Users/RCFischer/wkdir/JavaNLP/CoreNLP/classes/edu/stanford/nlp/ling/tokensregex/SequencePattern.class
    [javac]     class file contains wrong class: edu.stanford.nlp.stats.IntCounter
    [javac]     Please remove or make sure it appears in the correct subdirectory of the classpath.

BUILD FAILED
/Users/RCFischer/wkdir/JavaNLP/CoreNLP/build.xml:99: Compile failed; see the compiler error output for details.

Total time: 3 seconds

in in /StanfordCoreNLP/src/edu/stanford/nlp/sentiment filelist option is not processed correctly

the fout and pout are created and closed per sentence instead of per file.
Resolution:
Change it to the following:
for (Annotation annotation : annotations) {
pipeline.annotate(annotation);

      //AR: bug move it to before the second for
      //FileOutputStream fout = new FileOutputStream(file + ".out");
      //PrintStream pout = new PrintStream(fout);

Some tests do not work on Locales using "," as decimal separator.

[junit] Testcase: testToSortedString(edu.stanford.nlp.stats.CountersTest):  FAILED
[junit] null expected:<{c1[.0:a0.5:b0.]3}> but was:<{c1[,0:a0,5:b0,]3}>
[junit] junit.framework.ComparisonFailure: null expected:<{c1[.0:a0.5:b0.]3}> but was:<{c1[,0:a0,5:b0,]3}>
[junit]     at edu.stanford.nlp.stats.CountersTest.testToSortedString(CountersTest.java:250)

[junit] Testcase: testBasic(edu.stanford.nlp.util.ConfusionMatrixTest): FAILED
[junit] null expected:<...    prec=1, recall=0[.66667, spec=1, f1=0.8
[junit]               C2 = b        prec=0, recall=n/a, spec=0.]75, f1=n/a
[junit]          ...> but was:<...    prec=1, recall=0[,66667, spec=1, f1=0,8
[junit]               C2 = b        prec=0, recall=n/a, spec=0,]75, f1=n/a
[junit]          ...>
[junit] junit.framework.ComparisonFailure: null expected:<...    prec=1, recall=0[.66667, spec=1, f1=0.8
[junit]               C2 = b        prec=0, recall=n/a, spec=0.]75, f1=n/a
[junit]          ...> but was:<...    prec=1, recall=0[,66667, spec=1, f1=0,8
[junit]               C2 = b        prec=0, recall=n/a, spec=0,]75, f1=n/a
[junit]          ...>
[junit]     at edu.stanford.nlp.util.ConfusionMatrixTest.testBasic(ConfusionMatrixTest.java:41)

Instructions to run the system locally

I cloned the git repository. Updated the java/javac versions to 1.7
I can compile fine (meaning when I run ant compile it says build successful)
I cannot build fine meaning when I run ant build I get following error:

BUILD FAILED
Target "build" does not exist in the project "core".

I tried importing this project in eclipse (using import existing projects but seems it does not contain any existing project) but failed.

So although I can build I am not able to make any headway.

Can you please write some instructions on readme on how to run the app locally.

ShiftReduceParserQuery Throwing NPE in Pipeline

I'm very excited about the new SR parser, and I'm trying to drop it into a StanfordCoreNLP pipeline, but it's throwing an NPE when it gets to ShiftReduceParserQuery. The same code works with the PCFG parser, and I'm using version 3.5.0. The only properties I'm applying are:

props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
pipeline = new StanfordCoreNLP(props);

The stack trace leads to line 72 in the ShiftReduceParserQuery class:

Collection<ScoredObject<Integer>> predictedTransitions = parser.model.findHighestScoringTransitions(state, true, maxBeamSize, constraints);

I confirmed that parser.model is null here, even though the output says the model loads successfully. The relevant part of the output is below.

Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ...done [11.8 sec].
Adding annotator dcoref
java.lang.NullPointerException
    at edu.stanford.nlp.parser.shiftreduce.ShiftReduceParserQuery.parseInternal(ShiftReduceParserQuery.java:72)
    at edu.stanford.nlp.parser.shiftreduce.ShiftReduceParserQuery.parse(ShiftReduceParserQuery.java:47)
    at edu.stanford.nlp.pipeline.ParserAnnotator.doOneSentence(ParserAnnotator.java:263)
    at edu.stanford.nlp.pipeline.ParserAnnotator.doOneSentence(ParserAnnotator.java:215)
    at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:95)
    at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:408)

Caseless Parsers Broken

Since v3.3.1, caseless parsers are no longer supported. Is there a new model that is supported with the updated version?

Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ...
java.util.zip.ZipException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:78)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:90)
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:446)
at edu.stanford.nlp.io.IOUtils.readStreamFromString(IOUtils.java:368)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromSerializedFile(LexicalizedParser.java:606)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromFile(LexicalizedParser.java:401)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:158)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:144)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:187)
at edu.stanford.nlp.pipeline.ParserAnnotator.(ParserAnnotator.java:113)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$10.create(StanfordCoreNLP.java:732)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:81)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:262)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at edu.stanford.nlp.sentiment.SentimentPipeline.main(SentimentPipeline.java:297)
Loading parser from text file edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz java.util.zip.ZipException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:78)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:90)
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:446)
at edu.stanford.nlp.io.IOUtils.readerFromString(IOUtils.java:513)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTextFile(LexicalizedParser.java:540)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromFile(LexicalizedParser.java:403)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:158)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:144)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:187)
at edu.stanford.nlp.pipeline.ParserAnnotator.(ParserAnnotator.java:113)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$10.create(StanfordCoreNLP.java:732)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:81)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:262)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at edu.stanford.nlp.sentiment.SentimentPipeline.main(SentimentPipeline.java:297)
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:160)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:144)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:187)
at edu.stanford.nlp.pipeline.ParserAnnotator.(ParserAnnotator.java:113)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$10.create(StanfordCoreNLP.java:732)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:81)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:262)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at edu.stanford.nlp.sentiment.SentimentPipeline.main(SentimentPipeline.java:297)

Could the project switch to using log4j for logs?

I see a lot of logs printed to System.out or System.err.
Would it be possible to use a library like log4j http://logging.apache.org/log4j/2.x/ and use log.error, log.warning, log.info, log.debug instead?
That would make it easier for users of the StanfordCoreNLP to manage which logs should be printed by choosing the log level of the project.

Program freezes

I have installed the entire Stanford library on Eclipse. When I run the following source code, the program just crashes, with the last message being "Adding annotator sentiment". It remains like without any tree or any output stating positive/negative/neutral. Kindly can anyone help me out?

SOURCE CODE:

import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation;
import edu.stanford.nlp.util.CoreMap;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.Properties;


public class TagText
{
    public static void main(String[] args) throws IOException, ClassNotFoundException
    {
        // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        // read some text in the text variable
        String text = "European Stocks Drop as Maersk, Valeo Fall on Stake Sales";

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);

        // run all Annotators on this text
        pipeline.annotate(document);

        // these are all the sentences in this document
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);

        for(CoreMap sentence: sentences) {
          // traversing the words in the current sentence
          // a CoreLabel is a CoreMap with additional token-specific methods
          for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
            // this is the text of the token
            String word = token.get(TextAnnotation.class);
            // this is the POS tag of the token
            String pos = token.get(PartOfSpeechAnnotation.class);
            // this is the NER label of the token
            String ne = token.get(NamedEntityTagAnnotation.class);       
          }

          // this is the parse tree of the current sentence
          Tree tree = sentence.get(TreeAnnotation.class);

          // this is the Stanford dependency graph of the current sentence
          SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
        }

        // This is the coreference link graph
        // Each chain stores a set of mentions that link to each other,
        // along with a method for getting the most representative mention
        // Both sentence and token offsets start at 1!
        Map<Integer, CorefChain> graph = 
          document.get(CorefChainAnnotation.class);
   }
}

THE FOLLOWING MESSAGES APPEAR ON THE ECLIPSE CONSOLE with no tree or any output stating positive/negative/neutral:

Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [4.0 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [13.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [11.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [9.4 sec].
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Mar 12, 2014 6:33:22 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: null
Mar 12, 2014 6:33:22 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: temporal-composite-8:ranges
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Initializing JollyDayHoliday for sutime with classpath:edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Mar 12, 2014 6:33:23 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: null
Mar 12, 2014 6:33:23 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: temporal-composite-8:ranges
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.8 sec].
Adding annotator sentiment

3.4 Release missing classes for SR Parser

When setting parse.model=edu/stanford/nlp/models/srparser/englishSR.ser.gz and using the SR models from the site, there is a java.lang.ClassNotFoundException being thrown for edu.stanford.nlp.parser.shiftreduce.BasicFeatureFactory.

Upon inspection it looks like the class files for BasicFeatureFactory & DistsimFeatureFactory were not included in the build/jar; this renders the SR parser unusable from 3.4 (which is a bit of a pain as we use the .Net bindings).

SUTime sample does not work

Example of SUTime usage on the site has following line

pipeline.addAnnotator(new PTBTokenizerAnnotator(false));

but I cannot find class PTBTokenizerAnnotator in source code of version 3.5.0

Could you please provide correct example?

Truecaser not running because of missing Java class

Hello,

I am trying to use Stanford Core NLP for an EAMT funded project which is being implemented in Java, and we would love to use the tool's truecasing functionality. However, It seems that the whenever we try to run the truecasing annotator, we get the following error:

Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter

It seems like this is a very old problem which had been solved in the past but is now back in the newest versions. Is there any way it can be fixed?

Thanks in advance.

Problems with IndexedWord word() and value()

Hi, in the context of dkpro's StanfordCoreferenceResolver, I found the following problem:
Stanford CoreNLP (v3.4.1) seems to plan to make changes at IndexedWord: word() and value() both exist but according to a comment, should be unified at some time.

Details:

StanfordCoreferenceResolver creates the collapsed dependencies this way:
ParserAnnotatorUtils.fillInParseAnnotations(false, true, gsf, sentence, treeCopy);

Dcoref's Document.java makes use of the function getNodeByWordPattern of SemanticGraph, which in turn uses w.word(). This does not seem to be set by fillInParseAnnotations.

value() is set, however, so I preliminarily fixed the problem by adding the following right after fillInParseAnnotations in StanfordCoreferenceResolver.

SemanticGraph deps = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
for (IndexedWord vertex : deps.vertexSet()) {
vertex.setWord(vertex.value());
}

The problem should be fixed in StanfordCoreNLP, however.

CoNLLMentionExtractor always uses ``auto_conll`` files.

Hello,

it seems that CoNLLMentionExtractor always uses auto_conll to extract entity mentions, even for gold mentions. However, gold mentions are contained in gold_conll files. This results in the metric scores always being equal to zero.
In particular, on line 75:

if (Constants.USE_CONLL_AUTO) options.setFilter(".*_auto_conll$");

Is this the correct behavior or am I missing something?

Inconsistencies with sentiment analysis output

I did a quick test code to try out the new sentiment model and noticed that there is something weird going on when using RNNCoreAnnotations.getPredictedClass().

I don't know if the sentiment analysis model included to 3.3.0 is different than on the live demo site (http://nlp.stanford.edu:8080/sentiment/rntnDemo.html), but in any case the short test code is:

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.rnn.RNNCoreAnnotations;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.util.CoreMap;
import java.util.Properties;

public class SentimentTestAppStanfordNLP {

    private StanfordCoreNLP pipeline;

    public SentimentTestAppStanfordNLP() {
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
        pipeline = new StanfordCoreNLP(props);    
    }

    private void checkSentiment(String text) {        
        Annotation annotation = pipeline.process(text);
        for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
            Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
            int sentiment = RNNCoreAnnotations.getPredictedClass(tree);
            System.out.println("Sentiment: " + sentiment + " String: " + sentence.toString());
        }        
    }

    private void doMain() throws Exception {
        checkSentiment("Radek is a really good football player");
        checkSentiment("Radek is a good football player");
        checkSentiment("Radek is an OK football player");
        checkSentiment("Radek is a bad football player");
        checkSentiment("Radek is a really bad football player");        
        System.out.println("-----------------------------");
        checkSentiment("Mark is a really good football player");
        checkSentiment("Mark is a good football player");
        checkSentiment("Mark is an OK football player");
        checkSentiment("Mark is a bad football player");
        checkSentiment("Mark is a really bad football player");        
    }

    public static void main(String[] args) {
        try {
            SentimentTestAppStanfordNLP main = new SentimentTestAppStanfordNLP();
            main.doMain();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

The output baffled me; in the cases of "Radek", the RNNCoreAnnotations seemed to give almost random output, whereas on "Mark" cases the outputs were pretty much as expected (see below). When I test these same sentences on the live demo site, the "Radek" cases are correct, not like what the CoreNLP outputs here.

Sentiment: 0 String: Radek is a really good football player
Sentiment: 1 String: Radek is a good football player
Sentiment: 2 String: Radek is an OK football player
Sentiment: 2 String: Radek is a bad football player
Sentiment: 2 String: Radek is a really bad football player
-----------------------------
Sentiment: 3 String: Mark is a really good football player
Sentiment: 3 String: Mark is a good football player
Sentiment: 2 String: Mark is an OK football player
Sentiment: 1 String: Mark is a bad football player
Sentiment: 1 String: Mark is a really bad football player

how to build the CoreNLP Project?

I want to change the SUTime to use heidelTime, but I don't how to build this project ！

List<Foo> or List<? extends Foo> (discussion)

I need to annotate my text with some ParserConstraint. But I need those constraints to keep track of the token sequence they are constraining. I could re-build those sequence later using the .start and .end fields, no big deal, but as I iterate over the sentence token sequence to build constraints, that would imply kind of double iteration which I would really like to avoid.

I was thinking about extending the ParserConstraint class, but as far as I understand the Java Generics, it's not possible as the ParserAnnotations.ConstraintAnnotation class' .getType() method returns java.lang.Class<java.util.List<ParserConstraint>>.

I'm thinking that it would be great to have a covariant list, like java.util.List<? extends ParserConstraint>. What do you think about? Is it feasible? Thanks.

P.S. of course, I'm available for monkey coding. :-)

Can we construct Trees from input String?

I know StanfordNLP produces parentheses based output through print() method, but does it provide any function to read back outputted string and construct a tree?

NER annotation doesn't allow for setting SUTime rule path

When the NER annotator is used in a pipeline it doesn't seem to support passing the sutime.rules property on to the time extractors created by NumberSequenceClassifier, which leads to the Options object always being filled out with the default SUTime rules' paths.

In Java this isn't really a problem due to the class pathing, however, I'm using the .Net bindings via IKVM and have run into a few issues with this (as such I also don't have Java installed and thus cannot create a patch).

Inconsistencies in the interfaces for constituent spans

I wrote the following up on 2013-09-16, when I was trying to understand/modify the coreference system, which was using different access methods below in different places.

There seem to be at least three types of span/index information for words and parsetree constituents, in Stanford CoreNLP, and they are all inconsistent with one another.

CoreAnnotations.IndexAnnotation
Only applies to leaves, I think.
Initialize with: Tree.indexLeaves()
1-indexed
reliable

CoreAnnotations.BeginIndexAnnotation and CoreAnnotations.EndIndexAnnotation
Applies to both nonterminals and leaves.
Initialize with: Tree.indexSpans()
0-indexed inclusive-exclusive: [start,end)
NOT RELIABLE - sometimes are null.

CoreAnnotations.SpanAnnotation [with wrapper Tree.getSpan()]
Initialize with: Tree.setSpans()
0-indexed inclusive-inclusive: [start,end]
NOT RELIABLE - sometimes is null.

I made an example of this with Stanford CoreNLP 3.2.0.
It reads in a parse tree, then prints out the above annotations at every node in the tree.
Code and output here: https://gist.github.com/brendano/7345495

testTrieFindClosest failing under Java 8

When using Java 8 (early access), this test case fails.

java version "1.8.0-ea"
Java(TM) SE Runtime Environment (build 1.8.0-ea-b121)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b63, mixed mode)

Switching to Java 7, it works fine. This just FYI. I didn't investigate what exactly causes this issue and if it may be due to any remaining issue in the Java 8 preview.

    [junit] Testcase: testTrieFindClosest(edu.stanford.nlp.ling.tokensregex.matcher.TrieMapTest):   FAILED
    [junit] Expecting [([a - black - cat] -> true at (0,2),2.0), ([a - black - hat] -> true at (0,2),2.0), ([a - white - hat] -> true at (0,2),3.0), ([a - white - cat] -> true at (0,2),3.0), ([a - colored - hat] -> true at (0,2),3.0)], got [([a - black - hat] -> true at (0,2),2.0), ([a - black - cat] -> true at (0,2),2.0), ([a - colored - hat] -> true at (0,2),3.0), ([a - white - cat] -> true at (0,2),3.0), ([a - white - hat] -> true at (0,2),3.0)] 
expected:
<[([a - black - cat] -> true at (0,2),2.0), ([a - black - hat] -> true at (0,2),2.0), ([a - white - hat] -> true at (0,2),3.0), ([a - white - cat] -> true at (0,2),3.0), ([a - colored - hat] -> true at (0,2),3.0)]> 
but was:
<[([a - black - hat] -> true at (0,2),2.0), ([a - black - cat] -> true at (0,2),2.0), ([a - colored - hat] -> true at (0,2),3.0), ([a - white - cat] -> true at (0,2),3.0), ([a - white - hat] -> true at (0,2),3.0)]>

Pure c# port

I recently had finished a port/reimplementation of OpenNLP library in C#, and would like to do the same with StanfordNLP! 😋

There is any impediment (regarding the dual license) to make this port?

RuleBasedCorefMentionFinder NullPointerException with SR parser only

same crash in 3.5 release and current build

$ echo 'This waste, when mixed into the soil, can be very helpful to growing plants' > tmp.txt
$ java -mx3g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -parse.model edu/stanford/nlp/models/srparser/englishSR.ser.gz -file tmp.txt

Ready to process: 1 files, skipped 0, total 1
Processing file /Users/kevinh/Stanford/stanford-corenlp-full-2014-10-31/tmp.txt ... writing to /Users/kevinh/Stanford/stanford-corenlp-full-2014-10-31/tmp.txt.xml {
  Annotating file /Users/kevinh/Stanford/stanford-corenlp-full-2014-10-31/tmp.txt {
    RuleBasedCorefMentionFinder: Failed to find head token:
    Tree is: (ROOT (S (NP (NP (DT This) (NN waste)) (, ,) (SBAR (WHADVP (WRB when)) (S (VP (VBN mixed) (PP (IN into) (NP (DT the) (NN soil)))))) (, ,)) (VP (MD can) (VP (VB be) (ADJP (RB very) (JJ helpful) (PP (TO to) (NP (VBG growing) (NNS plants))))))))
    token = |waste|1|, approx=0
  } [0.476 seconds]
Exception in thread "main" java.lang.RuntimeException: Error annotating /Users/kevinh/Stanford/stanford-corenlp-full-2014-10-31/tmp.txt
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$processFiles$15(StanfordCoreNLP.java:877)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$17/1526062841.run(Unknown Source)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:948)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:990)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1060)
Caused by: java.lang.NullPointerException
    at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.findHead(RuleBasedCorefMentionFinder.java:276)
    at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.extractPredictedMentions(RuleBasedCorefMentionFinder.java:101)
    at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:107)
    at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:410)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$processFiles$15(StanfordCoreNLP.java:865)
    ... 4 more

is there any way to remove stop words from text document and stemming words not lemmatization

i want to know how to remove stop words and stemming of words?

Enable Sourcegraph

I want to use Sourcegraph code search and code review with CoreNLP. A project maintainer needs to enable it to set up a webhook so the code is up-to-date there.

Could you please enable CoreNLP on @sourcegraph by going to https://sourcegraph.com/github.com/stanfordnlp/CoreNLP and clicking on Settings? (It should only take 15 seconds.)

Thank you!

Gradle Build Support

In response to Issue #23: I think Gradle would be better. I'd be happy to contribute that. A minimal build.gradle with ant.importBuild("build.xml") would enable the use as a subproject.

Consider removing "classes" directory from repository?

I wonder, is there a reason that the "classes" are part of the git repository? Normally, generated output. I'm asking, because I constantly see that directory dirty in my git client.

mapreduce jobs for corenlp

Hello,

My query is not an issue but more like how to achieve X using corenlp.

I need to process large amount of data and I was looking at corenlp as one of the options.

Since processing a review was taking about 4 secs (on a 2 year old macbook pro) I wanted to use map-reduce to run it over larger amount of data.

The way map reduce jobs are run, typically each map routine gets one line of the file to process. If I call corenlp for each line (or each review at the max) then it it a lot of overhead because corenlp has setup time and it is not efficient to setup for each line.

So I wanted to know if the authors have any thought on whether running of corenlp can be optimized for map reduce paradigm? And if there is any relevant implementation of corenlp in this paradigm which I can look at.

Thanks.

Stanford CoreNLP v.s. Stanford Parser

In my project, I use Stanford CoreNLP to perform some basic operations. Meanwhile I need use a caseless model for parsing, so I chose "englichPCFG.caseless.ser.gz" in the model of Stanford Parser. However, CoreNLP cannot read this model, so I added Stanford Parser into my project, along with CoreNLP.

But here comes the question: there are java files with the same path (same package and same name) in both Stanford CoreNLP and Stanford Parser, once there are slight differences between these two java files, things got complicated, because I don't know which function am I going to call in my project. Actually after I added Stanford Parser in my project, the original lemmatize module couldn't work, error was occurred when loading the model.

Is there anyone who tried to add both Stanford Parser and Stanford CoreNLP in one project, and could you give me some advice to avoid conflicts? Thanks. :-)

Maven support

Add a maven building support would be nice.

corenlp.war: XOMReader warnings break visualize output

java -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

jetty-runner corenlp.war
2015-02-24 16:37:55.366:INFO::main: Logging initialized @108ms
2015-02-24 16:37:55.372:INFO:oejr.Runner:main: Runner
2015-02-24 16:37:55.457:INFO:oejs.Server:main: jetty-9.2.2.v20140723
2015-02-24 16:38:01.093:WARN:oeja.AnnotationConfiguration:main: ServletContainerInitializers: detected. Class hierarchy: empty
2015-02-24 16:38:01.331:INFO:oejsh.ContextHandler:main: Started o.e.j.w.WebAppContext@606d8acf{/,file:/private/var/folders/qt/7v9m4kd572b0zw56pc3hxy5r0000gn/T/jetty-0.0.0.0-8080-corenlp.war-_-any-6993542127094007379.dir/webapp/,AVAILABLE}{file:/Users/spiliero/CoreNLP/corenlp.war}
2015-02-24 16:38:01.332:WARN:oejsh.RequestLogHandler:main: !RequestLog
2015-02-24 16:38:01.360:INFO:oejs.ServerConnector:main: Started ServerConnector@1d057a39{HTTP/1.1}{0.0.0.0:8080}
2015-02-24 16:38:01.361:INFO:oejs.Server:main: Started @6125ms
Searching for resource: StanfordCoreNLP.properties
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.0 sec].
Adding annotator lemma
Adding annotator ner
annotators=tokenize, ssplit, pos, lemma, ner, parse, dcoref
Unknown property: |annotators|
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [5.1 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [2.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [3.6 sec].
sutime.binder.1.
Initializing JollyDayHoliday for sutime with classpath:edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Feb 24, 2015 4:38:16 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: null
Feb 24, 2015 4:38:16 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: temporal-composite-8:ranges
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...done [0.5 sec].
Adding annotator dcoref
Warning: nu.xom.xslt.XOMReader: XOMReader doesn't support http://javax.xml.XMLConstants/property/accessExternalDTD
Warning: nu.xom.xslt.XOMReader: XOMReader doesn't support http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit

NullRef in DeterministicCorefSieve.sortMentionsForPronoun

When using the caseless pos-tagger, it is possible to trigger a null-reference exception in edu.stanford.nlp.dcoref.sievepasses.DeterministicCorefSieve.sortMentionsForPronoun when there is a dangling pronoun.

A simple repro-case using a simplified tweet that can trigger the exception:
rt @bob: I really hate fifa 2015. ya

which yields this trace:

Exception in thread "main" java.lang.RuntimeException: Error annotating C:\Users\***\Desktop\stanford-corenlp-full-2014-06-16\input.txt
        at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1288)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1348)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1390)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1460)
Caused by: java.lang.NullPointerException
        at edu.stanford.nlp.dcoref.sievepasses.DeterministicCorefSieve.sortMentionsForPronoun(DeterministicCorefSieve.java:482)
        at edu.stanford.nlp.dcoref.sievepasses.DeterministicCorefSieve.getOrderedAntecedents(DeterministicCorefSieve.java:464)
        at edu.stanford.nlp.dcoref.SieveCoreferenceSystem.coreference(SieveCoreferenceSystem.java:898)
        at edu.stanford.nlp.dcoref.SieveCoreferenceSystem.coref(SieveCoreferenceSystem.java:845)
        at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:121)
        at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:848)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1276)
        ... 3 more

Admittedly this is not correct English in anyway, however it would be nice to see a little more robustness in the system :)

Non-string property values don't get passed to annotators

Ref: c01f31e#diff-817eb462723073c02c8fc4fd34993d18R22

Edit: This doesn't seem to come up right - the file and line in question are AnnotatorFactory.java, line 22:

-    for(Object key: properties.keySet()) {
-      this.properties.setProperty((String) key, properties.getProperty((String) key));
+    for (String key : properties.stringPropertyNames()) {
+      this.properties.setProperty(key, properties.getProperty(key));
     }
   }

Using stringPropertyNames() causes properties with non-string values to be excluded from the copied properties set, since they are excluded from the iterator. For example:

Properties props = new Properties();
props.put("ner.useSUTime", false);
props.put("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");

from this props object, only customAnnotatorClass.stopword will be copied into the annotator, since the value of the "ner.useSUTime" prop is a non-string. The documentation for stringPropertyNames says:

[This] method returns a set of keys in this property list where the key and its corresponding value are strings, including distinct keys in the default property list if a key of the same name has not already been found from the main properties list. Properties whose key or value is not of type String are omitted.

This results in CoreNLP silently creating annotators without the properties that have been set on the Properties object.

how to build this repo

Hi ! i am using ant to build the javas,
but error followed:

Buildfile: /home/drill/Downloads/CoreNLP-master/build.xml

classpath:
     [echo] core

compile:
     [echo] core
    [javac] Compiling 1 source file to /home/drill/Downloads/CoreNLP-master/classes
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:70: error: cannot find symbol
    [javac]     tree.check();
    [javac]         ^
    [javac]   symbol:   method check()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:71: error: cannot find symbol
    [javac]     tree.balance();
    [javac]         ^
    [javac]   symbol:   method balance()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:72: error: cannot find symbol
    [javac]     int height = tree.height();
    [javac]                      ^
    [javac]   symbol:   method height()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:74: error: cannot find symbol
    [javac]     tree.check();
    [javac]         ^
    [javac]   symbol:   method check()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:84: error: cannot find symbol
    [javac]     tree.clear();
    [javac]         ^
    [javac]   symbol:   method clear()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:130: error: cannot find symbol
    [javac]     Iterator<Interval<Integer>> iterator = tree.iterator();
    [javac]                                                ^
    [javac]   symbol:   method iterator()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] /home/drill/Downloads/CoreNLP-master/test/src/edu/stanford/nlp/util/IntervalTreeTest.java:156: error: cannot find symbol
    [javac]     Iterator<Interval<Integer>> iterator = tree.iterator();
    [javac]                                                ^
    [javac]   symbol:   method iterator()
    [javac]   location: variable tree of type IntervalTree<Integer,Interval<Integer>>
    [javac] 7 errors

BUILD FAILED
/home/drill/Downloads/CoreNLP-master/build.xml:99: Compile failed; see the compiler error output for details.

i have googled many times , found none. does it support the command line way ? hope you could give a guide , thx

Access to tagset in ShiftReduceParser

it would be nice if the ShiftReduceParser exposed a tagSet() method which would basically do a

return model.knownStates

Currently, I need to use reflection to access ShiftReduceParser.model and BaseModel.knownStates to extract the tag set.

error message when trying to parse nonsensical datetime

I see this a lot when parsing gigaword.

java.lang.NumberFormatException: For input string: "1438143814381434"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:583)
        at java.lang.Integer.valueOf(Integer.java:766)
        at edu.stanford.nlp.ie.pascal.ISODateInstance.extractDay(ISODateInstance.java:1107)
        at edu.stanford.nlp.ie.pascal.ISODateInstance.extractFields(ISODateInstance.java:398)
        at edu.stanford.nlp.ie.pascal.ISODateInstance.<init>(ISODateInstance.java:82)
        at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.normalizedDateString(QuantifiableEntityNormalizer.java:363)
        at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.normalizedDateString(QuantifiableEntityNormalizer.java:338)
        at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.processEntity(QuantifiableEntityNormalizer.java:1025)
        at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.addNormalizedQuantitiesToEntities(QuantifiableEntityNormalizer.java:1374)
        at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:133)
        at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:327)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:148)
        at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:95)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:137)
        at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:411)

Custom features in NER package should use dependency injection

To add a custom feature extractor to the NER package, the path of least resistance is to modify NERFeatureFactory.java, AnnotationLookup.java, CoreAnnotations.java and SeqClassifierFlags.java. A more generic approach would use dependency injection, so third-party developers wouldn't have to touch code inside CoreNLP. If it's not already on your development roadmap, I'm happy to take on that change.

Dave

TokenMgrError: Lexical error at line 1, column 104. Encountered: "E" (69), after : "\\"

I have tried several input files, but I keep on getting the following error. Any help will be appreciated.

Exception in thread "main" edu.stanford.nlp.ling.tokensregex.parser.TokenMgrError: Lexical error at line 1, column 104. Encountered: "E" (69), after : ""
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParserTokenManager.getNextToken(TokenSequenceParserTokenManager.java:1029)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.jj_ntk(TokenSequenceParser.java:3353)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.CoreMapNode(TokenSequenceParser.java:1386)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.NodeBasic(TokenSequenceParser.java:1360)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.NodeGroup(TokenSequenceParser.java:1327)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.NodeDisjConj(TokenSequenceParser.java:1266)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.BracketedNode(TokenSequenceParser.java:1127)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexBasic(TokenSequenceParser.java:833)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexDisjConj(TokenSequenceParser.java:1020)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegex(TokenSequenceParser.java:790)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexWithAction(TokenSequenceParser.java:1643)
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.parseSequenceWithAction(TokenSequenceParser.java:37)
at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:186)
at edu.stanford.nlp.patterns.surface.ScorePhrases.runParallelApplyPats(ScorePhrases.java:215)
at edu.stanford.nlp.patterns.surface.ScorePhrases.applyPats(ScorePhrases.java:326)
at edu.stanford.nlp.patterns.surface.ScorePhrases.learnNewPhrasesPrivate(ScorePhrases.java:397)
at edu.stanford.nlp.patterns.surface.ScorePhrases.learnNewPhrases(ScorePhrases.java:177)
at edu.stanford.nlp.patterns.surface.GetPatternsFromDataMultiClass.iterateExtractApply4Label(GetPatternsFromDataMultiClass.java:1716)
at edu.stanford.nlp.patterns.surface.GetPatternsFromDataMultiClass.iterateExtractApply(GetPatternsFromDataMultiClass.java:1591)
at edu.stanford.nlp.patterns.surface.GetPatternsFromDataMultiClass.main(GetPatternsFromDataMultiClass.java:2485)

Release new version with commit 39f68fc

Please consider releasing a new version to Maven central with commit 39f68fc.

stanfordnlp / corenlp Goto Github PK

corenlp's Introduction

Stanford CoreNLP

Build Instructions

Provided build

Build with Ant

Build with Maven

Models

Install by Gradle

Useful resources

corenlp's People

Contributors

Stargazers

Watchers

Forkers

corenlp's Issues

Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

Recommend Projects

Recommend Topics

Recommend Org