Coder Social home page Coder Social logo

clearnlp's People

Contributors

davidbelanger avatar jdchoi77 avatar

Watchers

 avatar

clearnlp's Issues

Potentially incorrect tokenization

When writing a tokenization unit test for the ClearTK wrappers for ClearNLP, I 
found an inconsistency between OpenNLP's tokenization and ClearNLP's.

Consider the string:
String s = "\"John & Mary's dog,\" Jane thought (to herself).\n"
                + "\"What a #$%!\n" 
                    + "a- ``I like AT&T''.\""

I was expecting the following tokenization as this is what our unit test for 
OpenNLP produces:
", John, &, Mary, 's, dog, ,, ", Jane, thought, (, to, herself, ), ., ", What, 
a, #, $, %, !, a, -, ``, I, like, AT&T, '', ., "

ClearNLP's output is slightly different:
", John, &, Mary, 's, dog, ,, ", Jane, thought, (, to, herself, ), ., ", What, 
a, #, $, %, !, a, -, `, `, I, like, AT, &, T, ', ', ., "

Specifically, the discrepancies are:
`` vs `,`
AT&T vs AT, &, T
'' vs ', '

Is this just a different style of tokenization or is it incorrect?  Does it 
make a difference for the parser? 

Original issue reported on code.google.com by lee.becker on 27 Oct 2012 at 6:01

InputStream for EngineGetters

_This was originally posted at our forum by Lee Becker_

Would it be possible to add APIs to the factory methods in EngineGetters to 
accept InputStreams?  Currently they only accept modelFiles or dataFiles as 
Strings.  It would be useful to accept InputStreams so that the developer can 
decide whether it comes from a File, URL, or URI.  This will also assist 
integration into UIMA-based systems like ClearTK or cTAKES.

For example, these would all be useful interfaces:

static public DEPParser getDEPParser(InputStream modeInputStream)
static public Pair<POSTagger[],Double> getPOSTaggers(InputStream 
modelInputStream) throws Exception
static public AbstractTokenizer getTokenizer(String language, InputStream 
dictInputStream)

Thanks,
Lee

Original issue reported on code.google.com by [email protected] on 29 Oct 2012 at 7:02

NPE in DEPNode.toString(List<DEPArc> heads)

What steps will reproduce the problem?
1. parse a sentence 'The train leaves at 5pm.' using 
EngineProcess.getDEPTree(...)
2. print the resultant DEPTree
3. experience the NPE

What is the expected output? What do you see instead?
1   The the DT  _   2   det     _   _
2   train   train   NN  _   3   nsubj       3:A0    _
3   leaves  leave   VBZ pb=leave.XX 0   root        _   _
4   at  at  IN  _   3   prep        3:AM-TMP    _
5   5   0   CD  _   6   num     _   _
6   pm  pm  NN  _   4   pobj        _   _
7   .   .   .   _   3   punct       _   _

Null pointer stacktrace

What version of the product are you using? On what operating system?
1.2.1

Please provide any additional information below.

add a null check before Collections.sort(...)

    private String toString(List<DEPArc> heads)
    {
        StringBuilder build = new StringBuilder();

        Collections.sort(heads);

        for (DEPArc arc : heads)
        {
            build.append(DEPLib.DELIM_HEADS);
            build.append(arc.toString());
        }

        if (build.length() > 0)
            return build.substring(DEPLib.DELIM_HEADS.length());
        else
            return AbstractColumnReader.BLANK_COLUMN;
    }

Original issue reported on code.google.com by [email protected] on 9 Nov 2012 at 11:08

MPAnalyzer gives null pointer exception

What steps will reproduce the problem?
1. Run "java com.googlecode.clearnlp.run.MPAnalyze -c input\config_en_morph.xml 
-i input\morph-sample.txt" as given in the Wiki

What is the expected output? What do you see instead?
input\morph-sample.txt.morph
java.lang.NullPointerException
        at com.googlecode.clearnlp.morphology.EnglishMPAnalyzer.getException(EnglishMPAnalyzer.java:340)
        at com.googlecode.clearnlp.morphology.EnglishMPAnalyzer.getLemmaAux(EnglishMPAnalyzer.java:306)
        at com.googlecode.clearnlp.morphology.EnglishMPAnalyzer.getLemma(EnglishMPAnalyzer.java:274)
        at com.googlecode.clearnlp.morphology.AbstractMPAnalyzer.lemmatize(AbstractMPAnalyzer.java:60)
        at com.googlecode.clearnlp.run.MPAnalyze.analyze(MPAnalyze.java:87)
        at com.googlecode.clearnlp.run.MPAnalyze.<init>(MPAnalyze.java:73)
        at com.googlecode.clearnlp.run.MPAnalyze.main(MPAnalyze.java:96)

What version of the product are you using? On what operating system?
ClearNLP version 1.3.0

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 22 Feb 2013 at 2:20

Latest models seem to be broken

What steps will reproduce the problem?
1. Checkout the latest Git master from googlecode
2. Download the latest models (eg 
https://bitbucket.org/jdchoi77/models/downloads/ontonotes-en-pos-1.3.0.tgz )
3. Parse using eg
 mvn exec:java -Dexec.mainClass=com.googlecode.clearnlp.demo.DemoDEPParser  -Dexec.args="model/dictionary-1.2.0.zip model/ontonotes-en-pos-1.3.0.tgz model/ontonotes-en-dep-1.3.0.tgz src/main/resources/sample/iphone5.txt src/main/resources/sample/iphone5.txt.newparsed"

What is the expected output? What do you see instead?
Instead of parse output, we get a null pointer exception

What version of the product are you using? On what operating system?
Git master 6fb797d1ad2a49946fcf907c77045136940936e3 (version 1.3.0)

Please provide any additional information below.
Parsing works fine with the old models. Looks like the models are misaligned 
with the Git version

Original issue reported on code.google.com by [email protected] on 25 Jan 2013 at 1:16

EngineGetters should throw exception instead of returning null

I was looking at the static factory methods in EngineGetter, and I noticed 
several methods like this:

    static public AbstractSegmenter getSegmenter(String language, AbstractTokenizer tokenizer)
    {
        if (language.equals(AbstractReader.LANG_EN))
            return new EnglishSegmenter(tokenizer);

        return null;
    }

It seems that instead of returning null, these methods should return an 
IllegalArgumentException that says "the requested language is not currently 
supported".

Original issue reported on code.google.com by lee.becker on 27 Oct 2012 at 5:33

lightweight, low-memory models would be useful for unit testing

For those of us wrapping in ClearNLP in another framework, it would be useful 
to have lightweight, low-memory models to test how the ClearNLP APIs interface 
with our own code.

Original issue reported on code.google.com by lee.becker on 29 Oct 2012 at 12:16

ClearNLP Error: java.lang.NullPointerException at com.googlecode.clearnlp.tokenization.EnglishTokenizer.normalizeNonUTF8

I'm trying to find a good Semantic Role Labeling tool that I can use in my java 
code using Netbeans.
I tried ClearNLP and it work with testing the version with the right output fom 
this link: https://code.google.com/p/clearnlp/wiki/Installation

But when I used the following code:

    /*
     * To change this license header, choose License Headers in Project Properties.
     * To change this template file, choose Tools | Templates
     * and open the template in the editor.
     */
    package stanfordposcode;


    import java.io.BufferedReader;
    import java.io.FileInputStream;
    import java.io.PrintStream;
    import java.util.List;

    import com.googlecode.clearnlp.component.AbstractComponent;
    import com.googlecode.clearnlp.dependency.DEPTree;
    import com.googlecode.clearnlp.engine.EngineGetter;
    import com.googlecode.clearnlp.nlp.NLPDecode;
    import com.googlecode.clearnlp.nlp.NLPLib;
    import com.googlecode.clearnlp.reader.AbstractReader;
    import com.googlecode.clearnlp.segmentation.AbstractSegmenter;
    import com.googlecode.clearnlp.tokenization.AbstractTokenizer;
    import com.googlecode.clearnlp.util.UTInput;
    import com.googlecode.clearnlp.util.UTOutput;


    // Import log4j classes.
     import org.apache.log4j.Logger;
     import org.apache.log4j.BasicConfigurator;

    public class SRL
    {
            final String language = AbstractReader.LANG_EN;
            static Logger logger = Logger.getLogger(SRL.class);

            public SRL(String dictFile, String posModelFile, String depModelFile, String predModelFile, String roleModelFile, String srlModelFile, String inputFile, String outputFile) throws Exception
            {
                    AbstractTokenizer tokenizer  = EngineGetter.getTokenizer(language, new FileInputStream(dictFile));
                    AbstractComponent tagger     = EngineGetter.getComponent(new FileInputStream(posModelFile) , language, NLPLib.MODE_POS);
                    AbstractComponent analyzer   = EngineGetter.getComponent(new FileInputStream(dictFile)     , language, NLPLib.MODE_MORPH);
                    AbstractComponent parser     = EngineGetter.getComponent(new FileInputStream(depModelFile) , language, NLPLib.MODE_DEP);
                    AbstractComponent identifier = EngineGetter.getComponent(new FileInputStream(predModelFile), language, NLPLib.MODE_PRED);
                    AbstractComponent classifier = EngineGetter.getComponent(new FileInputStream(roleModelFile), language, NLPLib.MODE_ROLE);
                    AbstractComponent labeler    = EngineGetter.getComponent(new FileInputStream(srlModelFile) , language, NLPLib.MODE_SRL);

                    AbstractComponent[] components = {tagger, analyzer, parser, identifier, classifier, labeler};

                    String sentence = "I'd like to meet Dr. Choi.";
                    process(tokenizer, components, sentence);
                    process(tokenizer, components, UTInput.createBufferedFileReader(inputFile), UTOutput.createPrintBufferedFileStream(outputFile));
            }

            public void process(AbstractTokenizer tokenizer, AbstractComponent[] components, String sentence)
            {
                    DEPTree tree = NLPDecode.toDEPTree(tokenizer.getTokens(sentence));

                    for (AbstractComponent component : components)
                            component.process(tree);

                    System.out.println(tree.toStringSRL()+"\n");
            }

            public void process(AbstractTokenizer tokenizer, AbstractComponent[] components, BufferedReader reader, PrintStream fout)
            {
                    AbstractSegmenter segmenter = EngineGetter.getSegmenter(language, tokenizer);
                    DEPTree tree;

                    for (List<String> tokens : segmenter.getSentences(reader))
                    {
                            tree = NLPDecode.toDEPTree(tokens);

                            for (AbstractComponent component : components)
                                    component.process(tree);

                            fout.println(tree.toStringSRL()+"\n");
                    }

                    fout.close();
            }

            public static void main(String[] args)
            {
                BasicConfigurator.configure();

                    String dictFile      = "/Users/ha/clearnlp/dictionary-1.3.1.jar"; // e.g., dictionary.zip
                    String posModelFile  = "/Users/ha/clearnlp/ontonotes-en-pos-1.3.0.jar"; // e.g., ontonotes-en-pos.tgz
                    String depModelFile  = "/Users/ha/clearnlp/ontonotes-en-dep-1.3.0.jar"; // e.g., ontonotes-en-dep.tgz
                    String predModelFile = "/Users/ha/clearnlp/ontonotes-en-pred-1.3.0.jar"; // e.g., ontonotes-en-pred.tgz
                    String roleModelFile = "/Users/ha/clearnlp/ontonotes-en-role-1.3.0.jar"; // e.g., ontonotes-en-role.tgz
                    String srlModelFile  = "/Users/ha/clearnlp/ontonotes-en-srl-1.3.0.jar"; // e.g., ontonotes-en-srl.tgz
                    String inputFile     = "/Users/ha/NetBeansProjects/StanfordPOSCode/src/stanfordposcode/input.txt";
                    String outputFile    = "/Users/ha/NetBeansProjects/StanfordPOSCode/src/stanfordposcode/output.txt";

                    try
                    {
                            new SRL(dictFile, posModelFile, depModelFile, predModelFile, roleModelFile, srlModelFile, inputFile, outputFile);
                    }
                    catch (Exception e) {e.printStackTrace();}
            }

    }

I got the following error:

    ........
    13084 [main] INFO com.googlecode.clearnlp.classification.model.StringModel  - Loading model:

    java.lang.NullPointerException
        at com.googlecode.clearnlp.tokenization.EnglishTokenizer.normalizeNonUTF8(EnglishTokenizer.java:362)
        at com.googlecode.clearnlp.tokenization.EnglishTokenizer.getTokenList(EnglishTokenizer.java:111)
        at com.googlecode.clearnlp.tokenization.AbstractTokenizer.getTokens(AbstractTokenizer.java:61)
        at stanfordposcode.SRL.process(SRL.java:54)
        at stanfordposcode.SRL.<init>(SRL.java:48)
        at stanfordposcode.SRL.main(SRL.java:95)
    BUILD SUCCESSFUL (total time: 18 seconds)


I already added all the jar files:

http://i.stack.imgur.com/cIECT.png

how can I solve this error?
and is there a better SRL that I can use?

Thanks in advance

Original issue reported on code.google.com by [email protected] on 15 Jan 2015 at 3:58

Adding personal file for training tokenizer

What steps will reproduce the problem?
While training the model, you are using a set of input files- abbrevations, 
compund, etc.  
Can we use a set of our own dictionary files for training the model. 
 For example, we have a set of terms from medical or law field and I want to tokenize those terms as a single term. e.g. law maker. 


Can you please suggest the correct process for this. 
What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 7 Oct 2013 at 1:40

error in tokenization

Hi,

I am using clearNLP for tokenization and I am using the clearNLP APIs for this. 
I am following the example code give by 
you(https://github.com/clearnlp/clearnlp-demo/blob/master/src/main/java/com/clea
rnlp/demo/DemoNLPDecode.java). But I am getting this error when I try to 
initialize "tokenizer". Here are the details :

================

String text ="here goes my text. Let's see how well does it perform"
String language = AbstractReader.LANG_EN;
AbstractTokenizer clearNLPTokenizer = NLPGetter.getTokenizer(language);
String modelType  = "general-en";
List<String> tokens = this.clearNLPTokenizer.getTokens(text);

But I get error in line 3:

Exception in thread "main" java.lang.UnsupportedClassVersionError: 
com/clearnlp/nlp/NLPGetter : Unsupported major.minor version 51.0

at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

==================

I have included all the jar files provided in 
here(http://clearnlp.wikispaces.com/file/detail/clearnlp-lib-2.0.2.tgz) and I 
have also included the dictionary jar.
I am making some mistake in using clearNLP. Please help me out.

Original issue reported on code.google.com by [email protected] on 18 Mar 2014 at 12:34

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.