Coder Social home page Coder Social logo

opennlp's Introduction

OpenNlp

OpenNlp is an open source library for Natural Language Processing (NLP). It provides a number of NLP tools in C#:

  • sentence splitter
  • tokenizer
  • part-of-speech tagger
  • chunker
  • coreference
  • name entity recognition
  • parse trees

This project started as a C# port of the Java OpenNLP tools (initial code was retrieved on http://sharpnlp.codeplex.com). It was moved to Github to improve the code (add new features and fix detected bugs) and create a nuget package.

You can install this library via nuget:

Install-Package OpenNLP

For use with .net Core applications, the System.Runtime.Caching nuget package is also required for full functionality:

Install-Package System.Runtime.Caching

Quick start

To test easily the various NLP tools, run the ToolsExample winform project. You'll find below a more detailed description of the tools and how code snippets to use them directly in your code. All NLP tools based on the maxent algorithm need model files to run. You'll find those files for English in Resources/Models. If you want to train your own models (to improve precision on English or to use those tools on other languages), please refer to the last section.

Sentence splitter

A sentence splitter splits a paragraph in sentences. Technically, the sentence detector will compute the likelihood that a specific character ('.', '?' or '!' in the case of English) marks the end of a sentence.

var paragraph = "Mr. & Mrs. Smith is a 2005 American romantic comedy action film. The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple. They are surprised to learn that they are both assassins hired by competing agencies to kill each other.";
var modelPath = "path/to/EnglishSD.nbin";
var sentenceDetector = new EnglishMaximumEntropySentenceDetector(modelPath);
var sentences = sentenceDetector.SentenceDetect(paragraph);
/* 
 * sentences = ["Mr. & Mrs. Smith is a 2005 American romantic comedy action film.", 
 * "The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple.", 
 * "They are surprised to learn that they are both assassins hired by competing agencies to kill each other."]
 */

Tokenizer

A tokenizer breaks a text into words, symbols or other meaningful elements. The historical tokenizers are based on the maxent algorithm.

// Regular tokenizer
var modelPath = "path/to/EnglishTok.nbin";
var sentence = "- Sorry Mrs. Hudson, I'll skip the tea.";
var tokenizer = new EnglishMaximumEntropyTokenizer(modelPath);
var tokens = tokenizer.Tokenize(sentence);
// tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."]

For English, a specific rule-based tokenizer (based on regexes) was created and has a better precision. This tokenizer doesn't need any model.

// English tokenizer
var tokenizer = new EnglishRuleBasedTokenizer();
var sentence = "- Sorry Mrs. Hudson, I'll skip the tea.";
var tokens = tokenizer.Tokenize(sentence);
// tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."]

Part-of-speech tagger

A part of speech tagger assigns a part of speech (noun, verb etc.) to each token in a sentence.

var modelPath = "path/to/EnglishPOS.nbin";
var tagDictDir = "path/to/tagdict/directory";
var posTagger = EnglishMaximumEntropyPosTagger(modelPath, tagdictDir);
var tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."];
var pos = posTagger.Tag(tokens);
// pos = [":", "NNP", "NNP", "NNP", ".", "PRP", "MD", "VB", "DT", "NN", "."]

For the full list of part of speech abbreviations, please refer to the Penn Treebank Project

Chunker

A chunker is an alternative to a full sentence parser which gives the partial syntactic structure of a sentence (for instance the noun/verg groups).

var modelPath = "path/to/EnglishChunk.nbin";
var chunker = EnglishTreebankChunker(modelPath);
var tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."];
var pos = [":", "NNP", "NNP", "NNP", ".", "PRP", "MD", "VB", "DT", "NN", "."];
var chunks = chunker.GetChunks(tokens, tags);
// chunks = [["NP", "- Sorry Mrs. Hudson"], [",", ","], ["NP", "I"], ["VP", "'ll skip"], ["NP", "the tea"], [".", "."]]

Coreference

Coference detects all expressions that refer to the same entities in a text.

var modelPath = "path/to/coref/dir";
var coreferenceFinder = new TreebankLinker(modelPath);
var sentences = ["Mr. & Mrs. Smith is a 2005 American romantic comedy action film.", 
	"The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple.", 
	"They are surprised to learn that they are both assassins hired by competing agencies to kill each other."];
var coref = coreferenceFinder.GetCoreferenceParse(sentences);
// coref = 

Name entity recognition

Name entity recognition identifies specific entities in sentences. With the current models, you can detect persons, dates, locations, money, percentages and time

var modelPath = "path/to/namefind/dir";
var nameFinder = new EnglishNameFinder(modelPath);
var sentence = "Mr. & Mrs. Smith is a 2005 American romantic comedy action film.";
// specify which types of entities you want to detect
var models = ["date", "location", "money", "organization", "percentage", "person", "time"];
var ner = nameFinder.GetNames(models, sentence);
// ner = Mr. & Mrs. <person>Smith</person> is a <date>2005</date> American romantic comedy action film.

Parse tree

A parser gives the full syntactic structure of a sentence.

var modelPath = "path/to/models/dir";
var sentence = "- Sorry Mrs Hudson, I'll skiip the tea.";
var parser = new EnglishTreebankParser(_modelPath);
var parse = parser.DoParse(sentence);
// parse = (TOP (S (NP (: -) (NNP Sorry) (NNP Mrs.) (NNP Hudson)) (, ,) (NP (PRP I)) (VP (MD 'll) (VP (VB skip) (NP (DT the) (NN tea)))) (. .)))

Train your models

The models proposed are general models for English. If you need those tools on other languages or on a specialized English corpus, you can train your own models. To do so, you'll need examples; for instance for sentence detections, you'll need a (big) number of paragraphs with the sentences appropriately delimited.

// The file with the training samples; works also with an array of files
var trainingFile  = "path/to/training/file";
// The number of iterations; no general rule for finding the best value, just try several!
var iterations = 5;
// The cut; no general rule for finding the best value, just try several!
var cut = 2;
// The characters which can mark an end of sentence
var endOfSentenceScanner = new CharactersSpecificEndOfSentenceScanner('.', '?', '!', '"', '-', '…');
// Train the model (can take some time depending on your training file size)
var model = MaximumEntropySentenceDetector.TrainModel(trainingFile, iterations, cut, endOfSentenceScanner);
// Persist the model to use it later
var outputFilePath = "path/to/persisted/model";
new BinaryGisModelWriter().Persist(bestModel, outputFilePath);

opennlp's People

Contributors

adambeddoe avatar alexpoint avatar emmagarland avatar hirse avatar icepear-jzx avatar r3db avatar tobymnelson avatar vishal-bold avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opennlp's Issues

German examples

Any site which provide example codes for German and which trained models to use?

The following site provide Java.
http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/

`java -cp $CP opennlp.tools.lang.german.SentenceDetector \models/german/sentdetect/sentenceModel.bin.gz
|

java -cp $CP opennlp.tools.lang.german.Tokenizer \models/german/tokenizer/tokenModel.bin.gz |

java -cp $CP -Xmx100m opennlp.tools.lang.german.PosTagger \models/german/postag/posModel.bin.gz |

java -cp $CP opennlp.tools.lang.english.TreebankChunker \models/german/chunking/GermanChunk.bin.gz`

How could we do that using OpenNLP.NET?

Special characters

I am experiencing word splits in weird places when the sentence includes characters like commas, colons, semi-colons, and slashes.

Example:
"As a Hydra Store Visitor, I want to see the latest version."

Becomes:
"As a Hydra Store Vis i t or , I want to see the latest version."

I noticed during TokenizePositions, the last token in the the first phrase ("Visitor,") does not pass the AlphaNumeric.IsMatch(token) because the token has an ending character , i.e. a comma, so the model evaluator takes over and splits up the string in strange ways. Am I doing something wrong?

Please advise.

A number of issues

Coreference
This doesnt work unless I copy imodel_nr.nbin to my root, even then your example returns no results.

Part-of-speech tagger
This keeps says Access to the path '[MYPATH]\Resources\WordNet\dict' is denied.

What is the LICENSE for OpenNLP?

Hi Alex,

I am a developer who wants to use the OpenNLP in one of my projects, but cannot find any LICENSE file in the project.

Can you kindly tell me which license it is using now? Or add a license file to the project so that other users can also distribute their software in a correct way.

And maybe a useful link here, to make the decision easier: http://choosealicense.com/

Thanks

No head rule defined for INC

Not sure what the proper fix is exactly, but for sentence fragments, occasionally I get this error - No head rule defined for INC using in INC-244

There are 2 spaces after using because this.getClass() is commented out

How to train own model for NER?

Anyone knows how to train own NER model?

Sorry, the question should be:-
Anyone knows how to train own NER model? I have tried to train my own but I have reached to the stage whereby it run out memory. Anyone know what should I configure for optimum for CUTOFF and Iteration please?

Sentence splitting not splitting as expected

Hi, I'm finding some behaviour with the default model and the SentenceDetect method, using the EnglishMaximumEntropySentenceDetector.

My first input is:

  • This is a sentence without spaces.Should be split into three?Is not?

I would expect a list of strings like:

  • "This is a sentence without spaces."
  • "Should be split into three?"
  • "Is not?"

The results come out as:

  • "This is a sentence without spaces.Should"
  • "be split into three.It is split in the wrong places."

Same for a sentence without many spaces:

  • This is a sentence without spaces.Should be split into three.It is split in the wrong places.

I would expect a list of strings like:

  • "This is a sentence without spaces.",
  • "Should be split into three.",
  • "It is split in the wrong places."

The results come out as:

  • "This is a sentence without spaces.Should"
  • "be split into three?Why"
  • "not?"

The sentence finds the end positions with the characters (33, 60 and 68 for the first example), but the FirstWhiteSpace method finds 40 as there are no white spaces after the punctuation as you would expect.

Is there something I am missing that I need to specify or train for this to work? Or do I have to sanitise the input first by making sure any special character has a space after it?

If I change the input to "This is a sentence without spaces. Should be split into three? Is not?" then it works and splits as expected.

Thanks!
Emma

NER new entity

Hello, this is great!!!

Can you please help on how do we introduce a new entity, for e.g. if I was using NER, and I wanted to get prescription names from the text, do I train a new model, or how do I go about creating that as an entity.

Please share some guidance/snippet.

Calculating SynsetOffset or reading data file has problem in SharpWordNet

var engine = new DataFileEngine(@"C:\Users\Ozgur_\Source\Repos\OpenNlp\Resources\WordNet\dict"); var synsets = engine.GetSynsets("apple");

When these two lines of code executed DataFileEngine.cs Line 283
var nt = int.Parse(tokenizer.NextToken()); tries to parse "n" to integer. Because the next token in "35 n 0000 | a hamburger with melted cheese on it" after "35" is "n".

I believe the line dataFile.BaseStream.Seek(synsetOffset, SeekOrigin.Begin); in DataFileEngine.cs Line 275 is misscalculating the offset of the word. Since it is not line offset but "byte offset", it may be calculated wrong.

Signed package?

Any possibility of releasing a strong name-signed version of the NuGet package? I currently get the error "Assembly generation failed -- Referenced assembly 'OpenNLP' does not have a strong name"

SharpWordNet reference error

When I download the nuget package, the only libraries it contains are OpenNLP.dll and SharpEntropy.dll. When I try to use these in Unity, I get an error "unable to resolve reference SharpWordNet". Is there be a SharpWordNet.dll missing?

Error in ParseTree project under Visual Studio 2019

When I run ParseTree project I get error in DrawTree method of LithiumControl.cs in the following row:
p = new Point(graphAbstract.Root.X, graphAbstract.Root.Y);
because graphAbstract.Root == null. I get the following error message:
< System.NullReferenceException: 'Object reference not set to an instance of an object.'
Netron.Lithium.GraphAbstract.Root.get returned null.>
It is not happened when I run the project under Visual Studio 2015

EntropyNameFinder.TrainModel Syntax for training custom model

Dear @AlexPoint

I am using the below code to train a custom model for my case. I can see the model being written but I am not sure if the syntax of the text file is correct.

I use the below format in the training file:
PANEL NAME: <NAME>MDB</NAME>
Where NAME is the named entity

var bestmodel = OpenNLP.Tools.NameFind.MaximumEntropyNameFinder.TrainModel(EntityExctractor_trainingFile, 5, 2);
SharpEntropy.IO.BinaryGisModelWriter modelwriter = new SharpEntropy.IO.BinaryGisModelWriter();
modelwriter.Persist(bestmodel, EntityExctractor_outputFilePath);

Any pointers on how to proceed?

Name mismatch

in OpenNLP/Tools/Coreference/Resolver/IsAResolver.cs
i think in lines 51 and 58 modelName should be without /, so only "imodel"

How to generate a Tag Dictionnary?

I am using the following code for training a POS model. The question is then how to generate the tag dictionnary that is required later to use the model?

        var trainingFile = "..";
        // The number of iterations; no general rule for finding the best value, just try several!
        var iterations = 5;
        // The cut; no general rule for finding the best value, just try several!
        var cut = 2;
        // Train the model (can take some time depending on your training file size)
        var model = MaximumEntropyPosTagger.TrainModel(trainingFile, iterations, cut); 
        // Persist the model to use it later
        var outputFilePath = @"...";
        new BinaryGisModelWriter().Persist(model, outputFilePath);

Bug in WordNetDictionary.cs Line 78

The "N" should be a "V" on line 78

            string partOfSpeech;
            if (tag.StartsWith("N") || tag.StartsWith("n"))
            {
                partOfSpeech = "noun";
            }
            **else if (tag.StartsWith("N") || tag.StartsWith("v"))**
            {
                partOfSpeech = "verb";
            }
            else if (tag.StartsWith("J") || tag.StartsWith("a"))
            {
                partOfSpeech = "adjective";
            }
            else if (tag.StartsWith("R") || tag.StartsWith("r"))
            {
                partOfSpeech = "adverb";
            }
            else
            {
                partOfSpeech = "noun";

.NET Standard support

I am currently using this package in an existing project(.NET framework).
We recently moved the project to a .NET Standard environment which I cannot use this package.

Is it possible to support the .NET standard in this package?

Name entity recognition not working correct for second call

When calling nameFinder.GetNames(models, "in 2005"); the second time, the tags are missing or the whole tagging is corrupted.
The first time is working fine, the second using an total different input is working in most cases - but not the same or some similar sentence!

A dirty work-a-round is to recreate the Beam in the Find method of the MaximumEntropyNameFinder.

MaximumEntropyNameFinder.cs:
public virtual string[] Find(string[] tokens, IDictionary previousTags)
{
       //Dirty hack to fix an error for a repeated call 
       mBeam = new NameBeamSearch(this, mBeamSize, mContextGenerator, mModel, mBeamSize); 

       mBestSequence = mBeam.BestSequence(tokens, new object[]{previousTags});
       return mBestSequence.Outcomes.ToArray();
}

May be you can correct the issue, I do not really understand what went wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.