alexpoint / opennlp Goto Github PK

View Code? Open in Web Editor NEW

278.0 38.0 98.0 102 MB

Open source NLP tools (sentence splitter, tokenizer, chunker, coref, NER, parse trees, etc.) in C#

License: MIT License

C# 99.92% Roff 0.08%

opennlp's Introduction

OpenNlp

OpenNlp is an open source library for Natural Language Processing (NLP). It provides a number of NLP tools in C#:

sentence splitter
tokenizer
part-of-speech tagger
chunker
coreference
name entity recognition
parse trees

This project started as a C# port of the Java OpenNLP tools (initial code was retrieved on http://sharpnlp.codeplex.com). It was moved to Github to improve the code (add new features and fix detected bugs) and create a nuget package.

You can install this library via nuget:

Install-Package OpenNLP

For use with .net Core applications, the System.Runtime.Caching nuget package is also required for full functionality:

Install-Package System.Runtime.Caching

Quick start

To test easily the various NLP tools, run the ToolsExample winform project. You'll find below a more detailed description of the tools and how code snippets to use them directly in your code. All NLP tools based on the maxent algorithm need model files to run. You'll find those files for English in Resources/Models. If you want to train your own models (to improve precision on English or to use those tools on other languages), please refer to the last section.

Sentence splitter

A sentence splitter splits a paragraph in sentences. Technically, the sentence detector will compute the likelihood that a specific character ('.', '?' or '!' in the case of English) marks the end of a sentence.

var paragraph = "Mr. & Mrs. Smith is a 2005 American romantic comedy action film. The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple. They are surprised to learn that they are both assassins hired by competing agencies to kill each other.";
var modelPath = "path/to/EnglishSD.nbin";
var sentenceDetector = new EnglishMaximumEntropySentenceDetector(modelPath);
var sentences = sentenceDetector.SentenceDetect(paragraph);
/* 
 * sentences = ["Mr. & Mrs. Smith is a 2005 American romantic comedy action film.", 
 * "The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple.", 
 * "They are surprised to learn that they are both assassins hired by competing agencies to kill each other."]
 */

Tokenizer

A tokenizer breaks a text into words, symbols or other meaningful elements. The historical tokenizers are based on the maxent algorithm.

// Regular tokenizer
var modelPath = "path/to/EnglishTok.nbin";
var sentence = "- Sorry Mrs. Hudson, I'll skip the tea.";
var tokenizer = new EnglishMaximumEntropyTokenizer(modelPath);
var tokens = tokenizer.Tokenize(sentence);
// tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."]

For English, a specific rule-based tokenizer (based on regexes) was created and has a better precision. This tokenizer doesn't need any model.

// English tokenizer
var tokenizer = new EnglishRuleBasedTokenizer();
var sentence = "- Sorry Mrs. Hudson, I'll skip the tea.";
var tokens = tokenizer.Tokenize(sentence);
// tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."]

Part-of-speech tagger

A part of speech tagger assigns a part of speech (noun, verb etc.) to each token in a sentence.

var modelPath = "path/to/EnglishPOS.nbin";
var tagDictDir = "path/to/tagdict/directory";
var posTagger = EnglishMaximumEntropyPosTagger(modelPath, tagdictDir);
var tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."];
var pos = posTagger.Tag(tokens);
// pos = [":", "NNP", "NNP", "NNP", ".", "PRP", "MD", "VB", "DT", "NN", "."]

For the full list of part of speech abbreviations, please refer to the Penn Treebank Project

Chunker

A chunker is an alternative to a full sentence parser which gives the partial syntactic structure of a sentence (for instance the noun/verg groups).

var modelPath = "path/to/EnglishChunk.nbin";
var chunker = EnglishTreebankChunker(modelPath);
var tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."];
var pos = [":", "NNP", "NNP", "NNP", ".", "PRP", "MD", "VB", "DT", "NN", "."];
var chunks = chunker.GetChunks(tokens, tags);
// chunks = [["NP", "- Sorry Mrs. Hudson"], [",", ","], ["NP", "I"], ["VP", "'ll skip"], ["NP", "the tea"], [".", "."]]

Coreference

Coference detects all expressions that refer to the same entities in a text.

var modelPath = "path/to/coref/dir";
var coreferenceFinder = new TreebankLinker(modelPath);
var sentences = ["Mr. & Mrs. Smith is a 2005 American romantic comedy action film.", 
	"The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple.", 
	"They are surprised to learn that they are both assassins hired by competing agencies to kill each other."];
var coref = coreferenceFinder.GetCoreferenceParse(sentences);
// coref =

Name entity recognition

Name entity recognition identifies specific entities in sentences. With the current models, you can detect persons, dates, locations, money, percentages and time

var modelPath = "path/to/namefind/dir";
var nameFinder = new EnglishNameFinder(modelPath);
var sentence = "Mr. & Mrs. Smith is a 2005 American romantic comedy action film.";
// specify which types of entities you want to detect
var models = ["date", "location", "money", "organization", "percentage", "person", "time"];
var ner = nameFinder.GetNames(models, sentence);
// ner = Mr. & Mrs. <person>Smith</person> is a <date>2005</date> American romantic comedy action film.

Parse tree

A parser gives the full syntactic structure of a sentence.

var modelPath = "path/to/models/dir";
var sentence = "- Sorry Mrs Hudson, I'll skiip the tea.";
var parser = new EnglishTreebankParser(_modelPath);
var parse = parser.DoParse(sentence);
// parse = (TOP (S (NP (: -) (NNP Sorry) (NNP Mrs.) (NNP Hudson)) (, ,) (NP (PRP I)) (VP (MD 'll) (VP (VB skip) (NP (DT the) (NN tea)))) (. .)))

Train your models

The models proposed are general models for English. If you need those tools on other languages or on a specialized English corpus, you can train your own models. To do so, you'll need examples; for instance for sentence detections, you'll need a (big) number of paragraphs with the sentences appropriately delimited.

// The file with the training samples; works also with an array of files
var trainingFile  = "path/to/training/file";
// The number of iterations; no general rule for finding the best value, just try several!
var iterations = 5;
// The cut; no general rule for finding the best value, just try several!
var cut = 2;
// The characters which can mark an end of sentence
var endOfSentenceScanner = new CharactersSpecificEndOfSentenceScanner('.', '?', '!', '"', '-', '…');
// Train the model (can take some time depending on your training file size)
var model = MaximumEntropySentenceDetector.TrainModel(trainingFile, iterations, cut, endOfSentenceScanner);
// Persist the model to use it later
var outputFilePath = "path/to/persisted/model";
new BinaryGisModelWriter().Persist(bestModel, outputFilePath);

opennlp's People

Contributors

Stargazers

Watchers

Forkers

ravindranathw quangfox shuk parshva gblosser twobob jamessdixon jackwangcumt kasmose jbruening sharad077 colinchenmaster norberte rsbavaresco finalnlp dbahr vishal-bold theolivenbaum jackjet870 khan007 rodrigobrito maryna-b tchekjunior alrehamy andmu inved1 mlennox vanisacia englandcarl guojianbin hazwana oracle-solution abaffa jangana zerouid awesomedotnetcore mkohan ziad-akiki joemerdizon0301 kaby76 esboy1988 maxakbar prabhjotsl ryanmcdonough thinkgeist daqingfeng adambeddoe knightpyw mclcode nlpka6j leandrokoiti ktp-forked-repos emmagarland zhongranxu neommob xiaoxiongnpu auycro milica94 mrhieptran hxjj whitealienqi ranjancse26 bpraveen4u dungtri davidturner97 itoathunder winhoals rndazurescript jazoora dfernand2 slamj1 nakinnubis spacecat56 davecs1 ewsq marinusmaurice yidianren studentutu holgedchen nasa03 achyun jhngrntn asdegani fasteddys pavanarya karanbajaj omnixtry zero-ghub aqhali fredatgithub avsnarayan rlebowitz catalin-andronie cttoy lordfrishetti1 chandusekhar

opennlp's Issues

German examples

Any site which provide example codes for German and which trained models to use?

The following site provide Java.
http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/

`java -cp $CP opennlp.tools.lang.german.SentenceDetector \models/german/sentdetect/sentenceModel.bin.gz
|

java -cp $CP opennlp.tools.lang.german.Tokenizer \models/german/tokenizer/tokenModel.bin.gz |

java -cp $CP -Xmx100m opennlp.tools.lang.german.PosTagger \models/german/postag/posModel.bin.gz |

java -cp $CP opennlp.tools.lang.english.TreebankChunker \models/german/chunking/GermanChunk.bin.gz`

How could we do that using OpenNLP.NET?

Special characters

I am experiencing word splits in weird places when the sentence includes characters like commas, colons, semi-colons, and slashes.

Example:
"As a Hydra Store Visitor, I want to see the latest version."

Becomes:
"As a Hydra Store Vis i t or , I want to see the latest version."

I noticed during TokenizePositions, the last token in the the first phrase ("Visitor,") does not pass the AlphaNumeric.IsMatch(token) because the token has an ending character , i.e. a comma, so the model evaluator takes over and splits up the string in strange ways. Am I doing something wrong?

Please advise.

using other langugaes bin

Hello! can I use this http://opennlp.sourceforge.net/models-1.5/
contain languages bin as nbin files .

A number of issues

Coreference
This doesnt work unless I copy imodel_nr.nbin to my root, even then your example returns no results.

Part-of-speech tagger
This keeps says Access to the path '[MYPATH]\Resources\WordNet\dict' is denied.

What is the LICENSE for OpenNLP?

Hi Alex,

I am a developer who wants to use the OpenNLP in one of my projects, but cannot find any LICENSE file in the project.

Can you kindly tell me which license it is using now? Or add a license file to the project so that other users can also distribute their software in a correct way.

And maybe a useful link here, to make the decision easier: http://choosealicense.com/

Thanks

What version of OpenNLP is this?

Is this a port of v1.3.4 of OpenNLP? Or is 1.3.4 the version of this project itself

Dotnet Core support?

Is it possible to port this to Dotnet Core?

No head rule defined for INC

Not sure what the proper fix is exactly, but for sentence fragments, occasionally I get this error - No head rule defined for INC using in INC-244

There are 2 spaces after using because this.getClass() is commented out

How to train own model for NER?

Anyone knows how to train own NER model?

Sorry, the question should be:-
Anyone knows how to train own NER model? I have tried to train my own but I have reached to the stage whereby it run out memory. Anyone know what should I configure for optimum for CUTOFF and Iteration please?

Sentence splitting not splitting as expected

Hi, I'm finding some behaviour with the default model and the SentenceDetect method, using the EnglishMaximumEntropySentenceDetector.

My first input is:

This is a sentence without spaces.Should be split into three?Is not?

I would expect a list of strings like:

"This is a sentence without spaces."
"Should be split into three?"
"Is not?"

The results come out as:

"This is a sentence without spaces.Should"
"be split into three.It is split in the wrong places."

Same for a sentence without many spaces:

This is a sentence without spaces.Should be split into three.It is split in the wrong places.

I would expect a list of strings like:

"This is a sentence without spaces.",
"Should be split into three.",
"It is split in the wrong places."

The results come out as:

"This is a sentence without spaces.Should"
"be split into three?Why"
"not?"

The sentence finds the end positions with the characters (33, 60 and 68 for the first example), but the FirstWhiteSpace method finds 40 as there are no white spaces after the punctuation as you would expect.

Is there something I am missing that I need to specify or train for this to work? Or do I have to sanitise the input first by making sure any special character has a space after it?

If I change the input to "This is a sentence without spaces. Should be split into three? Is not?" then it works and splits as expected.

Thanks!
Emma

Model Convertor failing

https://www.codeproject.com/articles/12109/statistical-parsing-of-english-sentences?display=print&fid=229482&df=90&mpp=25&sort=Position&view=Normal&spc=Relaxed&fr=101&prof=True
It's failing to convert .bin models https://opennlp.sourceforge.net/models-1.5/

NER new entity

Hello, this is great!!!

Can you please help on how do we introduce a new entity, for e.g. if I was using NER, and I wanted to get prescription names from the text, do I train a new model, or how do I go about creating that as an entity.

Please share some guidance/snippet.

Calculating SynsetOffset or reading data file has problem in SharpWordNet

var engine = new DataFileEngine(@"C:\Users\Ozgur_\Source\Repos\OpenNlp\Resources\WordNet\dict"); var synsets = engine.GetSynsets("apple");

When these two lines of code executed DataFileEngine.cs Line 283
var nt = int.Parse(tokenizer.NextToken()); tries to parse "n" to integer. Because the next token in "35 n 0000 | a hamburger with melted cheese on it" after "35" is "n".

I believe the line dataFile.BaseStream.Seek(synsetOffset, SeekOrigin.Begin); in DataFileEngine.cs Line 275 is misscalculating the offset of the word. Since it is not line offset but "byte offset", it may be calculated wrong.

Signed package?

Any possibility of releasing a strong name-signed version of the NuGet package? I currently get the error "Assembly generation failed -- Referenced assembly 'OpenNLP' does not have a strong name"

SharpWordNet reference error

When I download the nuget package, the only libraries it contains are OpenNLP.dll and SharpEntropy.dll. When I try to use these in Unity, I get an error "unable to resolve reference SharpWordNet". Is there be a SharpWordNet.dll missing?

Are you going to support this project for net. standard?

I need library for Net. Core and Mono (Unity 3d).

Error in ParseTree project under Visual Studio 2019

When I run ParseTree project I get error in DrawTree method of LithiumControl.cs in the following row:
p = new Point(graphAbstract.Root.X, graphAbstract.Root.Y);
because graphAbstract.Root == null. I get the following error message:
< System.NullReferenceException: 'Object reference not set to an instance of an object.'
Netron.Lithium.GraphAbstract.Root.get returned null.>
It is not happened when I run the project under Visual Studio 2015

Using other languages' models with this port

I would like to know how to use bin files from the OpenNLP official site to make it work with this port.

EntropyNameFinder.TrainModel Syntax for training custom model

Dear @AlexPoint

I am using the below code to train a custom model for my case. I can see the model being written but I am not sure if the syntax of the text file is correct.

I use the below format in the training file:
PANEL NAME: <NAME>MDB</NAME>
Where NAME is the named entity

var bestmodel = OpenNLP.Tools.NameFind.MaximumEntropyNameFinder.TrainModel(EntityExctractor_trainingFile, 5, 2);
SharpEntropy.IO.BinaryGisModelWriter modelwriter = new SharpEntropy.IO.BinaryGisModelWriter();
modelwriter.Persist(bestmodel, EntityExctractor_outputFilePath);

Any pointers on how to proceed?

Name mismatch

in OpenNLP/Tools/Coreference/Resolver/IsAResolver.cs
i think in lines 51 and 58 modelName should be without /, so only "imodel"

Context constructor does not set a value for property HeadTokenTag

I believe this is an ommision, occuring at OpenNlp-master\OpenNlp-master\OpenNLP\Tools\Coreference\Similarity\Context.cs.

OpenNLP Port

Hi,

I'm working a few months in a similar port, maybe we can join our skills in a fresh project.

Take a look:
https://github.com/knuppe/SharpNL

How to generate a Tag Dictionnary?

I am using the following code for training a POS model. The question is then how to generate the tag dictionnary that is required later to use the model?

        var trainingFile = "..";
        // The number of iterations; no general rule for finding the best value, just try several!
        var iterations = 5;
        // The cut; no general rule for finding the best value, just try several!
        var cut = 2;
        // Train the model (can take some time depending on your training file size)
        var model = MaximumEntropyPosTagger.TrainModel(trainingFile, iterations, cut); 
        // Persist the model to use it later
        var outputFilePath = @"...";
        new BinaryGisModelWriter().Persist(model, outputFilePath);

Can I find tense and aspect of a sentence?

Bug in WordNetDictionary.cs Line 78

The "N" should be a "V" on line 78

            string partOfSpeech;
            if (tag.StartsWith("N") || tag.StartsWith("n"))
            {
                partOfSpeech = "noun";
            }
            **else if (tag.StartsWith("N") || tag.StartsWith("v"))**
            {
                partOfSpeech = "verb";
            }
            else if (tag.StartsWith("J") || tag.StartsWith("a"))
            {
                partOfSpeech = "adjective";
            }
            else if (tag.StartsWith("R") || tag.StartsWith("r"))
            {
                partOfSpeech = "adverb";
            }
            else
            {
                partOfSpeech = "noun";

How to conver OpenNlp models and dictionary ?

I was trying to use those files
https://github.com/aciapetti/opennlp-italian-models/

To replace built-in model and dictionary, i tryed different files on different tools in solution but i dont find a combination that works.

Any idea ?

.NET Standard support

I am currently using this package in an existing project(.NET framework).
We recently moved the project to a .NET Standard environment which I cannot use this package.

Is it possible to support the .NET standard in this package?

Name entity recognition not working correct for second call

When calling nameFinder.GetNames(models, "in 2005"); the second time, the tags are missing or the whole tagging is corrupted.
The first time is working fine, the second using an total different input is working in most cases - but not the same or some similar sentence!

A dirty work-a-round is to recreate the Beam in the Find method of the MaximumEntropyNameFinder.

MaximumEntropyNameFinder.cs:
public virtual string[] Find(string[] tokens, IDictionary previousTags)
{
       //Dirty hack to fix an error for a repeated call 
       mBeam = new NameBeamSearch(this, mBeamSize, mContextGenerator, mModel, mBeamSize); 

       mBestSequence = mBeam.BestSequence(tokens, new object[]{previousTags});
       return mBestSequence.Outcomes.ToArray();
}

May be you can correct the issue, I do not really understand what went wrong.

Null Reference Exception in EnglishTreebankChunker.GetChunks(string[], string[])

currentSentenceChunk will always be null unless the first chunk starts with "B-" or equals "O".