nlprule

NLPRule is a library for rule-based grammatical error correction written in pure Rust with bindings for Python. Rules are sourced from LanguageTool.

from nlprule import Tokenizer, Rules, SplitOn

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

rules.correct("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

rules.correct("Thanks for your’s and Lucy’s help.")
# returns: 'Thanks for yours and Lucy’s help.'

rules.correct("I can due his homework.")
# returns: 'I can do his homework.'

suggestions = rules.suggest("She was not been here since Monday.")
for s in suggestions:
  print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

My goal with this library was creating a fast, lightweight engine to run natural language rules without having to rely on the JVM (and its speed / memory implications) and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc.

NLPRule currently supports English and German.

	\|Disambiguation rules\|	\|Grammar rules\|	LT version
English	843 (100%)	3725 (~ 85%)	5.2
German	486 (100%)	2970 (~ 90%)	5.2

NLPRule is focused on speed.

In [1]: from nlprule import Tokenizer, Rules, SplitOn
   ...: 
   ...: tokenizer = Tokenizer.load("en")
   ...: rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

In [2]: %timeit rules.correct("He wants that you send him an email.")
783 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

_{Using Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz}

Usage

Install: pip install nlprule

2. Create a `tokenizer` and `rules` object

from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en") # or 'de'
rules = Rules.load("en", tokenizer) # or 'de'

The objects will be downloaded the first time, then cached.

3a. Correct your text

rules.correct_sentence("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

correct_sentence expects a single sentence as input.

If you want to correct an arbitrary text, pass a sentence_splitter at initialization. A sentence splitter can be any function that takes a list of texts as input and returns a list of lists of sentences. A splitter that splits on fixed characters is included in NLPRule for convenience:

from nlprule import SplitOn

rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

Pro tip: You can use NNSplit for more robust sentence segmentation:

from nnsplit import NNSplit

splitter = NNSplit.load("en")
rules = Rules.load(
    "en",
    tokenizer,
    lambda texts: [[str(s) for s in text] for text in splitter.split(texts)],
)

If a sentence splitter is set, you can call .correct:

rules.correct("He wants that you send him an email. She was not been here since Monday.")
# returns: 'He wants you to send him an email. She was not here since Monday.'

3b. Get suggestions

suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
  print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

.suggest_sentence also has a multi-sentence counterpart in .suggest.

Bonus: Analyze text with the `tokenizer`

NLPRule does rule + dictionary-based part-of-speech tagging and lemmatization as well as chunking with a model ported from OpenNLP. It's not as fancy as spaCy but could be faster and had to be done anyway to apply the rules so I thought I might as well add a public API:

tokens = tokenizer.tokenize_sentence("She was not been here since Monday.")

for token in tokens:
    print(token.text, token.span, token.tags, token.lemmas, token.chunks)
# prints:
#  (0, 0) ['SENT_START'] [] []
# She (0, 3) ['PRP'] ['She', 'she'] ['B-NP-singular', 'E-NP-singular']
# was (4, 7) ['VBD'] ['be', 'was'] ['B-VP']
# not (8, 11) ['RB'] ['not'] ['I-VP']
# been (12, 16) ['VBN'] ['be', 'been'] ['I-VP']
# here (17, 21) ['RB'] ['here'] ['B-ADVP']
# since (22, 27) ['CC', 'IN', 'RB'] ['since'] ['B-PP']
# Monday (28, 34) ['NNP'] ['Monday'] ['B-NP-singular', 'E-NP-singular']
# . (34, 35) ['.', 'PCT', 'SENT_END'] ['.'] ['O']

Benchmark

NLPRule is approximately 1.7x - 2.8x faster than LanguageTool. See the benchmark issue for details.

	NLPRule time	LanguageTool time
English	1	1.7 - 2.0
German	1	2.4 - 2.8

Maintenance disclaimer

NLPRule is currently pretty bare bones in terms of API and documentation. I will definitely fix bugs, but adding new functionality (especially new languages) and improving API / docs will depend on interest by the community.

Fixing discrepancies between NLPRule and LanguageTool behaviour will have high priority if any are found.

Acknowledgements

All credit for the rule content goes to LanguageTool who have made a Herculean effort to create high-quality grammar correction rules. This library is just a parser and reimplementation of the rule logic.

License

NLPRule is licensed under the MIT license or Apache-2.0 license, at your option.

roy601912008 / nlprule Goto Github PK

nlprule's Introduction

nlprule

Usage

Benchmark

Maintenance disclaimer

Acknowledgements

License

nlprule's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent