Coder Social home page Coder Social logo

roy601912008 / nlprule Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bminixhofer/nlprule

0.0 0.0 0.0 414 KB

Rule-based grammatical error correction through parsing LanguageTool rules in Rust w/ bindings for Python.

License: MIT License

Rust 93.52% Python 5.58% Shell 0.89%

nlprule's Introduction

nlprule

PyPI Crates.io Docs.rs CI Downloads

NLPRule is a library for rule-based grammatical error correction written in pure Rust with bindings for Python. Rules are sourced from LanguageTool.

from nlprule import Tokenizer, Rules, SplitOn

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

rules.correct("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

rules.correct("Thanks for your’s and Lucy’s help.")
# returns: 'Thanks for yours and Lucy’s help.'

rules.correct("I can due his homework.")
# returns: 'I can do his homework.'

suggestions = rules.suggest("She was not been here since Monday.")
for s in suggestions:
  print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

My goal with this library was creating a fast, lightweight engine to run natural language rules without having to rely on the JVM (and its speed / memory implications) and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc.

NLPRule currently supports English and German.

|Disambiguation rules| |Grammar rules| LT version
English 843 (100%) 3725 (~ 85%) 5.2
German 486 (100%) 2970 (~ 90%) 5.2

NLPRule is focused on speed.

In [1]: from nlprule import Tokenizer, Rules, SplitOn
   ...: 
   ...: tokenizer = Tokenizer.load("en")
   ...: rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

In [2]: %timeit rules.correct("He wants that you send him an email.")
783 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz

Usage

  1. Install: pip install nlprule
2. Create a `tokenizer` and `rules` object

from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en") # or 'de'
rules = Rules.load("en", tokenizer) # or 'de'

The objects will be downloaded the first time, then cached.

3a. Correct your text

rules.correct_sentence("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

correct_sentence expects a single sentence as input.

If you want to correct an arbitrary text, pass a sentence_splitter at initialization. A sentence splitter can be any function that takes a list of texts as input and returns a list of lists of sentences. A splitter that splits on fixed characters is included in NLPRule for convenience:

from nlprule import SplitOn

rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

Pro tip: You can use NNSplit for more robust sentence segmentation:

from nnsplit import NNSplit

splitter = NNSplit.load("en")
rules = Rules.load(
    "en",
    tokenizer,
    lambda texts: [[str(s) for s in text] for text in splitter.split(texts)],
)

If a sentence splitter is set, you can call .correct:

rules.correct("He wants that you send him an email. She was not been here since Monday.")
# returns: 'He wants you to send him an email. She was not here since Monday.'

3b. Get suggestions

suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
  print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

.suggest_sentence also has a multi-sentence counterpart in .suggest.

Bonus: Analyze text with the `tokenizer`

NLPRule does rule + dictionary-based part-of-speech tagging and lemmatization as well as chunking with a model ported from OpenNLP. It's not as fancy as spaCy but could be faster and had to be done anyway to apply the rules so I thought I might as well add a public API:

tokens = tokenizer.tokenize_sentence("She was not been here since Monday.")

for token in tokens:
    print(token.text, token.span, token.tags, token.lemmas, token.chunks)
# prints:
#  (0, 0) ['SENT_START'] [] []
# She (0, 3) ['PRP'] ['She', 'she'] ['B-NP-singular', 'E-NP-singular']
# was (4, 7) ['VBD'] ['be', 'was'] ['B-VP']
# not (8, 11) ['RB'] ['not'] ['I-VP']
# been (12, 16) ['VBN'] ['be', 'been'] ['I-VP']
# here (17, 21) ['RB'] ['here'] ['B-ADVP']
# since (22, 27) ['CC', 'IN', 'RB'] ['since'] ['B-PP']
# Monday (28, 34) ['NNP'] ['Monday'] ['B-NP-singular', 'E-NP-singular']
# . (34, 35) ['.', 'PCT', 'SENT_END'] ['.'] ['O']

Benchmark

NLPRule is approximately 1.7x - 2.8x faster than LanguageTool. See the benchmark issue for details.

NLPRule time LanguageTool time
English 1 1.7 - 2.0
German 1 2.4 - 2.8

Maintenance disclaimer

NLPRule is currently pretty bare bones in terms of API and documentation. I will definitely fix bugs, but adding new functionality (especially new languages) and improving API / docs will depend on interest by the community.

Fixing discrepancies between NLPRule and LanguageTool behaviour will have high priority if any are found.

Acknowledgements

All credit for the rule content goes to LanguageTool who have made a Herculean effort to create high-quality grammar correction rules. This library is just a parser and reimplementation of the rule logic.

License

NLPRule is licensed under the MIT license or Apache-2.0 license, at your option.

nlprule's People

Contributors

bminixhofer avatar jbest avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.