Coder Social home page Coder Social logo

chrisjbryant / errant Goto Github PK

View Code? Open in Web Editor NEW
410.0 13.0 105.0 696 KB

ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

License: MIT License

Python 100.00%
automatic annotation grammatical-framework classifier evaluation

errant's Introduction

ERRANT v3.0.0

This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:

Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.

Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.

If you make use of this code, please cite the above papers. More information about ERRANT can be found here. In particular, see Chapter 5 for definitions of error types.

Update - 09/12/23: You can now try out ERRANT in our online demo!

Overview

The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.

Example:

Original: This are gramamtical sentence .
Corrected: This is a grammatical sentence .
Output M2:
S This are gramamtical sentence .
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1

In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons (see the CoNLL-2014 shared task) while the last field is the annotator id.

A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect evaluation.

Installation

Pip Install

The easiest way to install ERRANT and its dependencies is using pip. We also recommend installing it in a clean virtual environment (e.g. with venv). The latest version of ERRANT only supports Python >= 3.7.

python3 -m venv errant_env
source errant_env/bin/activate
pip install -U pip setuptools wheel
pip install errant
python3 -m spacy download en_core_web_sm

This will create and activate a new python3 environment called errant_env in the current directory. pip will then update some setup tools and install ERRANT, spaCy, rapidfuzz and spaCy's default English model in this environment. You can deactivate the environment at any time by running deactivate, but must remember to activate it again whenever you want to use ERRANT.

BEA-2019 Shared Task

ERRANT v2.0.0 was designed to be fully compatible with the BEA-2019 Shared Task. If you want to directly compare against the results in the shared task, you may want to install ERRANT v2.0.0 as newer versions may produce slightly different scores. You can also use Codalab to evaluate anonymously on the shared task datasets. ERRANT v2.0.0 is not compatible with Python >= 3.7.

pip install errant==2.0.0

Source Install

If you prefer to install ERRANT from source, you can instead run the following commands:

git clone https://github.com/chrisjbryant/errant.git
cd errant
python3 -m venv errant_env
source errant_env/bin/activate
pip install -U pip setuptools wheel
pip install -e .
python3 -m spacy download en_core_web_sm

This will clone the github ERRANT source into the current directory, build and activate a python environment inside it, and then install ERRANT and all its dependencies. If you wish to modify ERRANT code, this is the recommended way to install it.

Usage

CLI

Three main commands are provided with ERRANT: errant_parallel, errant_m2 and errant_compare. You can run them from anywhere on the command line without having to invoke a specific python script.

  1. errant_parallel

    This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.
    Example:

    errant_parallel -orig <orig_file> -cor <cor_file1> [<cor_file2> ...] -out <out_m2>
    
  2. errant_m2

    This is a variant of errant_parallel that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. -gold will only classify the existing edits, while -auto will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.
    Example:

    errant_m2 {-auto|-gold} m2_file -out <out_m2>
    
  3. errant_compare

    This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The -cat {1,2,3} flag can be used to evaluate error types at increasing levels of granularity, while the -ds or -dt flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.
    Examples:

    errant_compare -hyp <hyp_m2> -ref <ref_m2> 
    errant_compare -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
    errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds
    errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}
    

All these scripts also have additional advanced command line options which can be displayed using the -h flag.

API

As of v2.0.0, ERRANT now also comes with an API.

Quick Start

import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)

Loading

errant.load(lang, nlp=None)
Create an ERRANT Annotator object. The lang parameter currently only accepts 'en' for English, but we hope to extend it for other languages in the future. The optional nlp parameter can be used if you have already preloaded spacy and do not want ERRANT to load it again.

import errant
import spacy

nlp = spacy.load('en_core_web_sm') # Or en_core_web_X for other spacy models
annotator = errant.load('en', nlp)

Annotator Objects

An Annotator object is the main interface for ERRANT.

Methods

annotator.parse(string, tokenise=False)
Lemmatise, POS tag, and parse a text string with spacy. Set tokenise to True to also word tokenise with spacy. Returns a spacy Doc object.

annotator.align(orig, cor, lev=False)
Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the lev flag can be used for a standard Levenshtein alignment. Returns an Alignment object.

annotator.merge(alignment, merging='rules')
Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:

  1. rules: Use a rule-based merging strategy (default)
  2. all-split: Merge nothing: MSSDI -> M, S, S, D, I
  3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
  4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I

Returns a list of Edit objects.

annotator.classify(edit)
Classify an edit. Sets the edit.type attribute in an Edit object and returns the same Edit object.

annotator.annotate(orig, cor, lev=False, merging='rules')
Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running annotator.align, annotator.merge and annotator.classify in sequence. Returns a list of Edit objects.

import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
alignment = annotator.align(orig, cor)
edits = annotator.merge(alignment)
for e in edits:
    e = annotator.classify(e)

annotator.import_edit(orig, cor, edit, min=True, old_cat=False)
Load an Edit object from a list. orig and cor must be spacy-parsed Doc objects and the edit must be of the form: [o_start, o_end, c_start, c_end(, type)]. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The type value is an optional string that denotes the error type of the edit (if known). Set min to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and old_cat to True to preserve the old error type category (i.e. turn off the classifier).

import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edit = [1, 2, 1, 2, 'SVA'] # are -> is
edit = annotator.import_edit(orig, cor, edit)
print(edit.to_m2())

Alignment Objects

An Alignment object is created from two spacy-parsed text sequences.

Attributes

alignment.orig
alignment.cor
The spacy-parsed original and corrected text sequences.

alignment.cost_matrix
alignment.op_matrix
The cost matrix and operation matrix produced by the alignment.

alignment.align_seq
The first cheapest alignment between the two sequences.

Edit Objects

An Edit object represents a transformation between two text sequences.

Attributes

edit.o_start
edit.o_end
edit.o_toks
edit.o_str
The start and end offsets, the spacy tokens, and the string for the edit in the original text.

edit.c_start
edit.c_end
edit.c_toks
edit.c_str
The start and end offsets, the spacy tokens, and the string for the edit in the corrected text.

edit.type
The error type string.

Methods

edit.to_m2(id=0)
Format the edit for an output M2 file. id is the annotator id.

Development for Other Languages

If you want to develop ERRANT for other languages, you should mimic the errant/en directory structure. For example, ERRANT for French should import a merger from errant.fr.merger and a classifier from errant.fr.classifier that respectively have equivalent get_rule_edits and classify methods. You will also need to add 'fr' to the list of supported languages in errant/__init__.py.

Contact

If you have any questions, suggestions or bug reports, you can contact the authors at:
christopher d0t bryant at cl.cam.ac.uk
mariano d0t felice at cl.cam.ac.uk

errant's People

Contributors

chrisjbryant avatar maxbachmann avatar sam-writer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

errant's Issues

Expose errant_compare functionality via the API

It would be great if the functionality in the errant_compare command were available for invocation as an API call, so it could be used for things like early stopping when training GEC models.

I've looked through the compare_m2 file and it doesn't look like it would be all that much work to refactor things so that everything worked the same way but it was possible to import a function that returned a dict with the computed scores instead of printing them, so if this is the kind of thing you'd be willing to accept a PR for, I'd be happy to give it a go myself sometime in the next couple weeks. If not, it would be super awesome if you were able to get to it at some point.

SequenceLib is slower than python-Levenshtien

I was adapting this code for our private use. I noticed that using
python-Levenshtein package is 100x better than SequenceMatcher ratio.

Levenshtein.ratio(A, B) gets you the same result.
I understand that this library is more for offline benchmarking use, but it doesn't hurt to be faster 😉 .

btw Can you explain the rationale for the custom cost function for substitutions? Any example on how using it changes outcomes of the path taken.

Add a test to check spacy and errant work properly

Hi!
I have found a bug which is pretty difficult to be replicated: in certain cases (especially if you re-install spacy after installing errant), errant will "apparently" work, giving feedback on the sentences it corrects...but in reality it won't, resulting in errant not recognising most of the mistakes.

Would it be possible to add a simple test, with few basic sentence pairs, e.g., "He go home. -> He goes home." on which errant is evaluated, so that after installation one can check if spacy is working?

I know that, especially with spacy 2.x, the results won't be always the same...but I still think that this kind of feedback could be useful to check that errant is working "reasonably" well together with spacy.

If that is okay, I can make a PR with a new folder and file tests/test_errant_base.py, with 10-20 simple sentence pairs, where I check how many of the mistakes are correctly recognised by errant.

Simulate Errors

can errant simulate errors of sentence instead of correcting it? I am trying to build a dataset with ground-truth original transcript/text and error text of it.

Is there any way to further improve the method of summarizing error types?

There are some sentences where I noticed that the error type statement is not accurate enough.

I noticed that he is using a model of size sm, and I intended to replace it with a larger model, but it seems that the improvement is not significant. Is there any other way to improve his accuracy?

By the way,thank you for providing this tool, it is very useful!

Handling Missing Annotations on certain sentence

I am not able to generate m2 files for the case when annotations are missing for certain sentences for some of the annotators. Choosing orig==annotated has its side effects. Am I missing something?

About speed up

Hi dear author, ERRANT is such an excellent tool and I'm very happy to see that the character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein] library instead of python's native difflib.SequenceMatcher, which makes ERRANT 3x faster.
I want to know if there are other potential explorations that can increase the speed.
Can you give me some clues? Thank you very much!

Errant parse method not working

I'm facing a problem with the parse method of errant. I know it uses the Spacy tagger and I have the English Spacy model ready so I'm not sure what might be the problem.
Thanks in advance
Capture

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2490: ordinal not in range(128)

Hi 😊

I've encountered a problem while using errant:

I think there are conflicts between the version of python and spacy and I couldn't fix it

Python 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import errant
>>> errant.load('en')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/envs/errant200/lib/python3.6/site-packages/errant/__init__.py", line 19, in load
    classifier = import_module("errant.%s.classifier" % lang)
  File "/root/anaconda3/envs/errant200/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/anaconda3/envs/errant200/lib/python3.6/site-packages/errant/en/classifier.py", line 40, in <module>

I use

  • Ubuntu 20.04
  • python 3.6.13
  • spacy 1.9.0

Thank you

Licensing concerns

Hi, thanks for the awesome library.

Is there any way to remove python-Levenshtein from the dependencies? It's licensed under GPLv2 which is not compatible with MIT of errant.

Evaluating Neural Network model using Errant

Dear @ALL

I have an overview of your documentation, I'm still confused about how to evaluate my Neural Network model (GEC). As I understood that, I have to translate the test set (correcting), then build a new (M2) file using errant_parallel command. The last step is to use errant_compare with the span-based correction to get F0.5 score.

Is this correct?
What is the optimal way to evaluate my NN model using Errant?

Regards,

spacy tokenizer speed

There is one line code In parser function
text = self.nlp.tokenizer.tokens_from_list(text.split())

why do not just use nlp.tokenizer(text) directly ? This code can really accelerate tokenizing process.

Install error with bdist_wheel

Some versions of default python are missing or use an older version of the wheel package.

This raises an error when installing errant: error: invalid command 'bdist_wheel'

Although ERRANT was actually installed successfully and you can ignore this error, the fix is simply to install/upgrade the wheel package in your python venv before you install errant:
pip3 install -U wheel

Not comparing the actual correction tokens between hypothesis and reference edits in compare_m2.py

  • In compare_m2.py, the edits for a coder obtained from the extract_edits() function are in the form of (start,end):category.

  • While comparing the extracted edits for the hypothesis and gold corrections in compareEdits() function here:

    if h_edit in ref_edits.keys():

    in the lines below :

	# On occasion, multiple tokens at same span.
	for h_cat in ref_edits[h_edit]: # Use ref dict for TP
		tp += 1
		# Each dict value [TP, FP, FN]
		if h_cat in cat_dict.keys():
			cat_dict[h_cat][0] += 1
		else:
                        cat_dict[h_cat] = [1, 0, 0]
  • The edits are first being compared based on their (start,end) and then they are checked to see whether their error categories match.
  • If just their (start,end) and the error categories for a hypothesis edit and a reference edit are equal, then it is counted as a true positive.
  • Consider the case below:
    Source sentence: With the risk of being genetically disorder , many individuals have done the decision to undergo genetic testing .
    Hypothesis sentence: With the risk of being genetically disordered , many individuals have done the decision to undergo genetic testing .
    Gold correction: With the risk of having genetic disorders , many individuals have made the decision to undergo genetic testing .
  • In this case, for hypothesis edit is (6,7):R:NOUN:NUM and the reference edit is (6,7):R:NOUN:NUM. Here their (start,end) and error categories are same and hence, they are being counted as true positive.
  • As far as I understand, since we are not comparing the actual correction tokens 'disordered' vs 'disorders', does it inflate the number of true positives? Is there any reasoning behind just comparing the start,end and error category of the edits that I am missing?
  • Will it be better if the corrected tokens in the hypothesis edit as well as the reference edit are also compared before counting it as a true positive?
    Thanks.

Wrong format for incorr_sentences.txt

I used errant to preprocess the Oscar Tamil Dataset.

The source m2 file looks like this.

S முன்னாள் ஜனாதிபதி மஹிந்த ராஜபக்ஷவினால் முன்னெடுக்கப்பட்ட போராட்டம் உட்பட வேலைநிறுத்த போராட்டங்களுக்கான நிதி அனுசரணையை சீனாவே வழங்கி நாட்டையும் அரசாங்கத்தையும் நெருக்கடிக்குள்ளாக்க முயல்கிறது என சமூக நலன்புரி பிரதி அமைச்சர் ரஞ்சன் ராமநாயக்க தெரிவித்தார்.
A 0 1|||R:OTHER|||முன்னாழ்|||REQUIRED|||-NONE-|||0
A 2 3|||R:OTHER|||மஹிண்த|||REQUIRED|||-NONE-|||0
A 7 8|||R:OTHER|||வேளைணிறுத்த|||REQUIRED|||-NONE-|||0
A 8 9|||R:NOUN|||போராட்டங்களுக்காண|||REQUIRED|||-NONE-|||0
A 9 10|||R:OTHER|||ணிதி|||REQUIRED|||-NONE-|||0
A 10 11|||R:NOUN|||அநுசரநையை|||REQUIRED|||-NONE-|||0
A 15 16|||R:NOUN|||ணெருக்கடிக்குல்ளாக்க|||REQUIRED|||-NONE-|||0
A 19 20|||R:OTHER|||ணலந்புரி|||REQUIRED|||-NONE-|||0
A 23 24|||R:NOUN|||ராமனாயக்க|||REQUIRED|||-NONE-|||0
A 24 25|||R:OTHER|||தெரிவித்தார்|||REQUIRED|||-NONE-|||0

The corresponding generated section of corr_sentences.txt looks like this.

S முன்னாள் ஜனாதிபதி மஹிந்த ராஜபக்ஷவினால் முன்னெடுக்கப்பட்ட போராட்டம் உட்பட வேலைநிறுத்த போராட்டங்களுக்கான நிதி அனுசரணையை சீனாவே வழங்கி நாட்டையும் அரசாங்கத்தையும் நெருக்கடிக்குள்ளாக்க முயல்கிறது என சமூக நலன்புரி பிரதி அமைச்சர் ரஞ்சன் ராமநாயக்க தெரிவித்தார்.
A 0 1|||R:OTHER|||முன்னாழ்|||REQUIRED|||-NONE-|||0
A 2 3|||R:OTHER|||மஹிண்த|||REQUIRED|||-NONE-|||0
A 7 8|||R:OTHER|||வேளைணிறுத்த|||REQUIRED|||-NONE-|||0
A 8 9|||R:NOUN|||போராட்டங்களுக்காண|||REQUIRED|||-NONE-|||0
A 9 10|||R:OTHER|||ணிதி|||REQUIRED|||-NONE-|||0
A 10 11|||R:NOUN|||அநுசரநையை|||REQUIRED|||-NONE-|||0
A 15 16|||R:NOUN|||ணெருக்கடிக்குல்ளாக்க|||REQUIRED|||-NONE-|||0
A 19 20|||R:OTHER|||ணலந்புரி|||REQUIRED|||-NONE-|||0
A 23 24|||R:NOUN|||ராமனாயக்க|||REQUIRED|||-NONE-|||0
A 24 25|||R:OTHER|||தெரிவித்தார்|||REQUIRED|||-NONE-|||0

The corresponding section of incorr_sentences.txt looks like this.

S முன்னாள் ஜனாதிபதி மஹிந்த ராஜபக்ஷவினால் முன்னெடுக்கப்பட்ட போராட்டம் உட்பட வேலைநிறுத்த போராட்டங்களுக்கான நிதி அனுசரணையை சீனாவே வழங்கி நாட்டையும் அரசாங்கத்தையும் நெருக்கடிக்குள்ளாக்க முயல்கிறது என சமூக நலன்புரி பிரதி அமைச்சர் ரஞ்சன் ராமநாயக்க தெரிவித்தார்.
0 1|||R:OTHER|||முன்னாழ்|||REQUIRED|||-NONE-|||0
A work 3|||R:OTHER|||மஹிண்த|||REQUIRED|||-NONE-|||0
7 8|||R:OTHER|||வேளைணிறுத்த|||REQUIRED|||-NONE-|||0
Badly do my 9|||R:NOUN|||போராட்டங்களுக்காண|||REQUIRED|||-NONE-|||0
9 10|||R:OTHER|||ணிதி|||REQUIRED|||-NONE-|||0
A English 10 11|||R:NOUN|||அநுசரநையை|||REQUIRED|||-NONE-|||0
up relatively 15 16|||R:NOUN|||ணெருக்கடிக்குல்ளாக்க|||REQUIRED|||-NONE-|||0
19 20|||R:OTHER|||ணலந்புரி|||REQUIRED|||-NONE-|||0
Change 23 24|||R:NOUN|||ராமனாயக்க|||REQUIRED|||-NONE-|||0
24 25|||R:OTHER|||தெரிவித்தார்|||REQUIRED|||-NONE-|||0

The first correction line doesn't start with A. The word work appears randomly after A in the second correction line. The corresponding sentence in the source file does not have the word work. Other lines also have similar patterns.

Questions about evaluating duplicate corrections

Hi, I have a question about duplicate corrections.

errant_parallel sometimes makes duplicate corrections, e.g.

echo "If you want to actally know somebody you can spend the whole day with that person or place but if you do not , you do not even speak to that person or even go there . " > orig.txt
echo "If you want to actually know somebody , you can spend the whole day with that person or place , but if you do not , you do not even speak to that person or even go there . " > sys.txt
echo "If you want to actually get to know someone , or something , you can spend the whole day with that person , or place , and if you do not , you would n't have reason to even speak to that person , or even go there . " > ref.txt
errant_parallel -orig orig.txt -cor sys.txt -out hyp.m2
errant_parallel -orig orig.txt -cor ref.txt -out ref.m2
errant_compare -hyp hyp.m2 -ref ref.m2

(The above is line 612 of JFLEG-dev. The reference is the first annotation.)
In the above case, errant_compare shows

=========== Span-Based Correction ============
TP      FP      FN      Prec    Rec     F0.5
4       0       9       1.0     0.3077  0.6897
==============================================

However, hyp.m2 has only three correction, so TP=4 is strange.

  • hyp.m2
S If you want to actally know somebody you can spend the whole day with that person or place but if you do not , you do not even speak to that person or even go there .
A 4 5|||R:SPELL|||actually|||REQUIRED|||-NONE-|||0
A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 18 18|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0

The reason of this is the duplicate corrections in the reference.
Actually, ref.m2 has two lines of A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0.
(I don't know why such duplication appears.)

  • ref.m2
S If you want to actally know somebody you can spend the whole day with that person or place but if you do not , you do not even speak to that person or even go there .
A 4 5|||R:SPELL|||actually|||REQUIRED|||-NONE-|||0
A 5 5|||M:VERB|||get to|||REQUIRED|||-NONE-|||0
A 6 7|||R:NOUN|||someone|||REQUIRED|||-NONE-|||0
A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 7 7|||M:CONJ|||or|||REQUIRED|||-NONE-|||0
A 7 7|||M:NOUN|||something|||REQUIRED|||-NONE-|||0
A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 16 16|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 18 18|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 18 19|||R:CONJ|||and|||REQUIRED|||-NONE-|||0
A 25 27|||R:OTHER|||would n't have|||REQUIRED|||-NONE-|||0
A 27 27|||M:OTHER|||reason to|||REQUIRED|||-NONE-|||0
A 32 32|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0

During errant_compare, the coder_dict[coder][(7, 7, ',')] has multiple values: ['M:PUNCT', 'M:PUNCT'].
This adds two points to the evaluation score because ref_edits[h_edit] has two values (in here)

Is it expected?
Personally, I do not think it is desirable for the number of TP to exceed the number of edits of a hypothesis.
Possible solutions would be to

  • Prevent errant.Annotator.annotate() from outputting duplicate corrections.
  • Ensure that coder_dict variable in errant.commands.compare_m2.py only has a single value (now it is a list).

Thank you for your development of ERRANT!
(This is an aside, but I am developing an API-based errant_compare and noticed this problem because the my results did not match the official results.)

U type when there is a correction

Using the most up to date github clone we have found ERRANT output (e.g. on w&i as attached) sometimes categorizes an error as of unnecessary(U:) type although it is a replacement type.

See for example:
S The rich people will buy a car but the poor people always need to use a bus or taxi .
A 0 2|||U:DET|||Rich|||REQUIRED|||-NONE-|||0

ABCN.dev.gold.bea19.m2.txt

OSError: [E053] Could not read meta.json from en\meta.json

Traceback (most recent call last):
File "C:/Users/ITJaylon/Desktop/errant/errant/test.py", line 3, in
annotator = errant.load('en')
File "C:\Users\ITJaylon\Desktop\errant\errant_init_.py", line 16, in load
nlp = nlp or spacy.load(lang, disable=["ner"])
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy_init_.py", line 30, in load
return util.load_model(name, **overrides)
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy\util.py", line 172, in load_model
return load_model_from_path(Path(name), **overrides)
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy\util.py", line 198, in load_model_from_path
meta = get_model_meta(model_path)
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy\util.py", line 253, in get_model_meta
raise IOError(Errors.E053.format(path=meta_path))
OSError: [E053] Could not read meta.json from en\meta.json

Process finished with exit code 1

Running setup.py install for murmurhash ... error

Hi Chris,

Thanks for your updating new packages!

This is regarding the install error, when I install your package both from pip and source, it would give me the following error messages:

Installing collected packages: numpy, murmurhash, cymem, preshed, wrapt, tqdm, toolz, cytoolz, plac, six, dill, termcolor, pathlib, thinc, pip, ujson, idna, urllib3, chardet, certifi, requests, regex, webencodings, html5lib, wcwidth, ftfy, spacy, nltk, python-Levenshtein, errant Running setup.py install for murmurhash ... error ERROR: Command errored out with exit status 1: command: /Users/helen/errant/errant/errant_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"'; __file__='"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-record-ysbyt_yn/install-record.txt --single-version-externally-managed --compile --install-headers /Users/helen/errant/errant/errant_env/include/site/python3.6/murmurhash cwd: /private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/ Complete output (36 lines): running install running build running build_py creating build creating build/lib.macosx-10.7-x86_64-3.6 creating build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/about.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash creating build/lib.macosx-10.7-x86_64-3.6/murmurhash/tests copying murmurhash/tests/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/tests copying murmurhash/tests/test_import.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/tests copying murmurhash/mrmr.pyx -> build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/__init__.pxd -> build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/mrmr.pxd -> build/lib.macosx-10.7-x86_64-3.6/murmurhash creating build/lib.macosx-10.7-x86_64-3.6/murmurhash/include creating build/lib.macosx-10.7-x86_64-3.6/murmurhash/include/murmurhash copying murmurhash/include/murmurhash/MurmurHash2.h -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/include/murmurhash copying murmurhash/include/murmurhash/MurmurHash3.h -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/include/murmurhash running build_ext building 'murmurhash.mrmr' extension creating build/temp.macosx-10.7-x86_64-3.6 creating build/temp.macosx-10.7-x86_64-3.6/murmurhash gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include/python3.6m -I/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/murmurhash/include -I/Users/helen/errant/errant/errant_env/include -I/Users/helen/anaconda3/include/python3.6m -c murmurhash/mrmr.cpp -o build/temp.macosx-10.7-x86_64-3.6/murmurhash/mrmr.o -O3 -Wno-strict-prototypes -Wno-unused-function warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] 1 warning generated. gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include/python3.6m -I/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/murmurhash/include -I/Users/helen/errant/errant/errant_env/include -I/Users/helen/anaconda3/include/python3.6m -c murmurhash/MurmurHash2.cpp -o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash2.o -O3 -Wno-strict-prototypes -Wno-unused-function warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] 1 warning generated. gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include/python3.6m -I/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/murmurhash/include -I/Users/helen/errant/errant/errant_env/include -I/Users/helen/anaconda3/include/python3.6m -c murmurhash/MurmurHash3.cpp -o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash3.o -O3 -Wno-strict-prototypes -Wno-unused-function warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] 1 warning generated. g++ -bundle -undefined dynamic_lookup -L/Users/helen/anaconda3/lib -arch x86_64 -L/Users/helen/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/murmurhash/mrmr.o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash2.o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash3.o -o build/lib.macosx-10.7-x86_64-3.6/murmurhash/mrmr.cpython-36m-darwin.so clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated] ld: library not found for -lstdc++ clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command 'g++' failed with exit status 1 ---------------------------------------- ERROR: Command errored out with exit status 1: /Users/helen/errant/errant/errant_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"'; __file__='"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-record-ysbyt_yn/install-record.txt --single-version-externally-managed --compile --install-headers /Users/helen/errant/errant/errant_env/include/site/python3.6/murmurhash Check the logs for full command output.

I searched the error on Google and tried the possible solutions, such as update my Xcode, reinstall command tool lines, it doesn't work, therefore I wonder if you could give me some advice? Thanks in advance for your help!

Edits missed for a substitute -> Delete -> Substitute sequence.

Hi.
I am running into the following error:
For the source, target pairs:

source: In the article mrom The the New York Times.
target: In the article from The New York Times.

The edit mrom -> from is missed by ERRANT. The output from ERRANT was:

["Orig: [4, 6, 'The the'], Cor: [4, 5, 'The'], Type: 'U:DET'"]

On digging a little, it seems to be the issue with all alignment types of the following form

Input: w1 w2 w3
Output: w4 w5
such that w3.lower() == w5.lower()

Alignment Sequence: S w1 -> w4, D w2 -> "", S w3 -> w5

Then the edit "w1" -> "w4" is missed, and "w2 w3" -> "w5" is generated by errant.en.merger.process_seq
Example:

source: "In thir the"
target: "On The"
Errant Output: ["Orig: [1, 3, 'Thir the'], Cor: [1, 2, 'The'], Type: 'U:NOUN'"]
# Missing In -> On

‘’AttributeError: 'English' object has no attribute 'tagger'” when running the "Quick Start" code in API given in README.md

How can I eliminate this error report "AttributeError: 'English' object has no attribute 'tagger'"? I changed several data models, but they didn't work.

code:
`import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)`

spaCy >= 2.0 support

Hi Chris, thanks for the big 2.0 updates!

This is regarding the following section of the README

Note: ERRANT does not support spaCy 2 at this time. spaCy 2 POS tags are slightly different from spaCy 1 POS tags and so ERRANT rules, which were designed for spaCy 1, may not always work with spaCy 2.

Since Python can't handle having multiple versions of a given library in a single project, and we need to use features that were introduced post spacy 2.0, we currently have to keep ERRANT isolated in a separate service which we talk to over HTTP. This is not ideal. Since ERRANT now supports passing in an nlp spacy object, it seems like adding support for spacy >= 2.0 would not be bad.

Specifically, I think we could check nlp._meta['spacy_version']. If the spacy version is less than 2.0, nlp._meta doesn't exist, above 2.0, this gives us the exact spacy version. For this current purpose, just testing is_spacy_2_or_above = bool(getattr(nlp, "_meta", False)) should be enough. Then the quickest fix would be to just map the 2.0 tags to 1.9 tags if is_spacy_2_or_above.

Is this acceptable? If not, is there some other path to supporting spacy 2.0+? Thank you!

EDIT: we are happy to work on this, we'd just like to find an approach that you would approve.

spacy 1.9.0

Hi,
The pip install doesn't work and also from file.
The problem is the outdated spacy that doesn't work, pip can't build wheels for it and it throws errors when trying to import errant). manually updating spacy seem to solve it. (I am yet to use errant deeply with this so I might find it did not)
python 3.7.3
spacy 2.2.4
gcc 8.3 if relevant

Implementation issue

Dear Chris :)

I have applied Errant for fce.test.me and I got unexpected results as :

=========== Span-Based Correction ============
TP      FP      FN      Prec    Rec     F0.5
2       15503   4547    0.0001  0.0004  0.0002
=======================================

I follow you implementation as same as in the documentation as follow:

  1. I have applied errant_parallel using, errant_parallel -orig m2Scripts/orig_sentes.txt -cor m2Scripts/corec_sentes.txt -out m2Scripts/output.m2 files as in GitHub. Actually, I'm confused about the format of the parallel corrected text file.

  2. For the errant_m2 I applied as errant_m2 -auto m2Scripts/output.m2 -out m2Scripts/auto_output.

  3. The last step errant_compare as errant_compare -hyp m2Scripts/auto_output -ref m2Scripts/fce.test.m2 , results was are :

=========== Span-Based Correction ============
TP      FP      FN      Prec    Rec     F0.5
2       15503   4547    0.0001  0.0004  0.0002
=======================================

Could you please help to fix this issue?

Kind regards
Aiman Solyman

Parallel_to_m2 is not working

Hello Chris, I'm trying to convert my parallel dataset into m2 format so I used:
import errant ! errant_parallel -orig D5-src.txt -cor D5-trg.txt -out /out_m2.m2

and the output I got is:
Loading resources... Processing parallel files...

am I doing something wrong?
Note: I am using Google Colab and my dataset is in Arabic language

Merge Casing Issue

Hello Chris,
I am working on Errant for Czech and found the following line problematic:

if start == 0 and (len(o) == 1 and c[0].text[0].isupper()) or \

The issue is that it is True even if start != 0 (if (len(c) == 1 and o[0].text[0].isupper()) gets evaluated to True.

In such case, return on lines 66-67 will omit "the preceding part of combo".

I suppose that the fix is to enforce start == 0 by adding a pair of brackets:

if start == 0 and ((len(o) == 1 and c[0].text[0].isupper()) or \
                    (len(c) == 1 and o[0].text[0].isupper())):

Clarification on Spacy 2.0

In the documentation you have mentioned that spacy 2.0 is less compatible with ERRANT. What is the nature of this incompatibility and any pointers on what can be done to correct for it?

API Quickstart script not working - Please update with fix provided

Hi @chrisjbryant , the API quickstart script below is not working.

import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)

Error:

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

After python3 -m spacy download en_core_web_sm, it says

Successfully installed en_core_web_sm-2.3.1
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
You do not have sufficient privilege to perform this operation.
✘ Couldn't link model to 'en'
Creating a symlink in spacy/data failed. Make sure you have the required
permissions and try re-running the command as admin, or use a virtualenv. You
can still import the model as a module and call its load() method, or create the
symlink manually.
C:\Users\xxx\anaconda3\envs\chat-langchain\lib\site-packages\en_core_web_sm
--> C:\Users\xxx\anaconda3\envs\chat-langchain\lib\site-packages\spacy\data\en
⚠ Download successful but linking failed
Creating a shortcut link for 'en' didn't work (maybe you don't have admin
permissions?), but you can still load the model via its full package name: nlp =
spacy.load('en_core_web_sm')

I had to update the code to below before it works.

import errant
import spacy
import spacy.cli 

# spacy.cli.download("en_core_web_md")
nlp = spacy.load('en_core_web_md')
annotator = errant.load('en', nlp)
# annotator = errant.load('en_core_web_md')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.