Coder Social home page Coder Social logo

nusnlp / esc Goto Github PK

View Code? Open in Web Editor NEW
55.0 2.0 11.0 2.09 MB

The official code of the "Frustratingly Easy System Combination for Grammatical Error Correction" paper

Home Page: https://aclanthology.org/2022.naacl-main.143/

License: GNU General Public License v3.0

Macaulay2 99.85% Python 0.15%
deep-learning grammatical-error-correction natural-language-processing pytorch

esc's Introduction

Frustratingly Easy System Combination for Grammatical Error Correction

This repository provides the code to easily combines Grammatical Error Correction (GEC) models to produce better predictions with just the models' outputs, as reported in this paper:

Frustratingly Easy System Combination for Grammatical Error Correction
Muhammad Reza Qorib, Seung-Hoon Na, and Hwee Tou Ng
2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (PDF)

Update

ESC can work with GRECO to produce a more accurate combination. To get the score of each edit, use the argument --score instead of --test when combining the base systems' output.

Installation

This code should be run with Python 3.6. The reason Python 3.6 is needed is because the ERRANT version that is used in the BEA-2019 shared task (v2.0.0) is not compatible with Python >= 3.7

Install this code dependencies by running:

pip install -r requirements.txt
python -m spacy download en
wget https://www.comp.nus.edu.sg/~nlp/sw/m2scorer.tar.gz
tar -xf m2scorer.tar.gz

Note that you may need to customize your pytorch installation depending on your CUDA version, read more here. The code may also work with torch < 1.9.0 as only simple pytorch functions are used.

Reproducing the paper's result

For the CoNLL-2014 experiment, run: export EXP_DIR=conll-exp .

For the BEA-2019 experiment, run: export EXP_DIR=bea-exp .

  1. Get the model output
python run.py --test --data_dir $EXP_DIR/test-text --m2_dir $EXP_DIR/test-m2 --model_path $EXP_DIR/models/paper_model.pt --vocab_path $EXP_DIR/paper_vocab.idx --output_path $EXP_DIR/outputs/test.out
  1. Evaluate the test prediction. Replace test_output with $EXP_DIR/outputs/test.out

Retraining the experiments in the paper

For the CoNLL-2014 experiment, run: export EXP_DIR=conll-exp .

For the BEA-2019 experiment, run: export EXP_DIR=bea-exp .

  1. Run the training command:
python run.py --train --data_dir $EXP_DIR/dev-text --m2_dir $EXP_DIR/dev-m2 --model_path $EXP_DIR/models --vocab_path $EXP_DIR/vocab.idx
  1. Get the prediction on BEA-2019 Dev:
python run.py --test --data_dir $EXP_DIR/dev-text --m2_dir $EXP_DIR/dev-m2 --model_path $EXP_DIR/models/model.pt --vocab_path $EXP_DIR/vocab.idx --output_path $EXP_DIR/outputs/dev.out
  1. Get the F0.5 development score:
errant_parallel -ori $EXP_DIR/dev-text/source.txt -cor $EXP_DIR/outputs/dev.out -out $EXP_DIR/outputs/dev.m2
errant_compare -ref bea-full-valid.m2 -hyp $EXP_DIR/outputs/dev.m2
  1. Get the test prediction:
python run.py --test --data_dir $EXP_DIR/test-text --m2_dir $EXP_DIR/test-m2 --model_path $EXP_DIR/models/model.pt --vocab_path $EXP_DIR/vocab.idx --output_path $EXP_DIR/outputs/test.out
  1. Evaluate the test prediction. Replace test_output with $EXP_DIR/outputs/test.out

Evaluation

  • For CoNLL-2014 (requires Python 2.x):
python2 m2scorer/scripts/m2scorer.py test_output conll14st-test-corrected.m2

Combining your own systems

The simplest way is:

  • Create a new experiment directory, then go inside this directory.
  • Put your base systems' output on BEA-2019 Dev in a folder called dev-text. Please also copy the source.txt and target.txt from the bea-exp/dev-text folder to this new dev-text folder.
  • Put your base system's output on the test set in a folder called test-text. Please also put the source sentences of the dataset you are testing with inside the folder, under the name of source.txt.
  • Create the models and outputs folder. At this point, make sure your folder structure is similar to the contents of bea-exp or conll-exp, with the exceptions of dev-m2 and test-m2 (The code will generate these folders automatically).
  • Go back to the parent directory and follow the guide above, with the $EXP_DIR replaced with your new folder name.

If you want to customize your experiment setup, please note:

  • The code will index all files in the --data_dir folder as base systems, except the source file (the default filename is source.txt) and the target file (the default filename is target.txt).
  • The code will only read the contents of --m2_dir, not --data_dir. The code will index the files in --data_dir and look for the file with same basename on the --m2_dir.If the --m2_dir does not exist, the code will generate the directory along with the contents from the content of --data_dir. Thus, if you make any changes to the content of --data_dir after --m2_dir was generated, please remove the corresponding file on the --m2_dir or the delete the whole --m2_dir entirely.
  • The file names of the training files and the testing files have to be the same. The file names and the ordering are stored in the vocab file.
  • When you run the testing, make sure you run the prediction with the correct model and correct vocab file. Both files are dependent to the base systems you are combining.

License

The source code and models in this repository are licensed under the GNU General Public License Version 3 (see License). For commercial use of this code and models, separate commercial licensing is also available. Please contact Hwee Tou Ng ([email protected])

esc's People

Contributors

mrqorib avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

esc's Issues

Reproducing paper results fails

Trying to reproduce the paper results crashes the script.
Running:export EXP_DIR=conll-exp
and then: python run.py --test --data_dir $EXP_DIR/test-text --m2_dir $EXP_DIR/test-m2 --model_path $EXP_DIR/models/paper_model.pt --vocab_path $EXP_DIR/paper_vocab.idx --output_path $EXP_DIR/outputs/test.out

yields the following error:

Traceback (most recent call last):
  File "run.py", line 542, in <module>
    main(args)
  File "run.py", line 516, in main
    with open(args.output_path, 'w', encoding='utf-8') as out:
FileNotFoundError: [Errno 2] No such file or directory: 'conll-exp/outputs/test.out'

reproducing the bea-2019 experiments and re-training works fine.

Usage for different languages

Hi im trying to use this github repo to evaluate GEC tasks on other languages other than English. I have the (.m2) files but some languages dont have tools like ERRANT. In that case is it still possible to run the ensemble method? if so what are the necessary files?

Thank you!!

Training data , Data Format

Hello,
I have two questions,

  1. I saw dev-text folder and it had more than one .txt file. I know about source/target file ( it has incorect/correct pair , right?) but what about other files that has same sentences with different order/grammar?

  2. From what I have found, for training phase we need source/target file (incorrect/correct sentence) but what about M2 format? Is it for just enhancing the model or other things?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.