Coder Social home page Coder Social logo

m4cit / machamp-twg-data-augmentation Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 2.76 MB

Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".

License: MIT License

Python 100.00%
data-augmentation natural-language-processing nonsense parsing python syntax

machamp-twg-data-augmentation's Introduction

Scripts

1_unimorph_to_conllu.py:

Translates the original UniMorph file into the language / format of the CoNLLU file.

2_improveUnimorph.py:

Looks up all verbs from the UniMorph file successively on dictionary.com, and categorizes them into 'transitive' and 'intransitive' (Very slow. Better to use a dictionary API if available.).

3_improveRRG.py:

Checks and adds the transitivity of all verbs to the RRG CoNLLU file.

4_filterForTrain.py:

Filters out all unused words / lines in the RRG CoNLLU file.

generate.py:

Replaces words in the RRG CoNLLU file with randomly chosen ones from the UniMorph file.

augment.py:

Replaces original words in the training file with random ones generated by the module 'generate.py', followed by the augmentation of the training file with new sentences.

Requirements

  • Python 3.6 or newer
  • modules from the requirements.txt file

Installation

  1. pip install -r requirements.txt
    
  2. Place all files and folders into the main directory of MaChAmp-TWG.

Input Parameters

Word Replacement Options

--unimorph0:

UniMorph inaccurate verb replacements with regard to transitivity. In place of --unimorph1, --internal, --supertag, or --original.

--unimorph1:

UniMorph accurate verb replacements with regard to transitivity. In place of --unimorph0, --internal, --supertag, or --original.

--internal:

Internal word replacements. In place of --unimorph0, unimorph1, --supertag, or --original.

--supertag:

Internal supertag word replacements. In place of --unimorph0, unimorph1, --internal, or --original.

--original:

Augmentation with unchanged original sentences. In place of --unimorph0, unimorph1, --internal, or --supertag.

General Options

-h, --help

-i, --RRGinput:

(OPTIONAL) Filtered RRG file input. Default file: "rrgparbank/conllu/filtered_acc_en_conllu.conllu".

-o, --RRGoutput:

(OPTIONAL) Filtered RRG file output directory. Default directory: "rrgparbank/conllu".

-t, --tag:

Word tags.

-ti, --trainInput:

(OPTIONAL) train.supertags file input. Default file: "experiments/rrgparbank-en/base/train.supertags".

-to, --trainOutput:

(OPTIONAL) train.supertags file output directory. If the directory is not specified, the default directory is used and filename changes to "new_train.supertags".

-s, --extensionSize:

Extension size of the resulting training file. Must be >= 2. "2" doubles the size (sentences) of the base training file, thus does 1 run through the file (-s input-1).

Available tags (--tag) for replacement task (not for --supertag)

nS: Noun Singular
nP: Noun Plural

aPoss: Adjective Possessive
aCmpr: Adjective Comparative
aSup: Adjective Superlative

vPst: Verb Past Tense
vPresPart: Verb Present Tense, Participle Form
vPstPart: Verb Past Tense, Participle Form

adv (for --internal only): Adverb
advInt (for --internal only): Adverb, Pronominal type: Interrogative
advSup (for --internal only): Adverb Superlative
advCmpr (for --internal only): Adverb Comparative

noun: All nouns
adj: All adjectives
verb: All verbs
all: All available tags

Usage

augment.py [-h] [--unimorph0] [--unimorph1] [--internal] [--supertag] [--original]
[-i RRGINPUT] [-o RRGOUTPUT] [-t TAG] [-ti TRAININPUT] [-to TRAINOUTPUT] -s EXTENSIONSIZE

Example 1:

python augment.py --unimorph0 --tag all --extensionSize 2

or

python augment.py --unimorph0 -t all -s 2



Example 2:

python augment.py --supertag --extensionSize 10

or

python augment.py --supertag -s 10

Sources

Tatiana Bladier, Kilian Evang, Valeria Generalova, Zahra Ghane, Laura Kallmeyer, Robin Möllemann, Natalia Moors, Rainer Osswald, and Simon Petitjean. 2022. RRGparbank: A Parallel Role and Reference Grammar Treebank. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4833–4841, Marseille, France. European Language Resources Association.

Kilian Evang, Tatiana Bladier, Laura Kallmeyer, and Simon Petitjean. 2021. Bootstrapping Role and Reference Grammar Treebanks via Universal Dependencies. In Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 30–48, Sofia, Bulgaria. Association for Computational Linguistics.

Tatiana Bladier, Jakub Waszczuk, and Laura Kallmeyer. 2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6759–6766, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Kallmeyer, L., Osswald, R., Van Valin, R.D. 2013. Tree Wrapping for Role and Reference Grammar. In: Morrill, G., Nederhof, MJ. (eds) Formal Grammar. FG FG 2013 2012. Lecture Notes in Computer Science, vol 8036. Springer, Berlin, Heidelberg.

UniMorph

machamp-twg-data-augmentation's People

Contributors

m4cit avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.