Coder Social home page Coder Social logo

amr's Introduction

Pre- and post-processing scripts for neural sequence-to-sequence AMR parsing

This repository contains a list of scripts that help in pre- and post-processing for neural AMR parsing. It helps put the AMR files into structures sequence-to-sequence models can handle.

The scripts can do the following things:

  • Convert AMRs to single-line format and split AMRs and sentences
  • Remove variables and wiki-links
  • Handle co-referring nodes in different ways
  • Swap AMR branches so that the surface string best matches the word order
  • Put the input files in different character-level formats
  • Restore variables and Wiki-links in the output
  • Restore the co-referring nodes
  • Remove duplicate output

Getting Started

Simply clone the repository to your system to get started. All python programs are in Python 3. Also put Smatch in this folder.

git clone https://github.com/RikVN/AMR
cd AMR
git clone https://github.com/snowblink14/smatch

Prerequisites

All requirements can be installed using pip:

pip install -r requirements.txt

Running the scripts

There are two main components of this repository: pre-processing the input and post-processing the output.

I will explain everything in more detail below, but if you want to test if everything works, please run:

./test_pipeline.sh

This assumes that smatch is present in the main AMR folder.

Pre-processing

There are 4 different scripts to change the usual AMR format to single-line format without variables and Wiki-links. The default one is var_free_amrs.py and handles coreference by duplicating the co-referring nodes.

python var_free_amrs.py -f sample_input/sample.txt

There are two scripts that handle co-reference, either by using the Absolute Paths method or the Indexing method.

python create_coref_paths.py -f sample_input/sample.txt -p abs
python create_coref_indexing.py -f sample_input/sample.txt

The last script is similar to var_free_amrs.py, but swaps different AMR branches to best match the word order of the sentence.

This script needs the aligned AMRs as input!

python best_amr_permutation.py -f sample_alignment_input/sample.txt

By using the option --double, both the best aligned and original AMR are added to the dataset.

It is also possible to put the files in character-level format. There are options to keep POS-tags (-p) or relations (-s) (:ARG1, :mod, etc) as single characters. If you used the Absolute Paths or Indexing method in a previous step, please indicate this by using -c.

python char_level_AMR.py -f sample_alignment_input/sample.txt.tf

Post-processing

The post-processing script are used to restore the variables and wiki-links, while also possibly handling the coreference nodes. There are individual scripts that can do each step, but they are combined in postprocess_AMRs.py.

This script first restores the variables, by using a modified restoring script from Didzis Gosko. Then, duplicate nodes are pruned (common problem when parsing) and coreference is put back (when duplicating that is, for Abs and Index method this is done in the restoring step).

Finally, Wikipedia links are restored using Spotlight. These steps are done separately (creating .restore, .prune, .coref and .wiki files), but also together (creating .final file).

python postprocess_AMRs.py -f sample_alignment_input/sample.txt.char.tf -s sample_alignment_input/sample.sent

Here -f is the file to be processed and -s is the sentence file (needed for Wikification) It is possible to use --no_wiki to skip the Wikification step. These options can also be used to process a whole folder (use -fol) in parallel, to speed up the process. Check the script for details.

The AMRs will in one-line format, i.e. one AMR per line. If you want the more readable AMR format back, run this:

python reformat_single_amrs.py -f sample_input/sample.txt.char.tf.restore.final -e .form

Silver data

The silver data that I used in the experiments for the CLIN paper can be downloaded here. The silver data was obtained by parsing all sentences in the Groningen Meaning Bank with the parsers CAMR and JAMR. The data folder contains seven files: all CAMR and JAMR parses (1.25 million, aligned with each other) and sets of AMRs (20k, 50k, 75k, 100k, 500k) that were used in our experiments (CAMR only). For more details please see our CLIN paper.

Note that since the Groningen Meaning Bank is public domain, you can freely use these silver data sets in your own experiments. If you do, please cite our CLIN paper and the GMB paper.

Running my best model

I made the best model in the CLIN paper publicly available here. If you download it and have OpenNMT installed, you should be able to run it. Note that the input (SOURCE_FILE) must be POS-tagged and in character-level format. I also made my vocabulary files available, make sure you arrive at (more or less) the same vocabulary for your input! You can run it like this:

th translate.lua -src $SOURCE_FILE -output $OUTPUT_FILE -model $MODEL_FILE -beam_size 5 -max_sent_length 500 -replace_unk -n_best 1 -gpuid 1 -log_file $LOG_FILE -fallback_to_cpu

The output can be post-processed applying the scripts described above. If there are any issues, please let me know!

Papers

Please see the following papers for details. For general AMR parsing methods:

  • Neural Semantic Parsing by Character-based Translation: Experiments with Abstract Meaning Representations, Rik van Noord & Johan Bos, CLiN 2017 Journal. [PDF]

For coreference-specific information:

  • Dealing with Co-reference in Neural Semantic Parsing, Rik van Noord & Johan Bos, Proceedings of the 2nd Workshop on Semantic Deep Learning (SemDeep-2), Montpellier, 2017. [PDF]

About the Author

Rik van Noord, PhD student at University of Groningen, supervised by Johan Bos. Please see my personal website for more information. I'm happy to answer questions.

amr's People

Contributors

rikvn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amr's Issues

No sample.txt.char.tf file

Hi,

Thanks for great project. In readme,

python postprocess_AMRs.py -f sample_alignment_input/sample.txt.char.tf -s sample_alignment_input/sample.sent

but there is no sample.txt.char.tf in sample_alignment_input?
Thanks in advance!

Missing roles, only colons

Sometimes a produced AMR can have missing role but a colon:

(vv29traitor / traitor
    :quant (vv29many / many
        :degree (vv29too / too))
    : (vv29compatriot / compatriot
        :poss (vv29we / we)))

P.S. Despite this smatch is fine to translate them into triples that leads to obscure evaluation. smatch issue

Wiki links of form "X_(Y)"

Wiki links of the for X_(Y) are printed as X_ (Y), with an unwanted extra space.

For example: produced vs gold

(vv314vote-01 / vote-01 
	:ARG0 (vv314he / he) 
	:ARG1 (vv314bill / bill 
		:mod (vv314political-party / political-party :wiki "Democratic_Party_ (United_States) " 
			:name (v3141 / name :op1 "Dem")) 
		:quant (vv314several / several)) 
	:mod (vv314even / even))
# ::id DF-199-193694-586_5859.15 ::date 2013-08-29T08:01:13 ::annotator SDL-AMR-09 ::preferred
# ::snt He even voted for several Dem bills.
# ::save-date Mon Sep 16, 2013 ::file DF-199-193694-586_5859_15.txt
(v / vote-01
      :ARG0 (h / he)
      :ARG1 (b / bill
            :quant (s / several)
            :mod (p / political-party :wiki "Democratic_Party_(United_States)"
                  :name (n2 / name :op1 "Democratic" :op2 "Party")))
      :mod (e / even))

Error in post-processing

When I run the postprocess_AMRs.py as follows:
python postprocess_AMRs.py -f sample_alignment_input/sample.tf -s sample_alignment_input/sample.sent

I get the following error:
IOError: [Errno 2] No such file or directory: 'sample_alignment_input/sample.tf.restore'

I thought the postprocess file was supposed to create the sample.tf.restore file as well but it looks like it's not doing that. What am I missing here?

Note that here sample.tf refers to the predicted translation file from the OpenNMT model, and sample.sent is the actual sentence in English.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.