Coder Social home page Coder Social logo

hiergnn-release's Introduction

Abstractive Summarization Guided by Latent Hierarchical Document Structure

Code and materials for the paper "Abstractive Summarization Guided by Latent Hierarchical Document Structure". Part of our code is borrowed from fairseq implementation for BART. You can first run the baseline to get familiar with the whole pipeline.

Basic installations

You first need to install the fairseq by,

cd fairseq
pip install --editable ./

You then need to download the official checkpoint for bart.large as the backbone for HierGNN-BART from here,

wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
tar -xzvf bart.large.tar.gz
rm bart.large.tar.gz

Please make sure you are using PyTorch==1.7.

Data

Use our data

You can download our used data from here.

Processing the data by yourself (For CNN/DailyMail as the example)

Alternatively, you can first download the original data (without splitting source article into sentences) from here. We then use the sent_tokenize from nltk to split the source article into sentences, and add <cls> between sentences, with the following command,

python3 ssplit.py <input-source-file> <output-processed-file>

For example,

python3 ssplit.py cnndm-raw/train.source cnndm-ssplit/train.source

Then you can BPE all texts using hie_bpe.sh from cnndm-ssplit,

  TASK=cnndm-ssplit
  PROG=fairseq/examples/roberta/multiprocessing_bpe_encoder.py

  for SPLIT in train val
  do
     for LANG in source target
     do
     python $PROG \
           --encoder-json hie_encoder.json \
           --vocab-bpe vocab.bpe \
           --inputs "$TASK/$SPLIT.$LANG" \
           --outputs "$TASK/$SPLIT.bpe.$LANG" \
           --workers 60 \
           --keep-empty;
     done
  done

then binarize the dataset with hie_bin.sh and finally have have the binarized data cnndm-ssplit-bin,

  TASK=cnndm-ssplit
  DICT=checkpoints/dict.source.txt
  fairseq-preprocess \
     --source-lang "source" \
     --target-lang "target" \
     --trainpref "${TASK}/train.bpe" \
     --validpref "${TASK}/val.bpe" \
     --destdir "${TASK}-bin/" \
     --workers 60 \
     --srcdict $DICT \
     --tgtdict $DICT;

Train

The command for training is:

sh hie_train.sh

Valid/Test

The commands for inference is:

sh hie_test.sh

Evaluation

For evaluation, we use the ROUGE implementation from google-research, with the following command,

sh hie_eval.sh

Released Checkpoints and Outputs

ROUGE-1 ROUGE-2 ROUGE-L Checkpoints Outputs
CNN/DailyMail BART
HierGNN-BART
XSum BART
HierGNN-BART
PubMed BART
HierGNN-BART

Citation

@inproceedings{qiu2022hiergnn,
    title={Abstractive Summarization Guided by Latent Hierarchical Document Structure},
    author={Yifu Qiu and Shay Cohen},
    booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)},
    year={2022}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.