Coder Social home page Coder Social logo

how_multi_mds's Introduction

How "Multi" is Multi-Document Summarization?

This repo contains the code for the paper: How multi is Multi-Document Summarization (EMNLP 2022).

Setup

conda create --name multi_mds python=3.8 
conda activate multi_mds 
pip install -r requirements.txt 

If the setup fails on jsonnet, see this issue.

Preprocessing data

You should pre-preprocess your dataset into jsonl format where each lines includes the following fields:

  • document: a List of source documents
  • summary: a list of reference summaries
  • topic_id: instance id

Compute the AAC score and curves

There are several steps for computing the AAC score:

  1. extract the openIE from all source documents and the summary
  2. prepare pairs of OpenIE
  3. compute alignment scores between source and summary propositions for each topic
  4. build greedily the maximally covering subsets of source documents
  5. compute the Area Above the Curve and save the coverage plot.

You can run a single command that will compute all steps together, while skipping accomplished steps (edit the path of raw_data_dir and process_dir):

bash run.sh [preprocessed_data] [dir_path] 

Alternatively, you can run each step separately, as follows:

  1. Extract all Open IE tuples from the summary and the source documents.
export raw_data= # path to jsonl file 
export data_dir= # output dir

python extract_open_ie.py --raw_data $raw_data \
                          --data_dir $data_dir \
                          --gpu 0 

This script will create a directory $data_dir/oie with the propositions from the summary and the documents.

  1. Prepare pairs:
python prepare_oie_pairs.py --data_dir $data_dir

This script will create a file $data_dir/pairs.pickle with all possible pairs of open IE.

  1. Compute alignment scores between source and summary propositions for each topic:
python get_superpal_scores.py --data_dir $data_dir \
                              --model biu-nlp/superpal \
                              --device_ids 0,1,2,3 \
                              --batch_size 64

This script will run the alignment model on the $data_dir/pairs.pickle and save the results in the directory $data_dir/result_npy.

  1. Build greedy subsets of documents that maximize coverage
python build_greedy_subsets.py --data_dir $data_dir 
  1. Compute AAC score and save plot in $data_dir/plot.png.
python get_aac_scores.py --data_dir $data_dir

Citation

@inproceedings{Wolhandler2022HowI,
  title={How "Multi" is Multi-Document Summarization?},
  author={Ruben Wolhandler and Arie Cattan and Ori Ernst and Ido Dagan},
  booktitle={EMNLP},
  year={2022}
}

how_multi_mds's People

Contributors

ariecattan avatar

Stargazers

Jian Wu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.