Coder Social home page Coder Social logo

tianjianl / long-summarization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from armancohan/long-summarization

0.0 0.0 0.0 35.53 MB

Resources for the NAACL 2018 paper "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"

License: Apache License 2.0

Shell 0.19% Python 99.81%

long-summarization's Introduction

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

This repository contains data and code for the NAACL 2018 paper "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents". Please note that the code is not actively maintained.

Data

Two datasets of long and structured documents (scientific papers) are provided. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Get the datasets

ArXiv dataset: Download (mirror) PubMed dataset: Download (mirror)

The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each tar file consists of 4 files. train.txt, val.txt, test.txt respectively correspond to the training, validation, and test sets. The vocab file is a plaintext file for the vocabulary.

Format of the data

The files are in jsonlines format where each line is a json object corresponding to one scientific paper from ArXiv or PubMed. The abstract, sections and body are all sentence tokenized. The json objects are in the following format:

{ 
  'article_id': str,
  'abstract_text': List[str],
  'article_text': List[str],
  'section_names': List[str],
  'sections': List[List[str]]
}

Tensorflow datasets

The dataset is also available on Tensorflow Datasets which makes it easy to use within Tensorflow or colab.

Code

The code is based on the pointer-generator network code by See et al. (2017). Refer to their repo for documentation about the structure of the code. You will need python 3.6 and Tensorflow 1.5 to run the code. The code might run with later versions of Tensorflow but it is not tested. Checkout other dependencies in requirements.txt file. A small sample of the dataset is already provided in this repo. To run the code with the sample data unzip the files in the data directory and simply execute the run script: ./run.sh. To train the model with the entire dataset, first convert the jsonlines files to binary using the the following script: scripts/json_to_bin.py and modify the corresponding training data path in the run.sh script.

Citing

If you ended up finding this paper or repo useful please cite:

"A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"  
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian  
NAACL-HLT 2018

Another relevant reference is Pointer-Generator network by See et al. (2017):

"Get to the point: Summarization with pointer-generator networks."  
Abigail See, Peter J. Liu, and Christopher D. Manning.  
ACL (2017).

long-summarization's People

Contributors

armancohan avatar franck-dernoncourt avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.