Coder Social home page Coder Social logo

summarization-datasets's Introduction

summarization-datasets

Pre-processing and in some cases downloading of datasets for the paper "Content Selection in Deep Learning Models of Summarization."

Requires python 3.6 or greater.

To install run:

$ python setup.py install

If you haven't installed spacy before in your current environment you should also run:

python -m spacy download en

Also it might be good to set your number of OMP threads to a small number, e.g. export OMP_NUM_THREADS=2

CNN/DailyMail Dataset

To run:

python summarization-datasets/preprocess_cnn_dailymail.py \
    --data-dir data/

This will create the CNN/DM data in a directory data/cnn-dailymail. This dataset is quite large and will take a while to preprocess. Grab a coffee!

NYT Dataset

You must obtain the raw documents for this dataset from the LDC: https://catalog.ldc.upenn.edu/LDC2008T19 Assuming you have the original NYT tar file in a directory called raw_data, run the following:

cd raw_data
tar zxvf nyt_corpus_LDC2008T19.tgz
cd ..
python summarization-datasets/preprocess_nyt.py \
    --nyt raw_data/nyt_corpus \
    --data-dir data

This will create preprocessed NYT data in data/nyt/.

DUC Dataset

To obtain this data, first sign the release forms/email NIST (details here: https://duc.nist.gov/data.html).

You should obtain from NIST, two files for the 2001/2002 data and a username and password. Assuming you have the NIST data in the folder called raw_data, you should have following:

raw_data/DUC2001_Summarization_Documents.tgz
raw_data/DUC2002_Summarization_Documents.tgz

You will also need to download additional data from nist which you can do using a script that will be in your bin directory after installation:

$ duc2002-test-data.sh USERNAME PASSWORD raw_data

where USERNAME and PASSWORD should have been given to you by NIST to access their website data. This should create a file raw_data/DUC2002_test_data.tar.gz

Now run the preprocessing scripts:

python summarization-datasets/preprocess_duc_sds.py \
    --duc2001 raw_data/DUC2001_Summarization_Documents.tgz \
    --duc2002-documents raw_data/DUC2002_Summarization_Documents.tgz \
    --duc2002-summaries raw_data/DUC2002_test_data.tar.gz 
    --data-dir data

This will create preprocessed duc data in data/duc-sds/.

Reddit Dataset

To run:

python summarization-datasets/preprocess_reddit.py \
    --data-dir data/

This will create the Reddit data in a directory data/reddit.

AMI Dataset

To run:

python summarization-datasets/preprocess_ami.py \
    --data-dir data/

This will create the AMI data in a directory data/ami.

PubMed Dataset

To run:

python summarization-datasets/preprocess_pubmed.py \
    --data-dir data/

This will create the PubMed data in a directory data/pubmed.

summarization-datasets's People

Contributors

kedz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

summarization-datasets's Issues

error when running python preprocess_reddit.py --data-dir data/

I run the command "python preprocess_reddit.py --data-dir data/" to get reddit data.
But I get errors Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
I get stuck by this error for several days. Can you guide me to solve it ?

(datannsum) C:\Users\17674\Desktop\MultiModalSummary\summarization-datasets-master>python preprocess_reddit.py --data-dir data/
1145229 / 1145229
Writing train abstracts to: data\reddit\human-abstracts\train
Writing valid abstracts to: data\reddit\human-abstracts\valid
Writing test abstracts to: data\reddit\human-abstracts\test
Writing train extracts to: data\reddit\human-extracts\train
Writing valid extracts to: data\reddit\human-extracts\valid
Writing test extracts to: data\reddit\human-extracts\test
SpawnPoolWorker-2: Ready!
SpawnPoolWorker-7: Ready!
SpawnPoolWorker-8: Ready!
SpawnPoolWorker-3: Ready!
SpawnPoolWorker-4: Ready!
SpawnPoolWorker-5: Ready!
SpawnPoolWorker-6: Ready!
SpawnPoolWorker-1: Ready!
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\17674\Desktop\MultiModalSummary\summarization-datasets-master\preprocess_reddit.py", line 148, in worker
    ext_labels = {"id": story_id, "labels": get_labels(example, ext_paths)}
  File "C:\Users\17674\Desktop\MultiModalSummary\summarization-datasets-master\preprocess_reddit.py", line 114, in get_labels
    remove_stopwords=True, length=75)
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\site-packages\rouge_papier-0.0.1-py3.6.egg\rouge_papier\generate.py",line 15, in compute_extract
    length_unit=length_unit, remove_stopwords=remove_stopwords), None
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\site-packages\rouge_papier-0.0.1-py3.6.egg\rouge_papier\generate.py",line 127, in compute_greedy_sequential_extract
    length_unit=length_unit, remove_stopwords=remove_stopwords)
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\site-packages\rouge_papier-0.0.1-py3.6.egg\rouge_papier\wrapper.py", line 58, in compute_rouge
    output = check_output(" ".join(args), shell=True).decode("utf8")
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'perl C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1
-py3.6.egg-tmp\rouge_papier\data\ROUGE-1.5.5.pl -e C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier
-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data -a -n 1 -x -d -m -s -l 75 -r 1000 -f A -z SPL C:\Users\17674\AppData\Local\Temp\tmpyt4kke18\tmpcd6ql1xn' returned non-zero exit status 255.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess_reddit.py", line 262, in <module>
    main()
  File "preprocess_reddit.py", line 221, in main
    pool)
  File "preprocess_reddit.py", line 255, in make_dataset
    for i, _ in enumerate(pool.imap(worker, story_iter()), 1):
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\multiprocessing\pool.py", line 735, in next
    raise value
subprocess.CalledProcessError: Command 'perl C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1
-py3.6.egg-tmp\rouge_papier\data\ROUGE-1.5.5.pl -e C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier
-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data -a -n 1 -x -d -m -s -l 75 -r 1000 -f A -z SPL C:\Users\17674\AppData\Local\Temp\tmpyt4kke18\tmpcd6ql1xn' returned non-zero exit status 255.

NYT Dataset

Thanks for your excellent work.

Unfortunately, the url for NYT dataset seems broke, would you mind provide a new one?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.