summarization-datasets

Pre-processing and in some cases downloading of datasets for the paper "Content Selection in Deep Learning Models of Summarization."

Requires python 3.6 or greater.

To install run:

$ python setup.py install

If you haven't installed spacy before in your current environment you should also run:

python -m spacy download en

Also it might be good to set your number of OMP threads to a small number, e.g. export OMP_NUM_THREADS=2

CNN/DailyMail Dataset

To run:

python summarization-datasets/preprocess_cnn_dailymail.py \
    --data-dir data/

This will create the CNN/DM data in a directory data/cnn-dailymail. This dataset is quite large and will take a while to preprocess. Grab a coffee!

NYT Dataset

You must obtain the raw documents for this dataset from the LDC: https://catalog.ldc.upenn.edu/LDC2008T19 Assuming you have the original NYT tar file in a directory called raw_data, run the following:

cd raw_data
tar zxvf nyt_corpus_LDC2008T19.tgz
cd ..
python summarization-datasets/preprocess_nyt.py \
    --nyt raw_data/nyt_corpus \
    --data-dir data

This will create preprocessed NYT data in data/nyt/.

DUC Dataset

To obtain this data, first sign the release forms/email NIST (details here: https://duc.nist.gov/data.html).

You should obtain from NIST, two files for the 2001/2002 data and a username and password. Assuming you have the NIST data in the folder called raw_data, you should have following:

raw_data/DUC2001_Summarization_Documents.tgz
raw_data/DUC2002_Summarization_Documents.tgz

You will also need to download additional data from nist which you can do using a script that will be in your bin directory after installation:

$ duc2002-test-data.sh USERNAME PASSWORD raw_data

where USERNAME and PASSWORD should have been given to you by NIST to access their website data. This should create a file raw_data/DUC2002_test_data.tar.gz

Now run the preprocessing scripts:

python summarization-datasets/preprocess_duc_sds.py \
    --duc2001 raw_data/DUC2001_Summarization_Documents.tgz \
    --duc2002-documents raw_data/DUC2002_Summarization_Documents.tgz \
    --duc2002-summaries raw_data/DUC2002_test_data.tar.gz 
    --data-dir data

This will create preprocessed duc data in data/duc-sds/.

Reddit Dataset

To run:

python summarization-datasets/preprocess_reddit.py \
    --data-dir data/

This will create the Reddit data in a directory data/reddit.

AMI Dataset

To run:

python summarization-datasets/preprocess_ami.py \
    --data-dir data/

This will create the AMI data in a directory data/ami.

PubMed Dataset

To run:

python summarization-datasets/preprocess_pubmed.py \
    --data-dir data/

This will create the PubMed data in a directory data/pubmed.

error when running python preprocess_reddit.py --data-dir data/

I run the command "python preprocess_reddit.py --data-dir data/" to get reddit data.
But I get errors Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
I get stuck by this error for several days. Can you guide me to solve it ?

(datannsum) C:\Users\17674\Desktop\MultiModalSummary\summarization-datasets-master>python preprocess_reddit.py --data-dir data/
1145229 / 1145229
Writing train abstracts to: data\reddit\human-abstracts\train
Writing valid abstracts to: data\reddit\human-abstracts\valid
Writing test abstracts to: data\reddit\human-abstracts\test
Writing train extracts to: data\reddit\human-extracts\train
Writing valid extracts to: data\reddit\human-extracts\valid
Writing test extracts to: data\reddit\human-extracts\test
SpawnPoolWorker-2: Ready!
SpawnPoolWorker-7: Ready!
SpawnPoolWorker-8: Ready!
SpawnPoolWorker-3: Ready!
SpawnPoolWorker-4: Ready!
SpawnPoolWorker-5: Ready!
SpawnPoolWorker-6: Ready!
SpawnPoolWorker-1: Ready!
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
Cannot open exception db file for reading: C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data/WordNet-2.0.exc.db
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\17674\Desktop\MultiModalSummary\summarization-datasets-master\preprocess_reddit.py", line 148, in worker
    ext_labels = {"id": story_id, "labels": get_labels(example, ext_paths)}
  File "C:\Users\17674\Desktop\MultiModalSummary\summarization-datasets-master\preprocess_reddit.py", line 114, in get_labels
    remove_stopwords=True, length=75)
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\site-packages\rouge_papier-0.0.1-py3.6.egg\rouge_papier\generate.py",line 15, in compute_extract
    length_unit=length_unit, remove_stopwords=remove_stopwords), None
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\site-packages\rouge_papier-0.0.1-py3.6.egg\rouge_papier\generate.py",line 127, in compute_greedy_sequential_extract
    length_unit=length_unit, remove_stopwords=remove_stopwords)
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\site-packages\rouge_papier-0.0.1-py3.6.egg\rouge_papier\wrapper.py", line 58, in compute_rouge
    output = check_output(" ".join(args), shell=True).decode("utf8")
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'perl C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1
-py3.6.egg-tmp\rouge_papier\data\ROUGE-1.5.5.pl -e C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier
-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data -a -n 1 -x -d -m -s -l 75 -r 1000 -f A -z SPL C:\Users\17674\AppData\Local\Temp\tmpyt4kke18\tmpcd6ql1xn' returned non-zero exit status 255.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess_reddit.py", line 262, in <module>
    main()
  File "preprocess_reddit.py", line 221, in main
    pool)
  File "preprocess_reddit.py", line 255, in make_dataset
    for i, _ in enumerate(pool.imap(worker, story_iter()), 1):
  File "C:\Users\17674\anaconda3\envs\datannsum\lib\multiprocessing\pool.py", line 735, in next
    raise value
subprocess.CalledProcessError: Command 'perl C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier-0.0.1
-py3.6.egg-tmp\rouge_papier\data\ROUGE-1.5.5.pl -e C:\Users\17674\AppData\Local\Python-Eggs\Python-Eggs\Cache\rouge_papier
-0.0.1-py3.6.egg-tmp\rouge_papier\rouge_data -a -n 1 -x -d -m -s -l 75 -r 1000 -f A -z SPL C:\Users\17674\AppData\Local\Temp\tmpyt4kke18\tmpcd6ql1xn' returned non-zero exit status 255.

kedz / summarization-datasets Goto Github PK