alex-fabbri / multi-news Goto Github PK

Large-scale multi-document summarization dataset and code

License: Other

Python 89.43% Shell 4.90% Perl 3.33% Smalltalk 0.17% Emacs Lisp 1.55% JavaScript 0.08% NewLisp 0.14% Ruby 0.15% Slash 0.03% SystemVerilog 0.02% Jupyter Notebook 0.19%

summarization multi-news multi-document-summarization

multi-news's Introduction

Multi-News

Data and code for the ACL 2019 paper Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model.

Data

Preprocessed, but not truncated, data
Preprocessed, truncated, data
Raw data (only replaced \n with "NEWLINE_CHAR" and appended "|||||" to the end of each story).
Raw data, bad retrievals removed -- Removes documents retrieved with error noticed in this issue and removes the "|||||" at the end of each example.
Raw data -- zipped
Tensorflow datasets

Models and Summaries

Trained models
Model output

multi-news's People

Contributors

Stargazers

Watchers

Forkers

geminifox2019 yyht legendtianjin alenegro81 wangwenhaoxiaotie arianpasquali sonalipuja zhujunnan dragomirradev thorphan jackg0 meelement mehwishfatimah yumoxu henry-nlp christophhaushofer hoangcuong2011 mingzi151 daywatch octaviaguo yuyunwang sminl yaolu wadeyin9712 yuqianghan wh-forker xuehuiping mohitjuneja vincent-li-9701 kylie-box paleneutron dyjdekuy mina1987 kshitiz-shailendra yoyouc jhaprince bobycv06fpm stevenlau6 yuchiahung tanakron-thapkun mishav78 sunyilgdx trongthanht3 shresht8 mr-mainak chengniu antidotec will-duh guyue55 kimjiseong1994 nthon shivam-globussoft-21 ygrx532 xzy-bit

multi-news's Issues

How many GPUs are needed?

Hi Alex,

Thank you for the code and data. I was trying to reproduce the results in the paper. However, I keep running into out of memory issue. How many GPUs did you use during your experiments?

Thank you

OpenMNT-baseline, multi-GPU

Dear Alex

I am your big fan! I am playing around your transformer baseline (OpenNMT), but found that the multi-GPU is not implemented. Do you think if I replace the onmt module, things will work?

Hi, Alex, Thanks for your public dataset. However, I have some questions. As mentioned by your answer in another issue, there are some different number of multi-document in the corpus, so I want to know the details of the experiment. When you are training, what numbers of documents will you use?

Preprocessing data for OpenNMT baselines

Hi Alex, first of all, thanks for the amazing paper and dataset!

I have a question regarding preparing the data for training an OpenNMT Transformer model. I am trying to replicate the steps of first running preprocess.py, and then train.py. However, when running preprocess.py I end up with 1 train.pt, 1 valid.pt, and 1 vocab.pt files. The vocab is of size 50004, which seems odd. Afterwards, when I run the train.py, it stops at around 1300 step. I suppose this is because of the insufficient vocab size?
I tried running the preprocessing step with different versions of the data, because it was unclear to me what exactly I should use, in order to train a Transformer on the Multi-News dataset.

The command I use for preprocessing is:
python preprocess.py -train_src "...\train.src.cleaned" -train_tgt "...\train.tgt" -valid_src "...\val.src.cleaned" -valid_tgt "...\val.tgt" -save_data "...\newser" -src_seq_length 1000 -tgt_seq_length 1000 -src_seq_length_trunc 500 -tgt_seq_length_trunc 300 -dynamic_dict -share_vocab -shard_size 10000000

And for training:
python train.py -save_model "...\transformer_1 -data ".../newser" -copy_attn -word_vec_size 512 -rnn_size 512 -layers 4 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 50000 -warmup_steps 8000 -learning_rate 2 -decay_method noam -label_smoothing 0.1 -max_grad_norm 0 -dropout 0.2 -batch_size 4096 -optim adam -adam_beta2 0.998 -param_init 0 -batch_type tokens -normalization tokens -max_generator_batches 2 -accum_count 4 -share_embeddings -param_init_glorot -seed 777 -world_size 1 -gpu_ranks 0 -save_checkpoint_steps 500

Would you please clarify a little bit the process of training an OpenNMT model on your data, including more information about which formats of the data are you using?

Where to find the code that generate id2sources.txt

Hi Alex, nice work! Could you please point me to the script which generates the id2sources.txt file (The code doing the document clustering)? Cheers : )

Vanilla Transformer or copyTransformer?

Hi Alexander, thanks for sharing the code. The paper has reported the result of copyTransformer and the code you share in https://github.com/Alex-Fabbri/Multi-News/tree/master/code/OpenNMT-py-baselines is called Transformer. I just want to confirm the code you have released is for Vanilla Transformer or copyTransformer?

Details about experimental setup

Hi Alex,

Thanks a lot for sharing the code and data. I am trying to evaluate on your dataset. however, there are some details which are not mentioned in the paper. I am wondering if you could provide answers to the following questions:

What's the maximum truncate length of summary (looks like it's 300)?
Which embedding are you using, or do you use pretrained word embeddings?
Do you use positional embedding?
Do you share the encoder and decoder vocab and vocab embedding?
What's the encoder/decoder vocab size (or do you use a minimal frequency to filter out low-freq words or tokens)?

Error in running copytransformer inference

Hello

I am trying to run your copy transformer baseline, in the OpenNMT-py-baselines, I loaded your pretrained checkpoint newser_step_20000.pt , then I run ./run_test_transformer.sh

I received the following error:

var = torch.tensor(arr, dtype=self.dtype, device=device)

Traceback (most recent call last):
  File "translate.py", line 37, in <module>
    main(opt)
  File "translate.py", line 24, in main
    attn_debug=opt.attn_debug)
  File "/Users/USER/Desktop/new_version_multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/translate/translator.py", line 226, in translate
    batch_data = self.translate_batch(batch, data, fast=self.fast)
  File "/Users/USER/Desktop/new_version_multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/translate/translator.py", line 329, in translate_batch
    return self._translate_batch(batch, data)
  File "/Users/USER/Desktop/new_version_multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/translate/translator.py", line 555, in _translate_batch
    enc_states, memory_bank = self.model.encoder(src, src_lengths)
  File "/Users/USER/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/USER/Desktop/new_version_multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/encoders/transformer.py", line 102, in forward
    emb = self.embeddings(src)
  File "/Users/USER/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/USER/Desktop/new_version_multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/modules/embeddings.py", line 201, in forward
    source = module(source, step=step)
  File "/Users/USER/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/USER/Desktop/new_version_multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/modules/embeddings.py", line 39, in forward
    emb = emb + self.pe[:emb.size(0)]
RuntimeError: The size of tensor a (9734) must match the size of tensor b (5000) at non-singleton dimension 0

Can not get the whole dataset

Hi, Alex, After trying many times, I finally get train/dev/test 44824/5601/5604 which is different from the number described in the paper, and I recorded the URLs which can not be successfully crawled. Could you please help me?

Code to highlight different sources of summary

Hey Alex,
In the paper, it is said that the content of the summary is color coded based on different source it is taken from. Can you point me to the code block which does the same?

Thanks.

Discrepancy between documents and reference

For some of the documents encountered while parsing through the raw data. There is no match between the documents and their corresponding reference summary .
For example : Topic : 16490
Documents : Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain. ||||| This will appear next to all of your comments NEWLINE_CHAR NEWLINE_CHAR This will NOT appear anywhere on Newser |||||

Summary : – An online retailer has pulled a costume from its website that depicted Holocaust victim Anne Frank. Screenshots of the costume for sale at HalloweenCostumes.com posted to social media show a smiling girl wearing World War II-era clothing and a beret, the AP reports. The costume was quickly criticized on Twitter. Per the Arizona Republic, the description that accompanied the photo called Frank a hero and noted "we can always learn from the struggles of history." Carlos Galindo-Elvira, who leads the Anti-Defamation League's Arizona office, said on Twitter that the costume trivializes the memory of Frank, known from the diary she wrote while in hiding from the Nazis during the war. "There r better ways 2 commemorate Anne Frank," he wrote. A spokesman tweeted Sunday that the costume had been pulled from the site. He explained that the company sells costumes for activities other than Halloween, like "school projects and plays," and he apologized for any offense caused by the costume. Fun.com, based in North Mankato, Minn., runs the website.

Dataset in original formats & evaluation

Hi Alex,
I am trying to use this dataset which seems pretty high quality. Do you consider to distribute another version of dataset via Google Drive before tokenization and lower casing?
I want to do some preprocessing with the dataset but the files in Google drive ([train/dev/test].txt.[src/tgt]) are already lowercased the tokenized which might hurt the performance of downstream tools like StanfordNLP, entity linking tools, etc.

And also for fair comparison in evaluation, shall we use the tokenized version or do detokenization before feeding to ROUGE (will this affect the ROUGE score somehow?) example:

kavanaugh didn ' t state

where "didn ' t" are treated as three words (a little overly fine-grained). Did you use the tokenized version of the reference summary to get the number from your paper?

Upload "Raw data, bad retrievals removed" as a zip

Hi! Would it be possible to upload the Raw data, bad retrievals removed data as a .zip to the Google Drive, similar to multi-news-original?

I am trying to point the HuggingFace dataloaders to this cleaned data but it's much easier when it's available as a zip.

Code for data analysis

Hi Alex,

I read your paper and it is very interesting.
I am interested in analysis part, but couldn't find that code in this repo.
May you please provide analysis code as well.
Many thanks!

Blank entry

In the source data "Raw data, bad retrievals removed"->"test.src.cleaned", line 4737 is blank

how to Separate Doc in "Preprocessed, but not truncated, data"

Hi,

How to get the different doc in "Preprocessed, but not truncated, data"? I didn't see specific token and I try "\n", but the text doesn't have any.

models for PG and Hi-MAP.

Hi,

Can you provide your models for pointer generator and HI-Map.

The code for MMR-attention

Hi,

Thanks so much for releasing the code. I checked the files under the Hi-MAP folder, but I didn't find the code for MMR-attention of Hi-MAP. Could you show me the place?

Best,
Ye

Getting error when running on pre-trained model

Good work!!!
When I run "run_test_transformer.sh" using pre-trained model, which I have download from Google Drive, while running this one, getting error like that

File "translate.py", line 37, in
main(opt)
File "translate.py", line 19, in main
translator = build_translator(opt, report_score=True)
File "/content/drive/My Drive/summarization/SUMMPIP/Multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/translate/translator.py", line 38, in build_translator
onmt.model_builder.load_test_model(opt, dummy_opt.dict)
File "/content/drive/My Drive/summarization/SUMMPIP/Multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/model_builder.py", line 142, in load_test_model
model = build_base_model(model_opt, fields, use_gpu(opt), checkpoint)
File "/content/drive/My Drive/summarization/SUMMPIP/Multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/model_builder.py", line 166, in build_base_model
src_embeddings = build_embeddings(model_opt, src_dict, feature_dicts)
File "/content/drive/My Drive/summarization/SUMMPIP/Multi_news/Multi-News/code/OpenNMT-py-baselines/onmt/model_builder.py", line 44, in build_embeddings
num_word_embeddings = len(word_dict)
File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab.py", line 62, in len
return len(self.vocab)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1131, in getattr
type(self).name, name))
AttributeError: 'Vocab' object has no attribute 'vocab'

Can you help me to resolve this?

Thanks

The Diversity Analysis

Dear Fabbri @Alex-Fabbri , first of all, I want to thank you for your great work.
I have two questions about the diversity analysis in your paper. I used the code from https://github.com/lil-lab/newsroom to calculate the coverage, density, and compression. And then, I used the seaborn.kdeplot to visualize the results, as you mentioned that you used the kernel density estimation. But I found my result was different from figure 1 in your paper. The coverage score seems much lower. My questions are:

Did you divide the coverage scores with the maximum value or min-max normalization?
Did you randomly sample or use the entire training set to calculate these three metrics?
Thank you again for your help.

My result is here:

Output of MMR, LexRank and TextRank ?

Can you provide output for the above three algorithms too?

Training and validation accuracy of models

Hi Alex,

I've been training the transformer model from OpenNMT on your dataset, as explained in another issue. I was wondering if I am doing anything wrong, since the training accuracy is around 44, and the validation accuracy is around 37 at step 20000. Do you know what are the accuracies of the models you provided (e.g. I can see that your transformer model is also trained for 20000 steps)?

Code for First-K results

Thanks for sharing the code and data!

Table 6 in the paper cites the Rouge results of First-3 (R1=39.41) being higher than LexRank (R1=38.27) and TextRank (R1=38.44). This seems a bit surprising. Would it be possible to share the code for First-K?

I've been trying to reproduce the First-3 results using the pre-processed test dataset but I'm not getting anywhere near R1=39.41 by following the method described in the paper:

For our dataset, First-k means the first k sentences from each source article will be concatenated as the summary.

ERROR in running "run_inference_newser.sh"

Hi,
Thank you to provide the pre-trianed model.
I downloaded the pre-trianed model in "newser-mmr" and pre-processed, truncated test data.
When I try to use "run_inference_newser.sh" in "Hi-MAP" to get the result from test data set, there is an error like this:
'''
Traceback (most recent call last):
File "translate.py", line 37, in
main(opt)
File "translate.py", line 24, in main
attn_debug=opt.attn_debug)
File "/home/tiezheng/workspace/Debias/multidoc_Summarization/Multi-News/code/Hi_MAP/onmt/translate/translator.py", line 233, in translate
batch_data = self.translate_batch(batch, data, fast=self.fast)
File "/home/tiezheng/workspace/Debias/multidoc_Summarization/Multi-News/code/Hi_MAP/onmt/translate/translator.py", line 341, in translate_batch
return self._translate_batch(batch, data)
File "/home/tiezheng/workspace/Debias/multidoc_Summarization/Multi-News/code/Hi_MAP/onmt/translate/translator.py", line 622, in _translate_batch
step=i)
File "/home/tiezheng/anaconda3/envs/multi-news/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/tiezheng/workspace/Debias/multidoc_Summarization/Multi-News/code/Hi_MAP/onmt/decoders/decoder.py", line 148, in forward
tgt, memory_bank, state, memory_lengths=memory_lengths,sent_encoder=sent_encoder,src_sents=src_sents,dec=dec)
File "/home/tiezheng/workspace/Debias/multidoc_Summarization/Multi-News/code/Hi_MAP/onmt/decoders/decoder.py", line 602, in _run_forward_pass
mmr_among_words = self._run_mmr_attention(sent_encoder, sent_decoder, src_sents,attns["copy"][0].size()[-1])
File "/home/tiezheng/workspace/Debias/multidoc_Summarization/Multi-News/code/Hi_MAP/onmt/decoders/decoder.py", line 480, in _run_mmr_attention
sim1 = torch.bmm(self.mmr_W(sent_decoder), sent.unsqueeze(2)).squeeze(2) # (2,1)
RuntimeError: invalid argument 7: equal number of batches expected at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorMathBlas.cu:488
'''
Do you known what's the reason?

Can not reproduce Output with pretrained Hi_MAP model

Hi Alex,

I´m trying to reproduce the published output with your pretrained hi_map model . I use the cleaned, tokenized and truncated test src data file you provided.

The command I use for running the model is: (as described in the readme and the run_inference_newser.sh)

python translate.py -gpu 0 -batch_size 8 -beam_size 4 -model pretrained_models/newser_mmr/Feb17__step_20000.pt -src data/multi-news/test.txt.src.tokenized.fixed.cleaned.final.truncated.txt -output output/Feb17__step_20000_full.output -min_length 200 -max_length 300 -verbose -stepwise_penalty -coverage_penalty summary -beta 5 -length_penalty wu -alpha 0.9 -verbose -block_ngram_repeat 3 -ignore_when_blocking "story_separator_special_tag"

in the terminal.

Althought I managed to get the model up an running on my Windows machine (Windows 10 Home, Torch 1.4.0, Python 3.7.3), I´m not able to reproduce your output.

Here is an example for my output from the model for the first two documents in the test.txt.src.tokenized.fixed.cleaned.final.truncated file:

– it ' s a new day , but that ' s what we ' re not to be a lot of your life . as for the first time , you ' re going to see it out , but they ' re now not to have a lot to do so , the new york times reports . " i ’ ve been a lot of <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank>
– if you ' re a <unk> , it ' s a good thing , but that ' s what they ' re saying : " i ' ve seen it , " the new york times reports . " i can ' t believe . " as for the <unk> , they ' ll be able to be a " <unk> , " but it was " one of the most time , " he writes , adding that it was a " one woman . " but , " this is one of those who ' ve been able to do it . " the washington post reports that the " <unk> " is a <unk> to the <unk> <unk> : " it ’ s a very way to see a <unk> <unk> <unk> , but it ’ ll be a good time for it , but i don ’ t want it . i don ' t know if it was . i ' m going to do . " the <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank> <blank>

My torch versions are:
torch==1.4.0 torchtext==0.5.1 torchvision==0.5.0

I'd really appreciate it if you could help me out here. Thank you!

what the special token between source documents is？

Hi~
I have downloaded the "Preprocessed, truncated, data”.
As mentioned in the paper,

We simply introduce a special token between source documents to aid our models in detecting document-to-document relationships and leave direct modeling of this relationship...

But I didn't find and special token in the truncated data. How could I seperate sentences by their different source documents?

Thanks!

Mistake in prep_data.py

Nice Job!!!

Multi-News/data/scripts/prep_data.py

Line 71 in eb9ae19

open("../final_data/{split}.tgt.txt", "w") as output_tgt, \

Maybe it should be f"../final_data/{split}.tgt.txt", "w"

Different PGN results

I converted the Multi-News dataset (Preprocessed, but not truncated, data) into the original PGN input (Abisee, 2017), but got much lower ROUGE scores, different from the ones mentioned in the Multi-News paper. ROUGE(1/2/3/L/SU4): 31.98/10.43/5.77/28.98/9.66

Since the CNNDM dataset works well with the original PGN model, I doubt there might be something wrong with the data conversion procedure.

@Alex-Fabbri Could you let me know how you preprocessed the multi-news dataset for PGN. Did you add SENT_START and SENT_END as sentence tags? Or is there anything else that I need to pay attention to?

Thanks very much.

Trained Model for HI_MAP

Hey Alex,
Really appreciate the work you have done!
There was one request, I have tried the three models that you have shared. Can you share the trained model for HI_Map as well?

expected scalar type Long but found Float

when run run_inference_newser script the following error appears:

Namespace(alpha=0.9, attn_debug=False, batch_size=8, beam_size=4, beta=5.0, block_ngram_repeat=3, coverage_penalty='summary', data_type='text', dump_beam='', dynamic_dict=False, fast=False, gpu=0, ignore_when_blocking=['story_separator_special_tag'], image_channel_size=3, length_penalty='wu', log_file='', max_length=300, max_sent_length=None, min_length=200, models=['Feb17__step_20000.pt'], n_best=1, output='drive/MyDrive/summarization/data-1/generated_output.txt', replace_unk=False, report_bleu=False, report_rouge=False, sample_rate=16000, share_vocab=False, src='drive/MyDrive/summarization/data-1/test.txt.src.tokenized.fixed.cleaned.final.truncated.txt', src_dir='', stepwise_penalty=True, tgt='drive/MyDrive/summarization/data-1/test.txt.tgt.tokenized.fixed.cleaned.final.truncated.txt', verbose=True, window='hamming', window_size=0.02, window_stride=0.01)
Namespace(accum_count=5, adagrad_accumulator_init=0.1, adam_beta1=0.9, adam_beta2=0.999, batch_size=2, batch_type='sents', bridge=True, brnn=True, cnn_kernel_width=3, context_gate=None, copy_attn=True, copy_attn_force=False, copy_loss_by_seqlength=True, coverage_attn=False, data='newser_sent_500_300/newser_sents', dec_layers=1, decay_method='', decay_steps=10000, decoder_type='rnn', dropout=0.0, enc_layers=1, encoder_type='brnn', epochs=0, exp='', exp_host='', feat_merge='concat', feat_vec_exponent=0.7, feat_vec_size=-1, fix_word_vecs_dec=False, fix_word_vecs_enc=False, generator_function='log_softmax', global_attention='mlp', global_attention_function='softmax', gpu_backend='nccl', gpu_ranks=[0], gpu_verbose_level=0, gpuid=[], heads=8, image_channel_size=3, input_feed=1, keep_checkpoint=-1, label_smoothing=0.0, lambda_coverage=1, layers=1, learning_rate=0.15, learning_rate_decay=0.5, log_file='', master_ip='localhost', master_port=10000, max_generator_batches=32, max_grad_norm=4.0, model_type='text', normalization='sents', optim='adagrad', param_init=0.1, param_init_glorot=False, position_encoding=False, pre_word_vecs_dec=None, pre_word_vecs_enc=None, report_every=50, reuse_copy_attn=True, rnn_size=512, rnn_type='LSTM', sample_rate=16000, save_checkpoint_steps=1000, save_model='model_newser_atten/Feb17_', seed=777, self_attn_type='scaled-dot', share_decoder_embeddings=False, share_embeddings=False, src_word_vec_size=128, start_decay_steps=50000, tensorboard=False, tensorboard_log_dir='runs/onmt', tgt_word_vec_size=128, train_from='', train_steps=30000, transformer_ff=2048, truncated_decoder=0, valid_batch_size=32, valid_steps=10000, warmup_steps=4000, window_size=0.02, word_vec_size=128, world_size=1)
/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py:323: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
var = torch.tensor(arr, dtype=self.dtype, device=device)
/content/Multi-News/code/Hi_MAP/onmt/translate/translator.py:555: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.tensor(a, requires_grad=False)
Traceback (most recent call last):
File "Multi-News/code/Hi_MAP/translate.py", line 37, in
main(opt)
File "Multi-News/code/Hi_MAP/translate.py", line 24, in main
attn_debug=opt.attn_debug)
File "/content/Multi-News/code/Hi_MAP/onmt/translate/translator.py", line 233, in translate
batch_data = self.translate_batch(batch, data, fast=self.fast)
File "/content/Multi-News/code/Hi_MAP/onmt/translate/translator.py", line 342, in translate_batch
return self._translate_batch(batch, data)
File "/content/Multi-News/code/Hi_MAP/onmt/translate/translator.py", line 650, in _translate_batch
beam_attn.data[:, j, :memory_lengths[j]])
File "/content/Multi-News/code/Hi_MAP/onmt/translate/beam.py", line 140, in advance
self.attn.append(attn_out.index_select(0, prev_k))
RuntimeError: expected scalar type Long but found Float

Err read file .pt after preprocess

Thank for great repo @Alex-Fabbri. I follow readme.txt in Hi-Map

I run run_prep_newser, I have list .pt file after that i have newser_sents.vocab.pt.
I run run_inference_newser.sh, but i get err:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

I found that error from when call func read data from newser_sent_500/newser_sents.vocab.pt file:

def make_text_iterator_from_file(path):
    with codecs.open(path, "r", "utf-8") as corpus_file:
        for line in corpus_file:
            yield line

file: code/Hi_MAP/onmt/inputters/text_dataset.py

I using raw data from Raw data -- zipped
Some version from me:
torch 1.8.0
torchtext 0.9.0
cuda 11.1
Many thank for your help!!

data change?

hi , Alex ! thank u very much @Alex-Fabbri
I have a question to ask you : If transformer models are trained in other languages, such as Chinese .Can Chinese data be preprocessed directly according to your method ?（Use directly OpenNMT baselines "run_preprocess.sh"）

Multi Processing failed

Hi Alex,

Thank you for uploading the data and code. I'm trying to run

run_train_newser.sh

However, it failed to find GPU 1. I run torch.cuda.set_device(1). It didn't return any error.
I'm really confused about this error. Have u encountered a similar issue? If so, how did you solve it? Thank you!

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
1
Traceback (most recent call last):
File "train.py", line 129, in
main(opt)
File "train.py", line 56, in main
p.join()
File "/home/songlin/anaconda3/envs/multi_news/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/home/songlin/anaconda3/envs/multi_news/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/songlin/anaconda3/envs/multi_news/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "train.py", line 116, in signal_handler
raise Exception(msg)
Exception:

-- Tracebacks above this line can probably
be ignored --

Traceback (most recent call last):
File "/home/songlin/Multi-News/code/Hi_MAP/train.py", line 74, in run
single_main(opt, device_id)
File "/home/songlin/Multi-News/code/Hi_MAP/onmt/train_single.py", line 80, in main
opt = training_opt_postprocessing(opt, device_id)
File "/home/songlin/Multi-News/code/Hi_MAP/onmt/train_single.py", line 71, in training_opt_postprocessing
torch.cuda.set_device(device_id)
File "/home/songlin/anaconda3/envs/multi_news/lib/python3.6/site-packages/torch/cuda/init.py", line 281, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37

training Hi-Map on custom dataset.

About input file upload

Can you upload this processed file /home/lily/af726/spring-2019/summarization_general/data-final/data-truncated-opennmt/train.txt.src.tokenizd.fixed.cleaned.truncated .etc
This file was used in the inputs, but was not found in the released dataset.

The data from google drive

Hello,
I am not sure whether the data from google drive has already been truncated, cleaned, fixed, and tokenized or not.
Because in the run_prep_newser.sh the data's name is train.txt.src.tokenizd.fixed.cleaned.truncated and the data I download from google drive's name is only train.txt.src.
Should I do all those things on my own?
Thanks.
Great work by the way !!

alex-fabbri / multi-news Goto Github PK

multi-news's Introduction

Multi-News

Data

Models and Summaries

multi-news's People

Contributors

Stargazers

Watchers

Forkers

multi-news's Issues

Recommend Projects

Recommend Topics

Recommend Org