Coder Social home page Coder Social logo

matchsum's Introduction

MatchSum

Code for ACL 2020 paper: Extractive Summarization as Text Matching

Dependencies

  • Python 3.7
  • PyTorch 1.4.0
  • fastNLP 0.5.0
  • pyrouge 0.1.3
    • You should fill your ROUGE path in metrics.py line 20 before running our code.
  • rouge 1.0.0
    • Used in the validation phase.
  • transformers 2.5.1

All code only supports running on Linux.

Data

We have already processed CNN/DailyMail dataset, you can download it through this link, unzip and move it to ./data. It contains two versions (BERT/RoBERTa) of the dataset, a total of six files.

In addition, we have released five other processed datasets (WikiHow, PubMed, XSum, MultiNews, Reddit), which you can find here.

Train

We use eight Tesla-V100-16G GPUs to train our model, the training time is about 30 hours. If you do not have enough video memory, you can reduce the batch_size or candidate_num in train_matching.py, or you can adjust max_len in dataloader.py.

You can choose BERT or RoBERTa as the encoder of MatchSum, for example, to train a RoBERTa model, you can run the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train_matching.py --mode=train --encoder=roberta --save_path=./roberta --gpus=0,1,2,3,4,5,6,7

Test

After completing the training process, several best checkpoints are stored in a folder named after the training start time, for example, ./roberta/2020-04-12-09-24-51. You can run the following command to get the results on test set (only one GPU is required for testing):

CUDA_VISIBLE_DEVICES=0 python train_matching.py --mode=test --encoder=roberta --save_path=./roberta/2020-04-12-09-24-51/ --gpus=0

The ROUGE score will be printed on the screen, and the output of the model will be stored in the folder ./roberta/result.

Results on CNN/DailyMail

Test set (the average of three runs)

Model R-1 R-2 R-L
MatchSum (BERT-base) 44.22 20.62 40.38
MatchSum (RoBERTa-base) 44.41 20.86 40.55

Generated Summaries

The summaries generated by our models on the CNN/DM dataset can be found here. In the version we released, the result of MatchSum(BERT) is 44.26/20.58/40.40 (R-1/R-2/R-L), and the result of MatchSum(RoBERTa) is 44.45/20.88/40.60.

The summaries generated on other datasets can be found here.

Pretrained Model

Two versions of the pre-trained model on CNN/DM are available here. You can use them through torch.load. For example,

model = torch.load('MatchSum_cnndm_bert.ckpt')

Besides, the pre-trained models on other datasets can be found here.

Process Your Own Data

If you want to process your own data and get candidate summaries for each document, first you need to convert your dataset to the same jsonl format as ours, and make sure to include text and summary fields. Second, you should use BertExt or other methods to select some important sentences from each document and get an index.jsonl file (we provide an example in ./preprocess/test_cnndm.jsonl).

Then you can run the following command:

python get_candidate.py --tokenizer=bert --data_path=/path/to/your_original_data.jsonl --index_path=/path/to/your_index.jsonl --write_path=/path/to/store/your_processed_data.jsonl

Please fill your ROUGE path in preprocess/get_candidate.py line 22 before running this command. It is worth noting that you need to adjust the number of candidate summaries and the number of sentences in the candidate summaries according to your dataset. For details, see line 89-97 in preprocess/get_candidate.py.

After processing the dataset, and before using our code to train your own model, please adjust candidate_num in train_matching.py and max_len in dataloader.py according to the number and the length of the candidate summaries in your dataset.

Note

The code and data released here are used for the matching model. Before the matching stage, we use BertExt to prune meaningless candidate summaries, the implementation of BertExt can refer to PreSumm.

matchsum's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matchsum's Issues

dataset

Thanks for your work, I wonder how reddit datasets gain labels used to bertsum model, I used the original bertsum process code, but the labels I gain is different with yours on reddit dataset. I will appreciate if you can help me.

Evaluation of raw text

First of all, great work!

If I would like to evaluate my own raw text with one of the pre-trained models that you just released, how would I go about that?
Would I have to put them into the jsonl format first or could I directly pass in the text somehow into the model?

Thanks!

Plug & Play with "model = torch.load('MatchSum_cnndm_bert.ckpt').to(device)"

Hi all,

I can load the model into a python environment with the line model = torch.load('MatchSum_cnndm_bert.ckpt').to(device), provided your model.py file is in the same directory, and device is cuda.

I want to run some forward passes on a sample document, but I am confused by your input format. For example the code snippet below yields the following error:

import torch
import transformers
from transformers import BertTokenizer  # transformers>=3.0.0
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tok = BertTokenizer.from_pretrained("bert-base-uncased")
model = torch.load('MatchSum_cnndm_bert.ckpt').to(device)

with open("some_test_file.txt", "r") as handler:
    input_ids = tok( [handler.read()] )["input_ids"]

test_forward = model(input_ids, candidate_id=None, summary_id=None). # Error, see below
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-424e3757a895> in <module>
----> 1 model.forward(torch.tensor([0]))

TypeError: forward() missing 2 required positional arguments: 'candidate_id' and 'summary_id'

Any clarification on what the jsonl headers refer to would be greatly appreciated. Specifically, how to use the plug and play line of code included in your README.md

There is no target field.

input fields after batch(if batch size is 2):
candidate_id: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 20, 90])
text_id: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 451])
summary_id: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 45])
There is no target field.

Generated Summaries

Hi, the generated summaries here contain the reference summaries (.ref files) and the model generated summaries (.dec files). I wonder if there are original text file of the article? Is there a way to get the original text files associated with the summaries?

Thanks!

Are there instructions to train on another dataset?

Hi,

First of all thank you for your work and for sharing the code. I was wondering if there are instructions to train on a custom data collection (adapting data format? to which format?). Could you please provide an example of the dataset (besides the index file)?

a question about the key 'label' in the test dataset

Thanks for publishing your code to public.
I wonder how you obtain the " label " in test_CNNDM_roberta.jsonl. Do you use the greedy selction algorithm mentioned in SumRunner or use the BertSum prediction to obtain it? If the former, I think it is cheated to some degree when using MatchSum to test, because the cadidate summary is based on the selected labels ,and the selected labels is caopared with the gold summary.
If the latter ,I suspect the oracle is not as higher as R_1:52.59 R_2: 31.23 R_L: 48.87 ,because the BertEXT model can't predict as accurate as labels using the greedy selction algorithm.
Looking forward to your reply.

段错误(吐核)

Hi maszhong
I'm getting the following error?
can you please help me

Training process of MatchSum !!!
Start loading datasets !!!
Finished in 0:02:05.141637
Information of dataset is:
In total 2 datasets:
train has 287084 instances.
val has 13367 instances.

Devices is:
[0, 1, 2, 3]
段错误(吐核)

How can we open the outfiles you provide

Hi, i'm trying to access the summaries and the reference texts but the files are .dec and .ref and i don't understand how to open them.
Can someone help me ? :)
Thanks !

prepare for own data !!!

Hi, Thank you for your awesome work.
Can you give more detailed instructions on personal data preparation?
more specifically convert a text file to a jsonl file with the fields that matchsum requires?

Pre-trained Model

Hi, how to use the pre-trained model? Where should I put the torch.load() function?

Long documents

What is the maximum possible document length when using MatchSum?
It seems that when just passing a document into a transformer, e.g. Bert, it can't handle more than 512 or 1024 tokens (depending on the size)
huggingface/transformers#4332

What were some of the longest documents you handled when testing MatchSum?

Do you think it would make sense to train a https://github.com/allenai/longformer instead?

Thanks!

100% accuracy with one candidate

I'm not sure if this is a bug or it is how this algorithm works. I have my own dataset with 'text' and 'summary'. Desired summary is known from the beginning and it consists of few 'text' sentences. In this case 'candidate' is the same as 'summary' so in get_candidate I skip calculating rouge scores and just set indices = [sent_id] . Then when training with this data I get loss close to 0 and 100% ROUGE on train and val set immediately.

Is it a desired effect or this algorithms always needs some candidates which aren't extracted exactly from 'summary' but from 'text'?

Previously I used PreSum for the same tasks and data and got satisfactory results.

Difficulties to reproduce score End-to-end

I'm trying to reproduce the score on CNN/DM, using Matchum, with End-to-end summarization.

Here is the code I use :

# BertExt, return 5 most salient sentences from a list of sentences
sentences = self.extractor(documents, k=5, block_trigram=False)

text_id = tokenizer.batch_encode_plus([" ".join(d) for d in documents], return_tensors="pt", pad_to_max_length=True, max_length=tokenizer.model_max_length)["input_ids"]
cand_id = []
all_summaries = []
for sen in sentences:     # Iterate over batches
	summaries = list(combinations(sen, 2))
        summaries += list(combinations(sen, 3))
	summaries = [" ".join(s) for s in summaries]       # Create summary from list of sentences
	all_summaries.append(summaries)

	cand_id.append(torch.cat([tokenizer.encode_plus(s, max_length=tokenizer.model_max_length, return_tensors="pt", pad_to_max_length=True)["input_ids"] for s in summaries]))
cand_id = torch.stack(cand_id)

text_id = text_id.to(device)
cand_id = cand_id.to(device)
scores = self.match_sum(text_id, cand_id)
_, selected_sum = scores.max(-1)      # Take the summary with the best score (semantically similar to the document) 

return [summaries[selected] for summaries, selected in zip(all_summaries, selected_sum)]

My score is :

---------------------------------------------
1 ROUGE-1 Average_R: 0.48578 (95%-conf.int. 0.48274 - 0.48869)
1 ROUGE-1 Average_P: 0.36573 (95%-conf.int. 0.36313 - 0.36839)
1 ROUGE-1 Average_F: 0.39591 (95%-conf.int. 0.39383 - 0.39817)
---------------------------------------------
1 ROUGE-2 Average_R: 0.21714 (95%-conf.int. 0.21426 - 0.21986)
1 ROUGE-2 Average_P: 0.16260 (95%-conf.int. 0.16041 - 0.16501)
1 ROUGE-2 Average_F: 0.17620 (95%-conf.int. 0.17390 - 0.17840)
---------------------------------------------
1 ROUGE-L Average_R: 0.43988 (95%-conf.int. 0.43690 - 0.44274)
1 ROUGE-L Average_P: 0.33191 (95%-conf.int. 0.32945 - 0.33446)
1 ROUGE-L Average_F: 0.35889 (95%-conf.int. 0.35677 - 0.36091)

which is even lower than BertExt...

Any idea what I'm doing wrong ?

AssertionError?

Hi maszhong

I'm getting the following error?
can you please help me

(tf) lili@melody:~/MatchSum$ CUDA_VISIBLE_DEVICES=0,1,3 python train_matching.py --mode=train --encoder=roberta --save_path=./roberta --gpus=0,1,3
Training process of MatchSum !!!
Start loading datasets !!!
Finished in 0:02:24.464620
Information of dataset is:
In total 2 datasets:
train has 287084 instances.
val has 13367 instances.

Devices is:
[0, 1, 3]
Traceback (most recent call last):
File "train_matching.py", line 150, in
train_model(args)
File "train_matching.py", line 70, in train_model
assert args.batch_size % len(devices) == 0
AssertionError

End-to-end summarization

I'm a bit confused by the forward pass of the model :

def forward(self, text_id, candidate_id, summary_id):

Why do we need the summary_ids ? At inference time we don't have access to the gold summary, so how can we use MatchSum at inference time ?

=> Given only the article, how to use MatchSum to produce a summary ?


As far as I understood, the pipeline would be :

  1. Use BertExt (or any other sentence classifier) to extract k=5 sentences from the article
  2. Compute all possible summaries combinations (C(5, 2) + C(5, 3))
  3. Using MatchSum, compute the cosine similarity between the original article and each summaries computed in 2.
  4. Take the summary with the best similarity.

Am I right ?
If so we need to rewrite the MatchSum class to get as input text_ids and a list of cand_ids, right ?

'BertConfig' object has no attribute 'return_dict'

When I run the command
CUDA_VISIBLE_DEVICES=0 python train_matching.py --mode=test --encoder=bert --save_path=./models/ --gpus=0
there is an error:
Traceback (most recent call last):
File "train_matching.py", line 152, in
test_model(args)
File "train_matching.py", line 111, in test_model
tester.test()
File "/home/gpu2/anaconda3/lib/python3.7/site-packages/fastNLP/core/tester.py", line 165, in test
pred_dict = self._data_forward(self._predict_func, batch_x)
File "/home/gpu2/anaconda3/lib/python3.7/site-packages/fastNLP/core/tester.py", line 213, in _data_forward
y = self._predict_func_wrapper(**x)
File "/home/gpu2/10T_disk/yyj/MatchSum/model.py", line 30, in forward
out = self.encoder(text_id, attention_mask=input_mask)# last layer
File "/home/gpu2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/gpu2/anaconda3/lib/python3.7/site-packages/transformers/modeling_bert.py", line 755, in forward
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
File "/home/gpu2/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 203, in use_return_dict
return self.return_dict and not self.torchscript
AttributeError: 'BertConfig' object has no attribute 'return_dict'

how to solve it ?
@maszhongming

Unable to load pretrained model checkpoint

Hi, thanks so much for sharing your code and model!

I downloaded the pretrained model you provided, but ran into this error. Do you know how I can solve this issue?

model_path = "../../MatchSum_cnndm_roberta.ckpt"        
model = torch.load(model_path)

Output:


ModuleNotFoundError Traceback (most recent call last)
in
5
6 model_path = "../../MatchSum_cnndm_roberta.ckpt"
----> 7 model = torch.load(model_path)

~\fyp\env\lib\site-packages\torch\serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
527 with _open_zipfile_reader(f) as opened_zipfile:
528 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 529 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
530
531

~\fyp\env\lib\site-packages\torch\serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
700 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
701 unpickler.persistent_load = persistent_load
--> 702 result = unpickler.load()
703
704 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)

ModuleNotFoundError: No module named 'model'

Not getting the same rouge score

Hello

For the multinews dataset, I took your rouge evaluation code, and also your generated output. I am not getting the same result:

Here is your report result:

---------------------------------------------
1 ROUGE-1 Average_R: 0.49376 (95%-conf.int. 0.49086 - 0.49702)
1 ROUGE-1 Average_P: 0.46147 (95%-conf.int. 0.45854 - 0.46440)
1 ROUGE-1 Average_F: 0.46223 (95%-conf.int. 0.45993 - 0.46440)
---------------------------------------------
1 ROUGE-2 Average_R: 0.17810 (95%-conf.int. 0.17541 - 0.18108)
1 ROUGE-2 Average_P: 0.16330 (95%-conf.int. 0.16084 - 0.16581)
1 ROUGE-2 Average_F: 0.16502 (95%-conf.int. 0.16262 - 0.16764)
---------------------------------------------
1 ROUGE-L Average_R: 0.44680 (95%-conf.int. 0.44412 - 0.44996)
1 ROUGE-L Average_P: 0.41897 (95%-conf.int. 0.41602 - 0.42185)
1 ROUGE-L Average_F: 0.41903 (95%-conf.int. 0.41680 - 0.42119)

and here is what I got:


---------------------------------------------
1 ROUGE-1 Average_R: 0.48863 (95%-conf.int. 0.48578 - 0.49186)
1 ROUGE-1 Average_P: 0.45658 (95%-conf.int. 0.45363 - 0.45953)
1 ROUGE-1 Average_F: 0.45736 (95%-conf.int. 0.45512 - 0.45955)
---------------------------------------------
1 ROUGE-2 Average_R: 0.17679 (95%-conf.int. 0.17413 - 0.17974)
1 ROUGE-2 Average_P: 0.16207 (95%-conf.int. 0.15965 - 0.16456)
1 ROUGE-2 Average_F: 0.16378 (95%-conf.int. 0.16145 - 0.16635)
---------------------------------------------
1 ROUGE-L Average_R: 0.44234 (95%-conf.int. 0.43963 - 0.44541)
1 ROUGE-L Average_P: 0.41468 (95%-conf.int. 0.41180 - 0.41758)
1 ROUGE-L Average_F: 0.41479 (95%-conf.int. 0.41250 - 0.41693)

No offense, but do you know what makes that difference?

关于MarginRankingLoss

计算candidate loss时是否应该是range(1,n)而不是range(1,n-1)?为什么可以忽略rank最大和rank最小对应的的candidate loss呢

About the triplet loss

Hi,

I have a question about the calculation of triplet loss.

For example, for a certain document, you have 4 candidate sentences, {s1,s2,s3,s4} (Rouge s1 > Rouge s2 > Rouge s3 > Rouge s4).

When calculating the triplet loss, how many times should be summed?

In other words, the loss for this document is :

max(0, f(D, s2 ) − f(D, s1) + (2 − 1) ∗ γ2) + max(0, f(D, s3 ) − f(D, s1) + (3 − 1) ∗ γ2)+max(0, f(D, s4 ) − f(D, s1) + (4 − 1) ∗ γ2)
3 times

or

max(0, f(D, s2 ) − f(D, s1) + (2 − 1) ∗ γ2) + max(0, f(D, s3 ) − f(D, s1) + (3 − 1) ∗ γ2) + max(0, f(D, s4 ) − f(D, s1) + (4 − 1) ∗ γ2) + max(0, f(D, s3 ) − f(D, s2) + (3 − 2) ∗ γ2) + max(0, f(D, s4 ) − f(D, s2) + (4 − 2) ∗ γ2) + max(0, f(D, s4 ) − f(D, s3) + (4 − 3) ∗ γ2)
6 times

That is to say, how many positive sentence(s) in this case, only 1 (s1) , or relatively 3 (s1 , s2, s3 ) sentences?

Thank you very much!

What is the meaning of "label" in the CNN/DM dataset?

Hello, and first of all, thank you for sharing this repository.
I was wondering what is the meaning of the "label" field in the dataset you have made available? To me, it represents which sentences that need to be part of the final summary. However, the number of sentences within a given article is often bigger than the size of the "label" list, so I was wondering what is the reason for that and how to get the actual label sentences.

Thanks a lot :)

Have the model summarized my own dataset

Hi, I want to use your pre-trained model to generate summaries for my data, am I understand it right that I have to process my data according to the "processing you own data" section? If so,

  1. Does my .jsonl file should contain all these parameters: 'label', 'text', 'summary', 'ext_idx', 'indices', 'score', 'candidate_id', 'text_id', 'summary_id' ?
  2. Why do I need to select some important sentences and get index.jsonl ? (because what I want is for the model to select summary sentences for me)
  3. What is index.jsonl file for?
  4. The summary output of the model will go into my .jsonl file at the 'summary' field?

Thank you so much for any help!

Question about candidate summary

I see the implementation in 'get_candidate.py': First, we should select 5 most important sentences from BertExt; Then select any 2 or 3 sentences to form a candidate summary. There are 20 candidate summary for every data point.

But I wonder if there are only 4 sentences in src, so the most important setences from BertExt is up to 4. So C42 + C43 = 12 + 4 = 16. Whether we need to do anything to deal with the data to solve this exception?

Issue while using pre-trained model

Testing process of MatchSum !!!
Start loading datasets !!!
Finished in 0:00:03.919354
Information of dataset is:
In total 1 datasets:
test has 11489 instances.

Current model is MatchSum_cnndm_bert.ckpt
/usr/local/lib/python3.7/dist-packages/torch/serialization.py:593: SourceChangeWarning: source code of class 'model.MatchSum' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
Traceback (most recent call last):
File "train_matching.py", line 152, in
test_model(args)
File "train_matching.py", line 111, in test_model
tester.test()
File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/tester.py", line 170, in test
metric(pred_dict, batch_y)
File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/metrics.py", line 293, in call
self.evaluate(**refined_args)
File "/content/drive/MyDrive/MatchSum-master/metrics.py", line 141, in evaluate
ext = int(torch.max(score, dim=1).indices) # batch_size = 1
ValueError: only one element tensors can be converted to Python scalars

lr=0?

Hi, in your paper you state that you use the same learning rate schedule as in the paper "attention is all you need". But I cannot find any implementation of it in your code. Moreover the learning rate in your adam optimizer is set to 0. Can you give me a hint where all this is handled in the code?

Constructing input data for BERT-EXT on Multi-News.

Hi, thank you for publishing your interesting research and source code.
Now I am trying to construct input data for BERT-EXT on Multi-News. I have one question.

First of all, I checked the test_multinews.jsonl in here.
Next, I created token ids to be entered into BERT-EXT from "text" using transformers.BertTokenizer. After doing that, I noticed a discrepancy between the number of sentences in the first 512 tokens and the length of "label". Why do they differ?

Perhaps you have your own preprocess, can you give us some details?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.