jasonwu0731 / tod-bert Goto Github PK

View Code? Open in Web Editor NEW

288.0 12.0 55.0 1 MB

Pre-Trained Models for ToD-BERT

License: BSD 2-Clause "Simplified" License

Shell 5.25% Python 92.61% Perl 2.14%

task-oriented-dialogues dialogue pretrained-models natural-language-processing natural-language-understanding bert

tod-bert's Introduction

TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues

Authors: Chien-Sheng Wu, Steven Hoi, Richard Socher and Caiming Xiong.

EMNLP 2020. Paper: https://arxiv.org/abs/2004.06871

Introduction

The underlying difference of linguistic patterns between general text and task-oriented dialogue makes existing pre-trained language models less useful in practice. In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling. To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling. We propose a contrastive objective function to simulate the response selection task. Our pre-trained task-oriented dialogue BERT (TOD-BERT) outperforms strong baselines like BERT on four downstream task-oriented dialogue applications, including intention recognition, dialogue state tracking, dialogue act prediction, and response selection. We also show that TOD-BERT has a stronger few-shot ability that can mitigate the data scarcity problem for task-oriented dialogue.

Citation

If you use any source codes, pretrained models or datasets included in this repo in your work, please cite the following paper. The bibtex is listed below:

@inproceedings{wu-etal-2020-tod,
    title = "{TOD}-{BERT}: Pre-trained Natural Language Understanding for Task-Oriented Dialogue",
    author = "Wu, Chien-Sheng  and
      Hoi, Steven C.H.  and
      Socher, Richard  and
      Xiong, Caiming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.66",
    doi = "10.18653/v1/2020.emnlp-main.66",
    pages = "917--929"
}

Update

(2020.10.01) More training and inference information added. Release TOD-DistilBERT.
(2020.07.10) Loading model from Huggingface is now supported.
(2020.04.26) Pre-trained models are available.

Pretrained Models

You can easily load the pre-trained model using huggingface Transformers library using the AutoModel function. Several pre-trained versions are supported:

TODBERT/TOD-BERT-MLM-V1: TOD-BERT pre-trained only using the MLM objective
TODBERT/TOD-BERT-JNT-V1: TOD-BERT pre-trained using both the MLM and RCL objectives
TODBERT/TOD-DistilBERT-JNT-V1: TOD-DistilBERT pre-trained using both the MLM and RCL objectives

import torch
from transformers import *
tokenizer = AutoTokenizer.from_pretrained("TODBERT/TOD-BERT-JNT-V1")
tod_bert = AutoModel.from_pretrained("TODBERT/TOD-BERT-JNT-V1")

You can also downloaded the pre-trained models from the following links:

model_name_or_path = <path_to_the_downloaded_tod-bert>
model_class, tokenizer_class, config_class = BertModel, BertTokenizer, BertConfig
tokenizer = tokenizer_class.from_pretrained(model_name_or_path)
tod_bert = model_class.from_pretrained(model_name_or_path)

Direct Usage

Please refer to the following guide how to use our pre-trained ToD-BERT models. Our model is built on top of the PyTorch library and huggingface Transformers library. Let's do a very quick overview of the model architecture and code. Detailed examples for model architecturecan be found in the paper.

# Encode text 
input_text = "[CLS] [SYS] Hello, what can I help with you today? [USR] Find me a cheap restaurant nearby the north town."
input_tokens = tokenizer.tokenize(input_text)
story = torch.Tensor(tokenizer.convert_tokens_to_ids(input_tokens)).long()

if len(story.size()) == 1: 
    story = story.unsqueeze(0) # batch size dimension

if torch.cuda.is_available(): 
    tod_bert = tod_bert.cuda()
    story = story.cuda()

with torch.no_grad():
    input_context = {"input_ids": story, "attention_mask": (story > 0).long()}
    hiddens = tod_bert(**input_context)[0]

Training and Testing

If you would like to train the model yourself, you can download those datasets yourself from each of their original papers or sources. You can also direct download a zip file here.

The repository is currently in this structure:

.
└── image
    └── ...
└── models
    └── multi_class_classifier.py
    └── multi_label_classifier.py
    └── BERT_DST_Picklist.py
    └── dual_encoder_ranking.py
└── utils.py
    └── multiwoz
        └── ...
    └── metrics
        └── ...
    └── loss_function
        └── ...
    └── dataloader_nlu.py
    └── dataloader_dst.py
    └── dataloader_dm.py
    └── dataloader_nlg.py
    └── dataloader_usdl.py
    └── ...
└── README.md
└── evaluation_pipeline.sh
└── evaluation_ratio_pipeline.sh
└── run_tod_lm_pretraining.sh
└── main.py
└── my_tod_pretraining.py

Run Pretraining

❱❱❱ ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-MLM --only_last_turn
❱❱❱ ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-JNT --only_last_turn --add_rs_loss

Run Fine-tuning

❱❱❱ ./evaluation_pipeline.sh 0 bert bert-base-uncased save/BERT

Run Fine-tuning (Few-Shot)

❱❱❱ ./evaluation_ratio_pipeline.sh 0 bert bert-base-uncased save/BERT --nb_runs=3

Report

Feel free to create an issue or send email to the first author at [email protected]

tod-bert's People

Contributors

Stargazers

Watchers

Forkers

yuimo yoonseokheo acharyasabita luweishuang zeta1999 zhgwen hzeng-otterai burakakrishna ajmssc qjckevin zedavid jaykimbravekjh akarsh32 stjordanis smitakshigupta marziehngh kuldeepyadav chunningdu peterli1001 function2-llx shihanyang psy2013github glhr entn-at ananyaganesh qianrenjian kiminh mars-wei devjwsong zuohuif primitivenen amit-gh createrll timothyxxx yushi-hu hcy123902 zljabh youbadbad techthiyanes mon28 suzhidong ashishjwr smileix datakop banksy23 walker-hyf jianfeiwang aqhali bytjn1416124 vanamartin oplatek dthulke taocao ssingerr iffranciscome

tod-bert's Issues

Release trained evaluation models

Hi,

I am able to run the code without any problems. So thank you for sharing it.

Since the dataset I work with is small (very small), I want to test the fine-tuned models directly on my dataset before I actually re-train it or find some other solution. I was wondering if you could release the evaluation models as well? I can always train them, but would be just quicker to directly get them (if possible).

Thank you.

Requirements list

Can you add requirements list?

Thanks!

Can pretrained/finetuned TOD BERT be able to check when not to predict response in case of response selection task?

Since it works on softmax loss like classification settings, what needs to be done to be able to make sure it does not select either of the response for some out of domain conversation happening on bot?

Code for pre-training ToD-BERT

Hi, how can I pre-train ToD-BERT by myself? What are the hyperparameters? How to reproduce the evaluation results?
Thanks!

Question about output labels

Hello, I am working on re-implementing the tod-bert code to run on my own pretraining dataset and I have been getting a CUDA error that seems to be stemming from incorrect inputs to the loss function. Upon further examination it seems I might not be understanding the output labels for the responses in the RCL task. I had thought the output labels would be 1 if it is the correct response, and zero if incorrect. However, the line of code that generates the output seems to just generate an incrementing array relative to the batch size.

Specifically at the following line
output_labels = torch.tensor(np.arange(batch_size)).long() #.to(args.device)

For a batch size of 8 for example, the output labels would be an array [0, 1, 2, 3, 4, 5, 6, 7]. Is this to be expected? If so, how does this correspond to the positive/negative response labels needed?

Thanks in advanced!

Why are the user and system turns and labels switched in MWOZ and DSTC2 experiments?

Based on the function read_langs_turn in utils_universal_act.py, the user and system turns are switched if the system is the first speaker. Why was this done?

Cosine similarity from tod-bert encodings

I have a response selection problem where I only want to suggest relevant responses. I wanted to use Cosine similarity between context and response as a threshold to filter out irrelevant responses.
Does it make sense to use encoding of tod-bert-jnt or its fine-tuned version on Response Selection for getting cosine similarity score between context and response to determine relevance? If yes, what should be the threshold?
I used tod-bert-jnt model for computing cosine similarity between couple of context, response pairs but results didn't look good. Similarity score was often above 0.85 even for completely irrelevant examples.
I find it intuitive to use fine-tuned model for cosine similarity since it is using dot product as loss function. I haven't trained fine-tuned model (let me know in case its publicly available) and was wondering if it is worth to give it a try?

Question about ToD-BERT as a pipeline

Hi Jason,

It is a very nice paper, and It helps enlighten me in a way to consider pre-training AE model for better downstream tasks. As a matte of fact, I am wondering how you plan to apply it to a complete pipeline (ToD bot). From what I see, those downstream tasks mentioned in the paper are actually sequential individual components inside the ToD bot. Here, I mean each task is related to the next task somehow, so fine-tuning on each separately seem to lose its connection property. I understand this fine-tuning wants to show its power due to such design or pre-training. Any thought? I appreciate it.

Do you have the compiled data sets available to share?

Hi,

I am wondering whether you are able to share all the data sets you collected and compiled? It can be useful to when want want to test different models using the same data.

Thanks,
Song

Number of system acts for DSCT2 and GSIM

In the paper there are 19 system acts mentioned.
But when you query data you get 9 system acts?
len(np.unique(data_info['sys_act']))

The same applies to GSIM dataset.

[Errno 2] No such file or directory: '/home/ToD-BERT-dev/data/MultiWOZ-2.1/train_dials.json'

Hi, I'm trying to run the response generation code. I have downloaded the MultiWOZ data from here, but I'm having some problems finding the mapping between the files the script is expecting and the ones I got from the MultiMOZ repo. Did you get the models from a different source?

Cheers!

Cannot reproduce the results in paper using your provided pre-trained models

Hi, @jasonwu0731
I was finetuning the intent task using your provided pre-trained models
But I got diffenent result which is shown below.

I used this shell script and changed nothing.(Python3.6)

The second question is, according to your code

# ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-MLM --only_last_turn
# ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-JNT --only_last_turn --add_rs_loss

The model is pretrained with batchsize equal to 8 on one GPU. But in your paper, there were 2GPU and batchsize is 32.

Input for DA classification: context and system turn

For DA classification, the input is in format:

dialogue history utterances from system and user [SYS] [SEP] system turn [USR] user turn

where we try to predict the DAs for the turn between [SEP] and [USR] token i.e. system turn.

I was wondering if you can explain why was the given format chosen? Specifically,
why [SEP] after [SYS] token? and
why is there a following user turn after the system turn?

I haven't found much information for DA prediction using BERT, and that's why any explanation will be very helpful.

BTW, I realized here if sys_first_flag is outside, then system and user turns are actually system and user turns. But if it is inside (like before the code update), the system and user turn gets exchanged and in that case we end up predicting user utterance DAs given the above DA input format. (May be that's why the sys_first_flag was inside for loop before?)

Thank you!

How to set dialog state tracking labels if the parts of histories are truncated

Hi, Thank you for this great repository and the paper.
I'm very impressed with your work.

I have a minor question.
How did you set the labels for dialog state tracking when the parts of dialog contexts are truncated?
If we set the maximum number of turns to certain number, then we should cut the precedent turns.
But if we do that, the dialog states which were updated in those turns should not appear in the ground truth labels since the model cannot see that slot types & values in the truncated dialog sequence.

For example, let us assume that the user wants to reserve a hotel room and specified the hotel name at the first turn.
Obviously, that name should be set as the value of slot type "hotel name" in dialog states.
But after the conversation goes on, the first turn will be cut out to make the input size shorter than the maximum size of the model, but the dialog state still has the hotel name as the updated value.
This looks unnatural for me since the model cannot see the context anymore but still has to predict the slot value as an output.

I wonder how you handled this problem.
Please let me know if there is an efficient preprocessing way to handle this.
Thank you.

Something about the pretrain input

The input of the pretraining model is processed as “[SYS] S1 [USR] U1 [SYS] S2 [USR] U2 [SYS] S3 [USR] U3”?

How do I format new data for pertaining and fine-tuning tasks.

I have my own dialogue dataset. I wanted to know in what format and structure should I put them to use during pre training along with other datasets.

Use ToD-BERT for response selection

@jasonwu0731

How can we use this model for downstream task of "response selection" after loading model from HF ?, than how to do finetuning on our own dataset?

Distributed training forthe RCL task

Hello, I am trying to pretrain todbert on my own dataset, but because of the size of the dataset I need to distribute training to speed up computation. It seems like distributed training is built into the MLM task, but distributing the RCL task throws an error. We have written some to distribute the RCL task but our training results show little-to-no improvement on the RS loss vs the single-GPU case. I am wondering if there is any specific reason you decided not to distribute the RCL task over multiple GPUs or a problem you encountered, or if there is just likely a bug in our code.

Mapping response selection output

Hi Jason,

The response selection output gives an array of 100 elements, that I believe are ranked from top 1 to top 100 (where one of them is a true response and the other 99 are the responses in the same batch for other inputs - treated as negative samples).

I was wondering how do I map these indices to the actual responses? As in what part of the code is mapping the responses to these indices? I see there is code in the dataloader_nlg.py file, however I also noticed that the code below never gets executed as the nb_neg_sample_rs is 0. Could you please guide how do I interpret the test output?

Thanks in advance!

   if self.args["nb_neg_sample_rs"] != 0 and self.mode == "train":
        if self.args["sample_negative_by_kmeans"]:
            try:
                cur_cluster = self.others["ToD_BERT_SYS_UTTR_KMEANS"][self.data["turn_sys"][index]]
                candidates = self.others["KMEANS_to_SENTS"][cur_cluster]
                nb_selected = min(self.args["nb_neg_sample_rs"], len(candidates))
                try:
                    start_pos = random.randint(0, len(candidates)-nb_selected-1)
                except:
                    start_pos = 0
                sampled_neg_resps = candidates[start_pos:start_pos+nb_selected]
            
            except:
                start_pos = random.randint(0, len(self.resp_cand_trn)-self.args["nb_neg_sample_rs"]-1)
                sampled_neg_resps = self.resp_cand_trn[start_pos:start_pos+self.args["nb_neg_sample_rs"]]  
        else:
            start_pos = random.randint(0, len(self.resp_cand_trn)-self.args["nb_neg_sample_rs"]-1)
            sampled_neg_resps = self.resp_cand_trn[start_pos:start_pos+self.args["nb_neg_sample_rs"]]
        
        neg_resp_arr, neg_resp_idx_arr = [], []
        for neg_resp in sampled_neg_resps:
            neg_resp_plain = "{} ".format(self.sys_token) + neg_resp
            neg_resp_idx = self.preprocess(neg_resp_plain)[:self.max_sys_resp_len]
            neg_resp_idx_arr.append(neg_resp_idx)
            neg_resp_arr.append(neg_resp_plain)
        
        item_info["neg_resp_idx_arr"] = neg_resp_idx_arr
        item_info["neg_resp_arr"] = neg_resp_arr

The download links of your pre-trained models are invalid now

the links are here:
https://drive.google.com/file/d/1vxqTda4MIYb1VDIA4NOokq7uCM4MW_1J/view?usp=sharing
https://drive.google.com/file/d/17F-wS4PwR6iz-Ubj0TaNsxNyMscgO3VV/view?usp=sharing
if possible, I hope you can update them.
Thanks in advance.

loss_rs is not used when pretraining ToD-BERT-JNT?

https://github.com/jasonwu0731/ToD-BERT/blob/master/my_tod_pretraining.py#L448

## Calculate RCL loss
scores = torch.matmul(hid_cont, hid_resp.transpose(1, 0))
loss_rs = xeloss(scores, resp_label)
loss_rs = loss_rs.item()
loss += loss_rs

I think loss_rs is not used to update model's parameters.

L2 normalization for hid_resp and hid_cont

@jasonwu0731 Thanks for your excellent work.
Calculate RCL loss
scores = torch.matmul(hid_cont, hid_resp.transpose(1, 0))
loss_rs = xeloss(scores, resp_label)
loss += loss_rs
loss_rs = loss_rs.item()

when calculate rs loss, is it nessary to L2 normalize hid_resp hid_cont before matmul? So that the [CLS] vector be an embedding of a sentence.

How to decode?

Simple question probably, but I'm new to NLP and just doing an experiment. I've gotten the output and decoded but the tokens looks like [unusedxxx] for every token when I decode. How can I do this properly? Here's my code.

import torch
from transformers import *

tokenizer = AutoTokenizer.from_pretrained("TODBERT/TOD-BERT-JNT-V1")
tod_bert = AutoModel.from_pretrained("TODBERT/TOD-BERT-JNT-V1")

# Encode text 
input_text = "[CLS] [SYS] Hello, what can I help with you today? [USR] Find me a cheap restaurant nearby the north town."
input_tokens = tokenizer.tokenize(input_text)
story = torch.Tensor(tokenizer.convert_tokens_to_ids(input_tokens)).long()

if len(story.size()) == 1: 
    story = story.unsqueeze(0) # batch size dimension

if torch.cuda.is_available(): 
    tod_bert = tod_bert.cuda()
    story = story.cuda()

with torch.no_grad():
    input_context = {"input_ids": story, "attention_mask": (story > 0).long()}
    outputs = tod_bert(**input_context)[0]
    hiddens = outputs[0]
    print(tokenizer.decode(torch.argmax(hiddens, dim=1)))

'dual_encoder_ranking' object has no attribute 'final_response_output'

While evaluation on response selection with batch_size < 100, on this line:
https://github.com/jasonwu0731/ToD-BERT/blob/master/models/dual_encoder_ranking.py#L95
we will get the error: 'dual_encoder_ranking' object has no attribute 'final_response_output' , so please check this.

KeyError: 'slots'

Hi, i'm getting an error i cannot seem to get past. I've included it below:

Traceback (most recent call last): File "main.py", line 109, in <module> trn_loader = get_loader(args, "train", tokenizer, datasets, unified_meta) File "/content/ToD-BERT/utils/utils_general.py", line 58, in get_loader dataset = globals()["Dataset_"+task](data_info, tokenizer, args, unified_meta, mode, args["max_seq_length"]) File "/content/ToD-BERT/utils/dataloader_dst.py", line 20, in __init__ self.slots = list(unified_meta["slots"].keys()) KeyError: 'slots'

Is this to do with a specific dataset i need to include in the list of datasets to use? Because i do not want to use them all,just the multiwozs.