Coder Social home page Coder Social logo

jasonwu0731 / tod-bert Goto Github PK

View Code? Open in Web Editor NEW
285.0 12.0 54.0 1 MB

Pre-Trained Models for ToD-BERT

License: BSD 2-Clause "Simplified" License

Shell 5.25% Python 92.61% Perl 2.14%
task-oriented-dialogues dialogue pretrained-models natural-language-processing natural-language-understanding bert

tod-bert's Introduction

TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues

Authors: Chien-Sheng Wu, Steven Hoi, Richard Socher and Caiming Xiong.

EMNLP 2020. Paper: https://arxiv.org/abs/2004.06871

Introduction

The underlying difference of linguistic patterns between general text and task-oriented dialogue makes existing pre-trained language models less useful in practice. In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling. To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling. We propose a contrastive objective function to simulate the response selection task. Our pre-trained task-oriented dialogue BERT (TOD-BERT) outperforms strong baselines like BERT on four downstream task-oriented dialogue applications, including intention recognition, dialogue state tracking, dialogue act prediction, and response selection. We also show that TOD-BERT has a stronger few-shot ability that can mitigate the data scarcity problem for task-oriented dialogue.

Citation

If you use any source codes, pretrained models or datasets included in this repo in your work, please cite the following paper. The bibtex is listed below:

@inproceedings{wu-etal-2020-tod,
    title = "{TOD}-{BERT}: Pre-trained Natural Language Understanding for Task-Oriented Dialogue",
    author = "Wu, Chien-Sheng  and
      Hoi, Steven C.H.  and
      Socher, Richard  and
      Xiong, Caiming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.66",
    doi = "10.18653/v1/2020.emnlp-main.66",
    pages = "917--929"
}

Update

  • (2020.10.01) More training and inference information added. Release TOD-DistilBERT.
  • (2020.07.10) Loading model from Huggingface is now supported.
  • (2020.04.26) Pre-trained models are available.

Pretrained Models

You can easily load the pre-trained model using huggingface Transformers library using the AutoModel function. Several pre-trained versions are supported:

  • TODBERT/TOD-BERT-MLM-V1: TOD-BERT pre-trained only using the MLM objective
  • TODBERT/TOD-BERT-JNT-V1: TOD-BERT pre-trained using both the MLM and RCL objectives
  • TODBERT/TOD-DistilBERT-JNT-V1: TOD-DistilBERT pre-trained using both the MLM and RCL objectives
import torch
from transformers import *
tokenizer = AutoTokenizer.from_pretrained("TODBERT/TOD-BERT-JNT-V1")
tod_bert = AutoModel.from_pretrained("TODBERT/TOD-BERT-JNT-V1")

You can also downloaded the pre-trained models from the following links:

model_name_or_path = <path_to_the_downloaded_tod-bert>
model_class, tokenizer_class, config_class = BertModel, BertTokenizer, BertConfig
tokenizer = tokenizer_class.from_pretrained(model_name_or_path)
tod_bert = model_class.from_pretrained(model_name_or_path)

Direct Usage

Please refer to the following guide how to use our pre-trained ToD-BERT models. Our model is built on top of the PyTorch library and huggingface Transformers library. Let's do a very quick overview of the model architecture and code. Detailed examples for model architecturecan be found in the paper.

# Encode text 
input_text = "[CLS] [SYS] Hello, what can I help with you today? [USR] Find me a cheap restaurant nearby the north town."
input_tokens = tokenizer.tokenize(input_text)
story = torch.Tensor(tokenizer.convert_tokens_to_ids(input_tokens)).long()

if len(story.size()) == 1: 
    story = story.unsqueeze(0) # batch size dimension

if torch.cuda.is_available(): 
    tod_bert = tod_bert.cuda()
    story = story.cuda()

with torch.no_grad():
    input_context = {"input_ids": story, "attention_mask": (story > 0).long()}
    hiddens = tod_bert(**input_context)[0] 

Training and Testing

If you would like to train the model yourself, you can download those datasets yourself from each of their original papers or sources. You can also direct download a zip file here.

The repository is currently in this structure:

.
└── image
    └── ...
└── models
    └── multi_class_classifier.py
    └── multi_label_classifier.py
    └── BERT_DST_Picklist.py
    └── dual_encoder_ranking.py
└── utils.py
    └── multiwoz
        └── ...
    └── metrics
        └── ...
    └── loss_function
        └── ...
    └── dataloader_nlu.py
    └── dataloader_dst.py
    └── dataloader_dm.py
    └── dataloader_nlg.py
    └── dataloader_usdl.py
    └── ...
└── README.md
└── evaluation_pipeline.sh
└── evaluation_ratio_pipeline.sh
└── run_tod_lm_pretraining.sh
└── main.py
└── my_tod_pretraining.py
  • Run Pretraining
❱❱❱ ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-MLM --only_last_turn
❱❱❱ ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-JNT --only_last_turn --add_rs_loss
  • Run Fine-tuning
❱❱❱ ./evaluation_pipeline.sh 0 bert bert-base-uncased save/BERT
  • Run Fine-tuning (Few-Shot)
❱❱❱ ./evaluation_ratio_pipeline.sh 0 bert bert-base-uncased save/BERT --nb_runs=3 

Report

Feel free to create an issue or send email to the first author at [email protected]

tod-bert's People

Contributors

jasonwu0731 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tod-bert's Issues

Question about output labels

Hello, I am working on re-implementing the tod-bert code to run on my own pretraining dataset and I have been getting a CUDA error that seems to be stemming from incorrect inputs to the loss function. Upon further examination it seems I might not be understanding the output labels for the responses in the RCL task. I had thought the output labels would be 1 if it is the correct response, and zero if incorrect. However, the line of code that generates the output seems to just generate an incrementing array relative to the batch size.

Specifically at the following line
output_labels = torch.tensor(np.arange(batch_size)).long() #.to(args.device)

For a batch size of 8 for example, the output labels would be an array [0, 1, 2, 3, 4, 5, 6, 7]. Is this to be expected? If so, how does this correspond to the positive/negative response labels needed?

Thanks in advanced!

How to decode?

Simple question probably, but I'm new to NLP and just doing an experiment. I've gotten the output and decoded but the tokens looks like [unusedxxx] for every token when I decode. How can I do this properly? Here's my code.

import torch
from transformers import *

tokenizer = AutoTokenizer.from_pretrained("TODBERT/TOD-BERT-JNT-V1")
tod_bert = AutoModel.from_pretrained("TODBERT/TOD-BERT-JNT-V1")

# Encode text 
input_text = "[CLS] [SYS] Hello, what can I help with you today? [USR] Find me a cheap restaurant nearby the north town."
input_tokens = tokenizer.tokenize(input_text)
story = torch.Tensor(tokenizer.convert_tokens_to_ids(input_tokens)).long()

if len(story.size()) == 1: 
    story = story.unsqueeze(0) # batch size dimension

if torch.cuda.is_available(): 
    tod_bert = tod_bert.cuda()
    story = story.cuda()

with torch.no_grad():
    input_context = {"input_ids": story, "attention_mask": (story > 0).long()}
    outputs = tod_bert(**input_context)[0]
    hiddens = outputs[0]
    print(tokenizer.decode(torch.argmax(hiddens, dim=1)))

Release trained evaluation models

Hi,

I am able to run the code without any problems. So thank you for sharing it.

Since the dataset I work with is small (very small), I want to test the fine-tuned models directly on my dataset before I actually re-train it or find some other solution. I was wondering if you could release the evaluation models as well? I can always train them, but would be just quicker to directly get them (if possible).

Thank you.

How to set dialog state tracking labels if the parts of histories are truncated

Hi, Thank you for this great repository and the paper.
I'm very impressed with your work.

I have a minor question.
How did you set the labels for dialog state tracking when the parts of dialog contexts are truncated?
If we set the maximum number of turns to certain number, then we should cut the precedent turns.
But if we do that, the dialog states which were updated in those turns should not appear in the ground truth labels since the model cannot see that slot types & values in the truncated dialog sequence.

For example, let us assume that the user wants to reserve a hotel room and specified the hotel name at the first turn.
Obviously, that name should be set as the value of slot type "hotel name" in dialog states.
But after the conversation goes on, the first turn will be cut out to make the input size shorter than the maximum size of the model, but the dialog state still has the hotel name as the updated value.
This looks unnatural for me since the model cannot see the context anymore but still has to predict the slot value as an output.

I wonder how you handled this problem.
Please let me know if there is an efficient preprocessing way to handle this.
Thank you.

Question about ToD-BERT as a pipeline

Hi Jason,

It is a very nice paper, and It helps enlighten me in a way to consider pre-training AE model for better downstream tasks. As a matte of fact, I am wondering how you plan to apply it to a complete pipeline (ToD bot). From what I see, those downstream tasks mentioned in the paper are actually sequential individual components inside the ToD bot. Here, I mean each task is related to the next task somehow, so fine-tuning on each separately seem to lose its connection property. I understand this fine-tuning wants to show its power due to such design or pre-training. Any thought? I appreciate it.

Code for pre-training ToD-BERT

Hi, how can I pre-train ToD-BERT by myself? What are the hyperparameters? How to reproduce the evaluation results?
Thanks!

KeyError: 'slots'

Hi, i'm getting an error i cannot seem to get past. I've included it below:

Traceback (most recent call last): File "main.py", line 109, in <module> trn_loader = get_loader(args, "train", tokenizer, datasets, unified_meta) File "/content/ToD-BERT/utils/utils_general.py", line 58, in get_loader dataset = globals()["Dataset_"+task](data_info, tokenizer, args, unified_meta, mode, args["max_seq_length"]) File "/content/ToD-BERT/utils/dataloader_dst.py", line 20, in __init__ self.slots = list(unified_meta["slots"].keys()) KeyError: 'slots'

Is this to do with a specific dataset i need to include in the list of datasets to use? Because i do not want to use them all,just the multiwozs.

Mapping response selection output

Hi Jason,

The response selection output gives an array of 100 elements, that I believe are ranked from top 1 to top 100 (where one of them is a true response and the other 99 are the responses in the same batch for other inputs - treated as negative samples).

I was wondering how do I map these indices to the actual responses? As in what part of the code is mapping the responses to these indices? I see there is code in the dataloader_nlg.py file, however I also noticed that the code below never gets executed as the nb_neg_sample_rs is 0. Could you please guide how do I interpret the test output?

Thanks in advance!

   if self.args["nb_neg_sample_rs"] != 0 and self.mode == "train":
        if self.args["sample_negative_by_kmeans"]:
            try:
                cur_cluster = self.others["ToD_BERT_SYS_UTTR_KMEANS"][self.data["turn_sys"][index]]
                candidates = self.others["KMEANS_to_SENTS"][cur_cluster]
                nb_selected = min(self.args["nb_neg_sample_rs"], len(candidates))
                try:
                    start_pos = random.randint(0, len(candidates)-nb_selected-1)
                except:
                    start_pos = 0
                sampled_neg_resps = candidates[start_pos:start_pos+nb_selected]
            
            except:
                start_pos = random.randint(0, len(self.resp_cand_trn)-self.args["nb_neg_sample_rs"]-1)
                sampled_neg_resps = self.resp_cand_trn[start_pos:start_pos+self.args["nb_neg_sample_rs"]]  
        else:
            start_pos = random.randint(0, len(self.resp_cand_trn)-self.args["nb_neg_sample_rs"]-1)
            sampled_neg_resps = self.resp_cand_trn[start_pos:start_pos+self.args["nb_neg_sample_rs"]]
        
        neg_resp_arr, neg_resp_idx_arr = [], []
        for neg_resp in sampled_neg_resps:
            neg_resp_plain = "{} ".format(self.sys_token) + neg_resp
            neg_resp_idx = self.preprocess(neg_resp_plain)[:self.max_sys_resp_len]
            neg_resp_idx_arr.append(neg_resp_idx)
            neg_resp_arr.append(neg_resp_plain)
        
        item_info["neg_resp_idx_arr"] = neg_resp_idx_arr
        item_info["neg_resp_arr"] = neg_resp_arr

Cannot reproduce the results in paper using your provided pre-trained models

Hi, @jasonwu0731
I was finetuning the intent task using your provided pre-trained models
But I got diffenent result which is shown below.
image

I used this shell script and changed nothing.(Python3.6)

The second question is, according to your code

# ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-MLM --only_last_turn
# ./run_tod_lm_pretraining.sh 0 bert bert-base-uncased save/pretrain/ToD-BERT-JNT --only_last_turn --add_rs_loss

The model is pretrained with batchsize equal to 8 on one GPU. But in your paper, there were 2GPU and batchsize is 32.

Number of system acts for DSCT2 and GSIM

In the paper there are 19 system acts mentioned.
But when you query data you get 9 system acts?
len(np.unique(data_info['sys_act']))

The same applies to GSIM dataset.

Cosine similarity from tod-bert encodings

I have a response selection problem where I only want to suggest relevant responses. I wanted to use Cosine similarity between context and response as a threshold to filter out irrelevant responses.
Does it make sense to use encoding of tod-bert-jnt or its fine-tuned version on Response Selection for getting cosine similarity score between context and response to determine relevance? If yes, what should be the threshold?
I used tod-bert-jnt model for computing cosine similarity between couple of context, response pairs but results didn't look good. Similarity score was often above 0.85 even for completely irrelevant examples.
I find it intuitive to use fine-tuned model for cosine similarity since it is using dot product as loss function. I haven't trained fine-tuned model (let me know in case its publicly available) and was wondering if it is worth to give it a try?

L2 normalization for hid_resp and hid_cont

@jasonwu0731 Thanks for your excellent work.
Calculate RCL loss
scores = torch.matmul(hid_cont, hid_resp.transpose(1, 0))
loss_rs = xeloss(scores, resp_label)
loss += loss_rs
loss_rs = loss_rs.item()

when calculate rs loss, is it nessary to L2 normalize hid_resp hid_cont before matmul? So that the [CLS] vector be an embedding of a sentence.

Input for DA classification: context and system turn

For DA classification, the input is in format:

dialogue history utterances from system and user [SYS] [SEP] system turn [USR] user turn

where we try to predict the DAs for the turn between [SEP] and [USR] token i.e. system turn.

I was wondering if you can explain why was the given format chosen? Specifically,
why [SEP] after [SYS] token? and
why is there a following user turn after the system turn?

I haven't found much information for DA prediction using BERT, and that's why any explanation will be very helpful.

BTW, I realized here if sys_first_flag is outside, then system and user turns are actually system and user turns. But if it is inside (like before the code update), the system and user turn gets exchanged and in that case we end up predicting user utterance DAs given the above DA input format. (May be that's why the sys_first_flag was inside for loop before?)

Thank you!

Distributed training forthe RCL task

Hello, I am trying to pretrain todbert on my own dataset, but because of the size of the dataset I need to distribute training to speed up computation. It seems like distributed training is built into the MLM task, but distributing the RCL task throws an error. We have written some to distribute the RCL task but our training results show little-to-no improvement on the RS loss vs the single-GPU case. I am wondering if there is any specific reason you decided not to distribute the RCL task over multiple GPUs or a problem you encountered, or if there is just likely a bug in our code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.