Coder Social home page Coder Social logo

microsoft / dialogpt Goto Github PK

View Code? Open in Web Editor NEW
2.3K 55.0 342.0 44.68 MB

Large-scale pretraining for dialogue

License: MIT License

Python 97.93% Shell 0.05% Makefile 2.02%
dialogue machine-learning pytorch transformer text-generation dialogpt gpt-2 text-data data-processing

dialogpt's Introduction

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT)

This project page is no longer maintained as DialoGPT is superseded by GODEL, which outperforms DialoGPT according to the results of this paper. Unless you use DialoGPT for reproducibility reasons, we highly recommend you switch to GODEL.

This repository contains the source code and trained model for a large-scale pretrained dialogue response generation model. The human evaluation results indicate that the response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test.

The repository is based on huggingface pytorch-transformer and OpenAI GPT-2, containing data extraction script, model training code and pretrained small (117M) medium (345M) and large (762M) model checkpoint.

The model is trained on 147M multi-turn dialogue from Reddit discussion thread. The largest model can be trained in several hours on a 8 V100 machines (however this is not required), with distributed training and FP16 option.

The include script can be used to reproduce the results of DSTC-7 grounded dialogue generation challenge and a 6k multi-reference dataset created from Reddit data.

Project webpage: https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/

ArXiv paper: https://arxiv.org/abs/1911.00536

News

(Update 07/09/2022) Changes on the files.pushshift.io/reddit server caused our data generation pipeline to break. These problems have now been fixed, and the steps explained in the Data Preparation subsection below should work again. Data is generated in about 10 hours with 8 processes (-j 8), and 800GB of temporary disk space is needed.

(Update 06/23/2021) We have released a retrieval-augmented/grounded version of DialoGPT (RetGen), please check out the RetGen repo and RetGen paper

(Update 05/20/2021) An awesome video walkthrough on YouTube for DialoGPT by Prakhar Mishra

(Update 03/31/2021) A 3rd party demo by AK391 using Gradio web demo try it out

(Update 09/15/2020) A set of large-scale dialog ranking models has been released!

DialoGPT generation is improved by integrating with our latest dialog ranking models, DialogRPT

(Update 07/08/2020) The 6K multi-ref test set has been released!

To generate the data, pleaser run demo.py and set the data option to 'full', the generated 6k multi-ref test set will be located at

./data/test.refs.txt

(Update 03/10/2020) Model cards available in Huggingface Transformers!

Please check out our model cards in huggingface Transformers repository. With several lines of code it should be pretty straighforward to play with the DialoGPT interactively.

small model: https://huggingface.co/microsoft/DialoGPT-small

medium model: https://huggingface.co/microsoft/DialoGPT-medium

large model: https://huggingface.co/microsoft/DialoGPT-large

(New) Ranking model: https://huggingface.co/microsoft/DialogRPT-updown

(Update 01/06/2020) Some third-party decoding script implementations:

Recommended Configuration

  • Linux Ubuntu 16.04
  • GPU with at least 12G memory

DialoGPT was developed entirely on Ubuntu 16.04, and -- depending on our availability -- we try to provide support if you experience difficulties running the code on the same configuration. However, we are unable to provide support for other distributions or operating systems. Portions of the code may run on other UNIX flavors (macOS, Windows subsystem for Linux, Cygwin, etc.), but it is recommended to use Ubuntu for the main training code.

The training code can be run on CPU, but it can be slow. We would recommend to use GPU to train and finetune all models. There is no minimal limit of the number of GPUs. However, if using distributed train for multiple GPUs configuration, the speed-up vs the number of GPUs is roughly sub-linear. To simulate the same batchsize when using less GPUs, please use a larger gradient_accumulation_steps in model training.

The 117M and 345M model can be loaded in a single GPU with 12G memory. The 762M model would require a single GPU that has greater than 16G memory for efficient training. The training speed on a benchmark data with 50M training instances and V100 GPUs:

n_gpu epoch time (h) token/sec
1 118 10847
2 62 20645
4 34 37647
8 18 71356

Fine-tuning from our pretrained model on a new dataset typically requires 1-2 epochs.

Setup & Installation (TL;DR)

We created a demo script demo.py to ease the difficulty of the deployment of this system. The demo.py contains a pipeline of model downloading, data extraction, data preprocessing and model training over a dummy dataset within one commandline.

Train model with Conda Environment

Please use the below commandlines to clone, install the requirements and load the Conda environment (Note that the Nvidia CUDA 10.0 developer toolkit is required):

sudo apt-get install -y make wget gzip bzip2 xz-utils zstd sed
git clone https://github.com/microsoft/DialoGPT.git
cd DialoGPT
conda env create -f LSP-linux.yml -n LSP
conda activate LSP

If you run this on an architecture other than Linux, please use LSP-generic.yml instead of LSP-linux.yml but please note that the generic one is not tested in all platform, so the stablity can not be gauranteed. To use fp16 training, please install apex by using commands below

conda activate LSP
git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard 3d01e4a0a188cc8df54bc6e44cf5eb40ff6b4cc5
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
python3.6 demo.py

Train model with Docker environment

To start, first install the docker and Nvidia-docker from their official repos. The image environment for running the code can be loaded as below:

Nvidia-docker v2.*

$ docker run --gpus all --ipc=host --rm -it -v $PWD:/workspace --network=host icaruszyz/large-scale-training:dialogpt bash

Nvidia-docker v1.*

$ nvidia-docker --rm -it -v $PWD:/workspace --network=host icaruszyz/large-scale-training:dialogpt bash

Inside the docker container, run

python demo.py

Pipeline details

This section explains all components in the demo.py.

Data loading

Before running demo.py, you can set DATA_FOLDER (default value ./models) in demo.py as the place you want to download all the data and pretrained/fine-tuned models. Then simply run

python demo.py

to

  • automatically download models and data,
  • prepare raw data into db that is ready to use for the program,
  • generate a training scripts.

Note that by default the demo.py will use a dummy data, please specify the Reddit training data by using option --data. Three options are available:dummy,small and full.

python demo.py --data small
python demo.py --data full

The small Reddit data is around 140MB and the full Reddit data is more than 27GB. You can prepare a cup of coffee when processing with the full Reddit data because it takes a long time!

To generate the 6k multi-ref test set data, pleaser run demo.py and set the data option to 'full', the generation will be located at

./data/test.refs.txt

Pretrained model

The pretrained and fine-tuned models are available on azure blobstorage. Please run/see demo.py for more details about how to download/use those models. Or you could download directly by using the links in demo_utils.py.

Preparing data

First, use the prepare4db.sh to convert a tsv data file into the correct format that the following script can recognize. The trainig data need to be then processed into a database file with below commandline:

python prepro.py --corpus $DATA_PATH

Using the training script

The training script can be used in single GPU or multiple GPU settings (distributed training across multiple GPUs within a single node):

python ./LSP_train.py  # Single GPU training
python -m torch.distributed.launch --nproc_per_node=8 ./LSP_train.py  # Training on 8 GPUs

The training script accept several arguments to tweak the training:

Argument Type Default value Description
max_seq_length int 128 Maximum number of tokens for each training instance.
train_input_file str "" Path of the training dataset in a .db format
eval_input_file str "" Path of the validation set in a tsv format
continue_from int 0 Resuming the training after a specified number of steps
fp16 boolean True Whether to use 16-bits floating point for model training.
train_batch_size int 4 Batch size for training
valid_batch_size int 4 Batch size for validation
gradient_accumulation_steps int 2 Accumulate gradients on several steps
learning_rate float 1e-5 Learning rate
lr_schedule str noam Learning rate schedule can be chosen from [noam, noamwd, BERT, None]
num_optim_steps int 1000000 Number of training optimization steps
no_token_id boolean True If set True, using all-zeros token-type embedding.

During the training, two log files will be updated. The train_log.txt and eval_log.txt contains the model loss, perplexity and training speed (tokens/sec) statistics for the training and dev set.

The log file and saved model checkpoint can be found in ./models/output_model

Model decoding

We note that even with properly filtered Reddit dataset, sometimes our model can still generate moderately toxic/inappropriate responses. Due to this reason, we are unable to provide the decoding script at this time (The live demo and decoding script access is upon invitation only now ). We are currently still working on a controlled decoding method to prevent this system from toxic generation. Please stay tuned.

See issues #3 and Reddit discussions for some discussions on third-party decoding methods.

See below for some third-party decoding methods:

Models

We release 6 fine-tuned models which can be further fine-tuned on low-resource user-customized dataset. The total parameters in these models range from 117M to 762M, in accord with OpenAI GPT-2 model sizes.

Model Fine-tuned from GPT-2 Trained from scratch
DialoGPT 762M model [link] [huggingface model card] [link]
DialoGPT 345M model [link] [huggingface model card] [link]
DialoGPT 117M model [link] [huggingface model card] [link]
DialoGPT 345M model (reverse, for MMI) link -
DialogRPT (new ranking models) link -

The model files can be loaded exactly as the GPT-2 model checkpoints from Huggingface's Transformers. You can find the corresponding configuration files (merges.txt, config.json, vocab.json) in DialoGPT's repo in ./configs/*.

The reverse model is predicting the source from the target. This model is used for MMI reranking.

The DialogRPT models our recently proposed ranking models used to predict the human feedback (upvotes, replies) of the responses. These models can be used to improve the DialoGPT generation quality (see our EMNLP paper for details).

Retraining full models

Data Preparation

The first step to retrain the full models is to generate the aforementioned 27GB Reddit dataset. This involves downloading full Reddit submission and comments dumps from https://files.pushshift.io/reddit and creating intermediate files, which overall require 700GB of local disk space. Downloading and processing the full data requires about 1-2 days, depending on your (CPU) compute capabilties (e.g., ~24 hours with 8 cores on a recent computer). Assuming you ran the above setup and installation steps (conda activate LSP, etc.), you can create the full dataset by running either:

python demo.py --data full

or

cd reddit_extractor; SIZE=full make -j 8; cd ..

The former command calls the latter, so the two methods are equivalent. We recommend the former, as the latter is mostly useful if you run into any problem or want to customize any arguments (e.g., the make command lets you build only a subset of the data). Note that the downloading phase can be error prone, for example based on your geolocation (firewall, etc.). If the above commands fail to generate data/train.tsv, or if that file is not anywhere close to 27GB, it means something went wrong. In that case, you may want to inspect reddit_extractor/wget-log and reddit_extractor/logs/*.log for any obvious error (e.g., wget unable to download from pushshift.io). If error messages don't make sense to you, feel free to contact us. If so, please be sure to include any error messages gathered from these log files.

Training data statistics: the generated training tsv file should be roughly 26.8 GB uncompressed, with 146.8M training instances, 3.87B source tokens, and 2.14B target tokens (including utterance-level 0/1 weights). The resulting train.tsv file should contain 146,846,215 lines.

Training

We recommand generating the above data using the demo.py --data full, as it (1) generates the data, (2) converts it into DB format, and (3) trains a model using python LSP_train.py. Please directly edit demo.py if you want to customize any of the hyperparameters.

Evaluations

DSTC-7 challenge

Our model achieved the state-of-the-art results in DSTC-7 Challenge response generation task.

Experiment NIST2 NIST4 BLEU2 BLEU4 METEOR ENT-4 DIST-1 DIST-2 Avg. Len
Human response 2.62 2.65 12.35% 3.13% 8.31% 10.45 16.66% 67.01% 18.8
DSTC-7 Winner 2.51 2.52 14.35% 1.83% 8.07% 9.03 10.89% 32.49% 15.1
DialoGPT 345M 2.80 2.82 14.16% 2.31% 8.51% 10.08 9.13% 39.73% 16.9
DialoGPT 345M (BS) 2.92 2.97 19.18% 6.05% 9.29% 9.57 15.73% 51.03% 14.2

where ENT represents the Entropy score, and DIST represents the Distinct score. For all metrics except the average length, larger are better.

Note that the superior automatic evaluation comparing to human responses does not necessary imply that our model achieves human parity. Please check out our paper for more detailed analysis.

To fine-tune the 345M DialoGPT model on the DSTC-7 challenge data on a server with 8 V100 GPUs, please run the following commandline (The DSTC data can be found at DSTC-7 repo):

python3 -m torch.distributed.launch --nproc_per_node=8 train_LSP.py --init_checkpoint ./models/medium/medium_ft.pkl --train_input_file ./data/DSTC_train.db --eval_input_file ./data/DSTC_valid.tsv --model_name_or_path ./model/medium/ --learning_rate 1e-4  --train_batch_size 64 --eval_batch_size 64 --no_token_id

The trained model can be found at DSTC medium model

Evaluation

  1. Please downloads the following 3rd-party packages and save into the empty folder 3rdparty:

  2. Please follow the DSTC-7 official repo to extract the data, and put data-official-test/test.refs.txt into ./dstc/data/ folder.

  3. Run the extraction script below to produce the human response hypothesis file human.resp.txt:

    python extract_human.py
  4. Finally, to reproduce the results of human hypothesis on DSTC dataset, please run following commands under the repo folder:

    python batch_eval.py

The evaluation results will be generated in the folder ./dstc/eval/

6K multi-ref dataset result

Automatic evaluation

We test on 6K multi-ref dataset from Reddit. The results are summarized in below

Experiment NIST2 NIST4 BLEU2 BLEU4 METEOR ENT-4 DIST-1 DIST-2 Avg. Len
Human response 3.41 4.25 17.90% 7.48% 10.64% 11 14.50% 63.00% 13.1
DialoGPT 117M 2.39 2.41 10.54% 1.55% 7.53% 10.78 8.60% 39.90% 12.8
DialoGPT 345M 3 3.06 16.96% 4.56% 9.81% 9.13 6.80% 26.30% 12.2
DialoGPT 762M 2.84 2.9 18.66% 5.25% 9.66% 9.72 7.76% 29.93% 11.2
DialoGPT 345M (BS) 3.4 3.5 21.76% 7.92% 10.74% 10.48 12.38% 48.74% 11.3
DialoGPT 345M (w/MMI) 3.28 3.33 15.68% 3.94% 11.23% 11.25 9.39% 45.55% 17.2

Human evaluation

We further conduct human evaluations (6K examples for each methods, each example is evaluated by 3 human judges). The results show a strong evidence that our generation quality is towards approaching the quality of real human responses, under this non-interactive Turing test:

Relevance: A and B, which one is more relevant to the source prompt.

System A A Wins (%) Ties (%) B Wins (%) System B
DialoGPT 345M 2671 (45%) 513 (9%) 2816 (47%) Human responses
DialoGPT 345M 3281 (72%) 394 (9%) 882 (19%) PersonalityChat
DialoGPT 345M w/ MMI 2871 (48%) 522 (9%) 2607 (43%) Human responses

Informativeness: A and B, which one is more contentful and informative.

System A A Wins (%) Ties (%) B Wins (%) System B
DialoGPT 345M 2722 (45%) 234 (4%) 3044 (51%) Human responses
DialoGPT 345M 3490 (77%) 206 (5%) 861 (19%) PersonalityChat
DialoGPT 345M w/ MMI 3011 (50%) 234 (4%) 2755 (46%) Human responses

Human-Like: A and B, which one do you think is more likely to be generated by Human.

System A A Wins (%) Ties (%) B Wins (%) System B
DialoGPT 345M 2716 (45%) 263 (4%) 3021 (50%) Human responses
DialoGPT 345M 3462 (76%) 196 (4%) 899 (20%) PersonalityChat
DialoGPT 345M w/ MMI 2978 (50%) 241 (4%) 2781 (46%) Human responses

Please see full details in our arxiv paper.

Generated Examples

  • Multi-turn generation examples from an interactive environment:
Role Response
User Does money buy happiness?
Bot Depends how much money you spend on it .
User What is the best way to buy happiness ?
Bot You just have to be a millionaire by your early 20s, then you can be happy .
User This is so difficult !
Bot You have no idea how hard it is to be a millionaire and happy . There is a reason the rich have a lot of money
  • Multi-turn self-playing dialogue with user prompt
Role Response
User what is the meaning of life ?
Bot The meaning is to be with your family and friends .
Bot Iโ€™m going to guess : It means that your parents and friends have loved you . Thatโ€™s the meaning , I believe .
Bot I think the meaning is to make sure youโ€™re alive when you die . So youโ€™ll be with your family , friends , and yourself .
Bot Itโ€™s not , but thanks for trying to help .
  • Commonsense question answering examples
User Bot
who is the first president of the United States George Washington
what is the boiling point of water? I think it's about 212 F.
which one is bigger, sun or moon? The sun .
which animal has black and white stripes? A black and white striped cat .
  • Handling context that does not appear in training examples (note that training examples are sampled from 2005 to 2017)
User Bot
The trading war between China and US is still happening . I'm sure the Chinese government and the U . S . will be fine with that as long as they keep their arms on the table and theย money flowing .
who won the world cup in 2018 ? Germany , I believe .
Nvidia's Titan RTX is really good . I have this card and i love it . No complaints .
Can Porsche beat Tesla with its new Taycan EV ? I don't think so . The Tesla is more powerful than the porsche .
  • Our bot's response to philosophy questions
User Bot
What is the meaning of a good life ? I think it's that we're all connected to our past lives and the meaning of life is to live the way you want to and how you want to .
Which is more important -- being successful or being happy ? Being successful is just endlessย . Being happy is the only thing that matters .
Is peace the only way to stop war ? Not if the world is not willing to work with you . Peace is the ultimate goal , but peace requires more than that .
How much freedom should people have ? The right amount of freedom is the freedom to do as you please , as long as you don't hurt people or property .
  • Finally
User Bot
What do you think of the DialoGPT repo ? It's pretty good , but it's very easy to find a bug .

Please start a issue if you spot any :)

Try our system

The live demo and decoding script access is upon invitation only now. Please stayed tuned for the full release.

Related Project

  • RetGen: https://github.com/dreasysnail/RetGen. Retrieval-augmented/grounded DialoGPT and beyond. RetGen is a joint training framework that simultaneously optimizes a dense passage retriever and a knowledge-grounded text generator in an end-to-end fashion.

  • Microsoft ICECAPS: https://github.com/microsoft/icecaps.

    As an orthogonal repository of this project, Microsoft Icecaps is an open-source toolkit (in tensorflow) for building neural conversational systems. Icecaps provides an array of tools from recent conversation modeling and general NLP literature within a flexible paradigm that enables complex multi-task learning setups.

  • Pretrained UniLM: https://github.com/microsoft/unilm

  • MT-DNN: https://github.com/namisan/mt-dnn

  • A chinese counterpart of DialoGPT by yangjianxin1. https://github.com/yangjianxin1/GPT2-chitchat. We are glad to see that the MMI strategy that we used in DialoGPT has also improved the performance for this project as well!

Contact

Please contact [email protected] if you have any questions/suggestions. However, the response will be sporadic. Please expect delay.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Disclaimer

This repository aims to facilitate research in large-scale pretraining for conversational data. This toolkit contains only part of the modeling machinery needed to actually produce a model weight file in a running dialog. On its own, this model provides only information about the weights of various text spans; in order for a researcher to actually use it, they will need to bring conversational data of their own and decode the response generation from the pretrained system. Microsoft is not responsible for any generation from the 3rd party utilization of the pretrained system.

Citation

If you use this code in your research, you can cite our arxiv paper:

@inproceedings{zhang2019dialogpt,
    title={DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation},
    author={Yizhe Zhang and Siqi Sun and Michel Galley and Yen-Chun Chen and Chris Brockett and Xiang Gao and Jianfeng Gao and Jingjing Liu and Bill Dolan},
    year={2020},
    booktitle={ACL, system demonstration}
}

dialogpt's People

Contributors

cclauss avatar chenrocks avatar dreasysnail avatar golsun avatar intersun avatar kim3-sudo avatar mgalley avatar wang-tc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dialogpt's Issues

About prepro and MMI training

I have two questions about training reversed model.
The first one is about training data. I can't see objective reason why prepro.py cuts off big part of training data. I just realized that almost all samples wich have only 1 sentencte in source are cutted of due to _make_feature function work. Mor specificif all(w == 0 for w in ws[1:]): return None. I use --reverse parameter when prepearing data.

The second question is about validation data. If we train forward model, it's obviously that we need smth like src1<eos>src2 \t tgt but how it should look when we train backward model? My assumption was tgt \t src2 <eos> src1 due to inputs = list(reversed(inputs)), but the model's performance is very poor while training, and the quality on such validation set stops increasing after very small amount of training steps.

Thanks in advance.

Extract human response from 6k multi-ref dataset

Hi,

I'm trying to reproduce the human response result in the paper and encounter some problem.
I copied test.scored_refs.txt to dstc/data folder and use the first column as the keys.
The eval result after running python extract_human.py and python batch_eval.py is

n_lines = 5994
NIST = [2.871, 3.246, 3.3125, 3.3229]
BLEU = [0.378, 0.1678, 0.0966, 0.0655]
METEOR = 0.10657856237003654
entropy = [6.61382916462754, 10.109370475853682, 11.032526832134234, 11.125019724262556]
diversity = [0.12143963906484984, 0.5817823864609064]
avg_len = 14.64330997664331

which is different from the paper, even the avg_len is wrong.
I'm wondering which step is wrong and how to reproduce the result.

Thanks!

No file named train.tsv.gz

Hi,
Thank you for the implementation. While running the demo.py file, I encountered an error saying "No such file or directory". Can you help with the same?

TIA.

Code stuck after logging number of parameters

I ran python demo.py after replacing the dummy_data.tsv with my own small corpus. After reaching the train part the code just froze at this point.

11/06/2019 16:23:01 - INFO - __main__ -   Generating training CMD!
11/06/2019 16:23:01 - INFO - __main__ -   If there is any problem, please copy (modify) and run command below
11/06/2019 16:23:01 - INFO - __main__ -   #########################################################################
python LSP_train.py --model_name_or_path ./models/medium --init_checkpoint ./models/medium/pytorch_model.bin --train_input_file ./data/train.128len.db --eval_input_file ./data/ss_val_data.tsv --output_dir ./models/output_model --seed 42
--max_seq_length 128 --train_batch_size 512 --gradient_accumulation_steps 8 --eval_batch_size 64 --learning_rate 1e-5 --num_optim_steps 47500 --valid_step 475 --warmup_steps 9500 --normalize_data true --fp16 false --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true
11/06/2019 16:23:01 - INFO - __main__ -   #########################################################################
11/06/2019 16:23:02 - INFO - __main__ -   train batch size = 512, new train batch size (after gradient accumulation) = 64
11/06/2019 16:23:05 - INFO - __main__ -   CUDA available? True
11/06/2019 16:23:05 - INFO - __main__ -   Input Argument Information
11/06/2019 16:23:05 - INFO - __main__ -   model_name_or_path            ./models/medium
11/06/2019 16:23:05 - INFO - __main__ -   seed                          42
11/06/2019 16:23:05 - INFO - __main__ -   max_seq_length                128
11/06/2019 16:23:05 - INFO - __main__ -   skip_eval                     False
11/06/2019 16:23:05 - INFO - __main__ -   init_checkpoint               ./models/medium/pytorch_model.bin
11/06/2019 16:23:05 - INFO - __main__ -   train_input_file              ./data/train.128len.db
11/06/2019 16:23:05 - INFO - __main__ -   eval_input_file               ./data/ss_val_data.tsv
11/06/2019 16:23:05 - INFO - __main__ -   continue_from                 0
11/06/2019 16:23:05 - INFO - __main__ -   train_batch_size              64
11/06/2019 16:23:05 - INFO - __main__ -   gradient_accumulation_steps   8
11/06/2019 16:23:05 - INFO - __main__ -   eval_batch_size               64
11/06/2019 16:23:05 - INFO - __main__ -   learning_rate                 1e-05
11/06/2019 16:23:05 - INFO - __main__ -   num_optim_steps               47500
11/06/2019 16:23:05 - INFO - __main__ -   valid_step                    475
11/06/2019 16:23:05 - INFO - __main__ -   warmup_proportion             0.1
11/06/2019 16:23:05 - INFO - __main__ -   warmup_steps                  9500
11/06/2019 16:23:05 - INFO - __main__ -   normalize_data                True
11/06/2019 16:23:05 - INFO - __main__ -   fp16                          False
11/06/2019 16:23:05 - INFO - __main__ -   lr_schedule                   noam
11/06/2019 16:23:05 - INFO - __main__ -   loss_scale                    0.0
11/06/2019 16:23:05 - INFO - __main__ -   no_token_id                   True
11/06/2019 16:23:05 - INFO - __main__ -   output_dir                    ./models/output_model
11/06/2019 16:23:05 - INFO - __main__ -   log_dir                       None
11/06/2019 16:23:05 - INFO - __main__ -   pbar                          True
11/06/2019 16:23:05 - INFO - __main__ -   local_rank                    -1
11/06/2019 16:23:05 - INFO - __main__ -   config                        None
11/06/2019 16:23:05 - INFO - __main__ -   device                        cuda
11/06/2019 16:23:05 - INFO - __main__ -   n_gpu                         1
11/06/2019 16:23:05 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading vocabulary file ./models/medium/vocab.json
11/06/2019 16:23:05 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading merges file ./models/medium/merges.txt
11/06/2019 16:23:09 - INFO - gpt2_training.train_utils -   loading finetuned model from ./models/medium/pytorch_model.bin
11/06/2019 16:23:10 - INFO - gpt2_training.train_utils -   loading transfomer only
11/06/2019 16:23:13 - INFO - __main__ -   Number of parameter = 354823168

nothing logged after this point. Am I missing something or doing something wrong here?

Bot has a nasty habbit of talking in third person

Hello all,
i would like to start by thanking the DialoGPT team for open sourcing this - it has greatly aided my work. I have been playing with the decoder provided by polakowo. I have been working on creating a long range memory solution for pretrained generative transformers and so have been getting my head arround the system It works well overall but often either repates what is ay untill i clear the chat history or it will talk about itself in thrid person. Here is an example of both.'
Third person issue:

Chloe, 
So, you're a cat?

Leo Cornelius,
lol no

Chloe, 
Well, I guess so.

Leo Cornelius, 
are you a cat?

Chloe, 
She shrugs

Leo Cornelius, 
xD

Chloe, 
She shrugs

Leo Cornelius,
Do you like snow?

Chloe, 
she shrugs

Leo Cornelius,
do you ever do anything other than shrug?

Chloe,
she shrugs

repeating issue:

Chloe,
It is, my son.

Leo Cornelius, 
your son?!

Chloe, 
Your son is my son

Leo Cornelius,
i have a son?

Chloe, 
Your son has a son?

Leo Cornelius, 
maybe...

Chloe, 
you have a son?

Leo Cornelius,
no

Chloe,
no you must

Leo Cornelius,
well i dont

Chloe, 
no you must

Leo Cornelius,
Do you have a son?

Chloe,
no son

Leo Cornelius, 
do you want one?

Chloe,
no he wants my son

Leo Cornelius, 
who?

Chloe, 
who wants a son

(apologies for the voluminous data)
I do understand that a well seen issue with transformers is that they tend to imitate the person they are talking to and it has been discussed in detail here (#45). I have tried these methods but all they achieve is stopping the bot from repeating the same thing over and over again (and result in the bot just modifying it slightly (i am they are, they want etc). Iw as wondering if you have experimented with nucleus filtering and have recommended top-k, top-p and temperature values? I am currently using 50, 0.95 and 0.7 (respectively).

Regarding the third person issue. Do you know of this issue? Any suggestions on fixing it? Would it be possible to modify the decoder to apply a penalty for talking in third person? IVe seen that there is a "repeating penalty" in the hugging face transformer decoder and was wondering if something similar (but for person) could be used?

Again, apologies for the voluminous data and the large issue. I thought putting t all in one isue would be better than lots of smaller ones).
Thanks, Leo Cornelius

dbm.error: need 'c' or 'n' flag to open new db

Got an error below when trying to run demo.py:
Traceback (most recent call last):
File "LSP_train.py", line 176, in
args.max_seq_length)
File "/source/project/DialoGPT/data_loader.py", line 114, in init
self.db = shelve.open(f'{db_name}/db', 'r')
File "/anaconda3/envs/LSP/lib/python3.6/shelve.py", line 243, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
File "/anaconda3/envs/LSP/lib/python3.6/shelve.py", line 227, in init
Shelf.init(self, dbm.open(filename, flag), protocol, writeback)
File "/anaconda3/envs/LSP/lib/python3.6/dbm/init.py", line 85, in open
raise error[0]("need 'c' or 'n' flag to open new db")
dbm.error: need 'c' or 'n' flag to open new db
Exception ignored in: <bound method BucketingDataLoader.del of <data_loader.BucketingDataLoader object at 0x7f49527775f8>>
Traceback (most recent call last):
File "/DialoGPT/data_loader.py", line 151, in del
self.db.close()
AttributeError: 'BucketingDataLoader' object has no attribute 'db'

Problem for downloading data of reddit

Hi, great thanks to your contribution!

I try to use python demo.py --data full to download the reddit data. For I don't want to train the model now I didn't use the docker. I find that the link to the data is here:
https://convaisharables.blob.core.windows.net/lsp/keys-full.tar
It seems that I can't open that even with proxy. So do you have any other link to the reddit data?

Sorry to bother you. Thank you very much !

when are weights in features used?

I have seen there is "weight" as part of input, 1.0 or 0.0 for a sentence. But in LSP_train.py this feature is not used. Where and how are they used?

Can not find a model after training - strange

Hi, cool repo.
I fine-tune small_ft.pkl with the following training parameter in Google Colab:

!python LSP_train.py --model_name_or_path /content/DialoGPT/models/small --init_checkpoint /content/apex/DialoGPT/small_ft.pkl --train_input_file /content/apex/DialoGPT/data/train_raw_friend.128len.db --eval_input_file ./data/dummy_data.tsv --output_dir /content/DialoGPT/models/output_model --seed 42 --max_seq_length 128 --train_batch_size 16 --gradient_accumulation_steps 8 --eval_batch_size 64 --learning_rate 1e-5 --num_optim_steps 1000 --valid_step 5000 --warmup_steps 4000 --normalize_data true --fp16 false --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true

Trainig finished but when i navigate to /content/DialoGPT/models/output_model there is no actual model ! i see only 2 files in generated folder GPT2.1e-05.2.1gpu.2020-04-04161237
eval_log and train_log, but no large pkl file.

What i am doing wrong?

Screenshot_2020-04-04 Google Colaboratory

How can I fine-tune the model on some subreddit?

Thank you for this model, amazing results.

I would like to take the pretrained model and fine-tune it on r/advice. Such that it generates responses more similar to conversations in that subreddit.

How should I go about it?

Fine-tune with own dataset - how to multi-turn ?

When i run
python demo.py
a got an advice how to train model with own data:

python LSP_train.py --model_name_or_path /home/joo/Docs/LocalRepository/DialoGPT/models/small --init_checkpoint /home/joo/Docs/LocalRepository/DialoGPT/models/small/pytorch_model.bin --train_input_file /home/joo/Docs/LocalRepository/DialoGPT/data/train.128len.db --eval_input_file ./data/dummy_data.tsv --output_dir /home/joo/Docs/LocalRepository/DialoGPT/models/output_model --seed 42 --max_seq_length 128 --train_batch_size 512 --gradient_accumulation_steps 8 --eval_batch_size 64 --learning_rate 1e-5 --num_optim_steps 10000 --valid_step 5000 --warmup_steps 4000 --normalize_data true --fp16 true --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true

But here in above setting i cannot figure out where is the training dataset specified (DialoGPT/data/train.128len.db consist of 5 files with no actual dataset) and what is the right format for dataset to fine-tune?

According to original Huggingface format - they use json with lists of PERSONALITY and history of conversation with candidates to responses: https://github.com/huggingface/transfer-learning-conv-ai/blob/master/example_entry.py. (Additional - how to implement personality?)

Also, in issue #17 multi-turn dialog format is discussed, but where is Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

Maximum sequence length

On the README, it says that the default max_seq_length is 128. Was this the actual parameter used for training the models?

If so, does that mean without further fine-tuning or training a model from scratch with a longer max_seq_length, the trained models will not be able to generate proper responses for input sequences that are longer than 128 or have the responses cut short at 128? I'm asking because it seems like the case if I provide a context that is too long, about 120 tokens, then it would only generate about 8 tokens and stop, making the output get cut short.

Understanding the train.tsv file

After running python demo.py --data small and looking at the resulting train.tsv file, I want to make sure I have the correct understanding of the format and what the float values indicate.

For example, the first two examples look like:

t3_17830,t1_c24,t1_c40	0.0 On the bright side , despite kidnapping and cruelly abandoning him , it doesn't sound like he was tortured ...	1.0 We didn't torture somebody ! USA
t3_17844,t1_c88,t1_c95	1.0 will comments dissapear if ranked low enough ? I can just see the pages with 5000 comments now ..	1.0 not yet , but we'll play around with it

From the paper, I see that there was some heavy preprocessing and filtering done, such as removing offensive and bland training instances. Are the sequences prepended with 0.0 the filtered instances that will not be used to update the weights during training? Based on my understanding of the code, the weight 0.0 ensures this by setting the language modeling labels to -1:

DialoGPT/prepro.py

Lines 108 to 110 in 18d91ce

if w == 0.0:
lm_labels += [-1] * (len(s) + 1)
weights += [0.0] * (len(s) + 1)

What I'm confused with is that I can't seem to find how the training process ignores the preprended identifiers of each line (ex: t3_17830,t1_c24,t1_c40). How does this part of the training data gets ignored?

multi-turn dialog format

Section 3.1 of the paper states that dialog turns of the same session are concatenated into a long text, ended by the end-of-text token.

Does this mean that there are no special tokens in between dialog turns?

How do I separate dialog turns?

MMI criterion

I am reading through the code base and paper and am trying to understand where in the code the MMI criterion is implemented and used. My guess is that during evaluation the source and target sentence are fed into the model in reverse order so that the model has to predict the source given the target. But I can't seem to see where this is implemented or where the MMI is used. Have I missed something obvious?

Thank you.

Training speed is not as stated in README

Hi! I ran the training script on 130 million training instances and I got the following training speed:

1 V100 GPU, FP16 O2, ~14k tokens/sec, ~100 hours
8 V100 GPUs, FP16 O2, ~70k tokens/sec, ~20 hours

However, in the readme, the training speed was much much faster:

image

What am I missing? Please help!

LSP_train.py - Training stuck at 0/num_optimum_steps

Issue : The pbar for training is stuck at 0/num_optim_steps. Could be seen in the image below.

When i clone the repository and run python demo.py everything works find until the training loop. Then the pbar is stuck at 0/num_optim_steps meaning that the training does not take place. By default python demo.py takes --data = dummy_data.

Steps taken towards resolution

  • Manually ran the command generated such as python LSP_train.py \ --model_name_or_path /home/msoni/DialoGPT/models/small \ --init_checkpoint /home/msoni/DialoGPT/models/small/pytorch_model.bin \ --train_input_file /home/msoni/DialoGPT/data/train.128len.db \ --eval_input_file ./data/dummy_data.tsv \ --output_dir /home/msoni//DialoGPT/models/output_model \ --seed 42 \ --max_seq_length 128 \ --train_batch_size 8 \ --gradient_accumulation_steps 8 \ --eval_batch_size 64 \ --learning_rate 1e-05 \ --num_optim_steps 1000 \ --valid_step 500 \ --warmup_steps 100 \ --normalize_data true \ --fp16 False \ --lr_schedule noam \ --loss_scale 0.0 \ --no_token_id true \ --pbar true \ --local_rank -1

  • Altered the parameters for num_optim_steps, valid_steps, and warmup_steps.

  • Turned from_scratch = True

  • After the while loop at line 273 in LSP_train.py, printed a statement if that line executes. The code skips all lines after line number 290. This is the place the skip starts

if args.no_token_id: token_ids = None loss, ppl = model(input_ids, position_ids, token_ids, label_ids) logger.info("loss", loss) logger.info("ppl", ppl) logger.info("if this command print than line number 294 has run")

Is there something obvious that i am missing ?
Error

What is the forward method returning

I don't really want to generate any text for my use case I just want the embeddings. But when I used model.forward() with dialoggpt-small, it gave me a tuple with one tuple and one 5 dimensional output. Can someone please explain as the documentation is pretty absent

Prop:Help How create interactive environment for simple try/test

After python demo.py was created dir models/output_model/GPT2.1e-05.64.0gpu.2020-07-19022953/GP2-pretrain-step-10000.pkl

Question: How create simple interactive program to use GP2-pretrain-step-10000.pk.

Something like on:

https://huggingface.co/microsoft/DialoGPT-small

`from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")
for step in range(5):
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))`

Decoding script invitation

Is there a way to send in requests for the decoding script?

I understand the nature of the challenges surrounding reddit toxicity, we just want to try it out privately, and test different prompts.

Validation data

Hey guys! Great work! I really appreciate it!

After reading the code, I noticed that the training data is from 12/2015 to 11/2017, while the test data is from 03/2018 to 01/2019.

Does that mean that you guys used 12/2017 to 02/2018 as the validation data?

If not, what did you use for the validation data?

Incorrect DSTC medium model link: pkl file is the small model

Hi there, thanks for sharing your amazing work on github. Just wanted to point out that the link (https://convaisharables.blob.core.windows.net/lsp/DSTC/medium_ft.pkl) shared in readme for DSTC medium model is the small GPT-2 version.

Its size is 351.3MB rather than 863MB for medium GPT-2. When loading the pkl files, there are only 12 transformer blocks with hidden state size of 768 rather than medium's 1024.

Would be great if you can share the corrected DSTC medium model link! Thanks!

Multiturn mode training data

Hey guys! Awesome work.

Can you please clarify is there a reason train model with data contains not only N-turn samples if I want to use model in the N-turn mode?
Does extra data with extra turns (and also samples with less number of turns) helps the model to catch the context better or there is no sense including such data into the training set?

'BucketingDataLoader' object has no attribute 'db'

I am running on Ubuntu 18.04 with cuda 10. I have followed Setup & Installation (TL;DR) - Train model with Conda Environment.

python3.6 demo.py
Found existing ./models folder, skip creating a new one!
11/20/2019 19:07:20 - INFO - main - Downloading models...
11/20/2019 19:07:20 - INFO - demo_utils - ./models/small/config.json exists, return!
11/20/2019 19:07:20 - INFO - demo_utils - ./models/small/vocab.json exists, return!
11/20/2019 19:07:20 - INFO - demo_utils - ./models/small/merges.txt exists, return!
11/20/2019 19:07:20 - INFO - demo_utils - ./models/small/pytorch_model.bin exists, return!
11/20/2019 19:07:20 - INFO - demo_utils - ./models/small/small_ft.pkl exists, return!
11/20/2019 19:07:20 - INFO - main - Done!

11/20/2019 19:07:20 - INFO - main - Downloading and Extracting Data...
11/20/2019 19:07:20 - INFO - main - Preparing Data...
prepro.py --corpus ./data/train.tsv --max_seq_len 128
11/20/2019 19:07:22 - INFO - main - Done!

11/20/2019 19:07:22 - INFO - main - Generating training CMD!
11/20/2019 19:07:22 - INFO - main - If there is any problem, please copy (modify) and run command below
11/20/2019 19:07:22 - INFO - main - #########################################################################
python LSP_train.py --model_name_or_path ./models/small --init_checkpoint ./models/small/pytorch_model.bin --train_input_file ./data/train.128len.db --eval_input_file ./data/dummy_data.tsv --output_dir ./models/output_model --seed 42 --max_seq_length 128 --train_batch_size 512 --gradient_accumulation_steps 8 --eval_batch_size 64 --learning_rate 1e-5 --num_optim_steps 10000 --valid_step 5000 --warmup_steps 4000 --normalize_data true --fp16 true --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true
11/20/2019 19:07:22 - INFO - main - #########################################################################
11/20/2019 19:07:23 - INFO - main - train batch size = 512, new train batch size (after gradient accumulation) = 64
11/20/2019 19:07:23 - INFO - main - CUDA available? True
11/20/2019 19:07:23 - INFO - main - Input Argument Information
11/20/2019 19:07:23 - INFO - main - model_name_or_path ./models/small
11/20/2019 19:07:23 - INFO - main - seed 42
11/20/2019 19:07:23 - INFO - main - max_seq_length 128
11/20/2019 19:07:23 - INFO - main - skip_eval False
11/20/2019 19:07:23 - INFO - main - init_checkpoint ./models/small/pytorch_model.bin
11/20/2019 19:07:23 - INFO - main - train_input_file ./data/train.128len.db
11/20/2019 19:07:23 - INFO - main - eval_input_file ./data/dummy_data.tsv
11/20/2019 19:07:23 - INFO - main - continue_from 0
11/20/2019 19:07:23 - INFO - main - train_batch_size 64
11/20/2019 19:07:23 - INFO - main - gradient_accumulation_steps 8
11/20/2019 19:07:23 - INFO - main - eval_batch_size 64
11/20/2019 19:07:23 - INFO - main - learning_rate 1e-05
11/20/2019 19:07:23 - INFO - main - num_optim_steps 10000
11/20/2019 19:07:23 - INFO - main - valid_step 5000
11/20/2019 19:07:23 - INFO - main - warmup_proportion 0.1
11/20/2019 19:07:23 - INFO - main - warmup_steps 4000
11/20/2019 19:07:23 - INFO - main - normalize_data True
11/20/2019 19:07:23 - INFO - main - fp16 True
11/20/2019 19:07:23 - INFO - main - lr_schedule noam
11/20/2019 19:07:23 - INFO - main - loss_scale 0.0
11/20/2019 19:07:23 - INFO - main - no_token_id True
11/20/2019 19:07:23 - INFO - main - output_dir ./models/output_model
11/20/2019 19:07:23 - INFO - main - log_dir None
11/20/2019 19:07:23 - INFO - main - pbar True
11/20/2019 19:07:23 - INFO - main - local_rank -1
11/20/2019 19:07:23 - INFO - main - config None
11/20/2019 19:07:23 - INFO - main - device cuda
11/20/2019 19:07:23 - INFO - main - n_gpu 8
11/20/2019 19:07:23 - INFO - pytorch_pretrained_bert.tokenization_gpt2 - loading vocabulary file ./models/small/vocab.json
11/20/2019 19:07:23 - INFO - pytorch_pretrained_bert.tokenization_gpt2 - loading merges file ./models/small/merges.txt
Traceback (most recent call last):
File "LSP_train.py", line 176, in
args.max_seq_length)
File "/mnt/sdb/Tools/DialoGPT/data_loader.py", line 114, in init
self.db = shelve.open(f'{db_name}/db', 'r')
File "/mnt/sdb/miniconda3/envs/LSP/lib/python3.6/shelve.py", line 243, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
File "/mnt/sdb/miniconda3/envs/LSP/lib/python3.6/shelve.py", line 227, in init
Shelf.init(self, dbm.open(filename, flag), protocol, writeback)
File "/mnt/sdb/miniconda3/envs/LSP/lib/python3.6/dbm/init.py", line 91, in open
"available".format(result))
dbm.error: db type is dbm.gnu, but the module is not available
Exception ignored in: <bound method BucketingDataLoader.del of <data_loader.BucketingDataLoader object at 0x7f082fdc4cc0>>
Traceback (most recent call last):
File "/mnt/sdb/Tools/DialoGPT/data_loader.py", line 151, in del
self.db.close()
AttributeError: 'BucketingDataLoader' object has no attribute 'db'
11/20/2019 19:07:23 - INFO - main - Done!

LSP_train.py failing - output_dir NoneType

line 154 fails

output_dir = join(args.output_dir,
                  'GPT2.{}.{}.{}gpu.{}'.format(args.learning_rate,
                                               args.train_batch_size, n_gpu, timestamp))

with:
ยดยดยด
Traceback (most recent call last):
File "./LSP_train.py", line 154, in
args.train_batch_size, n_gpu, timestamp))
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/posixpath.py", line 80, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Python 3.6.7
Ubuntu 16.04

Repo not properly designed

Thank you for your contribution. I noticed a few things about your repo:

  • Variable names are vague and uninformative, which makes it hard to understand the code and adapt it.
  • There are some global variables in LSP_train.py which confuses me, first because they're not informative and don't tell me what's their purpose, second because they seem to interfere with parse-args variables that might have the same purpose. Please consider using only parse-args. and explain each one purpose well.
  • This repository doesn't seem to contain an interact script to chat with the models.
  • You don't give information about the finetuned models that you're providing, eg: what's the max_length that they were trained on, how many epochs on the training data, etc...

Distributed training hangs indefinitely at FP16_Optimizer#step

When I run distributed training with more than one GPU, training gets stuck at the very beginning and hangs indefinitely. It is stuck in FP16_Optimizer#set (specifically at this line, where data is implicitly moved from the GPU to the CPU).

The command line hangs at this indefinitely and makes no progress no matter how I wait:

training: 0%| | 0/1000000 [00:00<?, ?it/s]

I see this issue regardless of which CUDA version I use (I've tried it with 10.0 and 10.1), and regardless of whether I install apex myself or use the docker image icaruszyz/large-scale-training:dialogpt.

I do not experience this issue when I run demo.py rather than using python -m torch.distributed.launch to run training (i.e. I see this issue only when I try to train on multiple GPUs, not on a single GPU). I have not tried training with full 32-bit precision because I want to limit the number of GPUs I have to use.

The fact that this issue only occurs when training with multiple GPUs, and that it occurs on a line which transfers data from the GPU to the CPU, suggests to me that there may be a race condition related to collecting data from multiple GPUs.

Training configuration:

INFO - __main__ -   Input Argument Information
INFO - __main__ -   model_name_or_path            ./configs/762M
INFO - __main__ -   seed                          42
INFO - __main__ -   max_seq_length                128
INFO - __main__ -   skip_eval                     False
INFO - __main__ -   init_checkpoint               ./models/large/large_fs.pkl
INFO - __main__ -   train_input_file              ./data/train.128len.db
INFO - __main__ -   eval_input_file               ./data/dummy_data.tsv
INFO - __main__ -   continue_from                 0
INFO - __main__ -   train_batch_size              8
INFO - __main__ -   gradient_accumulation_steps   2
INFO - __main__ -   eval_batch_size               16
INFO - __main__ -   learning_rate                 0.0001
INFO - __main__ -   num_optim_steps               1000000
INFO - __main__ -   valid_step                    10000
INFO - __main__ -   warmup_proportion             0.1
INFO - __main__ -   warmup_steps                  16000
INFO - __main__ -   normalize_data                True
INFO - __main__ -   fp16                          True
INFO - __main__ -   lr_schedule                   noam
INFO - __main__ -   loss_scale                    0
INFO - __main__ -   no_token_id                   True
INFO - __main__ -   output_dir                    models/output_model
INFO - __main__ -   log_dir                       None
INFO - __main__ -   pbar                          True
INFO - __main__ -   local_rank                    0
INFO - __main__ -   config                        None
INFO - __main__ -   device                        cuda:0
INFO - __main__ -   n_gpu                         1

This bug is preventing me from fine-tuning the large model, which requires multiple GPUs.

Has anyone else experienced this or found a workaround?

How to correctly use Multiple GPUs?

I was retraining the model on my own dataset and with a single GPU

Training Command

export CUDA_VISIBLE_DEVICES=3
python LSP_train.py --model_name_or_path ./models/small --init_checkpoint None --init_weights true --train_input_file ./data/train_opensub_qa_dialogpt.128len.db --eval_input_file ./data/val_opensub_qa.tsv --output_dir ./models/opensub_qa/ --seed 42 --max_seq_length 128 --train_batch_size 512 --gradient_accumulation_steps 32 --eval_batch_size 16 --learning_rate 5e-5 --num_optim_steps 275000 --valid_step 6875 --warmup_steps 137500 --normalize_data true --fp16 false --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true

11/10/2019 20:36:34 - INFO - __main__ -   device                        cuda
11/10/2019 20:36:34 - INFO - __main__ -   n_gpu                         1
11/10/2019 20:36:34 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading vocabulary file ./models/small/vocab.json
11/10/2019 20:36:34 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading merges file ./models/small/merges.txt
11/10/2019 20:36:38 - INFO - gpt2_training.train_utils -   no checkpoint provided for GPT2LMHeadModel!
11/10/2019 20:36:41 - INFO - __main__ -   Reinitializing model weights for fresh training!
11/10/2019 20:36:41 - INFO - __main__ -   Number of parameter = 124439808
training:   0%|                                                                  | 178/275000 [04:30<110:38:38,  1.45s/it, tok/s: 5.0k ppl: 52973.36 epoch: 0]

In the single GPU setting I kept the number of optim steps = 275000. The progressbar estimated the training time to be 110hrs 38mins. Number of seconds to finish one iteration is 1.45 seconds.
To improve the speedup I switched to a 4 GPU setup and reduced the optim steps to 68750 which is 275000/4. Now surprisingly with 4 GPUs the training time increases as well as the amount of time to finish one iteration also increases.

Training Command:

export CUDA_VISIBLE_DEVICES=1,3,5,6
python LSP_train.py --model_name_or_path ./models/small --init_checkpoint None --train_input_file ./data/train_opensub_qa_dialogpt.128len.db --eval_input_file ./data/val_opensub_qa.tsv --output_dir ./models/opensub_qa/ --seed 42 --max_seq_length 128 --train_batch_size 512 --gradient_accumulation_steps 32 --eval_batch_size 16 --learning_rate 5e-5 --num_optim_steps 68750 --valid_step 1720 --warmup_steps 34375 --normalize_data true --fp16 false --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true
11/10/2019 20:26:13 - INFO - __main__ -   device                        cuda
11/10/2019 20:26:13 - INFO - __main__ -   n_gpu                         4
11/10/2019 20:26:13 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading vocabulary file ./models/small/vocab.json
11/10/2019 20:26:13 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading merges file ./models/small/merges.txt
11/10/2019 20:26:17 - INFO - gpt2_training.train_utils -   no checkpoint provided for GPT2LMHeadModel!
11/10/2019 20:26:20 - INFO - root -   data parallel because more than one gpu
11/10/2019 20:26:20 - INFO - __main__ -   Number of parameter = 124439808
training:   0%|                                                                                                                                                                                                    | 0/68750 [00:00<?, ?it/s]/home/baheti.3/anaconda3/envs/LSP/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
training:   0%|                                                      | 35/68750 [07:33<240:22:07, 12.59s/it, tok/s: 0.0k ppl: 49254.43 epoch: 0]

With 4 GPUs the model is extimated to take 240hrs 22mins.

What am I doing wrong here? Why am I not getting the expected speedup?

Hyperparameters Release

Can you guys share the hyperparameters of different model sizes i.e. small, medium, and large?

DialoGPT/LSP_train.py

Lines 46 to 82 in 75a4197

parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str,
help='pretrained model name or path to local checkpoint')
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--max_seq_length", type=int, default=128)
parser.add_argument("--skip_eval", action='store_true',
help='If true, skip evaluation.')
parser.add_argument("--init_checkpoint", type=str)
parser.add_argument("--train_input_file", type=str)
parser.add_argument("--eval_input_file", type=str)
parser.add_argument("--continue_from", type=int, default=0)
parser.add_argument("--train_batch_size", type=int, default=4,
help="batch size now means per GPU per step")
parser.add_argument("--gradient_accumulation_steps", type=int, default=2,
help="to increase effective batch size "
"and reduce synchronization")
parser.add_argument("--eval_batch_size", type=int, default=4)
parser.add_argument("--learning_rate", type=float, default=1e-5)
parser.add_argument("--num_optim_steps", type=int, default=1000000,
help="new API specifies num update steps")
parser.add_argument("--valid_step", type=int, default=10000,
help="how many optim steps between validations")
parser.add_argument("--warmup_proportion", type=float, default=0.1)
parser.add_argument("--warmup_steps", type=int, default=16000)
parser.add_argument("--normalize_data", type=boolean_string, default=True)
parser.add_argument("--fp16", type=boolean_string, default=True)
parser.add_argument("--lr_schedule", type=str,
choices=['noam', 'noamwd', 'BERT', 'None'], default='noam')
parser.add_argument("--loss_scale", type=float, default=0)
parser.add_argument("--no_token_id", type=boolean_string, default=True)
parser.add_argument("--output_dir", type=str)
parser.add_argument("--log_dir", type=str)
parser.add_argument('--pbar', type=boolean_string, default=True, help='turn on progress bar')

Large model (762M) keeps repeating the same utterance

I tried running the large model (in a colab notebook) using the approach described in the model card from the huggingface library:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-large")

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

Unfortunately, the output seems to quickly repeat the same sentence over and over again. In example 1 and 2, it repeats the same sentence from the beginning. In the 3rd case, the model starts fine, but as the conversation progresses, it starts repeating the same thing (or parts of it).

Is that an intended behavior?

Example 1

>> User:Today, I bought some ice cream.
DialoGPT: I'm so sorry.
>> User:Why are you sorry?
DialoGPT: I'm sorry for your loss.
>> User:What loss? I didn't lose anything.
DialoGPT: I'm sorry for your loss.
>> User:I'm very confused by what you are saying.
DialoGPT: I'm sorry for your loss.
>> User:Ok, nevermind.
DialoGPT: I'm sorry for your loss.

Example 2

>> User:Can't wait to upvote this 17 different times later this week.
DialoGPT: I'm going to upvote it every time.
>> User:What are you going to upvote every time?
DialoGPT: I'm going to upvote it every time.
>> User:Uh sure, if that's what you want.
DialoGPT: I'm going to upvote it every time.
>> User:I'm very confused by what you are saying.
DialoGPT: I'm going to upvote it every time.
>> User:What's your name?
DialoGPT: I'm going to upvote it every time.

Example 3

>> User:I like to walk my dog in the park.
DialoGPT: I like to walk my dog in the park.
>> User:That's great. What's your dog breed?
DialoGPT: He's a mix of some sort.
>> User:That's funny, you don't even know what mix it is.
DialoGPT: I know what mix it is.
>> User:What is it, then?
DialoGPT: A mix of some sort.
>> User:What sort? "Sort" is not a type of dog.
DialoGPT: I know what mix it is.
>> User:That's not the issue. I'm talking about your dog's breed.
DialoGPT: I know what mix it is.
>> User:Then tell me!
DialoGPT: I know what mix it is.
>> User:What are you waiting for?
DialoGPT: I know what mix?
>> User:Are you serious?
DialoGPT: I know what?
>> User:What?
DialoGPT: I know what?

Confused about the Model outputs?

Hey I got the hugging face GPT2 Large model of Dialog, I guess pre trained? and I have tried to ask it questions and it seems like it's not really returning any thing interesting. I asked it what is the meaning of life? and it said to be a good boy? I am confused about why it didn't pick up anything from reddit am I missing something? am I supposed to train it my self with the reddit dataset to get similar outputs from what was described?

other way to get the 27G dialogue Reddit data?

Great job! Thanks for your contributions to dialogue generation! Is there any way that I can get the 27G dialogue Reddit data(147,116,725 dialogue instances) without running demo.py?

Tokens in multi-turn setting

Hi,
thanks for making the work available and for the explanations.

From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.

Based on this and on what dreasysnail says in Issue #17:

There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this:

Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

my question is:

are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?

Thank you

Scoring Dialog using Hugging Face DialoGPT

Hi,

I would like to be able to do 2 things that I think should be straight forward, but am having trouble figuring them out in the Hugging Face model.

  1. I would like to provide a prompt, and in addition to returning response produced by the model, I would also like to return the word level score/model probabilities.

  2. I would like to enter a prompt and a response and use DialoGPT to get the word level score/model probabilities of that response (the 'force-decoded' score).

I have tried looking through the Hugging Face documentation, but it seems like different models have different APIs, and I can't quite figure out how to do either of these.

Thanks!

Error in large model's config

Hello. Large model trained from scratch has wrong config, resulting in errors below:
RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel: Missing key(s) in state_dict: "transformer.h.36.ln_1.weight", "transformer.h.36.ln_1.bias", ... , "transformer.h.47.mlp.c_proj.weight", "transformer.h.47.mlp.c_proj.bias".

size mismatch for transformer.wte.weight: copying a param with shape torch.Size([50257, 1280]) from checkpoint, the shape in current model is torch.Size([50257, 1600]).
size mismatch for transformer.wpe.weight: copying a param with shape torch.Size([1024, 1280]) from checkpoint, the shape in current model is torch.Size([1024, 1600]).
size mismatch for transformer.h.0.ln_1.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size([1600]).
size mismatch for transformer.h.0.ln_1.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size([1600]).

This is when loading config and then loading pkl pretrained file. Also if i change config to 1280 embeddings it works, but i get strange interactions, after some chatting model starts repeating same answer all over again. Maybe issue is in my decoding script (it's handwritten, not yours), but i keep getting mumbles

S a, u m? M.

or repeating answers such as:

My favorite color is red, red, red, red, red

or

i'm from r all.
i'm from r all.
i'm from r all.

etc

Path error when downloading reddit data

Hi, thanks for the great work. When downloading reddit data by "python demo.py --data small", it shows:
...
zcat: can't stat: data/out/train/conv/2009-11.tsv.gz (data/out/train/conv/2009-11.tsv.gz.Z): No such file or directory
zcat: can't stat: data/out/train/conv/2009-12.tsv.gz (data/out/train/conv/2009-12.tsv.gz.Z): No such file or directory
zcat: can't stat: data/out/train.tsv.gz (data/out/train.tsv.gz.Z): No such file or directory
error occurred, b"gzip: can't stat: ./train.tsv.gz (./train.tsv.gz.gz): No such file or directory\n"

Can you help fix the problem? Seems like it relates to line 67 of demo.py. Maybe we should delete "cd .."?

link to pretrained files doesnt work

link in the readme ("available on azure blobstorage here.") which points to "https://convaisharables.blob.core.windows.net/lsp" produces this:

This XML file does not appear to have any style information associated with it. The document tree is shown below. <Error> <Code>ResourceNotFound</Code> <Message> The specified resource does not exist. RequestId:0fa5f607-101e-00d5-51f1-9425d1000000 Time:2019-11-06T22:25:56.0233677Z </Message> </Error>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.