nouhadziri / thred Goto Github PK

The implementation of the paper "Augmenting Neural Response Generation with Context-Aware Topical Attention"

Home Page: https://arxiv.org/abs/1811.01063

License: MIT License

Python 98.81% Shell 1.19%

dialogue-generation sequence-to-sequence tensorflow deep-learning

thred's Introduction

This repository hosts the implementation of the paper "Augmenting Neural Response Generation with Context-Aware Topical Attention".

Topical Hierarchical Recurrent Encoder Decoder (THRED)

THRED is a multi-turn response generation system intended to produce contextual and topic-aware responses. The codebase is evolved from the Tensorflow NMT repository.

TL;DR Steps to create a dialogue agent using this framework:

Download the Reddit Conversation Corpus from here (2.5GB download / 7.8GB after uncompressing, which contains triples extracted from Reddit). Please report errors/inappropriate content in the data here.
Install the dependencies using conda env create -f thred_env.yml (To use pip, see Dependencies)
Train the model using the following command (pretrained models will be published soon). Note that MODEL_DIR is a directory that the model will be saved into. We recommend to train on at least 2 GPUs, otherwise you can reduce the data size (by omitting conversations from the training file) and the model size (by modifying the config file).

python -m thred --mode train --config conf/thred_medium.yml --model_dir <MODEL_DIR> \
--train_data <TRAIN_DATA> --dev_data <DEV_DATA> --test_data <TEST_DATA>

Chat with the trained model using:

python -m thred --mode interactive --model_dir <MODEL_DIR>

Dependencies

Python >= 3.5 (Recommended: 3.6)
Tensorflow == 1.12.0
Tensorflow-Hub
SpaCy >= 2.1.0
pymagnitude
tqdm
redis¹
mistune¹
emot¹
Gensim¹
prompt-toolkit²

¹_{^{packages required only for parsing and cleaning the Reddit data.}} ²_{^{used only for testing dialogue models in command-line interactive mode}}

To install the dependencies using pip, run pip install -r requirements. And for Anaconda, run conda env create -f thred_env.yml (recommended). Once done with the dependencies, run pip install -e . to install the thred package.

Data

Our Reddit dataset, which we call Reddit Conversation Corpus (RCC), is collected from 95 selected subreddits (listed here). We processed Reddit for a 20 month-period ranging from November 2016 until August 2018 (excluding June 2017 and July 2017; we utilized these two months along with the October 2016 data to train an LDA model). Please see here for the details of how the Reddit dataset is built including pre-processing and cleaning the raw Reddit files. The following table summarizes the RCC information:

Corpus	#train	#dev	#test	Download	Download with topic words
3 turns per line	9.2M	508K	406K	download (773MB)	download (2.5GB)
4 turns per line	4M	223K	178K	download (442MB)	download (1.2GB)
5 turns per line	1.8M	100K	80K	download (242MB)	download (594MB)

In the data files, each line corresponds to a single conversation where utterances are TAB-separated. The topic words appear after the last utterance separated also by a TAB.

Note that the 3-turns/4-turns/5-turns files contain similar content albeit with different number of utterances per line. They are all extracted from the same source. If you found any error or any inappropriate utterance in the data, please report your concerns here.

Embeddings

In the model config files (i.e., the YAML files in conf), the embedding types can be either of the following: glove840B, fastText, word2vec, and hub_word2vec. For handling the pre-trained embedding vectors, we leverage Pymagnitude and Tensorflow-Hub. Note that you can also use random300 (300 refers to the dimension of embedding vectors and can be replaced by any arbitrary value) to learn vectors during training of the response generation models. The settings related to embedding models are provided in word_embeddings.yml.

Train

The training configuration should be defined in a YAML file similar to Tensorflow NMT. Sample configurations for THRED and other baselines are provided here.

The implemented models are Seq2Seq, HRED, Topic Aware-Seq2Seq, and THRED.

Note that while most of the parameters are common among the different models, some models may have additional parameters (e.g., topical models have topic_words_per_utterance and boost_topic_gen_prob parameters).

To train a model, run the following command:

python main.py --mode train --config <YAML_FILE> \
--train_data <TRAIN_DATA> --dev_data <DEV_DATA> --test_data <TEST_DATA> \
--model_dir <MODEL_DIR>

In <MODEL_DIR>, vocabulary files and Tensorflow model files are stored. Training can be resumed by executing:

python main.py --mode train --model_dir <MODEL_DIR>

Test

With the following command, the model can be tested against the test dataset.

python main.py --mode test --model_dir <MODEL_DIR> --test_data <TEST_DATA>

It is possible to override test parameters during testing. These parameters are: beam width --beam_width, length penalty weight --length_penalty_weight, and sampling temperature --sampling_temperature.

A simple command line interface is implemented that allows you to converse with the learned model (Similar to test mode, the test parameters can be overrided too):

python main.py --mode interactive --model_dir <MODEL_DIR>

In the interactive mode, a pre-trained LDA model is required to feed the inferred topic words into the model. We trained an LDA model using Gensim on a Reddit corpus, collected for this purpose. It can be downloaded from here. The downloaded file should be uncompressed and passed to the program via --lda_model_dir <DIR>.

Citation

Please cite the following paper if you used our work in your research:

@article{dziri2018augmenting,
  title={Augmenting Neural Response Generation with Context-Aware Topical Attention},
  author={Dziri, Nouha and Kamalloo, Ehsan and Mathewson, Kory W and Zaiane, Osmar R},
  journal={arXiv preprint arXiv:1811.01063},
  year={2018}
}

thred's People

Contributors

Stargazers

Watchers

thred's Issues

Train-time

We have used the dataset which is 5 turns per line, and the num of gpus is 2, however the result of the command line is number of gup 0, and the speed of the training is very low. We estimate that it will take one month to complete the training of the model. I want to ask whether the speed of training is normal and how to accelerate the training.

GPU-Util is low

The GPU-Util is very low ( mostly 0 ) and the CPU utilizes above 1000%.
I installed tensorflow-gpu==1.8. And I used CUDA_VISIBLE_DEVICES to assign GPU to training.

Relative imports error

Trying to run the topic aware model by command "python main.py " , but got the error "ValueError: attempted relative import beyond top-level package“. I have checked init.py in each directory, but still doesn't work. Can you offer some hints?

Pre-trained models

Hi,
Can you guys make the pre-trained models available?
Thanks

Missing the diversity evaluation?

Hello,

Thanks for your code! I rerun the training but only get PPL scores, and I have checked the codes , it looks like the diversity evaluations (distinct-1 & distinct-2) are missing ?

Thanks a lot for your early reply!

running error

Hi, Thanks for your great work and clear codes, which inspire me a lot.

It seems that I just met a problem. When I execute the train command, there appears an "out of bounds "error, but not the "except tf.errors.OutOfRangeError" in hierarchical_base.py.

In detail,

Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/1THRED/thred/main.py", line 6, in
tf.app.run(main=thred_main)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/1THRED/thred/main.py", line 45, in main
model.train()
File "/1THRED/thred/models/hierarchical_base.py", line 124, in train
step_result = loaded_train_model.train(train_sess)
File "/1THRED/thred/models/thred/thred_model.py", line 511, in train
self.learning_rate])
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 2 of dimension 0 out of bounds.
[[{{node strided_slice_2}} = StridedSlice[Index=DT_INT32, T=DT_STRING, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](StringSplit:1, strided_slice_2/stack, strided_slice_2/stack_1, strided_slice_2/stack_2)]]
[[node IteratorGetNext (defined at /1THRED/thred/models/thred/thred_iterators.py:153) = IteratorGetNextoutput_shapes=[[?,?], [?,?], [?], [?], [?,?], [?], [?,?], [?,?], [?]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]

Could you please help me with that, thanks a lot.

Confused about the topic attention in the original paper

Sorry to bother you with this. I have read your great paper, but some confusion about the topic attention.

In the paper, you said:

The topic words {t1,t2,...,tn} are then linearly combined to form a fixed-length vector k. The weight values are calculated as the following:

I hardly figure it out. Is it the same as normal query-key-value attention? In my opinion, the final context-level encoder hidden state serves as a query, the word embedding of topic words serve as values. But how are the weights \beta calculated?

Look forward to your reply! Thanks.

the pretrained models has already been published?

hello,
please ask for the the pretrained models

string indices must be integers

Subreddit whitelist provided with size 95
Error occurred at 11: string indices must be integers
Traceback (most recent call last):
File "corpora/reddit/reddit_parser.py", line 473, in
parse(submissions_input, params, _convert_submission_to_post)
File "corpora/reddit/reddit_parser.py", line 305, in parse
raise e
File "corpora/reddit/reddit_parser.py", line 256, in parse
normalized_text = normalize_post_text(post.text, charmap)
File "corpora/reddit/reddit_parser.py", line 136, in normalize_post_text
normalized = strip_emojis_and_emoticons(normalized).strip()
File "/home/zengpr/open/PycharmProjects/2019-01/THRED-master/util/nlp.py", line 144, in strip_emojis_and_emoticons
return _strip_emoticons(_strip_emojis(text))
File "/home/zengpr/open/PycharmProjects/2019-01/THRED-master/util/nlp.py", line 166, in _strip_emoticons
emoticon = em['value']
TypeError: string indices must be integers

A question about random embedding at models/base.py.

Its shape is [vocab_size*embedding_size] in your code. Why isn't [vocab_size, embedding_size]?

Missing message-level attention?

Hello,

Thanks for releasing this codebase. I was reading your paper about the THRED model (https://arxiv.org/pdf/1811.01063.pdf) and I've noticed that in the generation process you compute two different attention mechanisms: the message attention to generate a representation for the utterances and a context-level attention to generate the context vector in classical HRED model. It looks to me that in the actual implementation the message-level attention is missing: https://github.com/nouhadziri/THRED/blob/master/models/thred/thred_model.py#L212

Is there any reason for this? Did you notice better performance with just the context-level attention?

Thanks a lot for your answer!

Alessandro

output directory not specified. pass it using: -o OR --out-dir

I have the problem:
sudo sh bin/reddit.sh -c data/RC__2018-10.xz --log log/log.txt -o output/output.txt

\bin/Reddit.sh: 9: bin/Reddit.sh: [[: not found
-e output directory not specified. pass it using: -o OR --out-dir
why?

about using other dataset

Hi, thank you for your work. I have a question about the input data.
If I want to use my own data, how can I build a dataset?
each line corresponds to a single conversation,just like this: q1 \t a1 \t q2 \t a2 \t | topic word
I don't know if the fact is that I understand it.
thanks again~

Pre-processing steps

Hi,

Thank you for open-sourcing this work. I have a few questions:

What are the pre-processing steps that are applied before releasing the datasets?
Do we need to apply all the pre-processing scripts mentioned in the dataset?
I still see profane words in the dataset, does that mean we have to apply the profane word filtering?

Thanks.

Test Issue- ValueError: Cannot feed value of shape(x,1) for Tensor 'Placeholder:0' which has shape '(?,)'

I am trying to use the vanilla version of the model for a dialogue generation task i.e given a statement by speaker#1, generate a statement by speaker#2
I have successfully trained the model but while testing I run into the following error.
I believe the format for the test file is only the source statements without the target statements(with no tabs).

> Traceback (most recent call last):
  File "/home/ydz853/data/THRED/main.py", line 51, in <module>
    tf.app.run()
  File "/home/ydz853/.conda/envs/env_1/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ydz853/data/THRED/main.py", line 47, in main
    model.test()
  File "/home/ydz853/data/THRED/models/vanilla/vanilla_wrapper.py", line 545, in test
    infer_sess.run(infer_model.iterator.initializer, feed_dict=feed_dict)
  File "/home/ydz853/.conda/envs/env_1/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/ydz853/.conda/envs/env_1/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1096, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape(100,1) for Tensor 'Placeholder:0' which has shape '(?,)'

topic model for short text

Hi, I have tried to build LDA model by gensim using my customed data for short text like twitter, however, the topic words seem not very fine-grained like yours, which is a commom problem of LDA model for short text. Do you have any suggestions for this? Thanks!

train data format

Hello, I have a simple question for the input training data format. I've download the 5 turns train data, as you say each line is separated by TAB, but after the 5th sentence, there is no more topic words. Does the 5th sentence is the topic words? I sampled the first line from the training data as follows:
i 've never been this high i 'm going up [ 8 } \t my armed feel weird \t how 's your legs doing ? \t they feel like roots for a tree \t grow young entling grow .$
Are the topic words " grow young entling grow ."?
Thanks

Permission error occurs

When i run hred.small, an error occurs:
PermissionError: 'C:\Users\best_dev_ppl'
Could you provide some solutions?

No such file or directory: 'src/redis-server': 'src/redis-server'

when I execute the command:
export PYTHONPATH="."; python3.6 corpora/reddit/reddit_dialogue.py build --text_file output/RC_2018-10.txt --db_file output/RC_2018-10_db.csv --output output1

it traceback：
No such file or directory: 'src/redis-server': 'src/redis-server'

how to use it for only 2 turns? Any ideas?

GPU-Util is low When use multi-GPUs

Hello,

I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.

Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%.
This is GPU Usage when training on 4 cards:

I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?

Looking forward to your reply！

fastText with magnitude causes issues

When loading fastText embedding with Magnitude, it causes the following error attached in the screenshot. This error popped up when I was running the Topic Aware model with the default configuration provided in the taware_small.yml file.

python main.py --mode train

Excuse me, in the following command:
python main.py --mode train --config <YAML_FILE>
--train_data <TRAIN_DATA> --dev_data <DEV_DATA> --test_data <TEST_DATA>
--model_dir <MODEL_DIR>

Does YAML_FILE means .yml file as thred_medium.yml?
But where are the TRAIN_DATA,DEV_DATA and TEST_DATA? Does it means I should download all the .bz2 files you mentioned in THRED/corpora/reddit/README.rd——“Reddit monthly data needs to be downloaded from here for comments and here for submissions”?

Model Directory while Testing

While testing, the model requires a copy of the model_dir to also be present in the same directory as it was at the time of training the model.
If the model_dir has been moved or renamed, an error is thrown.
The moved model directory passed in the argument --model_dir <model_dir> while testing is used to store the outputs.

``Traceback (most recent call last):
  File "main.py", line 51, in <module>
    tf.app.run()
  File "/home/ydz853/.conda/envs/env_py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 47, in main
    model.test()
  File "/home/ydz853/papyrus/THRED/models/vanilla/vanilla_wrapper.py", line 527, in test
    ckpt = tf.train.latest_checkpoint(self.config.get_infer_model_dir())
  File "/home/ydz853/.conda/envs/env_py35/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1722, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/home/ydz853/.conda/envs/env_py35/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 333, in get_matching_files
    for single_filename in filename
  File "/home/ydz853/.conda/envs/env_py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/ydz853/model_dir_vanilla_large/best_dev_ppl; No such file or directory

NaN error when clip gradients.

Hi,
The Vanilla Seq2Seq and HRED models report a "NaN tensor error" at the first training step.

The error code is clipped_grads, grad_norm = tf.clip_by_global_norm(self.gradients, params.max_gradient_norm) in hred_model.py.

How can I solve this problem?

P.S.

use embedding : random300
tensorfolw-gpu: 1.12.1
3-turn dataset
THRED and TA-Seq2Seq work well

It tracebacks:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/HRED/thred/main.py", line 6, in
tf.app.run(main=thred_main)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/HRED/thred/main.py", line 45, in main
model.train()
File "/data/HRED/thred/models/hierarchical_base.py", line 132, in train
step_result = loaded_train_model.train(train_sess)
File "/data/HRED/thred/models/hred/hred_model.py", line 446, in train
self.learning_rate])
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node hred_graph/VerifyFinite/CheckNumerics (defined at /data/HRED/thred/models/hred/hred_model.py:131) = CheckNumericsT=DT_FLOAT, _class=["loc:@hred_graph/VerifyFinite/control_dependency"], message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node hred_graph/clip_by_global_norm/mul/_187}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3642_hred_graph/clip_by_global_norm/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Regarding using custom dataset

I want to try your model on my own customized dataset but I'm overwhelmed by your codebase. So a few questions:

Does the dataset need to be with topical words for the training procedure or only for generation? Or both?
If I have a customized dataset, where do I start? From my understanding I need to:
A. Create dialogues file like yours, in your reddit format.
B. Use LDA script to create topical words and attach them to the dataset?
C. Create a vocabulray file somehow?
D. Then train?
Am I missing something?
Can you please let me know which file/script I need to use for each stage?

A dimension problem random300 embedding at models/base.py.

When I use random300 embedding to train, it report an dimension error during the concat operation at models/base.py. And I find that tensor "const_embedding_matrix“ is empty.

It traceback:

File "/1THRED/thred/main.py", line 45, in main
model.train()
File "/1THRED/thred/models/hierarchical_base.py", line 35, in train
train_model = _helper.create_train_model(self.config, scope)
File "/1THRED/thred/models/thred/thred_helper.py", line 45, in create_train_model
scope=scope)
File "/1THRED/thred/models/thred/thred_model.py", line 43, in init
self.init_embeddings(params.vocab_file, params.vocab_pkl, scope=scope)
File "/1THRED/thred/models/base.py", line 40, in init_embeddings
self.embeddings = tf.concat([reserved_token_embeddings, trainable_embeddings, const_embedding_matrix], 0)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1124, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1033, in concat_v2
"ConcatV2", values=values, axis=axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1792, in init
control_input_ops)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1631, in _create_c_op
raise ValueError(str(e))

ValueError: Shape must be rank 2 but is rank 1 for 'thred_graph/embeddings/concat' (op: 'ConcatV2') with input shapes: [4,300], [40174,300], [0], [].

nouhadziri / thred Goto Github PK

thred's Introduction

Topical Hierarchical Recurrent Encoder Decoder (THRED)

Dependencies

Data

Embeddings

Train

Test

Citation

thred's People

Contributors

Stargazers

Watchers

Forkers

thred's Issues

Recommend Projects

Recommend Topics

Recommend Org