Speaker Turn Modeling for Dialogue Act Classification

This repo implements this paper.

Installation

unzip data.zip
Install Pytorch and Huggingface Transformers.

Usage

To train the model on different datasets, simply run the corresponding file

python run_swda.py
python run_mrda.py
python run_dyda.py

The hyperparameters can be set in the three scripts and should be fairly understandable.

Train the model on other datasets

Create a folder data/{dataset_name}
Put your train/val/test data as train.csv, val.csv, and test.csv under this folder
1. For any of the three files, each row represents an utterance, and it must have the following columns:
  1. conv_id. the id of this conversation
  2. speaker_id. the id of the speaker. the speaker id should be binary and indicates the turn of the speaker in this conversation. for dyadic conversations the original speaker ids should already be binary. in the case of multi-party conversations and speaker ids are non-binary, please refer to Section 3.3 of our paper on how to make the labels binary. if speaker ids are not available, just put all zeros.
  3. text. the text of the utterance.
  4. act. the dialogue act label.
  5. topic. the topic label. if not available, just put all zeros.
Create a script as run_{dataset_name}.py. You can reuse most of the parameter settings in run_swda/mrda/dyda.py. If the conversations are very long (have a lot of utterances), consider slicing it into smaller chunks by specifying chunk_size to a non-zero value.
1. Set copurs to your {dataset_name}.
2. Set nclass to the number of dialogue act classes in your dataset.
Run the script

python run_{dataset_name}.py

In order to obtain the best performance, you may need to try different batch_size, chunk_size (32, 64, 128, 192, 256, 512 and etc.), lr (1e-4, 5e-5, 1e-5 and etc.), and nfinetune (1, 2).

Test the trained model to a new dataset

Decide the pretraining dataset pre_corpus in {SWDA, MRDA, DyDA}. Choose the one that is most similar to your own datset.
Train the model on the pretraining dataset using run_pre_corpus.py.
Prepare your own dataset as described in Step 1 & 2 in "Train the model on other datasets". Encode the dialogue act labels of your own dataset using the mapping shown in the top comments of dataset.py. If you don't have training data and validation data, just put train.csv and val.csv as two empty dataframes, with the required columns (make two empty DataFrames using pandas and save with those column names).
Make a copy of run_pre_corpus.py and change the following parameters.
1. Set corpus to your {dataset_name}.
2. Set mode to inference.
Run the new script.
The predictions of the model (a list of predicted labels) will be saved in preds_on_new.pkl.

Citation

@inproceedings{he2021speaker,
  title={Speaker Turn Modeling for Dialogue Act Classification},
  author={He, Zihao and Tavabi, Leili and Lerman, Kristina and Soleymani, Mohammad},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021},
  pages={2150--2157},
  year={2021}
}

pdou / speak-turn-emb-dialog-act-clf Goto Github PK

speak-turn-emb-dialog-act-clf's Introduction

Speaker Turn Modeling for Dialogue Act Classification

Installation

Usage

Train the model on other datasets

Test the trained model to a new dataset

Citation

speak-turn-emb-dialog-act-clf's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent