Coder Social home page Coder Social logo

mtn's Introduction

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

License: MIT

This is the PyTorch implementation of the paper: Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems. Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi. ACL 2019.

This code has been written using PyTorch 1.0.1. If you use the source code in this repo in your work, please cite the following paper. The bibtex is:

@inproceedings{le-etal-2019-multimodal,
    title = "Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems",
    author = "Le, Hung  and
      Sahoo, Doyen  and
      Chen, Nancy  and
      Hoi, Steven",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1564",
    doi = "10.18653/v1/P19-1564",
    pages = "5612--5623",
    abstract = "Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance.",
}

Abstract

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance.


A sample dialogue from the DSTC7 Video Scene-aware Dialogue training set with 4 example video scenes.

Model Architecture

Our MTN architecture includes 3 major components: (i) encoder layers encode text sequences and video features; (ii) decoder layers (D) project target sequence and attend on multiple inputs; and (iii) Query-Aware Auto-Encoder layers (QAE) attend on non-text modalities from query features. For simplicity, Feed Forward, Residual Connection and Layer Normalization layers are not presented. 2 types of encoders are used: text-sequence encoders (left) and video encoders (right). Text-sequence encoders are used on text input, i.e. dialogue history, video caption, query, and output sequence. Video encoders are used on visual and audio features of input video.

Dataset

Download dataset of the DSTC7, Video-Scene Aware Dialogues Track, including the training, validation, and test dialogues and the features of Charades videos extracted using VGGish and I3D models.

All the data should be saved into folder data in the repo root folder.

Scripts

We created run.sh to prepare evaluation code, train models, generate_responses, and evaluating the generated responses with automatic metrics. You can run:

❱❱❱ run.sh [execution_stage] [video_features] [video_feature_names] [numb_epochs] [warmup_steps] [dropout_rate] 

The parameters are:

Parameter Description Values
execution_state Stage of execution e.g. preparing, training, generating, evaluating <=1: preparing evaluation code by downloading the COCO caption evaluation tool
<=2: training the models
<=3: generating responses using beam search (default)
<=4: evaluating the generated responses
video_features Video features extracted from pretrained models vggish: audio features extracted using VGGish
i3d_flow: visual features extracted from I3D model
i3d_rgb: visual features extracted from I3D model
Features can also be combined, separated by single space e.g. "vggish i3d_flow"
video_feature_names Names of video features for saving output any value corresponding to the video_features input e.g. vggish+i3d_flow
num_epochs Number of training epochs e.g. 20
warmup_steps Number of warmup steps e.g. 9660
dropout_rate Dropout rate during training e.g. 0.2

While training, the model with the best validation is saved. The model is evaluated by using loss per token. The model output, parameters, vocabulary, and training and validation logs will be save into folder exps.

Other parameters, including data-related options, model parameters, training and generating settings, are defined in the run.sh file.

Sample Dialogues

Example test dialogue responses extracted from the ground-truth and generated by MTN and the baseline. For simplicity, the dialogue history is not presented and only parts of the video caption C are shown. Our model provides answers that are more accurate than the baseline, capturing single human action or a series of actions in the videos.

Visdial

MTN can also be adapted to run on the VisDial benchmark. Please switch the repo branch to 'visdial' here for the code. Main changes include data loading in data handler.

During train time, the model is trained in a generative setting using the ground-truth answer. During test time, at each dialogue turn, the model selects the best answer candidate based on the log likelihood among the answer options (Refer to the paper Section 4.4 for more details).

Note that in the current code, we use a small version of dialogue files in the data folder and dummy features for images during data loading/batching. Please replace the code with the paths to your downloaded data files and available features.

mtn's People

Contributors

henryhungle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mtn's Issues

How to generate the lbl*.json file under data/ folder

How to generate the following files under data/ folder? It is different from the officical files under dstc7avsd_eval folder(e.g., test_set4DSTC7-AVSD_multiref.json).

lbl_test_set4DSTC7-AVSD.json
lbl_undiscloseonly_test_set4DSTC7-AVSD.json

Unreproducible results with 6 references

Hi there,
thanks for your great work. I found that you provided the scores with 6 references on this issue #5. Even though I can reproduce the scores with single reference, the scores with 6 references are still unreproducible.
I simply used the provided code and hyperparameters and dropped the result file into evaluation tools. Then I got following scores:

./run.sh 1 "vggish i3d_flow" "vggish+i3d_flow"20 9660 0.2

1 reference:
Bleu_4: 0.129
METEOR: 0.162
ROUGE_L: 0.359
CIDEr: 1.243

6 references:
Bleu_4: 0.344
METEOR: 0.269
ROUGE_L: 0.555
CIDEr: 1.034

I especially found that you didn't manually seed for PyTorch.

MTN/train.py

Lines 108 to 109 in 7649113

random.seed(args.rand_seed)
np.random.seed(args.rand_seed)

I'm not sure if this is the major factor. Could you share your hyperparameters or evaluation method?

I can't get the result reported in your paper

I use the experiment settings in your paper and I use stopwords "." and ",". But I can't get the results reported in your paper.
Here is my result:
Bleu_1: 0.351
Bleu_2: 0.236
Bleu_3: 0.167
Bleu_4: 0.122
METEOR: 0.162
ROUGE_L: 0.352
CIDEr: 1.194

Could you help me run the experiments and get the results reported in your paper? Thank you.

Require gradient false in eval mode

In your code, you did not wrap the eval part with "torch.no_grad()", which means that your code computes gradient for validation data. That leads to unnecessary computation... You'd better edit that part.

can I recover my wallet from my email

MTN Group 2023 Interim Results
Date: 14 August 2023

Time: 15:30 CAT

WATCH HERE
OUTDO IT ALL
READ MORE
Leading digital solutions for Africa's progress
Learn more about our Ambition 2025 strategy

READ NOW
Join our cause, help children be children
FIND OUT MORE

About

Sustainability

Investors

News

Careers
Latest news
All
Media releases
Spotlight stories
COVID-19
Campaigns
Events
Uncategorized
MEDIA RELEASES
29 September 2023
MTN, Accenture and Genesys unite to elevate the customer experience
MTN, Africa’s leading telecommunications service provider, is elevating customer experience...
READ MORE
SPOTLIGHT STORIES
23 September 2023
From Challenge to Opportunity: MTN Nigeria’s Drive for Sustainable Energy
The challenge of providing affordable and clean energy is a...
READ MORE
MEDIA RELEASES
20 September 2023
Africa-America Institute honours MTN with 2023 Corporate Responsibility Award
MTN is honoured to have received The Africa-America Institute’s 2023...
READ MORE
VIEW ALL NEWS
Sustainability
Sustainable economic value
Find out more
Eco-responsibility
Find out more
Sustainable societies
Find out more

292 million subscribers
across Africa and the Middle East

Africa's most valuable brand
with operations in 19 markets

Service revenue of R108 billion
in H1 2023
Investors

VIEW ALL INVESTORS
At the heart of our ambition is to lead the drive for digital and financial inclusion in Africa.
Ralph Mupita
President and Chief Executive Officer
Working with MTN
Become an MTNer
Becoming a supplier
Contact Us
VIEW ALL REPORTS
LOOKING FOR SOMETHING ELSE?
Annual reports
Download our latest suite of annual reports as well as announcements from previous years.

VIEW ANNUAL REPORTS
Looking for something else?
Contact us
From Group to Press, get in touch here.

SAY Y'ELLO
Looking for something else?
About us
Discover more about Africa’s largest mobile network operator

READ OUR STORY

Contact Us
TWITTER
LINKEDIN
YOUTUBE
Legal
Terms & Conditions
Privacy Policy
© 2023 MTN GROUP MANAGEMENT SERVICES (PTY) LTD, ALL RIGHTS RESERVED.
everywhere you go

Cookie notice
By using our website you agree to our Cookie Policy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.