The roll-videoqa from noagarcia

Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

This is the PyTorch implementation of our ROLL model for VideoQA. ROLL has been recently published at ECCV 2020. Find the technical paper here.

ROLL consists on three branches, each performing a different inspired-cognitive task:

Read branch: Dialog comprehension.
Observe branch: Visual scene reasoning.
Recall branch: Storyline recalling.

The information generated by each branch is encoded via Transformers. A modality weighting mechanism balances the output from the different modalities to predict the final answer.

Dependencies

This code runs on Python 3.6 and PyTorch 1.0.1. We recommend using Anaconda to install the dependencies.

conda create --name roll-videoqa python=3.6
conda activate roll-videoqa
conda install -c anaconda numpy pandas scikit-learn 
conda install -c conda-forge visdom tqdm
conda install pytorch==1.0.1 torchvision==0.2.2 -c pytorch
pip install pytorch-transformers

Dependencies: NumPy, pandas, PyTorch 1.0.1, torchvision, PyTorch-Transformers, scikit-learn, tqdm, and Visdom.
Optional: Beautiful Soup to download episode summaries (already provided).

Data

For data preparation, follow instructions in DATA.md.

ROLL on KnowIT VQA

Start Visdom Server. To visualize the training plots, first start the Visdom server: python -m visdom.server. Plots can be found by visiting http://localhost:8097 in a browser.

Pretrain branches. The three branches (read, observe, recall) are first pretrained independently:

# Read branch training using the subtitles
python Source/branch_read.py --dataset knowit

# Recall branch training using the video summaries
python Source/branch_recall.py --dataset knowit

# For the observe branch, the video descriptions need to be computed first.
# The descriptions will be at Data/knowit_observe/scenes_descriptions.csv
python Source/generate_scene_description.py knowit

# Observe branch training using the generated descriptions
python Source/branch_observe.py --dataset knowit

Multimodality fusion. The outputs from the branches are fused and the network is trained one last time using the modality weighting mechanism.
```
python Source/fuse_branches.py --dataset knowit
```

ROLL on TVQA+

TODO.

Citation

If you find this code useful, please cite our work:

@InProceedings{garcia2020knowledge,
   author    = {Noa Garcia and Yuta Nakashima},
   title     = {Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions},
   booktitle = {Proceedings of the European Conference on Computer Vision},
   year      = {2020},
}

@InProceedings{garcia2020knowit,
   author    = {Noa Garcia and Mayu Otani and Chenhui Chu and Yuta Nakashima},
   title     = {KnowIT VQA: Answering Knowledge-Based Questions about Videos},
   booktitle = {Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence},
   year      = {2020},
}

TODO

TVQA+ code

noagarcia / roll-videoqa Goto Github PK

roll-videoqa's Introduction

Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

Dependencies

Data

ROLL on KnowIT VQA

ROLL on TVQA+

Citation

TODO

roll-videoqa's People

Contributors

Stargazers

Watchers

Forkers

roll-videoqa's Issues

How to visualize the scene graph？

Links for object relations are down

Links for KnowIt-VQA are down

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent