jimmy646 / violin Goto Github PK

Data and code for CVPR 2020 paper: "VIOLIN: A Large-Scale Dataset for Video-and-Language Inference"

License: MIT License

Python 100.00%

violin's Introduction

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

Data and code for CVPR 2020 paper: "VIOLIN: A Large-Scale Dataset for Video-and-Language Inference"

We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

Also, we present a new large-scale dataset, named Violin (VIdeO-and-Language INference) for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows). In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video).

News

2020.04.29 Baseline code released, and leaderboard will be available soon.
2020.04.04 Data features, subtitles and statements released.
2020.03.25 Paper released (arXiv).

Violin Dataset

Data Statistics

source	#episodes	#clips	avg clip len	avg pos. statement len	avg neg. statement len	avg subtitle len
Friends	234	2,676	32.89s	17.94	17.85	72.80
Desperate Housewives	180	3,466	32.56s	17.79	17.81	69.19
How I Met Your Mother	207	1,944	31.64s	18.08	18.06	76.78
Modern Family	210	1,917	32.04s	18.52	18.20	98.50
MovieClips	5,885	5,885	40.00s	17.79	17.81	69.20
All	6,716	15,887	35.20s	18.10	18.04	76.40

Data Download

Subtitles and statements (README)

Image (resnet) features

C3D features

Detection features (TODO)

To obtain raw video data, you can download the source videos yourself (YouTube and TV shows), and then use the span information provided in Subtitles and statements to extract the clips. Also, we might release sampled frames (as images) in the near future.

Baseline Models

Model Overview

Requirements

pytorch >= 1.2
transformers
h5py
tqdm
numpy

Usage

Download video features, subtitles and statements and put them into your feat directory.
Finetune BERT-base on Violin's training statements, or download our finetuned BERT model.

Training

Using only subtitles

python main.py --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub

Using both subtitles and video resnet features (--feat c3d for c3d features)

python main.py --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub vid --feat resnet

Testing

Testing a specific model

python main.py --test --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub vid --feat c3d --model_path [model path]

violin's People

Contributors

Stargazers

Watchers

Forkers

ml-lab cold-winter coderpriya zhennaziyu saqibmamoon paperfactory bruinxiong rsingh2083 ashwathaithal ych133 youngfly11 wanrudu pipigenius ibliever

violin's Issues

Statement to reasoning type mapping

Hi,

Figure 3 in the paper shows the distribution of 6 reasoning types. Could you provide the mapping from a statement pair to its reasoning type, so that we can conduct deeper performance analysis for different types?

Thanks,

Shin Lee

How to make the model runs with cuda9

Hi,
I am using 1.3 version of pytorch with cuda==9.0 and cuDNN=7.0. When I run the model I am getting the following error:

Traceback (most recent call last):
File "main.py", line 137, in
bert.to(opt.device)
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/nn/modules/module.py", line 432, in to
return self._apply(convert)
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/nn/modules/module.py", line 230, in _apply
param_applied = fn(param)
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/nn/modules/module.py", line 430, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/cuda/init.py", line 178, in _lazy_init
_check_driver()
File "/home/mitr/anaconda3/envs/vqa20/lib/python3.6/site-packages/torch/cuda/init.py", line 108, in _check_driver
of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 9000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

Any help, please

config.json not found

Hi,

When trying to run python main.py --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub vid --feat resnet given in the readme (feat_dir and bert_dir replaced by my local addresses for the C3D features and the pretrained bert model respectively), I am encountering the error 404 Client Error: Not Found for url: https://huggingface.co/bert_output/resolve/main/config.json. Is there ay other way I can get the config.json file?

detection feature release

Hi authors,

When will you release the detection features or the raw video data?

detection feature release

Hi authors,

When will you release the detection features?

real_state_output = model(padded_vid_feat, sub_feat, state_feat[0]).squeeze()

The index of state_feat may be up to 6. But the code only says 2.

The link of C3D features is the same as image resnet features

The link of C3D features is not for C3D.

RAW VIDEOS NOT FOUND

Hi, I was trying to download raw videos from YouTube using the id provided in the dataset. However, some videos are no longer available and no where to be found. How can I reproduce or follow your work when we do not have access to your raw video dataset?

e.g. ,
"gt3ntYidpvs_clip_000_040": {"file": "gt3ntYidpvs_clip_000_040", "source": "gt3ntYidpvs", "span": [0.0, 40.0], "statement": [["The lady in the black tanktop looked out the window to make sure it was safe to open it.", "The lady in the black tanktop looked out the window to make sure it was safe to go outside."], ["The lady in the blue shirt didn't want the lady in black with the gun to open the window because she was afraid something might get in.", "The lady in the blue shirt didn't want the lady in black with the gun to open the window because she was afraid something might get shot."], ["The man in white shirt put the bloody mass outside to try and keep the creatures away from himself.", "A luxury car salesman talks about the red car's performance specifications to a potential customer."]], "sub": [["one board one minute home free okay make", [120, 11629]], ["it quick", [11639, 21970]], ["you ready yeah wait what are you doing", [21980, 26230]], ["show me superiority the senator dead may", [26240, 28990]], ["drive them back sure sound like a call", [29000, 40000]]], "split": "test"}

Download links for features and finetuned BERT model not available

I noticed that the download links for the image (resnet) features, C3D features, detection features, and finetuned BERT model are currently inactive. I would greatly appreciate it if you could kindly make these data available once again, as they are integral to my current research.

About "adversarial matching"

In your paper, you use adversarial matching to collect negative statements.
I want to know how do you calculate the similarity between two statements. Is similar model( ESIM+ELMo) and training strategy used in the VCR ?

When will the dataset be released？

Hi, It's an interesting challenge for video dataset, do you have the expected timetable for dataset releasing?