Coder Social home page Coder Social logo

jayleicn / tvretrieval Goto Github PK

View Code? Open in Web Editor NEW
148.0 8.0 24.0 54.13 MB

[ECCV 2020] PyTorch code for XML on TVRetrieval dataset - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Home Page: https://tvr.cs.unc.edu

License: MIT License

Python 94.58% Shell 5.42%
video-retrieval dataset pytorch tvr tvc

tvretrieval's Introduction

TVRetrieval

PyTorch implementation of Cross-modal Moment Localization (XML), an efficient method for video (subtitle) moment localization in corpus level.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window. The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it. Strict qualification and post-annotation verification tests are applied to ensure the quality of the collected data. Further, we present several baselines and a novel Cross-modal Moment Localization (XML) network for multimodal moment retrieval tasks. The proposed XML model uses a late fusion design with a novel Convolutional Start-End detector (ConvSE), surpassing baselines by a large margin and with better efficiency, providing a strong starting point for future work. We have also collected additional descriptions for each annotated moment in TVR to form a new multimodal captioning dataset with 262K captions, named TV show Caption (TVC).

TVR Task

model_overview A TVR example in the corpus level moment retrieval task. Ground truth moment is shown in green box. Colors in the query indicate whether the words are related to video (blue) or subtitle (magenta) or both (black). To better retrieve relevant moments from the video corpus, a system needs to comprehend both videos and subtitles

Method - Cross-modal Moment Localization (XML)

model_overview XML is an efficient method for moment retrieval at a large video corpus. It performs video retrieval in its shallower layers and more fine-grained moment retrieval in its deeper layers. It uses a late fusion design with a novel Convolutional Start-End (ConvSE) detector, making the moment predictions efficient and accurate. The ConvSe module is inspired by edge detectors in image processing. It learns to detect start (up) and end (down) edges in the 1D query-clip similarity signals with two trainable 1D convolution filters, and is the core of XML's high accuracy and efficiency.

Resources

Getting started

Prerequisites

  1. Clone this repository
git clone https://github.com/jayleicn/TVRetrieval.git
cd TVRetrieval
  1. Prepare feature files

Download tvr_feature_release.tar.gz (33GB). After downloading the feature file, extract it to the data directory:

tar -xf path/to/tvr_feature_release.tar.gz -C data

You should be able to see tvr_feature_release under data directory. It contains video features (ResNet, I3D) and text features (subtitle and query, from fine-tuned RoBERTa). Read the code to learn details on how the features are extracted: video feature extraction, text feature extraction.

  1. Install dependencies.
  • Python 3.7
  • PyTorch 1.4.0
  • Cuda 10.1
  • tensorboard
  • tqdm
  • h5py
  • easydict

To install the dependencies use conda and pip, you need to have anaconda3 or miniconda3 installed first, then:

conda create --name tvr --file spec-file.txt
conda activate tvr 
pip install easydict
  1. Add project root to PYTHONPATH
source setup.sh

Note that you need to do this each time you start a new session.

Training and Inference

We give examples on how to perform training and inference for our Cross-modal Moment Localization (XML) model.

  1. XML training
bash baselines/crossmodal_moment_localization/scripts/train.sh \
tvr CTX_MODE VID_FEAT_TYPE \
--exp_id EXP_ID

CTX_MODE refers to the context (video, sub, tef, etc.) we use. VID_FEAT_TYPE video feature type (resnet, i3d, resnet_i3d). EXP_ID is a name string for current run.

Below is an example of training XML with video_sub (video + subtitle), where video feature is resnet_i3d (ResNet + I3D):

bash baselines/crossmodal_moment_localization/scripts/train.sh \
tvr video_sub resnet_i3d \
--exp_id test_run

This code will load all the data (~60GB) into RAM to speed up training, use --no_core_driver to disable this behavior. You can also use --debug before actually training the model to test your configuration.

By default, the model is trained with all the losses, including video retrieval loss L^{vr} and moment localization loss L^{svmr}. To train it for only the moment localization, append --lw_neg_q 0 --lw_neg_ctx 0. To train it for only video retrieval, append --lw_st_ed 0.

Training using the above config will stop at around epoch 60, around 4 hours with a single 2080Ti GPU. You should get ~2.6 for VCMR R@1, IoU=0.7 on val set. The resulting model and config will be saved at a dir: baselines/crossmodal_moment_localization/results/tvr-video_sub-test_run-*.

  1. XML inference

After training, you can inference using the saved model on val or test_public set:

bash baselines/crossmodal_moment_localization/scripts/inference.sh MODEL_DIR_NAME SPLIT_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., tvr-video_sub-test_run-*. SPLIT_NAME could be val or test_public. By default, this code evaluates all the 3 tasks (VCMR, SVMR, VR), you can change this behavior by appending option, e.g. --tasks VCMR VR where only VCMR and VR are evaluated. The generated predictions will be saved at the same dir as the model, you can evaluate the predictions by following the instructions here Evaluation and Submission.

While the default inference code shown above gives you results without non-maximum suppression (NMS), you can append an additional flag --nms_thd 0.5 to obtain results with NMS. Most likely you will observe a higher R@5 score, but lower R@{10, 100} scores. For the results reported in the paper, we do not use NMS.

Other baselines

Except for XML model, we also provide our implementation of CAL, ExCL and MEE at TVRetrieval/baselines. Their training, inference and evaluation is similar to XML.

Evaluation and Submission

We only release ground-truth for train and val splits, to get results on test-public split, please submit your results follow the instructions here: standalone_eval/README.md

Citations

If you find this code useful for your research, please cite our paper:

@inproceedings{lei2020tvr,
  title={TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval},
  author={Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit},
  booktitle={ECCV},
  year={2020}
}

Acknowledgement

This research is supported by grants and awards from NSF, DARPA, ARO and Google.

This code borrowed components from the following projects: transformers, TVQAplus, TVQA, MEE, we thank the authors for open-sourcing these great projects! We also thank Victor Escorcia for his kind help on explaining CAL's implementation details.

Contact

jielei [at] cs.unc.edu

tvretrieval's People

Contributors

jayleicn avatar linjieli222 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tvretrieval's Issues

Audio Extraction

Is it possible to download the original video clips to extract the audio?

codalab ReaderError: 'utf8' codec can't decode byte #xaf: invalid start byte

When I follow the example to submit my zip of json to codalab, it gives me an error:

Traceback (most recent call last):
File "/worker/worker.py", line 323, in run
bundles = get_bundle(root_dir, 'run', bundle_url)
File "/worker/worker.py", line 180, in get_bundle
metadata[k] = get_bundle(bundle_path, k, v)
File "/worker/worker.py", line 180, in get_bundle
metadata[k] = get_bundle(bundle_path, k, v)
File "/worker/worker.py", line 171, in get_bundle
metadata = yaml.load(mf)
File "/usr/local/lib/python2.7/dist-packages/yaml/init.py", line 69, in load
loader = Loader(stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/loader.py", line 34, in init
Reader.init(self, stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 85, in init
self.determine_encoding()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 135, in determine_encoding
self.update(1)
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 165, in update
exc.encoding, exc.reason)
ReaderError: 'utf8' codec can't decode byte #xaf: invalid start byte
in "/tmp/codalab/tmpOjOaDx/run/input/res/metadata", position 11

How can I fix the error?

Implementation details of the DiDeMo experiments with the XML baseline (your method)

Hi there!
Thanks for sharing your great work.
It seems you conducted experiments with the DiDeMo dataset without using subscript information to check the performance of your method.
I have a couple of questions to ask you about it.

  1. clip length of the input features (in this case ResNet)
    In the main experiments in your paper, TVR features are divided and fed into the model with the clip length of 1.5 sec.
    Is it also the case with the DiDeMo dataset?
    Or did you treat the feature in a different way from the TVR dataset?

  2. how to deal with the timestamp information at the time of both training and inference (for training, also about tef)
    In the DiDeMo dataset, the moment timestamp information are given in the form of index (0-5).
    Did you translate it into the form of seconds, i.e., (0 sec - 30 sec)?
    Or did you use the index as the timestamp information as it is?

If there is any information I missed about the didemo dataset, please also let me know.
Thank you in advance!

About data collection

Hi, in the data collection part, What automatic tool do you use to check the quality of the annotations in the automatic check part?

code location

Hi , i want to know where is the ConSe code and the Single Video Moment Retrieval. loss function ,but i could not find it.
1637738252(1)
1637738274(1)

How to setting the Multi-GPU for training?

Hi there.

I was trying to use multi-gpu for training. So I put the gpu ids in '--device_ids', baselines/crossmodal_moment_localization/config.py.

I fixed the code like below.
if opt.train_span_start_epoch != -1 and epoch_i >= opt.train_span_start_epoch: model.set_train_st_ed(opt.lw_st_ed) -> model.set_train_st_ed(opt.lw_st_ed)

Then I added the following code in the front of my script
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5"

But it still not working. What should I do?

I have used

  • 6 GPUs : RTX 3090

Thanks you for your time.

Segmentation fault while running Language Model Fine-tuning

Hi,
I've been trying to finetune the language model but I'm getting segmentation fault while doing so.
I get the same issue when I try to run the lm_finetuning_on_single_sentences.py directly, this happens when I comment out invocation of main() like this

if __name__ == "__main__":
    # main()

Inference for test_public

Hi,

When I run "bash baselines/crossmodal_moment_localization/scripts/inference.sh MODEL_DIR_NAME val" everything works as expected.

However when I run "bash baselines/crossmodal_moment_localization/scripts/inference.sh MODEL_DIR_NAME test_public" I get the following error:

2020-04-09 17:27:16.745:INFO:__main__ - CUDA enabled. 2020-04-09 17:27:16.756:INFO:__main__ - Starting inference... 2020-04-09 17:27:16.757:INFO:__main__ - Computing scores Computing query2video scores: 100%|█████████████████████████████████████████████████| 6/6 [00:02<00:00, 2.23it/s] 2020-04-09 17:27:22.153:INFO:__main__ - Inference with full-script. Traceback (most recent call last): File "baselines/crossmodal_moment_localization/inference.py", line 584, in <module> start_inference() File "baselines/crossmodal_moment_localization/inference.py", line 578, in start_inference tasks=opt.tasks, max_after_nms=100) File "baselines/crossmodal_moment_localization/inference.py", line 486, in eval_epoch eval_submission_raw = get_eval_res(model, eval_dataset, opt, tasks, max_after_nms=max_after_nms) File "baselines/crossmodal_moment_localization/inference.py", line 456, in get_eval_res tasks=tasks) File "baselines/crossmodal_moment_localization/inference.py", line 277, in compute_query2ctx_info eval_dataset.load_gt_vid_name_for_query(is_svmr) File "/home/kevin/TVRetrieval/baselines/crossmodal_moment_localization/start_end_dataset.py", line 241, in load_gt_vid_name_for_query assert "vid_name" in self.query_data[0] AssertionError

I notice that the "data/tvr_val_release.jsonl" a different format has than " data/tvr_test_public_release.jsonl" So I suspect this is the culprit and needs to be handled differently in the inference code.

P.S. kudos for all the code and clear documentation provided in this repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.