Coder Social home page Coder Social logo

mikewangwzhl / vidil Goto Github PK

View Code? Open in Web Editor NEW
110.0 5.0 2.0 111.9 MB

Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

License: MIT License

Python 57.17% Shell 1.24% Dockerfile 0.02% C++ 18.02% Cuda 23.31% C 0.12% Makefile 0.02% CSS 0.06% HTML 0.03%
blip clip gpt-3 msrvtt msvd vatex video-language vision-language youcook2 vlep

vidil's Introduction


Download Datasets & Checkpoints

  • Download dataset annotations zip from box or google drive. Then unzip the downloaded datasets under shared_datasets/. The resulting shared_dataset folder structure is expected to be:

    shared_datasets
    ├── README.md
    ├── MSRVTT_caption
    ├── MSRVTT_qa
    ...
    

    Then, please refer to Dataset Instruction for downloading and processing raw videos.

  • Download BLIP checkpoints:

    bash download_blip_checkpoints.sh
    
  • Download Input & Output Examples zip from box or google dirve. Unzip the folders under output_example/, the resulting output_example/ folder structure is expected to be:

    output_example
    ├── msrvtt
    ├── msvd_test
    ├── vlep_test
    └── README.md
    
  • [Update 6/17] GPT-3 Results for Video Captioning, Video Question Answering and VLEP can be downloaded here.

Set Up Environment

  • launch the docker environment:

    • (1) set up variable "CKPT" and "DATASETS" as commented in run_docker_vidil.sh
    • (2) run docker image
      bash run_docker_vidil.sh
      
  • set up GPU devices: within the docker image, set up the following environment variables to config GPT devices

    export N_GPU=<num of gpus>
    export CUDA_VISIBLE_DEVICES=<0,1,2...>
    

Generate Video Representation & GPT-3 Prompt

  • [Update 6/15] Quick Start with generated video representation: Frame captions and visual tokens for five datasets can be downloaded here if you don't want to go through the entire pipeline. You can copy the json files following the data structure as mentioned below.

The following scripts runs the entire pipeline which, (1) generates frame captions; (2) generates visual tokens (3) generates few-shot prompt readily for GPT-3. The output folder have the following structure:

    {dataset_split}
    ├── frame_caption
    │   ├── config.yaml  # config for frame captioning
    │   ├── video_text_Cap.json  # frame captions w/o filtering
    │   ├── video_text_CapFilt.json  # frame captions w/ filtering
    ├── input_prompts 
    │   ├── {output_name}.jsonl  # config for frame captioning
    │   ├── {output_name}__idx_2_videoid.json  # line idx to video id
    │   ├── {output_name}__chosen_samples.json  # chosen examples in the support
    │   ... 
    ├── visual_tokenization_{encoder_name}           
    │   ├── config.yaml  # config for visual tokenization
    │   └── visual_tokens.json  # raw visual tokens of each frame
    └──

All scripts should be run at /src dir, namely, the root directory after running the docker image. The following are examples for running the pipeline with in-context example selection for some datasets. For additional notes on running pipeline scripts, please refer to Pipeline Instruction.

Standalone Pipeline for Frame Captioning and Visaul Tokenization

Since we need to sample few-shot support set from training sets, for each dataset, at the first time running the pipeline, we need to do frame captioning and visual tokenization on the training set.

For <dataset> in ["msrvtt","youcook2","vatex","msvd","vlep"]:

bash pipeline/scripts/run_frame_captioning_and_visual_tokenization.sh <dataset> train <output_root>

An example of the frame caption and visual token dir can be found at: output_example/msrvtt/frame_caption , output_example/msrvtt/visual_tokenization_clip

Video Captioning

For <dataset> in ["msrvtt","youcook2","vatex"]:

  • (1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen <dataset>

  • (2) Run pipeline for generating video captioning prompts for <dataset> <split> in ["train","val","test"]

    • w/o ASR:
    bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection.sh <dataset> <split> <output_root> 10 42 5 caption
    
    • w/ ASR:
    bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection_with_asr.sh <dataset> <split> <output_root> 10 42 5 caption_asr
    

    An example of the output prompt jsonl can be found at output_example/msrvtt/input_prompts/temp_0.0_msrvtt_caption_with_in_context_selection_clip_shot_10_seed_42_N_5.jsonl.

Video Question Answering

For <dataset> in ["msrvtt","msvd"]:

  • (1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen <dataset>

  • (2) Run pipeline for generating video question answering prompts for <dataset> <split> in ["train","val","test"]

    bash pipeline/scripts/generate_gpt3_query_pipeline_qa_with_in_context_selection.sh <dataset> <split> <output_root> 5 42 5 question
    

    An example of the output prompt jsonl can be found at output_example/msvd_test/input_prompts/temp_0.0_gpt3_queries_msvd_qa_clip_shot_5_seed_42.jsonl.

Video-Language Event Prediction (VLEP)

  • (1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen vlep

  • (2) Run pipeline for generating vlep prompts

        bash pipeline/scripts/generate_gpt3_query_pipeline_vlep_with_random_context_asr_multichoice.sh <dataset> <split> <output_root> 10 42
    

    An example of the output prompt jsonl can be found at output_example/vlep_test/input_prompts/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice.jsonl.

Semi-Supervised Text-Video Retrieval

For semi-supervised setting, we first generate pseudo label on the training set, we then train BLIP on the pseudo labeled dataset for retrieval.

  • (1) Generate pseudo labeled training set annotation json: suppose we have the raw gpt3 response stored at <gpt3_response_dir>, the input_prompt dir is at <input_prompts_dir>, run:

        python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
        python utils_gpt3/gpt3_response_to_jsonl.py --dataset <dataset_name> --gpt3_processed_dir <processed_response_dir> --output_dir <pseudo_label_ann_output_dir>
    

    An example of the <gpt3_response_dir>, <input_prompts_dir>, <processed_response_dir> and pseudo_label_ann_output_dir can be found at output_example/msrvtt/gpt3_response, output_example/msrvtt/input_prompts, output_example/msrvtt/processed_response_dir, output_example/msrvtt/pseudo_label_ann.

  • (2) Finetune pretrained BLIP from pseudo labeled data: For <dataset> in ["msrvtt","vatex"], set the value of the field named train_ann_jsonl in configs/train_blip_video_retrieval_<dataset>_pseudo.yaml to be the path to the output jsonl from step one in <pseudo_label_ann_output_dir>. Then run:

    bash scripts/train_caption_video.sh train_blip_video_retrieval_<dataset>_pseudo
    

Evaluation

Scripts for evaluating generation results from GPT-3:

  • Video Captioning: please refer to the example written in the script for more details about the required inputs

    bash scripts/evaluation/eval_caption_from_gpt3_response.sh
    
  • Question Answering: please refer to the example written in the script for more details about the required inputs

    bash scripts/evaluation/eval_qa_from_gpt3_response.sh
    
  • VLEP:

    • (1) get the processed gpt3 response; an example of the: <gpt3_response_dir>, <input_prompts_dir> and <processed_response_dir> can be found at: output_example/vlep_test/gpt3_response, output_example/vlep_test/input_prompts, output_example/vlep_test/gpt3_response_processed

          python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
      
    • (2) run the following script to generate the output in the official format for CodaLab submission; an example of the output jsonl can be found at output_example/vlep_test/evaluation/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice_eval.jsonl

          python eval_vlep.py --gpt3_processed_response <processed_response_json> --output_path <output_jsonl_path>
      

Citation

@article{wang2022language,
  title={Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners},
  author={Wang, Zhenhailong and Li, Manling and Xu, Ruochen and Zhou, Luowei and Lei, Jie and Lin, Xudong and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Hoiem, Derek and others},
  journal={arXiv preprint arXiv:2205.10747},
  year={2022}
}

Acknowledgement

The implementation of VidIL relies on resources from BLIP, ALPRO, transformers. We thank the original authors for their open-sourced code and encourage users to cite their works when applicable.

vidil's People

Contributors

mikewangwzhl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

vidil's Issues

QA about the text-video retrieval result in the paper.

Hi, thanks for your great work!
I have a problem for the zero-shot text-video retrieval result in the paper.
As shown in the first line of Table 5, t2v R@1 is 40.5 and R@5 is 62.8,
image
while in the original BLIP paper,
image
t2v R@1 is 43.3 and R@5 is 65.6,
I think there must be several difference, e.g., backbone or etc. Could you tell me why the numbers are different? Thanks!

Download dataset error

Hi,
When I'm trying to download the dataset from the box URL I'm getting the following error:

This shared file or folder link has been removed or is unavailable to you.

Any chance you can upload another link for downloading?

Custom dataset preparation

How do I prepare this for finetuning on my own dataset? I would like to get the BLIP baseline on my custom dataset then finetune with your models on my dataset. Do I only need a folder of videos, with their captions in a json? Can you link me to a sample video/json for the dataset?

Computing BLIP baseline for 4 metrics on MSRVTT caption

Hi,
According to the reported baselines of BLIP and BLIP_cap:
image

I'm trying to understand, and also find in the code how you computed this baseline on the 4 metrics.
According to the paper, Section 4.2, you wrote that you stitch multiple frames and compute the loss.
But I'm not sure I understand how it's done (and where is it implemented in the code).

Any help is highly appreciated!

Thanks a lot.

someting about datasets

I find the videos in \MSRVTT_ret\train_val_videos\TrainValVideo cannot be matched with captions in MSRVTT caption/train_caption.jsonl,
is that real?

About VLEP TV

Hi @MikeWangWZHL, thanks for sharing your great work! I wonder how did you guys obtain TV show clips as only YouTube clips were provided in VLEP dataset?

Dockerfile content

Hi,
Is it possible you also upload the Dockerfile that is used for mikewangwzhl/vidil in your dockerhub registry?

Thanks,
Ilan.

Questions about Pipeline for Frame Captioning and Visaul Tokenization

When I use the run_frame_captioning_and_visual_tokenization.sh to extract visual tokenization and frame captioning for my own dataset, I meet the following issue under run_video_CapFilt.py file:

File "/extract_frame_concepts/models/med.py", line 178, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (12) must match the size of tensor b (36) at non-singleton dimension 0

Is this because I did something wrong?

Code for Getting GPT-3 Response

Hi, thanks for releasing this repo. The code for invoking the GPT-3 API and getting the response from the model doesn't seem to be included. Could you please add this if possible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.