Coder Social home page Coder Social logo

tempcompass's Introduction

1Peking University  2The University of Hong Kong  3Huawei Noah’s Ark Lab

📢 News

[2024-03-12] 🔥🔥🔥 The evaluation code is released now! Feel free to evaluate your own Video LLMs.

✨ Highlights

Diverse Temporal Aspects and Task Formats

  • TempCompass encompasses a diverse set of temporal aspects (left) and task formats (right) to comprehensively evaluate the temporal perception capability of Video LLMs.

Conflicting Videos

  • We construct conflicting videos to prevent the models from taking advantage of single-frame bias and language priors.

  • 🤔 Can your Video LLM correctly answer the following question for both two videos?

    Raw Video Conflicting Video

    What is happening in the video?
    A. A person drops down the pineapple
    B. A person pushes forward the pineapple
    C. A person rotates the pineapple
    D. A person picks up the pineapple

🚀 Quick Start

To begin with, clone this repository and install some packages:

git clone https://github.com/llyx97/TempCompass.git
cd TempCompass
pip install -r requirements.txt

Data Preparation

1. Task Instructions

The task instructions can be found in questions/.

2. Videos

Run the following commands. The videos will be saved to videos/.

cd utils
python download_video.py    # Download raw videos
python process_videos.py    # Construct conflicting videos

Run Inference

We use Video-LLaVA as an example to illustrate how to conduct MLLM inference on our benchmark.

Run the following commands. The prediction results will be saved to predictions/video-llava/<task_type>.

cd run_video_llava
python inference_dataset.py --task_type <task_type>    # select <task_type> from multi-choice, yes_no, caption_matching, captioning

Run Evaluation

After obtaining the MLLM predictions, run the following commands to conduct automatic evaluation. Remember to set your own $OPENAI_API_KEY in utils/eval_utils.py.

  • Multi-Choice QA python eval_multi_choice.py --video_llm video-llava

  • Yes/No QA python eval_yes_no.py --video_llm video-llava

  • Caption Matching python eval_caption_matching.py --video_llm video-llava

  • Caption Generation python eval_captioning.py --video_llm video-llava

The results of each data point will be saved to auto_eval_results/video-llava/<task_type>.json and the overall results on each temporal aspect will be printed out as follows:

{'action': 70.4, 'direction': 32.2, 'speed': 38.2, 'order': 41.4, 'attribute_change': 39.9, 'avg': 44.7}
{'fine-grained action': 54.9, 'coarse-grained action': 83.2, 'object motion': 31.7, 'camera motion': 33.7, 'absolute speed': 46.0, 'relative speed': 33.2, 'order': 41.4, 'color & light change': 39.7, 'size & shape change': 40.2, 'combined change': 35.0, 'other change': 55.6}
Match Success Rate=37.9

Data Statistics

Distribution of Videos

Distribution of Task Instructions

📊 Evaluation Results

The following figures present results of Video LLaVA, VideoChat2, SPHINX-v2 and the random baseline. Results of more Video LLMs and Image LLMs can be found in our paper.

Multi-Choice Yes/No Caption Matching Caption Generation

TODOs

  • Upload scripts to collect and process videos.
  • Upload the code for automatic evaluation.
  • Upload the code for task instruction generation.

Citation

@article{liu2024tempcompass,
  title   = {TempCompass: Do Video LLMs Really Understand Videos?},
  author  = {Yuanxin Liu and Shicheng Li and Yi Liu and Yuxiang Wang and Shuhuai Ren and Lei Li and Sishuo Chen and Xu Sun and Lu Hou},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2403.00476}
}

tempcompass's People

Contributors

henryhzy avatar llyx97 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.