Coder Social home page Coder Social logo

llyx97 / tempcompass Goto Github PK

View Code? Open in Web Editor NEW
71.0 4.0 3.0 71.95 MB

[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou

Python 95.51% HTML 0.81% JavaScript 1.06% CSS 0.19% Jupyter Notebook 2.43%
evaluation temporal-perception video-llms

tempcompass's Introduction

Yuanxin Liu1  Shicheng Li1  Yi Liu1  Yuxiang Wang1  Shuhuai Ren1 
Lei Li2  Sishuo Chen1  Xu Sun1  Lu Hou3
1Peking University  2The University of Hong Kong  3Huawei Noah’s Ark Lab

πŸ“’ News

[2024-08-08] Results of LLaVA-Next-Video, VILA-1.5 and LongVA are added to the leaderboard.

[2024-07] πŸŽ‰πŸŽ‰πŸŽ‰ TempCompass is integrated into LMMs-Eval. See here for usage examples.

[2024-06-11] Result of Reka-core is added to the leaderboard.

[2024-05-25] TempCompass Leaderboard is available on HuggingFace Space πŸ€—.

[2024-05-16] 🎊🎊🎊 TempCompass is accepted at ACL 2024 Findings!

[2024-04-14] Evaluation result of Gemini-1.5-pro, the current SOTA Video LLM, is add.

[2024-03-23] The answer prompt is improved to better guide Video LLMs to follow the desired answer formats. The evaluation code now provides an option to disable the use of ChatGPT.

[2024-03-12] πŸ”₯πŸ”₯πŸ”₯ The evaluation code is released now! Feel free to evaluate your own Video LLMs.

✨ Highlights

Diverse Temporal Aspects and Task Formats

  • TempCompass encompasses a diverse set of temporal aspects (left) and task formats (right) to comprehensively evaluate the temporal perception capability of Video LLMs.

Conflicting Videos

  • We construct conflicting videos to prevent the models from taking advantage of single-frame bias and language priors.

  • πŸ€” Can your Video LLM correctly answer the following question for both two videos?

    Raw Video Conflicting Video

    What is happening in the video?
    A. A person drops down the pineapple
    B. A person pushes forward the pineapple
    C. A person rotates the pineapple
    D. A person picks up the pineapple

πŸš€ Quick Start

To begin with, clone this repository and install some packages:

git clone https://github.com/llyx97/TempCompass.git
cd TempCompass
pip install -r requirements.txt

Data Preparation

1. Task Instructions

The task instructions can be found in questions/.

Task Instruction Generation Procedure
  1. Generate Multi-Choice QA instructions (question_gen.py).

  2. Manually validate quality and rectify.

  3. Generate task instructions for Yes/No QA (question_gen_yes_no.py), Caption Matching (question_gen_caption_match.py) and Caption Generation (question_gen_captioning.py), based on manually rectified Multi-Choice QA instructions.

  4. Manually validate quality and rectify.

2. Videos

All the processed videos can be downloaded from google drive or huggingface.

As an alternative, you can also download the raw videos and process them yourself

Run the following commands. The videos will be saved to videos/.

cd utils
python download_video.py    # Download raw videos
python process_videos.py    # Construct conflicting videos

Note: If you encounter a MoviePy error when running the processing script, please refer to this issue.

Run Inference

We use Video-LLaVA and Gemini as examples to illustrate how to conduct MLLM inference on our benchmark.

1. Video-LLaVA

Enter run_video_llava and install the environment as instructed.

Then run the following commands. The prediction results will be saved to predictions/video-llava/<task_type>.

# select <task_type> from multi-choice, yes_no, caption_matching, captioning
python inference_dataset.py --task_type <task_type>

2. Gemini

The inference script for gemini-1.5-pro is run_gemini.ipynb. It is recommended to run the script in Google Colab.

Run Evaluation

After obtaining the MLLM predictions, run the following commands to conduct automatic evaluation. Remember to set your own $OPENAI_API_KEY in utils/eval_utils.py.

  • Multi-Choice QA python eval_multi_choice.py --video_llm video-llava

  • Yes/No QA python eval_yes_no.py --video_llm video-llava

  • Caption Matching python eval_caption_matching.py --video_llm video-llava

  • Caption Generation python eval_captioning.py --video_llm video-llava

TipπŸ‘‰: Except for Caption Generation, you can set --disable_llm when running the scripts, which will disable chatgpt-based evaluation (i.e., entirely rely on rule-based evaluation). This is useful when you do not want to use ChatGPT API and your MLLM is good at following the instruction to generate answers of specific format.

The results of each data point will be saved to auto_eval_results/video-llava/<task_type>.json and the overall results on each temporal aspect will be printed out as follows:

{'action': 76.0, 'direction': 35.2, 'speed': 35.6, 'order': 37.7, 'attribute_change': 41.0, 'avg': 45.6}
{'fine-grained action': 58.8, 'coarse-grained action': 90.3, 'object motion': 36.2, 'camera motion': 32.6, 'absolute speed': 47.6, 'relative speed': 28.0, 'order': 37.7, 'color & light change': 43.6, 'size & shape change': 39.4, 'combined change': 41.7, 'other change': 38.9}
Match Success Rate=100.0

LMMs-Eval Evaluation

Here we provide an example of how to evaluate LLaVA-Next-Video on TempCompass, using lmms-eval.

1. Clone the repo from LLaVA-Next and setup environments

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .

2. Run inference and evaluation in a single command

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model llavavid \
    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
    --tasks tempcompass \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_vid_32B \
    --output_path ./logs/

You can also evaluate the performance on each task (e.g., multi-choice) seperately:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model llavavid \
    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
    --tasks tempcompass_multi_choice \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_vid_32B \
    --output_path ./logs/

3. Submit results to TempCompass LeaderBoard

Place the lmms-eval outputs (tempcompass_multi_choice.json, tempcompass_yes_no.json, tempcompass_caption_matching.json and tempcompass_captioning.json) into the same folder and run this script:

python merge_eval_result.py

Then submit the output file merged_result.json to the leaderboard.

Note: Currently, the evaluation results calculated by lmms-eval on specific temporal aspects might be incorrect (the average accuracy on each task is correct). To obtain the correct results, you can use this script: acc_lmms_eval.py or submit the result to our leaderboard.

πŸ“ˆ Data Statistics

πŸ“Š Evaluation Results

The following figures present results of Video-LLaVA, VideoChat2, SPHINX-v2, Gemini-1.5-pro and the random baseline. Results of more Video LLMs and Image LLMs can be found in our paper and the leaderboard.

Multi-Choice Yes/No Caption Matching Caption Generation

Answer Prompt

We update the answer prompt for Multi-Choice QA and Caption Matching, from "Best Option:" to "Please directly give the best option:", which can better encourage MLLMs to directly select an option. As such, we can reduce the reliance on ChatGPT API, if an MLLM is good at following the instruction.

The success rate of rule-based matching is as follows:

Multi-Choice QA

V-LLaVA SPHINX-v2 LLaMA-VID Qwen-VL-Chat PandaGPT Valley
old prompt 37.9 99.6 62.9 46.8 6.4 3.5
new prompt 100 100 97.0 98.5 3.9 0.4

Caption Matching

V-LLaVA SPHINX-v2 LLaMA-VID Qwen-VL-Chat PandaGPT Valley
old prompt 76.6 89.3 44.5 91.6 30.7 11.2
new prompt 99.5 99.5 68.3 96.0 22.5 3.7

TODOs

  • Upload scripts to collect and process videos.
  • Upload the code for automatic evaluation.
  • Upload the code for task instruction generation.

Citation

@article{liu2024tempcompass,
  title   = {TempCompass: Do Video LLMs Really Understand Videos?},
  author  = {Yuanxin Liu and Shicheng Li and Yi Liu and Yuxiang Wang and Shuhuai Ren and Lei Li and Sishuo Chen and Xu Sun and Lu Hou},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2403.00476}
}

tempcompass's People

Contributors

henryhzy avatar llyx97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tempcompass's Issues

Possible errors when preprocessing the videos

Hi @llyx97 , thanks for your great projects.

I encounter an error when processing these videos:

OSError: MoviePy error: failed to read the first frame of video file ../videos/1100319395.mp4. That might mean that the
file is corrupted. That may also mean that you are using a deprecated version of FFMPEG. On Ubuntu/Debian for instance
the version in the repos is deprecated. Please update to a recent version from the website.
OSError: MoviePy error: failed to read the first frame of video file ../videos/1056433484.mp4. That might mean that the
file is corrupted. That may also mean that you are using a deprecated version of FFMPEG. On Ubuntu/Debian for instance
the version in the repos is deprecated. Please update to a recent version from the website.

How to fix it:

As mentioned in: Zulko/moviepy#1078, comment the following commands at: /path/to/site-packages/moviepy/video/io/ffmpeg_reader.py

def close(self):
        if self.proc:
            self.proc.terminate()
            self.proc.stdout.close()
            self.proc.stderr.close()
            self.proc.wait()
            self.proc = None
        #if hasattr(self, 'lastread'):
        #    del self.lastread 

Actually, I am not sure whether it will affect the results. I think it would be better if you could provide the downloading link of the full video dataset after processing as an alternative:)

404M ./videos
275M ./videos_before_process

Inference and return error: AttributeError: 'LlavaConfig' object has no attribute 'X'

Hi. Thanks for the sharing. I am tyring to install and do inference when do the inference

python inference_dataset.py --task_type yes_no

I encouter the following error:

Traceback (most recent call last):
  File "/home/nelson/TempCompass/run_video_llava/inference_dataset.py", line 84, in <module>
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
  File "/home/nelson/TempCompass/run_video_llava/llava/model/builder.py", line 142, in load_pretrained_model
    X = model.config.X
  File "/home/nelson/TempCompass/tempcompassvenv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'LlavaConfig' object has no attribute 'X'

Looks I need to install the proper transformer version. Could you share which transformer version you used for the project?

Best,

Looking forwad to task generation code and asking for some intuitions.

Hi!! I happened to come across your paper and was very impressed by it. I want to express my gratitude for the work you've been doing. I am eagerly looking forward to the instruction data generation code that has mentioned. Could you please share when you plan to release this code? It would be really helpful if it were realsed sooner to my project.

And also I wanna ask to the author about an intuition that is related to your work.

"Do you think it could solve the problme that your paper tackles(vlm is weak at temporal reasoning) to better capture temporal relationships if fine-tuning like instruction tuning are performed using your dataset(or much more refined or polished temporal benchmark in the future) in terms of making well-known off-the-shelf video models(like the videochat series that try to capture temporal information) to understand temporal information ,? Or do you believe that there needs to be a further a huge leap in the architecture of off-the-shelf video models beyond the level of the training dataset?

Thanks for your rely in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.