Coder Social home page Coder Social logo

mvu's Introduction

Understanding Long Videos in One Multimodal Language Model Pass

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael Ryoo

Paper Link | Project Page

PWC PWC PWC

Overview

Abstract: Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmarks. In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information. Building on this, we inject video-specific object-centric information extracted from off-the-shelf pre-trained models and utilize natural language as a medium for information fusion. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across long-video and fine-grained action recognition benchmarks.

intro_image

We propose three variants of our framework. (left-top) Just LLM only world knowledge with zero task-specific awareness. (left-right) Single Frame VLM processes an additional center frame to obtain task context but accesses no video specific information. (right) Our complete approach, MVU extracts three additional object-centric modalities followed by fusion in language space. LS refers to likelihood selection.

Quickstart

We provide two notebooks to explore our two modality-constrained variants. These models require only an LLM / VLM to operate and can be setup easily. Only the python dependencies listed at top of each notebook need to be installed in a Python=3.8 environment. Use two different environments for two notebooks (some LLMs require latest HF version incompatible with LLaVA used for VLM). All data will be downloaded automatically (less than 5MB for LLM only / approximately 100MB for SF-VLM).

The following results on EgoSchema (500 video subset) dataset can be replicated using our notebooks.

Method Backbone Acc (%) Time (s)
LLM Only Llama-2-7b-Chat 17.4 0.72
LLM Only Gemma-7b-IT 45.8 1.84
LLM Only Mistral-7B-Instruct 45.8 0.41
SF-VLM LLaVA-v1.5-13B 55.8 1.70

Our full MVU framework requires EgoSchema videos for inference and involves multiple pretrained models. Refer to next section for using it.

Likelihood Selection

Our proposed Likelihood Selection (LS) strategy for long-video understanding tasks is a standalone function that can be incorporated with other LLM-based frameworks. Two working examples of LS are presented in each of our notebooks in the above section.

Given access to network logits, LS can easily be implemented. We refer the reader to our calc_loglikelihood method in notebooks/utils.py for the PyTorch implementation. Note that when applying in a different task, this selection setup may be sensitive to the prompt nature and could require some handcrafting of the the textual prompts used to query the model (as is common for most LLM based setups).

Installation

  1. Clone our repository
    git clone https://github.com/kahnchana/mvu.git
    
  2. Create conda environment
    conda create -n mvu python=3.8
    conda activate mvu
    
  3. Install python dependencies
    pip install -r requirements.txt
    

Dataset Preparation

Our main evaluations utilize three datasets: EgoSchema, NextQA, and Open X-Embodiment. We direct to their websites for dataset setup.

  1. EgoSchema
  2. NextQA
  3. Open X-Embodiment

We follow the default instructions in their websites to download these datasets. We describe the dataset splits used for each evaluation in our paper.

MVU Framework

We now detail our MVU framework. This is built over the Single Frame VLM variant.

Frame Selection

python src/model_frame_selection.py

Object Centric Modalities

We provide the pre-extracted data for each modality along with the templates used for language based fusion. These will be automatically download in the following scripts.

Long Video QnA

Modify the name of the dataset (EgoSchema, NextQA) and the data root (directory where the dataset was downloaded).

python src/model_video_infer.py --dataset $DATASET --data-root $DATA_ROOT

References

Our work builds over the LLaVA codebase and utilizes multiple pretrained models from HuggingFace (HF). From HF, we use three different LLMs: Mistral-7B, LLAMA-2-7B, and Gemma-7B. We also use OWL-ViT for object detection. We thank all authors and maintainers of above codebases for their valuable open-source contributions.

Citation

If you find our work or code useful, please consider citing our paper and leaving a star on our repo.

@misc{rana2024mvu,
      title={Understanding Long Videos in One Multimodal Language Model Pass}, 
      author={Kanchana Ranasinghe and Xiang Li and Kumara Kahatapitiya and Michael Ryoo},
      year={2024},
}

mvu's People

Contributors

kahnchana avatar

Stargazers

Akhil Songa avatar Zachariah Mustafa avatar  avatar  avatar James Le avatar  avatar Jungang Li avatar Jeff Carpenter avatar Daniel Puglisi avatar Yunus Güngör avatar Vincent avatar 郭道冲 avatar  avatar Kumara Kahatapitiya avatar Xiang Li avatar Varun Belagali avatar Saumya Gupta avatar

Watchers

Xiang Li avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.