Coder Social home page Coder Social logo

steveyin-fresh / videollm-online Goto Github PK

View Code? Open in Web Editor NEW

This project forked from showlab/videollm-online

0.0 0.0 0.0 95.61 MB

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)

License: Apache License 2.0

Shell 2.98% JavaScript 39.48% Python 50.47% CSS 0.70% HTML 6.38%

videollm-online's Introduction

VideoLLM-online: Online Video Large Language Model for Streaming Video

Homepage Demo Paper Checkpoint Data

TLDR

The first streaming video LLM, high speed (5 ~ 10 FPS on NVIDIA 3090 GPU, 10 ~ 15 FPS on A100GPU) on long-form videos (10 minutes), with SOTA performance on online/offline settings.

Click to Play

Introduction

This is the official implementation of VideoLLM-online: Online Video Large Language Model for Streaming Video, CVPR 2024. Our paper introduces several interesting stuffs compared to popular image/video/multimodal models:

  • Online Video Streaming: Unlike previous models that serve as offline mode (querying/responding to a full video), our model supports online interaction within a video stream. It can proactively update responses during a stream, such as recording activity changes or helping with the next steps in real time. Even GPT-4o, which is audio-driven, requires user voice interaction with the visual scene, not actual video streaming.

  • Cheap and Scalable Streaming Data Synthesis: Current video datasets for training multimodal LLMs are mostly offline and unsuitable for training an online video language model. Our method transforms any offline annotation into streaming dialogue data by prompting open-source LLM. The model is entirely trained on Llama synthesized data.

  • Parallelized Real-Time Inference: Our inference method parallelizes video encoding, LLM forwarding for video frames, and LLM response generation, arranging them asynchronously. This significantly enhances real-time performance, achieving 10-15 FPS on an A100 GPU.

Quick Start

  • (Recommended) Launch the gradio demo locally with:
python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
  • (Recommended) Launch the CLI locally with:
python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
  • (Deprecated, HF Spaces too slow) Try demo at Demo

By passing --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus, the PEFT checkpoint will be automatically downloaded and applied to meta-llama/Meta-Llama-3-8B-Instruct.

Installation

Ensure you have Miniconda and Python version >= 3.10 installed, then run:

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch source will make ffmpeg installed, but it is an old version and usually make very low quality preprocessing. Please install newest ffmpeg following:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

If you want to try our model with the audio in real-time streaming, please also clone ChatTTS.

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

Training and Evaluation

  • Download streaming dialogue data from Data

  • Distributed preprocess video frames: 2 FPS and 384 resolution, then using google/siglip-large-patch16-384 to extract CLS with avg pooled 3x3 spatial tokens. Please refer to instructions under data/preprocess/.

  • Refer to the examples under scripts/

Model Zoo

  • LLM: meta-llama/Meta-Llama-3-8B-Instruct
  • Vision Strategy:
    • Frame Encoder: google/siglip-large-patch16-384
    • Frame Tokens: CLS token + 3x3 average pooled spatial tokens
    • Frame FPS: 2 for training, 2~10 for inference
    • Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio
    • Video Length: 10 minutes
  • Training Data: Ego4D Narration Stream 113K + Ego4D GoalStep Stream 21K
  • LLM: meta-llama/Meta-Llama-3-8B-Instruct
  • Vision Strategy:
    • Frame Encoder: google/siglip-large-patch16-384
    • Frame Tokens: CLS token
    • Frame FPS: 2 for training, 2~10 for inference
    • Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio
    • Video Length: 60 minutes
  • Training Data: Ego4D Narration Stream 113K + Ego4D GoalStep Stream 21K

VideoLLM-online beyond Llama

This codebase has a very simple and clean implementation. You only need to change the inherited class from Llama to Mistral to achieve the Mistral version of VideoLLM-online. Please refer to the examples in models/live_llama.

Citation

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}

videollm-online's People

Contributors

chenjoya avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.