Part of the information comes from Awesome-LLMs-for-Video-Understanding.
Name | Paper | Number Videos | Number Sens | Ave Duration | Comments |
---|---|---|---|---|---|
TGIF | TGIF: A New Dataset and Benchmark on Animated GIF Description | 100k | 120k Des | 3.1s | Captioning |
TGIF-QA | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | 72k | 165k QAs | - | VQAs |
Ani-GIFs | Ani-GIFs: A benchmark dataset for domain generalization of action recognition from GIFs | 17k | 536 cls | 2.1s | Action Recognitaion & Animated GIFs |
Vid2GIF | Video2gif: automatic generation of animated gifs from video | 100k | - | 5.8s | Generating GIF from Video |
GifGIF | Predicting viewer perceived emotions in animated GIFs | 3.8k | 17 emotions | <=303 frames (15s) | emotions recg |
Gifgif+ | Gifgif+: Collecting emotional animated gifs with clustered multi-task learning | 23k | 17 emotions | - | emotions recg |
Name | Paper | Number Videos | Number Sens | Ave Duration | Comments |
---|---|---|---|---|---|
MSR-VTT | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | 10k (200k clips) | 200k Des | 14s | Captiong |
MSVD | Collecting Highly Parallel Data for Paraphrase Evaluation | 2k | 85k Des | 4-10s | Captioning |
ActivityNet | ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding | 27k (849h) | 203 cls | 109 s | activity reg |
Name | Paper | Code | Comments |
---|---|---|---|
LanguageBind | LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | Link |
OpenCLIP |
BLIP-2 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Link | ViT+Q-former+OPT/FlanT5 |
mPLUG2 | mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Link | WebVid-2M |
mPLUG-Owl2 | mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | Link |
- |
Name | Paper | Code | Video Datasets | Comments |
---|---|---|---|---|
Video-LLaMA | Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Link |
Pertraining: Webvid+CC3M Finetune: Video-Chat+LLaVa+MiniGPT4-4 |
BLIP2+Vicuna/LLaMa |
Video-LLaVA | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Link |
Pertraining: WebVid+CC3M Finetune:Video-ChatGPT+LLaVa |
LanguageBind+Vicuna |
StarVector | StarVector: Generating Scalable Vector Graphics Code from Images | Link |
SVG-Fonts+SVG-Icons+SVG-Emoji+SVG-Stack | Clip+Adapter+StarCode |
Name | Paper | Code | Datasets | Comments |
---|---|---|---|---|
StepCoder | StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback | Link |
APPS+ | Curriculum Learning+Reinforcement Learning |