English | 简体中文
👋 join us on WeChat
The Mini Sora open-source community is positioned as a community-driven initiative (free of charge and devoid of any exploitation) organized spontaneously by community members. The Mini Sora project aims to explore the implementation path and future development direction of Sora.
- Regular roundtable discussions will be held with the Sora team and the community to explore possibilities.
- We will delve into existing technological pathways for video generation.
- GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
- Training-Efficiency: It should achieve good results without requiring extensive training time.
- Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.
MiniSora-DiT: Reproducing the DiT Paper with XTuner
https://github.com/mini-sora/minisora-DiT
We are recruiting MiniSora Community contributors to reproduce DiT
using XTuner.
We hope the community member has the following characteristics:
- Familiarity with the
OpenMMLab MMEngine
mechanism. - Familiarity with
DiT
.
- The author of
DiT
is the same as the author ofSora
. - XTuner has the core technology to efficiently train sequences of length
1000K
.
Speaker: MMagic Core Contributors
Live Streaming Time: 03/12 20:00
Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.
Please scan the QR code with WeChat to book a live video session.
Night Talk with Sora: Video Diffusion Overview
ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models
-
Technical Report: Video generation models as world simulators
-
Latte: Latte: Latent Diffusion Transformer for Video Generation
-
Stable Cascade (ICLR 24 Paper): Würstchen: An efficient architecture for large-scale text-to-image diffusion models
-
Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
-
Updating...
- Diffusion Model
- Diffusion Transformer
- Baseline Video Generation Models
- Video Generation
- Dataset
- Patchifying Methods
- Long-context
- Audio Related Resource
- Consistency
- Prompt Engineering
- Security
- World Model
- Video Compression
- Existing high-quality resources
Paper | Link |
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis | NeurIPS 21 Paper, GitHub |
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models | CVPR 22 Paper, GitHub |
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models | NeurIPS 22 Paper, GitHub |
4) DDPM: Denoising Diffusion Probabilistic Models | NeurIPS 20 Paper, GitHub |
5) DDIM: Denoising Diffusion Implicit Models | ICLR 21 Paper, GitHub |
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations | ICLR 21 Paper, GitHub, Blog |
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models | ICLR 24 Paper, GitHub, Blog |
8) Diffusion Models in Vision: A Survey | TPAMI 23 Paper, GitHub |
Paper | Link |
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models | CVPR 23 Paper, GitHub, ModelScope |
2) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, GitHub, Project, ModelScope |
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | Paper, GitHub, ModelScope |
4) FiT: Flexible Vision Transformer for Diffusion Model | Paper, GitHub |
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers | Paper, GitHub |
6) Large-DiT: Large Diffusion Transformer | GitHub |
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks | Paper, GitHub |
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis | Paper, Blog |
Paper | Link |
1) ViViT: A Video Vision Transformer | ICCV 21 Paper, GitHub |
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | CVPR 23 Paper |
3) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, Github, Project, ModelScope |
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators | Paper, GitHub |
5) Latte: Latent Diffusion Transformer for Video Generation | Paper, GitHub, Project |
Paper | Link |
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | ICLR 24 Paper, GitHub, ModelScope |
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models | Paper, GitHub, ModelScope |
3) Imagen Video: High Definition Video Generation with Diffusion Models | Paper |
4) MoCoGAN: Decomposing Motion and Content for Video Generation | CVPR 18 Paper |
5) Adversarial Video Generation on Complex Datasets | Paper |
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models | Paper, Project |
7) VideoGPT: Video Generation using VQ-VAE and Transformers | Paper, GitHub |
8) Video Diffusion Models | Paper, GitHub, Project |
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | NeurIPS 22 Paper, GitHub, Project, Blog |
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation | Paper |
11) MAGVIT: Masked Generative Video Transformer | CVPR 23 Paper, GitHub, Project, Colab |
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | Paper, GitHub, Project |
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation | Paper, GitHub, Project |
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets | Paper, GitHub |
16) ADD: Adversarial Diffusion Distillation | Paper, GitHub |
17) GenTron: Diffusion Transformers for Image and Video Generation | CVPR 24 Paper, Project |
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models | CVPR 23 Paper, GitHub |
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models | Paper, GitHub |
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation | Paper, GitHub |
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation | Paper, GitHub |
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models | Paper, GitHub |
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation | Paper, GitHub |
Dataset Name - Paper | Link |
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers70M Clips, 720P, Downloadable |
CVPR 24 Paper, Github, Project |
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation10M Clips, 720P, Downloadable |
ArXiv 24 Paper, Github |
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset70K Clips, 720P, Downloadable |
CVPR 23 Paper, Github, Project |
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation130M Clips, 720P, Downloadable |
ArXiv 23 Paper, Github, Tool |
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions100M Clips, 720P, Downloadable |
CVPR 22 Paper, Github |
6) VideoCC - Learning Audio-Video Modalities from Image Captions10.3M Clips, 720P, Downloadable |
ECCV 22 Paper, Github |
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models180M Clips, 480P, Downloadable |
NeurIPS 21 Paper, Github, Project |
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips136M Clips, 240P, Downloadable |
ICCV 19 Paper, Github, Project |
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild13K Clips, 240P, Downloadable |
CVPR 12 Paper, Project |
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation122K Clips, 240P, Downloadable |
ACL 11 Paper, Project |
11) Fashion-Text2Video - A human video dataset with rich label and text annotations600 Videos, 480P, Downloadable |
ArXiv 23 Paper, Project |
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M5B Clips, Downloadable |
NeurIPS 22 Paper, Project |
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time20k videos, Downloadable |
Arxiv 17 Paper, Project |
14) MSR-VTT - A large-scale video benchmark for video understanding10k Clips, Downloadable |
CVPR 16 Paper, Project |
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labelingDownloadable |
Arxiv 16 Paper, Project |
16) Youku-mPLUG - First open-source large-scale Chinese video text datasetDownloadable |
Arxiv 23 Paper, Project |
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models6.69M, Downloadable |
Arxiv 24 Paper, Github |
1) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites10M video-text pairs |
Arxiv 21 Paper, Project |
Paper | Link |
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | CVPR 21 Paper, Github |
2) MAE: Masked Autoencoders Are Scalable Vision Learners | CVPR 22 Paper, Github |
3) ViViT: A Video Vision Transformer (-) | ICCV 21 Paper, GitHub |
4) DiT: Scalable Diffusion Models with Transformers (-) | ICCV 23 Paper, GitHub, Project, ModelScope |
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-) | CVPR 23 Paper, GitHub, ModelScope |
6) FlexiViT: One Model for All Patch Sizes | Paper, Github |
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | Paper, Github |
8) VQ-VAE: Neural Discrete Representation Learning | Paper, Github |
9) VQ-GAN: Neural Discrete Representation Learning | CVPR 21 Paper, Github |
10) LVT: Latent Video Transformer | Paper, Github |
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-) | Paper, GitHub |
12) Predicting Video with VQVAE | Paper |
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ICLR 23 Paper, Github |
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ECCV 22 Paper, Github |
15) MAGVIT: Masked Generative Video Transformer (-) | CVPR 23 Paper, GitHub, Project, Colab |
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ICLR 24 Paper, Github |
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | Paper |
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision | CVPR 21 Paper, Github |
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | Paper, Github |
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Paper, Github |
Paper | Link |
1) World Model on Million-Length Video And Language With RingAttention | Paper, GitHub |
2) Ring Attention with Blockwise Transformers for Near-Infinite Context | Paper, GitHub |
3) Extending LLMs' Context Window with 100 Samples | Paper, GitHub |
4) Efficient Streaming Language Models with Attention Sinks | ICLR 24 Paper, GitHub |
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey | Paper |
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | CVPR 24 Paper, GitHub, Project |
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory | Paper, GitHub |
Paper | Link |
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion | Paper, Github, Blog |
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | CVPR 23 Paper, GitHub |
3) Pengi: An Audio Language Model for Audio Tasks | NeurIPS 23 Paper, GitHub |
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset | NeurlPS 23 Paper, GitHub |
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Paper, GitHub |
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | Paper, GitHub |
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | Paper, GitHub |
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation | Paper, GitHub |
9) Audio-Visual LLM for Video Understanding | Paper |
10) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | ICML 23 Paper, GitHub |
11) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | Paper, GitHub |
12) AudioLM: a Language Modeling Approach to Audio Generation | Paper |
13) AudioGen: Textually Guided Audio Generation | ICLR 23 Paper, Project |
Paper | Link |
1) Layered Neural Atlases for Consistent Video Editing | TOG 21 Paper, GitHub, Project |
2) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
3) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing | Paper, GitHub, Project |
4) Consistency Models | ICML 23 Paper, GitHub |
5) Sora Generates Videos with Stunning Geometrical Consistency | Paper, GitHub, Project |
6) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency | ECCV 22 Paper, GitHub |
7) Bootstrap Motion Forecasting With Self-Consistent Constraints | ICCV 23 Paper |
8) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting | Paper |
9) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment | CVPRW 23 Paper, GitHub |
10) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing | Paper |
11) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter | TCSVT 23 Paper |
12) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking | CVPRW 19 Paper |
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) | Paper |
14) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-) | Paper |
Paper | Link |
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models | Paper, GitHub, Project |
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs | Paper, GitHub |
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | TMLR 23 Paper, GitHub |
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS | ICLR 24 Paper, GitHub |
5) Progressive Text-to-Image Diffusion with Soft Latent Direction | Paper |
6) Self-correcting LLM-controlled Diffusion Models | CVPR 24 Paper, GitHub |
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation | MM 23 Paper |
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | NeurIPS 23 Paper, GitHub |
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition | Paper, GitHub |
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions | Paper, GitHub |
11) Controllable Text-to-Image Generation with GPT-4 | Paper |
12) LLM-grounded Video Diffusion Models | ICLR 24 Paper |
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning | Paper |
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax | Paper, Github, Project |
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM | Paper |
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator | NeurIPS 23 Paper |
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models | Paper |
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation | Paper |
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | Paper |
Paper | Link |
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Github |
2) LIMA: Less Is More for Alignment | NeurIPS 23 Paper |
3) Jailbroken: How Does LLM Safety Training Fail? | NeurIPS 23 Paper |
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | CVPR 23 Paper |
5) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
6) Ablating concepts in text-to-image diffusion models | ICCV 23 Paper |
7) Diffusion art or digital forgery? investigating data replication in diffusion models | ICCV 23 Paper, Project |
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks | ICCV 20 Paper |
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks | ICML 20 Paper |
10) A pilot study of query-free adversarial attack against stable diffusion | **ICCV 23 Paper ** |
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models | ICCV 23 Paper |
12) Erasing Concepts from Diffusion Models | ICCV 23 Paper, Project |
13) Ablating Concepts in Text-to-Image Diffusion Models | ICCV 23 Paper, Project |
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Project |
15) LIMA: Less Is More for Alignment | NeurIPS 23 Paper |
16) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
17) Threat Model-Agnostic Adversarial Defense using Diffusion Models | Paper |
18) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? | Paper, Github |
19) Differentially Private Diffusion Models Generate Useful Synthetic Images | Paper |
20) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models | SIGSAC 23 Paper, Github |
21) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models | Paper, Github |
22) Unified Concept Editing in Diffusion Models | WACV 24 Paper, Project |
Paper | Link |
1) NExT-GPT: Any-to-Any Multimodal LLM | Paper, GitHub |
Paper | Link |
1) H.261: Video codec for audiovisual services at p x 64 kbit/s | Paper |
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video | Paper |
3) H.263: Video coding for low bit rate communication | Paper |
4) H.264: Overview of the H.264/AVC video coding standard | Paper |
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard | Paper |
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications | Paper |
7) DVC: An End-to-end Deep Video Compression Framework | CVPR 19 Paper, GitHub |
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method | Paper, GitHub |
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement | CVPR 20 Paper, Github |
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model | J-STSP 21 Paper, Github |
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN | IJCAI 22 Paper, Github |
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction | T-CSVT 22 Paper, Github |
13) DCVC: Deep Contextual Video Compression | NeurIPS 21 Paper, Github |
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression | TM 22 Paper, Github |
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression | MM 22 Paper, Github |
16) DCVC-DC: Neural Video Compression with Diverse Contexts | CVPR 23 Paper, Github |
17) DCVC-FM: Neural Video Compression with Feature Modulation | CVPR 24 Paper, Github |
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression | CVPR 20 Paper, Github |
Resources | Link |
1) Datawhale - AI视频生成学习 | Feishu doc |
2) A Survey on Generative Diffusion Model | TKDE 24 Paper, GitHub |
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models | Paper, GitHub |
4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/Synthesis | GitHub |
5) video-generation-survey: A reading list of video generation | GitHub |
6) Awesome-Video-Diffusion | GitHub |
7) Video Generation Task in Papers With Code | Task |
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | Paper, GitHub |
9) Open-Sora-Plan (PKU-YuanGroup) | GitHub |
10) State of the Art on Diffusion Models for Visual Computing | Paper |
11) Diffusion Models: A Comprehensive Survey of Methods and Applications | CSUR 24 Paper, GitHub |
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable | Paper |
13) On the Design Fundamentals of Diffusion Models: A Survey | Paper |
14) Efficient Diffusion Models for Vision: A Survey | Paper |
15) Text-to-Image Diffusion Models in Generative AI: A Survey | Paper |
16) Awesome-Diffusion-Transformers | GitHub, Project |
17) Open-Sora (HPC-AI Tech) | GitHub, Blog |
18) LAVIS - A Library for Language-Vision Intelligence | ACL 23 Paper, GitHub, Project |
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference | GitHub |
We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines