rerv / vdt Goto Github PK

[ICLR2024] The official implementation of paper "VDT: General-purpose Video Diffusion Transformers via Mask Modeling", by Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding.

License: Other

Python 0.35% Jupyter Notebook 99.65%

vdt's Introduction

VDT

[ICLR2024] The official implementation of paper "VDT: General-purpose Video Diffusion Transformers via Mask Modeling", by Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding.

Introduction

This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules, allowing separate optimization of each component and leveraging the rich spatial-temporal representation inherited from transformers.

It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time.
It facilitates flexible conditioning information, e.g., simple concatenation in the token space, effectively unifying different token lengths and modalities.
Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc

Extensive experiments on video generation, prediction, and dynamics modeling (i.e., physics-based QA) tasks have been conducted to demonstrate the effectiveness of VDT in various scenarios, including autonomous driving, human action, and physics-based simulation.

Release

✅ 2024-05-05: Release spatial-temporal mask modeling code and inference code.

✅ 2024-01-27: Our VDT has been accepted by ICLR2024.

✅ 2023-05-22: We propose Video Diffusion Transformer (VDT) model and release checkpoint and inference code.

Getting Started

Python3, PyTorch>=1.8.0, torchvision>=0.7.0 are required for the current codebase.

To install the other dependencies, run

conda env create -f environment.yml

conda activate VDT

Checkpoint

We now provide checkpoint for Sky Time-Lapse unified generation. You can download it from here.

Inference

We provide inference ipynb on Sky Time-Lapse unified generation (predict, backward, unconditional, single-frame, arbitrary interpolation, spatial_temporal). To sample results, you can first download the checkpoint, then run inference.ipynb, have fun!

Acknowledgement

Our codebase is built based on DiT, BEiT, SlotFormer and MVCD. We thank the authors for the nicely organized code!

vdt's People

Contributors

Stargazers

Watchers

Forkers

johndpope judithcaldes mingxiao2 movingforward100 hawksilent abapplo runngezhang-jx waylon620

vdt's Issues

Some confusion about the code.

Is the network model framework in the inference code you provide consistent with that during training? Are there inconsistencies between the training and inference codes?
Great job and looking forward to your reply, thanks in advance.

test

When running physion_sample.py, an error occurs: AttributeError: 'DiagonalGaussianDistribution' object has no attribute 'latent_dist'.

文中的Mask机制，在代码中对不上

你好，我发现文中的Mask那一段，在代码里没找到对应的方法。

More Physion evaluation results

Thanks for the great work! One question I have is about the evaluation results on Physion data. In both the paper and code, there seems to be only results and model checkpoint for collision. Wondering if VDT is evaluated on 7 other Physion scenarios? If there is, it would be great if both the evaluation results and checkpoints could be shared. Thanks in advance

Question about the `VDT`

Hello, I would like to know what num_frames means in the VDT? Looking forward to hearing from you

Would you release the training code?

Hello, Would you release the training code? If yes, may I know your schedule?

How to make text to video diffusion network?

怎么改造这个网络，可以实现文生视频呢？

GPU computer capability

Thank you for sharing such a great work with us.Can you tell me how long the VDT has been training on the A100 GPU?Waiting for your reply

Physion inference with less than 8 condition frames

Congratulations on the impressive paper. When I tried running inference with your pre-trained physion model, the results significantly degraded as the number of condition frames was reduced. For example, using 4 condition frames (rather than the default of 8) produces only noise - see attached image.

Does this match your expectation? It seems at odds with the paper's discussion, which states "our VDT can still take any length of conditional frame as input and output consistent predicted features".

Thank you!

Edit: I see in Figure 8 that you tried using more than 8 conditional frames, but not less. Do you have a sense how well the forward prediction can perform with only 1 conditioning frame using VDT? Would the model need to be trained with only 1 conditioning frame?

不吹不擂，分析一下VDT和Sora之间的差别，顺Genie继续往远眺望...

看到公众号介绍，说VDT和Sora之间几乎没有差别，受此吸引，来拜读大作。

对比个人在OpenAI Sora发布当天的一个技术分析，我觉得从系统/算法流程，主要的功能模块上看，核心虽然都是VisionTransformer+Diffution，相同的主要也在这部分，但差别还是不小。
https://github.com/yuedajiong/super-ai/blob/main/superai-20240216-sora.png
https://github.com/yuedajiong/super-ai
Sora的主要条件是文本（还有图片等）,那个captioner/recaption部分的作用很大，对视频很内容来说代表很强的语义级的约束，对用户来说有丰富的可描述的对象和对象描述。
Sora的Generate在VAE的En之后跑，应该可以冻结VAE，核心拟合跑在laten空间的，压缩后从像素空间应该至少(20+)^3的压缩，那个patchs也有助于从像素级的一致性简化难度到patchs的更小空间。
Sora在生成是条件的，那个GPT扩展描述很有用；总体来说，其实比无条件要难一些。复杂的内容准确控制除了技术，还有数据量上挑战很大。
Sora在Laten Patchs部分组织为grid，个人不清楚是有什么转换没有，还是简单的cat为grid结构，这个地方可能有一定帮助。
当然，结合Google Genie的出现，往视觉终极任务“视觉：立体+动态+交互+世界生成”，甚至“训练时：相机位置自由；diffhash式增量内容丰富”等要求下，最终极的走，现在还是挑战大。个人觉得，就算Sora+Genie“有机”的合体后，还是有个重要问题要解决：要不要对复杂场景中，中近景，给出显式的3D[4D/5D]表述，我觉得目前大家还在relax终极约束/需求，都想走简单的路用neural/implicit的方式走，我个人觉得：死路一条。（至少付出很高的成本/算力和数据量，也只能逼近好，而不能足够好。）如果显式的真立体被表示，那么，我们假设：进入场景内部自由移动的交互式电影中，我们可以围绕主角观看，可以和主角互动后，电影继续按需演绎。
graphdeco-inria/gaussian-splatting#658
个人观点，Genie比Sora在技术上更有突破。interactive比“视觉效果”最终看更重要。
显然，本文也在Collision上对interactive做某种程度的实现，我觉得是一大步，同样interactive比“视觉效果”最终看更重要。生成的对象丰富，细节丰富，分辨率高，不模糊等，更偏量变一些。（类似这种做互作的还有：2309.16237 Object Motion Guided Human Motion Synthesis）
我觉得，“视觉+生成”，甚至"+[立体+交互]"，这个大方向，VDT/ViDT，学术机构最大的挑战，就是没有海量算力和海量数据，一个10+人的专业团队，在质量上面爬坡一年以上，搞出公众能够感知到的靓丽。换句话说：VDT之类的论文，如果有这种资源，不止步于论文，也可以到Sora类似甚至更好的效果。
有大算力，有钱弄大数据，一个必要的专业团队长期优化，才有可能走到视觉终极任务：简单的说，就是生成出UE5类似的游戏交互场景片段。

无论如何，好文章，点个赞。

(在GFW的封锁下，AutoEncoderKL之类的用huggingface，真的不方便啊，作者这是欺负墙内的同行啊。)