Coder Social home page Coder Social logo

ssm-meets-video-diffusion-models's Introduction

SSM-Meets-Video-Diffusion-Models

"SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces" [Paper]

Image1

Device Details

Dataset UCF101 UCF101 MineRL MineRL MineRL
# of Frames 16 16 64 200 400
Resolution $32 \times 32$ $64 \times 64$ $32 \times 32$ $32 \times 32$ $32 \times 32$
Training steps 92k 106k 174k 255k 246k
GPUs V100 $\times 4$ A100 $\times 8$ V100 $\times 4$ A100 $\times 8$ A100 $\times 8$
Training Time 72 hours 120 hours 72 hours 100 hours 120 hours

Settings

Please use ./Dockerfile to build docker image or install python libraries specified in this dockerfile.

Run Experimental Codes

Downloading Datasets

UCF101

  1. Please follow the commands shown in ./dl_ucf101.ipynb to download datasets.
  2. Specify ucf101-all as --dataset, and . as --folder.

MineRL Navigate

  1. Execute a following python code.
python dl_mine_rl.py
  1. Specify minerl as --dataset, and minerl_navigate-torch as --folder.

Training

python train_video-diffusion.py 
--timesteps 256 --loss_type 'l2' --train_lr 0.0003 --train_num_steps 700000 --train_batch_size 16 --gradient_accumulate_every 2 --ema_decay 0.995 # Learning Settings
--base_channel_size 64 --timeemb_linears 2 # Architecture Settings
--temporal_layer 'bi-s4d' --s4d_version 8 # Temporal Layer Settings
--image_size 32 --dataset 'ucf101-all' # Dataset Settings
--folder 'path/to/datasets' 
--results_folder 'path/to/save' 
--device_ids 0 1 2 3 # GPU Settings

Sampling

python sample_video-diffusion.py 
--timesteps 256 --loss_type 'l2' --train_lr 0.0003 --train_num_steps 700000 --train_batch_size 16 --gradient_accumulate_every 2 --ema_decay 0.995 # Learning Settings
--base_channel_size 64 --timeemb_linears 2 # Architecture Settings
--temporal_layer 'bi-s4d' --s4d_version 8 # Temporal Layer Settings
--image_size 32 --dataset 'ucf101-all' # Dataset Settings
--folder 'path/to/datasets' 
--results_folder 'path/to/save'
--num_samples 2500 --sample_batch_size 10 --sample_save_every 10 # Sampling Number Settings
--milestone 92                                                   # Sampling Milestone (Progress of Learning) Settings
--device_ids 0 --seed 0                                          # Sampling Device Settings

Evaluation

python eval_video-diffusion.py 
--timesteps 256 --loss_type 'l2' --train_lr 0.0003 --train_num_steps 700000 --train_batch_size 16 --gradient_accumulate_every 2 --ema_decay 0.995 # Learning Settings
--base_channel_size 64 --timeemb_linears 2 # Architecture Settings
--temporal_layer 'bi-s4d' --s4d_version 8 # Temporal Layer Settings
--image_size 32 --dataset 'ucf101-all' # Dataset Settings
--folder 'path/to/datasets' 
--results_folder 'path/to/save'
--num_samples 2500 --sample_batch_size 10 --sample_save_every 10 
--milestone 92                                                   
# --seed 0 --sample_seeds 0 1 2 3 --eval_batch_size 100 # Evaluation Settings

Citation

@misc{ssmvdm2024,
      title={SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces}, 
      author={Yuta Oshima and Shohei Taniguchi and Masahiro Suzuki and Yutaka Matsuo},
      year={2024},
      eprint={2403.07711},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

ssm-meets-video-diffusion-models's People

Contributors

shim0114 avatar

Stargazers

Jose Cohenca avatar Julia avatar ABH avatar YK avatar Xxxia96 avatar  avatar Eric Toal avatar JingfanChen avatar Shreyas Jaiswal avatar Tatsuya Matsushima avatar 王贤锐(Henry) avatar Tao Hu avatar An-zhi WANG avatar  avatar nifeng avatar 冯祥卫 avatar Yue Jiang avatar Jingge Wang avatar Ahmed Ghorbel avatar XUE Lanqing avatar DongliXu avatar Occupying-Mars avatar Hans Brouwer avatar  avatar  avatar  avatar liushengqi avatar naruya avatar Amandeep Kumar avatar Mike Oller avatar  avatar  avatar Mesopotamia avatar Weihang Ran avatar MaoYuxin avatar Pyjcsx avatar Said avatar

Watchers

Tatsuya Matsushima avatar Pyjcsx avatar  avatar

ssm-meets-video-diffusion-models's Issues

mamba in code

When the version is 19 or 22, does it correspond to using mamba instead of ssm? thanks

GLU or not ?

Hello,
Thank you for this implementation of your paper .
I had a question, if i understood well version 16 is the code described in the paper but i am not able to understand why in Unet3D in here "elif version in range(16, 19) or version == 21:
self.s4d = S4D(dim, hidden_dim, transposed=False, output_glu=False)
if bidirectional:
self.s4d_rev = S4D(dim, hidden_dim, transposed=False, output_glu=False)" , the GLU output is False ?
Because in the model described in the paper GLU layer is used .
I am a bit confused about this part of the code .

Thank you in advance,
A.B.H

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.