Coder Social home page Coder Social logo

ccvs's Introduction

CCVS - Official PyTorch Implementation

Code for NeurIPS'21 paper CCVS: Context-aware Controllable Video Synthesis.

CCVS: Context-aware Controllable Video Synthesis
Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Paper: https://arxiv.org/abs/2107.08037
Project page: https://16lemoing.github.io/ccvs

Abstract: This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Installation

The code is tested with pytorch 1.7.0 and python 3.8.6

To install dependencies with conda run:

conda env create -f env.yml
conda activate ccvs

To install apex run:

cd tools
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../..

Prepare datasets

BAIR Robot Pushing - (Repo) - (License)

Create corresponding directory:

mkdir datasets/bairhd

Download the high resolution data from this link and put it in the new directory, then run:

tar -xvf datasets/bairhd/softmotion_0511.tar.gz -C datasets/bairhd

Preprocess BAIR dataset for resolution 256x256:

python data/scripts/preprocess_bairhd.py --data_root datasets/bairhd --dim 256

We also provide our annotation tool to later estimate the (x,y) position of the arm:

python data/scripts/annotate_bairhd.py --data_root datasets/bairhd/original_frames_256 --out_dir datasets/bairhd/annotated_frames

Kinetics-600 - (Repo) - (License)

This dataset is a collection of YouTube links from which we extract the corresponding train and test videos running:

mkdir datasets/kinetics
wget https://storage.googleapis.com/deepmind-media/Datasets/kinetics600.tar.gz -P datasets/kinetics
tar -xvf datasets/kinetics/kinetics600.tar.gz -C datasets/kinetics
python  data/scripts/download_kinetics.py datasets/kinetics/kinetics600/train.csv datasets/kinetics/kinetics600/train_videos --trim
python  data/scripts/download_kinetics.py datasets/kinetics/kinetics600/test.csv datasets/kinetics/kinetics600/test_videos --trim

Preprocess the dataset:

python data/scripts/preprocess_kinetics.py --src_folder datasets/kinetics/kinetics600/train_videos --out_root datasets/kinetics/preprocessed_videos --out_name train_64p_square_32t --max_vid_len 32 --resize 64 --square_crop
python data/scripts/preprocess_kinetics.py --src_folder datasets/kinetics/kinetics600/test_videos --out_root datasets/kinetics/preprocessed_videos --out_name test_64p_square_32t --max_vid_len 32 --resize 64 --square_crop

Split the data into folds and precompute metadata for faster training/testing:

python data/scripts/compute_folds_kinetics.py train 100 64p_square_32t
python data/scripts/compute_folds_kinetics.py test 40 64p_square_32t --max_per_fold 1248

AudioSet-Drums - (Repo) - (License) - (License of curated version)

Create corresponding directory:

mkdir datasets/drums

Download the data from this link and run:

unzip datasets/drums/AudioSet_Drums.zip -d datasets/drums

UCF101 - (Repo)

Create corresponding directory:

mkdir datasets/ucf101

Download the data from this link and run:

mkdir datasets/ucf101/videos
unrar e datasets/ucf101/UCF101.rar datasets/ucf101/videos

Training

BAIR Robot Pushing

First, train the frame autoencoder:

bash scripts/bairhd/train_frame_autoencoder.sh

Then, train the transformer for different tasks (one should change --q_load_path in the corresponding files to point to the checkpoints of the trained autoencoder) :

  • Video prediction
bash scripts/bairhd/train_transformer.sh
  • Point-to-point synthesis
bash scripts/bairhd/train_transformer_p2p.sh
  • State-conditioned synthesis (this requires to train a state estimator first and change the corresponding --s_load_path before training the transformer)
bash scripts/bairhd/train_state_estimator.sh
bash scripts/bairhd/train_transformer_state.sh
  • Unconditional synthesis
bash scripts/bairhd/train_transformer_unc.sh

Kinetics-600

The same applies, e.g., for video prediction:

bash scripts/kinetics/train_frame_autoencoder.sh
bash scripts/kinetics/train_transformer.sh

UCF101

The same applies, e.g., for video prediction:

bash scripts/ucf101/train_frame_autoencoder.sh
bash scripts/ucf101/train_transformer.sh

AudioSet-Drums

For audio-conditioned synthesis, we train two encoders (one to compress frames, the other to compress sound features) and then train the transformer:

bash scripts/drums/train_frame_autoencoder.sh
bash scripts/drums/train_stft_autoencoder.sh
bash scripts/drums/train_transformer_audio.sh

Inference

We provide checkpoints for various configurations:

Dataset Future prediction Point-to-point synthesis State-conditioned synthesis Sound-conditioned synthesis Unconditional synthesis Download
BAIR Robot Pushing checkpoint
Kinetics-600 checkpoint
UCF101 checkpoint
AudioSet-Drum checkpoint

Extract checkpoints with the following command (by replacing CKPT.zip with the corresponding name).

unzip CKPT.zip -d checkpoints/

Synthesize videos from downloaded checkpoints.

BAIR Robot Pushing

bash scripts/bairhd/save_videos_state_off.sh
bash scripts/bairhd/save_videos_p2p.sh
bash scripts/bairhd/save_videos_state_on.sh
bash scripts/bairhd/save_videos_unc.sh

Kinetics-600

bash scripts/kinetics600/save_videos.sh
bash scripts/kinetics600/save_videos_p2p.sh

UCF101

bash scripts/ucf101/save_videos.sh

AudioSet-Drums

bash scripts/drums/save_videos_audio_off.sh
bash scripts/drums/save_videos_audio_on.sh

Here are some important flags:

  • --vid_len: the total number of frames in synthetic videos (including conditioning frames)
  • --x_cond_len: the length of tokens corresponding to conditioning frames. In the preceding experiments one frame is represented by 64 tokens so one can set this flag to 0 for unconditionnal synthesis, 64 for one input frame, 128 for two...
  • --keep_state: add this flag in sound- or state- conditioned synthesis to effectvely use the control (otherwise sound / state are also predicted)

Evaluation

After inference, compute evaluation metrics with the following commands:

python tools/tf_fvd/fvd.py --exp_tag TAG
python tools/pytorch_metrics/metrics.py --exp_tag TAG

where TAG is the name of the directory (inside results/ folder) under which videos where saved during inference. The first command computes the Fréchet video distance (FVD), and second one the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM). One can use the --idx flag to compute PSNR / SSIM for specific timesteps.

Citation

If you find this code useful in your research, please consider citing:

@inproceedings{lemoing2021ccvs,
  title     = {{CCVS}: Context-aware Controllable Video Synthesis},
  author    = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
  booktitle = {NeurIPS},
  year      = {2021}
}

Acknowledgments

This code borrows from StyleGAN2, minGPT, pytorch-liteflownet and VQVAE.

License

CCVS is released under the MIT license.

ccvs's People

Contributors

16lemoing avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.