Coder Social home page Coder Social logo

lucidrains / phenaki-pytorch Goto Github PK

View Code? Open in Web Editor NEW
720.0 39.0 79.0 269 KB

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch

License: MIT License

Python 100.00%
artificial-intelligence attention-mechanisms deep-learning text-to-video transformers imagination-machine

phenaki-pytorch's Introduction

Phenaki - Pytorch

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch. It will also combine another technique involving a token critic for potentially even better generations

Please join Join us on Discord if you are interested in replicating this work in the open

AI Coffeebreak explanation

Appreciation

  • Stability.ai for the generous sponsorship to work on cutting edge artificial intelligence research

  • ๐Ÿค— Huggingface for their amazing transformers and accelerate library

  • Guillem for his ongoing contributions

  • You? If you are a great machine learning engineer and / or researcher, feel free to contribute to the frontier of open source generative AI

Install

$ pip install phenaki-pytorch

Usage

C-ViViT

import torch
from phenaki_pytorch import CViViT, CViViTTrainer

cvivit = CViViT(
    dim = 512,
    codebook_size = 65536,
    image_size = 256,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
).cuda()

trainer = CViViTTrainer(
    cvivit,
    folder = '/path/to/images/or/videos',
    batch_size = 4,
    grad_accum_every = 4,
    train_on_images = False,  # you can train on images first, before fine tuning on video, for sample efficiency
    use_ema = False,          # recommended to be turned on (keeps exponential moving averaged cvivit) unless if you don't have enough resources
    num_train_steps = 10000
)

trainer.train()               # reconstructions and checkpoints will be saved periodically to ./results

Phenaki

import torch
from phenaki_pytorch import CViViT, MaskGit, Phenaki

cvivit = CViViT(
    dim = 512,
    codebook_size = 65536,
    image_size = (256, 128),  # video with rectangular screen allowed
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

cvivit.load('/path/to/trained/cvivit.pt')

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
)

phenaki = Phenaki(
    cvivit = cvivit,
    maskgit = maskgit
).cuda()

videos = torch.randn(3, 3, 17, 256, 128).cuda() # (batch, channels, frames, height, width)
mask = torch.ones((3, 17)).bool().cuda() # [optional] (batch, frames) - allows for co-training videos of different lengths as well as video and images in the same batch

texts = [
    'a whale breaching from afar',
    'young girl blowing out candles on her birthday cake',
    'fireworks with blue and green sparkles'
]

loss = phenaki(videos, texts = texts, video_frame_mask = mask)
loss.backward()

# do the above for many steps, then ...

video = phenaki.sample(texts = 'a squirrel examines an acorn', num_frames = 17, cond_scale = 5.) # (1, 3, 17, 256, 128)

# so in the paper, they do not really achieve 2 minutes of coherent video
# at each new scene with new text conditioning, they condition on the previous K frames
# you can easily achieve this with this framework as so

video_prime = video[:, :, -3:] # (1, 3, 3, 256, 128) # say K = 3

video_next = phenaki.sample(texts = 'a cat watches the squirrel from afar', prime_frames = video_prime, num_frames = 14) # (1, 3, 14, 256, 128)

# the total video

entire_video = torch.cat((video, video_next), dim = 2) # (1, 3, 17 + 14, 256, 128)

# and so on...

Or just import the make_video function

# ... above code

from phenaki_pytorch import make_video

entire_video, scenes = make_video(phenaki, texts = [
    'a squirrel examines an acorn buried in the snow',
    'a cat watches the squirrel from a frosted window sill',
    'zoom out to show the entire living room, with the cat residing by the window sill'
], num_frames = (17, 14, 14), prime_lengths = (5, 5))

entire_video.shape # (1, 3, 17 + 14 + 14 = 45, 256, 256)

# scenes - List[Tensor[3]] - video segment of each scene

That's it!

Token Critic

A new paper suggests that instead of relying on the predicted probabilities of each token as a measure of confidence, one can train an extra critic to decide what to iteratively mask during sampling. You can optionally train this critic for potentially better generations as shown below

import torch
from phenaki_pytorch import CViViT, MaskGit, TokenCritic, Phenaki

cvivit = CViViT(
    dim = 512,
    codebook_size = 65536,
    image_size = (256, 128),
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
)

# (1) define the critic

critic = TokenCritic(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
    has_cross_attn = True
)

trainer = Phenaki(
    maskgit = maskgit,
    cvivit = cvivit,
    critic = critic    # and then (2) pass it into Phenaki
).cuda()

texts = [
    'a whale breaching from afar',
    'young girl blowing out candles on her birthday cake',
    'fireworks with blue and green sparkles'
]

videos = torch.randn(3, 3, 3, 256, 128).cuda() # (batch, channels, frames, height, width)

loss = trainer(videos = videos, texts = texts)
loss.backward()

Or even simpler, just reuse MaskGit itself as a Self Critic (Nijkamp et al), by setting self_token_critic = True on the initialization of Phenaki

phenaki = Phenaki(
    ...,
    self_token_critic= True  # set this to True
)

Now your generations should be greatly improved!

Phenaki Trainer

This repository will also endeavor to allow the researcher to train on text-to-image and then text-to-video. Similarly, for unconditional training, the researcher should be able to first train on images and then fine tune on video. Below is an example for text-to-video

import torch
from torch.utils.data import Dataset
from phenaki_pytorch import CViViT, MaskGit, Phenaki, PhenakiTrainer

cvivit = CViViT(
    dim = 512,
    codebook_size = 65536,
    image_size = 256,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

cvivit.load('/path/to/trained/cvivit.pt')

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
    unconditional = False
)

phenaki = Phenaki(
    cvivit = cvivit,
    maskgit = maskgit
).cuda()

# mock text video dataset
# you will have to extend your own, and return the (<video tensor>, <caption>) tuple

class MockTextVideoDataset(Dataset):
    def __init__(
        self,
        length = 100,
        image_size = 256,
        num_frames = 17
    ):
        super().__init__()
        self.num_frames = num_frames
        self.image_size = image_size
        self.len = length

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        video = torch.randn(3, self.num_frames, self.image_size, self.image_size)
        caption = 'video caption'
        return video, caption

dataset = MockTextVideoDataset()

# pass in the dataset

trainer = PhenakiTrainer(
    phenaki = phenaki,
    batch_size = 4,
    grad_accum_every = 4,
    train_on_images = False, # if your mock dataset above return (images, caption) pairs, set this to True
    dataset = dataset,       # pass in your dataset here
    sample_texts_file_path = '/path/to/captions.txt' # each caption should be on a new line, during sampling, will be randomly drawn
)

trainer.train()

Unconditional is as follows

ex. unconditional images and video training

import torch
from phenaki_pytorch import CViViT, MaskGit, Phenaki, PhenakiTrainer

cvivit = CViViT(
    dim = 512,
    codebook_size = 65536,
    image_size = 256,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

cvivit.load('/path/to/trained/cvivit.pt')

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
    unconditional = False
)

phenaki = Phenaki(
    cvivit = cvivit,
    maskgit = maskgit
).cuda()

# pass in the folder to images or video

trainer = PhenakiTrainer(
    phenaki = phenaki,
    batch_size = 4,
    grad_accum_every = 4,
    train_on_images = True,                # for sake of example, bottom is folder of images
    dataset = '/path/to/images/or/video'
)

trainer.train()

Todo

  • pass mask probability into maskgit and auto-mask and get cross entropy loss

  • cross attention + get t5 embeddings code from imagen-pytorch and get classifier free guidance wired up

  • wire up full vqgan-vae for c-vivit, just take what is in parti-pytorch already, but make sure to use a stylegan discriminator as said in paper

  • complete token critic training code

  • complete first pass of maskgit scheduled sampling + token critic (optionally without if researcher does not want to do extra training)

  • inference code that allows for sliding time + conditioning on K past frames

  • alibi pos bias for temporal attention

  • give spatial attention the most powerful positional bias

  • make sure to use stylegan-esque discriminator

  • 3d relative positional bias for maskgit

  • make sure maskgit can also support training of images, and make sure it works on local machine

  • also build option for token critic to be conditioned with the text

  • should be able to train for text to image generation first

  • make sure critic trainer can take in cvivit and automatically pass in video patch shape for relative positional bias - make sure critic also gets optimal relative positional bias

  • training code for cvivit

  • move cvivit into own file

  • unconditional generations (both video and images)

  • wire up accelerate for multi-gpu training for both c-vivit and maskgit

  • add depthwise-convs to cvivit for position generating

  • some basic video manipulation code, allow for sampled tensor to be saved as gif

  • basic critic training code

  • add position generating dsconv to maskgit too

  • outfit customizable self attention blocks to stylegan discriminator

  • add all top of the line research for stabilizing transformers training

  • get some basic critic sampling code, show comparison of with and without critic

  • bring in concatenative token shift (temporal dimension)

  • add a DDPM upsampler, either port from imagen-pytorch or just rewrite a simple version here

  • take care of masking in maskgit

  • test maskgit + critic alone on oxford flowers dataset

  • support rectangular sized videos

  • add flash attention as an option for all transformers and cite @tridao

Citations

@article{Villegas2022PhenakiVL,
    title   = {Phenaki: Variable Length Video Generation From Open Domain Textual Description},
    author  = {Ruben Villegas and Mohammad Babaeizadeh and Pieter-Jan Kindermans and Hernan Moraldo and Han Zhang and Mohammad Taghi Saffar and Santiago Castro and Julius Kunze and D. Erhan},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2210.02399}
}
@article{Chang2022MaskGITMG,
    title   = {MaskGIT: Masked Generative Image Transformer},
    author  = {Huiwen Chang and Han Zhang and Lu Jiang and Ce Liu and William T. Freeman},
    journal = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year    = {2022},
    pages   = {11305-11315}
}
@article{Lezama2022ImprovedMI,
    title   = {Improved Masked Image Generation with Token-Critic},
    author  = {Jos{\'e} Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2209.04439}
}
@misc{ding2021cogview,
    title   = {CogView: Mastering Text-to-Image Generation via Transformers},
    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
    year    = {2021},
    eprint  = {2105.13290},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}
@misc{press2021ALiBi,
    title   = {Train Short, Test Long: Attention with Linear Biases Enable Input Length Extrapolation},
    author  = {Ofir Press and Noah A. Smith and Mike Lewis},
    year    = {2021},
    url     = {https://ofir.io/train_short_test_long.pdf}
}
@article{Liu2022SwinTV,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    journal = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year    = {2022},
    pages   = {11999-12009}
}
@inproceedings{Nijkamp2021SCRIPTSP,
    title   = {SCRIPT: Self-Critic PreTraining of Transformers},
    author  = {Erik Nijkamp and Bo Pang and Ying Nian Wu and Caiming Xiong},
    booktitle = {North American Chapter of the Association for Computational Linguistics},
    year    = {2021}
}
@misc{https://doi.org/10.48550/arxiv.2302.01327,
    doi     = {10.48550/ARXIV.2302.01327},
    url     = {https://arxiv.org/abs/2302.01327},
    author  = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},
    title   = {Dual PatchNorm},
    publisher = {arXiv},
    year    = {2023},
    copyright = {Creative Commons Attribution 4.0 International}
}
@misc{gilmer2023intriguing
    title  = {Intriguing Properties of Transformer Training Instabilities},
    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},
    year   = {2023},
    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}
}
@misc{mentzer2023finite,
    title   = {Finite Scalar Quantization: VQ-VAE Made Simple},
    author  = {Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen},
    year    = {2023},
    eprint  = {2309.15505},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{yu2023language,
    title   = {Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation},
    author  = {Lijun Yu and Josรฉ Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang},
    year    = {2023},
    eprint  = {2310.05737},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

phenaki-pytorch's People

Contributors

gmegh avatar lucidrains avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phenaki-pytorch's Issues

In phenaki to output video_codebook_ids...

image
I got (batchsize, t x h x w, hidden dimension) shape of video_codebook_ids output from C-ViViT. But if video is made into 1 dimension like this, I don't think each temporal and spatial dimension is considered for transformer and vq.

In paper figure, tokens are separated into each frame so I can not understand why the output of c-vivit shape is (batchsize, thw, hidden dimension)....
image

Thank you

ImportError: T5Tokenizer requires the SentencePiece library but it was not found in your environment.

When I run the given

import torch
from phenaki_pytorch import CViViT, MaskGit, TokenCritic, PhenakiCritic

cvivit = CViViT(
    dim = 512,
    codebook_size = 5000,
    image_size = (256, 128),
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
)

critic = TokenCritic(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6
)

critic_trainer = PhenakiCritic(
    maskgit = maskgit,
    critic = critic,
    cvivit = cvivit
)

texts = [
    'a whale breaching from afar',
    'young girl blowing out candles on her birthday cake',
    'fireworks with blue and green sparkles'
]

videos = torch.randn(3, 3, 3, 256, 128) # (batch, channels, frames, height, width)

loss = critic_trainer(videos = videos, texts = texts)
loss.backward()

I get this error:
ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment.

What is out of index?

I got this error in loss = F.cross_entropy(logits[mask_token_mask], x[mask_token_mask])
Then I found out the error line 181 phenaki_pytorch.py
image
After pos emb(torch.arange(~)), I got all the tensors as torch.tensor.Object
And then the CUDA error occurs....

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

How can I solve it?

Compatibility issue with A100-80G with version 0.0.67

I'm not able to use the A100-80G GPU because the torch version with 0.0.67 is not compatible with the card.

I'm not sure what my previous version was, probably the one before 0.0.66.

Is there a plan to upgrade the torch version? Any workaround?

Some errors that appear on the training results

When we started trying your sample model, the result was a noisy video, again, each frame is a noisy image, I wonder if we need input or other errors during training? thank

This is one frame in entire_video

random0

This is our results video

result.mp4

Thanks a lot

Different video sizes

While yesterday's updates allow for all training videos to be rectangular, there is no current way to allow them to be different sizes among each other, I believe

Einops Error. Shape mismatch

I trained the C-ViViT encoder with images first, got the vae loss down to ~0.05 and disc loss to ~0.007.

Then, I tried to fine-tune it with a gif dataset. But it returned an Einops shape mismatch error
image

This only happens when I use a batch_size other than 1. The model keeps training properly with batch_size 1.

Are we supposed to use only 1 as batch_size? The sample code in the README has 4 as batch_size.

Save video?

How do you save or view the video created?

Unconditional Training returns errors

I'm trying to train an unconditional model with image and gif data I have, to have coherent video generated from gifs of manga panels:

cvivit = CViViT(
    dim = 512,
    codebook_size = 5000,
    image_size = 256,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
    unconditional = True # Kept this true, otherwise it asks for text samples (I only have image data)
)

phenaki = Phenaki(
    cvivit = cvivit,
    maskgit = maskgit
).cuda()

trainer = PhenakiTrainer(
    phenaki=phenaki,
    batch_size=4,
    grad_accum_every=4,
    train_on_images=True,
    folder='../dataset/compressed_manga/'
)

trainer.train()

When training, the following error is raised:
sample_images() got an unexpected keyword argument 'num_frames'
image

I think the arg num_frames is being passed to the method sample_images.
Can someone confirm this is a bug? I'll submit a PR with a fix

training data

for how long and how many videos should i train for good results?
As i tried to train it with just two 10 sec videos and the samples it is saving is just noise
8200

discriminator loss goes to infinity

Hi,

I'm trying to train the cvivit on a set of 10000 images. The vae-loss keeps going down, but the discriminator loss keeps rising infinity. It's easy to fool :)

Any idea what the problem is?

cannot reproduce

I really dont think this can be reproduced,

RuntimeError: Error(s) in loading state_dict for CViViT:
Missing key(s) in state_dict: "discr.attn_blocks.3.null_kv", "discr.attn_blocks.3.q_scale", "discr.attn_blocks.3.k_scale", "discr.attn_blocks.3.norm.gamma", "discr.attn_blocks.3.norm.beta", "discr.attn_blocks.3.context_norm.gamma", "discr.attn_blocks.3.context_norm.beta", "discr.attn_blocks.3.to_q.weight", "discr.attn_blocks.3.to_kv.weight", "discr.attn_blocks.3.to_out.weight".
Unexpected key(s) in state_dict: "discr.blocks.6.conv_res.weight", "discr.blocks.6.conv_res.bias", "discr.blocks.6.net.0.weight", "discr.blocks.6.net.0.bias", "discr.blocks.6.net.2.weight", "discr.blocks.6.net.2.bias", "discr.blocks.5.downsample.1.weight", "discr.blocks.5.downsample.1.bias".
size mismatch for discr.to_logits.3.weight: copying a param with shape torch.Size([1, 8192]) from checkpoint, the shape in current model is torch.Size([1, 16384]).

Error in Sample(): Expected scalar type float but found double

When running the example code, I keep getting the following error (see below). Do you have any idea how to fix it?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_21777/4126247202.py in <module>
     44 # do the above for many steps, then ...
     45 
---> 46 video = phenaki.sample(text = 'a squirrel examines an acorn', num_frames = 17, cond_scale = 5.) # (1, 3, 17, 256, 256)
     47 
     48 # so in the paper, they do not really achieve 2 minutes of coherent video

~/Phenaki/phenaki-pytorch/phenaki_pytorch2/phenaki_pytorch2.py in inner(model, *args, **kwargs)
     36         was_training = model.training
     37         model.eval()
---> 38         out = fn(model, *args, **kwargs)
     39         model.train(was_training)
     40         return out

/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

~/Phenaki/phenaki-pytorch/phenaki_pytorch2/phenaki_pytorch2.py in sample(self, text, num_frames, prime_frames, cond_scale, starting_temperature, noise_K)
   1115                     scores = 1 - rearrange(scores, '... 1 -> ...')
   1116                     
-> 1117                     scores = torch.where(mask, scores, -1e4)
   1118 
   1119         if has_prime:

RuntimeError: expected scalar type float but found double

Release pretrained models?

Hello,

Is there a plan to release pretrained models to do text-to-video out of the box with no training required?

If you upload your pretrained model somewhere, I can add download and loading of pretrained weight script and PR.

Thanks

model.buffer is EMPTY list in EMA get_buffers_iter function

image
With vscode debugging, I found that the buffer lists of both online_model and ema_model are all the empty in 1 iteration. Is this right? The model is c-vivit, not custom model. I am new to EMA so, I have difficulty understanding this code. Sorry for that!

Training the Phenaki - RuntimeError: CUDA error: device-side assert triggered

I have created a notebook and pasted the training code from the README.md file so that I can experiment with training the model, but I encounter the following error when training the Phenaki. is there anything I'm doing wrong.

The same error emerges both on Google Colab and AWS EC2 - p4d instance with 8 A100 GPUs.

Here's the error I getting:

<@beartype(phenaki_pytorch.phenaki_pytorch.Phenaki.forward) at 0x7fd70d61cf70> in forward(__beartype_object_94770277747520, __beartype_getrandbits, __beartype_get_violation, __beartype_conf, __beartype_func, *args, **kwargs)

[/content/phenaki-pytorch/phenaki_pytorch/phenaki_pytorch.py](https://localhost:8080/#) in forward(self, videos, texts, video_codebook_ids, video_frame_mask, text_embeds, cond_drop_prob, only_train_generator, only_train_critic)
    642         if not only_train_critic:
    643             loss = F.cross_entropy(
--> 644                 logits[mask_token_mask],
    645                 video_codebook_ids[mask_token_mask]
    646             )

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

video preprocessing

In Phenaki paper, they downsample MiT dataset from 25fps to 6fps before video quantization.
image

Then, I wonder how to get downsampled video in preprocessing and whether input video is downsampled or not during training transformer and video generation inference.
Even if you don't upload training and dataloader code for video, I want some advices from you who should have tried to implement it.

One more, I have implemented your c-vivit code for reconstruction. Then, after I got feasible outputs, I have gotten bad results in the very next checkpoint iteration like below. The left one is GT and the right one is the output. (I set checkpoint interval as 3000.)
image

Could I ask you what is wrong and is it supposed to be like that early stopping is required for tokenization learning?

Thank you.

Problem With Multi-GPU Training

Hello,

I have been training the c-vivit encoder and have encountered an issue when attempting to use multiple GPUs. While the encoder works well with a single GPU, I receive a RuntimeError when attempting to use multiple GPUs. Specifically, the error message is: "The size of tensor a (64) must match the size of tensor b (0) at non-singleton dimension 2." I have noticed that changing the dim_head parameter causes the size of tensor a to change as well. Could you please provide some insights into what might be causing this issue?

how to condition on text embeddings from T5X

'To train MaskGIT, we include a text conditioning in the form of T5X embeddings which are used as input through the use of cross attention with the video tokens.'
In paper, explanations about how to condtion on text embeddings are too simple to understand for me....
Could you explain it more?

to run C-ViViT...

import torch
from phenaki_pytorch import CViViT

cvivit = CViViT(
    dim = 512,
    codebook_size = 5000,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
).cuda()

video = torch.randn(1, 3, 17, 256, 256).cuda() # (batch, channels, frames + 1 leading frame, image height, image width)

loss = cvivit(video)
loss.backward()

you wrote 17 (=frames + 1 leading frame) then does it mean that the first frame should follow the other frames?
like
torch.cat([videos, image])

Running out of CUDA/GPU spaces

I have a GPU with 15GB and it seems it runs out of space when I try to train the network with 50 videos at a time. Do you think it would be better to repeat the loss training video per video, instead of all the videos at once?

There is error in ema

After one iteration of training, there is difference btw ema_vae.online_model.buffers. So, I got an error in ema_vae.update().
image
How can I solve it?

Successfully trained the CViViT! Working on the second step

Hello @lucidrains !

Based on your great repo, we have been working on replicating phenaki for a few months. Thus we worked on the first step of training, the CViViT. Things were pretty tough to setup with our cluster, and we made quite a lot of changes from your code. Since there were too many changes and we wanted to release the first step of training only without the code for the second step, we decided to create a separate repo here instead of doing a PR here which would have been a mess.

Here's the repo: https://github.com/obvious-research/phenaki-cvivit, with the model weights release on huggingface: https://huggingface.co/obvious-research/phenaki-cvivit

The model works well, it's been trained on Webvid10M and we intend on doing the same for the second step of training.

As you seemed interested in knowing about progress based on this repo (#28), we thought it was ok to open an issue here just to contact you :) We are a collective of artists working with AI that are opening our AI research lab, and we are interested in creating image and video models and releasing them open source.

What's the best way to contact you/dm you ? We are working on the second step of training at the moment. We have access to quite a lot of computing power so we think we have everything we need to actually do it. As we are making good progress towards a full replication, it'll be nice to collaborate if needed, maybe we'll have some questions and your help would be incredible! No pressure though, it's only if this is interesting to you.

Have a great day, and thanks again for the great repo!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.