snap-research / mocogan-hd Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2021 Spotlight] A Good Image Generator Is What You Need for High-Resolution Video Synthesis
License: Other
[ICLR 2021 Spotlight] A Good Image Generator Is What You Need for High-Resolution Video Synthesis
License: Other
Hi! Thank you for the project and the codebase! I noticed that for some datasets, links to the pretrained models do not work: e.g. the image generator link on FaceForensics leads to https://github.com/snap-research/MoCoGAN-HD/blob/main/pretrained_models/faceforensics-fid10.9920-snapshot-008765.pt, which does not exist (same for (Anime, VoxCeleb) and (AFHQ, VoxCeleb) cross-domain image generators). Could you please provide a link for the pretrained image generator on FaceForensics?
Hello,
As I saw issue #5 (specifically, the below comment), I understand DiffAugment is applied for training on UCF-101 dataset.
Is DiffAugment applied for FaceForensics dataset too?
Because similar to UCF-101 which only has a small number of samples per class,
FaceForensics has only 704 training data, and I think this is not enough amount of data to train GANs
Hi @sihyun-yu, have you tried to use the augmentation from this work?
The FID was calculated during training from StyleGAN2.
Originally posted by @alanspike in #5 (comment)
Thanks,
Hi,
Great Work!
I was using the pre_trained model for inference on skytimelapse and ucf-101 dataset. However in both the cases, gray videos are generated. I have not made any changes to the code. There are no errors or warnings. Did you face any similar issue ?
Dear authors,
I want to ask a question about how you finetune the generator. Take faceforensics
as an instance, did you use all cropped frames as the finetuning dataset, or did you use several frames of an identity each for finetuning?
Thanks a lot.
Hello
Thank you for your great work! I read the paper carefully.
I wonder how to calculate the inception score of UCF-101 in detail.
I read that you follow the tgan paper for evaluating the inception scores and use the C3D network for getting the predictions.
Which weights did you use for the C3D network? Did you train from the initial?
If not, could you let me know the weight of the C3D network and w how to use the C3D net?
In detail, in this paper, the size of the generated video(ucf101) is (224, 224). But pre-trained network of C3D in this link (https://github.com/rezoo/tgan2/releases/download/v1.0/conv3d_deepnetA_ucf.npz) was not trained with config (224, 224). How did you resize and normalize the frames..?
I would be very grateful if you could reply.
Thanks.
Hi,
I was able to run the pca_stats.py file using the pretrained image generator models provided by you. I was installing a few more packages to my conda environment when pca_stats stopped running altogether.
I have tried uninstalling and reinstalling the conda environment using the requirements.txt provided in the repository. This is my command -- python get_stats_pca.py --batchSize 4000 --save_pca_path pca_stats/ucf_101 --pca_iterations 250 --latent_dimension 512 --img_g_weights pretrained_checkpoints/ucf-256-fid41.6761-snapshot-006935.pt --style_gan_size 256 --gpu 0
The process just hangs forever, the GPU memory goes from 0 MB to 3 MB, and nothing else happens. I don't know what I could have done wrong. It was working before. As an additional step, I also setup the repository from scratch.
Any idea what might have happened?
Hi! FaceForensics contains "video starting" artifacts for its first ~0.5 seconds for many its videos (see the gif), which might produce the corresponding training artifacts. Did you remove them?
Here are random samples from FFS, cut to the first 0.5 seconds:
Also, did you account for them in any way when computing FVD?
Hi,
First of all, thank you for your great work!
As I read your paper,
I understand that the FVD is calculated from 2048 videos with 128x128 resolution in UCF101 dataset.
To evaluate your model on UCF101, I randomly sampled the 2048 real videos (random video clips with 16 consecutive frames) and resize them into 128x128 resolution.
Then, I calculated FVD between sampled real and fake videos.
In result, I got 625.87 which is a little lower than the distance you reported.
I think there is some difference when building the real video samples compared to your implementation or there is a lot of oscillation of FVD as the randomness of sampling.
Can you inform me detailed evaluation process for FVD on UCF101 and faceforensics dataset?
Thanks,
Hi, thanks for your great work!
I have a question about the cross-domain video discriminator.
According to your paper, you can learn to synthesize video content from one dataset A (such as Anime-Face) while motion part from another dataset B (such as VoxCeleb). In this mode, I think the video discriminator will first learn how to classify the anime and the real person's contents, rather than distinguish meaningful motions. How do you ensure that the video discriminator is helpful during training in this mode?
Hi,
Will u release the code for ACD (average consistency distance) and FID?
Thanks
MoCoGAN-HD/train_func_cross_domain.py
Lines 245 to 247 in 27356ba
Hi,
can you give an example on how to calculate similarity loss in equation 3 in the paper? Thanks!
Hi, thank you for sharing the code of your elegant work!
I have a question about the experimental setup on experiments with UCF-101 dataset.
Did you use the "train" split from the UCF-101 dataset or the whole dataset without split?
Thank you in advance!
Sincerely,
Sihyun
I have a custom dataset of face videos from the How2Sign dataset. I have the dataset in the format required by this repository. What are the steps for training on a custom dataset?
Hello! Thanks again for providing the implementation.
I am trying to retrain a "unconditional" image generator from scratch on the UCF-101 dataset using StyleGANv2 as you suggested.
Did you use specific hyperparameters to train such a model to reach the reported FID?
If so, can you share those hyperparameters?
Thanks in advance!
Sincerely,
Sihyun
Hi! Could you please tell whether you used any truncation for the content or motion codes or curated the samples for these generations: https://github.com/snap-research/MoCoGAN-HD#faceforensics-1 ? I used your pretrained checkpoint, PCA stats and the pretrained G to generate samples with --n_frames_G=32
and without spatial noise. And the results feel of lower quality compared to the ones which you show in your README.md. Here is the samples I got (sorry for the external link, github for some reason does not want to upload the gif even if it is less than 10mb):
https://i.imgur.com/1QRibnD.mp4
For example, the motion diversity is not that good, i.e. the heads do not "speak". Could you tell, why there is such a difference?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.