soumik-kanad / diff2lip Goto Github PK

License: Other

Python 98.44% Shell 1.56%

diff2lip's Introduction

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

This is the official repository for Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization accepted at WACV 2024. It includes the script to run lip-synchronization at inference time given a filelist of audio-video pairs.

Abstract	ArXiv	PDF	Website

tl;dr

Diff2Lip: arbitrary speech + face videos → high quality lip-sync.

Applications: movies, education, virtual avatars, (eventually) video conferencing.

Results

(a) Video Source (b) Wav2Lip (c) PC-AVS (d) Diff2Lip (ours)

Please find more results on our website.

Overview of our approach

Top : Diff2Lip uses an audio-conditioned diffusion model to generate lip-synchronized videos.
Bottom: On zooming in to the mouth region it can be seen that our method generates high-quality video frames without suffering from identity loss.

Setting up the environment

conda create -n diff2lip python=3.9
conda activate diff2lip
conda install -c conda-forge ffmpeg=5.0.1
pip install -r requirements.txt

Inference

For inference on VoxCeleb2 dataset we use scripts/inference.sh script which internally calls the python scripts generate.py or generate_dist.py. Set the following variables to run inference:

real_video_root: set this to the base path of your directory containing VoxCeleb2 dataset.
model_path: first download the Diff2Lip checkpoint from here, place it in checkpoints directory, and set this variable to the checkpoint's path.
sample_path: set this to where you want to generate your output.
sample_mode: set this to "cross" to drive a video source with a different/same audio source or set it to "reconstruction" to drive the first frame of the video with the same/differnt audio source.
NUM_GPUS; controls the number of gpus to be used. If set to greater than 1, it runs the disributed generation.

After setting these variables in the script, inference can be run using the following command:

scripts/inference.sh

Inference on other data

For example if you want to run on LRW dataset, apart from the above arguments you also need to set --is_voxceleb2=False, change variable filelist_recon to dataset/filelists/lrw_reconstruction_relative_path.txt and variable filelist_cross to dataset/filelists/lrw_cross_relative_path.txt. Each line of these filelists contain the relative path of the audio source and the video source separated by a space, relative to the real_video_root variable.

For inference on a single video set --is_voxceleb2=False and then either (1) filelist can have only one line or (2) set --generate_from_filelist=0 and specify --video_path,--audio_path,--out_path instead of --test_video_dir,--sample_path,--filelist flags in the scripts/inference.sh script.

License

Except where otherwise specified, the text/code on Diff2Lip repository by Soumik Mukhopadhyay (soumik-kanad) is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). It can be shared and adapted provided that they credit us and don't use our work for commercial purposes.

Citation

Please cite our paper if you find our work helpful and use our code.

@InProceedings{Mukhopadhyay_2024_WACV,
    author    = {Mukhopadhyay, Soumik and Suri, Saksham and Gadde, Ravi Teja and Shrivastava, Abhinav},
    title     = {Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2024},
    pages     = {5292-5302}
}

diff2lip's People

Contributors

Stargazers

Watchers

diff2lip's Issues

Process killed

Here is my settings

real_video_root='dataset/VoxCeleb2/vox2_test_mp4/mp4/'
model_path="checkpoints/e7.24.1.3_model260000_paper.pt"
sample_path="output_dir"
sample_mode="cross" # or "reconstruction"
NUM_GPUS=1

GEN_FLAGS="--is_voxceleb2=False --generate_from_filelist=0 --video_path=../test.mp4 --audio_path=../test.mp3 --out_path=../ --save_orig=False --face_det_batch_size 64 --pads 0,0,0,0"

However it doesn't seem to work. I always get :

MPI.COMM_WORLD.Get_rank() 0
os.environ["CUDA_VISIBLE_DEVICES"] 0
Logging to d2l_gen
creating model...
scripts/inference.sh: line 40: 31738 Killed

When will release the code?

Inference in Windows

When running in windows with cuda as well as without cuda I found out that the backends that is nccl and gloo are not being supported and it throws an error .
Any workaround for that?

is this support higher resolution face clip size?

can we just change "--image_size 128" to "--image_size 256" to improve quality?

Train Code ?

when finally write the video,it turn out the index error

about infer my own video? have error ?

my cmd :python generate.py
--generate_from_filelist=0
--video_path=path\to*.mp4
--audio_path=path\to*.mp3
--out_path=path\to*.mp4
--attention_resolutions
32,16,8
--learn_sigma
True
--num_head_channels
64
--resblock_updown
True
--use_scale_shift_norm
False
--sampling_input_type=first_frame
--sampling_ref_type=first_frame
--timestep_respacing
ddim25
--use_ddim
True
--sample_path=output_dir
--nframes
5
--nrefer
1
--use_ref=True
--use_audio=True
--audio_as_style=True
--save_orig=False
--image_size
128
about input video:
about resoult video:
What is the reason for this

training code?

Hi -- great work here. Was wondering if you plan to provide the training code? Want to train this on a different dataset

About temporal consistency

I noticed that the temporal consistency of Diff2Lip is good. What tricks did you apply in model training? Did you simply concatenate multiple continous frames in batch dimension to achieve this?

Low res result for single video inference (especially for the teeth area).

Hello!

I'm having an issue with the single video inference.

Here is the input video (With the target audio):

video.mp4

Here is the output video:

output.mp4

I believe the low res can be fixed with a face enhancer, however, the result around the teeth is poorer relative to the demo in the readme.

Any idea what would cause this?

Setting diffusion steps to > 1000 introduces Gaussian noise

Hi, thank you for releasing the test code and model. I'm playing with the "inference_single_video" script, finding that using diffusion steps > 1000 introduces noticeable Gaussian noise. And from another post, I saw that timestep_respacing is the flag to change the denoising steps. How shall I tune them?

Diffusion iterations

Hello, thank you for your great work!

As you had mentioned in paper, you used 25 denoising iterations on inference for better speed
I'd like to change that parameter for experiments, but I can't find it in argparser and model files
Is that possible?

Thanks

Would you consider sharing the training code

Thank you for sharing. Would you consider sharing the training code

I can't install

I can't install Windows gpu 3090 ti on my PC. Who will help for a reward?

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Hi,

Thank you for this great work. I'm trying to do an inference on a single file using the inference_single_video.sh.

I get this error :

File "/home/ubuntu/liptest/generate.py", line 398, in <module>
    main()
  File "/home/ubuntu/liptest/generate.py", line 336, in main
    generate(args.video_path, args.audio_path, model, diffusion, detector,  args, out_path=args.out_path, save_orig=args.save_orig)
  File "/home/ubuntu/liptest/generate.py", line 266, in generate
    sample, img_batch, model_kwargs = sample_batch(batch, model, diffusion, args)   
  File "/home/ubuntu/liptest/generate.py", line 186, in sample_batch
    sample = sample_fn(
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 649, in ddim_sample_loop
    for sample in self.ddim_sample_loop_progressive(
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 701, in ddim_sample_loop_progressive
    out = self.ddim_sample(
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 558, in ddim_sample
    out = self.p_mean_variance(
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/respace.py", line 91, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 265, in p_mean_variance
    model_output = model(x, self._scale_timesteps(t), **model_kwargs)
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/respace.py", line 128, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/unet.py", line 1040, in forward
    a = self.audio_encoder(a)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/unet.py", line 1263, in forward
    h= self.input_block(h, emb)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/liptest/guided-diffusion/guided_diffusion/unet.py", line 77, in forward
    x = layer(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Is it a known issue ? Any idea how to resolve this ?

Thanks

Will you include training code?

It would be very helpful if you could include training code for others.

Best,

Issue about https://github.com/cloneofsimo/lora/issues/258

Hello,

I am trying to train lora with cloneofsimo/lora and encountered an issue during the process.. I hope you can assist me with them.

Firstly, regarding the setup of the dataset for LORA training, I am interested in understanding how you select and prepare these data. Are there any specific format requirements? Or do they need to undergo certain preprocessing steps? I did not find detailed instructions on this part in your documentation or code, and I would appreciate your guidance.

Secondly, I encountered an error while executing your inference script. The specific error message is as follows:

I have checked the relevant code and dependencies but haven't been able to resolve the issue. Is this a known problem? Or can you provide any suggestions to address this issue?

Thank you for your assistance, and I look forward to your reply.

RuntimeError: CUDA error: an illegal memory access was encountered

MPI.COMM_WORLD.Get_rank() 0
os.environ["CUDA_VISIBLE_DEVICES"] 0
Logging to d2l_gen
creating model...
Recovering from OOM error; New batch size: 32
Recovering from OOM error; New batch size: 16
Recovering from OOM error; New batch size: 8
Recovering from OOM error; New batch size: 4
Recovering from OOM error; New batch size: 2
Recovering from OOM error; New batch size: 1
Error: Image too big to run face detection on GPU /root/AVideo/test.mp4 /root/AVideo/diff2lip-main/test1.mp4
Traceback (most recent call last):
File "/root/AVideo/diff2lip-main/generate.py", line 109, in face_detect
predictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))
File "/root/AVideo/diff2lip-main/face_detection/api.py", line 66, in get_detections_for_batch
detected_faces = self.face_detector.detect_from_batch(images.copy())
File "/root/AVideo/diff2lip-main/face_detection/detection/sfd/sfd_detector.py", line 42, in detect_from_batch
bboxlists = batch_detect(self.face_detector, images, device=self.device)
File "/root/AVideo/diff2lip-main/face_detection/detection/sfd/detect.py", line 65, in batch_detect
imgs = torch.from_numpy(imgs).float().to(device)
RuntimeError: CUDA error: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/AVideo/diff2lip-main/generate.py", line 199, in generate
face_det_results = face_detect(video_frames.copy(), detector, args, resize=True)
File "/root/AVideo/diff2lip-main/generate.py", line 112, in face_detect
raise RuntimeError('Image too big to run face detection on GPU')
RuntimeError: Image too big to run face detection on GPU

Traceback (most recent call last):
File "/root/AVideo/diff2lip-main/generate.py", line 398, in
main()
File "/root/AVideo/diff2lip-main/generate.py", line 336, in main
generate(args.video_path, args.audio_path, model, diffusion, detector, args, out_path=args.out_path, save_orig=args.save_orig)
File "/root/AVideo/diff2lip-main/generate.py", line 208, in generate
face_det_results = face_det_results[:min_frames]
UnboundLocalError: local variable 'face_det_results' referenced before assignment

possible to only get the differential output?

I'm curious if it would be possible to only get the output of the new face movements, and not have the system re-merge the video streams back into one?

can you help

can you simply update where to put audio an video file its bit confusing
#!/bin/bash

#set paths and arguments
real_video_root='dataset/VoxCeleb2/vox2_test_mp4/mp4/'
model_path="D:\diff2lip-main\checkpoints/e7.24.1.3_model260000_paper.pt"
sample_path="D:\diff2lip-main\result"
sample_mode="cross" # or "reconstruction"
NUM_GPUS=1

Result Video blurry

I have encountered some issues. I input a video in 1080p resolution. After processing, the output is also in 1080p, but the video is very blurry and doesn't achieve a high-quality result. How can I adjust this? Please assist me. Thank you.

INPUT LIKE:

OUTPUT LIKE:

create model failed

MPI.COMM_WORLD.Get_rank() 0
os.environ["CUDA_VISIBLE_DEVICES"] 0
Logging to d2l_gen
creating model...
Segmentation fault (core dumped)

Information about GAN loss

Hi,

Very interesting work. Thank you so much for open sourcing inference code. Since the training code is missing and I would assume that you guys aren't planning to release it. Can you please give more information on the Patch GAN loss was applied during training? Did you use the pretrained model or was it trained together in an end-to-end fashion?

the model is not matched,i do not konw why

parameters:Namespace(generate_from_filelist=False, video_path='/home/stardust/download/sdk/deepleaning/video-retalking/examples/face/1.mp4', audio_path='/home/stardust/download/sdk/deepleaning/video-retalking/examples/audio/1.wav', out_path='zzz.mp4', save_orig=True, test_video_dir='test_videos', filelist='test_filelist.txt', use_fp16=False, face_hide_percentage=0.5, use_ref=False, use_audio=False, audio_as_style=False, audio_as_style_encoder_mlp=False, nframes=1, nrefer=0, image_size=64, syncnet_T=5, syncnet_mel_step_size=16, audio_frames_per_video=16, audio_dim=80, is_voxceleb2=True, video_fps=25, sample_rate=16000, mel_steps_per_sec=80.0, clip_denoised=True, sampling_batch_size=2, use_ddim=False, model_path='checkpoints/e7.15_model210000_notUsedInPaper.pt', sample_path='d2l_gen', sample_partition='', sampling_seed=None, sampling_use_gt_for_ref=False, sampling_ref_type='gt', sampling_input_type='gt', face_det_batch_size=64, pads='0,0,0,0', num_channels=128, num_res_blocks=2, num_heads=4, num_heads_upsample=-1, num_head_channels=-1, attention_resolutions='16,8', dropout=0.0, class_cond=False, use_checkpoint=False, use_scale_shift_norm=True, resblock_updown=False, learn_sigma=False, diffusion_steps=1000, noise_schedule='linear', timestep_respacing='', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False, loss_variation=0, audio_encoder_kwargs={})

error:File "/home/stardust/download/sdk/deepleaning/diff2lip/generate.py", line 326, in main
model.load_state_dict(
File "/home/stardust/anaconda3/envs/facial/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TFGModel:
Missing key(s) in state_dict: "input_blocks.3.0.op.weight", "input_blocks.3.0.op.bias", "input_blocks.4.0.skip_connection.weight", "input_blocks.4.0.skip_connection.bias", "input_blocks.6.0.op.weight", "input_blocks.6.0.op.bias", "input_blocks.9.0.op.weight", "input_blocks.9.0.op.bias", "output_blocks.2.2.conv.weight", "output_blocks.2.2.conv.bias", "output_blocks.5.2.conv.weight", "output_blocks.5.2.conv.bias", "output_blocks.8.1.conv.weight", "output_blocks.8.1.conv.bias".
Unexpected key(s) in state_dict: "audio_encoder.time_embed.0.weight", "audio_encoder.time_embed.0.bias", "audio_encoder.time_embed.2.weight", "audio_encoder.time_embed.2.bias", "audio_encoder.input_block.0.weight", "audio_encoder.input_block.0.bias", "audio_encoder.input_block.1.weight", "audio_encoder.input_block.1.bias"........

Inference in Windows?

the single result lip sync video seems not so good

I got some problem. I tried a single video for lip sync, but the mouth and teeth is not satisfied.

diff2lip_demo.mp4

And I only modified the inference.sh and run it on the terminal by scripts/inference.sh↓

i am stuck here

Executing command: python generate.py --attention_resolutions 32,16,8 --class_cond False --learn_sigma True --num_channels 128 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm False --predict_xstart False --diffusion_steps 1000 --noise_schedule linear --rescale_timesteps False --sampling_seed=7 --sampling_input_type=gt --sampling_ref_type=gt --timestep_respacing ddim25 --use_ddim True --model_path=D:/diff2lip-main/checkpoints/e7.24.1.3_model260000_paper.pt --nframes 5 --nrefer 1 --image_size 128 --sampling_batch_size=32 --face_hide_percentage 0.5 --use_ref=True --use_audio=True --audio_as_style=True --generate_from_filelist 0
MPI.COMM_WORLD.Get_rank() 0
os.environ["CUDA_VISIBLE_DEVICES"] 0
Logging to d2l_gen
creating model...
-vf: No such file or directory
Unrecognized option '2'.
Error splitting the argument list: Option not found
C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\librosa\util\decorators.py:88: UserWarning: PySoundFile failed. Trying audioread instead.
return f(*args, **kwargs)
Traceback (most recent call last):
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\librosa\core\audio.py", line 164, in load
y, sr_native = __soundfile_load(path, offset, duration, dtype)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\librosa\core\audio.py", line 195, in __soundfile_load
context = sf.SoundFile(path)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\soundfile.py", line 740, in init
self._file = self._open(file, mode_int, closefd)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\soundfile.py", line 1264, in _open
_error_check(_snd.sf_error(file_ptr),
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\soundfile.py", line 1455, in _error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'd2l_gen\temp\audio.wav': System error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\diff2lip-main\generate.py", line 398, in
main()
File "D:\diff2lip-main\generate.py", line 336, in main
generate(args.video_path, args.audio_path, model, diffusion, detector, args, out_path=args.out_path, save_orig=args.save_orig)
File "D:\diff2lip-main\generate.py", line 204, in generate
wrong_all_indiv_mels, wrong_audio_wavform = load_all_indiv_mels(audio_path, args)
File "D:\diff2lip-main\generate.py", line 47, in load_all_indiv_mels
wav = audio.load_wav(out_path, args.sample_rate)
File "D:\diff2lip-main\audio\audio.py", line 10, in load_wav
return librosa.core.load(path, sr=sr)[0]
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\librosa\util\decorators.py", line 88, in inner_f
return f(*args, **kwargs)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\librosa\core\audio.py", line 170, in load
y, sr_native = __audioread_load(path, offset, duration, dtype)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\librosa\core\audio.py", line 226, in _audioread_load
reader = audioread.audio_open(path)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\audioread_init.py", line 127, in audio_open
return BackendClass(path)
File "C:\Users\ggrov\anaconda3\envs\diff2lip\lib\site-packages\audioread\rawread.py", line 59, in init
self._fh = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'd2l_gen\temp\audio.wav'
Error: Command failed with exit code 1

ValueError: Could not determine output format

Hi, I'm trying to use the model on a custom video using the script inference_single_video.sh, but I get an error when the code tries to save a video using torchvision.io.write_video in line 287 of generate.py

[rank0]:   File "/data01/user/diff2lip/generate.py", line 401, in <module>
[rank0]:     main()
[rank0]:   File "/data01/user/diff2lip/generate.py", line 339, in main
[rank0]:     generate(args.video_path, args.audio_path, model, diffusion, detector,  args, out_path=args.out_path, save_orig=args.save_orig)
[rank0]:   File "/data01/user/diff2lip/generate.py", line 287, in generate
[rank0]:     torchvision.io.write_video(
[rank0]:   File "/home/user/mambaforge/envs/diff2lip/lib/python3.9/site-packages/torchvision/io/video.py", line 90, in write_video
[rank0]:     with av.open(filename, mode="w") as container:
[rank0]:   File "av/container/core.pyx", line 429, in av.container.core.open
[rank0]:   File "av/container/core.pyx", line 213, in av.container.core.Container.__cinit__
[rank0]: ValueError: Could not determine output format

I'm using the same modules as in requirements.txt

Extremely distorted video for reconstruction and cross.

So im getting very poor video generation. The entire face is distorted and sometimes I just see the mouth overlay on the face.

I changed nothing. Just ran scripts/inference.sh for a single clear video.

I even preprocessed to 224x224 @ 25fps (as in the paper) and it's not better. Below is an example of a frame from the video

running the e7.24.1.3_model260000_paper.pt checkpoint.

t, the shape in current model is torch.Size([256]). size mismatch for out.2.weight: copying a param with shape torch.Size([6, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([3, 128, 3, 3]).

Parameter missmatch when using the checkpoint. I am running

python generate.py --video_path "test.mp4" --audio_path "InputAudio/test_audio.mp3" --model_path "checkpoint/e7.24.1.3_model260000_paper.pt" --out_path "OutputVideo/output.mp4"

All the weights seem to have size mismatch when copying over from the checkpoint...