Coder Social home page Coder Social logo

dinet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dinet's Issues

ploblem about the pretrained syncnet that provided

"Hello, may I ask if anyone has encountered issues with the pre-trained syncnet provided by dinet author or if it is extremely sensitive to the dataset? I trained it on my own downloaded hdtf dataset and found that the syncloss kept oscillating on the ground truth data."

How to fix this error HELP!

extracting: input/888.zip
Traceback (most recent call last):
File "inference.py", line 33, in
raise ('wrong video path : {}'.format(opt.source_video_path))
TypeError: exceptions must derive from BaseException

Simplfy the network structure

Amazing! Thanks for your contribution.
To simplfy the network structure, Can we:

  1. use 5 mouth images as refrence images instead of 5 whole face?
  2. Fref concat Fs as input to the AdaAT.
  3. no aligment encoder.
  4. no concat in Inpainting part.

Landmark OpenFace CMD issue?

Hi,

great project, thanks for sharing. I wanted to use openface linux to extract the landmark and create csv to inference to a new custom video.

I try to find out the parameter i need, so i used to extract this via command line:

`
build/bin/FaceLandmarkVidMulti -f mj2.mp4.m4v -2Dfp -tracked

`
The csv is written and for me it looks good as the other examples:
mj2.csv

When i execute inference:
python3 inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/short1.mp4 --source_openface_landmark_path=./asserts/examples/mj2.csv --driving_audio_path=./asserts/examples/mj_sound1.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth

I received:

Traceback (most recent call last):
File "inference.py", line 54, in
video_landmark_data = load_landmark_openface(opt.source_openface_landmark_path).astype(np.int)
AttributeError: 'NoneType' object has no attribute 'astype'

Do i miss a parameter in the csv or the openface cmd?

Thanks @MRzzm

Syncnet Training

I reproduced for training syncnet, what is the loss of syncnet that could take into train clip?
currently, it's around 0.21-0.25

请教 Syncnet Training 代码是否正确

from models.Syncnet import SyncNetPerception,SyncNet
from config.config import DINetTrainingOptions
from sync_batchnorm import convert_model

from torch.utils.data import DataLoader
from dataset.dataset_DINet_syncnet import DINetDataset

from utils.training_utils import get_scheduler, update_learning_rate,GANLoss

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import os
import torch.nn.functional as F


if __name__ == "__main__":
    # load config
    opt = DINetTrainingOptions().parse_args()
    random.seed(opt.seed)
    np.random.seed(opt.seed)
    torch.cuda.manual_seed(opt.seed)
    # init network
    
    net_lipsync = SyncNet(15,29,128).cuda()

    criterionMSE = nn.BCELoss().cuda()
    # set scheduler
    # set label of syncnet perception loss
    real_tensor = torch.tensor(1.0).cuda()
    
    # setup optimizer
   # optimizer_s = optim.Adam(net_lipsync.parameters(), lr=opt.lr_g)
    optimizer_s = optim.Adamax(net_lipsync.parameters(), lr=opt.lr_g)
    
    # set scheduler
    net_s_scheduler = get_scheduler(optimizer_s, opt.non_decay, opt.decay)

    
    # load training data
    train_data = DINetDataset(opt.train_data,opt.augment_num,opt.mouth_region_size)
    training_data_loader = DataLoader(dataset=train_data,  batch_size=opt.batch_size, shuffle=True,drop_last=True,num_workers=12)
    train_data_length = len(training_data_loader)
    
    # load training data
    test_data = DINetDataset(opt.test_data,opt.augment_num,opt.mouth_region_size)
    test_data_loader = DataLoader(dataset=test_data,  batch_size=1, shuffle=True,drop_last=True,num_workers=12)
    test_data_length = len(test_data_loader)
    
    min_loss = 100
    # start train
    for epoch in range(opt.start_epoch, opt.non_decay+opt.decay+1):
        net_lipsync.train()
        for iteration, data in enumerate(training_data_loader):
            # forward
            optimizer_s.zero_grad()
            source_clip, deep_speech_full, y = data
            source_clip = torch.cat(torch.split(source_clip, 1, dim=1), 0).squeeze(1).float().cuda()
            source_clip = torch.cat(torch.split(source_clip, opt.batch_size, dim=0), 1).cuda()
            deep_speech_full = deep_speech_full.float().cuda()

            y = y.cuda()
            ## sync perception loss
            source_clip_mouth = source_clip[:, :, train_data.radius:train_data.radius + train_data.mouth_region_size,
            train_data.radius_1_4:train_data.radius_1_4 + train_data.mouth_region_size]
            sync_score = net_lipsync(source_clip_mouth, deep_speech_full)        

            loss_sync = criterionMSE(sync_score.unsqueeze(1), y)
            
            loss_sync.backward()
            optimizer_s.step()

            print(
                "===> Epoch[{}]({}/{}):  Loss_Sync: {:.4f} lr_g = {:.7f} ".format(
                    epoch, iteration, len(training_data_loader), float(loss_sync) ,
                    optimizer_s.param_groups[0]['lr']))

        update_learning_rate(net_s_scheduler, optimizer_s)

        # checkpoint
        if epoch %  opt.checkpoint == 0 :
            if not os.path.exists(opt.result_path):
                os.makedirs(opt.result_path)
            model_out_path = os.path.join(opt.result_path, 'netS_model_epoch_{}.pth'.format(epoch))
            states = {
                'epoch': epoch + 1,
                'state_dict': {'net': net_lipsync.state_dict()},
                'optimizer': {'net': optimizer_s.state_dict()}
            }
            torch.save(states, model_out_path)
            print("Checkpoint saved to {}".format(epoch))
        if epoch %  opt.stop_checkpoint == 0:
            break

single-frame dubbed and entire sequence audio

Hello, I have a doubt. The dubbed image generated each time is a single frame, but the driving audio is the entire sequence. Where does this single-frame dubbed image correspond to in the audio?

关于自己的音频无法驱动问题

作者你好,我用自己的音频为啥运行出错了,后来把audio_sample_rate, audio = wavfile.read(audio_path)替换成audio,audio_sample_rate = librosa.load(audio_path),运行的结果嘴巴不会动

关于syncnet和discriminator模块输出的实际含义

在wav2lip中这两个模块直接输出一个数字表示结果,而DINet中输出的却是一个类似(1,1,2,2)的特征图,计算损失函数的时候和一个扩展的全一矩阵进行比较;虽然本质上可能和添加一个avgpooling一致,但看起来总觉得有些别扭。

multi-stage training is necessary?

May I ask if multi-stage training is necessary for DINet, or if it is possible to only train the final stage to save training time? I understand that multi-stage training is primarily used to improve the effect of data initialization, so theoretically, it should be possible to train only the final stage

Generalizing the DINet

Hi, this is a very good project thanks for making it open source, I would like to know what are changes that we need to do in order to generalize the Clip network, as I can see that there are trained on only some 400 videos.

RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED

My training got this error when using ONE 4090 gpu:

train_DINet_frame64 ===> Epoch90: Loss_DI: 0.2199 Loss_GI: 0.3163 Loss_perception: 2.4996 Loss_g: 2.8160 lr_g = 0.0001000
Traceback (most recent call last):
File "train_DINet_frame.py", line 121, in
loss_g.backward()
File "/root/miniconda3/envs/dinet/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/dinet/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":30, please report a bug to PyTorch.

this error also randomly occurs on forward.

It seams related to nn.DataParallel while I use single GPU.
Any idea on this will be most appreciative.

Continue Model Training?

Is it possible to add a feature or change some code please so we can continue training?

Currently if the training crashes or its stopped we cant continue and have to retrain from the start of that step

用自己的视频数据测试,推理时视频帧数与openface 生成的csv的帧数不一致

您好!
用自己的视频数据测试,推理时视频帧数与openface 生成的csv的帧数不一致,请问怎么解决?是视频源有问题吗?但是换了好几个视频都是这样:
The video_landmark_data.shape: 249
aligning frames with driving audio
len_video_frames: 250
Traceback (most recent call last):
File "inference.py", line 67, in
raise ('video frames are misaligned with detected landmarks')
TypeError: exceptions must derive from BaseException

questions of loss weights

The parameter settings of the loss function in the paper are different from those in the open-source code. Is it really necessary to set Sync loss so low (0.1)? Can it still be effective?

Training Loss convergance

I'm doing training but it's difficult to understand the loss convergance so I keep failing to get good results, how if you are training are you figuring it out?

what is deepspeech used for?

I see speech to text is not mentioned in the paper, is is just used for training or is it for inference too? and what purpose does it serve?

用自己的音频生成视频嘴部不怎么动(generated videos with my own audios,however,the mouth almost does not move)

我使用官方提供的./asserts/examples 中的 driving_audio_x.wav 生成视频后,效果还行,嘴部的运动是流畅的。但是当我使用bark,tortoise等tts模型生成的音频再去合成时,结果生成的数字人嘴部动的幅度都很小,值得注意的是,我将driving_audio_x.wav通过torchaudio加载保存,其比特率变为了原来的2倍,然后再合成视频,嘴部也不怎么动了,有谁能提供点解决思路吗?

I generate videos using driving_audio_x.wav in the official supply./asserts/examples and they work fine, mouth movements are smooth. However, when I use bark,tortoise and other tts models to synthesize the audio, the amplitude of mouth movement in the generated figures is very small. It is worth noting that I load and save driving_audio_x.wav through torchaudio, and its bit rate becomes twice of the original, and then synthesize the video. His mouth isn't moving much, either. Does anyone have any ideas?

Other technology than OpenFace?

Hi,

First of all, congratulations for this project. I really like it. It is not easy to find a good project like this one. (I spent a lot of hours looking for something like this with quality, easy to use and easy to train)

Do you know another technology similar to OpenFace? I know there are a lot of face landmark detection. But, I ask you because maybe you found a better solution than OpenFace in 2023.

Thank you,

David Martin

How large is the GPU memory?

Hello, thanks for you nice work! The size of my gpu (2080 Ti) memory is 12G . When I run inference.py, the gpu memory will burst. How large is your GPU memory?

运行执行代码教程出现问题Traceback (most recent call last): File "inference.py", line 88, in <module> raise ('our method can not handle videos with large change of facial size!!')

运行代码
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/test4.mp4 --source_openface_landmark_path=./asserts/examples/test4.csv --driving_audio_path=./asserts/examples/driving_audio_4.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
出现这个问题。
Traceback (most recent call last):
File "inference.py", line 88, in
raise ('our method can not handle videos with large change of facial size!!')

运行代码
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/test24.mp4 --source_openface_landmark_path=./asserts/examples/test24.csv --driving_audio_path=./asserts/examples/driving_audio_2.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
出现
loading facial landmarks from : ./asserts/examples/test24.csv
aligning frames with driving audio
Traceback (most recent call last):
File "inference.py", line 59, in
raise ('video frames are misaligned with detected landmarks')
TypeError: exceptions must derive from BaseException

如何解决?

DeepSpeech model: where does it come from?

As the title suggests. I would like to get rid of the TF dependency and trying to convert the full model to ONNX. DeepSpeech is the first challenge. It is released as a black box and doesn't seem consistent with latest official DeepSpeech releases.

Where does it come from exactly?

Wrong file for output_graph.pb

Hi, thanks for the amazing work!
When I tried to unzip asserts.zip, it was shown that the output_graph.pb file in the zip package was damaged. Could you please check the zip package and repair the corresponding file?Thank you so much!

CPU usage

Works perfectly on GPU, is it possible to run it on CPU? if yes, could you please add some examples?

Thanks!

己的视频与预训练的模型?

到目前为止,没有一个自定义视频对我有用。

如果你之前用openface创建了元历史*.csv,是否可以用pretrain.pth模型对自定义视频进行推理?

或者你必须事先在HDTF的训练集中创建自定义视频,才能将这个视频添加到推理中?

Question about facial size change error

I got the error about this. 'our method can not handle videos with large change of facial size!!'
However my video does not have huge change about facial size. Is it open face landmark mistakes? I debug the error and it goes to the below stage and failed the inference
elif max(radius_clip) > min(radius_clip) * 1.5:
return False, None

I also checked that Openface has multiple landmark setting.
CLM CLNF CECLM
Do I have to set something among this?
I also attahced my csv file.
new_light_English.csv

OpenFace .csv not working

Hi,

I installed OpenFace and I tried this two commands from an Ubuntu:

./build/bin/FeatureExtraction -f video.mp4

./build/bin/FaceLandmarkVid -f video.mp4

I tried the two output csv's. In both cases I tried the next command:

python inference.py --mouth_region_size=256 --source_video_path=/home/pc/video.mp4 --source_openface_landmark_path=/home/pc/output.csv --driving_audio_path=/home/pc/audio.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth

Unfortunately, it is not working as expected. The lips are not moving.
The output video_synthetic_face.mp4 is blurry, there is no face. So, I take for granted that I am not executing the correct feature extraction, or maybe I need some parameters that I don't know how to pass.

What is the right command to extract the face landmarks to be compatible with DINet?

Thank you!

Optimized version

Hi,

thanks for this amazing work. I have worked a bit on this project to remove deep speech dependancies beside some other optimization efforts. You can find this optimized version here:

https://github.com/Elsaam2y/DINet_optimized

This version would improve the inference latency by 50-60%. I can also open a PR here in this repo if you are willing to accept external PRs.

Thanks

I need help regarding feature extraction?

If possible can you just help me any other way to find same feature like media pipe or dlib because I'm getting error related to fast change can't be handle
If you can release the version for feature extraction CSV file ..
That will help me alot

DeepSpeech

Hello,

many thanks for this amazing work. I was just wondering if there was any specific reason for using this version of DeepSpeech?

I was thinking of retraining the model with a different audio processing like using a newer version DeepSpeech or melspectrograms for example. However I want to know if you already tried so already and if there are any objections.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.