joonson / syncnet_python Goto Github PK

View Code? Open in Web Editor NEW

613.0 613.0 138.0 95 KB

Out of time: automated lip sync in the wild

License: MIT License

Python 99.08% Shell 0.92%

syncnet_python's People

Contributors

Stargazers

Watchers

Forkers

liviust hbredin benjisympa ajitaru zhangaustin param-uttarwar kulikovpavel fotwo vivoutlaw tavihalperin mansi1710 terminator-ger jamiroquai88 xzm2004260 templeblock nieshaoshuai thelinuxmaniac gogozhaoya keithimyers 7aughing ruizewang 5idaidai etshang qzz971126 fengdalu daviddoukhan persistentbuilder joovvhan komalchugh fastcode3d shahla843 aretelabs catherine-qian minglangqiao anupkumargupta saumya0303 dendisuhubdy salonijain18 saulbrennan falmi yihaochen96 joejiong cuijianzhu warhammer0 sotelo jayasree-saha tailororrr nikhilkanamarla akamrahimi jeremmyzong pugangqiang borsuk74 jojocorleone chaiyujin sshuster alansavio25 disinformationlab yuliangzhang chrismbirmingham vietbacle sahi11 taynaraincerti ciaoadio eurus202425 jungwook518 frischjulien huchenxucs liushiru sstzal azuredsky noammy phoenixzyt junleen fuankarion deepstem opencvbaby ankurbhatia24 suvrat96 achyun echoyinke leichtrhino lzj9072 datura0822 ddiv-lee sunjinny shehrum shuye-cheung thesujitroy momina04 abdullahmu palmpalmpalm galaxy24 urnotpopcorn xk-huang amirmohamadbabaee wf1024966 pingponglabs ashok-arjun descriptinc saber5433

syncnet_python's Issues

Cannot ouput audio.wav for some .mp4 files

FileNotFoundError: [Errno 2] No such file or directory: 'data/output/pywork/example/tracks.pckl'

hi guys

I tried to test pipelines when I followed the instructions in the readme.md file but faced with this error.

Traceback (most recent call last):
  File "run_pipeline.py", line 298, in <module>
    scene = scene_detect(opt)
  File "run_pipeline.py", line 235, in scene_detect
    scene_manager.detect_scenes(frame_source=video_manager)
  File "/usr/local/lib/python3.7/dist-packages/scenedetect/scene_manager.py", line 469, in detect_scenes
    self._process_frame(self._num_frames + start_frame, frame_im)
  File "/usr/local/lib/python3.7/dist-packages/scenedetect/scene_manager.py", line 366, in _process_frame
    self._add_cuts(detector.process_frame(frame_num, frame_im))
  File "/usr/local/lib/python3.7/dist-packages/scenedetect/detectors/content_detector.py", line 100, in process_frame
    curr_hsv[i] = curr_hsv[i].astype(numpy.int32)
TypeError: 'tuple' object does not support item assignment
Model data/syncnet_v2.model loaded.
Traceback (most recent call last):
  File "run_visualise.py", line 28, in <module>
    with open(os.path.join(opt.work_dir,opt.reference,'tracks.pckl'), 'rb') as fil:
FileNotFoundError: [Errno 2] No such file or directory: 'data/output/pywork/clip1/tracks.pckl'

when I checked content_detector.py file (line 92), I realized that the code is like below:

# line 92 content_detector.py
curr_hsv = cv2.split(cv2.cvtColor(frame_img, cv2.COLOR_BGR2HSV))

the cv2.split function takes an image as input and returns its color channels as a tuple. and then:

# line 100 content_detector.py
curr_hsv[i] = curr_hsv[i].astype(numpy.int32)

but as the error said, tuple object is immutable and does not support item assignment. when I cast the first line to list, the problem was fixed.

# fixed
# line 92 content_detector.py
curr_hsv = list(cv2.split(cv2.cvtColor(frame_img, cv2.COLOR_BGR2HSV)))

scenedetecor version that I used, was scenedetect==0.5.1. Is this bug fixed in the new version of scenedetect module? if so, I think requirements.txt needs to be updated.

How to batch sync? Data set 7W Command by command？

python3 run_pipeline.py --videofile path/video.mp4 --reference name_of_video --data_dir path/to --min_track 50

Not able to detect lip sync error when multiple scenes are there

Evaluation on Columbia dataset

Thank you very much for sharing the code and model of the paper. I have some questions about the testing methods used in this paper.
The frame rate of the Columbia video mentioned in the paper is 30fps. And both the public model and the paper were trained in 25fps.
1.Is the SyncNet model sensitive to the frame-rate? Should the frame rates of training and testing be consistent for model performance?
2. I wonder if you sampled the Columbia video to 25fps during the evaluation?
Thank you for giving some details about this evaluation.

Is it possible for multiple face tracking?

I notice that the function track_shot could track only one random face. What if I have multiple faces in a shot? Is it possible for multiple face tracking?

Error to download model

Hi, thank you for your job.
I want to try your implementation but I get an ERROR 404 when I want to download the model :/
I also have a question, on which corpus the network was trained ?
Thanks

results not consistent with README.md

OS: Ubuntu 16.04

requirements.txt:
torch==0.4.0 numpy==1.14.3 scipy==1.0.1 opencv-python==3.4.0.14 python_speech_features==0.6 tensorflow==1.4 pyscenedetect==0.3.4

Results:
AV offset: 4 Min dist: 6.742 Confidence: 10.447

Numbers are off, please advise

There is a delay in the video and audio in crop files?

I notice that video and audio are synchronized in original videos, but there is a delay in the video and audio in crop files.

Is it okay that elements are removed from framefaces list while iterating over it?

In the file there is a for cycle over framefaces variable, and inside that cycle framefaces gets modified (faces that are already tracked are removed):

syncnet_python/run_pipeline.py

Line 77 in 6efbb1c

framefaces.remove(face)

I'm not sure if this behavior is okay, because it may mean we loose some of the frames? At the same time, it all is getting executed in while loop, so maybe in the end the data is not lost.
I might try to rewrite this, if the problem really exists. But maybe it's not.
Do you think it is a problem? Should it be fixed?

Want to grab where and whose the speech start and end

Hi, is it possible to extract what time (or where) the speech of each speaker start and end?
I want to extract speech of each speaker so it needs to know when the speech matched to the speakers and end.

Different results between demo_syncnet.py and run_syncnet.py

When using python demo_syncnet.py --videofile data/example.avi --tmp_dir /path/to/temp/directory, I get the same result as:
AV offset: 4 Min dist: 6.742 Confidence: 10.447

But, when using run_syncnet.py, I get different result from above:
AV offset: 2 Min dist: 7.093 Confidence: 9.238

Why is that?
Thank you!

File not found issue when I tried to run the demo

~/syncnet_python$ python demo_syncnet.py --videofile data/1.mp4 --tmp_dir temp_dir
Traceback (most recent call last):
File "demo_syncnet.py", line 27, in
s.loadParameters(opt.initial_model);
File "/home/dajun/syncnet_python/SyncNetInstance.py", line 202, in loadParameters
loaded_state = torch.load(path, map_location=lambda storage, loc: storage);
File "/home/dajun/anaconda3/envs/syncnet/lib/python3.6/site-packages/torch/serialization.py", line 581, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/dajun/anaconda3/envs/syncnet/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/dajun/anaconda3/envs/syncnet/lib/python3.6/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'data/syncnet_v2.model'

Please help.

Visualise error

The file activesd.pckl is not being created by the previous steps in the pipeline and when i run visualise it throws an error because of it.

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x278528 and 512x512)

While I run run_syncnet.py getting below error :

WARNING: Audio (3.6720s) and video (3.7200s) lengths are different. Traceback (most recent call last): File "run_syncnet.py", line 40, in <module> offset, conf, dist = s.evaluate(opt,videofile=fname) File "/home/ubuntu/wav2lip_288x288/syncnet_python/SyncNetInstance.py", line 112, in evaluate im_out = self.__S__.forward_lip(im_in.cuda()); File "/home/ubuntu/wav2lip_288x288/syncnet_python/SyncNetModel.py", line 108, in forward_lip out = self.netfclip(mid); File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x278528 and 512x512)

Please let me know if anybody knows how to resolve this ?

how to remove temporal lags according to the offset?

Thanks for the work!
I wonder to know that after I get the offset between the audio and video, how can I do to remove the temporal lags? That is, making the video synced with the audio?
Looking forward to your reply!

EOFError: Ran out of input

after running "run_pipeline.py", I am getting this error :

Stream mapping: Stream #0:1 -> #0:0 (mp3 (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to './synced_output/pyavi/0000.mp4/audio.wav': Metadata: ISFT : Lavf57.83.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc57.107.100 pcm_s16le size= 137kB time=00:00:04.39 bitrate= 256.1kbits/s speed= 391x video:0kB audio:137kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.055499% [S3FD] loading with cuda Traceback (most recent call last): File "./run_pipeline.py", line 294, in <module> faces = inference_video(opt) File "./run_pipeline.py", line 187, in inference_video DET = S3FD(device='cuda') File "/media/SSD/syncnet_python/detectors/s3fd/__init__.py", line 22, in __init__ state_dict = torch.load(PATH_WEIGHT, map_location=self.device) File "/home/administrator/anaconda3/envs/sync1/lib/python3.8/site-packages/torch/serialization.py", line 795, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/administrator/anaconda3/envs/sync1/lib/python3.8/site-packages/torch/serialization.py", line 1002, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

Active speaker detection in a video

In run_visualise.py I don't see that you detect active speaker. How do you detect active speaker when you have multiple faces?

Multi speaker Multi shot movie like videos

Hi, @joonson

Thanks for open sourcing this amazing work. I'm able to test the pre-trained SyncNet model on a single speaker, single-shot video. However, when there are 2 or more speakers including multiple scenes, and when run_pipeline.py is used, the frames are extracted into the REFERENCE folder, but pycrop is empty. The pycrop folder being empty is probably the reason for the syncnet model being uploaded, but not resulting in any output when run_syncnet.py is deployed. I came across an issue opened in the GitHub repo regarding multiple speaker detection and it was clarified there that it indeed works for multi-speaker frames. But when I run_pipeline.py on my video, it is not able to detect multiple speakers and keep track of them across multiple scenes (it recognizes this, pycrop is empty). Can you please share some insight on what I might do to fix this? First of all is it possible to predict the AV offset using SyncNet in such a scenario where the videos are movie-like. Thank you.

Syncronization happens only for the first few minutes of the video

I have been trying to use the repository to perform speech diarzation. I tried to test a 5 minutes long video with 1 person in frame, the AV sycronisation could be seen for approx 2 minutes in the video_out file and not for the complete 5 minutes. Also help me understand what does the below numerics mean.
AV offset: 1
Min dist: 12.508
Confidence: 0.108

Explanation of outputs

I ran your code for different audio delays:

delay = 0.00 seconds - AV offset -2, conf 0.038
delay = 0.25 seconds - AV offset -9, conf 0.048
delay = 0.50 seconds - AV offset -15, conf 0.039
delay = 0.75 seconds - AV offset -4, conf 0.022
delay = 1.00 seconds - AV offset 12, conf 0.029

Clearly the delay is not being reflected with much confidence in the results. Is this a work in progress?
FYI, the above values were for videos converted from 30fps to 25fps, which had the issue of:
Mismatch between the number of audio and video frames. Type 'cont' to continue.
I see that the example video has the full face (1.5 x dlib face_rect) as the input to the lip model. Does this mean the model will only work for faces in the LRW dataset? I am trying with non-LRW faces.

On the need of converting to .avi in run_pipeline.py...

Hi,

I have gone through the implementation and when looking at run_pipeline.py, I just wanted to find out if there is any particular benefit of converting the input video file (most likely an mp4) to .avi? From my understanding, there isn't any part of the code that would not work if mp4 was directly used, but I presumed that this conversion from mp4 to avi would have some significance.

Thanks.

Where is "activesd.pckl"?

adding requirements.txt

Dear @joonson ,

While trying to use your software, I wasted few hours trying to install the dependencies required by your program using several strategies: conda, pip, etc...

Then, I found a requirements.txt file in a fork of your repository, that allowed me to install easily all ther required software dependencies:
Here is the link to the file that worked for me, realized by @KeithlMyers :
https://github.com/KeithIMyers/syncnet_python/blob/master/requirements.txt

I would suggest to add this file to your repository, and update the README.md according to this new ressource.

Kind regards,

Alignment with paper

Hello, thanks for releasing the pytorch version of the code!
I have a couple questions that sync this repo with the paper (sorry for the pun

fc7 in the paper is a 256-d vector whereas here the output feature is 1024-d (at lease the pretrained model seems to be), is it a newer/better version of this work or am I looking at the wrong place?
in the file SyncNetInstance.py line 107, there is a *4 applied to the sampling of the audio, I suspect that refers to some sort of stride, however I seem to miss the part in the paper mentioning this stride (perhaps too fundamental?), would you explain what it is?

How to use this repo?

If I am correct, after downloading the models, I have to run

python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

and the sync-corrected output will be stored in output_dir/pyavi/video.avi

Is this correct?

Size mismatch, m1: [5278528], m2: [512512] at

Dear author,

Thank you for this excellent work!

I ran into a problem when running on a video. At first, my CUDA memory ran out so I reduced the batch size to 5. And then I got into this error:

Runtime Error: Size mismatch, m1: [5278528], m2: [512512] at
Traceback (most recent call last):
File "demo_syncnet.py", line 30, in
s.evaluate(opt, videofile=opt.videofile)
File "C:\XC\syncnet_python\SyncNetInstance.py", line 112, in evaluate
im_out = self.S.forward_lip(im_in.cuda());
File "C:\XC\syncnet_python\SyncNetModel.py", line 108, in forward_lip
out = self.netfclip(mid);
File "C:\Users\xunyu\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "C:\Users\xunyu\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 100, in forward
input = module(input)
File "C:\Users\xunyu\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "C:\Users\xunyu\Anaconda3\lib\site-packages\torch\nn\modules\linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\xunyu\Anaconda3\lib\site-packages\torch\nn\functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [5 x 278528], m2: [512 x 512] at C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src\THC/generic/THCTensorMathBlas.cu:290

Could you take a look at this issue at your convenience? Thanks a lot!

RuntimeError: cuda runtime error (10) : invalid device ordinal

OS: Ubuntu 16.04 LTS
GPU: Nvidia k80 12gb memory
Cuda: v9
conda 4.5.1
Python: 3.6
Pytorch: 0.4

After I

download model
pip installed python_speech_features

Ran:

python testSyncNet.py --video data/example.avi

Error:

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=10 : invalid device ordinal
Traceback (most recent call last):
File "testSyncNet.py", line 164, in
s.loadParameters(args.initial_model);
File "testSyncNet.py", line 150, in loadParameters
loaded_state = torch.load(path);
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
return _load(f, map_location, pickle_module)
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 469, in _load
result = unpickler.load()
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 437, in persistent_load
data_type(size), location)
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 88, in default_restore_location
result = fn(storage, location)
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 70, in _cuda_deserialize
return obj.cuda(device)
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 68, in _cuda
with torch.cuda.device(device):
File "/home/T/anaconda3/lib/python3.6/site-packages/torch/cuda/init.py", line 227, in enter
torch._C._cuda_setDevice(self.idx)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32

Note.. i can run TF / keras just fine on this machine..

The training script is not included in this repository

Any chance you can share the pytorch version training script now? Thanks

The training script is not included in this repository, but contrastive loss was used to train the model. We used a margin of 20, but using a different value should not affect the performance.

Originally posted by @joonson in #7 (comment)

What model is the Tensorflow face detector using?

Hello,

What model is the face detector using? I tried doing inference on the CPU and it is very very slow.

Any idea what model is it?

Thanks
Abhishek S

Training using VoxCeleb

How would I got about training and testing the model on the VoxCeleb dataset?

is RGB or BGR format used for the model

hi, thanks for your brilliant work! I am confused at two points:

do you use RGB or BGR as inputs for the model

syncnet_python/SyncNetInstance.py

Lines 52 to 58 in a8a396a

    
           while frame_num: 
        
               frame_num += 1 
        
               ret, image = cap.read() 
        
               if ret == 0: 
        
                   break 
        
               images.append(image)

syncnet_python/SyncNetInstance.py

Lines 152 to 159 in a8a396a

    
           while frame_num: 
        
               frame_num += 1 
        
               ret, image = cap.read() 
        
               if ret == 0: 
        
                   break 
        
               image_np = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 
        
               images.append(image_np)

are there any differences between these two steps? What I guess is 4...

syncnet_python/SyncNetInstance.py

Line 94 in a8a396a

lastframe = len(images)-6

syncnet_python/SyncNetInstance.py

Line 171 in a8a396a

lastframe = len(images)-4

thanks in advance!

Error during setup and running demo

Hi, I was having trouble with the original requirements.txt due to opencv-contrib-python. so I modified it and tried to run the demo. Take a look at the screenshots below. Do you know why it's not working?

Input preprocessing for SyncNet

Thanks for the fantastic work with SyncNet and for releasing its code!

I am currently using SyncNet (https://github.com/joonson/syncnet_python) for the evaluation of a project that I have been working on. I had a couple of questions on it and would be grateful if you could answer them:

Do I need to preprocess the input video to SyncNet (outside of the preprocessing that sync net does internally)?
Would you be able to share an example of what an input frame looks like to the SyncNet network (i.e. after all the preprocessing)?

Model taking lip video as input

Hi! Thank you for your excellent work in the paper!
As is said in you paper, your model takes lip video as input, while this repo however only provides a model taking face video as input. Could you please provide a pre-trained model which takes the lip video as input? It will be very helpful for me!

loss function details

Hi,

Where did you define the loss?
What is an appropriate value for margin?

Thanks

The cropped face region

Hi Joon,

Thanks for the excellent job. We are now using SyncNet to form a data collection pipeline. As I went through the code, I found that it detects the face region and then scale it before cropping. The default crop_scale=0.4, which means the cropped region is 1.4 times larger than the original face region. Is it the default behavior when training the model? Why should we do the scale since there should no information outside the real face. Any advice?

python2 Vs python3

Dear @joonson ,

First of all I'd like to thank you for this software that is quite impressive!

I managed to run your code with python3 without any problems.

However, when I first rode the instructions found in the README.md , you suggested to use python 2.7 .
I tried to launch your program using python 2.7, but I faced few problems:
In scripts run_syncnet.py and run_visualise.py , you call function pickle.load with 2 arguments: the name of the pickled file AND an "encoding" named argument.
The "encoding" named argument is only supported in Python 3.
When I removed this encoding argument from the 2 scripts, the code worked fine.

So here are my suggestions:

you could either tell in README.md that your program requires python3 instead of python2
otherwise, you could change run_syncnet.py and run_visualise.py in order to remove the encoding argument used in pickle.load , that would allow the code to be run using python2

Code not working

Hello,

I was using your code for a video but it was giving an error while running demo_syncnet.py file. It ran fine for the example.avi but is not running for my video. Can you help me?

Lip sync error using Syncnet

Hello,
Thank you for the excellent work and publicly available code.
I am using syncnet to find if there is lip-sync error in the video. I am getting very random values of AV offset and confidence. I am using the train weights available on official website. Can someone elaborate this paragraph from the paper? -

Determining the lip-sync error -
To find the time offset between the audio and the video, we take a sliding-window
approach. For each sample, the distance is computed between one 5-frame video
feature and all audio features in the ± 1 second range. The correct offset is when
this distance is at a minimum. However as Table 2 suggests, not all samples in
a clip are discriminative (for example, there may be samples in which nothing
is being said at that particular time), therefore multiple samples are taken for
each clip, and then averaged.

I am missing something in this paragrah. How do I collect multiple samples for each clip?
I would like to know how to get a proper value of metric (AV offset, Confidence) that show the out of sync of video and audio on sample.

Thank you

the loss of train the Syncnet is not going down

Hello,Thank you for the excellent work and publicly available code.
I'm trying to use the mvlrs_v1 datasets to train the Syncnet,but the loss keep oscillating:

Since most video in the mvlrs_v1 is short,I randomly shift the audio up to 10-frame in order to generate synthetic false audio-video pairs.
1、Which datasets are you using?LRW/LRS2/LRS3?
2、Does the false pairs have to be shifted by up to 2s?
3、Can you show me your training log?
thank you!

issue to fix the offset

I'm trying to remove the offset of a video, the offset is -15 and use the following command to shift the video,
" ffmpeg -y -i temp.mp4 -itsoffset -0.5 -i newtemp.mp4 -ss -0.5 -t 9.2 -map 0:v -map 1:a new0.mp4 "
but new video has offset equal to 8! I don't know waht is the issue, should I change anything else?
the FPS has changed to 25

loss function

hello,
how do you define the loss function in training?
i didn't find this part.

Video conversion and synchronizion issue

For the command:ffmpeg -y -i %s -qscale:v 2 -async 1 -r 25 %s
you set the fps to 25. But what if the original video's fps is not 25? Does that influence synchronization?

calculation time for out of sync video

Hi. How can i calculate the time if i have out of sync video?

The input is face image or lip image?

In this paper, it's said that the input is lip image. But in this repo and the example.avi, the whole faces are kept and processed without cropping face part. In your Keras version, you only use lip.
So for this pretrained model syncnet_v2.model, what kind of input image should we use?

Demo Not Working (RuntimeError: mat1 dim 1 must match mat2 dim 0)

Dear Authors,
Firstly, thanks for releasing the codes and the models.

I could run your demo code with the data/example.avi as suggested in the readme.
However, when I try to run that same demo with one of my .avi files, I get the below runtime error.
Here is my video file
Any guidance to mitigate this will be really appreciated.

Traceback (most recent call last): File "demo_syncnet.py", line 30, in <module> s.evaluate(opt, videofile=opt.videofile) File "/data/syncnet_python/SyncNetInstance.py", line 112, in evaluate im_out = self.__S__.forward_lip(im_in.cuda()); File "/data/syncnet_python/SyncNetModel.py", line 108, in forward_lip out = self.netfclip(mid); File "/usr/local/google/home/avisek/anaconda2/envs/syncnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/google/home/avisek/anaconda2/envs/syncnet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/google/home/avisek/anaconda2/envs/syncnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/google/home/avisek/anaconda2/envs/syncnet/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 91, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/google/home/avisek/anaconda2/envs/syncnet/lib/python3.6/site-packages/torch/nn/functional.py", line 1674, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: mat1 dim 1 must match mat2 dim 0

Full face or mouth region only

Hi,

Is the model uploaded here trained on the full faces, or just the mouth region ?
The paper says that the network takes the mouth region as input, but the testing script takes in the full face.

Thanks in advance

Could you provide the training code？

I want to repeat the process of this paper training, thank you very much.

Unable to run demo script due to TypeError: 'Tensor' object is not callable

Hi, I am trying to run demo script on my Mac laptop.

I disabled .cuda() in code as it's not supported, and download all the required files (example.avi, weights etc) following download_model.sh. When I run demo script, I experienced error as follows:

Input #0, avi, from 'data/example.avi': Metadata: encoder : Lavf57.83.100 Duration: 00:00:12.68, start: 0.000000, bitrate: 701 kb/s Stream #0:0: Video: mpeg4 (Simple Profile) (FMP4 / 0x34504D46), yuv420p, 224x224 [SAR 1:1 DAR 1:1], 430 kb/s, 25 fps, 25 tbr, 25 tbn, 25 tbc Stream #0:1: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:1 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help -async is forwarded to lavfi similarly to -af aresample=async=1:min_hard_comp=0.100000:first_pts=0. Output #0, wav, to '/Users/lingjingjing/AV_sync/data/demo/audio.wav': Metadata: ISFT : Lavf58.45.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc58.91.100 pcm_s16le size= 396kB time=00:00:12.68 bitrate= 256.0kbits/s speed=2.53e+03x video:0kB audio:396kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.019223% cc_batch type is: <class 'list'> cc_in type is: <class 'torch.Tensor'> Traceback (most recent call last): File "demo_syncnet.py", line 30, in <module> s.evaluate(opt, videofile=opt.videofile) File "/Users/lingjingjing/AV_sync/syncnet_python/SyncNetInstance.py", line 123, in evaluate cc_out = self.__S__.forward_aud(cc_in()) TypeError: 'Tensor' object is not callable

Has anyone experienced the same issue before? Has anyone successfully run demo on their mac?

	while frame_num:
	frame_num += 1
	ret, image = cap.read()
	if ret == 0:
	break

	images.append(image)