Coder Social home page Coder Social logo

evelynfan / faceformer Goto Github PK

View Code? Open in Web Editor NEW
778.0 15.0 133.0 8.81 MB

[CVPR 2022] FaceFormer: Speech-Driven 3D Facial Animation with Transformers

License: MIT License

Python 100.00%
computer-vision computer-graphics deep-learning facial-animation speech 3d-face 3d-models pytorch-implementation lip-animation facial-expressions

faceformer's People

Contributors

evelynfan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

faceformer's Issues

Please provide training details

Hi,

I am trying to train the model from the scratch on vocaset dataset. However, I am not sure whether the training statistics are correct or not since the loss for training got quite small from the beginning.

「(Epoch 1, iteration 115) TRAIN LOSS:0.0000014」

Could you please provide the training details like starting loss, ending loss, number of epochs for model to converge?

Thanks a lot!

BIWI

The README instruction call for the BIWI data set.
I have requested the data set but was unable to obtain it.
Is there a work around this problem? Is there an alternative data set without having to create my own.
Can the demo be run without it?

Rendering is a bit slow

Hello, your work is amazing,
but it takes a lot of time to render. It needs nearly 1 second for a frame on the Tesla V100. Is there any solution?

Best wishes!

Test the demo jitter problem...

Hello, in the vocaset demo I tested, the mouth would shake around the 8th second, do you know the reason? look forward to your reply, thank you,Tested directly with your pretrained mode

aixia_1_FaceTalk_170904_03276_TA_condition_FaceTalk_170904_03276_TA.mp4

l

error when running prediction demo part

I wanted to test the prediction part by running demo script. I've set the environment and data as explained in the documentation but there's code level error when i run below command

 python demo.py --model_name vocaset --wav_path "demo/wav/test.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30  --fps 30  --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA

The error happens in Wav2Vec part and looks as below

Traceback (most recent call last):
  File "/home/shounan/Development/FaceFormer/demo.py", line 204, in <module>
    main()
  File "/home/shounan/Development/FaceFormer/demo.py", line 200, in main
    test_model(args)
  File "/home/xxx/Development/FaceFormer/demo.py", line 57, in test_model
    prediction = model.predict(audio_feature, template, one_hot)
  File "/home/xxx/Development/FaceFormer/faceformer.py", line 140, in predict
    hidden_states = self.audio_encoder(audio, self.dataset).last_hidden_state
  File "/home/xxx/miniconda3/envs/faceformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/Development/FaceFormer/wav2vec.py", line 140, in forward
    return_dict=return_dict,
  File "/home/xxx/miniconda3/envs/faceformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/faceformer/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 812, in forward
    position_embeddings = self.pos_conv_embed(hidden_states)
  File "/home/xxx/miniconda3/envs/faceformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/faceformer/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 446, in forward
    hidden_states = hidden_states.transpose(1, 2)
AttributeError: 'tuple' object has no attribute 'transpose'

So I was wondering if the authors or anyone else had issues running current version of code

Additional env info:

  • Ubuntu 20.04
  • transformers 4.17.0 (latest)

transformer decoder time consuming problem

Hi,I am a novice in Transformer.
In function 'predict()', I noticed that the time consuming of transformer decoder in for loop is not stable. The average time of one loop is about 2ms. But there always exists some peak. Do you have any idea? Thanks!
截屏2022-04-27 下午2 42 24

transformer

Hello, have you considered upgrading the Transformer version? When exploring the better model than WAV2VEC models, the 'tuple' Object has no Attribute 'transpose' appears

bugs when run demo

The cmd is
python demo.py --model_name vocaset --wav_path "demo/wav/test.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30 --fps 30 --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA

output:

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2Model: ['lm_head.bias', 'lm_head.weight']

  • This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Traceback (most recent call last):
    File "demo.py", line 204, in
    main()
    File "demo.py", line 200, in main
    test_model(args)
    File "demo.py", line 57, in test_model
    prediction = model.predict(audio_feature, template, one_hot)
    File "/evo_860/yaobin.li/workspace/FaceFormer/faceformer.py", line 140, in predict
    hidden_states = self.audio_encoder(audio, self.dataset).last_hidden_state
    File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/evo_860/yaobin.li/workspace/FaceFormer/wav2vec.py", line 135, in forward
    encoder_outputs = self.encoder(
    File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 812, in forward
    position_embeddings = self.pos_conv_embed(hidden_states)
    File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 446, in forward
    hidden_states = hidden_states.transpose(1, 2)
    AttributeError: 'tuple' object has no attribute 'transpose'

Pyrender GPU rendering egl

Hi, is there any ways that we can enable egl offscreen rendering in the demo render code? I used
# os.environ['PYOPENGL_PLATFORM'] = 'egl'

but got the following error message,

NotImplementedError: Platform does not define a GLUT font retrieval function

I noticed that VOCA used egl for rendering and when I used the virtual environment for voca to run faceformer, it still gives the same error message.

trainning problem

The result video has mouth-only motion when I use pre-trained model. But when I use the model trained by myself, the result has whole-face motion. (vocaset)

Evaluation Results

Hello, I want to know how to get the results and relevant evaluation indicators in the paper. Thank you for your answer

Missing files from voca dataset website

Hi,

Thanks for the impressive work!

I was working on the Demo. However, I could not find complete data from voca website.

As mentioned in #1 , 「please click "Download" and the Training Data (8 GB) can be found at the bottom of the page」. But after unzip, there was only data_verts.npy and missing files include 「raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl 」.

Could you please help check the data completeness?

Thanks again!

bugs when running demo

Traceback (most recent call last):
File "demo.py", line 205, in
main()
File "demo.py", line 201, in main
test_model(args)
File "/data2/aotenglong/anaconda3/envs/meshtalk/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "demo.py", line 44, in test_model
temp = templates[args.subject]
KeyError: 'FaceTalk_170809_00138_TA'

RuntimeError:CUDA out of memory

When i train the model with my dataset, it happens "RuntimeError:CUDA out of memory", i try some solutions but all can't work, my gpu is 16G memory. What can i do for this?

test_subjects

Hello, I found that test_subjects is not used, what is the reason?

image

faceformer export to onnx failed

I want to export vocaset.pth to onnx model with the following rectify :

firstly , in demo.py added the export code:

input_names = ['audio_feature', 'template', 'one_hot']
output_names = ['vertice_out']
torch.onnx.export(model,                        # model being run
            (audio_feature, template, one_hot),                              # model input (or a tuple for multiple inputs)
            'vocaset.onnx',                # where to save the model (can be a file or file-like object)
            export_params=True,             # store the trained parameter weights inside the model file
            opset_version=11,               # the ONNX version to export the model to
            do_constant_folding=True,       # whether to execute constant folding for optimization
            input_names = input_names,        # the model's input names
            output_names = output_names,      # the model's output names
            dynamic_axes={'audio_feature' : {1 : 'audio_len'}}
            )

and secondly, rewrite the forward function same to the predict function in faceformer.py.
when I running the demo.py, it failed with the following message:

(function ComputeConstantFolding)
[W shape_type_inference.cpp:419] Warning: Constant folding in symbolic shape inference fails: expected scalar type Long but found Float
Exception raised from data_ptr at /pytorch/build/aten/src/ATen/core/TensorMethods.cpp:5759 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd3b3eafa22 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd3b3eac3db in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: long* at::Tensor::data_ptr() const + 0xde (0x7fd22608d83e in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::jit::onnx_constant_fold::runTorchSlice_opset10(torch::jit::Node const*, std::vector<at::Tensor, std::allocatorat::Tensor >&) + 0x42e (0x7fd36aa778fe in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: torch::jit::onnx_constant_fold::runTorchBackendForOnnx(torch::jit::Node const*, std::vector<at::Tensor, std::allocatorat::Tensor >&, int) + 0x1c5 (0x7fd36aa78c45 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xafd7f1 (0x7fd36aab77f1 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::string, c10::IValue, std::lessstd::string, std::allocator<std::pair<std::string const, c10::IValue> > > const&, int) + 0x906 (0x7fd36aabc666 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0xb05414 (0x7fd36aabf414 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa7c010 (0x7fd36aa36010 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x500f98 (0x7fd36a4baf98 in /opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: _PyMethodDef_RawFastCallKeywords + 0x254 (0x564c0405c7a4 in /opt/conda/envs/faceformer/bin/python)
frame #11: + 0x17fb40 (0x564c04092b40 in /opt/conda/envs/faceformer/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x4762 (0x564c040da702 in /opt/conda/envs/faceformer/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x255 (0x564c0402be85 in /opt/conda/envs/faceformer/bin/python)
frame #14: _PyFunction_FastCallKeywords + 0x583 (0x564c0404bcd3 in /opt/conda/envs/faceformer/bin/python)
frame #15: + 0x17f9c5 (0x564c040929c5 in /opt/conda/envs/faceformer/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x1401 (0x564c040d73a1 in /opt/conda/envs/faceformer/bin/python)
frame #17: _PyEval_EvalCodeWithName + 0x255 (0x564c0402be85 in /opt/conda/envs/faceformer/bin/python)
frame #18: _PyFunction_FastCallKeywords + 0x583 (0x564c0404bcd3 in /opt/conda/envs/faceformer/bin/python)
frame #19: + 0x17f9c5 (0x564c040929c5 in /opt/conda/envs/faceformer/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x1401 (0x564c040d73a1 in /opt/conda/envs/faceformer/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x255 (0x564c0402be85 in /opt/conda/envs/faceformer/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0x583 (0x564c0404bcd3 in /opt/conda/envs/faceformer/bin/python)
frame #23: + 0x17f9c5 (0x564c040929c5 in /opt/conda/envs/faceformer/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1401 (0x564c040d73a1 in /opt/conda/envs/faceformer/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x255 (0x564c0402be85 in /opt/conda/envs/faceformer/bin/python)
frame #26: _PyFunction_FastCallKeywords + 0x521 (0x564c0404bc71 in /opt/conda/envs/faceformer/bin/python)
frame #27: + 0x17f9c5 (0x564c040929c5 in /opt/conda/envs/faceformer/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x4762 (0x564c040da702 in /opt/conda/envs/faceformer/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x255 (0x564c0402be85 in /opt/conda/envs/faceformer/bin/python)
frame #30: _PyFunction_FastCallKeywords + 0x583 (0x564c0404bcd3 in /opt/conda/envs/faceformer/bin/python)
frame #31: + 0x17f9c5 (0x564c040929c5 in /opt/conda/envs/faceformer/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x1401 (0x564c040d73a1 in /opt/conda/envs/faceformer/bin/python)
frame #33: _PyFunction_FastCallDict + 0x118 (0x564c0404acf8 in /opt/conda/envs/faceformer/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x1cb8 (0x564c040d7c58 in /opt/conda/envs/faceformer/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0xdf9 (0x564c0402ca29 in /opt/conda/envs/faceformer/bin/python)
frame #36: _PyFunction_FastCallKeywords + 0x583 (0x564c0404bcd3 in /opt/conda/envs/faceformer/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x3f5 (0x564c040d6395 in /opt/conda/envs/faceformer/bin/python)
frame #38: _PyFunction_FastCallKeywords + 0x187 (0x564c0404b8d7 in /opt/conda/envs/faceformer/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x3f5 (0x564c040d6395 in /opt/conda/envs/faceformer/bin/python)
frame #40: _PyEval_EvalCodeWithName + 0x255 (0x564c0402be85 in /opt/conda/envs/faceformer/bin/python)
frame #41: PyEval_EvalCode + 0x23 (0x564c0402d273 in /opt/conda/envs/faceformer/bin/python)
frame #42: + 0x227c82 (0x564c0413ac82 in /opt/conda/envs/faceformer/bin/python)
frame #43: PyRun_FileExFlags + 0x9e (0x564c04144e1e in /opt/conda/envs/faceformer/bin/python)
frame #44: PyRun_SimpleFileExFlags + 0x1bb (0x564c0414500b in /opt/conda/envs/faceformer/bin/python)
frame #45: + 0x2330fa (0x564c041460fa in /opt/conda/envs/faceformer/bin/python)
frame #46: _Py_UnixMain + 0x3c (0x564c0414618c in /opt/conda/envs/faceformer/bin/python)
frame #47: __libc_start_main + 0xe7 (0x7fd3fae99c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #48: + 0x1d803a (0x564c040eb03a in /opt/conda/envs/faceformer/bin/python)
(function ComputeConstantFolding)
Traceback (most recent call last):
File "/workspace/FaceFormer/demo.py", line 250, in
main()
File "/workspace/FaceFormer/demo.py", line 246, in main
test_model(args)
File "/opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/workspace/FaceFormer/demo.py", line 99, in test_model
output_names = output_names, # the model's output names
File "/opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/onnx/init.py", line 280, in export
custom_opsets, enable_onnx_checker, use_external_data_format)
File "/opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/onnx/utils.py", line 94, in export
use_external_data_format=use_external_data_format)
File "/opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/onnx/utils.py", line 695, in _export
dynamic_axes=dynamic_axes)
File "/opt/conda/envs/faceformer/lib/python3.7/site-packages/torch/onnx/utils.py", line 502, in _model_to_graph
_export_onnx_opset_version)
RuntimeError: expected scalar type Long but found Float

does anyone meet the same problem like me? need your help! tks!!

How to add some textures to the face?

Hello, i want to render a textural animation. Can I get a uv texture mapping file that matches the output? Besides, is there a way to automatically transfer the texture of a custom face to the target mesh?

error when test my own wav file

I test my own wav file, cmd is

python demo.py --model_name vocaset --wav_path "demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30  --fps 30  --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA

but output error as below

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "demo.py", line 204, in <module>
    main()
  File "demo.py", line 200, in main
    test_model(args)
  File "demo.py", line 57, in test_model
    prediction = model.predict(audio_feature, template, one_hot)
  File "/evo_860/yaobin.li/workspace/FaceFormer/faceformer.py", line 157, in predict
    vertice_out = self.transformer_decoder(vertice_input, hidden_states, tgt_mask=tgt_mask, memory_mask=memory_mask)
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 248, in forward
    output = mod(output, memory, tgt_mask=tgt_mask,
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 451, in forward
    x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask))
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 460, in _sa_block
    x = self.self_attn(x, x, x,
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/functional.py", line 5016, in multi_head_attention_forward
    raise RuntimeError(f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}.")
RuntimeError: The shape of the 3D attn_mask is torch.Size([4, 600, 600]), but should be (4, 601, 601)

and i check my wav file info

ffmpeg -i ~/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav
ffmpeg version 4.3 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7.3.0 (crosstool-NG 1.23.0.449-a04d0)
  configuration: --prefix=/home/yaobin.li/soft/miniconda3/envs/wenet --cc=/opt/conda/conda-bld/ffmpeg_1597178665428/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-gnutls --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from '/home/yaobin.li/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav':
  Metadata:
    encoder         : Lavf58.45.100
  Duration: 00:01:37.94, bitrate: 256 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s

So is it audio issues?

The resulting video has no sound

python demo.py --model_name biwi --wav_path "demo/wav/test.wav" --dataset BIWI --vertice_dim 70110 --feature_dim 128 --period 25 --fps 25 --train_subjects "F2 F3 F4 M3 M4 M5" --test_subjects "F1 F5 F6 F7 F8 M1 M2 M6" --condition M3 --subject M1
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2Model: ['lm_head.weight', 'lm_head.bias']

  • This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    /home/johnren/anaconda3/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
    To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
    return torch.floor_divide(self, other)
    rendering: test
    ffmpeg version 4.2.4-1ubuntu0.1 Copyright (c) 2000-2020 the FFmpeg developers
    built with gcc 9 (Ubuntu 9.3.0-10ubuntu2)
    configuration: --prefix=/usr --extra-version=1ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
    libavutil 56. 31.100 / 56. 31.100
    libavcodec 58. 54.100 / 58. 54.100
    libavformat 58. 29.100 / 58. 29.100
    libavdevice 58. 8.100 / 58. 8.100
    libavfilter 7. 57.100 / 7. 57.100
    libavresample 4. 0. 0 / 4. 0. 0
    libswscale 5. 5.100 / 5. 5.100
    libswresample 3. 5.100 / 3. 5.100
    libpostproc 55. 5.100 / 55. 5.100
    Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/home/johnren/Desktop/FaceFormer/demo/output/tmpraurad03.mp4':
    Metadata:
    major_brand : isom
    minor_version : 512
    compatible_brands: isomiso2mp41
    encoder : Lavf58.76.100
    Duration: 00:00:11.48, start: 0.000000, bitrate: 986 kb/s
    Stream #0:0(und): Video: mpeg4 (Simple Profile) (mp4v / 0x7634706D), yuv420p, 800x800 [SAR 1:1 DAR 1:1], 985 kb/s, 25 fps, 25 tbr, 12800 tbn, 25 tbc (default)
    Metadata:
    handler_name : VideoHandler
    Please use -q:a or -q:v, -qscale is ambiguous
    File 'demo/output/test_M1_condition_M3.mp4' already exists. Overwrite ? [y/N] y
    Stream mapping:
    Stream #0:0 -> #0:0 (mpeg4 (native) -> h264 (libx264))
    Press [q] to stop, [?] for help
    [libx264 @ 0x562c16b07400] using SAR=1/1
    [libx264 @ 0x562c16b07400] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
    [libx264 @ 0x562c16b07400] profile High, level 3.1
    [libx264 @ 0x562c16b07400] 264 - core 155 r2917 0a84d98 - H.264/MPEG-4 AVC codec - Copyleft 2003-2018 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=24 lookahead_threads=4 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
    Output #0, mp4, to 'demo/output/test_M1_condition_M3.mp4':
    Metadata:
    major_brand : isom
    minor_version : 512
    compatible_brands: isomiso2mp41
    encoder : Lavf58.29.100
    Stream #0:0(und): Video: h264 (libx264) (avc1 / 0x31637661), yuv420p, 800x800 [SAR 1:1 DAR 1:1], q=-1--1, 25 fps, 12800 tbn, 25 tbc (default)
    Metadata:
    handler_name : VideoHandler
    encoder : Lavc58.54.100 libx264
    Side data:
    cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
    frame= 287 fps=0.0 q=-1.0 Lsize= 338kB time=00:00:11.36 bitrate= 244.1kbits/s speed=23.1x
    video:334kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.242142%
    [libx264 @ 0x562c16b07400] frame I:2 Avg QP:16.06 size: 7228
    [libx264 @ 0x562c16b07400] frame P:78 Avg QP:21.98 size: 2355
    [libx264 @ 0x562c16b07400] frame B:207 Avg QP:24.91 size: 693
    [libx264 @ 0x562c16b07400] consecutive B-frames: 2.8% 1.4% 5.2% 90.6%
    [libx264 @ 0x562c16b07400] mb I I16..4: 29.6% 64.6% 5.9%
    [libx264 @ 0x562c16b07400] mb P I16..4: 0.9% 2.9% 0.1% P16..4: 16.0% 3.8% 1.2% 0.0% 0.0% skip:75.2%
    [libx264 @ 0x562c16b07400] mb B I16..4: 0.2% 0.5% 0.0% B16..8: 14.1% 1.0% 0.0% direct: 0.1% skip:84.2% L0:54.7% L1:43.3% BI: 2.0%
    [libx264 @ 0x562c16b07400] 8x8 transform intra:71.8% inter:76.2%
    [libx264 @ 0x562c16b07400] coded y,uvDC,uvAC intra: 33.1% 0.5% 0.0% inter: 1.3% 0.0% 0.0%
    [libx264 @ 0x562c16b07400] i16 v,h,dc,p: 52% 12% 9% 27%
    [libx264 @ 0x562c16b07400] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 32% 17% 41% 1% 2% 2% 1% 2% 1%
    [libx264 @ 0x562c16b07400] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 29% 31% 18% 4% 3% 3% 5% 3% 4%
    [libx264 @ 0x562c16b07400] i8c dc,h,v,p: 95% 2% 3% 0%
    [libx264 @ 0x562c16b07400] Weighted P-Frames: Y:0.0% UV:0.0%
    [libx264 @ 0x562c16b07400] ref P L0: 57.5% 4.8% 25.2% 12.5%
    [libx264 @ 0x562c16b07400] ref B L0: 78.1% 15.8% 6.2%
    [libx264 @ 0x562c16b07400] ref B L1: 92.3% 7.7%
    [libx264 @ 0x562c16b07400] kb/s:238.06

Failed rendering frame

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2Model: ['lm_head.weight', 'lm_head.bias']

  • This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    /data2/mesh/FaceFormer-main/faceformer.py:22: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
    bias = torch.arange(start=0, end=max_seq_len, step=period).unsqueeze(1).repeat(1,period).view(-1)//(period)
    Downloading: 100%|███████████████████████████████████████| 159/159 [00:00<00:00, 77.7kB/s]
    Downloading: 100%|████████████████████████████████████████| 291/291 [00:00<00:00, 145kB/s]
    Downloading: 100%|███████████████████████████████████████| 163/163 [00:00<00:00, 62.7kB/s]
    Downloading: 100%|█████████████████████████████████████| 85.0/85.0 [00:00<00:00, 34.9kB/s]
    rendering: test
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    pyrender: Failed rendering frame
    ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
    built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)

I have encountered this problem. A black screen video will be generated. I don't know how to solve it. Please help me.

FLAME

Hello, I want to add other expression parameters. I want to know how to use flame as the decoder. Thank you for your answer.

Issue while training on Vocaset

Hey @EvelynFan ,
Thanks for this awesome repo.
I'm just trying to play with training on vocaset data. So just followed the steps for data preparation and run training with
the following command,

python main.py --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30 --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --val_subjects "FaceTalk_170811_03275_TA FaceTalk_170908_03277_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA"

I'm getting the following error,

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2Model: ['lm_head.bias', 'lm_head.weight']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model parameters:  92215197
Loading data...
100%|█████████████████████████████████████████| 475/475 [03:05<00:00,  2.55it/s]
314 40 39
  0%|                                                   | 0/314 [00:00<?, ?it/s]vertice shape: torch.Size([1, 117, 15069])
vertice_input shape: torch.Size([1, 1, 64])
vertice_input shape: torch.Size([1, 1, 64])
tgt_mask: tensor([[[0.]],

        [[0.]],

        [[0.]],

        [[0.]]], device='cuda:0')
memory_mask: tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True]], device='cuda:0')
  0%|                                                   | 0/314 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 151, in <module>
    main()
  File "main.py", line 146, in main
    model = trainer(args, dataset["train"], dataset["valid"],model, optimizer, criterion, epoch=args.max_epoch)
  File "main.py", line 34, in trainer
    loss = model(audio, template,  vertice, one_hot, criterion,teacher_forcing=False)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ujjawal/my_work/object_recon/FaceFormer/faceformer.py", line 135, in forward
    vertice_out = self.transformer_decoder(vertice_input, hidden_states, tgt_mask=tgt_mask, memory_mask=memory_mask)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 233, in forward
    memory_key_padding_mask=memory_key_padding_mask)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 369, in forward
    key_padding_mask=memory_key_padding_mask)[0]
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 845, in forward
    attn_mask=attn_mask)
  File "/home/ujjawal/miniconda2/envs/caffe2/lib/python3.7/site-packages/torch/nn/functional.py", line 3873, in multi_head_attention_forward
    raise RuntimeError('The size of the 2D attn_mask is not correct.')

If anyone get this type of error while training, Please suggest how to resolve this issue.

Token Alignment in Wav2Vec2.0

Can pretrained Wav2Vec2.0 model facebook/wav2vec2-base-960h ensure alignment between input and output token? I find that facebook/wav2vec2-base-960h has been finetune with CTC.

templates.pkl for vocaset model

Amazing work! could you plz guide me where to find templates.pkl ?
for BIWI, it is here: FaceFormer/BIWI/templates.pkl
however I can not find it for vocaset.

Looking forward to hearing from you!

Regarding BIWI dataset preprocess.

Hi,

This is amazing work and I am trying to train FaceFormer to reproduce the results on BIWI dataset. The repo documentation says the code for preprocessing the BIWI dataset is coming soon (written as "to do"). Will it be available soon?

However, if it comes later, could you please clarify in short on how you preprocessed the BIWI vertices? Looking at both the original dataset (that comes in .vl files) and the result files (.npy) of the model, I see that the values of the vertices are not in the same range. Did you apply normalization on the dataset across the captured frames for all vertices along their respective coordinates (x,y,z)?

Any pointer would be greatly appreciated. :)

Failed rendering frame

Hi,

I am facing a Failed rendering frame issue with the following error:
"cannot import name 'OSMesaCreateContextAttribs' from 'OpenGL.osmesa'"

I'm with CentOS, and I have installed both openGL and Libosmesa, but this issue still appears. Could you please help have a look at it?

getting same hidden states value from Wav2Vec2 for my dataset

Hey @EvelynFan ,
I tried to train the model on my custom datasets, but Wav2Vec2 is producing same hidden states value
for all audio frames,
Here is the reference,

torch.Size([1, 88800])
hidden_states: tensor([[[-0.0847,  0.0599, -0.0042,  ...,  0.1818,  0.0301, -0.0014],
         [-0.0847,  0.0599, -0.0042,  ...,  0.1818,  0.0301, -0.0014],
         [-0.0847,  0.0599, -0.0042,  ...,  0.1818,  0.0301, -0.0014],
         ...,
         [-0.0847,  0.0599, -0.0042,  ...,  0.1818,  0.0301, -0.0014],
         [-0.0847,  0.0599, -0.0042,  ...,  0.1818,  0.0301, -0.0014],
         [-0.0847,  0.0599, -0.0042,  ...,  0.1818,  0.0301, -0.0014]]],
       device='cuda:0')

Can you suggest some way out?
Thanks.

Why batch_size = 1

Hi, I find that batch_size = 1(refer to link).
Is there any reason? I think it may be faster with a larger batch_size

The output video has no sound

Hi, thanks for your great work!
My output video has no sound. Do you have any idea?

python demo.py --model_name vocaset --wav_path "demo/wav/test.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30 --fps 30 --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA

transformers problem

transformers/models/wav2vec2/modeling_wav2vec2.py", line 387, in forward
    hidden_states = hidden_states.transpose(1, 2)
AttributeError: 'tuple' object has no attribute 'transpose'

does it possible to make code compatible with latest transformers?

pyrender: Failed rendering frame

I have successfully ran the training code based on vocaset. But both for the visualization and the demo's rendering, it keep telling me that the failed rendering frame while running the outscreenrenderer function in render.py.
I have checked those earlier issues that mentioned similar output, but none of them make effect at my case. Even I updated pyopengl to 3.1.4, I still get the failed rendering frame feedback.
Could anybody give me some other idea about this? I guess maybe it is a problem caused by bad osmesa installation, but with a centos machine I don't know how to successfully install osmesa without using apt-get.

Evaluation result on BIWI dataset

Hi, I can not get the number as in the paper for BIWI Test-A which is 5.3742 x mm^-4.

The following is what I've tried:

  1. Rotate/scale/translate the raw data to align to the templates in your repo.
    The result scale factors for each subject:
    {
    'F2': 179.7675,
    'F3': 185.8210,
    'F4': 185.8799,
    'M3': 184.9965,
    'M4': 186.2286,
    'M5': 201.2294,
    'F1': 173.5965,
    'F5': 182.6764,
    'F6': 186.2587,
    'F7': 180.6849,
    'F8': 180.7115,
    'M1': 188.5588,
    'M2': 192.8390,
    'M6': 189.5069
    }

  2. Manually select vertices over lip area using blender:

image

  1. Run the pretrained model on Test-A and save all sequence of vertices to file. Then calculate the max L2 vertex err using:
def get_lip_maxl2_err(v_hat, v, lip_inds, scale):
    """return max L2 err over lip area for each frame
        v: [N, v, 3] tensor
        v_hat: [M, v, 3] tensor
    """
    N, V, _ = v.shape
    N = min(v.shape[0], v_hat.shape[0])
    lip_err = (v[:N, lip_inds, :] - v_hat[:N, lip_inds, :]) *scale # scale to original size
    max_err, max_inds = (lip_err ** 2).mean(-1).max(-1)
    return max_err
  1. The rendered result looks quite good but the lip vertex err I got is 7.0980 on val set, and 8.1337 on test set. I would like to know if I'm doing it correctly or I've missed something.
F2_e33_faceformer.mp4

Input when i don't want any template and style embedding

Hey @zlinao @EvelynFan ,
Thanks for this clear code structure.
I'm trying to train this model after removing the style emb layer without any template

So, while forward pass you're using like,

template = template.unsqueeze(1) # (1,1, V*3)
obj_embedding = self.obj_vector(one_hot)#(1, feature_dim)

And when using teacher_forcing, input is created like this

if teacher_forcing:
            vertice_emb = obj_embedding.unsqueeze(1) # (1,1,feature_dim)
            style_emb = vertice_emb  
            vertice_input = torch.cat((template,vertice[:,:-1]), 1) # shift one position
            vertice_input = vertice_input - template

and if not teacher_forcing, input is created like this

if i==0:
                    vertice_emb = obj_embedding.unsqueeze(1) # (1,1,feature_dim)
                    style_emb = vertice_emb
                    vertice_input = self.PPE(style_emb)
                    print('vertice_input shape:',vertice_input.shape)
                else:
                    vertice_input = self.PPE(vertice_emb)
                 -------------------------------------------
                 ------------------------------------------
                 vertice_emb = torch.cat((vertice_emb, new_output), 1)

So, I changes these lines using a zero vector with same dimension as need as first input,

if teacher_forcing:
            first_input=torch.FloatTensor(np.zeros([1,input_dim])).unsqueeze(1).to(device=self.device)
            vertices_input=torch.cat((first_input,verices[:,:-1]), 1) # shift one position

Since i concatenated zero vector ,there is no need to subtract any thing as you did in your case (subtracted the template)

Again, while not using teacher forcing, input is like this

if i==0:
                    vertices_emb=torch.FloatTensor(np.zeros([1,feature_dim])).unsqueeze(1).to(device=self.device)
                    style_emb=vertices_emb
                    vertices_input=self.PPE(style_emb)
                else:
                    vertices_input=self.PPE(vertices_emb)
                  -------------------------------------------
                 ------------------------------------------
                 vertice_emb = torch.cat((vertice_emb, new_output), 1)

The whole flow is working, But the training loss is slowly fixed (between 0.0035 to 0.0040) after 2-3 epochs.
Also while predicting, hidden states produced is the same for each frames and hence the animation is also the same
for all the frames.

Please suggest what i'm missing here or anything to be added .

Thanks again.

Does the wav have the time limition?

Great work and really clear code. Thanks for your sharing again!

I tried some short wavs, the Faceformer woks well, but when i input a longer one, i met this error.

RuntimeError: The shape of the 3D attn_mask is torch.Size([4, 600, 600]), but should be (4, 601, 601).

render

Hello, disturb you again, I want to know how to add audio output after rendering
python render.py --dataset vocaset --vertice_dim 15069 --fps 30

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.