mpc001 / end-to-end-lipreading Goto Github PK

View Code? Open in Web Editor NEW

174.0 174.0 51.0 22 KB

Pytorch code for End-to-End Audiovisual Speech Recognition

Python 100.00%

end-to-end-lipreading's People

Contributors

Stargazers

Watchers

Forkers

fengchesummer ml-lab johndpope devwalkar a-d-i-t-y-a ajitaru duguyue100 7aughing aitorbajo huckiyang entn-at sheng-cc yashd94 pingyufeng sparkparis sorry0cell newtusst ylyinzju jusperlee yuzhou-git chandanka90 dendisuhubdy peterzs la-corne 5l1v3r1 coalboss joeyjing miguelfmc caoyuhang ankurbhatia24 gaoyiyeah rukon023 gogozhaoya pugangqiang antecede sodapeter tingzhim abcxq yafanyen tstafylakis rongfei-chen qianqq xjchengit metapost wjliu0215 zhalok susanna1999 ndcuong91 rathi4research sourov395 deandeandone

end-to-end-lipreading's Issues

Can you supply the trained model for video-only?

Thank your for your good work. I want to use your trained model to test if it works on my task and I'll appreciate it if you can provide the trained model.

How to reproduce the results for the pre-trained video-only model.

Has anyone successfully reproduce the results from the provided pretrained models?
I tried a simple script to predict mp4s of ABOUT in training data, but it seems to produce wildly different labels for each sample. It seems unlikely this could achieve 83.39% accuracy, so I guess there must be something wrong in the following scripts:

import imageio
import numpy as np
import torch
import torch.nn.functional as F
import cv2

from model import lipreading

def load_frames(path):
    # depending on your environment, this might sometimes produce 30 frames
    cap = np.array(list(imageio.get_reader(path, 'ffmpeg')))

    images = np.stack(
        [cv2.cvtColor(cap[_], cv2.COLOR_RGB2GRAY) for _ in range(29)], axis=0)
    images = images[:, 84:172, 120:208] / 255.
    mean = 0.413621
    std = 0.1700239
    images = (images - mean) / std
    images = images.reshape(1, 1, 29, 88, 88)
    images = torch.tensor(images, dtype=torch.float32)
    return images

def reload_model(model, path):
    model_dict = model.state_dict()
    pretrained_dict = torch.load(path)
    model_dict.update(pretrained_dict)
    model.load_state_dict(model_dict)
    return model


# load model
m = lipreading(mode='finetuneGRU')
m = reload_model(m, 'model.pt')
m.eval()

for j in range(2, 10):
    images = load_frames(
        'lipread_mp4/ABOUT/train/ABOUT_%05d.mp4'
        % j)

    # print the label
    outputs = m(images)
    outputs = torch.mean(outputs, 1)
    outputs = torch.argmax(F.softmax(outputs, dim=1), dim=1)
    print(outputs)

How to get the middle output of resnet34 using a pre-trained video-only model?

Hello! Thanks for your great work!
I'm wondering how to get the middle output of resnet34 using a pre-trained video-only model? Could you please offer some code examples of extracting those lip embeddings?

“ValueError: num_samples should be a positive integer value, but got num_samples=0”

When I trained the audio-only, I encounted problem"ValueError: num_samples should be a positive integer value, but got num_samples=0". I don't know how to solve it. Does anyone have the same problem?

What is the correct format for the data set?

What is the correct format for the data set? I may get an incorrect dataset. And when I trained the audio-only, there has an error"ValueError: num_samples should be a positive integer value, but got num_samples=0".

How to add train data in audio only main prgrm?

Hi. I would like to know how to add train, validation and test data after executing main.py prgrm in audio only folder?. Plz reply.

Directory Structure for the dataset

could you please provide the way the dataset has to be arranged for running the code?

About noisy audio files

Hello, thank you for your work. I would like to ask where did you get the babble noise added in the audio files in your work?

About running time!

Hi, I also use a single 1080Ti for training. I also don't change any parameters. Why it took around 20 minutes per 1% epoch in the video-only problem.
Process: [ 4716/488766 (1%)] Loss: 6.3010 Acc:0.0000 Cost time: 1142s Estimated time:118110s
@mpc001

About the ResNet !

@mpc001
Thank you for your code !
I want to run your code, and I found that in your code , you write the ResNet34 yourself while the Pytorch provide the pretrained ResNet34.
I want to know is there any diffirences between your written ResNet34 and the provided ResNet34 by Pytorch?
Thank you very much!

How can I get the "NoisyAudio/-5dB/MONEY_00581.npz"?

When I run the main.py, I get the FileNotFoundError: No such file or directory 'MONEY/NoisyAudio/-5dB/MONEY_00581.npz'

Estimated training time

Would it be possible for the creators to provide details about specifications of the machine used to train and also estimated running time per epoch on the same machine?
@mpc001

pretrained models

Hi, thanks for your work. Please can you provide the pretrained model for audio. Also the pretrained model for video. What I need in order to get them? Just an account on GitHub? I l already have one.

Some questions about the architecture

Hi! Thank you for your work!
I'm trying to reproduce your network architecture using keras. I'm not sure to understand everything, so i have few questions:

What is the shape of the tensor / array containing the audio waveforms given to the audio network? Is it (number of samples, waveform's vector size)? If possible, can you print the summary of the model to show the outputs of each layer? It would help me a lot!
When you tried to use MFCC, did you put something between the MFCC feature and the BGRU layers ? Or did you feed the feature to the BGRU layer directly?
Did you tried to truncate the LRW videos to isolate the words?

Thank you!

hello. the data augment was not use in dataset?

I was suffer from overfitting when training the video only model, and the cvtransfomer.py is not used in dataset? Thanks

Overfitting?

@mpc001
Thanks for your code. And I try to replicate your code in keras-tensorflow, but It suffers from serious overfitting. In N2 phase (3DCNN+ResNet+temporal Conv), I fully follow your training settings (data augmentation and ROI cropping), the training accuracy is around 88%, However, the validation accuracy is only about 65%. The result is far from yours (74.6%), I don't know why, so could you tell me the more training detail and tricks?

shape '[-1, 29, 512]' is invalid for input of size 497664

Hello there,
After creating the files via the convert_video.py I try to run the audio-only main.py and get the following issue. It seems to be something wrong with the dimentions but I don't know how to fix it. Any ideas would be highly appreciated
Statistics: train: 488766, val: 25000, test: 25000

Epoch 0/29
Current Learning rate: [0.0003]
Traceback (most recent call last):
File "/home/Documents/audiovisual/audio_only/main.py", line 256, in
main()
File "/home/Documents/audiovisual/audio_only/main.py", line 252, in main
test_adam(args, use_gpu)
File "/home/Documents/audiovisual/audio_only/main.py", line 230, in test_adam
model = train_test(model, dset_loaders, criterion, epoch, 'train', optimizer, args, logger, use_gpu, save_path)
File "/home/evialv/Documents/audiovisual/audio_only/main.py", line 146, in train_test
outputs = model(inputs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Documents/audiovisual/audio_only/model.py", line 156, in forward
x = x.view(-1, self.frameLen, self.inputDim)
RuntimeError: shape '[-1, 29, 512]' is invalid for input of size 497664

training audiovisual net with and without pretrained models

Hello,
I have some doubts about the process of training the audiovisual model. Currently, I am following the steps indicated on the README going from temporalconv, backend, and later finetuneGRU to train the whole network. I have some questions about the process:

Training with pre-trained models:

When traininig using the pre-trained models. The net got stuck in this part :
inputs = torch.cat((audio_outputs, video_outputs), dim=2)

It requires 3D tensors to concatenate them, but the "video_outputs" and "audio_outputs" that I got are 2D tensors such as [B, 500]. How these 3D tensors for audio and video should look like? Is there a code missing or should something to transform them as required by the net?

These are the tensors I got for conv1 and conv2 backend. The tensors of backend conv1 are not the same. Those should be equal?
initial audio tensor: [B,19456]
audio after backend conv1: [B,2048,1]
audio after backend conv2(final): [B,500]
initial video tensor : [B,1,29,96,96]
video after backend conv1: [B,1024,1]
video after backend conv2(final): [B,500]
The concat_pretrained model is actually 3 files named as _a.pt,_b.pt,_v.pt all those three should be merge and use as a input? along with audio and video models to start the training with temporal convolutional backend? or Which one should I consider first?

Training from scratch:

I started the process of training from scratch without the pre-trained models. But only the audiovisual net because this is the part I am interested in. Is that the correct approach? Should I train from the scratch also the audio-only and video-only models first?
If I train also the audio-only and video-only models, should I use the .pt files of the last phase "finetuneGRU" of the audio and video net as an inputs (pt files) for the training of the audiovisual net? Besides that how could I get the concat_model.pt? How should look like the temporalConv training for audivisual with these inputs (audio_model.pt, video_model.pt & concat_model.pt)?
On the README, the step ii mentioned "Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend" is this not already somehow specify on the code?

About the concatenation between the audio and visu

Hi, thank you for the code. I want to ask about the concatenation part,
inputs = torch.cat((audio_outputs, video_outputs), dim=2)
outputs = concat_model(inputs)
Does this mean it is being concatenated along the feature axis? Is it because the audio and visual input has different number of timesteps?

About the convert_audio.py

In "data = librosa.load(filename, sr=16000)[0]", I find the data.shape is differnt from my Mac and Ubuntu. (Maybe the reason is that the video format is ".mp4"? I get the same shape in ".mov".)

On Ubuntu, the data.shape is (20480, ) larger than 19456. But on the Ubuntu the speed of librosa.load is too slow (maybe three days to process, I don't konw why is so slow), so I run the "convert_audio.py" on Mac (my Mac is 100 times faster than Ubuntu to run "librosa.load").

On Mac, the data.shape is (18368, ) smaller than 19456, so I run the "data = librosa.load(filename, sr=17000)[0][-19456:]" on my mac. But use your pretrained audio model, I get the accuracy is only 94.67% rather than 97.72%.

So I have some questions.
What is the shape you get on the "data = librosa.load(filename, sr=16000)[0]"?
Why is the data.shape differnt from my Mac and Ubuntu?
Why is the speed so slow on Ubuntu?

Possible typo [ValueError: expected 4D input (got 2D input)]

end-to-end-lipreading/video_only/model.py

Line 56 in 595a988

self.bnfc = nn.BatchNorm2d(num_classes)

Here, after changing from "BatchNorm2d" to "BatchNorm1d" the program is able to run normally.

From official documentations of PyTorch:

CLASS torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)[SOURCE]
Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

Apparently it should have been 1d BatchNorm. The same typo can be found in the audio-visual model.

How can we get pretrained models?

Hello.

I’m doing my research on multimodal AI, such as multimodal ASR.

How can I get some pretrained models of audiovisual net?

stage accuracy

Hello, I'm trying to train the lip reading network following by your advice,could you please tell me the accuracy that the model can achieve for three stages?

RuntimeError: dimension out of range

Getting this error while trying to run the code

Traceback (most recent call last): File "main.py", line 212, in <module> main() File "main.py", line 208, in main test_adam(args, use_gpu) File "main.py", line 181, in test_adam train_test(model, dset_loaders, criterion, 0, 'val', optimizer, args, logger, use_gpu, save_path) File "main.py", line 105, in train_test _, preds = torch.max(F.softmax(outputs, dim=1).data, 1) File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 768, in softmax return torch._C._nn.softmax(input, dim) RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

Any ideas on how to go about debugging this?

data preparing

Thank you for your contribution!
Could you please give more details on data preparing?
Which model/code do you use to extract mouth ROI?
I can run this repository only after preparing the data by myself? right?

Thank you!

Error in Audiovisual

Hello,
In the audiovisual code there is a concat mode (path to pre-trained concat model) is this for the pretrained model in audiovisual?
Also in the code 2 references of model are declared (one in main and one in concat_model) that are throwing an error. Can we return the concat model instead? since no model is declared in these functions...
Thanks in advance

pretrained model

Hi, thank you for your work. Could you provide the pretrained model?

I want to cite your paper but the acc I write 82 or 83.3

Which acc should I choose, from the orignal paper, or your acc in github, thanks.

about training the audio-only model

Hello,

I scan the code and the paper, and I can't find the code about adding the babble noise of different levels to the audio clip. Could you please tell me how to realize it.

Thank you !

mpc001 / end-to-end-lipreading Goto Github PK

end-to-end-lipreading's People

Contributors

Stargazers

Watchers

Forkers

end-to-end-lipreading's Issues

Hello there, After creating the files via the convert_video.py I try to run the audio-only main.py and get the following issue. It seems to be something wrong with the dimentions but I don't know how to fix it. Any ideas would be highly appreciated Statistics: train: 488766, val: 25000, test: 25000

Recommend Projects

Recommend Topics

Recommend Org

Hello there,
After creating the files via the convert_video.py I try to run the audio-only main.py and get the following issue. It seems to be something wrong with the dimentions but I don't know how to fix it. Any ideas would be highly appreciated
Statistics: train: 488766, val: 25000, test: 25000