Coder Social home page Coder Social logo

end-to-end-lipreading's People

Contributors

mpc001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

end-to-end-lipreading's Issues

How to reproduce the results for the pre-trained video-only model.

Has anyone successfully reproduce the results from the provided pretrained models?
I tried a simple script to predict mp4s of ABOUT in training data, but it seems to produce wildly different labels for each sample. It seems unlikely this could achieve 83.39% accuracy, so I guess there must be something wrong in the following scripts:

import imageio
import numpy as np
import torch
import torch.nn.functional as F
import cv2

from model import lipreading

def load_frames(path):
    # depending on your environment, this might sometimes produce 30 frames
    cap = np.array(list(imageio.get_reader(path, 'ffmpeg')))

    images = np.stack(
        [cv2.cvtColor(cap[_], cv2.COLOR_RGB2GRAY) for _ in range(29)], axis=0)
    images = images[:, 84:172, 120:208] / 255.
    mean = 0.413621
    std = 0.1700239
    images = (images - mean) / std
    images = images.reshape(1, 1, 29, 88, 88)
    images = torch.tensor(images, dtype=torch.float32)
    return images

def reload_model(model, path):
    model_dict = model.state_dict()
    pretrained_dict = torch.load(path)
    model_dict.update(pretrained_dict)
    model.load_state_dict(model_dict)
    return model


# load model
m = lipreading(mode='finetuneGRU')
m = reload_model(m, 'model.pt')
m.eval()

for j in range(2, 10):
    images = load_frames(
        'lipread_mp4/ABOUT/train/ABOUT_%05d.mp4'
        % j)

    # print the label
    outputs = m(images)
    outputs = torch.mean(outputs, 1)
    outputs = torch.argmax(F.softmax(outputs, dim=1), dim=1)
    print(outputs)

What is the correct format for the data set?

What is the correct format for the data set? I may get an incorrect dataset. And when I trained the audio-only, there has an error"ValueError: num_samples should be a positive integer value, but got num_samples=0".
image
image

About noisy audio files

Hello, thank you for your work. I would like to ask where did you get the babble noise added in the audio files in your work?

About running time!

Hi, I also use a single 1080Ti for training. I also don't change any parameters. Why it took around 20 minutes per 1% epoch in the video-only problem.
Process: [ 4716/488766 (1%)] Loss: 6.3010 Acc:0.0000 Cost time: 1142s Estimated time:118110s
@mpc001

About the ResNet !

@mpc001
Thank you for your code !
I want to run your code, and I found that in your code , you write the ResNet34 yourself while the Pytorch provide the pretrained ResNet34.
I want to know is there any diffirences between your written ResNet34 and the provided ResNet34 by Pytorch?
Thank you very much!

Estimated training time

Would it be possible for the creators to provide details about specifications of the machine used to train and also estimated running time per epoch on the same machine?
@mpc001

pretrained models

Hi, thanks for your work. Please can you provide the pretrained model for audio. Also the pretrained model for video. What I need in order to get them? Just an account on GitHub? I l already have one.

Some questions about the architecture

Hi! Thank you for your work!
I'm trying to reproduce your network architecture using keras. I'm not sure to understand everything, so i have few questions:

  • What is the shape of the tensor / array containing the audio waveforms given to the audio network? Is it (number of samples, waveform's vector size)? If possible, can you print the summary of the model to show the outputs of each layer? It would help me a lot!
  • When you tried to use MFCC, did you put something between the MFCC feature and the BGRU layers ? Or did you feed the feature to the BGRU layer directly?
  • Did you tried to truncate the LRW videos to isolate the words?

Thank you!

Overfitting?

@mpc001
Thanks for your code. And I try to replicate your code in keras-tensorflow, but It suffers from serious overfitting. In N2 phase (3DCNN+ResNet+temporal Conv), I fully follow your training settings (data augmentation and ROI cropping), the training accuracy is around 88%, However, the validation accuracy is only about 65%. The result is far from yours (74.6%), I don't know why, so could you tell me the more training detail and tricks?

shape '[-1, 29, 512]' is invalid for input of size 497664

Hello there,
After creating the files via the convert_video.py I try to run the audio-only main.py and get the following issue. It seems to be something wrong with the dimentions but I don't know how to fix it. Any ideas would be highly appreciated
Statistics: train: 488766, val: 25000, test: 25000

Epoch 0/29
Current Learning rate: [0.0003]
Traceback (most recent call last):
File "/home/Documents/audiovisual/audio_only/main.py", line 256, in
main()
File "/home/Documents/audiovisual/audio_only/main.py", line 252, in main
test_adam(args, use_gpu)
File "/home/Documents/audiovisual/audio_only/main.py", line 230, in test_adam
model = train_test(model, dset_loaders, criterion, epoch, 'train', optimizer, args, logger, use_gpu, save_path)
File "/home/evialv/Documents/audiovisual/audio_only/main.py", line 146, in train_test
outputs = model(inputs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Documents/audiovisual/audio_only/model.py", line 156, in forward
x = x.view(-1, self.frameLen, self.inputDim)
RuntimeError: shape '[-1, 29, 512]' is invalid for input of size 497664

training audiovisual net with and without pretrained models

Hello,
I have some doubts about the process of training the audiovisual model. Currently, I am following the steps indicated on the README going from temporalconv, backend, and later finetuneGRU to train the whole network. I have some questions about the process:

Training with pre-trained models:

  1. When traininig using the pre-trained models. The net got stuck in this part :
    inputs = torch.cat((audio_outputs, video_outputs), dim=2)

    It requires 3D tensors to concatenate them, but the "video_outputs" and "audio_outputs" that I got are 2D tensors such as [B, 500]. How these 3D tensors for audio and video should look like? Is there a code missing or should something to transform them as required by the net?

    These are the tensors I got for conv1 and conv2 backend. The tensors of backend conv1 are not the same. Those should be equal?
    initial audio tensor: [B,19456]
    audio after backend conv1: [B,2048,1]
    audio after backend conv2(final): [B,500]
    initial video tensor : [B,1,29,96,96]
    video after backend conv1: [B,1024,1]
    video after backend conv2(final): [B,500]

  2. The concat_pretrained model is actually 3 files named as _a.pt,_b.pt,_v.pt all those three should be merge and use as a input? along with audio and video models to start the training with temporal convolutional backend? or Which one should I consider first?

Training from scratch:

  1. I started the process of training from scratch without the pre-trained models. But only the audiovisual net because this is the part I am interested in. Is that the correct approach? Should I train from the scratch also the audio-only and video-only models first?

  2. If I train also the audio-only and video-only models, should I use the .pt files of the last phase "finetuneGRU" of the audio and video net as an inputs (pt files) for the training of the audiovisual net? Besides that how could I get the concat_model.pt? How should look like the temporalConv training for audivisual with these inputs (audio_model.pt, video_model.pt & concat_model.pt)?

  3. On the README, the step ii mentioned "Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend" is this not already somehow specify on the code?

About the concatenation between the audio and visu

Hi, thank you for the code. I want to ask about the concatenation part,
inputs = torch.cat((audio_outputs, video_outputs), dim=2)
outputs = concat_model(inputs)
Does this mean it is being concatenated along the feature axis? Is it because the audio and visual input has different number of timesteps?

About the convert_audio.py

In "data = librosa.load(filename, sr=16000)[0]", I find the data.shape is differnt from my Mac and Ubuntu. (Maybe the reason is that the video format is ".mp4"? I get the same shape in ".mov".)

On Ubuntu, the data.shape is (20480, ) larger than 19456. But on the Ubuntu the speed of librosa.load is too slow (maybe three days to process, I don't konw why is so slow), so I run the "convert_audio.py" on Mac (my Mac is 100 times faster than Ubuntu to run "librosa.load").

On Mac, the data.shape is (18368, ) smaller than 19456, so I run the "data = librosa.load(filename, sr=17000)[0][-19456:]" on my mac. But use your pretrained audio model, I get the accuracy is only 94.67% rather than 97.72%.

So I have some questions.
What is the shape you get on the "data = librosa.load(filename, sr=16000)[0]"?
Why is the data.shape differnt from my Mac and Ubuntu?
Why is the speed so slow on Ubuntu?

Possible typo [ValueError: expected 4D input (got 2D input)]

self.bnfc = nn.BatchNorm2d(num_classes)

Here, after changing from "BatchNorm2d" to "BatchNorm1d" the program is able to run normally.

From official documentations of PyTorch:

CLASS torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)[SOURCE]
Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

Apparently it should have been 1d BatchNorm. The same typo can be found in the audio-visual model.

How can we get pretrained models?

Hello.

I’m doing my research on multimodal AI, such as multimodal ASR.

How can I get some pretrained models of audiovisual net?

stage accuracy

Hello, I'm trying to train the lip reading network following by your advice,could you please tell me the accuracy that the model can achieve for three stages?

RuntimeError: dimension out of range

Getting this error while trying to run the code

Traceback (most recent call last): File "main.py", line 212, in <module> main() File "main.py", line 208, in main test_adam(args, use_gpu) File "main.py", line 181, in test_adam train_test(model, dset_loaders, criterion, 0, 'val', optimizer, args, logger, use_gpu, save_path) File "main.py", line 105, in train_test _, preds = torch.max(F.softmax(outputs, dim=1).data, 1) File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 768, in softmax return torch._C._nn.softmax(input, dim) RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

Any ideas on how to go about debugging this?

data preparing

Thank you for your contribution!
Could you please give more details on data preparing?
Which model/code do you use to extract mouth ROI?
I can run this repository only after preparing the data by myself? right?

Thank you!

Error in Audiovisual

Hello,
In the audiovisual code there is a concat mode (path to pre-trained concat model) is this for the pretrained model in audiovisual?
Also in the code 2 references of model are declared (one in main and one in concat_model) that are throwing an error. Can we return the concat model instead? since no model is declared in these functions...
Thanks in advance

pretrained model

Hi, thank you for your work. Could you provide the pretrained model?

about training the audio-only model

Hello,

I scan the code and the paper, and I can't find the code about adding the babble noise of different levels to the audio clip. Could you please tell me how to realize it.

Thank you !

Dataloader Error while running main.py

when I run (as suggested in README) : CUDA_VISIBLE_DEVICES='0' python main.py --path '' --dataset /mnt/data/rajivratn/lrw/lipread_mp4 --mode 'temporalConv' --every-frame False --batch-size 36 --lr 3e-4 --epochs 30 --test False

I am getting the following error:
RuntimeError: invalid argument 1: must be strictly positive at /pytorch/torch/lib/TH/generic/THTensorMath.c:2247

Question about temporal conv training

Thank you for your code.
I have a question about training.
In temporal conv step, did you use 0.0003 learning rate, no weight decay and adjust lr(scheduler) in this code?
because only in 11 epochs it overfitted and accuracy is too low.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.