Coder Social home page Coder Social logo

Comments (5)

chevalierNoir avatar chevalierNoir commented on August 19, 2024 1

Hi,

In LRS3 pretraining, the model is trained for 400K steps. The training steps for finetuning depend on the task and are 30K for VSR and AVSR under 30 hours. You can find those numbers in the optimization.max_update field of the corresponding configuration files. If you use 8 GPUs, you can simulate the 32-GPU pretraining by appending optimization.update_freq=[4]. This will increase the effective batch size to 4 times larger.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on August 19, 2024

Hi,

Regarding the inferior performance of AV-HuBERT to the audio-HuBERT in ASR under clean setting, we also noticed such phenomenon in our paper (last paragraph of section 4.5) and we attribute it to the hyperparameter selection based on lip-reading rather than on ASR. For Table 4, we trained an audio-HuBERT for one iteration using the clusters from the AV-HuBERT of the last iteration.

  1. In our previous experiments, we noticed setting different modality dropout values in the first 4 iterations does not have an impact as large as the one on the last iteration. We will re-check those values in our original configurations for the first 4 iterations on LRS3.
  2. mask_prob_image in the config file is the probability of each frame being masked, which is equal to $p\times l$ in the paper. Note the p in the paper is the probability of one frame being selected as the start.

from av_hubert.

li563042811 avatar li563042811 commented on August 19, 2024

Thank you for your reply. I think maybe the reason why the result of AV-HuBERT I trained didn't surpass A-HuBERT in your paper is that I used 8 A-100 for pretrain and finetune and I didn't change batch_size and max_tokens.
In your paper, you used 32 GPUs. So the number of epochs of my training might be less than yours. Could you tell me how many epochs are in your AV-HuBERT's pretraining and finetuning?

from av_hubert.

li563042811 avatar li563042811 commented on August 19, 2024

Hi
I pretrained AV-HuBERT in LRS3 by appending optimization.update_freq=[4] still using 8 GPUs. The epoch number changes to 275 from 69. The 1st iteration costs 102 hours and after finetuning in 30h data the clean av WER comes to 12.56, which is 15.88 in case of optimization.update_freq=[1].
I didn't change other hyperparameters.
The pretrain config file is avhubert/conf/pretrain/base_lrs3_iter1.yaml appending optimization.update_freq=[4].
The finetune config file is avhubert/conf/av-finetune/base_noise_pt_noise_ft_30h.yaml appending optimization.update_freq=[4].

I have two questions,

  1. In your experiments, is the upgrade in the 1st iteration also work in the last iteration with the same relative improvement?
  2. Could you give WER of iteration[1-5] of your AV-HuBERT model pretrained in 433h LRS3 and fine-tuned in 30h data?

from av_hubert.

chevalierNoir avatar chevalierNoir commented on August 19, 2024

Hi,

  1. We always used 32 GPUs for pretraining in each iteration and thus didn't know the number of relative improvement by using 32 GPUs compared to using 8 GPUs. But my guess is the gain will hold. The gain from using 32 GPUs is also observed in the original HuBERT paper (Figure 3).
  2. For AVSR/ASR, we only train the model for one iteration (last iteration). For VSR (lip reading), we mostly use CTC for fine-tuning during the first 4 iterations and only use seq2seq finetuning for the last iteration (which outperforms CTC). The CTC numbers for each iteration are in the paper (first line of Table 2).

Note for fine-tuning, we always use 8 GPUs and haven't tried 32 GPUs for that.

from av_hubert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.