Coder Social home page Coder Social logo

yuangongnd / ast Goto Github PK

View Code? Open in Web Editor NEW
1.1K 18.0 212.0 2.4 MB

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

License: BSD 3-Clause "New" or "Revised" License

Python 10.73% Shell 0.92% Jupyter Notebook 88.35%
pytorch audio-classification deep-learning audio representation-learning keyword-spotting speech-commands speech-classification

ast's People

Contributors

jeffc0628 avatar yuangongnd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ast's Issues

training with custom data

Thanks a lot for your amazing work and sharing the code. I had a little question. As I have a video dataset, wanna use it by extracting audios from the videos. Do you recommend any recipe for processing the audio data from the videos? or any raw mp3 would work

Thanks again

Binarizing output for each audio label in AudioSet(527 classes)

Hi Yuan ,

First of all , I would like to say huge thanks for your great work !

It would be great if you can share more details about the output values in the Readme.md.

I run demo.py and I got the linear output values (positive and negative). I would like to know what is the best way to binarize those output values (0: audio label is absent , 1: audio label is present) ?

Anar Sultani

The accuracy following esc50 Recipe is very low

There must be some mistake from my side. Can someone help me identify it?Screenshot 2021-08-29 at 5 42 13 PM
This is how I'm training:
!python -W ignore /content/ast/src/run.py --model ast --dataset esc50 \ --data-train /content/data/datafiles/esc_train_data_1.json --data-val /content/data/datafiles/esc_eval_data_1.json --exp-dir /content/expdir/fold1 \ --label-csv /content/ast/egs/esc50/data/esc_class_labels_indices.csv --n_class 50 \ --lr 1e-5 --n-epochs 25 --batch-size 12 --save_model False \ --freqm 24 --timem 96 --mixup 8 --bal None \ --tstride 10 --fstride 10 --imagenet_pretrain True --audioset_pretrain True

Inference on CPU ?

Hello,
I tried to do inference on cpu.BUT ;
RuntimeError: module must have its parameters and buffers on device cuda:0 error arose.
What steps should be done ?

Wrong .pth name?

Hi, thanks for the awesome contribution!
I have prepared my data using your pipeline. When running the experiments, I get:

ImageNet pretraining: True, AudioSet pretraining: True
Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    audioset_pretrain=args.audioset_pretrain, model_size='base384')
  File "/home/user/PycharmProjects/ast/src/models/ast_models.py", line 143, in __init__
    sd = torch.load('../../pretrained_models/ast_audioset.pth', map_location=device)
  File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '../../pretrained_models/ast_audioset.pth'

Looking at the ast_models.py file in line 143, the name of the model is different from the name of the downloaded model:

wget.download(audioset_mdl_url, out='../../pretrained_models/audioset_10_10_0.4593.pth')
sd = torch.load('../../pretrained_models/ast_audioset.pth', map_location=device)

Changing the name of the file from "ast_audioset.pth" to "audioset_10_10_0.4593.pth" just fixed the missing file error.
Posted this in case someone needs it.

Process Terminated during Finetuning

I was trying to use the Audioset pretrained model for finetuning on a very small dataset to test it out on. At first the process would simply be killed with "Out of memory" in the log, but when I moved to a larger system, the process ran for longer before returning this error:

Traceback (most recent call last):
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    train(audio_model, train_loader, val_loader, args)
  File "/home/ubuntu/ast_conv/src/traintest.py", line 220, in train
    torch.save(audio_model.state_dict(), "%s/models/audio_model.%d.pth" % (exp_dir, epoch))
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
    return
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f20ac5b47a7 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x24e10c0 (0x7f20f14190c0 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x24dc69c (0x7f20f141469c in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0x9a (0x7f20f1419afa in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7f20f1419d83 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x1a5 (0x7f20f141a075 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xa7ffe3 (0x7f2103160fe3 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x4ff188 (0x7f2102be0188 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x50048e (0x7f2102be148e in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: python() [0x5cf938]
frame #10: python() [0x52cae8]
frame #11: python() [0x52cb32]
frame #12: python() [0x52cb32]
<omitting python frames>
frame #17: python() [0x654354]
frame #19: __libc_start_main + 0xe7 (0x7f21079dcbf7 in /lib/x86_64-linux-gnu/libc.so.6)

run.sh: line 46:  1703 Aborted                 (core dumped) CUDA_CACHE_DISABLE=1 python -W ignore ../../src/run.py --model ${model} --dataset ${dataset} --data-train ${tr_data} --data-val ${val_data} --exp-dir $exp_dir --label-csv ./data/class_labels_indices.csv --n_class 3 --lr $lr --n-epochs ${epoch} --batch-size $batch_size --save_model True --freqm $freqm --timem $timem --mixup ${mixup} --bal ${bal} --tstride $tstride --fstride $fstride --imagenet_pretrain $imagenetpretrain --audioset_pretrain $audiosetpretrain > $exp_dir/log.txt

As far as I can tell, that OSError could indicate that the filesize has been exceeded, not just that the total memory is overflowing. I haven't changed traintest.py except for adding an elif condition for the finetuning dataset. Did you run into this error while finetuning, or does it seem like something you understand the cause of?

data preparation

hi yuan, would it be better to elaborate on how to ensure that flac audios are single-channel?

Parameters for tuning

Hello @YuanGongND. I am trying to train AST on a dataset, which is very similar to Speech Commands, but:

  • max lenght of WAVs is 64000 frames (vs 16000 in SC)
  • test part contains very noisy samples
  1. Could you advise me, which params should I change?
  2. I have enough resources and I want to increase the accuracy. How can I do this?

missing or corrupt files when training esc-50 model

Upon trying to run the ESC-50 recipe, I come across the following error:

formats: can't open input file `/project/ast/data/ESC-50-master/audio/1-31836-A-4.wav': Input/output error
Epoch: [1][100/134]     Per Sample Total Time 0.07304   Per Sample Data Time 0.00930    Per Sample DNN Time 0.06374     Train Loss 2.7119
Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    train(audio_model, train_loader, val_loader, args)
  File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/traintest.py", line 100, in train
    for i, (audio_input, labels) in enumerate(train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 28.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/dataloader.py", line 180, in __getitem__
  File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/dataloader.py", line 101, in _wav2fbank
  File "/opt/conda/lib/python3.7/site-packages/torchaudio/backend/sox_io_backend.py", line 153, in load
    filepath, frame_offset, num_frames, normalize, channels_first, format)
RuntimeError: Error loading audio file: failed to open file /project/ast/data/ESC-50-master/audio/1-31836-A-4.wav

It's quite unclear to me how this could happen, maybe the sox command fails and the file is therefor not created?
This happens in almost every fold.

Prediction always wrong using esc50 recipe with 0.95+ accuracy after training

Thank you for this paper, it is very well written and documented. Sorry for the confusing title. I ran the esc50 recipe and it worked as expected. This is the accuracy obtained per forld:

9.50E-01
9.83E-01
9.35E-01
9.70E-01
9.43E-01
9.56E-01

I am trying to use the best model produced to manually classify some audio files (later I want to use the model on my own dataset). This is the code I am running:

torch.cuda.set_device('cuda:0')
device = torch.device("cuda:0")

pretrained_mdl_path ="/home/habashyk/virtualEnvs/ast/egs/esc50/exp/test-esc50-f10-t10-impTrue-aspTrue-b48-lr1e-5/fold2/models/best_optim_state.pth"
sd = torch.load(pretrained_mdl_path, map_location=device)

ast_mdl = ASTModel(label_dim=50,
                   fstride=10,
                   tstride=10,
                   input_fdim=audio_conf_AS["num_mel_bins"],
                   input_tdim=audio_conf_AS["target_length"],
                   imagenet_pretrain=True,
                   model_size='base384')
ast_mdl = torch.nn.DataParallel(ast_mdl)
ast_mdl.load_state_dict(sd, strict=False)
ast_mdl.cuda()
ast_mdl.eval()

Unfortunately, when I use this model, the predictions are never accurate (but have very high probabilities)

Top 3 labels and their associated probabilities for each prediction
THIS IS BATCH 0
Wav 0: Ground truth:  dog
Label:  Cough 	Prob:  0.80712890625
Label:  Female speech, woman speaking 	Prob:  0.74560546875
Label:  Throat clearing 	Prob:  0.7314453125
Wav 1: Ground truth:  chirping_birds
Label:  Child singing 	Prob:  0.78369140625
Label:  Cough 	Prob:  0.73779296875
Label:  Sneeze 	Prob:  0.720703125
Wav 2: Ground truth:  vacuum_cleaner
Label:  Narration, monologue 	Prob:  0.67578125
Label:  Children shouting 	Prob:  0.6611328125
Label:  Baby laughter 	Prob:  0.66015625

I have used the same code with the audioset model and its associated .pth weights and it works fine. Any insigt on this would be greatly appreciated. Please let me know of anything else I can provide.

Also, using audioset_pretrain = True has the same result of high probabilities with incorrect classes.

Thank you!

Temporal organization of tokens

Hi!

To start with - great work with the model and thanks for sharing!

I already ran it for standard classification cases and it worked as expected. However, now I want to treat the network's outputs as a sequence organized by time dimension. I have a few points / questions related to that:

  1. I noticed in your paper that you tried the 128 x 2 input patches in your ablation studies - do you have the weights saved and would you be willing to share them? Maybe despite worse results they could be useful in my case.
  2. You mentioned that the 128 x 2 trained better on purely on AudioSet. Have you considered also pretraining on ImageNet using those parameters? Was the reason behind not checking this computational complexity or something else (e.g. you believe that it wouldn't train well on ordinary image data)?
  3. Do you see a way to use the majority of the network as it currently is (with 16 x 16 input patches) and adding some layer (e.g. conv1D) at the top to make it combine outputs corresponding to specific time frames? How would you approach this?

Thanks!
Michał

Inference time mismatch errors ?

Hello,
I conducted training of base384 sized ast model on my own data set. While training the was no errors but when I tried to do inference and load from checkpoint error arose.

RuntimeError: Error(s) in loading state_dict for DataParallel:
size mismatch for module.v.pos_embed: copying a param with shape torch.Size([1, 602, 768]) from checkpoint, the shape in current model is torch.Size([1, 1214, 768]).

What could be wrong with this error?

Running AST on a downstream task.

Dear Yuan,

Thank you for creating this SOTA model for audio processing.

I want to run AST on an Audio dataset. I have prepared the data in a similar manner as the data prepared for ESC50 dataset. I wanted to run the model but then I noticed that you took dataset specific mean and std to normalize the dataset. Can you please share the method you used to find these two metrics.

Regards
Saif

MixUp Waveform Length Matching

When specifying mixup>0, the code tries to load 2 audio files and if they are not the same length tries to scale waveform2 to the same shape as waveform1. There is a minor bug in the code that does this:

 if waveform1.shape[1] != waveform2.shape[1]:
        if waveform1.shape[1] > waveform2.shape[1]:
            # padding
            temp_wav = torch.zeros((1,waveform1.shape[1]))
            temp_wav[0, 0:waveform2.shape[1]] = waveform2
            waveform2 = temp_wav
        else:
            # cutting
            waveform2 = waveform2[0, 0:waveform1.shape[1]]

In the above snippet, lines 4, 5, 9, don't work where the 1st dimension of the waveforms >1.
Following minor tweaks should help:

if waveform1.shape[1] != waveform2.shape[1]:
      if waveform1.shape[1] > waveform2.shape[1]:
          # padding
          temp_wav = torch.zeros(waveform1.shape)
          temp_wav[:, 0:waveform2.shape[1]] = waveform2
          waveform2 = temp_wav
      else:
          # cutting
          waveform2 = waveform2[:, 0:waveform1.shape[1]]

Use librosa for inference.py instead of torchaudio

Hi I was going through inference pipeline and i wanted to know if there is a way we can replace Kaldi Fbank implementation to livbrosa library, I am hoping to run it on my jeson device and kaldi uses mkl library which is not suitable for ARM architectures.

I've tried multiple methods but the results are not same as kaldi's fbank implementation. Any help would be appreciated. Thankyou.

@JeffC0628 @YuanGongND

Random inference result

Hello, Dr. Yuan.

Thank you for your great work and sorry for my very elementary question, I'm very new to audio classification.
My inference script outputs a random result (the output changes at every execution). could you tell me what is wrong?

I checked #19 and added fbank = (fbank + 4.26) / (4.57 * 2) but the result does not change.

This is my Colab page and I added you as an editor (If the runtime timeout, clone and pip install are needed that take about 10min).

source:

############ Load
import librosa.display
import os
import scipy
import numpy as np
import matplotlib.pyplot as plt
import torchaudio
import torch
import IPython.display as ipd

sample_freq = 16000

# Load fragment from 70s to 80s
filename = "/content/zzNdwF40ID8_short.wav"
y, sr = librosa.load(filename, sr=sample_freq, offset=70.0, duration=10.0)


print(f"Input sound shape is {y.shape}, {sr} Hz")
librosa.display.waveplot(y=y, sr=sr)
ipd.Audio(y, rate=sr, autoplay=True)


################# Show
# n_mels is number of Mel bands to generate
n_mels=128
interval = 10e-3 #ms    from https://arxiv.org/pdf/2104.01778.pdf
win_length = 25e-3 #ms  from https://arxiv.org/pdf/2104.01778.pdf but not used


# # hop_length is number of samples between successive frames.
hop_length=int(sample_freq * interval)


### generate fbank https://github.com/YuanGongND/ast/blob/102f0477099f83e04f6f2b30a498464b78bbaf46/src/dataloader.py#L123
waveform = torch.from_numpy( y.reshape(1, -1).astype(np.float32) ).clone().cpu() # to torch
fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
                                          window_type='hanning', num_mel_bins=n_mels, dither=0.0, frame_shift=10)

# normalize with dataset mean and std from https://github.com/YuanGongND/ast#use-pretrained-model-for-downstream-tasks
fbank = (fbank + 4.26) / (4.57 * 2)


# align to target_length
target_length = int((y.size/sr)/interval)
n_frames = fbank.shape[0]
p = target_length - n_frames

# cut and pad
if p > 0:
    m = torch.nn.ZeroPad2d((0, 0, 0, p))
    fbank = m(fbank)
elif p < 0:
    fbank = fbank[0:target_length, :]


plt.figure(figsize=(12, 4))
librosa.display.specshow(data=fbank.transpose(1, 0).to('cpu').detach().numpy().copy(), sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')
plt.tight_layout()


# reshape
fbank = fbank.reshape(1, -1, 128)

# print
print(f"fbank shape is {fbank.shape}, mean: {fbank.mean()} std:{fbank.std()}")


######################### Infer
import os
import torch
import sys
import csv

sys.path.append(os.path.join('./ast/src'))
import models


# download pretrained model in this directory
os.environ['TORCH_HOME'] = './ast/pretrained_models'

# assume the task has 527 classes
label_dim = 527

# create a input
test_input = fbank
input_tdim = fbank.shape[1]
print(f"Input size: {test_input.shape}\n", test_input.device)

# create an AST model and infer
ast_mdl = models.ASTModel(label_dim=label_dim, input_tdim=input_tdim, imagenet_pretrain=True, audioset_pretrain=True)
ast_mdl.eval()                          
with torch.no_grad():
    test_output = ast_mdl.forward(test_input)
    test_output = torch.sigmoid(test_output)


# output should be in shape [1, 527], i.e., 1 sample, each with prediction of 527 classes.
print(f"\noutput shape is {test_output.shape}, argmax is {test_output.argmax(axis=1)}")

# open labels
if not "labels" in vars(sys.modules[__name__]):
  with open('audioset_label.txt') as f:
      reader = csv.reader(f)
      labels = [row [0]for row in reader]

# argmax
result_output = test_output.data.cpu().numpy()[0]
sorted_indexes = np.argsort(result_output)[::-1]


# Print audio tagging top probabilities
print("\nTop probabilities. Should Music, Sonar\n-------")
for k in range(10):
    print('{}: {:.4f}'.format(np.array(labels)[sorted_indexes[k]],
                              result_output[sorted_indexes[k]]))

Clarification on the Parameters

Hey,

I'm pretty new to working with audio data in classification, so could you give some insight into some of the parameters / stats mentioned in steps 2 - 4 in the "Use Pretrained Model For Downstream Tasks" section? Specifically, a bit more clarification on getting the normalization stats, and how the parameters in steps 2 (SpecAug and mixup rate) and 4 need to be changed for different kinds of input or how they affect the model.

How to change the interpolation method?

Hi Yuan,
In the AST, for the part of the ablation experiment comparing different interpolation methods, one of the items is called "Reinitialize", how is this reflected in the code?
Best Regards.

Validation loss vs Training loss in AudioSet training

Hi!

First of all i would like to thank you for sharing with everyone your amazing work! Truly inspiring and fascinating work you shard with us.

I have a question regarding the differences of the training loss and the validation loss. It seems that the validation loss is much higher than the training loss, is that make sense? isn't it overfitting?

I also tried to fine tune the Audioset trained model for my data and is showed the same differences (with and without augmentations).

Here is an example from the logs: test-full-f10-t10-pTrue-b12-lr1e-5/log_2090852.txt:

train_loss: 0.011128
valid_loss: 0.693989

I'm still new to deep learning so maybe I'm missing something.

Thank you!

results.csv and getting labels per audio file

Hi Yuan,

Thank you for posting your project and providing ample information about it's elements!

I am running the ESC-50 recipe and I've been struggling to output results. Could you point me towards where the result.csv files get created in the scripts? Moreover, do you know how I could pull out labels, for the sounds files, from the results of that recipe? I am trying to use this recipe for avian call recognition and am struggling with the result gathering.

Thank you, I appreciate any insight you can offer.

Incorrect balance variable

Hi, thanks for this great resource.

I noticed a potential typo in the wrapper script for the audioset pipeline: Lines 26 and 31 may need to be swapped

ast/egs/audioset/run.sh

Lines 24 to 35 in e038086

if [ $set == balanced ]
then
bal=none
lr=5e-5
epoch=25
tr_data=/data/sls/scratch/yuangong/aed-pc/src/enhance_label/datafiles_local/balanced_train_data_type1_2_mean.json
else
bal=bal
lr=1e-5
epoch=5
tr_data=/data/sls/scratch/yuangong/aed-pc/src/enhance_label/datafiles_local/whole_train_data.json
fi

Question about pre-training on a new dataset.

Hi ,
I am trying to use the pre-trained model on my own dataset and in my own pipeline .
As recommended I am using - audioset_pretrain=True and imagenet_pretrain=True .
In the code I noticed that we call the ASTmodel again that results in an infinite loop. (line 129 in ast_models.py).
below is the snipped that I am referring to :
Is this a bug or an oversight on my part . Can you pls take a look ?
I am really looking forward to try AST on my pipeline .

`# now load a model that is pretrained on both ImageNet and AudioSet
        elif audioset_pretrain == True:
            if audioset_pretrain == True and imagenet_pretrain == False:
                raise ValueError('currently model pretrained on only audioset is not supported, please set imagenet_pretrain = True to use audioset pretrained model.')
            if model_size != 'base384':
                raise ValueError('currently only has base384 AudioSet pretrained model.')
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            if os.path.exists('../../pretrained_models/audioset_10_10_0.4593.pth') == False:
                # this model performs 0.4593 mAP on the audioset eval set
                audioset_mdl_url = 'https://www.dropbox.com/s/cv4knew8mvbrnvq/audioset_0.4593.pth?dl=1'
                wget.download(audioset_mdl_url, out='../../pretrained_models/audioset_10_10_0.4593.pth')
            sd = torch.load('../../pretrained_models/audioset_10_10_0.4593.pth', map_location=device)
            **audio_model = ASTModel(label_dim=527, fstride=10, tstride=10, input_fdim=128, input_tdim=1024, imagenet_pretrain=False, audioset_pretrain=False, model_size='base384', verbose=False)`**

Thanks in advance for your time.

ImageNet classifier is not terminated in Audioset pretrained models.

Hi Yuan Gong, Thank you for sharing your work. It is clear and easy to run.
I am wondering about the ImageNet Classifier weights, they still exist in AudioSet pretrained models.
do you train them?.
here is the last displayed part of the pretrained "audioset_10_10_0.4593.pth"

module.v.head.weight
torch.Size([1000, 768])
module.v.head.bias
torch.Size([1000])
module.v.head_dist.weight
torch.Size([1000, 768])
module.v.head_dist.bias
torch.Size([1000])
module.mlp_head.0.weight
torch.Size([768])
module.mlp_head.0.bias
torch.Size([768])
module.mlp_head.1.weight
torch.Size([527, 768])
module.mlp_head.1.bias
torch.Size([527])

They can be skipped by
self.v.head = nn.Identity()
self.v.head_dist = nn.Identity()

Now, I want to use the pretrained Audioset model for another task. but worried if I eliminate this part will affect the performance. Although, I think they are not connected to the final Audioset classifier of 527 classes.

Thank you again

single aduio inference for ast_model

hi, yuan:
I have written a pretty simple script to verify the tags of the single wave, but I got the result it seems not right, could you help to point the mistake?

  import os
  import sys
  import csv
  
  import numpy as np
  import torch
  import torchaudio
  from src.models import ASTModel
  torchaudio.set_audio_backend("soundfile")       # switch backend
  basepath = os.path.dirname(os.path.dirname(sys.path[0]))
  sys.path.append(basepath)
  
  # download pretrained model in this directory
  os.environ['TORCH_HOME'] = '../pretrained_models'
  
  
  def make_features(wav_name, mel_bins, target_length=1024):
      waveform, sr = torchaudio.load(wav_name)
  
      fbank = torchaudio.compliance.kaldi.fbank(
          waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
          window_type='hanning', num_mel_bins=mel_bins, dither=0.0,
          frame_shift=10)
  
      n_frames = fbank.shape[0]
      p = target_length - n_frames
      # cut and pad
      if p > 0:
          m = torch.nn.ZeroPad2d((0, 0, 0, p))
          fbank = m(fbank)
      elif p < 0:
          fbank = fbank[0:target_length, :]
  
      return fbank
  
  
  def load_label(label_csv):
      # Load label
      with open(label_csv, 'r') as f:
          reader = csv.reader(f, delimiter=',')
          lines = list(reader)
  
      labels = []
      ids = []  # Each label has a unique id such as "/m/068hy"
      for i1 in range(1, len(lines)):
          id = lines[i1][1]
          label = lines[i1][2]
          ids.append(id)
          labels.append(label)
      return labels
  
  
  if __name__ == '__main__':
  
      label_csv = './ast/egs/audioset/data/class_labels_indices.csv'
  
      # 1. make feature for predict
      wav_name = './ast/egs/audioset/data/0OxlgIitVig.wav'
      feats = make_features(wav_name, mel_bins=128)           # shape(1024, 128)
  
      # assume each input spectrogram has 100 time frames
      input_tdim = feats.shape[0]
  
      # 2. load the best model and the weights
      checkpoint_path = './ast/pretrained_models/audioset_10_10_0.4593.pth'
      ast_mdl = ASTModel(label_dim=527, input_tdim=input_tdim, imagenet_pretrain=False, audioset_pretrain=False)
      print(f'[*INFO] load checkpoint: {checkpoint_path}')
      checkpoint = torch.load(checkpoint_path, map_location='cuda')
      audio_model = torch.nn.DataParallel(ast_mdl, device_ids=[0])
      audio_model.load_state_dict(checkpoint)
  
      audio_model = audio_model.to(torch.device("cuda:0"))
  
      # 3. feed the data feature to model
      feats_data = feats.expand(1, input_tdim, 128)           # reshape the feature
  
      audio_model.eval()                                      # set the eval model
      with torch.no_grad():
          output = audio_model.forward(feats_data)
          output = torch.sigmoid(output)
      result_output = output.data.cpu().numpy()[0]
  
      # 4. map the post-prob to label
      labels = load_label(label_csv)
  
      sorted_indexes = np.argsort(result_output)[::-1]
  
      # Print audio tagging top probabilities
      for k in range(10):
          print('{}: {:.4f}'.format(np.array(labels)[sorted_indexes[k]],
                                    result_output[sorted_indexes[k]]))
  
      # output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes.
      # print(result_output.shape)

and the output:
Speech: 0.1906
Music: 0.0481
Inside, small room: 0.0245
Musical instrument: 0.0100
Silence: 0.0088
Sound effect: 0.0074
Outside, rural or natural: 0.0064
Animal: 0.0058
Outside, urban or manmade: 0.0045
Inside, large room or hall: 0.0041

Error reshaping positional embedding for AudioSet pretrained model

This error only occurs when using the AudioSet pretrained model - does not occur using only ImageNet pretrained. Audio is resampled to 16k Hz. Error occurs in src/models/ast_models.py - since t_dim > 101, else block on line 139 is triggered.

Traceback (most recent call last):
  File "train.py", line 73, in <module>
    model = VTN(**vars(cfg))
[REDACTED - model call internally]
  File "/[REDACTED]/ast_models.py", line 141, in __init__
    new_pos_embed = new_pos_embed.reshape(1, 768, num_patches).transpose(1, 2)
RuntimeError: shape '[1, 768, 120]' is invalid for input of size 221184

Parameters to "AstModel" instantiation:

label_dim: 400
input_tdim: 251
input_fdim: 64
audioset_pretrain: True

PSLA code

Hi,

Can you provide code for the model architecture in Figure 2?

Wonderful work! questions about feature size

Hi, there:
Thank you for open sourcing this piece of implementation!
It is very inspiring to see timm works in the audio settings.

Q: I tried the pipeline with a smaller feature size e.g. 64x400, and end up with 39x5 patches, and AST would be stuck at 0.01 mAP.
Tried upsampling to your feature size 128x1024, and brought it up to 0.10 mAP. I guess your intuition is to "take advantage of" the 384x384 position (originally 576 n_patches), so 1212 patches would be roughly 2x the 576 patches. Still curious is there a way to do this with a smaller feature dimension.

Convert mel filterbanks to wav again?

First of all, thanks for this wonderful repo! I am just curious if it is possible to convert the mel input back to wav again? I am trying out a model that will use the same concept as yours as a transformer decoder input but am just not sure if the predicted output (also in mel form) can be converted back to mel. Thank you very much in advance!

Normalizing the train and test data

You have mentioned that if we want to use your pre-trained model, we need to take care of the input normalization. In your code, I observed that you have manually added the mean and std for each of the datasets you used. How are we supposed to calculate the mean and std of our own dataset? Do we calculate it after computing the fbank for each audio signal or is it calculated from raw audio form? It would be great if you could provide some clarity on this

No such file or directory: './data/datafiles/esc_train_data_1.json'

Hi Yuan,
I downloaded the model and try to test it with the esc50 data. I tried to run run_esc.sh, but got error for no such file. I download the master.zip, unzip and put it in ./data/ESC-50-master/. I check the run and prep scripts and havn't found any code that makes the directory or download files for ./data/datafiles/
Is it a file or data I suppose to download or the file is auomatically generated?

Ningkun

load a trained model only for evaluation

I have a case where I want to load a model, that I recently trained with target_length=512 and sample rate 48kHz, using the following code:

  sd = torch.load(model_path, map_location="cuda")
  audio_model = ASTModel(label_dim=84, fstride=10, tstride=10, input_fdim=128, input_tdim=512, imagenet_pretrain=False, audioset_pretrain=False, model_size='base384', verbose=False)
  audio_model = torch.nn.DataParallel(audio_model)
  audio_model.load_state_dict(sd, strict=False)

from load_state_dict I get that all keys matched successfully, but the evaluation fails with a random Mean Average Precision value, which doesn't match the value during training.

Question regarding fbank for fine tuning

Hi Yuan,

Than you for this great work! I am currently fine tuning the models you produced for a project I am working on and really appreciate the opportunity you created for me. I had a question regarding the spectrograms (or fbanks) produced by the wav2vec function.

Currently, I am trying to prepare a dataset to match the requirements of the model but have stumbled upon something that grabs my attention: You have mentioned in the paper that the model acceps variable inputs. Taking a closer look, I have found that this is due to the added padding below the fbank, his is done to fix the input dimensions into the model. However, when I applied this on my own data, I saw that the padding was of different colors depending on the image when I converted them. Here are two examples:
d4-2 wav, d10-2 wav
Although I am aware that the values of the solid coloured areas are zeros, I worry that this is indicative of the same colour being attributed to a different value in different spectrograms and how that would imapt the understanding of the model of colour.

My second question is regarding the use of padding specifically. In the ViT paper as well as AST, images are fed through as a colelction of patches for learning. Any patches that are fully blank naturally would not be adding too much information to the model. However, for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there? Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?

Any insight on the above would be deeply appreciated.
Thanks again

Positional embedding

The paper https://arxiv.org/pdf/2012.12877v2.pdf says "We therefore cut the first dimension and interpolate the second dimension of the 24 × 24 ViT positional embedding to 12 × 100 and use it as the positional embedding for the AST.

Do the "cut" means take the first 12-dimension ? In my understanding, nn.functional.interpolate always "interpolate".

from torch import nn


h,w=4,3
pos_embed = torch.randn((1,1,h,w))

a = nn.functional.interpolate(pos_embed,scale_factor=(2/h,3/w),mode='bilinear')
print("position embedding:\n",pos_embed)
print("{},{}->{},{}:\n".format(h,w,2,3),a)```

---------------------------------------------
position embedding:
 tensor([[[[-0.5638,  0.0127, -2.4190],
          [ 0.2434,  0.3804, -0.2128],
          [ 0.2813, -0.7966, -0.3580],
          [-1.2754, -0.2837,  1.6149]]]])
4,3->2,3:
 tensor([[[[-0.1602,  0.1966, -1.3159],
          [-0.4971, -0.5402,  0.6284]]]])

How to change the kernel size?

Hello @YuanGongND, I'm sorry to bother you again.

I would like to ask you a question: How to change the kernel size to change the number of patches, I USE ImageNet pretrained model and NOT USE AudioSet pretrained model, but I have this problem.

x1 torch.Size([64, 149, 768])
self.v.pos_embed torch.Size([1, 202, 768])
Traceback (most recent call last):
  File "train.py", line 394, in <module>
    main()
  File "train.py", line 161, in main
    train_loss,train_acc = train(train_loader, model, criterion, optimizer, args.use_cuda, epoch)
  File "train.py", line 296, in train
    output = model(inputs)
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_autocast
    return func(*args, **kwargs)
  File "/data/source/deepAST_exp/model/ASTConcat.py", line 180, in forward
    x1 = x1 + self.v.pos_embed
RuntimeError: The size of tensor a (149) must match the size of tensor b (202) at non-singleton dimension 1

I only changed the get_shape function, like this

def get_shape(self, fstride, tstride, input_fdim=128, input_tdim=1024, kernel_size=(8,8)):
        test_input = torch.randn(1, 1, input_fdim, input_tdim)
        test_proj = nn.Conv2d(1, self.original_embedding_dim, kernel_size=kernel_size, stride=(fstride, tstride))
        test_out = test_proj(test_input)
        f_dim = test_out.shape[2]
        t_dim = test_out.shape[3]
        return f_dim, t_dim

So, What is the correct way to do this? Looking forward to your answer.

computing the normalization stats

Hi, thank you for your great work!

I have a question regarding the differences of the parameter values('freqm' ). When computing the normalization stats -- mean and std, the parameter values are 24. But during model training, it's 48. Why are their values different in these two processes?

Real-time microphone testing

Hi, i've been using your model for classification and audio analysis and it works great.
I have trained my own model and was wondering if there's a way to test it in real-time with microphone rather than audio file, if you could provide a way forward it would be greatt.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.