yuangongnd / ast Goto Github PK
View Code? Open in Web Editor NEWCode for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
License: BSD 3-Clause "New" or "Revised" License
Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
License: BSD 3-Clause "New" or "Revised" License
Thanks a lot for your amazing work and sharing the code. I had a little question. As I have a video dataset, wanna use it by extracting audios from the videos. Do you recommend any recipe for processing the audio data from the videos? or any raw mp3 would work
Thanks again
Hell0, I'm new in audio classification, and i want to know is this normalization right?
fbank = (fbank - self.norm_mean) / (self.norm_std * 2)
does it should be :
fbank = (fbank - self.norm_mean) / (self.norm_std ** 2)
Hi Yuan ,
First of all , I would like to say huge thanks for your great work !
It would be great if you can share more details about the output values in the Readme.md.
I run demo.py
and I got the linear output values (positive and negative). I would like to know what is the best way to binarize those output values (0: audio label is absent , 1: audio label is present) ?
Anar Sultani
There must be some mistake from my side. Can someone help me identify it?
This is how I'm training:
!python -W ignore /content/ast/src/run.py --model ast --dataset esc50 \ --data-train /content/data/datafiles/esc_train_data_1.json --data-val /content/data/datafiles/esc_eval_data_1.json --exp-dir /content/expdir/fold1 \ --label-csv /content/ast/egs/esc50/data/esc_class_labels_indices.csv --n_class 50 \ --lr 1e-5 --n-epochs 25 --batch-size 12 --save_model False \ --freqm 24 --timem 96 --mixup 8 --bal None \ --tstride 10 --fstride 10 --imagenet_pretrain True --audioset_pretrain True
Hello,
I tried to do inference on cpu.BUT ;
RuntimeError: module must have its parameters and buffers on device cuda:0 error arose.
What steps should be done ?
Hi, thanks for the awesome contribution!
I have prepared my data using your pipeline. When running the experiments, I get:
ImageNet pretraining: True, AudioSet pretraining: True
Traceback (most recent call last):
File "../../src/run.py", line 99, in <module>
audioset_pretrain=args.audioset_pretrain, model_size='base384')
File "/home/user/PycharmProjects/ast/src/models/ast_models.py", line 143, in __init__
sd = torch.load('../../pretrained_models/ast_audioset.pth', map_location=device)
File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 579, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '../../pretrained_models/ast_audioset.pth'
Looking at the ast_models.py file in line 143, the name of the model is different from the name of the downloaded model:
wget.download(audioset_mdl_url, out='../../pretrained_models/audioset_10_10_0.4593.pth')
sd = torch.load('../../pretrained_models/ast_audioset.pth', map_location=device)
Changing the name of the file from "ast_audioset.pth" to "audioset_10_10_0.4593.pth" just fixed the missing file error.
Posted this in case someone needs it.
Hi, Yuan. Is there the code or the demo to test the single audiofile with the trained model ?
I was trying to use the Audioset pretrained model for finetuning on a very small dataset to test it out on. At first the process would simply be killed with "Out of memory" in the log, but when I moved to a larger system, the process ran for longer before returning this error:
Traceback (most recent call last):
File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device
Traceback (most recent call last):
File "../../src/run.py", line 99, in <module>
train(audio_model, train_loader, val_loader, args)
File "/home/ubuntu/ast_conv/src/traintest.py", line 220, in train
torch.save(audio_model.state_dict(), "%s/models/audio_model.%d.pth" % (exp_dir, epoch))
File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
return
File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
terminate called after throwing an instance of 'c10::Error'
what(): [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f20ac5b47a7 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x24e10c0 (0x7f20f14190c0 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x24dc69c (0x7f20f141469c in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0x9a (0x7f20f1419afa in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7f20f1419d83 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x1a5 (0x7f20f141a075 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xa7ffe3 (0x7f2103160fe3 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x4ff188 (0x7f2102be0188 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x50048e (0x7f2102be148e in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: python() [0x5cf938]
frame #10: python() [0x52cae8]
frame #11: python() [0x52cb32]
frame #12: python() [0x52cb32]
<omitting python frames>
frame #17: python() [0x654354]
frame #19: __libc_start_main + 0xe7 (0x7f21079dcbf7 in /lib/x86_64-linux-gnu/libc.so.6)
run.sh: line 46: 1703 Aborted (core dumped) CUDA_CACHE_DISABLE=1 python -W ignore ../../src/run.py --model ${model} --dataset ${dataset} --data-train ${tr_data} --data-val ${val_data} --exp-dir $exp_dir --label-csv ./data/class_labels_indices.csv --n_class 3 --lr $lr --n-epochs ${epoch} --batch-size $batch_size --save_model True --freqm $freqm --timem $timem --mixup ${mixup} --bal ${bal} --tstride $tstride --fstride $fstride --imagenet_pretrain $imagenetpretrain --audioset_pretrain $audiosetpretrain > $exp_dir/log.txt
As far as I can tell, that OSError could indicate that the filesize has been exceeded, not just that the total memory is overflowing. I haven't changed traintest.py
except for adding an elif condition for the finetuning dataset. Did you run into this error while finetuning, or does it seem like something you understand the cause of?
Line 42 in d338ce4
:param input_fdim -> :param input_tdim
hi yuan, would it be better to elaborate on how to ensure that flac audios are single-channel?
Hello @YuanGongND. I am trying to train AST on a dataset, which is very similar to Speech Commands, but:
Hi, thank you for sharing the reproducible code.
Let me have questions about the detail for getting fbanks.
According to the paper, Hamming window would be used. But following code uses Hann. Then the Hanning is the one actually used?
https://github.com/YuanGongND/ast/blob/master/src/dataloader.py#L129
All other parameters to get the fbanks are the default, right?
Thanks in advance!
Upon trying to run the ESC-50 recipe, I come across the following error:
formats: can't open input file `/project/ast/data/ESC-50-master/audio/1-31836-A-4.wav': Input/output error
Epoch: [1][100/134] Per Sample Total Time 0.07304 Per Sample Data Time 0.00930 Per Sample DNN Time 0.06374 Train Loss 2.7119
Traceback (most recent call last):
File "../../src/run.py", line 99, in <module>
train(audio_model, train_loader, val_loader, args)
File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/traintest.py", line 100, in train
for i, (audio_input, labels) in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 28.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/dataloader.py", line 180, in __getitem__
File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/dataloader.py", line 101, in _wav2fbank
File "/opt/conda/lib/python3.7/site-packages/torchaudio/backend/sox_io_backend.py", line 153, in load
filepath, frame_offset, num_frames, normalize, channels_first, format)
RuntimeError: Error loading audio file: failed to open file /project/ast/data/ESC-50-master/audio/1-31836-A-4.wav
It's quite unclear to me how this could happen, maybe the sox command fails and the file is therefor not created?
This happens in almost every fold.
Thank you for this paper, it is very well written and documented. Sorry for the confusing title. I ran the esc50 recipe and it worked as expected. This is the accuracy obtained per forld:
9.50E-01
9.83E-01
9.35E-01
9.70E-01
9.43E-01
9.56E-01
I am trying to use the best model produced to manually classify some audio files (later I want to use the model on my own dataset). This is the code I am running:
torch.cuda.set_device('cuda:0')
device = torch.device("cuda:0")
pretrained_mdl_path ="/home/habashyk/virtualEnvs/ast/egs/esc50/exp/test-esc50-f10-t10-impTrue-aspTrue-b48-lr1e-5/fold2/models/best_optim_state.pth"
sd = torch.load(pretrained_mdl_path, map_location=device)
ast_mdl = ASTModel(label_dim=50,
fstride=10,
tstride=10,
input_fdim=audio_conf_AS["num_mel_bins"],
input_tdim=audio_conf_AS["target_length"],
imagenet_pretrain=True,
model_size='base384')
ast_mdl = torch.nn.DataParallel(ast_mdl)
ast_mdl.load_state_dict(sd, strict=False)
ast_mdl.cuda()
ast_mdl.eval()
Unfortunately, when I use this model, the predictions are never accurate (but have very high probabilities)
Top 3 labels and their associated probabilities for each prediction
THIS IS BATCH 0
Wav 0: Ground truth: dog
Label: Cough Prob: 0.80712890625
Label: Female speech, woman speaking Prob: 0.74560546875
Label: Throat clearing Prob: 0.7314453125
Wav 1: Ground truth: chirping_birds
Label: Child singing Prob: 0.78369140625
Label: Cough Prob: 0.73779296875
Label: Sneeze Prob: 0.720703125
Wav 2: Ground truth: vacuum_cleaner
Label: Narration, monologue Prob: 0.67578125
Label: Children shouting Prob: 0.6611328125
Label: Baby laughter Prob: 0.66015625
I have used the same code with the audioset model and its associated .pth weights and it works fine. Any insigt on this would be greatly appreciated. Please let me know of anything else I can provide.
Also, using audioset_pretrain = True has the same result of high probabilities with incorrect classes.
Thank you!
Hi!
To start with - great work with the model and thanks for sharing!
I already ran it for standard classification cases and it worked as expected. However, now I want to treat the network's outputs as a sequence organized by time dimension. I have a few points / questions related to that:
Thanks!
Michał
Can I use a different sampling rate like 22 kHz for fine tuning?
Hello,
I conducted training of base384 sized ast model on my own data set. While training the was no errors but when I tried to do inference and load from checkpoint error arose.
RuntimeError: Error(s) in loading state_dict for DataParallel:
size mismatch for module.v.pos_embed: copying a param with shape torch.Size([1, 602, 768]) from checkpoint, the shape in current model is torch.Size([1, 1214, 768]).
What could be wrong with this error?
Dear Yuan,
Thank you for creating this SOTA model for audio processing.
I want to run AST on an Audio dataset. I have prepared the data in a similar manner as the data prepared for ESC50 dataset. I wanted to run the model but then I noticed that you took dataset specific mean and std to normalize the dataset. Can you please share the method you used to find these two metrics.
Regards
Saif
When specifying mixup>0, the code tries to load 2 audio files and if they are not the same length tries to scale waveform2 to the same shape as waveform1. There is a minor bug in the code that does this:
if waveform1.shape[1] != waveform2.shape[1]:
if waveform1.shape[1] > waveform2.shape[1]:
# padding
temp_wav = torch.zeros((1,waveform1.shape[1]))
temp_wav[0, 0:waveform2.shape[1]] = waveform2
waveform2 = temp_wav
else:
# cutting
waveform2 = waveform2[0, 0:waveform1.shape[1]]
In the above snippet, lines 4, 5, 9, don't work where the 1st dimension of the waveforms >1.
Following minor tweaks should help:
if waveform1.shape[1] != waveform2.shape[1]:
if waveform1.shape[1] > waveform2.shape[1]:
# padding
temp_wav = torch.zeros(waveform1.shape)
temp_wav[:, 0:waveform2.shape[1]] = waveform2
waveform2 = temp_wav
else:
# cutting
waveform2 = waveform2[:, 0:waveform1.shape[1]]
Hello Yuan, I sent an email to your email([email protected]) and hope to discuss with you about AST.
Hi I was going through inference pipeline and i wanted to know if there is a way we can replace Kaldi Fbank implementation to livbrosa library, I am hoping to run it on my jeson device and kaldi uses mkl library which is not suitable for ARM architectures.
I've tried multiple methods but the results are not same as kaldi's fbank implementation. Any help would be appreciated. Thankyou.
Hi, YuanGongND, can you share the imagenet pretrain model url ?
hello , I didn't find the transformer or attention in the ATSModel ,Can you help me point out ?
Hello, Dr. Yuan.
Thank you for your great work and sorry for my very elementary question, I'm very new to audio classification.
My inference script outputs a random result (the output changes at every execution). could you tell me what is wrong?
I checked #19 and added fbank = (fbank + 4.26) / (4.57 * 2)
but the result does not change.
This is my Colab page and I added you as an editor (If the runtime timeout, clone and pip install are needed that take about 10min).
source:
############ Load
import librosa.display
import os
import scipy
import numpy as np
import matplotlib.pyplot as plt
import torchaudio
import torch
import IPython.display as ipd
sample_freq = 16000
# Load fragment from 70s to 80s
filename = "/content/zzNdwF40ID8_short.wav"
y, sr = librosa.load(filename, sr=sample_freq, offset=70.0, duration=10.0)
print(f"Input sound shape is {y.shape}, {sr} Hz")
librosa.display.waveplot(y=y, sr=sr)
ipd.Audio(y, rate=sr, autoplay=True)
################# Show
# n_mels is number of Mel bands to generate
n_mels=128
interval = 10e-3 #ms from https://arxiv.org/pdf/2104.01778.pdf
win_length = 25e-3 #ms from https://arxiv.org/pdf/2104.01778.pdf but not used
# # hop_length is number of samples between successive frames.
hop_length=int(sample_freq * interval)
### generate fbank https://github.com/YuanGongND/ast/blob/102f0477099f83e04f6f2b30a498464b78bbaf46/src/dataloader.py#L123
waveform = torch.from_numpy( y.reshape(1, -1).astype(np.float32) ).clone().cpu() # to torch
fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
window_type='hanning', num_mel_bins=n_mels, dither=0.0, frame_shift=10)
# normalize with dataset mean and std from https://github.com/YuanGongND/ast#use-pretrained-model-for-downstream-tasks
fbank = (fbank + 4.26) / (4.57 * 2)
# align to target_length
target_length = int((y.size/sr)/interval)
n_frames = fbank.shape[0]
p = target_length - n_frames
# cut and pad
if p > 0:
m = torch.nn.ZeroPad2d((0, 0, 0, p))
fbank = m(fbank)
elif p < 0:
fbank = fbank[0:target_length, :]
plt.figure(figsize=(12, 4))
librosa.display.specshow(data=fbank.transpose(1, 0).to('cpu').detach().numpy().copy(), sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')
plt.tight_layout()
# reshape
fbank = fbank.reshape(1, -1, 128)
# print
print(f"fbank shape is {fbank.shape}, mean: {fbank.mean()} std:{fbank.std()}")
######################### Infer
import os
import torch
import sys
import csv
sys.path.append(os.path.join('./ast/src'))
import models
# download pretrained model in this directory
os.environ['TORCH_HOME'] = './ast/pretrained_models'
# assume the task has 527 classes
label_dim = 527
# create a input
test_input = fbank
input_tdim = fbank.shape[1]
print(f"Input size: {test_input.shape}\n", test_input.device)
# create an AST model and infer
ast_mdl = models.ASTModel(label_dim=label_dim, input_tdim=input_tdim, imagenet_pretrain=True, audioset_pretrain=True)
ast_mdl.eval()
with torch.no_grad():
test_output = ast_mdl.forward(test_input)
test_output = torch.sigmoid(test_output)
# output should be in shape [1, 527], i.e., 1 sample, each with prediction of 527 classes.
print(f"\noutput shape is {test_output.shape}, argmax is {test_output.argmax(axis=1)}")
# open labels
if not "labels" in vars(sys.modules[__name__]):
with open('audioset_label.txt') as f:
reader = csv.reader(f)
labels = [row [0]for row in reader]
# argmax
result_output = test_output.data.cpu().numpy()[0]
sorted_indexes = np.argsort(result_output)[::-1]
# Print audio tagging top probabilities
print("\nTop probabilities. Should Music, Sonar\n-------")
for k in range(10):
print('{}: {:.4f}'.format(np.array(labels)[sorted_indexes[k]],
result_output[sorted_indexes[k]]))
Hey,
I'm pretty new to working with audio data in classification, so could you give some insight into some of the parameters / stats mentioned in steps 2 - 4 in the "Use Pretrained Model For Downstream Tasks" section? Specifically, a bit more clarification on getting the normalization stats, and how the parameters in steps 2 (SpecAug and mixup rate) and 4 need to be changed for different kinds of input or how they affect the model.
Hi Yuan,
In the AST, for the part of the ablation experiment comparing different interpolation methods, one of the items is called "Reinitialize", how is this reflected in the code?
Best Regards.
Hi!
First of all i would like to thank you for sharing with everyone your amazing work! Truly inspiring and fascinating work you shard with us.
I have a question regarding the differences of the training loss and the validation loss. It seems that the validation loss is much higher than the training loss, is that make sense? isn't it overfitting?
I also tried to fine tune the Audioset trained model for my data and is showed the same differences (with and without augmentations).
Here is an example from the logs: test-full-f10-t10-pTrue-b12-lr1e-5/log_2090852.txt:
train_loss: 0.011128
valid_loss: 0.693989
I'm still new to deep learning so maybe I'm missing something.
Thank you!
Hi Yuan,
Thank you for posting your project and providing ample information about it's elements!
I am running the ESC-50 recipe and I've been struggling to output results. Could you point me towards where the result.csv files get created in the scripts? Moreover, do you know how I could pull out labels, for the sounds files, from the results of that recipe? I am trying to use this recipe for avian call recognition and am struggling with the result gathering.
Thank you, I appreciate any insight you can offer.
Hi, nice work!
Was wondering whether there exists a parameter for specifying the number of GPUs to use for training?
Hi, thanks for this great resource.
I noticed a potential typo in the wrapper script for the audioset pipeline: Lines 26 and 31 may need to be swapped
Lines 24 to 35 in e038086
Hi ,
I am trying to use the pre-trained model on my own dataset and in my own pipeline .
As recommended I am using - audioset_pretrain=True
and imagenet_pretrain=True
.
In the code I noticed that we call the ASTmodel
again that results in an infinite loop. (line 129 in ast_models.py).
below is the snipped that I am referring to :
Is this a bug or an oversight on my part . Can you pls take a look ?
I am really looking forward to try AST on my pipeline .
`# now load a model that is pretrained on both ImageNet and AudioSet
elif audioset_pretrain == True:
if audioset_pretrain == True and imagenet_pretrain == False:
raise ValueError('currently model pretrained on only audioset is not supported, please set imagenet_pretrain = True to use audioset pretrained model.')
if model_size != 'base384':
raise ValueError('currently only has base384 AudioSet pretrained model.')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if os.path.exists('../../pretrained_models/audioset_10_10_0.4593.pth') == False:
# this model performs 0.4593 mAP on the audioset eval set
audioset_mdl_url = 'https://www.dropbox.com/s/cv4knew8mvbrnvq/audioset_0.4593.pth?dl=1'
wget.download(audioset_mdl_url, out='../../pretrained_models/audioset_10_10_0.4593.pth')
sd = torch.load('../../pretrained_models/audioset_10_10_0.4593.pth', map_location=device)
**audio_model = ASTModel(label_dim=527, fstride=10, tstride=10, input_fdim=128, input_tdim=1024, imagenet_pretrain=False, audioset_pretrain=False, model_size='base384', verbose=False)`**
Thanks in advance for your time.
Hi Yuan Gong, Thank you for sharing your work. It is clear and easy to run.
I am wondering about the ImageNet Classifier weights, they still exist in AudioSet pretrained models.
do you train them?.
here is the last displayed part of the pretrained "audioset_10_10_0.4593.pth"
module.v.head.weight
torch.Size([1000, 768])
module.v.head.bias
torch.Size([1000])
module.v.head_dist.weight
torch.Size([1000, 768])
module.v.head_dist.bias
torch.Size([1000])
module.mlp_head.0.weight
torch.Size([768])
module.mlp_head.0.bias
torch.Size([768])
module.mlp_head.1.weight
torch.Size([527, 768])
module.mlp_head.1.bias
torch.Size([527])
They can be skipped by
self.v.head = nn.Identity()
self.v.head_dist = nn.Identity()
Now, I want to use the pretrained Audioset model for another task. but worried if I eliminate this part will affect the performance. Although, I think they are not connected to the final Audioset classifier of 527 classes.
Thank you again
hi, yuan:
I have written a pretty simple script to verify the tags of the single wave, but I got the result it seems not right, could you help to point the mistake?
import os
import sys
import csv
import numpy as np
import torch
import torchaudio
from src.models import ASTModel
torchaudio.set_audio_backend("soundfile") # switch backend
basepath = os.path.dirname(os.path.dirname(sys.path[0]))
sys.path.append(basepath)
# download pretrained model in this directory
os.environ['TORCH_HOME'] = '../pretrained_models'
def make_features(wav_name, mel_bins, target_length=1024):
waveform, sr = torchaudio.load(wav_name)
fbank = torchaudio.compliance.kaldi.fbank(
waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
window_type='hanning', num_mel_bins=mel_bins, dither=0.0,
frame_shift=10)
n_frames = fbank.shape[0]
p = target_length - n_frames
# cut and pad
if p > 0:
m = torch.nn.ZeroPad2d((0, 0, 0, p))
fbank = m(fbank)
elif p < 0:
fbank = fbank[0:target_length, :]
return fbank
def load_label(label_csv):
# Load label
with open(label_csv, 'r') as f:
reader = csv.reader(f, delimiter=',')
lines = list(reader)
labels = []
ids = [] # Each label has a unique id such as "/m/068hy"
for i1 in range(1, len(lines)):
id = lines[i1][1]
label = lines[i1][2]
ids.append(id)
labels.append(label)
return labels
if __name__ == '__main__':
label_csv = './ast/egs/audioset/data/class_labels_indices.csv'
# 1. make feature for predict
wav_name = './ast/egs/audioset/data/0OxlgIitVig.wav'
feats = make_features(wav_name, mel_bins=128) # shape(1024, 128)
# assume each input spectrogram has 100 time frames
input_tdim = feats.shape[0]
# 2. load the best model and the weights
checkpoint_path = './ast/pretrained_models/audioset_10_10_0.4593.pth'
ast_mdl = ASTModel(label_dim=527, input_tdim=input_tdim, imagenet_pretrain=False, audioset_pretrain=False)
print(f'[*INFO] load checkpoint: {checkpoint_path}')
checkpoint = torch.load(checkpoint_path, map_location='cuda')
audio_model = torch.nn.DataParallel(ast_mdl, device_ids=[0])
audio_model.load_state_dict(checkpoint)
audio_model = audio_model.to(torch.device("cuda:0"))
# 3. feed the data feature to model
feats_data = feats.expand(1, input_tdim, 128) # reshape the feature
audio_model.eval() # set the eval model
with torch.no_grad():
output = audio_model.forward(feats_data)
output = torch.sigmoid(output)
result_output = output.data.cpu().numpy()[0]
# 4. map the post-prob to label
labels = load_label(label_csv)
sorted_indexes = np.argsort(result_output)[::-1]
# Print audio tagging top probabilities
for k in range(10):
print('{}: {:.4f}'.format(np.array(labels)[sorted_indexes[k]],
result_output[sorted_indexes[k]]))
# output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes.
# print(result_output.shape)
and the output:
Speech: 0.1906
Music: 0.0481
Inside, small room: 0.0245
Musical instrument: 0.0100
Silence: 0.0088
Sound effect: 0.0074
Outside, rural or natural: 0.0064
Animal: 0.0058
Outside, urban or manmade: 0.0045
Inside, large room or hall: 0.0041
This error only occurs when using the AudioSet pretrained model - does not occur using only ImageNet pretrained. Audio is resampled to 16k Hz. Error occurs in src/models/ast_models.py - since t_dim > 101, else block on line 139 is triggered.
Traceback (most recent call last):
File "train.py", line 73, in <module>
model = VTN(**vars(cfg))
[REDACTED - model call internally]
File "/[REDACTED]/ast_models.py", line 141, in __init__
new_pos_embed = new_pos_embed.reshape(1, 768, num_patches).transpose(1, 2)
RuntimeError: shape '[1, 768, 120]' is invalid for input of size 221184
Parameters to "AstModel" instantiation:
label_dim: 400
input_tdim: 251
input_fdim: 64
audioset_pretrain: True
Hi,
Can you provide code for the model architecture in Figure 2?
Hi, there:
Thank you for open sourcing this piece of implementation!
It is very inspiring to see timm works in the audio settings.
Q: I tried the pipeline with a smaller feature size e.g. 64x400, and end up with 39x5 patches, and AST would be stuck at 0.01 mAP.
Tried upsampling to your feature size 128x1024, and brought it up to 0.10 mAP. I guess your intuition is to "take advantage of" the 384x384 position (originally 576 n_patches), so 1212 patches would be roughly 2x the 576 patches. Still curious is there a way to do this with a smaller feature dimension.
First of all, thanks for this wonderful repo! I am just curious if it is possible to convert the mel input back to wav again? I am trying out a model that will use the same concept as yours as a transformer decoder input but am just not sure if the predicted output (also in mel form) can be converted back to mel. Thank you very much in advance!
You have mentioned that if we want to use your pre-trained model, we need to take care of the input normalization. In your code, I observed that you have manually added the mean and std for each of the datasets you used. How are we supposed to calculate the mean and std of our own dataset? Do we calculate it after computing the fbank for each audio signal or is it calculated from raw audio form? It would be great if you could provide some clarity on this
Hi Yuan,
I downloaded the model and try to test it with the esc50 data. I tried to run run_esc.sh, but got error for no such file. I download the master.zip, unzip and put it in ./data/ESC-50-master/. I check the run and prep scripts and havn't found any code that makes the directory or download files for ./data/datafiles/
Is it a file or data I suppose to download or the file is auomatically generated?
Ningkun
I have a case where I want to load a model, that I recently trained with target_length=512
and sample rate 48kHz, using the following code:
sd = torch.load(model_path, map_location="cuda")
audio_model = ASTModel(label_dim=84, fstride=10, tstride=10, input_fdim=128, input_tdim=512, imagenet_pretrain=False, audioset_pretrain=False, model_size='base384', verbose=False)
audio_model = torch.nn.DataParallel(audio_model)
audio_model.load_state_dict(sd, strict=False)
from load_state_dict
I get that all keys matched successfully, but the evaluation fails with a random Mean Average Precision value, which doesn't match the value during training.
Hi Yuan,
Than you for this great work! I am currently fine tuning the models you produced for a project I am working on and really appreciate the opportunity you created for me. I had a question regarding the spectrograms (or fbanks) produced by the wav2vec function.
Currently, I am trying to prepare a dataset to match the requirements of the model but have stumbled upon something that grabs my attention: You have mentioned in the paper that the model acceps variable inputs. Taking a closer look, I have found that this is due to the added padding below the fbank, his is done to fix the input dimensions into the model. However, when I applied this on my own data, I saw that the padding was of different colors depending on the image when I converted them. Here are two examples:
,
Although I am aware that the values of the solid coloured areas are zeros, I worry that this is indicative of the same colour being attributed to a different value in different spectrograms and how that would imapt the understanding of the model of colour.
My second question is regarding the use of padding specifically. In the ViT paper as well as AST, images are fed through as a colelction of patches for learning. Any patches that are fully blank naturally would not be adding too much information to the model. However, for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there? Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?
Any insight on the above would be deeply appreciated.
Thanks again
The paper https://arxiv.org/pdf/2012.12877v2.pdf says "We therefore cut the first dimension and interpolate the second dimension of the 24 × 24 ViT positional embedding to 12 × 100 and use it as the positional embedding for the AST.
Do the "cut" means take the first 12-dimension ? In my understanding, nn.functional.interpolate always "interpolate".
from torch import nn
h,w=4,3
pos_embed = torch.randn((1,1,h,w))
a = nn.functional.interpolate(pos_embed,scale_factor=(2/h,3/w),mode='bilinear')
print("position embedding:\n",pos_embed)
print("{},{}->{},{}:\n".format(h,w,2,3),a)```
---------------------------------------------
position embedding:
tensor([[[[-0.5638, 0.0127, -2.4190],
[ 0.2434, 0.3804, -0.2128],
[ 0.2813, -0.7966, -0.3580],
[-1.2754, -0.2837, 1.6149]]]])
4,3->2,3:
tensor([[[[-0.1602, 0.1966, -1.3159],
[-0.4971, -0.5402, 0.6284]]]])
Hello @YuanGongND, I'm sorry to bother you again.
I would like to ask you a question: How to change the kernel size to change the number of patches, I USE ImageNet pretrained model and NOT USE AudioSet pretrained model, but I have this problem.
x1 torch.Size([64, 149, 768])
self.v.pos_embed torch.Size([1, 202, 768])
Traceback (most recent call last):
File "train.py", line 394, in <module>
main()
File "train.py", line 161, in main
train_loss,train_acc = train(train_loader, model, criterion, optimizer, args.use_cuda, epoch)
File "train.py", line 296, in train
output = model(inputs)
File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_autocast
return func(*args, **kwargs)
File "/data/source/deepAST_exp/model/ASTConcat.py", line 180, in forward
x1 = x1 + self.v.pos_embed
RuntimeError: The size of tensor a (149) must match the size of tensor b (202) at non-singleton dimension 1
I only changed the get_shape function, like this
def get_shape(self, fstride, tstride, input_fdim=128, input_tdim=1024, kernel_size=(8,8)):
test_input = torch.randn(1, 1, input_fdim, input_tdim)
test_proj = nn.Conv2d(1, self.original_embedding_dim, kernel_size=kernel_size, stride=(fstride, tstride))
test_out = test_proj(test_input)
f_dim = test_out.shape[2]
t_dim = test_out.shape[3]
return f_dim, t_dim
So, What is the correct way to do this? Looking forward to your answer.
Hi, thank you for your great work!
I have a question regarding the differences of the parameter values('freqm' ). When computing the normalization stats -- mean and std, the parameter values are 24. But during model training, it's 48. Why are their values different in these two processes?
Hello Yuan,
Thank you for your excellent code, in your paper you mentioned that the AST model can support variable-length inputs, but I noticed that the following parts of the code didn't seem to support variable-length input:
Line 188 in 6f4e193
So how can you solve the above problems?
--obsidian
Hi, i've been using your model for classification and audio analysis and it works great.
I have trained my own model and was wondering if there's a way to test it in real-time with microphone rather than audio file, if you could provide a way forward it would be greatt.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.