This is a demo implementation of BYOL for Audio (BYOL-A), a self-supervised learning method for general-purpose audio representation, includes:
- Training code that can train models with arbitrary audio files.
- Evaluation code that can evaluate trained models with downstream tasks.
- Pretrained weights.
UPDATE (Dec, 2022): The v2 is now published on TASLP! We updated BibTeX.
UPDATE (Nov, 2022): New model definitions (AudioNTT2020X, AudioNTT2020Task6X) are ready. These are for making all layer features accessible so that the weighted sum of layer features can be available in SUPERB.
UPDATE (May, 2022): We have two papers for BYOL-A.
If you find BYOL-A useful in your research, please use either of the following BibTeX entries for citation.
The former is the first paper from IJCNN2021 (LINK to IEEE Xplore), and the latter is currently under review (LINK to arxiv) published on TASLP!
@inproceedings{niizumi2021byol-a,
title={BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation},
author={Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
booktitle = {2021 International Joint Conference on Neural Networks (IJCNN)},
publisher={IEEE},
DOI={10.1109/ijcnn52387.2021.9534474},
url={http://dx.doi.org/10.1109/IJCNN52387.2021.9534474},
year={2021},
month={Jul}
}
@article{niizumi2023byol-a,
title={{BYOL for Audio}: Exploring Pre-trained General-purpose Audio Representations},
author={Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
publisher={Institute of Electrical and Electronics Engineers (IEEE)},
year={2023},
volume={31},
pages={137–151},
doi={10.1109/TASLP.2022.3221007},
url={http://dx.doi.org/10.1109/TASLP.2022.3221007},
ISSN={2329-9304}
}
We've added an augmentation block and updated network architecture.
We introduced an extra augmentation block, Random Linear Fader, in the 2022 version (TASLP2023).
We reduced the number of convolutional blocks from three to two and added a skip connection at a new Concat block on the 2022 version.
- For IJCNN2021, codes have not been changed; please find the details in this README.
- For TASLP2023(2022 version), 👉 please find codes in the v2 folder.
-
Download external source files, and apply a patch. Our implementation uses the following.
- BYOL implementation: https://github.com/lucidrains/byol-pytorch/blob/master/byol_pytorch/byol_pytorch.py
- MLPClassifier for PyTorch: https://github.com/daisukelab/general-learning/blob/master/MLP/torch_mlp_clf.py
curl -O https://raw.githubusercontent.com/lucidrains/byol-pytorch/2aa84ee18fafecaf35637da4657f92619e83876d/byol_pytorch/byol_pytorch.py patch < byol_a/byol_pytorch.diff mv byol_pytorch.py byol_a curl -O https://raw.githubusercontent.com/daisukelab/general-learning/7b31d31637d73e1a74aec3930793bd5175b64126/MLP/torch_mlp_clf.py mv torch_mlp_clf.py utils
Install PyTorch 1.7.1, torchaudio, and other dependencies listed on requirements.txt.
The following steps will perform a downstream task evaluation by linear-probe fashion. This is an example with SPCV2; Speech commands dataset v2.
-
Preprocess metadata (.csv file) and audio files, processed files will be stored under a folder
work
.# usage: python -m utils.preprocess_ds <downstream task> <path to its dataset> python -m utils.preprocess_ds spcv2 /path/to/speech_commands_v0.02
Run evaluation. This will convert all .wav audio to representation embeddings first, train a lineaer layer network, then calculate accuracy as a result.
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth spcv2
You can also run an evaluation multiple times and take an average result. Following will evaluate on UrbanSound8K with a unit audio duration of 4.0 seconds, for 10 times.
# usage: python evaluate.py <your weight> <downstream task> <unit duration sec.> <# of iteration> python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth us8k 4.0 10
Similarly, the following evaluates on NSynth (4.0 seconds long) 10 times.
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth nsynth 4.0 10
This is an example to calculate a feature vector for an audio sample.
from byol_a.common import * from byol_a.augmentations import PrecomputedNorm from byol_a.models import AudioNTT2020 device = torch.device('cuda') cfg = load_yaml_config('config.yaml') print(cfg) # ** Prepare the statistics in advance ** # You need to calculate the statistics of mean and standard deviation of the log-mel spectrogram of your dataset. # See calc_norm_stats in evaluate.py for your reference. stats = [-5.4919195, 5.0389895] # Preprocessor and normalizer. to_melspec = torchaudio.transforms.MelSpectrogram( sample_rate=cfg.sample_rate, n_fft=cfg.n_fft, win_length=cfg.win_length, hop_length=cfg.hop_length, n_mels=cfg.n_mels, f_min=cfg.f_min, f_max=cfg.f_max, ) normalizer = PrecomputedNorm(stats) # Load pretrained weights. model = AudioNTT2020(d=cfg.feature_d) model.load_weight('pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth', device) # Load your audio file. wav, sr = torchaudio.load('work/16k/spcv2/one/00176480_nohash_0.wav') # a sample from SPCV2 for now assert sr == cfg.sample_rate, "Let's convert the audio sampling rate in advance, or do it here online." # Convert to a log-mel spectrogram, then normalize. lms = normalizer((to_melspec(wav) + torch.finfo(torch.float).eps).log()) # Now, convert the audio to the representation. features = model(lms.unsqueeze(0))
You can also train models. Followings are an example of training on FSD50K.
-
Convert all samples to 16kHz. This will convert all FSD50K files to a folder
work/16k/fsd50k
while preserving folder structure.python -m utils.convert_wav /path/to/fsd50k work/16k/fsd50k
-
Start training, this example trains with all development set audio samples from FSD50K.
python train.py work/16k/fsd50k/FSD50K.dev_audio
Refer to Table VI on our paper for the performance of a model trained on FSD50K.
We include 3 pretrained weights of our encoder network.
Method Dim. Filename NSynth US8K VoxCeleb1 VoxForge SPCV2/12 SPCV2 Average BYOL-A 512-d AudioNTT2020-BYOLA-64x96d512.pth 69.1% 78.2% 33.4% 83.5% 86.5% 88.9% 73.3% BYOL-A 1024-d AudioNTT2020-BYOLA-64x96d1024.pth 72.7% 78.2% 38.0% 88.5% 90.1% 91.4% 76.5% BYOL-A 2048-d AudioNTT2020-BYOLA-64x96d2048.pth 74.1% 79.1% 40.1% 90.2% 91.0% 92.2% 77.8% This implementation is for your evaluation of BYOL-A paper, see LICENSE for the detail.
BYOL-A is built on top of byol-pytorch, a BYOL implementation by Phil Wang (@lucidrains). We thank Phil for open-source sophisticated code.
@misc{wang2020byol-pytorch, author = {Phil Wang}, title = {Bootstrap Your Own Latent (BYOL), in Pytorch}, howpublished = {\url{https://github.com/lucidrains/byol-pytorch}}, year = {2020} }
- BYOL: J.-B. Grill and F. Strub and F. Altché and C. Tallec and P. H. Richemond and E. Buchatskaya and C. Doersch and B. A. Pires and Z. D. Guo and M. G. Azar and B. Piot and K. Kavukcuoglu and R. Munos and M. Valko, "Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning," 2020
- BYOL-A: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino "BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation," 2021
- FSD50K: Eduardo Fonseca and Xavier Favory and Jordi Pons and Frederic Font and Xavier Serra, “FSD50K: an Open Dataset of Human-Labeled Sound Events,” 2020.
- NSynth: Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Mohammad Norouzi and Douglas Eck and Karen Simonyan, "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders," 2017
- US8K: Justin Salamon and Christopher Jacoby, and Juan Pablo Bello, "A Dataset and Taxonomy for Urban Sound Research," 2014
- SPCV2: Pete Warden, "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition," 2018
- VoxCeleb1: Arsha Nagrani and Joon Son Chung and Andrew Zisserman, "VoxCeleb: A Large-Scale Speaker Identification Dataset," 2017
- VoxForge: K. MacLean, "VoxForge," 2018
byol-a's People
Forkers
oucxlw felipeescallon trendingtechnology chenchy gergol hirokinarita haorotu juiscoming jiihyunk russell-izadi-bose chang111 ankitshah009 dendisuhubdy forks-learning twistedmove wangyi120226 nickthereal1997 gasserelbanna izorin kevinkwshin yoshikimas ishine square-of-w haoheliu zqevans zvk enzechen0314 bryce-irvin-bose cheraissi yuliangzhang icerdavid yfliao kitchell kevinfos kuper7byol-a's Issues
Question about comments in the train.py
https://github.com/nttcslab/byol-a/blob/master/train.py
At line 67, there is comments for the shape of input.
# in fact, it should be (B, 1, F, T), e.g. (256, 1, 64, 96) where 64 is the number of mel bins paired_inputs = torch.cat(paired_inputs) # [(B,1,T,F), (B,1,T,F)] -> (2*B,1,T,F)
However, it is different from the descriptions in config.yml file
# Shape of loh-mel spectrogram [F, T]. shape: [64, 96]
Performing evaluation with only a small part of the spectrogram
Hi
Thank you for your contribution. It's a really interesting work. However, I have one question regarding the downstream evaluation.
In the paper, you mentioned that "A segment of shape FxT was randomly cropped from each audio clip and
encoded for linear evaluation in the downstream tasks."However, as far as I know, this procedure was not adopted in the previous works. Have you tried the experiment where the complete log-mel spectrogram (without random cropping) is fed to the network during the evaluation stage? Is there any performance difference?
Thanks
Missing scaling of validation samples in evaluate.py
https://github.com/nttcslab/byol-a/blob/master/evaluate.py#L112
It also needs:
X_val = scaler.transform(X_val)
, or validation acc & loss will be invalid.
This can be one of the reasons why we see lower performance when I tried to get official performances...missing byol_pytorch.py
in byol_a folder there is byol_pytorch.diff instead of byol_pytorch.py
a basic question:torch.randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3`
Traceback (most recent call last): File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2066, in <module> main() File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2060, in main globals = debugger.run(setup['file'], None, None, is_module) File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1411, in run return self._exec(is_module, entry_point_fn, module_name, file, globals, locals) File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1418, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "E:/pythonSpace/byol-a/train.py", line 132, in <module> main(audio_dir=base_path + '1/', epochs=100) File "E:/pythonSpace/byol-a/train.py", line 112, in main learner = BYOLALearner(model, cfg.lr, cfg.shape, File "E:/pythonSpace/byol-a/train.py", line 56, in __init__ self.learner = BYOL(model, image_size=shape, **kwargs) File "D:\min\envs\torch1_7_1\lib\site-packages\byol_pytorch\byol_pytorch.py", line 211, in __init__ self.forward(torch.randn(2, 3, image_size, image_size, device=device)) TypeError: randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3
Doubt in RunningNorm
Hi There, great repo!
I think I have misunderstood something wrong with the RunningNorm function. The function expects the size of an epoch, however, your implementation passes the size of the entire dataset.
Is it a bug? Or is there a problem with my understanding?
Thank You!
About inference speed?
Hi is there any inference speed evaluation?
And how to deal with long audios in production?
Many thanks for ur great work.BYOL-A Is this independent of language?
Can we create vector representation using a pretrained model only for English or is it language Independent?
Model parameters cannot be trainable once they become requires_grad=False
https://github.com/nttcslab/byol-a/blob/master/byol_a/models.py#L42
if p.requires_grad
has to be removed.Evaluation on voxforge
Hi,
Thank you so much for your contribution. This works is very interesting and your code is easy for me to follow. But one of the downstream dataset, voxforge is missing from the preprocess_ds.py. Could you please release the code for that dataset, too?
Thank you again for your time.
Best regards
Question for reproducing results
Hi,
Thanks for sharing this great work! I tried to reproduce the results using the official guidance but I failed.
After processing the data, I run the following commands:
CUDA_VISIBLE_DEVICES=0 python -W ignore train.py work/16k/fsd50k/FSD50K.dev_audio cp lightning_logs/version_4/checkpoints/epoch\=99-step\=16099.ckpt AudioNTT2020-BYOLA-64x96d2048.pth CUDA_VISIBLE_DEVICES=4 python evaluate.py AudioNTT2020-BYOLA-64x96d2048.pth spcv2
However, the results are far from the reported results
Did I miss something important? Thank you very much.
How to interpret the performance
Hi, it' s a great work, but how can I understance the performance metric? For example, VoxCeleb1 is usually for speaker verification, shouldn't we measure EER?
Finetuning of BYOL-A
Hi,
your paper is super interesting. I have a question regarding the downstream tasks. If I understand the paper correctly, you used a single linear layer for the downstream tasks which only used the sum of mean and max of the representation over time as input.
Did you try to finetune BYOL-A end-to-end after pretraining to the downstream tasks? In the case of TRILL they were able to improve the performance even further by finetuning the whole model end-to-end. Is there a specific reason why this is not possible with BYOL-A?
A mistake in RunningMean
Thank you for the fascinating paper and the code to reproduce it!
I think there might be a problem in RunningMean. The current formula (the same in v1 and v2) looks like this:
$$ m_n = m_{n - 1} + \frac{a_n - m_{n - 1}}{n - 1}, $$ which is inconsistent with the correct formula listed on StackOverflow:
$$ m_n = m_{n - 1} + \frac{a_n - m_{n - 1}}{n}. $$ The problem is that self.n is incremented after the new mean is computed. Could you please either correct me if I am wrong or correct the code?
Doubt in paper
Hi there,
Section 4, subsection A, part 1 from your paper says:
The number of frames, T, in one segment was 96 in pretraining, which corresponds to 1,014ms.
However, the previous line says the hop size used was 10ms. So according to this 96 would mean 960ms?
Am I understanding something wrong here?
Thank You in advance!
Random crop is not working.
Lines 80 to 82 in 60cebdc
If len(wav) > self.unit_length,
length_adj
will be a negative value. Sostart
will be 0. If wav (before pad) is shorter than unit length, length_adj == 0 after padding. Sostart
is always 0. So It will only perform a certain area of crop from 0 to self.unit_length (cropped_wav == wav[0: self.unit_length]), not random crop.So I think line 80 should be changed to
length_adj = len(wav) - self.unit_length
.Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.