Coder Social home page Coder Social logo

nttcslab / byol-a Goto Github PK

View Code? Open in Web Editor NEW
205.0 10.0 35.0 50.51 MB

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Home Page: https://arxiv.org/abs/2103.06695

License: Other

Python 100.00%
byol byol-pytorch audio ntt byol-a

byol-a's Introduction

key_visual

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

This is a demo implementation of BYOL for Audio (BYOL-A), a self-supervised learning method for general-purpose audio representation, includes:

  • Training code that can train models with arbitrary audio files.
  • Evaluation code that can evaluate trained models with downstream tasks.
  • Pretrained weights.

UPDATE (Dec, 2022): The v2 is now published on TASLP! We updated BibTeX.

UPDATE (Nov, 2022): New model definitions (AudioNTT2020X, AudioNTT2020Task6X) are ready. These are for making all layer features accessible so that the weighted sum of layer features can be available in SUPERB.

UPDATE (May, 2022): We have two papers for BYOL-A. If you find BYOL-A useful in your research, please use either of the following BibTeX entries for citation. The former is the first paper from IJCNN2021 (LINK to IEEE Xplore), and the latter is currently under review (LINK to arxiv) published on TASLP!

@inproceedings{niizumi2021byol-a,
    title={BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation},
    author={Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    booktitle = {2021 International Joint Conference on Neural Networks (IJCNN)},
    publisher={IEEE},
    DOI={10.1109/ijcnn52387.2021.9534474},
    url={http://dx.doi.org/10.1109/IJCNN52387.2021.9534474},
    year={2021},
    month={Jul}
}
@article{niizumi2023byol-a,
    title={{BYOL for Audio}: Exploring Pre-trained General-purpose Audio Representations},
    author={Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
    publisher={Institute of Electrical and Electronics Engineers (IEEE)},
    year={2023},
    volume={31},
    pages={137–151},
    doi={10.1109/TASLP.2022.3221007},
    url={http://dx.doi.org/10.1109/TASLP.2022.3221007},
    ISSN={2329-9304}
}

What is the difference between the papers?

We've added an augmentation block and updated network architecture.

We introduced an extra augmentation block, Random Linear Fader, in the 2022 version (TASLP2023).

aug-history

We reduced the number of convolutional blocks from three to two and added a skip connection at a new Concat block on the 2022 version.

network-history

  • For IJCNN2021, codes have not been changed; please find the details in this README.
  • For TASLP2023(2022 version), 👉 please find codes in the v2 folder.

Getting Started

  1. Download external source files, and apply a patch. Our implementation uses the following.

    curl -O https://raw.githubusercontent.com/lucidrains/byol-pytorch/2aa84ee18fafecaf35637da4657f92619e83876d/byol_pytorch/byol_pytorch.py
    patch < byol_a/byol_pytorch.diff
    mv byol_pytorch.py byol_a
    curl -O https://raw.githubusercontent.com/daisukelab/general-learning/7b31d31637d73e1a74aec3930793bd5175b64126/MLP/torch_mlp_clf.py
    mv torch_mlp_clf.py utils
  2. Install PyTorch 1.7.1, torchaudio, and other dependencies listed on requirements.txt.

Evaluating BYOL-A Representations

Downstream Task Evaluation

The following steps will perform a downstream task evaluation by linear-probe fashion. This is an example with SPCV2; Speech commands dataset v2.

  1. Preprocess metadata (.csv file) and audio files, processed files will be stored under a folder work.

    # usage: python -m utils.preprocess_ds <downstream task> <path to its dataset>
    python -m utils.preprocess_ds spcv2 /path/to/speech_commands_v0.02
  2. Run evaluation. This will convert all .wav audio to representation embeddings first, train a lineaer layer network, then calculate accuracy as a result.

    python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth spcv2

You can also run an evaluation multiple times and take an average result. Following will evaluate on UrbanSound8K with a unit audio duration of 4.0 seconds, for 10 times.

# usage: python evaluate.py <your weight> <downstream task> <unit duration sec.> <# of iteration>
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth us8k 4.0 10

Similarly, the following evaluates on NSynth (4.0 seconds long) 10 times.

python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth nsynth 4.0 10

Evaluating Representations In Your Tasks

This is an example to calculate a feature vector for an audio sample.

from byol_a.common import *
from byol_a.augmentations import PrecomputedNorm
from byol_a.models import AudioNTT2020


device = torch.device('cuda')
cfg = load_yaml_config('config.yaml')
print(cfg)

# ** Prepare the statistics in advance **
# You need to calculate the statistics of mean and standard deviation of the log-mel spectrogram of your dataset.
# See calc_norm_stats in evaluate.py for your reference.
stats = [-5.4919195,  5.0389895]

# Preprocessor and normalizer.
to_melspec = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sample_rate,
    n_fft=cfg.n_fft,
    win_length=cfg.win_length,
    hop_length=cfg.hop_length,
    n_mels=cfg.n_mels,
    f_min=cfg.f_min,
    f_max=cfg.f_max,
)
normalizer = PrecomputedNorm(stats)

# Load pretrained weights.
model = AudioNTT2020(d=cfg.feature_d)
model.load_weight('pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth', device)

# Load your audio file.
wav, sr = torchaudio.load('work/16k/spcv2/one/00176480_nohash_0.wav') # a sample from SPCV2 for now
assert sr == cfg.sample_rate, "Let's convert the audio sampling rate in advance, or do it here online."

# Convert to a log-mel spectrogram, then normalize.
lms = normalizer((to_melspec(wav) + torch.finfo(torch.float).eps).log())

# Now, convert the audio to the representation.
features = model(lms.unsqueeze(0))

Training From Scratch

You can also train models. Followings are an example of training on FSD50K.

  1. Convert all samples to 16kHz. This will convert all FSD50K files to a folder work/16k/fsd50k while preserving folder structure.

    python -m utils.convert_wav /path/to/fsd50k work/16k/fsd50k
  2. Start training, this example trains with all development set audio samples from FSD50K.

    python train.py work/16k/fsd50k/FSD50K.dev_audio

Refer to Table VI on our paper for the performance of a model trained on FSD50K.

Pretrained Weights

We include 3 pretrained weights of our encoder network.

Method Dim. Filename NSynth US8K VoxCeleb1 VoxForge SPCV2/12 SPCV2 Average
BYOL-A 512-d AudioNTT2020-BYOLA-64x96d512.pth 69.1% 78.2% 33.4% 83.5% 86.5% 88.9% 73.3%
BYOL-A 1024-d AudioNTT2020-BYOLA-64x96d1024.pth 72.7% 78.2% 38.0% 88.5% 90.1% 91.4% 76.5%
BYOL-A 2048-d AudioNTT2020-BYOLA-64x96d2048.pth 74.1% 79.1% 40.1% 90.2% 91.0% 92.2% 77.8%

License

This implementation is for your evaluation of BYOL-A paper, see LICENSE for the detail.

Acknowledgements

BYOL-A is built on top of byol-pytorch, a BYOL implementation by Phil Wang (@lucidrains). We thank Phil for open-source sophisticated code.

byol-a's People

Contributors

daisukelab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

byol-a's Issues

Performing evaluation with only a small part of the spectrogram

Hi

Thank you for your contribution. It's a really interesting work. However, I have one question regarding the downstream evaluation.
In the paper, you mentioned that "A segment of shape FxT was randomly cropped from each audio clip and
encoded for linear evaluation in the downstream tasks."

However, as far as I know, this procedure was not adopted in the previous works. Have you tried the experiment where the complete log-mel spectrogram (without random cropping) is fed to the network during the evaluation stage? Is there any performance difference?

Thanks

a basic question:torch.randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3`

Traceback (most recent call last):
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2066, in <module>
    main()
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2060, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1411, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1418, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "E:/pythonSpace/byol-a/train.py", line 132, in <module>
    main(audio_dir=base_path + '1/', epochs=100)
  File "E:/pythonSpace/byol-a/train.py", line 112, in main
    learner = BYOLALearner(model, cfg.lr, cfg.shape,
  File "E:/pythonSpace/byol-a/train.py", line 56, in __init__
    self.learner = BYOL(model, image_size=shape, **kwargs)
  File "D:\min\envs\torch1_7_1\lib\site-packages\byol_pytorch\byol_pytorch.py", line 211, in __init__
    self.forward(torch.randn(2, 3, image_size, image_size, device=device))
TypeError: randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3

Doubt in RunningNorm

Hi There, great repo!

I think I have misunderstood something wrong with the RunningNorm function. The function expects the size of an epoch, however, your implementation passes the size of the entire dataset.

Is it a bug? Or is there a problem with my understanding?

Thank You!

About inference speed?

Hi is there any inference speed evaluation?
And how to deal with long audios in production?
Many thanks for ur great work.

Evaluation on voxforge

Hi,

Thank you so much for your contribution. This works is very interesting and your code is easy for me to follow. But one of the downstream dataset, voxforge is missing from the preprocess_ds.py. Could you please release the code for that dataset, too?

Thank you again for your time.

Best regards

Question for reproducing results

Hi,

Thanks for sharing this great work! I tried to reproduce the results using the official guidance but I failed.

After processing the data, I run the following commands:

CUDA_VISIBLE_DEVICES=0 python -W ignore train.py work/16k/fsd50k/FSD50K.dev_audio
cp lightning_logs/version_4/checkpoints/epoch\=99-step\=16099.ckpt AudioNTT2020-BYOLA-64x96d2048.pth
CUDA_VISIBLE_DEVICES=4 python evaluate.py AudioNTT2020-BYOLA-64x96d2048.pth spcv2

However, the results are far from the reported results

image

Did I miss something important? Thank you very much.

How to interpret the performance

Hi, it' s a great work, but how can I understance the performance metric? For example, VoxCeleb1 is usually for speaker verification, shouldn't we measure EER?

Finetuning of BYOL-A

Hi,

your paper is super interesting. I have a question regarding the downstream tasks. If I understand the paper correctly, you used a single linear layer for the downstream tasks which only used the sum of mean and max of the representation over time as input.

Did you try to finetune BYOL-A end-to-end after pretraining to the downstream tasks? In the case of TRILL they were able to improve the performance even further by finetuning the whole model end-to-end. Is there a specific reason why this is not possible with BYOL-A?

A mistake in RunningMean

Thank you for the fascinating paper and the code to reproduce it!

I think there might be a problem in RunningMean. The current formula (the same in v1 and v2) looks like this:

$$ m_n = m_{n - 1} + \frac{a_n - m_{n - 1}}{n - 1}, $$

which is inconsistent with the correct formula listed on StackOverflow:

$$ m_n = m_{n - 1} + \frac{a_n - m_{n - 1}}{n}. $$

The problem is that self.n is incremented after the new mean is computed. Could you please either correct me if I am wrong or correct the code?

Doubt in paper

Hi there,

Section 4, subsection A, part 1 from your paper says:

 The number of frames, T, in one segment was 96 in pretraining, which corresponds to 1,014ms. 

However, the previous line says the hop size used was 10ms. So according to this 96 would mean 960ms?

Am I understanding something wrong here?

Thank You in advance!

Random crop is not working.

byol-a/byol_a/dataset.py

Lines 80 to 82 in 60cebdc

length_adj = self.unit_length - len(wav)
start = random.randint(0, length_adj) if length_adj > 0 else 0
wav = wav[start:start + self.unit_length]

If len(wav) > self.unit_length, length_adj will be a negative value. So start will be 0. If wav (before pad) is shorter than unit length, length_adj == 0 after padding. So start is always 0. So It will only perform a certain area of crop from 0 to self.unit_length (cropped_wav == wav[0: self.unit_length]), not random crop.

So I think line 80 should be changed to length_adj = len(wav) - self.unit_length .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.