s3prl / s3prl Goto Github PK

Self-Supervised Speech Pre-training and Representation Learning Toolkit

Home Page: https://s3prl.github.io/s3prl/

License: Apache License 2.0

Python 98.00% Shell 1.95% Dockerfile 0.05%

speech-representation mockingjay representation-learning apc tera self-supervised-learning speech-pretraining vq-apc wav2vec vq-wav2vec

s3prl's Introduction

Contact

We prefer to have discussions directly on Github issue page, so that all the information is transparent to all the contributors and is auto-archived on the Github. If you wish to use email, please contact:

Please refer to the legacy citation of S3PRL and the timeline below, which justify our initiative on this project. This information is used to protect us from half-truths. We encourage to cite the individual papers most related to the function you are using to give fair credit to the developer of the function. You can find the names in the Change Log. Finally, we would like to thank our advisor, Prof. Hung-yi Lee, for his advice. The project would be impossible without his support.

If you have any question (e.g., about who came up with / developed which ideas / functions or how the project started), feel free to engage in an open and responsible conversation on the GitHub issue page, and we'll be happy to help!

Contribution (pull request)

Guideline

Starting in 2024, we will only accept new contributions in the form of new upstream models, so we can save bandwidth for developing new techniques (which will not be in S3PRL.)
S3PRL has transitioned into pure maintenance mode, ensuring the long-term maintenance of all existing functions.
Reporting bugs or the PR fixing the bugs is always welcome! Thanks!

Tutorials

Environment compatibilities

We support the following environments. The test cases are ran with tox locally and on github action:

Env	versions
os	`ubuntu-18.04`, `ubuntu-20.04`
python	`3.7`, `3.8`, `3.9`, `3.10`
pytorch	`1.8.1`, `1.9.1`, `1.10.2`, `1.11.0`, `1.12.1` , `1.13.1` , `2.0.1` , `2.1.0`

Star History

Change Log

We only list the major contributors here for conciseness. However, we are deeply grateful for all the contributions. Please see the Contributors page for the full list.

Dec 2023: Support Multi-resolution HuBERT (MR-HuBERT, see Multiresolution HuBERT)
Oct 2023: Support ESPnet pre-trained upstream models (see ESPnet HuBERT and WavLabLM)
Sep 2022: In JSALT 2022, We upgrade the codebase to support testing, documentation and a new S3PRL PyPI package for easy installation and usage for upstream models. See our online doc for more information. The package is now used by many open-source projects, including ESPNet. Contributors: Shu-wen Yang (NTU), Andy T. Liu (NTU), Heng-Jui Chang (MIT), Haibin Wu (NTU) and Xuankai Chang (CMU).
Mar 2022: Introduce SUPERB-SG, see Speech Translation by Hsiang-Sheng Tsai (NTU), Out-of-domain ASR by Heng-Jui Chang (NTU), Voice Conversion by Wen-Chin Huang (Nagoya), Speech Separation and Speech Enhancement by Zili Huang (JHU) for more info.
Mar 2022: Introduce SSL for SE/SS by Zili Huang (JHU). See SE1 and SS1 folders for more details. Note that the improved performances can be achieved by the later introduced SE2 and SS2. However, for aligning with SUPERB-SG benchmarking, please use the version 1.
Nov 2021: Introduce S3PRL-VC by Wen-Chin Huang (Nagoya), see Any-to-one for more info. We highly recommend to consider the newly released official repo of S3PRL-VC which is developed and actively maintained by Wen-Chin Huang. The standalone repo contains much more recepies for the VC experiments. In S3PRL we only include the Any-to-one recipe for reproducing the SUPERB results.
Oct 2021: Support DistilHuBERT by Heng-Jui Chang (NTU), see docs for more info.
Sep 2021: We host a challenge in AAAI workshop: The 2nd Self-supervised Learning for Audio and Speech Processing! See SUPERB official site for the challenge details and the SUPERB documentation in this toolkit!
Aug 2021: Andy T. Liu (NTU) and Shu-wen Yang (NTU) introduces the S3PRL toolkit in MLSS 2021, you can also watch it on Youtube!
Aug 2021: TERA by Andy T. Liu (NTU) is accepted to TASLP!
July 2021: We are now working on packaging s3prl and reorganizing the file structure in v0.3. Please consider using the stable v0.2.0 for now. We will test and release v0.3 before August.
June 2021: Support SUPERB: Speech processing Universal PERformance Benchmark, submitted to Interspeech 2021. Use the tag superb-interspeech2021 or v0.2.0. Contributors: Shu-wen Yang (NTU), Pohan Chi (NTU), Yist Lin (NTU), Yung-Sung Chuang (NTU), Jiatong Shi (CMU), Xuankai Chang (CMU), Wei-Cheng Tseng (NTU), Tzu-Hsien Huang (NTU) and Kushal Lakhotia (Meta).
June 2021: Support extracting multiple hidden states for all the SSL pretrained models by Shu-wen Yang (NTU).
Jan 2021: Readme updated with detailed instructions on how to use our latest version!
Dec 2020: We are migrating to a newer version for a more general, flexible, and scalable code. See the introduction below for more information! The legacy version can be accessed the tag v0.1.0.
Oct 2020: Shu-wen Yang (NTU) and Andy T. Liu (NTU) added varioius classic upstream models, including PASE+, APC, VQ-APC, NPC, wav2vec, vq-wav2vec ...etc.
Oct 2019: The birth of S3PRL! The repository was created for the Mockingjay development. Andy T. Liu (NTU), Shu-wen Yang (NTU) and Pohan Chi (NTU) implemented the pre-training scripts and several simple downstream evaluation tasks. This work was the very start of the S3PRL project which established lots of foundamental modules and coding styles. Feel free to checkout to the old commits to explore our legacy codebase!

Introduction and Usages

This is an open source toolkit called s3prl, which stands for Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks.

The toolkit has three major usages:

Pretrain

Pretrain upstream models, including Mockingjay, Audio ALBERT and TERA.
Document: pretrain/README.md

Upstream

Easily load most of the existing upstream models with pretrained weights in a unified I/O interface.
Pretrained models are registered through torch.hub, which means you can use these models in your own project by one-line plug-and-play without depending on this toolkit's coding style.
Document: upstream/README.md

Downstream

Utilize upstream models in lots of downstream tasks
Benchmark upstream models with SUPERB Benchmark
Document: downstream/README.md

Here is a high-level illustration of how S3PRL might help you. We support to leverage numerous SSL representations on numerous speech processing tasks in our GitHub codebase:

We also modularize all the SSL models into a standalone PyPi package so that you can easily install it and use it without depending on our entire codebase. The following shows a simple example and you can find more details in our documentation.

Install the S3PRL package:

pip install s3prl

Use it to extract representations for your own audio:

import torch
from s3prl.nn import S3PRLUpstream

model = S3PRLUpstream("hubert")
model.eval()

with torch.no_grad():
    wavs = torch.randn(2, 16000 * 2)
    wavs_len = torch.LongTensor([16000 * 1, 16000 * 2])
    all_hs, all_hs_len = model(wavs, wavs_len)

for hs, hs_len in zip(all_hs, all_hs_len):
    assert isinstance(hs, torch.FloatTensor)
    assert isinstance(hs_len, torch.LongTensor)

    batch_size, max_seq_len, hidden_size = hs.shape
    assert hs_len.dim() == 1

With this modularization, we have achieved close integration with the general speech processing toolkit ESPNet, enabling the use of SSL models for a broader range of speech processing tasks and corpora to achieve state-of-the-art (SOTA) results (kudos to the ESPNet Team):

You can start the journey of SSL with the following entry points:

S3PRL: A simple SUPERB downstream task
ESPNet: Levearging S3PRL for ASR

Feel free to use or modify our toolkit in your research. Here is a list of papers using our toolkit. Any question, bug report or improvement suggestion is welcome through opening up a new issue.

If you find this toolkit helpful to your research, please do consider citing our papers, thanks!

Installation

Python >= 3.6
Install sox on your OS
Install s3prl: Read doc or pip install -e ".[all]"
(Optional) Some upstream models require special dependencies. If you encounter error with a specific upstream model, you can look into the README.md under each upstream folder. E.g., upstream/pase/README.md=

Reference Repositories

Pytorch, Pytorch.
Audio, Pytorch.
Kaldi, Kaldi-ASR.
Transformers, Hugging Face.
PyTorch-Kaldi, Mirco Ravanelli.
fairseq, Facebook AI Research.
CPC, Facebook AI Research.
APC, Yu-An Chung.
VQ-APC, Yu-An Chung.
NPC, Alexander-H-Liu.
End-to-end-ASR-Pytorch, Alexander-H-Liu
Mockingjay, Andy T. Liu.
ESPnet, Shinji Watanabe
speech-representations, aws lab
PASE, Santiago Pascual and Mirco Ravanelli
LibriMix, Joris Cosentino and Manuel Pariente

License

The majority of S3PRL Toolkit is licensed under the Apache License version 2.0, however all the files authored by Facebook, Inc. (which have explicit copyright statement on the top) are licensed under CC-BY-NC.

Used by

List of papers that used our toolkit (Feel free to add your own paper by making a pull request)

Self-Supervised Pretraining

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders (Liu et al., 2020)

@article{mockingjay,
   title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
   ISBN={9781509066315},
   url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},
   DOI={10.1109/icassp40776.2020.9054458},
   journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
   publisher={IEEE},
   author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},
   year={2020},
   month={May}
}

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech (Liu et al., 2020)

@misc{tera,
    title={TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech},
    author={Andy T. Liu and Shang-Wen Li and Hung-yi Lee},
    year={2020},
    eprint={2007.06028},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation (Chi et al., 2020)

@inproceedings{audio_albert,
    title={Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation},
    author={Po-Han Chi and Pei-Hung Chung and Tsung-Han Wu and Chun-Cheng Hsieh and Shang-Wen Li and Hung-yi Lee},
    year={2020},
    booktitle={SLT 2020},
}

Explanability

Understanding Self-Attention of Self-Supervised Audio Transformers (Yang et al., 2020)

@inproceedings{understanding_sat,
    author={Shu-wen Yang and Andy T. Liu and Hung-yi Lee},
    title={{Understanding Self-Attention of Self-Supervised Audio Transformers}},
    year=2020,
    booktitle={Proc. Interspeech 2020},
    pages={3785--3789},
    doi={10.21437/Interspeech.2020-2231},
    url={http://dx.doi.org/10.21437/Interspeech.2020-2231}
}

Adversarial Attack

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning (Wu et al., 2020), code for computing LNSR: utility/observe_lnsr.py

@inproceedings{mockingjay_defense,
    author={Haibin Wu and Andy T. Liu and Hung-yi Lee},
    title={{Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning}},
    year=2020,
    booktitle={Proc. Interspeech 2020},
    pages={3780--3784},
    doi={10.21437/Interspeech.2020-2026},
    url={http://dx.doi.org/10.21437/Interspeech.2020-2026}
}

Adversarial Defense for Automatic Speaker Verification by Cascaded Self-Supervised Learning Models (Wu et al., 2021)

@misc{asv_ssl,
    title={Adversarial defense for automatic speaker verification by cascaded self-supervised learning models},
    author={Haibin Wu and Xu Li and Andy T. Liu and Zhiyong Wu and Helen Meng and Hung-yi Lee},
    year={2021},
    eprint={2102.07047},
    archivePrefix={arXiv},
    primaryClass={eess.AS}

Voice Conversion

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations (Lin et al., 2021)

@misc{s2vc,
      title={S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations},
      author={Jheng-hao Lin and Yist Y. Lin and Chung-Ming Chien and Hung-yi Lee},
      year={2021},
      eprint={2104.02901},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Benchmark and Evaluation

SUPERB: Speech processing Universal PERformance Benchmark (Yang et al., 2021)

@misc{superb,
      title={SUPERB: Speech processing Universal PERformance Benchmark},
      author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
      year={2021},
      eprint={2105.01051},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Utilizing Self-supervised Representations for MOS Prediction (Tseng et al., 2021)

@misc{ssr_mos,
    title={Utilizing Self-supervised Representations for MOS Prediction},
    author={Wei-Cheng Tseng and Chien-yu Huang and Wei-Tsung Kao and Yist Y. Lin and Hung-yi Lee},
    year={2021},
    eprint={2104.03017},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

}

Citation

If you find this toolkit useful, please consider citing following papers.

If you use our pre-training scripts, or the downstream tasks considered in TERA and Mockingjay, please consider citing the following:

@misc{tera,
  title={TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech},
  author={Andy T. Liu and Shang-Wen Li and Hung-yi Lee},
  year={2020},
  eprint={2007.06028},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
}

@article{mockingjay,
   title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
   ISBN={9781509066315},
   url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},
   DOI={10.1109/icassp40776.2020.9054458},
   journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
   publisher={IEEE},
   author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},
   year={2020},
   month={May}
}

If you use our organized upstream interface and features, or the SUPERB downstream benchmark, please consider citing the following:

@article{yang2024large,
  title={A Large-Scale Evaluation of Speech Foundation Models},
  author={Yang, Shu-wen and Chang, Heng-Jui and Huang, Zili and Liu, Andy T and Lai, Cheng-I and Wu, Haibin and Shi, Jiatong and Chang, Xuankai and Tsai, Hsiang-Sheng and Huang, Wen-Chin and others},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2024},
  publisher={IEEE}
}

@inproceedings{yang21c_interspeech,
  author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
  title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1194--1198},
  doi={10.21437/Interspeech.2021-1775}
}

s3prl's People

Contributors

Stargazers

Watchers

Forkers

voidism twistedmove entn-at henryhenrychen gblin5566 cogmeta m13kj rainmaker712 yuzhongshanyue aviasd schoolboy-ju heimanba89 moonlightsong aiedward deepchatterjeevns breezedeus xixirupan newoneincntk yistlin eyekid shiyuzh2007 jzcruiser verbubbleabc ag027592 xintaozhao0805 5iding yangyutu dendisuhubdy pengyizhou zhongerqiandan apoorv2904 jackanth sailinglqh artemiszgl lalimili6 mpauro jefflai108 skyy93 kingstorm zhangshenghust konsangbgohain amirhussein96 edosyhptra denisleonov hubeibei007 amit2020cs pjanbakhshi roxima ftshijt sunilsivadas simpleoier flyk13 xiongmaoxia frozenburst sas91 pppku albertwy chenchy sumhncku rajs1302 bhavinjethra noetits huangziliandy aarora8 amberww wgwangang lhy0807 mortyzhou-shef-bit zjc6666 ankugupt chenxinglili yanjingwangzhang whitefu gqwert123 gary109 zhouweilian aigizk zhaoforever triper1022 fengweijp sraghveet69 hehaoming aheba wszlong ljing2007 zhouzhenkun b06901052 jingriguming aliceinhunterland olegjakushkin unilight ayushexel shuchengzhang92 lingss0918 ankitshah009 albertvillanova kwangje targonaut laura-bustos tranmduc

s3prl's Issues

about the downstream evaluation

Hello, I'm sorry for asking some stupid question.

I have trained a mockingjay with the melbase config. And then I want to evaluate it on downstream task. When loading the model, it shows the error below. And I can't find the config for the transformer in the downstream.yaml.

 "Pre-trained weights NOT loaded!".
size mismatch for input_representations.spec_transform.weight: copying a param with shape torch.Size([768, 160]) from checkpoint, the shape in current model is torch.Size([768, 1]).',))

Besides, I'm a little confused that where does the speaker information comes from when evaluate the speaker recognition task? And what if I pretrain in different language, do you have some suggestions for phone classification?

Thanks a lot.

Why reconstruct linear spectrogram?

Thanks for such a good work! I have a specific question about the reconstruction goal. I am wondering, in both Audio ALBERT and Mockingjay large, why you decided to reconstruct the corresponding linear spectrogram instead of the input 160-dimension mel spectrogram? Do you have any statistics for how much performance gain we can get by reconstructing linear spectrogram instead of input mel spectrogram?

Thank you so much

CPC label

Hello, Author ! Thank you for your sharing !

I have some problems about the CPC label. I find that code in 'dataloader.py'.

assert('train-clean-100' in sets and len(sets) == 1)

Does this code mean that I can only use the 'train-clean-100' ?

In addition, I have use kaldi to align the LibriSpeech but I can not get the file 'converted_aligned_phone.txt'. How can I obtain the CPC label of other datasets ?

Would you like to answer my questions? Thank you very much !

performance about fine-tuning TERA with output layer by minimizing a CTC loss

Thanks your great work. I am a big fan of professor Hung-yi Lee, he's online course helps me a lot.

I have read the TERA paper, the ASR comparison experiments was conducted within the DNN/HMM framework.
Have you ever tested the performance about fine-tuning TERA with output layer by minimizing a CTC loss?

why dev_set and test_set not used in pre-training

Hi, I notice that the code only uses train_set during the pre-training progress. Could you plz explain why dev-test and test-set are not used? Thanks!

Phone accuracy of Mockingjay

Dear Author :
Hello ! I am a In this pretrain process, I train on train-clean-360 dataset for 500k total training steps and use parameters in 'mockingjay_libri_melBase.yaml'. The target is mel-160 features. The specific parameters are as follows: e">transformer: input_dim: 160 #int`, 39 for mfcc, 40 for fmllr, 80 for fbank, 160 for mel
# stacked consecutive features vectors to reduce the length of input sequences by this factor.
# Size of the encoder layers and the pooler layer.
# Number of hidden layers in the Transformer encoder.
# Number of attention heads for each attention layer in the Transformer encoder.
# The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
# The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
# The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
# The dropout ratio for the attention probabilities.
# The sttdev of the truncated_normal_initializer for initializing all weight matrices.
# The epsilon used by LayerNorm.
# mask this percentage of all spectrogram frames in each sequence at random during MAM training
# mask this amount of consecutive frames
# mask this amount of consecutive frames
# allow overlap masking
# only used when overlap is not allowed. sample a mask from each bucket in size of [sampled mask_consecutive * mask_bucket_ratio]
# mask maximum this amount of frequency bands, set to 0 for no frequency mask
# for this percentage of the time, Gaussian noise will be applied on all frames during MAM training, set to 0 for no noise
# Usage: 0,1,2,12-15 will prune headids [0,1,2,12,13,14]. headids = layerid * head_num + headid_in_layer
# Share layer weights
# maximum input length (0 for no restriction)

optimizer:
type: 'adam' # modes: ['adam', 'adamW', 'lamb']
learning_rate: "4e-4" # Learning rate for opt. "4e-4" for 'data/libri_mel160_subword5000', "2e-4" for 'data/libri_fmllr_cmvn'
loss_scale: 0 # Loss scale to improve fp16 numeric stability. Only used when apex is set to True. 0: dynamic loss scaling. positive power of 2: static loss scaling.
warmup_proportion: 0.07 # Proportion of training to perform linear rate warmup.
gradient_accumulation_steps: 1 # Number of updates steps to accumulate before performing a backward/update pass
gradient_clipping: 1.0 # Maximum gradient norm

dataloader:
n_jobs: 12 # Subprocess used for torch Dataloader
batch_size: 6 # training batch size
dev_batch_size: 6 # used for dev/test splits
max_timestep: 3000 # Max length for audio feature (0 for no restriction)

LIBRISEECH SETTINGS

data_path: 'data/libri_mel160' # Source data path, 'data/libri_mel160_subword5000' or 'data/libri_fmllr_cmvn' for different preprocessing features
target_path: 'data/libri_mel160' # Target data path for reconstruction to a different feature type, for example linear spectrograms
phone_path: 'data/libri_phone' # phone boundary label data path for the phone classification task. set to 'data/libri_phone' or 'data/cpc_phone'
train_set: ['train-clean-360'] # ['train-clean-100', 'train-clean-360', 'train-other-500'] for pre-training. ['train-clean-360'] or ['train-clean-100'] for libri phone exp or cpc phone exp, respectively.
dev_set: ['dev-clean'] #
test_set: ['test-clean'] #
train_proportion: 1.0 # Currently only effect the phone classification task, use this percent of train_set for downstream task training to demonstrate mockingjay generality

runner:

Training options

apex: False # Use APEX (see https://github.com/NVIDIA/apex for more details)
total_steps: 500000 # total steps for training, a step is a batch of update
log_step: 2500 # log training status every this amount of training steps
save_step: 10000 # save model every this amount of training steps
duo_feature: False # Use different input / output features during training
max_keep: 5 # maximum number of model ckpt to keep during training`

In this train process, I train on the all train-clean-360 and use parameters. in 'downstream.yaml'. The specific parameters are as follows:
`dataloader:
n_jobs: 6 # Subprocess used for torch Dataloader
batch_size: 6 # training batch size
dev_batch_size: 12 # used for dev/test splits
max_timestep: 0 # Max length for audio feature (0 for no restriction)

data_path: 'data/libri_mel160' # Source data path, 'data/libri_fmllr_cmvn', or 'data/libri_mfcc_cmvn', or 'data/libri_mel160_subword5000' for different preprocessing features
phone_path: 'data/libri_phone' # phone boundary label data path for the phone classification task. set to 'data/libri_phone' or 'data/cpc_phone'
libri_root: '/data/dataset/Libri/Libri/LibriSpeech/' # only used when extract features on-the-fly
train_set: ['train-clean-360'] # ['train-clean-100', 'train-clean-360', 'train-other-500'] for pre-training. ['train-clean-360'] or ['train-clean-100'] for libri phone exp or cpc phone exp, respectively.
dev_set: ['test-clean'] #
test_set: ['test-clean'] #
train_proportion: 1.0 # Currently only effect the phone classification task, use this percent of train_set for downstream task training to demonstrate mockingjay generality

runner:
learning_rate: '4e-3' # Learning rate for opt: ['4e-3' for fine-tune, '4e-3' for regualr downstream task training]
warmup_proportion: 0.07 # Proportion of training to perform linear rate warmup.
gradient_clipping: 1.0 # Maximum gradient norm
total_steps: 500000 # total steps for training, a step is a batch of update
log_step: 2000 # log training status every this amount of training steps
save_step: 2000 # save model every this amount of training steps
dev_step: 5000 # evaluate every this amount of training steps
evaluation: 'test' # can be 'dev' or 'test', show inference results and saves the best model
max_keep: 2 # maximum number of model ckpt to keep during training

model: # downstream model config, each task can have different model setting

phone_linear:
hidden_size: 0 # when linear: True, the hidden_size is ignored
drop: 0.0 # The dropout ratio, not used when linear is set to True.
linear: True # whether to make the classifier linear
layers: 1 # number of layers in the classifier, set to 2 for 1 hidden layer
concat: 1 # int, must be an odd number. Concatenation of this amount of windows to match the average size of a phoneme. Set to 1 for no concatenation, set to 9 to concat 4 previous and 4 future frames.

phone_1hidden:
hidden_size: 768 # hidden size of classifier
drop: 0.0 # The dropout ratio, not used when linear is set to True.
linear: False # whether to make the classifier linear
layers: 2 # number of layers in the classifier, set to 2 for 1 hidden layer
concat: 1 # int, must be an odd number. Concatenation of this amount of windows to match the average size of a phoneme. Set to 1 for no concatenation, set to 9 to concat 4 previous and 4 future frames.

phone_concat:
hidden_size: 0 # when linear: True, the hidden_size is ignored
drop: 0.0 # The dropout ratio, not used when linear is set to True.
linear: True # whether to make the classifier linear
layers: 1 # number of layers in the classifier, set to 2 for 1 hidden layer
concat: 9 # int, must be an odd number. Concatenation of this amount of windows to match the average size of a phoneme. Set to 1 for no concatenation, set to 9 to concat 4 previous and 4 future frames.

speaker_frame:
hidden_size: 0 # when linear: True, the hidden_size is ignored
drop: 0.0 # The dropout ratio, not used when linear is set to True.
linear: True # whether to make the classifier linear
layers: 1 # number of layers in the classifier, set to 2 for 1 hidden layer
concat: 1 # int, must be an odd number. Concatenation of this amount of windows to match the average size of a phoneme. Set to 1 for no concatenation, set to 9 to concat 4 previous and 4 future frames.

speaker_utterance:
hidden_size: 0 # when linear: True, the hidden_size is ignored
drop: 0.0 # The dropout ratio, not used when linear is set to True.
linear: True # whether to make the classifier linear
layers: 1 # number of layers in the classifier, set to 2 for 1 hidden layer
concat: 1 # int, must be an odd number. Concatenation of this amount of windows to match the average size of a phoneme. Set to 1 for no concatenation, set to 9 to concat 4 previous and 4 future frames.`

I don't know what caused this result. Would you like to help me? Thanks you very much !

How can you use speech labels with downstream tasks?

Hi, thanks for your great work for this.

I would like to test some of your approaches and want to clarify those.

When you do the pre-training, can you also use labels if we have? I don't think it is possible based on the paper since it reconstruct frames with L1 loss.
For downstream tasks, it said a model 0.1 % labeled data outperformed 100% Mel-features data.
So, when you train downstream tasks, you only train 0.1% labeled data and audio file among the whole dataset?

Thanks.

Tutorial for application on custom dataset

Is there a tutorial/guide to apply this code for a different custom dataset

About the MOCKINGJAY model

Hi Andy,
Thanks for all the amazing works!
I just have a question about the Mockingjay model. It is mentioned in the paper that a linear-scale spectrogram is also used as the output reconstruction target. Do you have any results showing the difference between linear spectrogram and melspectrogram? Also, is it possible that you provide some pre-trained models for the linear spectrogram? I think it would be more convenient to some more complex downstream tasks like deep perceptual loss.

Best Regards

Difficulties pre-training with filter bank features

Hi,

First of all thank you for a very clean code base for your work.
I am trying to pre-train a mode with filter-bank features for librispeech using the provided config upstream_fbankBase.yaml.
The only change I have made is setting

train_set:  ['train-clean-100', 'train-clean-360', 'train-other-500']

However, my loss seems to be stuck around 0.98. I was wondering if you could tell me the expected loss on filterbank features.
The pre-trained model also seems to doing bad on montreal-phone classification tasks.

Note that, I was able to pretrain using mel-160 features for same dataset and downstream task.

Thank you for your help on this.

Best Regards,
Apoorv

downstream input and output

Hi, I studied this code for days, and have some questions for it.
When training the downstream model, when I set phone_path to be cpc_phone, dataloaders will load features and labels. For example, the shape of labels is [12,1544], and for features, that is [12,1544,160].
12 is batch_size, maybe 1544 means the loaded feature is consist of 1544 phones(after zero padding?), and each 160-dim feature represents one phone. After the upstream model, it becomes a [12,1544,768] tensor which is the output of transformer.
Do I understand correctly?

label for pretrain

Hi, sorry to bother again.
I want to use part of the data with my own label during pretraining, so how to train the upstream model supervised?
Could I train it in supervised way firstly with my label, and then use the model to the downstream task with the cpc_phone label? I'm afraid the two different sets of phone labels would confuse the model.

Fail in resolving dependencies

As i pip install -r requirements.txt, I got errors

The user requested numpy==1.14.5
librosa 0.7.2 depends on numpy>=1.15.0`

The user requested numpy==1.14.5
matplotlib 2.2.3 depends on numpy>=1.7.1
numba 0.48.0 depends on numpy>=1.15`

I think change numpy==1.14.5 in requirements.txt to numpy==1.15 should fix the issue. Has anyone experienced the same problem?

Evaluating other upstream models

Dear @andi611,

First of all I would like to thank you for this work/package. It is really helpful for people doing unsupervised speech representation, I'm learning a lot from you.
I've read your paper on Mockingjay and I'm trying to compare some results of an upstream method that I have been working on against yours. To make a fair comparison I would like to use the "One layer RNN" that is described on your paper. The parameters used for speaker recognition and sentiment analysis are not clear to me. Could you confirm if you did use these ones that I found on your GDrive?

rnn:
mode: 'classification'
input_dim: 'None'
select_hidden: 'last'
sample_rate: 5
pre_linear_dims: []
hidden_size: 32
post_linear_dims: [32]
drop: 0.5

Again, Thank you for your work!

Cheers,

Questions about sample rate and data normalization

Hi, thanks for your amazing open source project! How can I use different sample rate (e.g. 44.1kHz), espectially for the online config? It seems that the current version is less flexible for the On-the-fly settings. In addition, is there any data normalization in the preprocessing stage?

Thanks a lot.

How to resume downstream training

Thank you for sharing your great work!
I have a trouble when I learn to use this toolkit. I was trying to use the pre-trained mockingjay model on librispeech dataset, and then performed fine-tune for the phone classification task. Then I noticed the total step was 500000, which was same as the step for pre-training the model, then I stopped the training and reduced the value of total step. But when I wanted to resume at the saved fine-tune model, it reported error:

[run_downstream] - getting upstream model: transformer
Traceback (most recent call last):
  File "run_downstream.py", line 222, in <module>
    main()
  File "run_downstream.py", line 202, in main
    upstream_model = get_upstream_model(args) ######### plug in your upstream pre-trained model here #########
  File "run_downstream.py", line 111, in get_upstream_model
    upstream_model = TRANSFORMER(options, args.input_dim, online_config=args.online_config)
  File "/home/speech/Self-Supervised-Speech-Pretraining-and-Representation-Learning/transformer/nn_transformer.py", line 305, in __init__
    super(TRANSFORMER, self).__init__(options, inp_dim, config, online_config)
  File "/home/speech/Self-Supervised-Speech-Pretraining-and-Representation-Learning/transformer/nn_transformer.py", line 61, in __init__
    self.model_config = TransformerConfig(self.config)
  File "/home/speech/Self-Supervised-Speech-Pretraining-and-Representation-Learning/transformer/model.py", line 27, in __init__
    self.downsample_rate = config['transformer']['downsample_rate']
KeyError: 'transformer'

The command I use was python run_downstream.py --run=phone_linear --upstream=transformer --ckpt=./result/result_transformer_cpc_phone/exp_632/states-70000.ckpt
and states-70000.ckpt is a saved ckpt during fine-tune

Training our own data with multi-GPU

Thank you for opening such good codes.

But when I use it, I can only use one GPU for model training. How can I use multiple GPUs to train a model?

what's the L1Loss value when pretrain model's training finished ?

Errors from example_finetune.py

Hi,

I found one error from example_finetune.py

Traceback (most recent call last):
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
loss = classifier(reps, torch.LongTensor([0, 1, 0]))
File "/Users/user/pt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/Users/user/github/Mockingjay-Speech-Representation/downstream/model.py", line 282, in forward
loss = self.criterion(result, labels)
File "/Users/user/pt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/Users/user/pt/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/Users/user/pt/lib/python3.6/site-packages/torch/nn/functional.py", line 2021, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/Users/user/pt/lib/python3.6/site-packages/torch/nn/functional.py", line 1836, in nll_loss
.format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (1200) to match target batch_size (3).

It might be the simple issue, but for the reminder.

What is the actual input to the model?

It appears to be a Mel Spectrogram, but that’s not clear from the paper. Can you please explain?

Backward Compatibility to Prevent - KeyError: 'prune_headids' with Old Models

I downloaded an updated version of the Github repository and I am trying to activate an old saved model of yours from Google Drive.

When running the code below:

import torch
from runner_mockingjay import get_mockingjay_model

example_path = '/home/ec2-user/SageMaker/Mockingjay-Speech-Representation/result/result_mockingjay/mockingjay-500000.ckpt'
mockingjay = get_mockingjay_model(from_path=example_path, display_settings=True)

# A batch of spectrograms: (batch_size, seq_len, hidden_size)
spec = torch.zeros(3, 800, 160)

# reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=True)

# reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=False)

# reps.shape: (batch_size, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=True)

# reps.shape: (batch_size, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=False)

I get the error:

[SOLVER] -  Initializing Mockingjay model.
[SOLVER] -  Number of parameters: 21388800
[SOLVER] -  Load model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/result/result_mockingjay/mockingjay-500000.ckpt
[SOLVER] -  [Mockingjay] - Loaded
[SOLVER] -  Model loading complete!
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-913ea2308ae0> in <module>()
      9 
     10 # reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)
---> 11 reps = mockingjay.forward(spec=spec, all_layers=True, tile=True)
     12 
     13 # reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)

~/SageMaker/Mockingjay-Speech-Representation/mockingjay/solver.py in forward(self, spec, all_layers, tile, process_from_loader)
    686 
    687             head_mask = None
--> 688             prune_headids = self.config['mockingjay']['prune_headids']
    689             if prune_headids is not None:
    690                 layer_num = self.config['mockingjay']['num_hidden_layers']

KeyError: 'prune_headids'

I fixed it.
I added two lines in the runner_mockingjay.py file, in get_mockingjay_model method:

    if 'prune_headids' not in config['mockingjay']:
        config['mockingjay']['prune_headids'] = None

This is the whole method:

def get_mockingjay_model(from_path='result/result_mockingjay/mockingjay_libri_sd1337_best/mockingjay-500000.ckpt', display_settings=False):
    ''' Wrapper that loads the mockingjay model from checkpoint path '''
    print("im here - get_mockingjay_model")

    # load config and paras
    all_states = torch.load(from_path, map_location='cpu')
    config = all_states['Settings']['Config']
    paras = all_states['Settings']['Paras']
    
    if not hasattr(paras, 'multi_gpu'):
        setattr(paras, 'multi_gpu', False)

    # added this lines
    ####################################################        
    if 'prune_headids' not in config['mockingjay']:
        config['mockingjay']['prune_headids'] = None
    ####################################################

    # display checkpoint settings
    if display_settings:
        for cluster in config:
            print(cluster + ':')
            for item in config[cluster]:
                print('\t' + str(item) + ': ', config[cluster][item])
        print('paras:')
        v_paras = vars(paras)
        for item in v_paras:
            print('\t' + str(item) + ': ', v_paras[item])

    # load model with Tester
    from mockingjay.solver import Tester
    mockingjay = Tester(config, paras)
    mockingjay.set_model(inference=True, with_head=False, from_path=from_path)
    return mockingjay

The problem was that the new function parse_prune_heads is not getting called when using an existing model, only when creating a new model.
So, the config['mockingjay']['prune_headids'] was never initialized to None for old models.

Hope it helped,

Questions about preprocessing custom dataset

I am a newcomer to the audio field. I have some questions when use this project to generate the audio embedding for my multimodality model (text and audio)

I want to use Mockingjay, and run `python preprocess_any.py --feature_type=mel' but get 80 dim features, I just simply change num_mel in utility/audio.py from 80 to 160(I see this model need 160dim mel features in README), is it right?

Thanks a lot!

Questions regarding TERA

I have several questions regarding your TERA paper:

You used masked reconstruction for 80%, noisy reconstruction for 10% and clean reconstruction for 10%. Is this strategy better than 100% masked reconstruction? By how much?
Your feature extractor is a transformer and you added liGRU on top of it for finetuning. Why you used GRU instead of transformers for finetuning? Does GRU perform better?
You also mentioned in the paper that not freezing TERA in finetuning performs better. By not freezing, do you mean you don't freeze it from the beginning, or you freeze TERA for a while and then defreeze it in the later epochs?
To understand your TERA code, which files should I look into? It seems that they are wrapped quite deep?
Thx!

GoogleDrive 404

Hello,

unfortunately the Google Drive with the Training Checkpoints isnt available anymore. Could you please reup or update the link?

Thank you very much

baseline feature performance issue

Hi, thanks for the great code, really appreciate it!

When I'm running downstream evaluation written by my self, I found that mfcc feature with utterance based speaker classification can reach high accuracy. But when I use s3prl to run. I get poor result, with command:

python preprocess/preprocess_libri.py --feature_type=mfcc --delta=True --delta_delta=True
python run_downstream.py --run=speaker_utterance --upstream=baseline --config=config/downstream.yaml --input_dim=39
the accuracy is 0.03~0.05

And I made some modification:

change learning rate in downstream.yaml to 0.001
python preprocess/preprocess_libri.py --feature_type=mfcc --delta=True --delta_delta=True --apply_cmvn=False
python run_downstream.py --run=speaker_utterance --upstream=baseline --config=config/downstream.yaml --input_dim=39

and the accuracy went to 0.88, could you please check for this? Did I misunderstood anything?

Normalization for Librispeech pre-training data

Hi, thanks for your excellent work on this repository.

I am wondering if you performed any additional normalization on the Librispeech dataset before pre-training the model (e.g. Mockingjay)?

The only normalization function I found is this. But did you also perform other normalization, e.g. normalize to zero mean and unit variance per speaker?

Questions about the masking

Thanks for your great project.

From the paper Mockingjay, I noticed that the networks to reconstruct the corresponding linear spectrogram from the masked input so I assume the reconstruction loss is only computed on masked frames.
So did you try to reconstruct all the frames?

Thanks in advance.

The phone accuracy of Mockingjay

Dear Author :
Hello ! I am a Master from xi’an jiaotong university, thanks for your amazing open source project! I have executed the code of Mockingjay, but the model achieve 8%. phone accuracy in test-clean dataset and 70% in train dataset.

In this pretrain process, I train on train-clean-360 dataset for 500k total training steps and use parameters in 'mockingjay_libri_melBase.yaml'. The target is mel-160 features.

In this train process, I train on the all train-clean-360 and use parameters. in 'downstream.yaml'.

I don't know what caused this result. Would you like to help me? Thanks you very much !

Error in loading state_dict for TransformerModel

Hi thanks for the codebase! I want to use your pretrained models to encode my speech segments and feed the representations to a linear classifier for downstream tasks. However I can't seem to even load the downloaded weights:

In [1]: options = {
   ...:     'ckpt_file'     : '/mnt/sdb/Tools/Self-Supervised-Speech-Pretraining-and-Representation-Learning/pretrained-models/mockingjay/MelLargeM6-libri/states-500000.ckpt',
   ...:     'load_pretrain' : 'True',
   ...:     'no_grad'       : 'True',
   ...:     'dropout'       : 'default',
   ...:     'spec_aug'      : 'False',
   ...:     'spec_aug_prev' : 'True',
   ...:     'weighted_sum'  : 'False',
   ...:     'select_layer'  : -1,
   ...: }

In [2]: import torch
   ...: from transformer.nn_transformer import TRANSFORMER
   ...: from downstream.model import example_classifier
   ...: from downstream.solver import get_optimizer
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .

In [3]: transformer = TRANSFORMER(options=options, inp_dim=40)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/mnt/sdb/Tools/Self-Supervised-Speech-Pretraining-and-Representation-Learning/transformer/nn_transformer.py in load_model(self, state_dict)
    150                 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 151                                     self.model.__class__.__name__, '\n\t'.join(error_msgs)))
    152             print('[Transformer] - Pre-trained weights loaded!')

RuntimeError: Error(s) in loading state_dict for TransformerModel:
	size mismatch for input_representations.spec_transform.weight: copying a param with shape torch.Size([768, 480]) from checkpoint, the shape in current model is torch.Size([768, 120]).

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-3-1cf1c36e7053> in <module>
----> 1 transformer = TRANSFORMER(options=options, inp_dim=40)

/mnt/sdb/Tools/Self-Supervised-Speech-Pretraining-and-Representation-Learning/transformer/nn_transformer.py in __init__(self, options, inp_dim, config)
    100         load = bool(strtobool(options["load_pretrain"]))
    101         if load:
--> 102             self.load_model(all_states['Transformer'])
    103             print('[Transformer] - Number of parameters: ' + str(sum(p.numel() for p in self.model.parameters() if p.requires_grad)))
    104

/mnt/sdb/Tools/Self-Supervised-Speech-Pretraining-and-Representation-Learning/transformer/nn_transformer.py in load_model(self, state_dict)
    153
    154         except:
--> 155             raise RuntimeError('[Transformer] - Pre-trained weights NOT loaded!')
    156
    157

RuntimeError: [Transformer] - Pre-trained weights NOT loaded!

Am I missing something obvious here? Thanks.

How to implement masking segments along channel axis

Hi Andy,
I really like your new paper!
When reading the code and paper TERA, I have a question: how to implement masking segments along channel axis? Is it done when extracting acoustic features? I will appreciate if you could give me some information about that.
Thanks!

The link to download pretrained models is not working

Hello

I tried to access the link - https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning/blob/master/bit.ly/drive-s3prl but it is not accessible.

Can you please clarify which link do we need to access for the pretrained model?

Thanks

L1 loss value of AALBERT pre-training task

Can anyone tell me what is the approximate value of L1 Loss in the training/validation set after 500K iterations are completed during the pre-training task of AALBERT?

Question about the "upsampling method" for the extracted feature.

Hi, the function forward_fine_tune uses tile_representations to upsample the extracted features, which simply duplicates each feature frame downsample_rate times along the time axis.

I am wondering why you choose this specific upsampling method instead of other methods, such as splitting one feature frame to downsample_rate consecutive frames? Does this specific upsampling method provide the best performance?

Thanks in advance!

Resume an Interrupted Training

Dear Author, I'm sorry to bother you for some simple question.
1.When i trained my own pre-training model with our dataset , I accidentally interrupted the training. And i wanted to go on training on the checkpoint model file, but i did'nt find a way to do so.
2.Similar to question 1. When i evaluated downstream task with my own classifier on your pre-trained model--melLarge, the program broke down sometimes for my data's big size. I can't continue on my saved ckpt file using your example code givev in README.md, though i saved the total net rather than only weights.
My English is not good. Sorry!

Downstream Evaluation RuntimeError: size mismatch

Hi Andi,

Thanks for providing this code and data.

I am trying to run downstream evaluation with pre-trained model and data from S3PRL drive.

This is the command I am running:

python run_downstream.py --run=phone_linear --upstream=transformer ckpt=../S3PRL/mockingjay/MelMediumM9_libri/states-500000.ckpt

I am able to load the model:

[run_downstream] - getting upstream model: transformer
[Transformer] - Pre-trained weights loaded!
[Transformer] - Number of parameters: 42898176

But the downstream training fails:

[run_downstream] - Loading input data: ['train-clean-100'] from data/libri_fmllr_cmvn
[run_downstream] - Loading phone data: data/cpc_phone
[run_downstream] - getting train dataloader...
[Dataset] - Possible phone classes: 41, number of data: 20547
[run_downstream] - getting dev dataloader...
[Dataset] - Possible phone classes: 41, number of data: 2283
[run_downstream] - getting test dataloader...
[Dataset] - Possible phone classes: 41, number of data: 5708

I get the following error at the beginning of the training:

RuntimeError: size mismatch, m1: [3186 x 120], m2: [480 x 768] at /pytorch/aten/src/TH/generic/THTensorMath.cpp:41
Iteration: 0% 0/3425 [00:14<?, ?it/s]
0% 0/500000 [00:15<?, ?it/s]

What can be the problem?

Regards
Joram

Regarding data preparation with Librispeech

Hi, thanks for your amazing open source project! We're reading manual on feature extraction with Librispeech and we're only interested in using pre-trained librispeech model to finetune on librispeech-clean-100. Can we comment out all the other directories (like train_clean_360, train_other_500) in run.sh? It seems like doing the preprocesing on the whole 960 hours of data takes too much time.

a bug when use train-100h and cpc-phone

Could there be a little bug in the code?
I got great accuracy when I used train-100h subset to pretrain. I notice when using train-100h subset to pretrain the upstream model, it may load all of the data, but when test with cpc-phone, it use a split of the train-100h subset. The model may have already seen the test data.
I'm not sure if it is right.
Looking forward for you reply.

can i preprocess the libri_mel160_subword5000 with the preprocess_libri script?

I have the librispeech dataset, but when i download the feature from the google drive, it's slow. So I wonder can I process it with the original dataset? I'm a little confuse about the "subword" meaning for mel feature.

Fine tune Mockingjay LARGE

Mockingjay paper says "We did not fine-tune the LARGE model, as it is meant for extracting representations." Does it mean Mockingjay LARGE will have worse performance after we fine tune it?

Weights of MockingjayModel not initialized from pretrained model

Hey,

I've created my downstream solver and started training, I saw some weird behavior that I can not explain.

First loading from your checkpoint works fine:

[Dataset] - Computing pathology class...
label HC idx 0
label CV idx 1
[Dataset] - Possible pathology classes:  2
Load model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/result/result_mockingjay/mockingjay-500000.ckpt
[Mockingjay] - Pre-trained weights loaded!
[Mockingjay] - Number of parameters: 21388800

When saving the model like:

def save_model(self, name, model_all=True, assign_name=None):

# ----------------------------------------for printing---------------------------------------------- # 
        print(assign_name)
        print(self.mockingjay.state_dict())
# --------------------------------------------------------------------------------------------------- #  
      if model_all:
            all_states = {
                'Classifier': self.classifier.state_dict(),
                'Mockingjay': self.mockingjay.state_dict(),
                'Optimizer': self.optimizer.state_dict(),
                'Global_step': self.global_step,
                'Settings': {
                    'Config': self.config,
                    'Paras': self.paras,
                },
            }
        else:
            all_states = {
                'Classifier': self.classifier.state_dict(),
                'Settings': {
                    'Config': self.config,
                    'Paras': self.paras,
                },
            }

        if assign_name is not None:
            model_path = f'{self.ckptdir}/{assign_name}.ckpt'
            torch.save(all_states, model_path)

# -----------------------------------------for printing--------------------------------------------- #
            all_states = torch.load(model_path, map_location='cpu')
            print(assign_name)
            print(all_states["Mockingjay"])
            self.mockingjay.load_state_dict(all_states["Mockingjay"])
# --------------------------------------------------------------------------------------------------- #

            return

        new_model_path = '{}/{}-{}.ckpt'.format(self.ckptdir, name, self.global_step)
        torch.save(all_states, new_model_path)
        self.models_kept.append(new_model_path)
        
        if len(self.models_kept) >= self.max_keep:
            os.remove(self.models_kept[0])
            self.models_kept.pop(0)

It is printing the model attributes like this: model.input_representations.spec_transform.weight .
with model before the actual attribute.

when I try to load like this:

    def load_model(self, inference=False):
        input_dim = self.input_dim
        
        print('Load model from {}'.format(self.ckpt_path))
        self.write_to_log('Load model from {}'.format(self.ckpt_path))
        all_states = torch.load(self.ckpt_path, map_location='cpu')
        
        # setup the mockingjay model
        options = {
            'ckpt_file' : self.ckpt_path,
            'load_pretrain' : 'True',
            'no_grad' : 'False',
            'dropout' : 'default'
        }

        self.mockingjay = MOCKINGJAY(options=options, inp_dim=160).to(self.device) 
        self.write_to_log('[Mockingjay] - Loaded')
        self.classifier = RnnClassifier(input_dim=768,
                                class_num=self.dataloader.dataset.class_num,
                                task=self.task,
                                dconfig=self.config['downstream']).to(self.device)
        
        if not inference:
            self.mockingjay.train()
            self.classifier.train()

            param_optimizer = list(self.mockingjay.named_parameters()) + list(self.classifier.named_parameters())
            self.optimizer = get_mockingjay_optimizer(params=param_optimizer, 
                                                      lr=self.learning_rate, 
                                                      warmup_proportion=self.warmup_proportion,
                                                      training_steps=self.total_steps)
        else:      
            self.classifier.load_state_dict(all_states['Classifier'])
            self.write_to_log('[Classifier] - Loaded')
            self.mockingjay.eval()
            self.classifier.eval()

        
        self.write_to_log('Model loading completed!')

It is loading with the warning:

[Dataset] - Computing pathology class...
label HC idx 0
label CV idx 1
[Dataset] - Possible pathology classes:  2
Load model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/coronavirus/ckpt/mockingjay_corona_sd20190929_td2020-04-28/tmp.ckpt
Weights of MockingjayModel not initialized from pretrained model: ['input_representations.spec_transform.weight', 'input_representations.spec_transform.bias', 'input_representations.LayerNorm.weight', 'input_representations.LayerNorm.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.1.attention.self.query.weight', 'encoder.layer.1.attention.self.query.bias', 'encoder.layer.1.attention.self.key.weight', 'encoder.layer.1.attention.self.key.bias', 'encoder.layer.1.attention.self.value.weight', 'encoder.layer.1.attention.self.value.bias', 'encoder.layer.1.attention.output.dense.weight', 'encoder.layer.1.attention.output.dense.bias', 'encoder.layer.1.attention.output.LayerNorm.weight', 'encoder.layer.1.attention.output.LayerNorm.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.2.attention.self.query.weight', 'encoder.layer.2.attention.self.query.bias', 'encoder.layer.2.attention.self.key.weight', 'encoder.layer.2.attention.self.key.bias', 'encoder.layer.2.attention.self.value.weight', 'encoder.layer.2.attention.self.value.bias', 'encoder.layer.2.attention.output.dense.weight', 'encoder.layer.2.attention.output.dense.bias', 'encoder.layer.2.attention.output.LayerNorm.weight', 'encoder.layer.2.attention.output.LayerNorm.bias', 'encoder.layer.2.intermediate.dense.weight', 'encoder.layer.2.intermediate.dense.bias', 'encoder.layer.2.output.dense.weight', 'encoder.layer.2.output.dense.bias', 'encoder.layer.2.output.LayerNorm.weight', 'encoder.layer.2.output.LayerNorm.bias', 'encoder.layer.3.attention.self.query.weight', 'encoder.layer.3.attention.self.query.bias', 'encoder.layer.3.attention.self.key.weight', 'encoder.layer.3.attention.self.key.bias', 'encoder.layer.3.attention.self.value.weight', 'encoder.layer.3.attention.self.value.bias', 'encoder.layer.3.attention.output.dense.weight', 'encoder.layer.3.attention.output.dense.bias', 'encoder.layer.3.attention.output.LayerNorm.weight', 'encoder.layer.3.attention.output.LayerNorm.bias', 'encoder.layer.3.intermediate.dense.weight', 'encoder.layer.3.intermediate.dense.bias', 'encoder.layer.3.output.dense.weight', 'encoder.layer.3.output.dense.bias', 'encoder.layer.3.output.LayerNorm.weight', 'encoder.layer.3.output.LayerNorm.bias', 'encoder.layer.4.attention.self.query.weight', 'encoder.layer.4.attention.self.query.bias', 'encoder.layer.4.attention.self.key.weight', 'encoder.layer.4.attention.self.key.bias', 'encoder.layer.4.attention.self.value.weight', 'encoder.layer.4.attention.self.value.bias', 'encoder.layer.4.attention.output.dense.weight', 'encoder.layer.4.attention.output.dense.bias', 'encoder.layer.4.attention.output.LayerNorm.weight', 'encoder.layer.4.attention.output.LayerNorm.bias', 'encoder.layer.4.intermediate.dense.weight', 'encoder.layer.4.intermediate.dense.bias', 'encoder.layer.4.output.dense.weight', 'encoder.layer.4.output.dense.bias', 'encoder.layer.4.output.LayerNorm.weight', 'encoder.layer.4.output.LayerNorm.bias', 'encoder.layer.5.attention.self.query.weight', 'encoder.layer.5.attention.self.query.bias', 'encoder.layer.5.attention.self.key.weight', 'encoder.layer.5.attention.self.key.bias', 'encoder.layer.5.attention.self.value.weight', 'encoder.layer.5.attention.self.value.bias', 'encoder.layer.5.attention.output.dense.weight', 'encoder.layer.5.attention.output.dense.bias', 'encoder.layer.5.attention.output.LayerNorm.weight', 'encoder.layer.5.attention.output.LayerNorm.bias', 'encoder.layer.5.intermediate.dense.weight', 'encoder.layer.5.intermediate.dense.bias', 'encoder.layer.5.output.dense.weight', 'encoder.layer.5.output.dense.bias', 'encoder.layer.5.output.LayerNorm.weight', 'encoder.layer.5.output.LayerNorm.bias', 'encoder.layer.6.attention.self.query.weight', 'encoder.layer.6.attention.self.query.bias', 'encoder.layer.6.attention.self.key.weight', 'encoder.layer.6.attention.self.key.bias', 'encoder.layer.6.attention.self.value.weight', 'encoder.layer.6.attention.self.value.bias', 'encoder.layer.6.attention.output.dense.weight', 'encoder.layer.6.attention.output.dense.bias', 'encoder.layer.6.attention.output.LayerNorm.weight', 'encoder.layer.6.attention.output.LayerNorm.bias', 'encoder.layer.6.intermediate.dense.weight', 'encoder.layer.6.intermediate.dense.bias', 'encoder.layer.6.output.dense.weight', 'encoder.layer.6.output.dense.bias', 'encoder.layer.6.output.LayerNorm.weight', 'encoder.layer.6.output.LayerNorm.bias', 'encoder.layer.7.attention.self.query.weight', 'encoder.layer.7.attention.self.query.bias', 'encoder.layer.7.attention.self.key.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.7.attention.self.value.weight', 'encoder.layer.7.attention.self.value.bias', 'encoder.layer.7.attention.output.dense.weight', 'encoder.layer.7.attention.output.dense.bias', 'encoder.layer.7.attention.output.LayerNorm.weight', 'encoder.layer.7.attention.output.LayerNorm.bias', 'encoder.layer.7.intermediate.dense.weight', 'encoder.layer.7.intermediate.dense.bias', 'encoder.layer.7.output.dense.weight', 'encoder.layer.7.output.dense.bias', 'encoder.layer.7.output.LayerNorm.weight', 'encoder.layer.7.output.LayerNorm.bias', 'encoder.layer.8.attention.self.query.weight', 'encoder.layer.8.attention.self.query.bias', 'encoder.layer.8.attention.self.key.weight', 'encoder.layer.8.attention.self.key.bias', 'encoder.layer.8.attention.self.value.weight', 'encoder.layer.8.attention.self.value.bias', 'encoder.layer.8.attention.output.dense.weight', 'encoder.layer.8.attention.output.dense.bias', 'encoder.layer.8.attention.output.LayerNorm.weight', 'encoder.layer.8.attention.output.LayerNorm.bias', 'encoder.layer.8.intermediate.dense.weight', 'encoder.layer.8.intermediate.dense.bias', 'encoder.layer.8.output.dense.weight', 'encoder.layer.8.output.dense.bias', 'encoder.layer.8.output.LayerNorm.weight', 'encoder.layer.8.output.LayerNorm.bias', 'encoder.layer.9.attention.self.query.weight', 'encoder.layer.9.attention.self.query.bias', 'encoder.layer.9.attention.self.key.weight', 'encoder.layer.9.attention.self.key.bias', 'encoder.layer.9.attention.self.value.weight', 'encoder.layer.9.attention.self.value.bias', 'encoder.layer.9.attention.output.dense.weight', 'encoder.layer.9.attention.output.dense.bias', 'encoder.layer.9.attention.output.LayerNorm.weight', 'encoder.layer.9.attention.output.LayerNorm.bias', 'encoder.layer.9.intermediate.dense.weight', 'encoder.layer.9.intermediate.dense.bias', 'encoder.layer.9.output.dense.weight', 'encoder.layer.9.output.dense.bias', 'encoder.layer.9.output.LayerNorm.weight', 'encoder.layer.9.output.LayerNorm.bias', 'encoder.layer.10.attention.self.query.weight', 'encoder.layer.10.attention.self.query.bias', 'encoder.layer.10.attention.self.key.weight', 'encoder.layer.10.attention.self.key.bias', 'encoder.layer.10.attention.self.value.weight', 'encoder.layer.10.attention.self.value.bias', 'encoder.layer.10.attention.output.dense.weight', 'encoder.layer.10.attention.output.dense.bias', 'encoder.layer.10.attention.output.LayerNorm.weight', 'encoder.layer.10.attention.output.LayerNorm.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.11.attention.self.query.weight', 'encoder.layer.11.attention.self.query.bias', 'encoder.layer.11.attention.self.key.weight', 'encoder.layer.11.attention.self.key.bias', 'encoder.layer.11.attention.self.value.weight', 'encoder.layer.11.attention.self.value.bias', 'encoder.layer.11.attention.output.dense.weight', 'encoder.layer.11.attention.output.dense.bias', 'encoder.layer.11.attention.output.LayerNorm.weight', 'encoder.layer.11.attention.output.LayerNorm.bias', 'encoder.layer.11.intermediate.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.output.dense.weight', 'encoder.layer.11.output.dense.bias', 'encoder.layer.11.output.LayerNorm.weight', 'encoder.layer.11.output.LayerNorm.bias']
Weights from pretrained model not used in MockingjayModel: ['model.input_representations.spec_transform.weight', 'model.input_representations.spec_transform.bias', 'model.input_representations.LayerNorm.weight', 'model.input_representations.LayerNorm.bias', 'model.encoder.layer.0.attention.self.query.weight', 'model.encoder.layer.0.attention.self.query.bias', 'model.encoder.layer.0.attention.self.key.weight', 'model.encoder.layer.0.attention.self.key.bias', 'model.encoder.layer.0.attention.self.value.weight', 'model.encoder.layer.0.attention.self.value.bias', 'model.encoder.layer.0.attention.output.dense.weight', 'model.encoder.layer.0.attention.output.dense.bias', 'model.encoder.layer.0.attention.output.LayerNorm.weight', 'model.encoder.layer.0.attention.output.LayerNorm.bias', 'model.encoder.layer.0.intermediate.dense.weight', 'model.encoder.layer.0.intermediate.dense.bias', 'model.encoder.layer.0.output.dense.weight', 'model.encoder.layer.0.output.dense.bias', 'model.encoder.layer.0.output.LayerNorm.weight', 'model.encoder.layer.0.output.LayerNorm.bias', 'model.encoder.layer.1.attention.self.query.weight', 'model.encoder.layer.1.attention.self.query.bias', 'model.encoder.layer.1.attention.self.key.weight', 'model.encoder.layer.1.attention.self.key.bias', 'model.encoder.layer.1.attention.self.value.weight', 'model.encoder.layer.1.attention.self.value.bias', 'model.encoder.layer.1.attention.output.dense.weight', 'model.encoder.layer.1.attention.output.dense.bias', 'model.encoder.layer.1.attention.output.LayerNorm.weight', 'model.encoder.layer.1.attention.output.LayerNorm.bias', 'model.encoder.layer.1.intermediate.dense.weight', 'model.encoder.layer.1.intermediate.dense.bias', 'model.encoder.layer.1.output.dense.weight', 'model.encoder.layer.1.output.dense.bias', 'model.encoder.layer.1.output.LayerNorm.weight', 'model.encoder.layer.1.output.LayerNorm.bias', 'model.encoder.layer.2.attention.self.query.weight', 'model.encoder.layer.2.attention.self.query.bias', 'model.encoder.layer.2.attention.self.key.weight', 'model.encoder.layer.2.attention.self.key.bias', 'model.encoder.layer.2.attention.self.value.weight', 'model.encoder.layer.2.attention.self.value.bias', 'model.encoder.layer.2.attention.output.dense.weight', 'model.encoder.layer.2.attention.output.dense.bias', 'model.encoder.layer.2.attention.output.LayerNorm.weight', 'model.encoder.layer.2.attention.output.LayerNorm.bias', 'model.encoder.layer.2.intermediate.dense.weight', 'model.encoder.layer.2.intermediate.dense.bias', 'model.encoder.layer.2.output.dense.weight', 'model.encoder.layer.2.output.dense.bias', 'model.encoder.layer.2.output.LayerNorm.weight', 'model.encoder.layer.2.output.LayerNorm.bias']
[Mockingjay] - Pre-trained weights loaded!
[Mockingjay] - Number of parameters: 85425408

First question:
Why do I have 85425408 parameters, and the your original checkpoint had 21388800 parameters?

After that I understood that the model wants me to save the attributes without the model before the actual attribute.
So I changed one line in the way I save the model:

all_states = {
                'Classifier': self.classifier.state_dict(),
                'Mockingjay': self.mockingjay.model.state_dict(),  # I added the model after mockingjay
                'Optimizer': self.optimizer.state_dict(),
                'Global_step': self.global_step,
                'Settings': {
                    'Config': self.config,
                    'Paras': self.paras,
                },
            }

I added the word model after the word mockingjay.
This should save the attributes without the word model before the actual attribute.

It is printing the model attributes like this input_representations.spec_transform.weight.
Without the model word before the actual attribute.
But still.. I have a warning:

[Dataset] - Computing pathology class...
label HC idx 0
label CV idx 1
[Dataset] - Possible pathology classes:  2
Load model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/coronavirus/ckpt/mockingjay_corona_sd20190929_td2020-04-28/tmp.ckpt
Weights of MockingjayModel not initialized from pretrained model: ['encoder.layer.3.attention.self.query.weight', 'encoder.layer.3.attention.self.query.bias', 'encoder.layer.3.attention.self.key.weight', 'encoder.layer.3.attention.self.key.bias', 'encoder.layer.3.attention.self.value.weight', 'encoder.layer.3.attention.self.value.bias', 'encoder.layer.3.attention.output.dense.weight', 'encoder.layer.3.attention.output.dense.bias', 'encoder.layer.3.attention.output.LayerNorm.weight', 'encoder.layer.3.attention.output.LayerNorm.bias', 'encoder.layer.3.intermediate.dense.weight', 'encoder.layer.3.intermediate.dense.bias', 'encoder.layer.3.output.dense.weight', 'encoder.layer.3.output.dense.bias', 'encoder.layer.3.output.LayerNorm.weight', 'encoder.layer.3.output.LayerNorm.bias', 'encoder.layer.4.attention.self.query.weight', 'encoder.layer.4.attention.self.query.bias', 'encoder.layer.4.attention.self.key.weight', 'encoder.layer.4.attention.self.key.bias', 'encoder.layer.4.attention.self.value.weight', 'encoder.layer.4.attention.self.value.bias', 'encoder.layer.4.attention.output.dense.weight', 'encoder.layer.4.attention.output.dense.bias', 'encoder.layer.4.attention.output.LayerNorm.weight', 'encoder.layer.4.attention.output.LayerNorm.bias', 'encoder.layer.4.intermediate.dense.weight', 'encoder.layer.4.intermediate.dense.bias', 'encoder.layer.4.output.dense.weight', 'encoder.layer.4.output.dense.bias', 'encoder.layer.4.output.LayerNorm.weight', 'encoder.layer.4.output.LayerNorm.bias', 'encoder.layer.5.attention.self.query.weight', 'encoder.layer.5.attention.self.query.bias', 'encoder.layer.5.attention.self.key.weight', 'encoder.layer.5.attention.self.key.bias', 'encoder.layer.5.attention.self.value.weight', 'encoder.layer.5.attention.self.value.bias', 'encoder.layer.5.attention.output.dense.weight', 'encoder.layer.5.attention.output.dense.bias', 'encoder.layer.5.attention.output.LayerNorm.weight', 'encoder.layer.5.attention.output.LayerNorm.bias', 'encoder.layer.5.intermediate.dense.weight', 'encoder.layer.5.intermediate.dense.bias', 'encoder.layer.5.output.dense.weight', 'encoder.layer.5.output.dense.bias', 'encoder.layer.5.output.LayerNorm.weight', 'encoder.layer.5.output.LayerNorm.bias', 'encoder.layer.6.attention.self.query.weight', 'encoder.layer.6.attention.self.query.bias', 'encoder.layer.6.attention.self.key.weight', 'encoder.layer.6.attention.self.key.bias', 'encoder.layer.6.attention.self.value.weight', 'encoder.layer.6.attention.self.value.bias', 'encoder.layer.6.attention.output.dense.weight', 'encoder.layer.6.attention.output.dense.bias', 'encoder.layer.6.attention.output.LayerNorm.weight', 'encoder.layer.6.attention.output.LayerNorm.bias', 'encoder.layer.6.intermediate.dense.weight', 'encoder.layer.6.intermediate.dense.bias', 'encoder.layer.6.output.dense.weight', 'encoder.layer.6.output.dense.bias', 'encoder.layer.6.output.LayerNorm.weight', 'encoder.layer.6.output.LayerNorm.bias', 'encoder.layer.7.attention.self.query.weight', 'encoder.layer.7.attention.self.query.bias', 'encoder.layer.7.attention.self.key.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.7.attention.self.value.weight', 'encoder.layer.7.attention.self.value.bias', 'encoder.layer.7.attention.output.dense.weight', 'encoder.layer.7.attention.output.dense.bias', 'encoder.layer.7.attention.output.LayerNorm.weight', 'encoder.layer.7.attention.output.LayerNorm.bias', 'encoder.layer.7.intermediate.dense.weight', 'encoder.layer.7.intermediate.dense.bias', 'encoder.layer.7.output.dense.weight', 'encoder.layer.7.output.dense.bias', 'encoder.layer.7.output.LayerNorm.weight', 'encoder.layer.7.output.LayerNorm.bias', 'encoder.layer.8.attention.self.query.weight', 'encoder.layer.8.attention.self.query.bias', 'encoder.layer.8.attention.self.key.weight', 'encoder.layer.8.attention.self.key.bias', 'encoder.layer.8.attention.self.value.weight', 'encoder.layer.8.attention.self.value.bias', 'encoder.layer.8.attention.output.dense.weight', 'encoder.layer.8.attention.output.dense.bias', 'encoder.layer.8.attention.output.LayerNorm.weight', 'encoder.layer.8.attention.output.LayerNorm.bias', 'encoder.layer.8.intermediate.dense.weight', 'encoder.layer.8.intermediate.dense.bias', 'encoder.layer.8.output.dense.weight', 'encoder.layer.8.output.dense.bias', 'encoder.layer.8.output.LayerNorm.weight', 'encoder.layer.8.output.LayerNorm.bias', 'encoder.layer.9.attention.self.query.weight', 'encoder.layer.9.attention.self.query.bias', 'encoder.layer.9.attention.self.key.weight', 'encoder.layer.9.attention.self.key.bias', 'encoder.layer.9.attention.self.value.weight', 'encoder.layer.9.attention.self.value.bias', 'encoder.layer.9.attention.output.dense.weight', 'encoder.layer.9.attention.output.dense.bias', 'encoder.layer.9.attention.output.LayerNorm.weight', 'encoder.layer.9.attention.output.LayerNorm.bias', 'encoder.layer.9.intermediate.dense.weight', 'encoder.layer.9.intermediate.dense.bias', 'encoder.layer.9.output.dense.weight', 'encoder.layer.9.output.dense.bias', 'encoder.layer.9.output.LayerNorm.weight', 'encoder.layer.9.output.LayerNorm.bias', 'encoder.layer.10.attention.self.query.weight', 'encoder.layer.10.attention.self.query.bias', 'encoder.layer.10.attention.self.key.weight', 'encoder.layer.10.attention.self.key.bias', 'encoder.layer.10.attention.self.value.weight', 'encoder.layer.10.attention.self.value.bias', 'encoder.layer.10.attention.output.dense.weight', 'encoder.layer.10.attention.output.dense.bias', 'encoder.layer.10.attention.output.LayerNorm.weight', 'encoder.layer.10.attention.output.LayerNorm.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.11.attention.self.query.weight', 'encoder.layer.11.attention.self.query.bias', 'encoder.layer.11.attention.self.key.weight', 'encoder.layer.11.attention.self.key.bias', 'encoder.layer.11.attention.self.value.weight', 'encoder.layer.11.attention.self.value.bias', 'encoder.layer.11.attention.output.dense.weight', 'encoder.layer.11.attention.output.dense.bias', 'encoder.layer.11.attention.output.LayerNorm.weight', 'encoder.layer.11.attention.output.LayerNorm.bias', 'encoder.layer.11.intermediate.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.output.dense.weight', 'encoder.layer.11.output.dense.bias', 'encoder.layer.11.output.LayerNorm.weight', 'encoder.layer.11.output.LayerNorm.bias']
[Mockingjay] - Pre-trained weights NOT loaded!
[Mockingjay] - Number of parameters: 85425408

Now the weights did not load at all..

Second question:
Why is it not loading the parameters?

I found a workaround for now..
I always load your original checkpoint first and just after the model is ready, I am changing it's parameters to the ones I saved.. It is not giving me any warnings..
It works just fine:

start:
[Dataset] - Computing pathology class...
label HC idx 0
label CV idx 1
[Dataset] - Possible pathology classes:  2
Load original model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/result/result_mockingjay/mockingjay-500000.ckpt
[Mockingjay] - Pre-trained weights loaded!
[Mockingjay] - Number of parameters: 21388800

dev 1:
[Dataset] - Computing pathology class...
label HC idx 0
label CV idx 1
[Dataset] - Possible pathology classes:  2
Load original model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/result/result_mockingjay/mockingjay-500000.ckpt
[Mockingjay] - Pre-trained weights loaded!
[Mockingjay] - Number of parameters: 21388800
Load finetuned model from /home/ec2-user/SageMaker/Mockingjay-Speech-Representation/coronavirus/ckpt/mockingjay_corona_sd20190929_td2020-04-28/tmp.ckpt

What can cause that behavior?

Thank you,

Pretrain an auto encoder using ALBERT

Can i use the ALBERT model to pretrain an autoencoder?

UnknownBackendTraceback and FileNotFoundError when using Google Colab

I tried to run the model on Google Colab.
After downloading and activating:

!git clone https://github.com/andi611/Mockingjay-Speech-Representation.git

import os
os.chdir("Mockingjay-Speech-Representation")

!pip3 install -r requirements.txt

When I am running this line from the Downstream Task Instructions:

# using spectrogram as baseline
!python3 runner_mockingjay.py --train_phone

I am getting an error:

[TerminalIPythonApp] WARNING | GUI event loop or pylab initialization failed

UnknownBackendTraceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in enable_matplotlib(self, gui)
   2953         # Now we must activate the gui pylab wants to use, and fix %run to take
   2954         # plot updates into account
-> 2955         self.enable_gui(gui)
   2956         self.magics_manager.registry['ExecutionMagics'].default_runner = \
   2957             pt.mpl_runner(self.safe_execfile)

/usr/local/lib/python3.6/dist-packages/IPython/terminal/interactiveshell.py in enable_gui(self, gui)
    512         if gui:
    513             self.active_eventloop, self._inputhook =\
--> 514                 get_inputhook_name_and_func(gui)
    515         else:
    516             self.active_eventloop = self._inputhook = None

/usr/local/lib/python3.6/dist-packages/IPython/terminal/pt_inputhooks/__init__.py in get_inputhook_name_and_func(gui)
     36 
     37     if gui not in backends:
---> 38         raise UnknownBackend(gui)
     39 
     40     if gui in aliases:

UnknownBackend: No event loop integration for 'inline'. Supported event loops are: qt, qt4, qt5, gtk, gtk2, gtk3, tk, wx, pyglet, glut, osx
/usr/local/lib/python3.6/dist-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
/content/Mockingjay-Speech-Representation/utility/audio.py:19: UserWarning: 
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "runner_mockingjay.py", line 237, in <module>
    main()
  File "runner_mockingjay.py", line 110, in main
    from downstream.solver import Downstream_Trainer
  File "/content/Mockingjay-Speech-Representation/downstream/solver.py", line 24, in <module>
    from dataloader import get_Dataloader
  File "/content/Mockingjay-Speech-Representation/dataloader.py", line 25, in <module>
    from ipdb import set_trace
  File "/usr/local/lib/python3.6/dist-packages/ipdb/__init__.py", line 7, in <module>
    from ipdb.__main__ import set_trace, post_mortem, pm, run             # noqa
  File "/usr/local/lib/python3.6/dist-packages/ipdb/__main__.py", line 29, in <module>
    ipapp.initialize(['--no-term-title'])
  File "<decorator-gen-120>", line 2, in initialize
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/terminal/ipapp.py", line 320, in initialize
    self.init_gui_pylab()
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/shellapp.py", line 213, in init_gui_pylab
    r = enable(key)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2950, in enable_matplotlib
    pt.activate_matplotlib(backend)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/pylabtools.py", line 309, in activate_matplotlib
    matplotlib.pyplot.switch_backend(backend)
  File "/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py", line 231, in switch_backend
    matplotlib.use(newbackend, warn=False, force=True)
  File "/usr/local/lib/python3.6/dist-packages/matplotlib/__init__.py", line 1422, in use
    reload(sys.modules['matplotlib.backends'])
  File "/usr/lib/python3.6/importlib/__init__.py", line 166, in reload
    _bootstrap._exec(spec, module)
  File "/usr/local/lib/python3.6/dist-packages/matplotlib/backends/__init__.py", line 16, in <module>
    line for line in traceback.format_stack()


  matplotlib.use("Agg")
[SOLVER] -  CUDA is available!
[SOLVER] -  Loading source data from ['train-clean-360'] from data/libri_mel160_subword5000
[SOLVER] -  Loading phone data from ['train-clean-360'] from data/libri_phone
Traceback (most recent call last):
  File "runner_mockingjay.py", line 237, in <module>
    main()
  File "runner_mockingjay.py", line 114, in main
    trainer.load_data(split='train', load='phone')
  File "/content/Mockingjay-Speech-Representation/downstream/solver.py", line 97, in load_data
    **self.config['dataloader']))
  File "/content/Mockingjay-Speech-Representation/dataloader.py", line 840, in get_Dataloader
    train_proportion=train_proportion if split == 'train' else 1.0)
  File "/content/Mockingjay-Speech-Representation/dataloader.py", line 248, in __init__
    super(Mel_Phone_Dataset, self).__init__(file_path, sets, bucket_size, max_timestep, max_label_len, drop, load)
  File "/content/Mockingjay-Speech-Representation/dataloader.py", line 54, in __init__
    tables = [pd.read_csv(os.path.join(file_path, s + '.csv')) for s in sets]
  File "/content/Mockingjay-Speech-Representation/dataloader.py", line 54, in <listcomp>
    tables = [pd.read_csv(os.path.join(file_path, s + '.csv')) for s in sets]
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'data/libri_mel160_subword5000/train-clean-360.csv' does not exist

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

There are 2 problems here.

The "UnknownBackendTraceback" at the top,
The "FileNotFoundError" at the bottom.

For both problems:

Is there a way to make it work with Google colab?
How can I load the data required for the training?

Thank you,

Does the simple network structure design of downstream task affect the results ?

Thanks for you job. I want to know does the simple network structure of downstream task affect the result? Because if you use a fine tuned network of pretrained network , it jusk like you have a more complex network than you directly feed the acoustic future input the downstream network which only have two layers. Do you have any experiment that training the network which combined ALBERT network and downstream network , and feed it with acoustic fetures from strat. Thanks.

Question about 'mask_label' in transformer.mam.process_train_MAM_data

Hi,

Thank you so much for providing us such well-organized Pretraining framework.

Recently, I got a question about the variable 'mask_label' defined in your transformer.mam.process_train_MAM_data.

mask_label = torch.zeros_like(spec_stacked, dtype=torch.uint8) if mask_proportion != 0 and mask_frequency != 0 else torch.ones_like(spec_stacked, dtype=torch.uint8)

From the code, I understand that you are targeting at the masked frames and calculating the L1 loss for those frames only. However, I am thinking maybe it is better to modify the line to

mask_label = torch.zeros_like(spec_stacked, dtype=torch.uint8) if mask_proportion != 0 or mask_frequency != 0 else torch.ones_like(spec_stacked, dtype=torch.uint8)

, since sometimes we only want to perform either time mask or frequency mask and we hope the 'mask_label' won't always be a ones matrix in such cases.

deprecated code reference

https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning/blob/61e9edff0e0b7f5266f038bb096265f2bc392ffa/run_downstream.py#L23

But dual transformer doesn't exist since commit b7723c.

This lead to import and other errors

Results of ASR with PyTorch Kaldi don't perform well

Thanks a lot for the great work!

I am trying to finetue the mockingjay pretrained model (fmllrBase960-F-N-K-libri)for ASR with pytorch-kaldi. I use librispeech(train-clean-100) for downstream training, and test by test-clean. The config file is "libri_transformer_liGRU_fmllr_ft.cfg". The WER of test-clean is 8.76, which seems not better than the orignal model in pytorch-kaldi. The result in pytorch-kaldi tutorial is 8.6.

Is it normal? Could you give some results for ASR with pytorch-kaldi by finetue the mockingjay? I am confused with my experiment results. Thanks you!

error training asr with pretrained aalbert

when running pytorch-kaldi for asr training using pretrained aalbert model, something went wrong like this:

$ python run_exp.py cfg/AISHELL/aishell_transformer_fmllr_ft.cfg 
- Reading config file......OK!
- Chunk creation......OK!

------------------------------ Epoch 00 / 23 ------------------------------
Training train chunk = 1 / 50
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[Transformer] - Pre-trained weights loaded!
[Transformer] - Number of parameters: 7182336
Traceback (most recent call last):
  File "run_exp.py", line 281, in <module>
    next_config_file,
  File "/home/bin.yang02/program/pytorch-kaldi/core.py", line 628, in run_nn
    forward_outs,
  File "/home/bin.yang02/program/pytorch-kaldi/utils.py", line 2334, in forward_model
    outs_dict[inp2] = outs_dict[inp2].view(max_len * batch_size, -1)
RuntimeError: shape '[416, -1]' is invalid for input of size 307200

I trained the aalbert model with the cfg file like this:

$ cat aalbert_aishell_fbank3L.yaml
transformer:
  input_dim: 40                                         # `int`, 39 for mfcc, 40 for fmllr, 80 for fbank, 160 for mel
  downsample_rate: 3                                    # stacked consecutive features vectors to reduce the length of input sequences by this factor.
  hidden_size: 768                                      # Size of the encoder layers and the pooler layer.
  num_hidden_layers: 3                                  # Number of hidden layers in the Transformer encoder.
  num_attention_heads: 12                               # Number of attention heads for each attention layer in the Transformer encoder.
  intermediate_size: 3072                               # The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
  hidden_act: "gelu"                                    # The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
  hidden_dropout_prob: 0.1                              # The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
  attention_probs_dropout_prob: 0.1                     # The dropout ratio for the attention probabilities.
  initializer_range: 0.02                               # The sttdev of the truncated_normal_initializer for initializing all weight matrices.
  layer_norm_eps: "1e-12"                               # The epsilon used by LayerNorm.
  mask_proportion: 0.15                                 # mask this percentage of all spectrogram frames in each sequence at random during MAM training                        
  mask_consecutive_min: 7                               # mask this amount of consecutive frames
  mask_consecutive_max: 7                               # mask this amount of consecutive frames
  mask_allow_overlap: True                              # allow overlap masking
  mask_bucket_ratio: 1.2
  mask_bucket_ratio: 1.2                                # only used when overlap is not allowed. sample a mask from each bucket in size of [sampled mask_consecutive * mask_bucket_ratio]
  mask_frequency: 16                                    # mask maximum this amount of frequency bands, set to 0 for no frequency mask
  noise_proportion: 0.15                                # for this percentage of the time, Gaussian noise will be applied on all frames during MAM training, set to 0 for no noise
  prune_headids: None                                   # Usage: 0,1,2,12-15 will prune headids [0,1,2,12,13,14]. headids = layerid * head_num + headid_in_layer
  share_layer: True                                     # Share layer weights
  max_input_length: 0                                   # maximum input length (0 for no restriction)

optimizer: 
  learning_rate: "2e-4"                                 # Learning rate for opt. "4e-4" for 'data/libri_mel160_subword5000', "2e-4" for 'data/libri_fmllr_cmvn'
  loss_scale: 0                                         # Loss scale to improve fp16 numeric stability. Only used when apex is set to True. 0: dynamic loss scaling. positive power of 2: static loss scaling.
  warmup_proportion: 0.07                               # Proportion of training to perform linear rate warmup.
  gradient_accumulation_steps: 3                        # Number of updates steps to accumulate before performing a backward/update pass
  gradient_clipping: 3.0                                # Maximum gradient norm

dataloader:
  n_jobs: 12                                            # Subprocess used for torch Dataloader
  batch_size: 12                                        # training batch size, 12 for pre-train, 6 for cpc exp
  dev_batch_size: 12                                    # used for dev/test splits
  max_timestep: 1500                                    # Max length for audio feature (0 for no restriction), 1500 for pre-train, 3000 for downstream tasks
  
  # LIBRISEECH SETTINGS
  data_path: 'data/aishell_fmllr_cmvn'                    # Source data path, 'data/libri_fmllr_cmvn', or 'data/libri_mfcc_cmvn', or 'data/libri_mel160_subword5000' for different preprocessing features
  target_path: ''                                       # Target data path, not used when `duo_deature:False`. For reconstruction to a different feature type, for example set dataset to 'libri_linear1025_subword5000'.
  phone_path: 'data/cpc_phone'                          # phone boundary label data path for the phone classification task. set to 'data/libri_phone' or 'data/cpc_phone'
  train_set: ['train']                        # ['train-clean-100', 'train-clean-360', 'train-other-500'] for pre-training. ['train-clean-360'] or ['train-clean-100'] for libri phone exp or cpc phone exp, respectively.
  dev_set: ['dev']                                #
  test_set: ['test']                              #
  train_proportion: 1.0                                 # Currently only effect the `phone classification task`, use this percent of `train_set` for downstream task training to demonstrate mockingjay generality

runner:
  # Training options
  apex: False                                           # Use APEX (see https://github.com/NVIDIA/apex for more details)
  total_steps: 200000                                   # total steps for training, a step is a batch of update
  log_step: 2500                                        # log training status every this amount of training steps
  save_step: 10000                                      # save model every this amount of training steps
  duo_feature: False                                    # Use different input / output features during training
  max_keep: 2                                           # maximum number of model ckpt to keep during training

and the asr train cfg file:

$ cat aishell_transformer_fmllr_ft.cfg
[cfg_proto]
cfg_proto=proto/global.proto
cfg_proto_chunk=proto/global_chunk.proto

[exp]
cmd=
run_nn_script=run_nn
out_folder=exp/aishell_transformer_fmllr
seed=1234
use_cuda=True
multi_gpu=False
save_gpumem=False
N_epochs_tr=24

[dataset1]
data_name=train
fea:fea_name=fmllr
    fea_lst=/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/train/feats.scp
    fea_opts=apply-cmvn --utt2spk=ark:/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/train/utt2spk  ark:/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/train/data/cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |
    cw_left=0
    cw_right=0

    
lab:lab_name=lab_cd
    lab_folder=/home/bin.yang02/program/kaldi-scripts/aishell/s5/exp/tri5a_ali_train/
    lab_opts=ali-to-pdf 
    lab_count_file=auto
    lab_data_folder=/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/train/
    lab_graph=/home/bin.yang02/program/kaldi-scripts/aishell/s5/exp/tri5a/graph/

N_chunks=32
        
[dataset2]
data_name=dev
fea:fea_name=fmllr
    fea_lst=/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/dev/feats.scp
    fea_opts=apply-cmvn --utt2spk=ark:/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/dev/utt2spk  ark:/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/dev/data/cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |
    cw_left=0
    cw_right=0


lab:lab_name=lab_cd
    lab_folder=/home/bin.yang02/program/kaldi-scripts/aishell/s5/exp/tri5a_ali_dev/
    lab_opts=ali-to-pdf 
    lab_count_file=auto
    lab_data_folder=/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/dev/
    lab_graph=/home/bin.yang02/program/kaldi-scripts/aishell/s5/exp/tri5a/graph/


N_chunks=32

[dataset3]
data_name=test
fea:fea_name=fmllr
    fea_lst=/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/test/feats.scp
    fea_opts=apply-cmvn --utt2spk=ark:/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/test/utt2spk  ark:/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/test/data/cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |
    cw_left=0
    cw_right=0


lab:lab_name=lab_cd
    lab_folder=/home/bin.yang02/program/kaldi-scripts/aishell/s5/exp/tri5a_ali_test/
    lab_opts=ali-to-pdf 
    lab_count_file=auto
    lab_data_folder=/home/bin.yang02/program/kaldi-scripts/aishell/s5/fmllr/test/
    lab_graph=/home/bin.yang02/program/kaldi-scripts/aishell/s5/exp/tri5a/graph/


N_chunks=16

        
[data_use]
train_with=train
valid_with=dev
forward_with=test


[batches]
batch_size_train=16
max_seq_length_train=1000
increase_seq_length_train=True
start_seq_len_train=100
multply_factor_seq_len_train=2
batch_size_valid=8
max_seq_length_valid=1000


[architecture1]
arch_name=TRANSFORMER_AM
arch_proto=proto/Transformer.proto
arch_library=nn_transformer
arch_class=TRANSFORMER
arch_pretrain_file=none
arch_freeze=False
arch_seq_model=True

# Transformer.proto settings
ckpt_file=/home/bin.yang02/software/Self-Supervised-Speech-Pretraining-and-Representation-Learning/result/result_transformer/aalbert_fbank3L/states-200000.ckpt
load_pretrain=True
no_grad=False
dropout=default
spec_aug=True
spec_aug_prev=True
weighted_sum=False
select_layer=-1

# Optimizer Settings
arch_lr = 0.0002
arch_halving_factor = 0.5
arch_improvement_threshold = 0.001
arch_opt = rmsprop
opt_momentum = 0.0
opt_alpha = 0.95
opt_eps = 1e-8
opt_centered = False
opt_weight_decay = 0.0


[architecture2]
arch_name=MLP_layers
arch_proto=proto/MLP.proto
arch_library=neural_networks
arch_class=MLP
arch_pretrain_file=none
arch_freeze=False
arch_seq_model=False
dnn_lay=N_out_lab_cd
dnn_drop=0.0
dnn_use_laynorm_inp=False
dnn_use_batchnorm_inp=False
dnn_use_batchnorm=False
dnn_use_laynorm=False
dnn_act=softmax

arch_lr = 0.0002
arch_halving_factor=0.5
arch_improvement_threshold=0.001
arch_opt=rmsprop
opt_momentum=0.0
opt_alpha=0.95
opt_eps=1e-8
opt_centered=False
opt_weight_decay=0.0


[model]
model_proto=proto/model.proto
model:out_dnn1=compute(TRANSFORMER_AM,fmllr)
      out_dnn2=compute(MLP_layers,out_dnn1)
      loss_final=cost_nll(out_dnn2,lab_cd)
      err_final=cost_err(out_dnn2,lab_cd)


[forward]
forward_out=out_dnn2
normalize_posteriors=True
normalize_with_counts_from=lab_cd
save_out_file=False
require_decoding=True


[decoding]
decoding_script_folder=kaldi_decoding_scripts/
decoding_script=decode_dnn.sh
decoding_proto=proto/decoding.proto
min_active=200
max_active=7000
max_mem=50000000
beam=20.0
latbeam=12.0
acwt=0.10
max_arcs=-1
skip_scoring=false
scoring_script=/home/bin.yang02/software/kaldi-master/egs/librispeech/s5/local/score.sh
scoring_opts="--min-lmwt 4 --max-lmwt 23"
norm_vars=False

but when I only use simlpe mlp nn to train with the same feat/lab as input, the run_exp. py ran normally.

Is there any configs missed？

Is there any instructions how to pre-train the model?

Hi, I would like to know is there any guide for how to preprocessing or where to start.

I can see how to preprocess the downstream tasks.

Thanks!

AttributeError: 'Namespace' object has no attribute 'multi_gpu'

When using Google Colab to run the lines written in example_extract.py:
First, we clone the git:

!git clone https://github.com/andi611/Mockingjay-Speech-Representation.git

Then, we change working directory:

import os
os.chdir('drive/My Drive/MockingjayTest/Mockingjay-Speech-Representation')

Then, we run the requirements file:

!pip3 install -r requirements.txt

In the end we run the lines written in example_extract.py:

import torch
from runner_mockingjay import get_mockingjay_model

example_path = './result/mockingjay-500000.ckpt'
mockingjay = get_mockingjay_model(from_path=example_path)

# A batch of spectrograms: (batch_size, seq_len, hidden_size)
spec = torch.zeros(3, 800, 160)

# reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=True)

# reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=False)

# reps.shape: (batch_size, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=True)

# reps.shape: (batch_size, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=False)

(I uploaded the "mockingjay-500000.ckpt" file to the result directory in my drive)

The Error is:

[SOLVER] -  CUDA is available!
[SOLVER] -  Initializing Mockingjay model.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-c9f0d939d27d> in <module>()
      3 
      4 example_path = './result/mockingjay-500000.ckpt'
----> 5 mockingjay = get_mockingjay_model(from_path=example_path)
      6 
      7 # A batch of spectrograms: (batch_size, seq_len, hidden_size)

1 frames
/content/drive/My Drive/MockingjayTest/Mockingjay-Speech-Representation/runner_mockingjay.py in get_mockingjay_model(from_path, display_settings)
    231     from mockingjay.solver import Tester
    232     mockingjay = Tester(config, paras)
--> 233     mockingjay.set_model(inference=True, with_head=False, from_path=from_path)
    234     return mockingjay
    235 

/content/drive/My Drive/MockingjayTest/Mockingjay-Speech-Representation/mockingjay/solver.py in set_model(self, inference, with_head, from_path, output_attention)
    107             self.mockingjay = MockingjayModel(self.model_config, self.input_dim, self.output_attention).to(self.device)
    108             print(self.paras)
--> 109             if self.paras.multi_gpu:
    110                 self.mockingjay = torch.nn.DataParallel(self.mockingjay)
    111                 self.verbose('Multi-GPU training Enabled: ' + str(torch.cuda.device_count()))

AttributeError: 'Namespace' object has no attribute 'multi_gpu'

When I tried to print the paras dictionary to see if there is an attribute named multi_gpu, I found:

Namespace(apc_path='./result/result_apc/apc_libri_sd1337_standard/apc-500000.ckpt', ckpdir='result/result_mockingjay/', ckpt='mockingjay_libri_sd1337_LinearLarge/mockingjay-500000.ckpt', config='config/mockingjay_libri_MelBase.yaml', cpu=False, dckpt='baseline_sentiment_libri_sd1337/baseline_sentiment-500000.ckpt', gpu=True, load=False, logdir='log/log_mockingjay/', name=None, no_msg=False, plot=False, run_apc=False, run_mockingjay=False, seed=1337, test_phone=False, test_sentiment=False, test_speaker=False, train=True, train_phone=False, train_sentiment=False, train_speaker=False, verbose=True, with_head=False)

There is no attribute named multi_gpu.
Is there a way to address this problem?

Does clm work in this code?

Hi, I have a question about CLM approaches.

It seems like a code below requires text_only options, but data_loader has no text_only options.

self.train_set = get_Dataloader('text',text_only=True,**data_config)
https://github.com/andi611/Mockingjay-Speech-Representation/blob/05f0315125f4201d9028f807468b4915453a6f74/asr/clm.py#L39

I am just wondering this logic is implemented or not in the current stages.