This repo contains code and materials for NeurIPS2023 submission. The DiffS4L requires Fairseq and ProDiff and optionally WaveNet for baseline.
wget https://github.com/facebookresearch/fairseq/archive/refs/tags/v0.12.2.zip
unzip v0.12.2.zip
cd fairseq-0.12.2
pip install --editable ./
git clone https://github.com/Rongjiehuang/ProDiff.git
Follow the instruction in SCTK repo to install SCTK
Download Librispeech data from openSLR.
wget https://us.openslr.org/resources/12/train-clean-100.tar.gz
wget https://us.openslr.org/resources/12/dev-clean.tar.gz
wget https://us.openslr.org/resources/12/test-clean.tar.gz
tar -xf train-clean-100.tar.gz
tar -xf dev-clean.tar.gz
tar -xf test-clean.tar.gz
Create manifest files train.tsv
, dev.tsv
and test.tsv
for train-clean-100
, dev-clean
and test-clean
splits respectively using the script wav2vec_manifest.py provided by Fairseq.
mkdir -p manifest/LibriSpeech100
cd manifest/LibriSpeech100
python examples/wav2vec/wav2vec_manifest.py LibriSpeech/dev-clean --dest manifest/LibriSpeech100 --ext flac --valid-percent 0
mv train.tsv dev.tsv
python examples/wav2vec/wav2vec_manifest.py LibriSpeech/test-clean --dest manifest/LibriSpeech100 --ext flac --valid-percent 0
mv train.tsv test.tsv
python examples/wav2vec/wav2vec_manifest.py LibriSpeech/train-clean-100 --dest manifest/LibriSpeech100 --ext flac --valid-percent 0
Pretrain a Wav2vec model using LibriSpeech100.
cd fairseq-0.12.2/examples/wav2vec
seed=2023
WORK_DIR=$(pwd)
DATASET=LibriSpeech100
MANIFEST_DIR=manifest/${DATASET}
save_dir=outputs/wav2vec-en-${DATASET}-base-s${seed}
fairseq-hydra-train \
distributed_training.distributed_port=12345 \
distributed_training.distributed_world_size=32 \
task.data=${MANIFEST_DIR} \
checkpoint.save_dir=${save_dir} hydra.run.dir=${save_dir} \
common.fp16=True common.seed=${seed} \
+optimization.update_freq='[2]' \
--config-dir ${WORK_DIR}/config/pretraining \
--config-name wav2vec2_base_librispeech
Follow the instruction in HuBERT repo to obtain discrete speech representations using the wav2vec-en-LibriSpeech100-base-s2023
model for train-clean-100
, dev-clean
and test-clean
and name them train.km
, dev.km
and test.km
. I store the three km files in manifest/LibriSpeech100/wav2vec-ls100
.
mkdir -p manifest/LibriSpeech100/wav2vec-ls100
mv train.km dev.km test.km manifest/LibriSpeech100/wav2vec-ls100
Extract melspectrogram of LibriSpeech100. The ckpt_path
here is not used but just a place holder.
ckpt_path="wav2vec-en-LibriSpeech100-base-s2023/checkpoint_best.pt"
model_name="mel"
layer=0
sample_rate=16000
FEATURE_SCRIPT="dump_mel_feature.py"
km_dataset="LibriSpeech100"
tsv_dir="$(pwd)/manifest/${km_dataset}"
feat_dir="$(pwd)/feats/${model_name}_$(basename ${km_dataset})_l${layer}"
mkdir -p ${feat_dir}
pids=()
splits=(train dev test)
for split in ${splits[@]}; do
nshard=8
ranks=$(seq 0 $((nshard - 1)))
for rank in ${ranks[@]}; do
(
echo "Parallel ${rank}"
python -W ignore ${FEATURE_SCRIPT} \
"${tsv_dir}" "${split}" \
"${ckpt_path}" "${layer}" "${nshard}" "${rank}" "${feat_dir}" --sample_rate $sample_rate
) &
pids+=($!)
done
i=0; for pid in "${pids[@]}"; do wait ${pid} || ((++i)); done
[ ${i} -gt 0 ] && echo "$0: ${i} background jobs are failed." && false
Move the ProDiff/egs to the ProDiff cloned from GitHub in Sec 1.2. librivox_en_mask0.yaml
is the DDPM for SS/DS speech and the librivox_en_mask0.8.yaml
is the DDPM for NC speech. Change label_dir
to the folder storing the km labels, i.e. label_dir=manifest/LibriSpeech100/wav2vec-ls100
. spkemb_path
, spk_map
and phone_list_file
can be found in manifest/LibriSpeech100
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python tasks/run.py --config egs/datasets/audio/librivox/prodiff/librivox_en_mask0.8.yaml --exp_name librivox_en_mask0.8 --reset
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python tasks/run.py --config egs/datasets/audio/librivox/prodiff/librivox_en_mask0.yaml --exp_name librivox_en_mask0 --reset
Generate SS/DS and NC synthetic audio datasets that are both 10 times the size of original LibriSpeech100 audio using different seeds. librivox_en_mask0.yaml
is the DDPM for SS/DS speech and the librivox_en_mask0.8.yaml
is the DDPM for NC speech. Our HiFiGAN trained on LibriSpeech100 can be downloaded [here].
pids=()
seeds=(1 2 3 4 5 6 7 8 9 10)
for si in ${!seeds[@]}; do
seed=${seeds[$si]}
rand_speaker=True
use_speaker=seen
(
vocoder='hifigan'
vocoder_ckpt='mls_librispeech100'
PYTHONPATH=. CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config egs/datasets/audio/librivox/prodiff/librivox_en_mask0.yaml \
--exp_name librivox_en_mask0 --infer \
--hparams="save_gt=False,test_set_name=train,num_test_samples=0,random_spk=True,seed=${seed},gen_dir_name='generate/${use_speaker}${seed}',mask_partial_probs=[1 0 0],vocoder_ckpt=${vocoder_ckpt},vocoder=${vocoder}"
PYTHONPATH=. CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config egs/datasets/audio/librivox/prodiff/librivox_en_mask0.8.yaml \
--exp_name librivox_en_mask0.8 --infer \
--hparams="save_gt=False,test_set_name=train,num_test_samples=0,random_spk=${rand_speaker},seed=${seed},gen_dir_name='generate/${use_speaker}${seed}',mask_partial_probs=[0 0 1],vocoder_ckpt=${vocoder_ckpt},vocoder=${vocoder}"
) &
pids+=($!)
done
i=0; for pid in "${pids[@]}"; do wait ${pid} || ((++i)); done
[ ${i} -gt 0 ] && echo "$0: ${i} background jobs are failed." && false
Sample from the audios generated by librivox_en_mask0
for SS/DS speech and from the audios generated by librivox_en_mask0.8
for NC speech. The sampled audio combined with the original train-clean-100
LibriSpeech dataset makes up the different data composition, e.g., 100+860+0
and 100+430+430
. The dev.tsv
and test.tsv
are the same as the ones in LibriSpeech100.
cd fairseq-0.12.2/examples/wav2vec
seed=2023
WORK_DIR=$(pwd)
DATASET="100+430+430"
MANIFEST_DIR=manifest/${DATASET}
save_dir=outputs/wav2vec-en-${DATASET}-base-s${seed}
fairseq-hydra-train \
distributed_training.distributed_port=12345 \
distributed_training.distributed_world_size=32 \
task.data=${MANIFEST_DIR} \
checkpoint.save_dir=${save_dir} hydra.run.dir=${save_dir} \
common.fp16=True common.seed=${seed} \
+optimization.update_freq='[2]' \
--config-dir ${WORK_DIR}/config/pretraining \
--config-name wav2vec2_base_librispeech
Use libri_labels.py to obtain transcripts for dev-clean
and test-clean
splits. The label dictionary can be found [here].
python libri_labels.py dev.tsv --output-dir . --output-name dev
python libri_labels.py test.tsv --output-dir . --output-name test
mv dev.ltr test.ltr manifest/LibriSpeech100
Evaluate on dev-clean
and test-clean
splits using CER/WER. The lexicon_ltr.lst
can be downloaded [here]. The 4-gram.bin
can be downloaded [here].
ckpt=/path/to/wav2vec/checkpoint
manifest=manifest/LibriSpeech100
KENLM=4-gram.bin
LEXICON=lexicon_ltr.lst
SCTK=SCTK/bin
splits=(dev test)
FAIRSEQ=fairseq-0.12.2
for split in ${splits[@]}; do
subset=$split
result_dir=$(dirname $ckpt)/results_raw
mkdir -p $result_dir
echo --------------------- Save $subset raw results in $result_dir -----------------------
python $FAIRSEQ/examples/speech_recognition/infer.py \
$manifest --task audio_finetuning \
--nbest 1 --path $ckpt --gen-subset $subset --results-path ${result_dir} --w2l-decoder viterbi \
--lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--post-process letter --beam 1 \
2> $result_dir/log-${subset}.txt
$SCTK/sclite -r ${result_dir}/ref.units-checkpoint_best.pt-${subset}.txt trn -h ${result_dir}/hypo.units-checkpoint_best.pt-${subset}.txt trn -i wsj | tee ${result_dir}/uer-${subset}.txt
$SCTK/sclite -r ${result_dir}/ref.word-checkpoint_best.pt-${subset}.txt trn -h ${result_dir}/hypo.word-checkpoint_best.pt-${subset}.txt trn -i wsj | tee ${result_dir}/wer-${subset}.txt
echo --------------------------------------------------------------------------
result_dir=$(dirname $ckpt)/results_kenlm
mkdir -p $result_dir
echo --------------------- Save $subset kenlm results in $result_dir -----------------------
python $FAIRSEQ/examples/speech_recognition/infer.py \
$manifest --task audio_finetuning \
--nbest 1 --path $ckpt --gen-subset $subset --results-path ${result_dir} --w2l-decoder kenlm --lexicon ${LEXICON} \
--lm-model $KENLM --lm-weight 1 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--post-process letter 2> $result_dir/log-${subset}.txt
$SCTK/sclite -r ${result_dir}/ref.units-checkpoint_best.pt-${subset}.txt trn -h ${result_dir}/hypo.units-checkpoint_best.pt-${subset}.txt trn -i wsj | tee ${result_dir}/uer-${subset}.txt
$SCTK/sclite -r ${result_dir}/ref.word-checkpoint_best.pt-${subset}.txt trn -h ${result_dir}/hypo.word-checkpoint_best.pt-${subset}.txt trn -i wsj | tee ${result_dir}/wer-${subset}.txt
echo --------------------------------------------------------------------------
done
Follow the instructions in SUPERB to evaluate the pretrained SSL models.