yuangongnd / gopt Goto Github PK

Code for the ICASSP 2022 paper "Transformer-Based Multi-Aspect Multi-Granularity Non-native English Speaker Pronunciation Assessment".

License: BSD 3-Clause "New" or "Revised" License

Python 49.96% Shell 0.83% Jupyter Notebook 49.21%

gopt's People

Contributors

Stargazers

Watchers

gopt's Issues

how to infer or predict my own trained model

I have trained model according to the tutorial you provided, but I can't find the predict model code, like predict.py, can you guide me?

Meaning of phoneme scores

For any utterance, is the 5 phoneme scores returned by the model representing the phoneme-level score of the whole utterance?
What should be the shape of the phoneme score? In my case, I always obtain 1-dim numeric result for each phoneme score.

So if it is true, why it is still considered a phoneme score instead of an utterance score since it is an average phone score of the utterance?

How to infer with my own Data

How can i infernce with custom data

How to improve speed for GOPT

Hello everyone,
I am working on a project where I need to use GOPT to evaluate a voice record in real-time. I am using Yuan Gong tutorial to process each speak sentence but it is quite slow. I guess it is slow due to the overhead of loading the kadi model to extract the Gop feats. Any solution to this? I can only think of using PyKaldi to cache the model in the memory. Thanks

关于第一步数据的问题

您好，我在使用762数据集的时候，使用默认的conf提取了，feat.scp然后运行python3 local/extract_gop_feats.py，报错kaldi_io.kaldi_io.UnknownVectorHeader: The header contained 'CM '，想请教一下是为什么，以及使用Kaldi提取特征的参数是什么（mfcc，fbank等），已下是762默认的参数配置。

--use-energy=false # use average of log energy, not energy.
--num-mel-bins=40 # similar to Google's setup.
--num-ceps=40 # there is no dimensionality reduction.
--low-freq=20 # low cutoff frequency for mel bins... this is high-bandwidth data, so
# there might be some information at the low end.
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)

How to provide feedback to improve pronunciation?

Problems of Generating tr_label_phn during Inference

In my own inference experiment, I notice the score is mainly determined not by the .wav but by the phn.
There was an extreme pattern for multiple sound files of the same word:

Word A 4 4 4 4 4 4
Word B 5 5 5 5 5 5

Even after messing up the .wav files, the results remain the same.
Then I found a potential reason:

In gen_seq_data_phn.py, tr_label_phn or te_label_phn is generated by the phn_dict that is specific to the dataset that we want to use. However, the pretrain model is based on SpeechOcean762. When trying to inference any other dataset, the model will receive these labels specific to the inference dataset not the SpeechOcean dataset, causing inconsistent inference results.

The correct method is to always generate the phn_dict from the one generated when training the SpeechOcean762.
I will update the inference tutorial if you think it is necessary.

step 3 in inference your own data confusion

Hello i am following you guide to infer my own data here and i am confused about step 20, where should i run the step 20 python code??
here is my tree from data
raw_kaldi_gop
│   └── librispeech
│   ├── te_feats.csv
│   ├── te_keys_phn.csv
│   ├── te_keys_word.csv
│   ├── te_labels_phn.csv
│   ├── te_labels_word.csv
│   ├── tr_feats.csv
│   ├── tr_keys_phn.csv
│   ├── tr_keys_word.csv
│   ├── tr_labels_phn.csv
│   └── tr_labels_word.csv
├── README.md
├── seq_data_librispeech
│   ├── te_feat.npy
│   ├── te_label_phn.npy
│   ├── te_label_utt.npy
│   ├── te_label_word.npy
│   ├── tr_feat.npy
│   ├── tr_label_phn.npy
│   ├── tr_label_utt.npy
│   └── tr_label_word.npy
├── seq_data_paiia
│   ├── te_feat.npy
│   ├── te_label_phn.npy
│   ├── te_label_utt.npy
│   ├── te_label_word.npy
│   ├── tr_feat.npy
│   ├── tr_label_phn.npy
│   ├── tr_label_utt.npy
│   └── tr_label_word.npy
└── seq_data_paiib
├── te_feat.npy
├── te_label_phn.npy
├── te_label_utt.npy
├── te_label_word.npy
├── tr_feat.npy
├── tr_label_phn.npy
├── tr_label_utt.npy
└── tr_label_word.npy

**My exp folder**
final.py

├── gopt-1e-3-3-1-25-24-gopt-librispeech-br
│   └── result_summary.csv
├── gopt-1e-3-3-1-25-24-gopt-librispeech-br-0
│   ├── models
│   │   └── best_audio_model.pth
│   ├── preds
│   │   ├── phn_pred.npy
│   │   ├── phn_target.npy
│   │   ├── utt_pred.npy
│   │   ├── utt_target.npy
│   │   ├── word_pred.npy
│   │   └── word_target.npy
│   └── result.csv
├── gopt-1e-3-3-1-25-24-gopt-librispeech-br-1
│   ├── models
│   │   └── best_audio_model.pth
│   ├── preds
│   │   ├── phn_pred.npy
│   │   ├── phn_target.npy
│   │   ├── utt_pred.npy
│   │   ├── utt_target.npy
│   │   ├── word_pred.npy
│   │   └── word_target.npy
│   └── result.csv
├── gopt-1e-3-3-1-25-24-gopt-librispeech-br-2
│   ├── models
│   │   └── best_audio_model.pth
│   ├── preds
│   │   ├── phn_pred.npy
│   │   ├── phn_target.npy
│   │   ├── utt_pred.npy
│   │   ├── utt_target.npy
│   │   ├── word_pred.npy
│   │   └── word_target.npy
│   └── result.csv
├── gopt-1e-3-3-1-25-24-gopt-librispeech-br-3
│   ├── models
│   │   └── best_audio_model.pth
│   ├── preds
│   │   ├── phn_pred.npy
│   │   ├── phn_target.npy
│   │   ├── utt_pred.npy
│   │   ├── utt_target.npy
│   │   ├── word_pred.npy
│   │   └── word_target.npy
│   └── result.csv
├── gopt-1e-3-3-1-25-24-gopt-librispeech-br-4
│   ├── models
│   │   └── best_audio_model.pth
│   ├── preds
│   │   ├── phn_pred.npy
│   │   ├── phn_target.npy
│   │   ├── utt_pred.npy
│   │   ├── utt_target.npy
│   │   ├── word_pred.npy
│   │   └── word_target.npy
│   └── result.csv
└── README.md

测试自己的数据时，均值和方差怎么算出来的

    # normalize the input to 0 mean and unit std.
    if am=='librispeech':
        dir='seq_data_librispeech'
        norm_mean, norm_std = 3.203, 4.045
    elif am=='paiia':
        dir='seq_data_paiia'
        norm_mean, norm_std = -0.652, 9.737
    elif am=='paiib':
        dir='seq_data_paiib'
        norm_mean, norm_std = -0.516, 9.247
    else:
        raise ValueError('Acoustic Model Unrecognized.')

Interpreting the result

After following the inference steps, I got below values for u1-u5,p, w1-w3

u1=tensor([[1.7443]]) u2=tensor([[1.5404]]) u3=tensor([[1.7297]]) u4=tensor([[1.7074]]) u5=tensor([[1.7606]]) p=tensor([[[1.1559],
[1.2266],
[1.2165],
[1.1115],
[1.1052],
[1.1074],
[1.0690],
[1.2223],
[1.0949],
[1.1671],
[1.0795],
[1.2557],
[1.0595],
[1.1116],
[1.1818],
[1.1300],
[1.2001],
[1.1101],
[1.1616],
[1.0864],
[1.1390],
[0.7162],
[0.8037],
[0.8568],
[0.8601],
[0.8054],
[0.8418],
[0.8683],
[0.7827],
[0.8825],
[0.6441],
[0.7901],
[0.7464],
[0.6433],
[0.8020],
[0.8223],
[0.7503],
[0.7563],
[0.8885],
[0.8561],
[0.8105],
[0.8625],
[0.8481],
[0.8317],
[0.8435],
[0.8590],
[0.8139],
[0.7567],
[0.8845],
[0.8129]]]) w1=tensor([[[ 0.1104],
[ 0.2297],
[ 0.2281],
[ 0.0758],
[ 0.0577],
[ 0.1400],
[-0.0202],
[ 0.1290],
[ 0.0133],
[ 0.2836],
[ 0.0878],
[ 0.3509],
[ 0.0595],
[ 0.0864],
[ 0.1327],
[ 0.0924],
[ 0.1755],
[ 0.0542],
[ 0.1502],
[ 0.0426],
[ 0.1247],
[ 0.9526],
[ 1.0063],
[ 1.0826],
[ 1.0663],
[ 0.9944],
[ 1.0674],
[ 1.1030],
[ 1.0209],
[ 1.0798],
[ 0.8870],
[ 1.0020],
[ 0.9713],
[ 0.8827],
[ 1.0125],
[ 1.0476],
[ 0.9834],
[ 0.9916],
[ 1.1105],
[ 1.0714],
[ 1.0451],
[ 1.0725],
[ 1.0760],
[ 1.0540],
[ 1.0640],
[ 1.0696],
[ 1.0384],
[ 0.9810],
[ 1.0873],
[ 1.0260]]]) w2=tensor([[[0.6134],
[0.7956],
[0.9271],
[0.6699],
[0.5889],
[0.6262],
[0.4851],
[0.6197],
[0.5322],
[0.9736],
[0.7261],
[1.0064],
[0.5336],
[0.6623],
[0.6925],
[0.6142],
[0.7239],
[0.5258],
[0.6993],
[0.5545],
[0.7373],
[0.9153],
[0.9858],
[1.0829],
[1.0741],
[1.0285],
[1.0639],
[1.0860],
[0.9937],
[1.1015],
[0.8865],
[1.0654],
[0.9615],
[0.9004],
[0.9985],
[1.0304],
[0.9705],
[0.9877],
[1.0782],
[1.0342],
[1.0029],
[1.0279],
[1.0328],
[1.0081],
[1.0391],
[1.0626],
[1.0167],
[0.9367],
[1.0728],
[1.0083]]]) w3=tensor([[[0.9717],
[1.0951],
[1.1173],
[0.9834],
[0.9371],
[0.9385],
[0.8971],
[1.0128],
[0.9022],
[1.1262],
[0.9963],
[1.1767],
[0.9003],
[0.9701],
[0.9989],
[0.9520],
[1.0238],
[0.9401],
[1.0122],
[0.9360],
[1.0347],
[1.0048],
[1.0965],
[1.1611],
[1.1419],
[1.1097],
[1.1247],
[1.1732],
[1.0983],
[1.1891],
[0.9894],
[1.1176],
[1.0471],
[0.9793],
[1.0938],
[1.1114],
[1.0798],
[1.0866],
[1.2085],
[1.1529],
[1.0992],
[1.1474],
[1.1448],
[1.1297],
[1.1249],
[1.1632],
[1.1026],
[1.0581],
[1.1813],
[1.1074]]])

Now, how to interpret this result ?

What is the point of cleaning lexicon in the dictionary: librispeech-lexicon.txt

Hi Yuan,
When doing the inference steps of your tutorial, i find your suggestion that i should replace the content of lexicon.txt in speechocean762 with the librispeech-lexicon.tx, and the example code of cleaning:

with open("librispeech-lexicon.txt", 'r') as f:
lexicon_raw = f.read()
rows = lexicon_raw.splitlines()
clean_rows = [row.split() for row in rows]
lexicon_dict_l = dict()
for row in clean_rows:
c_row = row.copy()
key = c_row.pop(0)
if len(c_row) == 1:
c_row[0] = c_row[0] + '_S'
if len(c_row) >= 2:
c_row[0] = c_row[0] + '_B'
c_row[-1] = c_row[-1] + '_E'
if len(c_row) > 2:
for i in range(1,len(c_row)-1):
c_row[i] = c_row[i] + '_I'
val = " ".join(c_row)
lexicon_dict_l[key] = val
lexicon_dict_l

Can you please explain me why we need to clean the lexicon and what is the meaning of suffix _S,_B,_E? Thanks you.

Local script missing oov.int

I am replacing the speechocean762 with my own datasets to perform inference.
I have fixed many things to reach stage 9.
Here, steps/align_mapped.sh will complain that it expects file oov.int to exist.
I am using the acoustic model according to the guide, that is, in the beginning of run.sh, I set the following

librispeech_eg=../../librispeech/s5
model=$librispeech_eg/exp/chain_cleaned/tdnn_1d_sp
ivector_extractor=$librispeech_eg/exp/nnet3_cleaned/extractor
lang=$librispeech_eg/exp/chain_cleaned/tdnn_1d_sp

But there is no oov.int existing in either of the folders.
Can you give me a direction on this?

I could not find "load_human_scores" and "load_phone_symbol_table" from utils.

Hi, dear Yuan
I want to use your code for inferring with my own data, but there is no 'utils' package to import 'load_human_scores' and 'load_phone_symbol_table' in your source code. Should I download the 'utils' from somewhere?
Thanks

About infer the data

Hi, I am doing the inference as the inference tutorial told. This perhaps is a stupid question, but it said it only has a wav file to infer, so I did , and got the error from kaldi/egs/gop_speechocean762/s5/run.sh, and I did manage to train locally with speachocean762 and generate GOP features with the original dataset. So how should my own dataset look like?
./run.sh utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea. Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html for more information. utils/validate_data_dir.sh: Error: in data/train, utterance-ids extracted from utt2spk and features utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.nkqk/utts 2023-05-16 16:34:52.724637343 +0800 +++ /tmp/kaldi.nkqk/utts.feats 2023-05-16 16:34:52.800640123 +0800

Can't not infer my own data, the result belongs to speechocean762 dataset

How did you infer your own wav file? I followed the inference instruction, but the result I got is [u1, u2, u3, u4, u5, p, w1, w2, w3] of the speechocean762 dataset (i realized cause its tensor shape of u1 is (2500,1) while I only have 1 wav file), but not my wav file :((((

And in inference instruction, instructor didn't mention where to put your own dataset folder, nor when to specify the path to your own dataset path in code.

Extracting features with the Kaldio package

hello author,
Recently, when studying the model you proposed, I want to use your pre-trained model to infer my own data, but I have the following confusion:

First, I used the following code on my machine to extract one of the wav files of Speechocean762. I used wav.scp to encode, but an encoding error was reported. Specifically, kaldio should use open to open the wav file. Do you know how to fix this?

for key, feat in kaldi_io.read_vec_flt_scp(scp_path):
    print(key)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 2: 
invalid continuation byte(arr = (binary + fd.readline().decode()).strip().split())

I checked your extracted tr_feats.csv, and I found that each line has 85 dimensions, but why the feature I extracted using torchaudio.compliance.kaldi is not 85 dimensions, and I extracted a wav in the speechocean 762 dataset The length of the file is much greater than 50, and the features of all files in your code are less than 50. I want to know why?

specgram = torchaudio.compliance.kaldi.spectrogram(waveform)
mfcc = torchaudio.compliance.kaldi.mfcc(waveform)
mfcc.shape      (torch.Size([377, 257]))
specgram.shape     (torch.Size([377, 257]))

3、If I just want the model to identify the speaker's fluency, can I make some simplifications, such as not using the ASR model?

Lang Librispeech

I'm running the code
I'm having a problem I don't know where to get folder lang, follow
lang=$librispeech_eg/data/lang

Сomparison with other works

Hi! Thank you for sharing the code of your work. It might be a silly question, but I wonder why you haven't made any comparing with other strong solution trained on larger dataset L2-Arctic? The dataset has no sentence-level scores, but larger set of utterances with phoneme-level annotation. The authors claimed quite strong results of phoneme accuracy.
Also according to the same article, adding ASR loss increase the errors detection accuracy, why you decided not to use it as one of heads? Thanks.

my own data length limit

Hi, dear author, when I infer my own data, the length exceed 50 and error occurs. I find this line in gopt.py,
self.pos_embed = nn.Parameter(torch.zeros(1, 55, self.embed_dim))
So the max length is 50? If the sentence exceed 50, I must clip it into shorter fragment?
@YuanGongND

The result of u1,u2,u3,u4,u5 is always higher than 1.5

Hi all, I used the output https://www.dropbox.com/s/zc6o1d8rqq28vci/data.zip?dl=1 and follow the inference instruction at steps_of_inference.md. But after running inference with te_label_phn.npy and te_feat.npy of the librispeech output I downloaded before, the result is weird. I checked the min value of each tensor from u1 to u5, but their mean values are always higher than 1.5 which is very high and unexpected because there are bad pronunciation examples inside speechocean762 dataset.

For example, min value of u1 is 1.5474397 and max value of u1 is 1.7969123.

I used this inference code:

`import torch
import sys
import os
sys.path.append(os.path.abspath('../src/'))
from models import GOPT
gopt = GOPT(embed_dim=24, num_heads=1, depth=3, input_dim=84)
gopt = torch.nn.DataParallel(gopt)
sd = torch.load('gopt_librispeech/best_audio_model.pth', map_location='cpu')
gopt.load_state_dict(sd, strict=True)

import numpy as np
input_feat = np.load("te_feat.npy")
input_phn = np.load("te_label_phn.npy")
gopt = gopt.float()
gopt.eval()
with torch.no_grad():
t_input_feat = torch.from_numpy(input_feat[:,:,:])
t_phn = torch.from_numpy(input_phn[:,:,0])
u1, u2, u3, u4, u5, p, w1, w2, w3 = gopt(t_input_feat.float(),t_phn.float())`

How do I rate my own voice using a model trained in google colab?

As the question title says, I have tested the model running on Google Colab provided, I want to ask is how can I take input as my own voice, corresponding text and output as a tuple of [ u1, u2, u3, u4, u5, p, w1, w2, w3] (probably by a piece of code that is serialized with the existing Code on Colab). If the above is possible, do I need to input the sentences spoken in the Speechocean762 dataset or can I test with anything else? I am new to Automatic Pronunciation Assessment so this question can be a bit silly.

Thanks a lot,
Bach.

hello

gopt/src/models/gopt.py

Line 297 in 5ec31ca

x = x + self.pos_embed

hello, I met a problem. If i suppose the batch_size=2, shape of x is (2,4,41), the shape of self.pos_embed is (1,55,41). x = x + self.pos_embed will be an error. Can you give me some suggests? thx!!

my test example is:
a = GOPTNoPhn(41,4,2)
data = torch.randn(2,4,84)
phn = torch.tensor([[1,4,2,-1],[4,2,3,38]])
a(data,phn)

the number of my own data

when i delete the 762dataset as inter my own data like what you do in step of interference.md , and only 3 lists of data are left, but there will be a mistake that

i want to some help
thank you!

solved

Can I assess random .wav without corresponding script?

First of all I want to thank you a lot for this amazing project.
Can I assess random .wav without corresponding script? And can I assess multiple sentences inside one .wav file? Please answer, many thanks!

How can I assess my own .wav file using trained model

Hi, I follow your steps and get result, best model.
Also I try to follow steps_of_inference.md to assess my own .wav file, and I get tensors [u1, u2, u3, u4, u5, p, w1, w2, w3] as result. But these tensors seem to be random or negative. How can I transfer these tensors to exact result I want? Such as accuracy, total scores.
I have read issue 6 for reference, but still do not work for me.

Thanks for your help!

Is GOPT designed to take only complete words and sentences? What about a phoneme?

Hi again :) Thanks to your response, I was able to replicate the whole process of pronunciation assessment.
I became curious whether this gopt can provide the accuracy of a phoneme separately without giving a complete word or sentence. So I understand this can take a sentence or a word to provide the accuracy of the level deep down to the phonemes, but can it still generate the accuracy of phoneme, given that I manually pass a text like "A" along with the correct phonetic transcription of "AH" ?
I wasn't so sure if this is designed to take only complete words or sentences, or it can take singular phoneme separately.

Thank you very much
Best regards
Theo Seo

Where can I find L2 data used in PAII-A an PAII-B

Nice Work and thank you for your code!

I found that 1696 / 6591 hours of L2 data is used in training of the PAII-A / PAII-B. I'm impressed by the 1696 / 6591 number. I want to ask where I can find these data. Are they public?

while running run.sh in gop_speechocean it get error in visualize_feat.py AttributeError: 'tuple' object has no attribute 'shape'

`(env) amandeep@vitubuntu:~/Desktop/kaldi-master/egs/gop_speechocean762/s5$ ./run.sh
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/train
local/data_prep.sh: successfully prepared data in data/train
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/test
local/data_prep.sh: successfully prepared data in data/test
steps/make_mfcc.sh --nj 1 --mfcc-config conf/mfcc_hires.conf --cmd run.pl data/train
steps/make_mfcc.sh: moving data/train/feats.scp to data/train/.backup
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
steps/make_mfcc.sh: Succeeded creating MFCC features for train
steps/compute_cmvn_stats.sh data/train
Succeeded creating CMVN stats for train
fix_data_dir.sh: kept all 1 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
steps/make_mfcc.sh --nj 1 --mfcc-config conf/mfcc_hires.conf --cmd run.pl data/test
steps/make_mfcc.sh: moving data/test/feats.scp to data/test/.backup
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/test
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
steps/make_mfcc.sh: Succeeded creating MFCC features for test
steps/compute_cmvn_stats.sh data/test
Succeeded creating CMVN stats for test
fix_data_dir.sh: kept all 1 utterances.
fix_data_dir.sh: old files are kept in data/test/.backup
steps/online/nnet2/extract_ivectors_online.sh --cmd run.pl --nj 1 data/train ../../librispeech/s5/exp/nnet3_cleaned/extractor data/train/ivectors
steps/online/nnet2/extract_ivectors_online.sh: extracting iVectors
steps/online/nnet2/extract_ivectors_online.sh: combining iVectors across jobs
steps/online/nnet2/extract_ivectors_online.sh: done extracting (online) iVectors to data/train/ivectors using the extractor in ../../librispeech/s5/exp/nnet3_cleaned/extractor.
steps/online/nnet2/extract_ivectors_online.sh --cmd run.pl --nj 1 data/test ../../librispeech/s5/exp/nnet3_cleaned/extractor data/test/ivectors
steps/online/nnet2/extract_ivectors_online.sh: extracting iVectors
steps/online/nnet2/extract_ivectors_online.sh: combining iVectors across jobs
steps/online/nnet2/extract_ivectors_online.sh: done extracting (online) iVectors to data/test/ivectors using the extractor in ../../librispeech/s5/exp/nnet3_cleaned/extractor.
steps/nnet3/compute_output.sh --cmd run.pl --nj 1 --online-ivector-dir data/train/ivectors data/train ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp exp/probs_train
steps/nnet3/compute_output.sh: WARNING: no such file ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/final.raw. Trying ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/final.mdl instead.
steps/nnet3/compute_output.sh --cmd run.pl --nj 1 --online-ivector-dir data/test/ivectors data/test ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp exp/probs_test
steps/nnet3/compute_output.sh: WARNING: no such file ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/final.raw. Trying ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/final.mdl instead.
Preparing phone lists
2 silence phones saved to: data/local/dict_nosp/silence_phones.txt
1 optional silence saved to: data/local/dict_nosp/optional_silence.txt
39 non-silence phones saved to: data/local/dict_nosp/nonsilence_phones.txt
5 extra triphone clustering-related questions saved to: data/local/dict_nosp/extra_questions.txt
Lexicon text file saved as: data/local/dict_nosp/lexicon.txt
utils/prepare_lang.sh --phone-symbol-table ../../librispeech/s5/data/lang_test_tgsmall/phones.txt data/local/dict_nosp data/local/lang_tmp_nosp data/lang_nosp
Checking data/local/dict_nosp/silence_phones.txt ...
--> reading data/local/dict_nosp/silence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict_nosp/silence_phones.txt is OK

Checking data/local/dict_nosp/optional_silence.txt ...
--> reading data/local/dict_nosp/optional_silence.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict_nosp/optional_silence.txt is OK

Checking data/local/dict_nosp/nonsilence_phones.txt ...
--> reading data/local/dict_nosp/nonsilence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict_nosp/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking data/local/dict_nosp/lexicon.txt
--> reading data/local/dict_nosp/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict_nosp/lexicon.txt is OK

Checking data/local/dict_nosp/lexiconp.txt
--> reading data/local/dict_nosp/lexiconp.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict_nosp/lexiconp.txt is OK

Checking lexicon pair data/local/dict_nosp/lexicon.txt and data/local/dict_nosp/lexiconp.txt
--> lexicon pair data/local/dict_nosp/lexicon.txt and data/local/dict_nosp/lexiconp.txt match

Checking data/local/dict_nosp/extra_questions.txt ...
--> reading data/local/dict_nosp/extra_questions.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict_nosp/extra_questions.txt is OK
--> SUCCESS [validating dictionary directory data/local/dict_nosp]

fstaddselfloops data/lang_nosp/phones/wdisambig_phones.int data/lang_nosp/phones/wdisambig_words.int
prepare_lang.sh: validating output directory
utils/validate_lang.pl data/lang_nosp
Checking existence of separator file
separator file data/lang_nosp/subword_separator.txt is empty or does not exist, deal in word case.
Checking data/lang_nosp/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang_nosp/phones.txt is OK

Checking words.txt: #0 ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang_nosp/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...
--> silence.txt and nonsilence.txt are disjoint
--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...
--> found no unexplainable phones in phones.txt

Checking data/lang_nosp/phones/context_indep.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in data/lang_nosp/phones/context_indep.txt
--> data/lang_nosp/phones/context_indep.int corresponds to data/lang_nosp/phones/context_indep.txt
--> data/lang_nosp/phones/context_indep.csl corresponds to data/lang_nosp/phones/context_indep.txt
--> data/lang_nosp/phones/context_indep.{txt, int, csl} are OK

Checking data/lang_nosp/phones/nonsilence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 320 entry/entries in data/lang_nosp/phones/nonsilence.txt
--> data/lang_nosp/phones/nonsilence.int corresponds to data/lang_nosp/phones/nonsilence.txt
--> data/lang_nosp/phones/nonsilence.csl corresponds to data/lang_nosp/phones/nonsilence.txt
--> data/lang_nosp/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang_nosp/phones/silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in data/lang_nosp/phones/silence.txt
--> data/lang_nosp/phones/silence.int corresponds to data/lang_nosp/phones/silence.txt
--> data/lang_nosp/phones/silence.csl corresponds to data/lang_nosp/phones/silence.txt
--> data/lang_nosp/phones/silence.{txt, int, csl} are OK

Checking data/lang_nosp/phones/optional_silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang_nosp/phones/optional_silence.txt
--> data/lang_nosp/phones/optional_silence.int corresponds to data/lang_nosp/phones/optional_silence.txt
--> data/lang_nosp/phones/optional_silence.csl corresponds to data/lang_nosp/phones/optional_silence.txt
--> data/lang_nosp/phones/optional_silence.{txt, int, csl} are OK

Checking data/lang_nosp/phones/disambig.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 6 entry/entries in data/lang_nosp/phones/disambig.txt
--> data/lang_nosp/phones/disambig.int corresponds to data/lang_nosp/phones/disambig.txt
--> data/lang_nosp/phones/disambig.csl corresponds to data/lang_nosp/phones/disambig.txt
--> data/lang_nosp/phones/disambig.{txt, int, csl} are OK

Checking data/lang_nosp/phones/roots.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 41 entry/entries in data/lang_nosp/phones/roots.txt
--> data/lang_nosp/phones/roots.int corresponds to data/lang_nosp/phones/roots.txt
--> data/lang_nosp/phones/roots.{txt, int} are OK

Checking data/lang_nosp/phones/sets.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 41 entry/entries in data/lang_nosp/phones/sets.txt
--> data/lang_nosp/phones/sets.int corresponds to data/lang_nosp/phones/sets.txt
--> data/lang_nosp/phones/sets.{txt, int} are OK

Checking data/lang_nosp/phones/extra_questions.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 14 entry/entries in data/lang_nosp/phones/extra_questions.txt
--> data/lang_nosp/phones/extra_questions.int corresponds to data/lang_nosp/phones/extra_questions.txt
--> data/lang_nosp/phones/extra_questions.{txt, int} are OK

Checking data/lang_nosp/phones/word_boundary.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 330 entry/entries in data/lang_nosp/phones/word_boundary.txt
--> data/lang_nosp/phones/word_boundary.int corresponds to data/lang_nosp/phones/word_boundary.txt
--> data/lang_nosp/phones/word_boundary.{txt, int} are OK

Checking optional_silence.txt ...
--> reading data/lang_nosp/phones/optional_silence.txt
--> data/lang_nosp/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1
--> data/lang_nosp/phones/disambig.txt has "#0" and "#1"
--> data/lang_nosp/phones/disambig.txt is OK

Checking topo ...

Checking word_boundary.txt: silence.txt, nonsilence.txt, disambig.txt ...
--> data/lang_nosp/phones/word_boundary.txt doesn't include disambiguation symbols
--> data/lang_nosp/phones/word_boundary.txt is the union of nonsilence.txt and silence.txt
--> data/lang_nosp/phones/word_boundary.txt is OK

Checking word-level disambiguation symbols...
--> data/lang_nosp/phones/wdisambig.txt exists (newer prepare_lang.sh)
Checking word_boundary.int and disambig.int
--> generating a 20 word/subword sequence
--> resulting phone sequence from L.fst corresponds to the word sequence
--> L.fst is OK
--> generating a 10 word/subword sequence
--> resulting phone sequence from L_disambig.fst corresponds to the word sequence
--> L_disambig.fst is OK

Checking data/lang_nosp/oov.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang_nosp/oov.txt
--> data/lang_nosp/oov.int corresponds to data/lang_nosp/oov.txt
--> data/lang_nosp/oov.{txt, int} are OK

--> data/lang_nosp/L.fst is olabel sorted
--> data/lang_nosp/L_disambig.fst is olabel sorted
--> SUCCESS [validating lang directory data/lang_nosp]
steps/align_mapped.sh --cmd run.pl --nj 1 --graphs exp/ali_train data/train exp/probs_train ../../librispeech/s5/data/lang_test_tgsmall ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp exp/ali_train
steps/align_mapped.sh: aligning data in data/train using model from ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp, putting alignments in exp/ali_train
steps/diagnostic/analyze_alignments.sh --cmd run.pl ../../librispeech/s5/data/lang_test_tgsmall exp/ali_train
steps/diagnostic/analyze_alignments.sh: see stats in exp/ali_train/log/analyze_alignments.log
steps/align_mapped.sh: done aligning data.
steps/align_mapped.sh --cmd run.pl --nj 1 --graphs exp/ali_test data/test exp/probs_test ../../librispeech/s5/data/lang_test_tgsmall ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp exp/ali_test
steps/align_mapped.sh: aligning data in data/test using model from ../../librispeech/s5/exp/chain_cleaned/tdnn_1d_sp, putting alignments in exp/ali_test
steps/diagnostic/analyze_alignments.sh --cmd run.pl ../../librispeech/s5/data/lang_test_tgsmall exp/ali_test
steps/diagnostic/analyze_alignments.sh: see stats in exp/ali_test/log/analyze_alignments.log
steps/align_mapped.sh: done aligning data.
local/visualize_feats.py --phone-symbol-table data/lang_nosp/phones-pure.txt exp/gop_train/feat.scp data/local/scores.json exp/gop_train/feats.png
Traceback (most recent call last):
File "local/visualize_feats.py", line 75, in
main()
File "local/visualize_feats.py", line 68, in main
features = TSNE(n_components=2).fit_transform(features)
File "/home/amandeep/Desktop/kaldi-master/egs/gop_speechocean762/s5/env/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py", line 1118, in fit_transform
self._check_params_vs_input(X)
File "/home/amandeep/Desktop/kaldi-master/egs/gop_speechocean762/s5/env/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py", line 828, in _check_params_vs_input
if self.perplexity >= X.shape[0]:
AttributeError: 'tuple' object has no attribute 'shape'
`

[Tutorial] Infer your own data

Infer your own data

This tutorial is aimed at helping people understand how to infer any data with GOPT.
There will be some restrictions on "any data", I will explain it later.

Install Kaldi and GOPT. Sometimes Kaldi can be fuzzy. If meeting any problems, use a docker image of Kaldi instead.
Download the original speechocean762 to your disk.
Copy speechocean762 and change to your dataset’s name. In this example, I use test_dataset
There are multiple hacks we need to do to make our test_dataset runnable.
First, delete unnecessary .wav files. Replace them with your own.
Update all files in both test and train. Including spk2age, spk2gender, spk2utt, text, utt2spk, wav.scp. Remember, words in text need to be capitalized.
Because I only have 1 wav file, all these files will be 1-line. For example, I set the speaker ID to be 0001 and utt_id to be test. So:

wav.scp includes
```
test	WAVE/SPEAKER0001/test.wav
```
spk2utt includes
```
0001 test
```
You can infer how to write the rest.
In resource/text-phone, delete unnecessary lines and replace your own. Here, each lines begins with <utt_id>.<n> which represents the n-th word in your text. After it, please append the corresponding phones of that word.

To be specific, find your corresponding phones in resource/lexicon.txt. For instance, the word FAN would be F AE0 N. For all the first phones, add an additional suffix B, for the last phones, add an additional suffix E and for all others, add the suffix I.

For example, if my text is “FAN WORKS”, the final result in text-phone is
```
test.0 F_B AE0_I N_E
test.1 W_B ER0_I K_I S_E
```
If your don’t do this right, you will stuck at stage 8.
Download and extract all tars in https://kaldi-asr.org/models/m13

In gop_speechocean762/s5/run.sh , change lines 38-42 to your extracted results. In my case, I use

librispeech_eg=../../librispeech/s5
model=$librispeech_eg/exp/chain_cleaned/tdnn_1d_sp
ivector_extractor=$librispeech_eg/exp/nnet3_cleaned/extractor
lang=$librispeech_eg/data/lang_test_tgsmall

Also, change stage to 2to avoid download. Change nj to the number of examples. In my case, I set to 1.

After running run.sh, confirm you have non-empty feat files in gop_train and gop_test. Mine looks like this

(gopt) yifan@XXX:~/develop/kaldi/egs/gop_speechocean762/s5/exp/gop_train$ tree .
.
├── feat.1.ark
├── feat.1.scp
├── feat.scp
├── gop.1.ark
├── gop.1.scp
├── gop.scp
└── log
    └── compute_gop.1.log

Execute the following original guide in GOPT.

kaldi_path=your_kaldi_path
cd $gopt_path
mkdir -p data/raw_kaldi_gop/librispeech
cp src/extract_kaldi_gop/{extract_gop_feats.py,extract_gop_feats_word.py} ${kaldi_path}/egs/gop_speechocean762/s5/local/
cd ${kaldi_path}/egs/gop_speechocean762/s5

Now we need to change the original GOPT files
First, in extract_gop_feats.py, delete the continue https://github.com/YuanGongND/gopt/blob/master/src/extract_kaldi_gop/extract_gop_feats.py#L54. (PS: label in this file is lable. Ummm, you can’t unsee them.)
In the same file, because we do not have score, change https://github.com/YuanGongND/gopt/blob/master/src/extract_kaldi_gop/extract_gop_feats.py#L61 to
```
lables.append([ph])
```
Run the edited python local/extract_gop_feats.py, skip extract_gop_feats_word.py, not needed for inference.

Continue with

cd $gopt_path
cp -r ${kaldi_path}/egs/gop_speechocean762/s5/gopt_feats/* data/raw_kaldi_gop/<your dataset name>

Change another GOPT file, src/prep_data/gen_seq_data_phn.py. Because we do not have score any more, all we want to have is the phn. Also we need to replace the hardcoding path to <your dataset name>. You can debug it yourself, here is my edited results.

# -*- coding: utf-8 -*-
# @Time    : 9/19/21 11:13 PM
# @Author  : Yuan Gong
# @Affiliation  : Massachusetts Institute of Technology
# @Email   : [email protected]
# @File    : gen_seq_data_phn.py

# Generate sequence phone input and label for seq2seq models from raw Kaldi GOP features.

import numpy as np

def load_feat(path):
    file = np.loadtxt(path, delimiter=',')
    return file

def load_keys(path):
    file = np.loadtxt(path, delimiter=',', dtype=str)
    return file

def load_label(path):
    file = np.loadtxt(path, delimiter=',', dtype=str)
    return file

def process_label(label):
    pure_label = []
    for i in range(0, label.shape[0]):
        pure_label.append(float(label[i, 1]))
    return np.array(pure_label)

def process_feat_seq(feat, keys, labels, phn_dict):
    key_set = []
    for i in range(keys.shape[0]):
        cur_key = keys[i].split('.')[0]
        key_set.append(cur_key)

    feat_dim = feat.shape[1] - 1

    utt_cnt = len(list(set(key_set)))
    print('In total utterance number : ' + str(utt_cnt))

    # Pad all sequence to 50 because the longest sequence of the so762 dataset is shorter than 50.
    seq_feat = np.zeros([utt_cnt, 50, feat_dim])
    # -1 means n/a, padded token
    # [utt, seq_len, 0] is the phone label, and the [utt, seq_len, 1] is the score label
    seq_label = np.zeros([utt_cnt, 50, 2]) - 1

    # the key format is utt_id.phn_id
    prev_utt_id = keys[0].split('.')[0]

    row = 0
    for i in range(feat.shape[0]):
        cur_utt_id, cur_tok_id = keys[i].split('.')[0], int(keys[i].split('.')[1])
        # if a new sequence, start a new row of the feature vector.
        if cur_utt_id != prev_utt_id:
            row += 1
            prev_utt_id = cur_utt_id

        # The first element is the phone label.
        seq_feat[row, cur_tok_id, :] = feat[i, 1:]

        # [utt, seq_len, 0] is the phone label
        print(labels)
        seq_label[row, cur_tok_id, 0] = phn_dict[labels[i]]
        # [utt, seq_len, 1] is the score label, range from 0-2
        # seq_label[row, cur_tok_id, 1] = labels[i, 1]

    return seq_feat, seq_label

def gen_phn_dict(label):
    phn_dict = {}
    phn_idx = 0
    for i in range(label.shape[0]):
        if label[i] not in phn_dict:
            phn_dict[label[i]] = phn_idx
            phn_idx += 1
    return phn_dict

# generate sequence training data
tr_feat = load_feat('../../data/raw_kaldi_gop/test_dataset/tr_feats.csv')
tr_keys = load_keys('../../data/raw_kaldi_gop/test_dataset/tr_keys_phn.csv')
tr_label = load_label('../../data/raw_kaldi_gop/test_dataset/tr_labels_phn.csv')
phn_dict = gen_phn_dict(tr_label)
print(phn_dict)
tr_feat, tr_label = process_feat_seq(tr_feat, tr_keys, tr_label, phn_dict)
print(tr_feat.shape)
print(tr_label.shape)
np.save('../../data/seq_data_test_dataset/tr_feat.npy', tr_feat)
np.save('../../data/seq_data_test_dataset/tr_label_phn.npy', tr_label)

# generate sequence test data
te_feat = load_feat('../../data/raw_kaldi_gop/test_dataset/te_feats.csv')
te_keys = load_keys('../../data/raw_kaldi_gop/test_dataset/te_keys_phn.csv')
te_label = load_label('../../data/raw_kaldi_gop/test_dataset/te_labels_phn.csv')
te_feat, te_label = process_feat_seq(te_feat, te_keys, te_label, phn_dict)
print(te_feat.shape)
print(te_label.shape)
np.save('../../data/seq_data_test_dataset/te_feat.npy', te_feat)
np.save('../../data/seq_data_test_dataset/te_label_phn.npy', te_label)

The last step requires you to run these lines. Skip word and utterence.

mkdir data/seq_data_<your dataset name>
cd src/prep_data
python gen_seq_data_phn.py

Finally, in gopt/data/<your dataset name>, you will see kindly two files that are needed to do the inference. te_feat.npy and te_label_phn.npy. But remember, the te_label_phn.npycontains both phn and scores which we have not generated (and not needed). So, in order to do the inference, run the following.

PS: to simplify stuff, my train dataset is the same as test dataset.

import torch
import sys
import os
sys.path.append(os.path.abspath('../src/'))
from models import GOPT
gopt = GOPT(embed_dim=24, num_heads=1, depth=3, input_dim=84)
# GOPT is trained with dataparallel, so it need to be wrapped with dataparallel even you have a single gpu or cpu
gopt = torch.nn.DataParallel(gopt)
sd = torch.load('gopt_librispeech/best_audio_model.pth', map_location='cpu')
gopt.load_state_dict(sd, strict=True)

import numpy as np
input_feat = np.load("<your_path>/te_feat.npy")
input_phn = np.load("<your_path>/te_label_phn.npy")
t_input_feat = torch.from_numpy(input_feat)
t_phn = torch.from_numpy(input_phn[:,:,0])
gopt = gopt.float()
gopt.eval()
with torch.no_grad():
    print(gopt(t_input_feat.float(),t_phn.float()))

Good Luck!
Restrictions: If your text contains words that are not in lexicon, you are out of luck.

The result seems to be no change

Hi, I am very impressed with the results you have achieved in this project. So I want to reproduce it but have some problem with the result.

I followed the instructions of "[Tutorial] Infer your own data" to get GOPT features from Kaldi and then Inferent with pretrained weight.
I tried on many experiments with my own voice and voices of speechocean762 but the results are almost the same (Ex: u1:accuracy always from 1.7 -1.8 (tensor), ...).
I also test with incorrect item (voice different with transcript) and the voice with the bad score in speechocean762. But the results are the same as I described above. Hope you can give me some advice.

And one more thing that I do tensor*5 to get the score on ten scale cause tensor_value in range[0-2]. Is it the right way??

Ex: speaker 9604 of speechocean762, this voice have "accuracy": 3 in scores.json -> bad voice but the result is high:

text-phone:
096040025.0 DH_B EH0_I R_E
096040025.1 W_B AH0_I Z_E
096040025.2 N_B AH1_I TH_I IH0_I NG_E
096040025.3 T_B UW0_E
096040025.4 B_B IY0_E
096040025.5 G_B EY0_I N_I D_E
096040025.6 B_B AY0_E
096040025.7 IH0_B T_E
spk2age 9604 21
spk2gender 9604 f
spk2utt 9604 096040025
text 096040025 THERE WAS NOTHING TO BE GAINED BY IT
utt2spk 096040025 9604
wav 096040025 WAVE/SPEAKER9604/096040025.WAV

=> GOPT features (zip file): feature.zip
=> Result:
u1 = tensor([1.7430])
u2 = tensor([1.5640])
...

Hi, I want to add some words to text-phone from CMU dict and librispeech-lexicon, but they are different with speechocen762

Hi, dear author, I want to infer some my own data, but some words don't exist in speechocean762/resource/lexicon.txt or speechocean762/resource/text-phone, so I want to add them. But I found a problem.

For example:

In speechocean762/recource/text-phone:
000010011.0 W_B IY0_E

In speechocean762/recource/lexicon.txt:
WE W IY0

However, in every version CMU dict:
WE W IY1

In librispeech-lexicon.txt
WE W IY1

CMU website said 0 and 1 represent different lexical stress:
0 — No stress
1 — Primary stress
2 — Secondary stress

lots of difference stresses appear between speechocean762 and other dict, which one is true?

Is it possible to apply GOPT in real life application?

Hello,

Does GOPT solution has potential to go to real life production apps if it is fine tuned? I am curios about that, thanks.

yuangongnd / gopt Goto Github PK

gopt's People

Contributors

Stargazers

Watchers

Forkers

gopt's Issues

Infer your own data

Recommend Projects

Recommend Topics

Recommend Org