jzlianglu / pykaldi2 Goto Github PK
View Code? Open in Web Editor NEWYet another speech toolkit based on Kaldi and PyTorch
License: MIT License
Yet another speech toolkit based on Kaldi and PyTorch
License: MIT License
Currently, we are using Zip file for storage, which is actually quite convenient. But to be more integrated with the community, we should also develop HDF5 based storage format, which will store both data and meta data in the same file.
In the ideal case, we should monitor the statistics of the simulated data such that it matches those of the target domain data. E.g. the SNR distribution, direct-to-reverberation ratio distribution, etc. We should have a mechanism to dynamically monitor the simulated data and adjust it to match the target.
Currently, we have a single configuration called "simulation_prob" to control how many percentages of the source speech to apply data simulation. We need more control over selective data simulation.
Specifically, we need to estimate the quality of the speech signal and only apply simulation to those high-quality ones. As a first step, we can use SNR estimation. We can estimate it offline and store it as one type of meta-data. The simulation module should use this information to decide whether to apply simulation.
Hi,
I have a few suggestions on using LF-MMI:
I noticed you are looping over the batch, creating the supervision and calculating the criterion.
you can use MergeSupervision function I added to pykaldi in-order to create the supervision to the whole batch and run the criterion only once.
here is my collate function for the data-loader:
def supervision_collate(batch):
"""
a collate function, for using supervision with dataloader
"""
elem = batch[0]
elem_type = type(elem)
if isinstance(elem, container_abcs.Sequence):
transposed = zip(*batch)
return [supervision_collate(samples) for samples in transposed]
elif isinstance(elem,kaldi.chain._chain_supervision.Supervision):
if len(batch) == 1:
return batch[0]
return kaldi.chain.merge_supervison(batch)
elif elem is None:
return batch
return torch.utils.data.dataloader.default_collate(batch)
To add the MergeSupervision I made a pull request to pykaldi (pykaldi/pykaldi#182), but you can use my fork that already have the change (https://github.com/yotam319/pykaldi)
also, using phone_ali gives a small supervision, you should consider using lattices and phone_lattice_to_proto_supervision instead of alignment_to_proto_supervision.
and finally, you can save your supervision as bytes and read them again.
here are the functions I used for doing this:
import kaldi
from kaldi import chain
def supervision_to_bytes(supervision):
out_s = kaldi.base.io.stringstream()
supervision.write(out_s,True)
return out_s.to_bytes()
def supervision_from_supervision_bytes(supervision_bytes):
in_s = kaldi.base.io.stringstream.from_str(supervision_bytes)
supervision = kaldi.chain.Supervision()
supervision.read(in_s,True)
return supervision
def split_supervision(supervision, start, duration):
sup_cut = kaldi.chain.SupervisionSplitter(supervision).get_frame_range(start,duration)
sup_cut.fst = StdVectorFst(sup_cut.fst).rmepsilon()
return sup_cut
def ali_phone_to_supervision_bytes(phones_durs,
opt, ctx_dep, trans_model):
"""
input:
phones_durs: list of phone*duration tuples
opt: kaldi.chain.SupervisionOptions object
ctx_dep: from kaldi.alignment.Aligner.read_tree("exp\chain\<ref_model>\tree")
trans_model: from kaldi.alignment.Aligner.read_model("exp\chain\<ref_model>\0.trans_mdl")
returns: byte representation of supervision
"""
p_supervision = chain.alignment_to_proto_supervision_with_phones_durs(opt,phones_durs)
supervision = chain.proto_supervision_to_supervision(ctx_dep,trans_model,p_supervision, opt.convert_to_pdfs)
return supervision_to_bytes(supervision)
def lat_to_supervision_bytes(lat,phone_lat_mdl, phone_lat_opts,
supervision_opts, ctx_dep, trans_model):
"""
input:
lat: lattice
phone_lat_mdl: final.mdl from the lat folder
phone_lat_opts: PhoneAlignLatticeOptions object
supervision_opts: kaldi.chain.SupervisionOptions object
ctx_dep: from kaldi.alignment.Aligner.read_tree("exp\chain\<ref_model>\tree")
trans_model: from kaldi.alignment.Aligner.read_model("exp\chain\<ref_model>\0.trans_mdl")
returns: byte representation of supervision
"""
(suc,phone_lat) = kaldi.lat.align.phone_align_lattice(lat,phone_lat_mdl, phone_lat_opts)
assert suc
phone_lat.topsort()
phone_lat.topsort()
p_supervision = chain.phone_lattice_to_proto_supervision(supervision_opts,phone_lat)
supervision = chain.proto_supervision_to_supervision(ctx_dep,trans_model,p_supervision, supervision_opts.convert_to_pdfs)
return supervision_to_bytes(supervision)
hope this helps :)
Hi, have been using your Simulator functionality and found it quite useful. However, the augmented data I'm obtaining from it has a ton of reverb (more than I'm expecting). Still diagnosing the problem, but is there any reason why this repo is using the equivalent of
FFTbasedConvolveSignals
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/signal.cc#L50
as opposed to
FFTbasedBlockConvolveSignals
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/signal.cc#L77
Kaldi does reverb by using the second https://github.com/kaldi-asr/kaldi/blob/master/src/featbin/wav-reverberate.cc#L96. Thanks!
When I trained the model by ChainObjtiveFunction, I found the chain loss computation is very slow. For example, the data load time is 0.2s, the model forward time is 0.2s, but the loss computation time is 8.2s. I wonder why the chain loss computation is so slow and how to accelerate it? Thanks!
Currently, the code assumes a very specific meta-data format for the RIRs. Will need to define a standard format that is flexible and easy to use.
By flexible, it should support variable number of metadata for each RIR. For example, some RIRs have information about room size, source-to-sensor distance, azimuth angle, reverberation time, etc, but some RIRs do not have any meta data. We need to be able to support both of them.
We also need to define the way to store the RIRs waveforms. One option is to store multiple multi-channel RIRs from different source positions to the same sensor(s) position(s) in one file, so we can have some of them for different speech sources, and some of them for directional noise sources.
Hi,
How does one decode with the models trained using train_transformer_ce.py? Is it possible to provide a decoding recipe or point to resources that can be used to build the recipe?
I am trying to install pykaldi2 from source and encountered the following error.
ERROR: torchvision 0.3.0 has requirement torch>=1.1.0, but you'll
have torch 1.0.0 which is incompatible.
I chose 1.0.0 because of the following
Line 3 in 5e988e5
Line 68 in 5e988e5
Are there any specific reasons to choose 1.0.0
?
When I run . decode.sh , i met this problem : no module named 'kaldi.base._kaldi_error'
Is the kaldi link in pykaldi2 pykaldi/kaldi?
Hi, I'm implementing your paper "A TRANSFORMER WITH INTERLEAVED SELF-ATTENTION AND CONVOLUTION FOR HYBRID ACOUSTIC MODELS", and I'v got some questions bothering me.
In kaldi setup, sequence training need to create a phone-level language model and denominator fst, which is called HCP in this blog (https://desh2608.github.io/2019-05-21-chain/). In your code, I find that the script "train_transformer_se.py" needs a direction that contains HCLG.fst.
Is the HCLG.fst needed here is equal to the HCP builded based on a phone-level LM.?
Hi,
I wander is it appropriate to add CE regularizer(grad_xent) directly to grad in chain model training?
As implemented as: grad.add_mat(chain_opts.xent_regularize, grad_xent) .
In kaldi's chain model recipe, e.g. aishell s5. The network architecture has two branches after layer tdnn6, one for chain-model(output layer), the other one for CE(output-xent layer).
Derivative matrix grad is applied to "output", while grad_xent is applied to "output-xent".
If grad_xent is merged into grad, there will be no prefinal-xent--> output-xent branch at all.
Hi,
Why does grad_input multiply -1 in LF_MMI loss Class?
grad_input *= -1.0
Code Reference:
Any explanation for this?
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.