jzlianglu / pykaldi2 Goto Github PK

View Code? Open in Web Editor NEW

173.0 13.0 33.0 241 KB

Yet another speech toolkit based on Kaldi and PyTorch

License: MIT License

Python 88.70% Dockerfile 1.10% Shell 6.64% Perl 3.55%

kaldi speech-toolkit horovod pykaldi pytorch

pykaldi2's People

Contributors

Stargazers

Watchers

pykaldi2's Issues

Physical data format

Currently, we are using Zip file for storage, which is actually quite convenient. But to be more integrated with the community, we should also develop HDF5 based storage format, which will store both data and meta data in the same file.

SETraining is extremely slow

Hi, I have completed the gmm stage and excute the CE training stage of transformer based on a tri4b system,. But when I'm trying a SE training on a Tesla P100 ,it is extremely slow as the following pict shows. I wonder is it normal?

Targeted simulation

In the ideal case, we should monitor the statistics of the simulated data such that it matches those of the target domain data. E.g. the SNR distribution, direct-to-reverberation ratio distribution, etc. We should have a mechanism to dynamically monitor the simulated data and adjust it to match the target.

Selective data simulation

Currently, we have a single configuration called "simulation_prob" to control how many percentages of the source speech to apply data simulation. We need more control over selective data simulation.

Specifically, we need to estimate the quality of the speech signal and only apply simulation to those high-quality ones. As a first step, we can use SNR estimation. We can estimate it offline and store it as one type of meta-data. The simulation module should use this information to decide whether to apply simulation.

minibatches for LFMMI

Hi,
I have a few suggestions on using LF-MMI:
I noticed you are looping over the batch, creating the supervision and calculating the criterion.
you can use MergeSupervision function I added to pykaldi in-order to create the supervision to the whole batch and run the criterion only once.

here is my collate function for the data-loader:

def supervision_collate(batch):
    """
    a collate function, for using supervision with dataloader
    """
    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, container_abcs.Sequence):
        transposed = zip(*batch)
        return [supervision_collate(samples) for samples in transposed]
    elif isinstance(elem,kaldi.chain._chain_supervision.Supervision):
        if len(batch) == 1:
            return batch[0]
        return kaldi.chain.merge_supervison(batch)
    elif elem is None:
        return batch
    return torch.utils.data.dataloader.default_collate(batch)

To add the MergeSupervision I made a pull request to pykaldi (pykaldi/pykaldi#182), but you can use my fork that already have the change (https://github.com/yotam319/pykaldi)

also, using phone_ali gives a small supervision, you should consider using lattices and phone_lattice_to_proto_supervision instead of alignment_to_proto_supervision.

and finally, you can save your supervision as bytes and read them again.
here are the functions I used for doing this:

import kaldi
from kaldi import chain

def supervision_to_bytes(supervision):
    out_s = kaldi.base.io.stringstream()
    supervision.write(out_s,True)
    return out_s.to_bytes()

def supervision_from_supervision_bytes(supervision_bytes):
    in_s = kaldi.base.io.stringstream.from_str(supervision_bytes)
    supervision = kaldi.chain.Supervision()
    supervision.read(in_s,True)
    return supervision

def split_supervision(supervision, start, duration):
    sup_cut = kaldi.chain.SupervisionSplitter(supervision).get_frame_range(start,duration)
    sup_cut.fst = StdVectorFst(sup_cut.fst).rmepsilon()
    return sup_cut

def ali_phone_to_supervision_bytes(phones_durs,
                             opt, ctx_dep, trans_model):
    """
    input:
    phones_durs: list of phone*duration tuples
    opt: kaldi.chain.SupervisionOptions object
    ctx_dep: from kaldi.alignment.Aligner.read_tree("exp\chain\<ref_model>\tree")
    trans_model: from kaldi.alignment.Aligner.read_model("exp\chain\<ref_model>\0.trans_mdl")
    
    returns: byte representation of supervision
    """
    p_supervision = chain.alignment_to_proto_supervision_with_phones_durs(opt,phones_durs)
    supervision = chain.proto_supervision_to_supervision(ctx_dep,trans_model,p_supervision, opt.convert_to_pdfs)
    return supervision_to_bytes(supervision)

def lat_to_supervision_bytes(lat,phone_lat_mdl, phone_lat_opts,
                             supervision_opts, ctx_dep, trans_model):
    """
    input:
    lat: lattice
    phone_lat_mdl: final.mdl from the lat folder
    phone_lat_opts: PhoneAlignLatticeOptions object
    supervision_opts: kaldi.chain.SupervisionOptions object
    ctx_dep: from kaldi.alignment.Aligner.read_tree("exp\chain\<ref_model>\tree")
    trans_model: from kaldi.alignment.Aligner.read_model("exp\chain\<ref_model>\0.trans_mdl")
    
    returns: byte representation of supervision
    """
    (suc,phone_lat) = kaldi.lat.align.phone_align_lattice(lat,phone_lat_mdl, phone_lat_opts)
    assert suc
    phone_lat.topsort()
    phone_lat.topsort()
    p_supervision = chain.phone_lattice_to_proto_supervision(supervision_opts,phone_lat)
    supervision = chain.proto_supervision_to_supervision(ctx_dep,trans_model,p_supervision, supervision_opts.convert_to_pdfs)
    return supervision_to_bytes(supervision)

hope this helps :)

Convolve vs. BlockConvolve for RIR Augmentation

Hi, have been using your Simulator functionality and found it quite useful. However, the augmented data I'm obtaining from it has a ton of reverb (more than I'm expecting). Still diagnosing the problem, but is there any reason why this repo is using the equivalent of

FFTbasedConvolveSignals
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/signal.cc#L50

as opposed to

FFTbasedBlockConvolveSignals
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/signal.cc#L77

Kaldi does reverb by using the second https://github.com/kaldi-asr/kaldi/blob/master/src/featbin/wav-reverberate.cc#L96. Thanks!

Why is the chain loss computation so slow?

When I trained the model by ChainObjtiveFunction, I found the chain loss computation is very slow. For example, the data load time is 0.2s, the model forward time is 0.2s, but the loss computation time is 8.2s. I wonder why the chain loss computation is so slow and how to accelerate it? Thanks!

RIR format

Currently, the code assumes a very specific meta-data format for the RIRs. Will need to define a standard format that is flexible and easy to use.

By flexible, it should support variable number of metadata for each RIR. For example, some RIRs have information about room size, source-to-sensor distance, azimuth angle, reverberation time, etc, but some RIRs do not have any meta data. We need to be able to support both of them.

We also need to define the way to store the RIRs waveforms. One option is to store multiple multi-channel RIRs from different source positions to the same sensor(s) position(s) in one file, so we can have some of them for different speech sources, and some of them for directional noise sources.

Decoding with train_transformer_ce.py

Hi,

How does one decode with the models trained using train_transformer_ce.py? Is it possible to provide a decoding recipe or point to resources that can be used to build the recipe?

incompatible pytorch and torchvision version

I am trying to install pykaldi2 from source and encountered the following error.

ERROR: torchvision 0.3.0 has requirement torch>=1.1.0, but you'll
have torch 1.0.0 which is incompatible.

I chose 1.0.0 because of the following

pykaldi2/docker/Dockerfile

Line 3 in 5e988e5

ENV PYTORCH_VERSION=1.0.0

pykaldi2/docker/Dockerfile

Line 68 in 5e988e5

RUN pip install h5py torch==${PYTORCH_VERSION} torchvision

Are there any specific reasons to choose 1.0.0 ?

no module named 'kaldi.base._kaldi_error'

When I run . decode.sh , i met this problem : no module named 'kaldi.base._kaldi_error'
Is the kaldi link in pykaldi2 pykaldi/kaldi?

Problems about HCLG.fst needed by "train_transformer_se.py"

Hi, I'm implementing your paper "A TRANSFORMER WITH INTERLEAVED SELF-ATTENTION AND CONVOLUTION FOR HYBRID ACOUSTIC MODELS", and I'v got some questions bothering me.
In kaldi setup, sequence training need to create a phone-level language model and denominator fst, which is called HCP in this blog (https://desh2608.github.io/2019-05-21-chain/). In your code, I find that the script "train_transformer_se.py" needs a direction that contains HCLG.fst.
Is the HCLG.fst needed here is equal to the HCP builded based on a phone-level LM.?

How to use CE regularizer in chain-model training?

Hi,
I wander is it appropriate to add CE regularizer(grad_xent) directly to grad in chain model training?
As implemented as: grad.add_mat(chain_opts.xent_regularize, grad_xent) .

In kaldi's chain model recipe, e.g. aishell s5. The network architecture has two branches after layer tdnn6, one for chain-model(output layer), the other one for CE(output-xent layer).
Derivative matrix grad is applied to "output", while grad_xent is applied to "output-xent".
If grad_xent is merged into grad, there will be no prefinal-xent--> output-xent branch at all.

Why negative grad_input of LF_MMI loss?

Hi,

Why does grad_input multiply -1 in LF_MMI loss Class?
grad_input *= -1.0
Code Reference:

Any explanation for this?
Thank you

jzlianglu / pykaldi2 Goto Github PK

pykaldi2's People

Contributors

Stargazers

Watchers

Forkers

pykaldi2's Issues

Recommend Projects

Recommend Topics

Recommend Org