Coder Social home page Coder Social logo

jzlianglu / pykaldi2 Goto Github PK

View Code? Open in Web Editor NEW
173.0 13.0 33.0 241 KB

Yet another speech toolkit based on Kaldi and PyTorch

License: MIT License

Python 88.70% Dockerfile 1.10% Shell 6.64% Perl 3.55%
kaldi speech-toolkit horovod pykaldi pytorch

pykaldi2's People

Contributors

bliunlpr avatar jzlianglu avatar serhiy-shekhovtsov avatar singaxiong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pykaldi2's Issues

Physical data format

Currently, we are using Zip file for storage, which is actually quite convenient. But to be more integrated with the community, we should also develop HDF5 based storage format, which will store both data and meta data in the same file.

SETraining is extremely slow

Hi, I have completed the gmm stage and excute the CE training stage of transformer based on a tri4b system,. But when I'm trying a SE training on a Tesla P100 ,it is extremely slow as the following pict shows. I wonder is it normal?
Capture

Targeted simulation

In the ideal case, we should monitor the statistics of the simulated data such that it matches those of the target domain data. E.g. the SNR distribution, direct-to-reverberation ratio distribution, etc. We should have a mechanism to dynamically monitor the simulated data and adjust it to match the target.

Selective data simulation

Currently, we have a single configuration called "simulation_prob" to control how many percentages of the source speech to apply data simulation. We need more control over selective data simulation.

Specifically, we need to estimate the quality of the speech signal and only apply simulation to those high-quality ones. As a first step, we can use SNR estimation. We can estimate it offline and store it as one type of meta-data. The simulation module should use this information to decide whether to apply simulation.

minibatches for LFMMI

Hi,
I have a few suggestions on using LF-MMI:
I noticed you are looping over the batch, creating the supervision and calculating the criterion.
you can use MergeSupervision function I added to pykaldi in-order to create the supervision to the whole batch and run the criterion only once.

here is my collate function for the data-loader:

def supervision_collate(batch):
    """
    a collate function, for using supervision with dataloader
    """
    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, container_abcs.Sequence):
        transposed = zip(*batch)
        return [supervision_collate(samples) for samples in transposed]
    elif isinstance(elem,kaldi.chain._chain_supervision.Supervision):
        if len(batch) == 1:
            return batch[0]
        return kaldi.chain.merge_supervison(batch)
    elif elem is None:
        return batch
    return torch.utils.data.dataloader.default_collate(batch)

To add the MergeSupervision I made a pull request to pykaldi (pykaldi/pykaldi#182), but you can use my fork that already have the change (https://github.com/yotam319/pykaldi)

also, using phone_ali gives a small supervision, you should consider using lattices and phone_lattice_to_proto_supervision instead of alignment_to_proto_supervision.

and finally, you can save your supervision as bytes and read them again.
here are the functions I used for doing this:

import kaldi
from kaldi import chain

def supervision_to_bytes(supervision):
    out_s = kaldi.base.io.stringstream()
    supervision.write(out_s,True)
    return out_s.to_bytes()

def supervision_from_supervision_bytes(supervision_bytes):
    in_s = kaldi.base.io.stringstream.from_str(supervision_bytes)
    supervision = kaldi.chain.Supervision()
    supervision.read(in_s,True)
    return supervision

def split_supervision(supervision, start, duration):
    sup_cut = kaldi.chain.SupervisionSplitter(supervision).get_frame_range(start,duration)
    sup_cut.fst = StdVectorFst(sup_cut.fst).rmepsilon()
    return sup_cut

def ali_phone_to_supervision_bytes(phones_durs,
                             opt, ctx_dep, trans_model):
    """
    input:
    phones_durs: list of phone*duration tuples
    opt: kaldi.chain.SupervisionOptions object
    ctx_dep: from kaldi.alignment.Aligner.read_tree("exp\chain\<ref_model>\tree")
    trans_model: from kaldi.alignment.Aligner.read_model("exp\chain\<ref_model>\0.trans_mdl")
    
    returns: byte representation of supervision
    """
    p_supervision = chain.alignment_to_proto_supervision_with_phones_durs(opt,phones_durs)
    supervision = chain.proto_supervision_to_supervision(ctx_dep,trans_model,p_supervision, opt.convert_to_pdfs)
    return supervision_to_bytes(supervision)

def lat_to_supervision_bytes(lat,phone_lat_mdl, phone_lat_opts,
                             supervision_opts, ctx_dep, trans_model):
    """
    input:
    lat: lattice
    phone_lat_mdl: final.mdl from the lat folder
    phone_lat_opts: PhoneAlignLatticeOptions object
    supervision_opts: kaldi.chain.SupervisionOptions object
    ctx_dep: from kaldi.alignment.Aligner.read_tree("exp\chain\<ref_model>\tree")
    trans_model: from kaldi.alignment.Aligner.read_model("exp\chain\<ref_model>\0.trans_mdl")
    
    returns: byte representation of supervision
    """
    (suc,phone_lat) = kaldi.lat.align.phone_align_lattice(lat,phone_lat_mdl, phone_lat_opts)
    assert suc
    phone_lat.topsort()
    phone_lat.topsort()
    p_supervision = chain.phone_lattice_to_proto_supervision(supervision_opts,phone_lat)
    supervision = chain.proto_supervision_to_supervision(ctx_dep,trans_model,p_supervision, supervision_opts.convert_to_pdfs)
    return supervision_to_bytes(supervision)

hope this helps :)

Convolve vs. BlockConvolve for RIR Augmentation

Hi, have been using your Simulator functionality and found it quite useful. However, the augmented data I'm obtaining from it has a ton of reverb (more than I'm expecting). Still diagnosing the problem, but is there any reason why this repo is using the equivalent of

FFTbasedConvolveSignals
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/signal.cc#L50

as opposed to

FFTbasedBlockConvolveSignals
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/signal.cc#L77

Kaldi does reverb by using the second https://github.com/kaldi-asr/kaldi/blob/master/src/featbin/wav-reverberate.cc#L96. Thanks!

Why is the chain loss computation so slow?

When I trained the model by ChainObjtiveFunction, I found the chain loss computation is very slow. For example, the data load time is 0.2s, the model forward time is 0.2s, but the loss computation time is 8.2s. I wonder why the chain loss computation is so slow and how to accelerate it? Thanks!

RIR format

Currently, the code assumes a very specific meta-data format for the RIRs. Will need to define a standard format that is flexible and easy to use.

By flexible, it should support variable number of metadata for each RIR. For example, some RIRs have information about room size, source-to-sensor distance, azimuth angle, reverberation time, etc, but some RIRs do not have any meta data. We need to be able to support both of them.

We also need to define the way to store the RIRs waveforms. One option is to store multiple multi-channel RIRs from different source positions to the same sensor(s) position(s) in one file, so we can have some of them for different speech sources, and some of them for directional noise sources.

Decoding with train_transformer_ce.py

Hi,

How does one decode with the models trained using train_transformer_ce.py? Is it possible to provide a decoding recipe or point to resources that can be used to build the recipe?

Problems about HCLG.fst needed by "train_transformer_se.py"

Hi, I'm implementing your paper "A TRANSFORMER WITH INTERLEAVED SELF-ATTENTION AND CONVOLUTION FOR HYBRID ACOUSTIC MODELS", and I'v got some questions bothering me.
In kaldi setup, sequence training need to create a phone-level language model and denominator fst, which is called HCP in this blog (https://desh2608.github.io/2019-05-21-chain/). In your code, I find that the script "train_transformer_se.py" needs a direction that contains HCLG.fst.
Is the HCLG.fst needed here is equal to the HCP builded based on a phone-level LM.?

How to use CE regularizer in chain-model training?

Hi,
I wander is it appropriate to add CE regularizer(grad_xent) directly to grad in chain model training?
As implemented as: grad.add_mat(chain_opts.xent_regularize, grad_xent) .

In kaldi's chain model recipe, e.g. aishell s5. The network architecture has two branches after layer tdnn6, one for chain-model(output layer), the other one for CE(output-xent layer).
Derivative matrix grad is applied to "output", while grad_xent is applied to "output-xent".
If grad_xent is merged into grad, there will be no prefinal-xent--> output-xent branch at all.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.