Coder Social home page Coder Social logo

samarth0898 / speechtranscriptionsystem Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 46.79 MB

Seq2Seq attention based speech transcription system using pyramidal Bi-LSTMS .

Jupyter Notebook 100.00%
attention-mechanism pytorch reccurent-neural-network speech-to-text

speechtranscriptionsystem's Introduction

Sequence2Sequence Speech Transcription System

This repo contains a sequence to sequence speech transcription system based on an encoder, decoder and attention mechanism. The English alphabets are learnt from the input mel frequency cepstral coefficients. (need to be completed)

Prototyping Attention Mechanism

Attention MAP

Naive learning of input output correspondance using a K,Q,V attention mechanism between the encoder and decoder

Alt Text

Objectives of this work

  • Setup and encoder, decoder and attention based system to build correspondance between input MFCC and phonemes
  • Encoder synchrony and rate principles in encoder setup
  • Bidirectional LSTM cell / Pyramidal LSTM cell

Training regimes

  • Teacher forcing mechanism
  • pack-padding variable length data

Description of the input

40-dimensional log-mel filter bank features were computed every 10ms and used as the acoustic inputs to the listener.

The encoder used is a pyramidal-BiLSTM for matching the input rate and the speech transcription rate which is about 8:1. This model is sigificantly influenced by the LAS paper LAS: Chan, William, et al. "Listen, attend and spell." arXiv preprint arXiv:1508.01211 (2015).

[REF: B. Raj, Deep Learning Carnegie Mellon University] The pBLSTM is a variant of Bi-LSTMs that downsamples sequences by a factor of 2 by concatenating adjacent pairs of inputs before running a conventional Bi-LSTM on the reduced-length sequence. So, given an input vector sequence X0, X1, X2, X3, . . . XN−1, the pBLSTM first concatenates adjacent pairs of vectors as [X0, X1], [X2, X3], . . . [XN−2, XN−1], and then computes a regular BiLSTM on the reshaped input.

  • Initial Bi-LSTM
  • 3x Pyramidal Bi-LSTM

speechtranscriptionsystem's People

Contributors

samarth0898 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.