Coder Social home page Coder Social logo

z-mu-z / avsr-tf1 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from georgesterpu/avsr-tf1

0.0 0.0 0.0 5.17 MB

Audio-Visual Speech Recognition using Sequence to Sequence Models

License: GNU General Public License v3.0

Python 91.58% HTML 6.15% CSS 0.36% JavaScript 1.91%

avsr-tf1's Introduction

AVSR-tf1

Audio-Visual Speech Recognition (AVSR) research system using sequence-to-sequence neural networks based on TensorFlow 1.13

About

AVSR-tf1 is an open-source research system for Speech Recognition.

Written entirely in Python, AVSR-tf1 aims to provide a simple and reproducible way of training and evaluating speech recognition models based on sequence to sequence neural networks. AVSR-tf1 can exploit both auditory and visual speech modalities, considered either independently (ASR, VSR) or jointly (AVSR).

Rather than providing a dense documentation to the users and contributors, the AVSR-tf1 code is designed (or strives) to be intuitive and self-explanatory, encouraging researchers and developers to understand the entire codebase and propose improvements at its lowest levels. Hence we want it to be more of a flexible research system than a black box for production.

Core functionalities

1. Extract acoustic features from audio files (librosa, TensorFlow)

  • log mel-scale spectrograms, MFCC
  • optional computation of first and second derivatives
  • optional strided frame stacking
  • write into TensorFlow-compatible format (TFRecord dataset)

2. Extract the lip region from video files (OpenFace - Tadas Baltrusaitis)

  • write into TensorFlow-compatible format (TFRecord dataset)

3. Train sequence to sequence neural networks for continuous speech recognition

  • audio-only (LAS [3])
  • visual-only (lip-reading [5])
  • audio-visual fusion
    • dual-attention decoder (WLAS [4])
    • attention-based alignment (AV-Align [6, 7])
  • flexible language units (phonemes, visemes, characters etc.)

4. Evaluate models

  • normalised Levenshtein distances
    • Character Error Rate
    • Word Error Rate

Getting started

A typical workflow is as follows:

  1. convert data into .tfrecord files
  2. train/evaluate models

Please refer to the attached examples for running audio-only, visual-only, or audio-visual speech recognition experiments.

To prepare the data, you can use the two scripts extract_faces.py and write_records_tcd.py

Dependencies

For visual/audio-visual experiments, please compile from source install OpenFace

The other dependencies are popular and easy to install Python packages, so feel free to use your preferred sources.

The supported TensorFlow version for this repository is 1.13.1, and the recommended install source is: pip install tensorflow_gpu==1.13.1.

Please get in touch in case you face any issues.

Acknowledgements

We are grateful to Eugene Brevdo of Google for his remarkable help and advice during the early stages of development. In addition, we would like to thank Derek Murray, Andreas Steiner, Khe Chai Sim for the assistance and interesting conversations, and also every TensorFlow contributor on GitHub and StackOverflow. Our work is supported by NVIDIA, which granted us a Titan Xp GPU through its academic program.

How to cite

If you use this work, please cite it as:

George Sterpu, Christian Saam. Naomi Harte. How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020. https://doi.org/10.1109/TASLP.2020.2980436

[bib]

@ARTICLE{Sterpu2020,
  author={G. {Sterpu} and C. {Saam} and N. {Harte}},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition},
  year={2020},
  volume={},
  number={},
 pages={1-1},
}

[pdf]

or

George Sterpu, Christian Saam, and Naomi Harte. 2018. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition. In 2018 International Conference on Multimodal Interaction (ICMI โ€™18), October 16โ€“20, 2018, Boulder, CO, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3242969.3243014

[bib]

@inproceedings{sterpu_icmi18,
  author = {George Sterpu and Christian Saam and Naomi Harte},
  title = {Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition},
  year = {2018},
  publisher = {{ACM, New York, NY, USA}},
  booktitle = {2018 International Conference on Multimodal Interaction (ICMI '18), October 16--20, 2018, Boulder, CO, USA},
  url       = {http://doi.acm.org/10.1145/3242969.3243014},
  doi       = {10.1145/3242969.3243014},
}

[pdf]

How to contribute

We are delighted to receive your feedback and help on improving AVSR-tf1. On the technical side, this could be an advice or a pull request for code refactoring (we are not Python/TensorFlow experts), adding implementations of popular features, bug reports, performance improvements, language models, support for computation in 16 bit precision or on Google TPU devices.

References

[1] Sequence to Sequence Learning with Neural Networks https://arxiv.org/abs/1409.3215

[2] Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

[3] Listen, Attend and Spell https://arxiv.org/abs/1508.01211

[4] Lip Reading Sentences in the Wild https://arxiv.org/abs/1611.05358

[5] Can DNNs Learn to Lipread Full Sentences? https://arxiv.org/abs/1805.11685

[6] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition https://arxiv.org/abs/1809.01728

[7] How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition https://ieeexplore.ieee.org/document/9035650

avsr-tf1's People

Contributors

georgesterpu avatar saamc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.