Coder Social home page Coder Social logo

pansori's Introduction

Pansori

Pansori is a program for creating an automatic speech recognition (ASR) corpus from online videos with audio and subtitle data.

Overview

alt text

It consists of 4 pipeline stages as shown in the diagram above: ingest, align, transform and validate.

Ingest

Online video contents consist of multiple media streams for different screen resolutions and audio-only playback; hand-transcribed subtitle information can also be retrieved if available. Pansori downloads the audio and subtitle streams from online videos as mp4 and srt files, respectively.

Align

The subtitles contain segmented text and timing information which corresponds to the audio contents of the associated video. With the timing information, it is possible to segment the audio stream to make a matching pair of audio and text fragments for an ASR corpus.

However, inaccuracies can be introduced to the segmented contents because the timing information might be determined not only by audio contents but also by scene changes in the video. In addition, they can also arise from unintentional slicing of audio stream at word boundaries in fast speeches and when substantial ambient noise such as applause is present. To fix these inaccuracies, we used finetuneas, a GUI tool to help find correct alignment between audio and text. We are currently moving to a fully automated forced alignment approach in order to further simplify this stage.

Transform

The aligned audio stream and subtitle data are then processed with the following transformations specific to data types:

  • Audio stream: segmentation, lossless compression
  • Subtitle data: normalization, punctuation removal, removal of non-speech text (such as the description of audience response or ambient noise)

Validate

Although the audio stream and subtitle data are force-aligned with each other, there are also inherent discrepancies between the two. This can come from one or more of the following: inaccurate transcriptions, ambiguous pronunciations, and non-ideal audio conditions (like ambient noise or poor recording quality). To increase the quality of the corpus, the corpus needs to be refined by filtering out inaccurate audio and subtitle pairs.

Previous approaches relied on custom ASR models for corpus validation and refinement; however, they are not easily created for many languages, especially for those without existing corpora. In Pansori, we used a new approach through a cloud-based ASR; we chose the Google Cloud Speech-to-Text API since it provides the highest quality ASR services in more than 120 languages. Cloud services make the development of corpus generation much faster and easier since we can just set up the cloud service rather than create custom ASR engines with acoustic and language models in different languages.


The program can be modified for use in videos subtitled in any language available in the Google API.

Installation

Clone repository:

$ git clone https://github.com/yc9701/pansori

Install pytube, a library for downloading YouTube videos.

$ pip install pytube

Install pysubs2, a library for editing subtitle files. *Currently, pysubs2 runs only with Python 3.6; on Python 3.7, this library does not work

$ pip install pysubs2

Install pydub, a library for manipulating audio. *Only necessary if wishing for audio playback when validating audio

$ pip install pydub

The Google Cloud Speech API is also required for validate.py (an account is required).

pansori's People

Contributors

yc9701 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.