Coder Social home page Coder Social logo

viseme-to-video's Introduction

viseme-to-video

<aflorithmic> contributions-welcome GitHub


This Python module creates video from viseme images and TTS audio output. I created this for testing the sync accuracy between synthesised audio and duration predictions extracted from FastSpeech2 hidden states.

mouth1_with_audio_long_2000.mp4
mouth1_with_audio_123_2000.mp4



Running viseme-to-video

To use this module, first install dependencies using by running the command:

pip install -r requirements.txt

The tool can be run directly from the command line using the command:

python viseme_to_video.py


Repo contents

This repo contains the following resources:

image/

Two image sets:
-- speaker1/ from Occulus developer doc 'Viseme reference'
-- mouth1/ adapted from icSpeech guide 'Mouth positions for English pronunciation'

A different viseme image directory can be specified on the command line using the flag --im_dir.

metadata/

24.json: A viseme metadata JSON file we produced during FastSpeech2 inference by:

  • extracting the phoneme sequence produced by the text normalisation frontend module
  • mapping this to a sequence of visemes
  • extracting hidden state durations (in n frames) from FS2
  • converting durations from frames to milliseconds using the formula
  • writing this information (phoneme, viseme, duration, offset)

The tool will automatically generate video for all JSON metadata files stored in the metadata/ folder.

map/

viseme_map.json: A JSON file containing mappings between the visemes in viseme metadata files and the image filenames. Mapping visemes was necessary since the viseme set we use to generate our metadata files contained upper/lower-case distinctions, which file naming doesn't support. (I.e. you can't have two files named 't.jpeg' and 'T.jpeg' stored in the same folder.)

A different mapping file can be specified on the command line using the flag --map.

audio/

24.wav - An audio sample generated from FastSpeech2 (using kan-bayashi's ESPnet framework). This sample uses a Harvard sentence as text input (list 3, sentence 5: 'The beauty of the view stunned the young boy').

Audio can be toggled on/off with the argument --no_audio.



viseme-to-video's People

Contributors

evelyndjwilliams avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.