Coder Social home page Coder Social logo

saiteja-talluri / speech2face Goto Github PK

View Code? Open in Web Editor NEW
161.0 12.0 34.0 7.92 MB

Implementation of the CVPR 2019 Paper - Speech2Face: Learning the Face Behind a Voice by MIT CSAIL

License: MIT License

Python 96.05% Shell 3.95%
speech-recognition deep-learning face cvpr2019

speech2face's Introduction

Speech2Face

This project implements a framework to convert speech to facial features as described in the CVPR 2019 paper - Speech2Face: Learning the Face Behind a Voice by MIT CSAIL group.

A detailed report on results can be found here as report.pdf. It was made as the final project for CS 753 - Automatic Speech Recognition course in Autumn 2019 at Indian Institute of Technology (IIT) Bombay, India.

Usage

Folder structure of the project

Efficient structure to arrange the database (audio and video) and the code for this project to avoid any duplication.

.
├── base.py
├── LICENSE
├── logs
│   └── ......
├── model.py
├── models
│   └── final.h5
├── preprocess
│   ├── avspeech_test.csv
│   ├── avspeech_train.csv
│   ├── clean_directory.sh
│   ├── data
│   │   ├── audios/
│   │   ├── audio_spectrograms/
│   │   ├── cropped_frames/
│   │   ├── frames/
│   │   ├── pretrained_model
│   │   │   ├── 20180402-114759
│   │   │   │   └── ......
│   │   │   └── 20180402-114759.zip
│   │   ├── speaker_video_embeddings/
│   │   └── videos/
│   ├── data_download.py
│   ├── facenet
│   ├── prepare_directory.sh
│   ├── speaker.py
│   └── video_generator.py
├── README.md
├── requirements.txt
└── results
    ├── ......
    ├── presentation.pdf
    └── report.pdf

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

  1. Go to preprocess folder and run prepare_directory.sh and then download AVSpeech Dataset. Run data_download.py file for data download from youtube based on AVSpeech Dataset.
cd preprocess/
sh prepare_directory.sh

Download AVSpeech Dataset in the folder.

python3 data_download.py
usage: data_download.py [-h] [--from_id FROM_ID] [--to_id TO_ID]
                        [--low_memory LOW_MEMORY] [--sample_rate SAMPLE_RATE]
                        [--duration DURATION] [--fps FPS] [--mono MONO]
                        [--window WINDOW] [--stride STRIDE]
                        [--fft_length FFT_LENGTH] [--amp_norm AMP_NORM]
                        [--face_extraction_model FACE_EXTRACTION_MODEL]
                        [--verbose]
  1. Now run the base file with train option if you want to train.
python3 base.py
usage: base.py [-h] [--from_id FROM_ID] [--to_id TO_ID] [--epochs EPOCHS]
               [--start_epoch START_EPOCH] [--batchsize BATCHSIZE]
               [--num_gpu NUM_GPU] [--num_samples NUM_SAMPLES]
               [--load_model LOAD_MODEL] [--save_model SAVE_MODEL] [--train]
               [--verbose]
  1. To run the code without training, you can download the final model final.h5 and place it in models folder.

Results

We have used face retrieval performace as a evaluation metric and we are able to achieve a decent accuracy. Increasing the computation power and using complete dataset can help us achieve greater accuracy.

Image

Image

More training details to reciprocate can be found in presentation.pdf

Future Work

  1. Implementation of the Face Decoder Model, which takes as input the face features predicted by Speech2Face model and produces an image of the face in a canonical form (frontal-facing and with neutral expression).
  2. The pretrained Face Decoder Model used by the paper was not available and the model was based on another CVPR paper (Synthesizing Normalized Faces from Facial Identity Features)
  3. We tried implementing the model but this required lots of data for the model to train properly and the result was not even human recognizable.
  4. As the main focus of the project was on Speech Domain, we plan to complete this Vision task in the future.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

  1. Speech2Face: Learning the Face Behind a Voice (https://arxiv.org/pdf/1905.09773.pdf)
  2. Wav2Pix: Speech-conditioned face generation using generative adversarial networks (https://arxiv.org/pdf/1903.10195.pdf)
  3. AVSpeech Dataset (https://looking-to-listen.github.io/avspeech/download.html)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.