Coder Social home page Coder Social logo

voice-vector's Introduction

Text-independent voice vectors

Subtitle: which of the Hollywood stars is most similar to my voice?

Prologue

Everyone has their own voice. The same voice will not exist from different people, but some people have similar voices while others do not. This project aims to find individual voice vectors using VoxCeleb dataset, which contains 1,251 Hollywood stars' 145,379 utterances. The voice vectors are text-independent, meaning that any pair of utterances from same speaker has similar voice vectors. Also the closer the vector distance is, the more voices are similar.

Architectures

The architecture is based on a classification model. The utterance inputted is classified as one of the Hollywood stars. The objective function is simply a cross entropy between speaker labels from ground truth and predictions. Eventually, the last layer's activation becomes the speaker's embedding.

The model architecture is structured as follows.

  1. memory cell
    • CBHG module from Tacotron captures hidden features from sequential data.
  2. embedding
    • memory cell's last output is projected by the size of embedding vector.
  3. softmax
    • embedding is logits for each classes.

Training

  • VoxCeleb dataset used.
    • 1,251 Hollywood stars' 145,379 utterances
    • gender dist.: 690 males and 561 females
    • age dist.: 136, 351, 318, 210, and 236 for 20s, 30s, 40s, 50s, and over 60s respectively.
  • text-independent
    • at each step, the speaker is arbitrarily selected.
    • for each speaker, the utterance inputted is randomly selected and cropped so that it does not matter to text.
  • loss and train accuracy

Embedding

  • Common Voice dataset used for inference.
    • hundreds of thousands of English utterances from numerous voice contributors in the world.
  • evaluation accuracy

  • embedding visualization using t-SNE
    • voices are well clustered by gender without any supervision in training.
    • but we could not find any tendency toward age.

How to run?

Requirements

  • python 2.7
  • tensorflow >= 1.1
  • numpy >= 1.11.1
  • librosa == 0.5.1
  • tensorpack == 0.8.0

Settings

  • configurations are set in two YAML files.
  • hparams/default.yaml includes default settings for signal processing, model, training, evaluation and embedding.
  • hparams/hparams.yaml is for customizing the default settings in each experiment case.

Runnable python files

  • train.py for training.
    • run python train.py some_case_name
    • remote mode: utilizing more cores of remote server to load data and enqueue more quickly.
      • run python train.py some_case_name -remote -port=1234 in local server.
      • run python remote_dataflow.py some_case_name -dest_url=tcp://local-server-host:1234 -num_thread=12 in remote server.
  • eval.py for evaluation.
    • run python eval.py some_case_name
  • embedding.py for inference and getting embedding vectors.
    • run python embedding.py some_case_name

Visualizations

  • Tensorboard
    • Scalars tab: loss, train accuracy, and eval accuracy.
    • Audio tab: sample audios of input speakers(wav) and predicted speakers(wav_pred)
    • Text tab: prediction texts with the following form: 'input-speaker-name (meta) -> predicted-speaker-name (meta)'
      • ex. sample-022653 (('female', 'fifties', 'england')) -> Max_Schneider (('M', '26', 'USA'))
  • t-SNE output file
    • outputs/embedding-[some_case_name].png

Future works

  • One-shot learning with triplet loss.

References

voice-vector's People

Contributors

andabi avatar gitname avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

voice-vector's Issues

Can't get a dependency to work

In train.py and eval.py, the line from data_load import DataLoader, AudioMeta seems to cause the following error: ModuleNotFoundError: No module named 'tensorpack.dataflow.prefetch'.

I've verified that tensorpack is installed but can't seem to get passed this. I've actually tried this on multiple computers and get the same error. Has anyone encountered this?

Dataset files confusion

Which datapath should be put in path .
There are 3 variables speaker_id ,data_path, meta_path please specify what path should I keep in following
blanks.
I have voxceleb dataset and metadata .

Using your model on another project

Hello,

I have a speech recognition project (written in tensorflow), and I would like to know how could I be able to use your trained model inside my project. My initial idea is to use the layer before the last one of your model as a feature vector (that will work as a speaker identification).

Unfortunately, I am not quite familiar with tensorpack, and I am having trouble to do a simple task as: (1) load your trained model in tensorflow; (2) remove the last layer; and (3) use it as a feature extractor. Is it possible to do that? or your model needs the tensorpack to work?

how to get embeddings from a speaker

I wanted extract Embeddings from a Audio File...Because I wanted to do Voice Similarity.
I'm Confusing with ReadMe File......to get Embeddings,Any Simple Commands...

I found These code on README....

  • train.py for training.
    • run python train.py some_case_name

In Above case,what is some_case_name

And Also Some Help for Extracting Embeddings with embeddings.py....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.