Coder Social home page Coder Social logo

nickwilkinson37 / voxseg Goto Github PK

View Code? Open in Web Editor NEW
76.0 5.0 12.0 100.48 MB

A python library for voice activity detection (VAD) for speech/non-speech segmentation.

License: MIT License

Python 99.28% Dockerfile 0.72%
speech-processing voice-activity-detection speech-segmentation speech vad python-library python

voxseg's Introduction

Voxseg

Voxseg is a Python package for voice activity detection (VAD), for speech/non-speech audio segmentation. It provides a full VAD pipeline, including a pretrained VAD model, and it is based on work presented here.

Use of this VAD may be cited as follows:

@inproceedings{cnnbilstm_vad,
    title = {A hybrid {CNN-BiLSTM} voice activity detector},
    author = {Wilkinson, N. and Niesler, T.},
    booktitle = {Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
    year = {2021},
    address = {Toronto, Canada},
}

Installation

To install this package, clone the repository from GitHub to a directory of your choice and install using pip:

git clone https://github.com/NickWilkinson37/voxseg.git
pip install ./voxseg

In future, installation directly from the package manager will be supported.

To test the installation run:

cd voxseg
python -m unittest

The test will run the full VAD pipeline on two example audio files. This pipeline includes feature extraction, VAD and evaluation. The output should include the following*:

  • A progress bar monitoring the feature extraction process, followed by a DataFrame containing normalized-features and metadata.
  • A DataFrame containing model generated endpoints, indicating the starts and ends of discovered speech utterances.
  • A confusion matrix of speech vs non-speech, with the following values: TPR 0.935, FPR 0.137, FNR 0.065, FPR 0.863

*The order in which these outputs appear may vary.

Data Preparation

Before using the VAD, a number of files need to be created to specify the audio that one wishes to process. These files are the same as those used by the Kaldi toolkit. Extensive documentation on the data preparation process for Kaldi may be found here. Only the files required by the Voxseg toolkit are described here.

  1. wav.scp - this file provides the paths to the audio files one wishes to process, and assigns them a unique recording-id. It is structured as follows: <recording-id> <extended-filename>. Each entry should appear on a new line, for example:

    rec_000 wavs/some_raw_audio.wav
    rec_001 wavs/some_more_raw_audio.wav
    rec_002 wavs/yet_more_raw_audio.wav
    

    Note that the <extended-filename> may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used.

  2. segments - this file is optional and specifies segments within the audio file to be processed by the VAD (useful if one only wants to run the VAD on a subset of the full audio files). If this file is not present the full audio files will be processed. This file is structured as follows: <utterance-id> <recording-id> <segment-begin> <segment-end>, where <segment-begin> and <segment-end> are in seconds. Each entry should appear on a new line, for example:

    rec_000_part_1 rec_000 20.5 142.6
    rec_000_part_2 rec_000 362.1 421.0
    rec_001_part_1 rec_001 45.9 89.4
    rec_001_part_2 rec_001 97.7 130.0
    rec_001_part_3 rec_001 186.9 241.8
    rec_002_full rec_002 0.0 350.0
    

These two files should be placed in the same directory, usually named data, however you may give it any name. This is the directory that is provided as input to voxseg’s feature extraction.

Usage

The package may be used in a number of ways:

  1. The full VAD can be run with a single script.
  2. Smaller scripts may be called to run different parts of the pipeline separately, for example feature extraction, then VAD. Useful if one is tuning the parameters of the VAD, and would like to avoid recomputing the features for every experiment.
  3. As a module within python, useful if one would like to integrate parts of the system into one's own python code.

Full VAD pipeline

This package may be used through a basic command-line interface. To run the full VAD pipeline with default settings, navigate to the voxseg directory and call:

# data_directory is Kaldi-style data directory and output_directory is destination for segments file 
python3 voxseg/main.py data_directory output_directory

To explore the available flags for changing settings navigate to the voxseg directory and call:

python3 voxseg/main.py -h

The most commonly used flags are:

  • -s sets the speech vs non-speech decision threshold (accepts float between 0 and 1, default is 0.5)
  • -f: adds median filtering to smooth the output (accepts odd integer for kernal size, default is 1)
  • -e: allows a reference directory to be given, against which the VAD output is scored (accepts path to Kaldi-style directory containing ground truth segments file)

Individual scripts

To run the smaller, individual scripts, navigate to the voxseg directory and call:

# reads Kaldi-style data directory and extracts features to .h5 file in output directory
python3 voxseg/extract_feats.py data_directory output_directory
# runs VAD and saves output segments file in ouput directory
python3 voxseg/run_cnnlstm.py -m model_path features_directory output_directory
# reads Kaldi-style data directory used as VAD input, the VAD output directory and a directory 
# contining a ground truth segments file reference. 
python3 voxseg/evaluate.py vad_input_directory vad_out_directory ground_truth_directory

Module within Python

To import the module an use it within custom Python scripts/modules:

import voxseg
from tensorflow.keras import models

# feature extraction
data = extract_feats.prep_data('path/to/input/data') # prepares audio from Kaldi-style data directory
feats = extract_feats.extract(data) # extracts log-mel filterbank spectrogram features
normalized_feats = extract_feats.normalize(norm_feats) # normalizes the features

#model execution
model = models.load_model('path/to/model.h5') # loads a pretrained VAD model
predicted_labels = run_cnnlstm.predict_labels(model, normalized_feats) # runs the VAD model on features
utils.save(predicted_labels, 'path/for/output/labels.h5') # saves predicted labels to .h5 file

Training

A basic training script is provided in the file train.py in the root directory of the project.

To use this script the following files are required in a Kaldi style data directory:

  1. wav.scp - this file provides the paths to the audio files one wishes to use for training, and assigns them a unique recording-id. It is structured as follows: <recording-id> <extended-filename>. Each entry should appear on a new line, for example:

    rec_000 wavs/some_raw_audio.wav
    rec_001 wavs/some_more_raw_audio.wav
    

    Note that the <extended-filename> may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used.

  2. segments - this file specifies the start and end points of each labelled segment within the audio file. Note, this is different to the way this file is used when provided for decoding. This file is structured as follows: <utterance-id> <recording-id> <segment-begin> <segment-end>, where <segment-begin> and <segment-end> are in seconds. Each entry should appear on a new line, for example:

    rec_000_00 rec_000 0.0 4.3
    rec_000_01 rec_000 4.3 7.2
    rec_000_02 rec_000 7.2 14.8
    rec_000_03 rec_000 14.8 19.5
    rec_001_00 rec_001 0.0 8.5
    rec_001_01 rec_001 8.5 12.2
    rec_001_02 rec_001 12.2 16.1
    rec_001_03 rec_001 16.1 18.9
    rec_001_04 rec_001 18.9 22.0
    
  3. utt2spk - this file specifies the label attached to each segment defined within the segments file. This file is structured as follows: <utterance-id> <label>. Each entry should appear on a new line, for example:

    rec_000_00 speech
    rec_000_01 non_speech
    rec_000_02 speech
    rec_000_03 non_speech
    rec_001_00 non_speech
    rec_001_01 speech
    rec_001_02 non_speech
    rec_001_03 speech
    rec_001_04 non_speech
    

Note, that the model may be trained with 2 classes ('speech', 'non_speech') as shown in the above example, or with the 4 classes from AVA-Speech dataset ('clean_speech', 'no_speech', 'speech_with_music', 'speech_with_noise'), as is the case for the default model used by the toolkit.

To use the training script with a specific validation set run:

# use -v to specify a Kaldi style data directory to be used as validation set
python3 train.py -v val_dir train_dir model_name out_dir

To use the training script with a percetage of the training data as a validation set run:

# use -s to specify a percetage of the training data to be used as a validation set
python3 train.py -s 0.1 train_dir model_name out_dir

The training script may also be used without any flags, however this is not recommended, as it makes it difficult to tell whether the model is starting to overfit. When a validation set is provided the model with the best validation accuracy is saved. When no validation set is provided the model is saved after the final training epoch.

License

MIT

voxseg's People

Contributors

nickwilkinson37 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

voxseg's Issues

2 label pre-trained model

I've run through the project and can infer with the provided 4 label pre-trained model. Do you have a checkpoint for the 2 class speech / non-speech model? Great work btw.

CUDA out of mempry

Hi,

I used a very big data and I got CUDA out of memory error.
What should I do to fix this error?

Regards,

Surasak

More than 4 classes in dataset

Hey,
This has worked great for me, but I was wondering if there was way to train data on a different dataset (that has 5 classes). I removed the if-else that doesn't permit datasets with lengths that are not 2 or 4, but whenever I train it gets really low validation accuracy numbers. Do you know what would be the advisable approach? Thanks.

question

hi, the project works well, but i don't find anything about model training. Can you share the model training code, thanks a lot

Prior issue

Hi Nick,

I just wanted to double check with you if prior value was ok !

In the code we have these following :
prior = np.array([(1-speech_thresh) * speech_w_music_thresh, speech_thresh * speech_w_music_thresh, (1-speech_thresh) * (1-speech_w_music_thresh), (1-speech_thresh) * speech_w_music_thresh])

the first and last value seem the same. Should the last value be changed to (speech_thresh) * (1-speech_w_music_thresh) ?

Thanks again for all the great work !

Léo

Question on the parameters speech_threshold and speech_w_music_threshold

Hi,

the repo is very easy to use and works as well nicely for german speech! While playing around with your repo and after skimming through the paper, I was wondering about the functionality of the speech_threshold and speech_w_music_threshold parameter. It is clear to me that if you raise them, the activity detection is less sensitive for plain speech and speech in surroundings with music, respectively. My question now is how is the embedment of the threshold working? I had a look into run_cnnlstm.py but I am not able to follow you there. Especially this parts is troublesome for me:

    prior = np.array([(1-speech_thresh) * speech_w_music_thresh,
                    speech_thresh * speech_w_music_thresh,
                    (1-speech_thresh) * (1-speech_w_music_thresh),
                    (1-speech_thresh) * speech_w_music_thresh])
    temp = pd.concat([_targets_to_endpoints(medfilt([0 if (j*prior).argmax() == 1 else 1 for j in i], filt), 0.32) \
                     for i in targets['predicted-targets']], ignore_index=True)

Thanks again for this very useful repo!

ValueError raised when audio file has no voice activity

First of all, thank you for all of your work. This package is proving to be very helpful.

I have come across what appears to be a bug. If I supply an audio file to Voxseg and no voice activity is identified, this ValueError is thrown:

------------------- Running VAD -------------------
2021-06-08 18:15:50.236547: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-08 18:15:50.236961: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299965000 Hz
Traceback (most recent call last):
  File "voxseg/main.py", line 58, in <module>
    endpoints = run_cnnlstm.decode(targets, speech_thresh, speech_w_music_thresh, filt)
  File "../voxseg/env/lib/python3.8/site-packages/voxseg/run_cnnlstm.py", line 57, in decode
    ((targets['start'] * 100).astype(int)).astype(str).str.zfill(7) + '_' + \
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/generic.py", line 5874, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 673, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1074, in astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
  File "pandas/_libs/lib.pyx", line 619, in pandas._libs.lib.astype_intsafe
ValueError: cannot convert float NaN to integer

I suspect that because no voice activity has been identified, no time points exist or they are NaN values (i.e. targets['start'] and targets['end']), causing the following code to fail:

From voxseg.run_cnnlstm.decode

    targets['utterance-id'] = targets['recording-id'].astype(str) + '_' + \
                        ((targets['start'] * 100).astype(int)).astype(str).str.zfill(7) + '_' + \
                        ((targets['end'] * 100).astype(int)).astype(str).str.zfill(7)

I have put together a workaround but figured others will likely come across this bug at some point. I also would like to know if this bug is due to some other cause than the lack of voice activity.

Many thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.