Voxseg

Voxseg is a Python package for voice activity detection (VAD), for speech/non-speech audio segmentation. It provides a full VAD pipeline, including a pretrained VAD model, and it is based on work presented here.

Use of this VAD may be cited as follows:

@inproceedings{cnnbilstm_vad,
    title = {A hybrid {CNN-BiLSTM} voice activity detector},
    author = {Wilkinson, N. and Niesler, T.},
    booktitle = {Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
    year = {2021},
    address = {Toronto, Canada},
}

Installation

To install this package, clone the repository from GitHub to a directory of your choice and install using pip:

git clone https://github.com/NickWilkinson37/voxseg.git
pip install ./voxseg

In future, installation directly from the package manager will be supported.

To test the installation run:

cd voxseg
python -m unittest

The test will run the full VAD pipeline on two example audio files. This pipeline includes feature extraction, VAD and evaluation. The output should include the following*:

A progress bar monitoring the feature extraction process, followed by a DataFrame containing normalized-features and metadata.
A DataFrame containing model generated endpoints, indicating the starts and ends of discovered speech utterances.
A confusion matrix of speech vs non-speech, with the following values: TPR 0.935, FPR 0.137, FNR 0.065, FPR 0.863

*The order in which these outputs appear may vary.

Data Preparation

Before using the VAD, a number of files need to be created to specify the audio that one wishes to process. These files are the same as those used by the Kaldi toolkit. Extensive documentation on the data preparation process for Kaldi may be found here. Only the files required by the Voxseg toolkit are described here.

wav.scp - this file provides the paths to the audio files one wishes to process, and assigns them a unique recording-id. It is structured as follows: <recording-id> <extended-filename>. Each entry should appear on a new line, for example:
```
rec_000 wavs/some_raw_audio.wav
rec_001 wavs/some_more_raw_audio.wav
rec_002 wavs/yet_more_raw_audio.wav
```
Note that the <extended-filename> may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used.
segments - this file is optional and specifies segments within the audio file to be processed by the VAD (useful if one only wants to run the VAD on a subset of the full audio files). If this file is not present the full audio files will be processed. This file is structured as follows: <utterance-id> <recording-id> <segment-begin> <segment-end>, where <segment-begin> and <segment-end> are in seconds. Each entry should appear on a new line, for example:
```
rec_000_part_1 rec_000 20.5 142.6
rec_000_part_2 rec_000 362.1 421.0
rec_001_part_1 rec_001 45.9 89.4
rec_001_part_2 rec_001 97.7 130.0
rec_001_part_3 rec_001 186.9 241.8
rec_002_full rec_002 0.0 350.0
```

These two files should be placed in the same directory, usually named data, however you may give it any name. This is the directory that is provided as input to voxseg’s feature extraction.

Usage

The package may be used in a number of ways:

The full VAD can be run with a single script.
Smaller scripts may be called to run different parts of the pipeline separately, for example feature extraction, then VAD. Useful if one is tuning the parameters of the VAD, and would like to avoid recomputing the features for every experiment.
As a module within python, useful if one would like to integrate parts of the system into one's own python code.

Full VAD pipeline

This package may be used through a basic command-line interface. To run the full VAD pipeline with default settings, navigate to the voxseg directory and call:

# data_directory is Kaldi-style data directory and output_directory is destination for segments file 
python3 voxseg/main.py data_directory output_directory

To explore the available flags for changing settings navigate to the voxseg directory and call:

python3 voxseg/main.py -h

The most commonly used flags are:

-s sets the speech vs non-speech decision threshold (accepts float between 0 and 1, default is 0.5)
-f: adds median filtering to smooth the output (accepts odd integer for kernal size, default is 1)
-e: allows a reference directory to be given, against which the VAD output is scored (accepts path to Kaldi-style directory containing ground truth segments file)

Individual scripts

To run the smaller, individual scripts, navigate to the voxseg directory and call:

# reads Kaldi-style data directory and extracts features to .h5 file in output directory
python3 voxseg/extract_feats.py data_directory output_directory
# runs VAD and saves output segments file in ouput directory
python3 voxseg/run_cnnlstm.py -m model_path features_directory output_directory
# reads Kaldi-style data directory used as VAD input, the VAD output directory and a directory 
# contining a ground truth segments file reference. 
python3 voxseg/evaluate.py vad_input_directory vad_out_directory ground_truth_directory

Module within Python

To import the module an use it within custom Python scripts/modules:

import voxseg
from tensorflow.keras import models

# feature extraction
data = extract_feats.prep_data('path/to/input/data') # prepares audio from Kaldi-style data directory
feats = extract_feats.extract(data) # extracts log-mel filterbank spectrogram features
normalized_feats = extract_feats.normalize(norm_feats) # normalizes the features

#model execution
model = models.load_model('path/to/model.h5') # loads a pretrained VAD model
predicted_labels = run_cnnlstm.predict_labels(model, normalized_feats) # runs the VAD model on features
utils.save(predicted_labels, 'path/for/output/labels.h5') # saves predicted labels to .h5 file

Training

A basic training script is provided in the file train.py in the root directory of the project.

To use this script the following files are required in a Kaldi style data directory:

wav.scp - this file provides the paths to the audio files one wishes to use for training, and assigns them a unique recording-id. It is structured as follows: <recording-id> <extended-filename>. Each entry should appear on a new line, for example:
```
rec_000 wavs/some_raw_audio.wav
rec_001 wavs/some_more_raw_audio.wav
```
Note that the <extended-filename> may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used.
segments - this file specifies the start and end points of each labelled segment within the audio file. Note, this is different to the way this file is used when provided for decoding. This file is structured as follows: <utterance-id> <recording-id> <segment-begin> <segment-end>, where <segment-begin> and <segment-end> are in seconds. Each entry should appear on a new line, for example:
```
rec_000_00 rec_000 0.0 4.3
rec_000_01 rec_000 4.3 7.2
rec_000_02 rec_000 7.2 14.8
rec_000_03 rec_000 14.8 19.5
rec_001_00 rec_001 0.0 8.5
rec_001_01 rec_001 8.5 12.2
rec_001_02 rec_001 12.2 16.1
rec_001_03 rec_001 16.1 18.9
rec_001_04 rec_001 18.9 22.0
```
utt2spk - this file specifies the label attached to each segment defined within the segments file. This file is structured as follows: <utterance-id> <label>. Each entry should appear on a new line, for example:
```
rec_000_00 speech
rec_000_01 non_speech
rec_000_02 speech
rec_000_03 non_speech
rec_001_00 non_speech
rec_001_01 speech
rec_001_02 non_speech
rec_001_03 speech
rec_001_04 non_speech
```

Note, that the model may be trained with 2 classes ('speech', 'non_speech') as shown in the above example, or with the 4 classes from AVA-Speech dataset ('clean_speech', 'no_speech', 'speech_with_music', 'speech_with_noise'), as is the case for the default model used by the toolkit.

To use the training script with a specific validation set run:

# use -v to specify a Kaldi style data directory to be used as validation set
python3 train.py -v val_dir train_dir model_name out_dir

To use the training script with a percetage of the training data as a validation set run:

# use -s to specify a percetage of the training data to be used as a validation set
python3 train.py -s 0.1 train_dir model_name out_dir

The training script may also be used without any flags, however this is not recommended, as it makes it difficult to tell whether the model is starting to overfit. When a validation set is provided the model with the best validation accuracy is saved. When no validation set is provided the model is saved after the final training epoch.

License

MIT

ValueError raised when audio file has no voice activity

First of all, thank you for all of your work. This package is proving to be very helpful.

I have come across what appears to be a bug. If I supply an audio file to Voxseg and no voice activity is identified, this ValueError is thrown:

------------------- Running VAD -------------------
2021-06-08 18:15:50.236547: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-08 18:15:50.236961: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299965000 Hz
Traceback (most recent call last):
  File "voxseg/main.py", line 58, in <module>
    endpoints = run_cnnlstm.decode(targets, speech_thresh, speech_w_music_thresh, filt)
  File "../voxseg/env/lib/python3.8/site-packages/voxseg/run_cnnlstm.py", line 57, in decode
    ((targets['start'] * 100).astype(int)).astype(str).str.zfill(7) + '_' + \
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/generic.py", line 5874, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 673, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "../voxseg/env/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1074, in astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
  File "pandas/_libs/lib.pyx", line 619, in pandas._libs.lib.astype_intsafe
ValueError: cannot convert float NaN to integer

I suspect that because no voice activity has been identified, no time points exist or they are NaN values (i.e. targets['start'] and targets['end']), causing the following code to fail:

From voxseg.run_cnnlstm.decode

    targets['utterance-id'] = targets['recording-id'].astype(str) + '_' + \
                        ((targets['start'] * 100).astype(int)).astype(str).str.zfill(7) + '_' + \
                        ((targets['end'] * 100).astype(int)).astype(str).str.zfill(7)

I have put together a workaround but figured others will likely come across this bug at some point. I also would like to know if this bug is due to some other cause than the lack of voice activity.

Many thanks!

nickwilkinson37 / voxseg Goto Github PK

voxseg's Introduction

Voxseg

Installation

Data Preparation

Usage

Full VAD pipeline

Individual scripts

Module within Python

Training

License

voxseg's People

Contributors

Stargazers

Watchers

Forkers

voxseg's Issues

Recommend Projects

Recommend Topics

Recommend Org