Coder Social home page Coder Social logo

speechvgg's Introduction

SpeechVGG: A deep feature extractor for speech processing

speechVGG is a deep speech feature extractor, tailored specifically for applications in representation and transfer learning in speech processing problems. The extractor adopts the classic VGG-16 architecture and is trained via the word recognition task.

We showed that extractor can capture generalized speech-specific features in a hierarchical fashion. Importantly, the generalized representation of speech captured by the pre-trained model is transferable over distinct speech processing tasks, employing a different dataset.

In our experiments, we showed that even relatively simple applications of the pre-trained speechVGG were capable of achieving results comparable to the state-of-the-art, presumably thanks to the knowledge transfer. For more details and full evaluation, see the original paper.

Here, we present our Python implementation of speechVGG, as introduced in (Beckmann et al., 2019) and employed for Deep Speech Inpainting in (Kegler et al., 2019) (see the demo here). Now we are going to walk you through how to use it!

Technical requirements:

Third party packages

  • Python 3.6.8
  • numpy 1.16.4
  • h5py 2.8.0
  • SoundFile 0.10.2
  • SciPy 1.2.1
  • Tensorflow 1.13.1
  • Keras 2.2.4
  • Keras-tqdm 2.0.1

Tested on Linux Ubuntu 18.04 LTS.

Use our pre-trained models and explore their example applications...

Pre-trained models are available here (all configurations considered in (Beckmann et al., 2019)). In the examples folder of this repository we show you how to apply a pre-trained speechVGG in speaker recognition and speech/music/noise classification, as introduced in (Beckmann et al., 2019).

... or train your own!

Here, we are going to show you how to used the code to train the model from scratch using LibriSpeech dataset.

Data

You should create a folder 'LibriSpeech' with the following folders :

LibriSpeech
	|_ word_labels
	|_ split
        |____ test-clean
        |____ test-other
        |____ dev-clean
        |____ dev-other
        |____ train-clean-100
        |____ train-clean-360
        |____ train-other-500

The word_label folder should contain the aligned labels, this folder can be downloaded here.

The split folder should contain the extracted Librispeech datasets that can be downloaded here.

Generate dataset

First, preprocess the data (here, LibriSpeech for example):

python preprocess.py --data ~/LibriSpeech --dest_path ~/LibriSpeechWords

Then, obtain the mean and standard deviation of the desired dataset (for normalization):

python compute_dataset_props.py --data ~/LibriSpeechWords/train-clean-100/ --output_folder data

Parameters will be saved as dataset_props_log.h5 file. Here we attach a version obtained from training part of LibriSpeech data.

Train

Now you can train the model using the training script:

python train.py --train ~/LibriSpeechWords/train-clean-100/ --test ~/LibriSpeechWords/test-clean/ --weight_path data --classes 1000 --augment yes 

Finally the weights of the model will be saved in the desired direction, here 'data'. Subsequently you can use the trained model, for example, to obtain deep feature losses (as we did in Kegler et al., 2019 & Beckmann et al., 2019).

References

  1. Beckmann, P.*, Kegler, M.*, Saltini, H., and Cernak, M. (2019) Speech-VGG: A deep feature extractor for speech processing. arXiv preprint arXiv:1910.09909.
  2. Kegler, M.*, Beckmann, P.*, and Cernak, M. (2020) Deep Speech Inpainting of Time-Frequency Masks. Proc. Interspeech 2020, 3276-3280, DOI: 10.21437/Interspeech.2020-1532. + Demo

speechvgg's People

Contributors

mkegler avatar bepierre avatar

Stargazers

Nickolay V. Shmyrev avatar  avatar LacTry avatar  avatar  avatar Alex Wu avatar  avatar Sejong Yang avatar Ifty Mohammad Rezwan avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar Yutong_Chen avatar aidenfaustine avatar Yejin avatar Yan Zhaoyu avatar Sundeep Teki avatar Lalit Pagaria avatar M. Yusuf Sarıgöz avatar Yunusemre avatar  avatar Mamunur Rahaman Mamun avatar kevingenghaopeng avatar toan avatar Henry Xu avatar Fabian-Robert Stöter avatar Pariente Manuel avatar  avatar Neil Scheidwasser avatar  avatar  avatar Jingbei Li avatar Guangyang Nie avatar  avatar Abhai Katiyar avatar Alonso Astroza Tagle avatar YUANG LI avatar  avatar  avatar Long Nguyen avatar Amos_CCH avatar Ludovic Claude avatar Pawel Cyrta avatar  avatar  avatar vyouman avatar MaisyZhang avatar Pham Thanh Lam avatar Erdene-Ochir Tuguldur avatar Onchip Technologies Pvt. Ltd. avatar Mathis Lamarre avatar 爱可可-爱生活 avatar  avatar Luis Montero avatar  avatar Ke Fang avatar Zach Caceres avatar 郝翔 avatar Owais avatar RuiLiu avatar  avatar

Watchers

Nickolay V. Shmyrev avatar  avatar  avatar  avatar  avatar paper2code - bot avatar

speechvgg's Issues

Error in example extract_and_classify.ipynb

First of all, thanks for sharing this great repository!

I think there is an error in examples/speech_music_noise/extract_and_classify.ipynb in the function remove_short_clips. I think that you meant to write clips=np.array(clips)[good_idxs]

def remove_short_clips(clips, length=16200):
    good_idxs = []
    for i, clip in enumerate(clips):
        audio = sf.read(clip)
        audio = audio[0]
        if len(audio)>=length:
            good_idxs.append(i)
    clips=np.array(noise_clips)[good_idxs]
    return clips

Thanks!

/libs/data_generator.py file has a bug

I'm working on Language Identification using SpeechVGG. There's a bug in line 149 at /libs/data_generator.py

the code label_tmp = h5f['word_idx'][0] is incorrect as there is not key named 'word_idx'

this correct line will be:
label_tmp = h5f['class'][0]

Do check it once and let me know.

Thank you :)

Quesiton about DataGenerator

Dear Pierre,

Thank you very much for sharing your codes and datasets online.

I am trying to use your code with my datasets, but I found the line data_tmp= np.delete(data_tmp, (128), axis=1) in this link doesn't work.

The errors says the following:


IndexError Traceback (most recent call last)
in
----> 1 print(train_generator[1])

~/Dropbox/NETSPAR-Speech/Program Jupyter 2022Nov/data_generator_new.py in getitem(self, index)
99
100 # Generate data
--> 101 specs, labels = self.__data_generation(list_IDs_temp)
102
103 return specs, labels

~/Dropbox/NETSPAR-Speech/Program Jupyter 2022Nov/data_generator_new.py in __data_generation(self, list_IDs_temp)
144
145 #Store Labels
--> 146 data_tmp= np.delete(data_tmp, (128), axis=1) #already deleted in a previous step
147 data[i,] = data_tmp
148

<array_function internals> in delete(*args, **kwargs)

~/.local/lib/python3.8/site-packages/numpy/lib/function_base.py in delete(arr, obj, axis)
4371 obj = obj.item()
4372 if (obj < -N or obj >= N):
-> 4373 raise IndexError(
4374 "index %i is out of bounds for axis %i with "
4375 "size %i" % (obj, axis, N))

IndexError: index 128 is out of bounds for axis 1 with size 128

Do you have an idea what might be the problem?

Thanks a lot in advance and have a great day!

About Pre-train Model

Hello,
I already downloaded the Pre-train models,
I want to ask how can I test the Pre-train model(.h5 file) with data,
thank you.

Bug in test_TIMIT.py

Hello,

In examples/speaker_identification/test_TIMIT.py at line 84, we have:

data_tmp[:,:,0] = log_standardize(data_tmp[:,:,0])

Before applying log standardization, the signal needs to be padded; similar to how it is performed in the data_generator.py file at line 138

So here before line 84 in test_TIMIT.py, please add the code: data_tmp = pad_spec(data_tmp)

Let me know if it's identified correctly.

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.