jakobovski / free-spoken-digit-dataset Goto Github PK

View Code? Open in Web Editor NEW

605.0 605.0 250.0 30.39 MB

A free audio dataset of spoken digits. An audio version of MNIST.

Python 100.00%

audio dataset machine-learning mnist speech-recognition spoken-digits spoken-language

free-spoken-digit-dataset's People

Contributors

Stargazers

Watchers

Forkers

chagge authman ahams morgukai jasmeetsb hoskingl deepanjank saifee95 ilvitoriocasas shujihachisu joonpark72 pabulson gunter24 donandres sushmit86 baletercero zorpa sblack4 yuyuzeng saurav-31 aledevansuk alexsiow dkrac diyuanlu emmanuelq2 guelmiv project-tuva sony111 krittikav aerophile thayermldac dvnsarma xldude suresir binchenbin volcas avipatchava xiao2mo dipeshdulal anders-torp miguel-ossa cesarsouza connello rttembo naroom stefania11 madelinebriere vorugant speechwrecko shikhar1729 nixongenoa fancyerii zhng1456 amsterdumb jjkindergarten junsooo issev sebashc3712 yelinkyaw naruto678 batermj xunzhaocunzi elastific adhishthite jonsoncode hollowninja jcalhoon ahmed-fau shaynemei gergues jupiterethan snehar26 yuxinpan komosinskid huazhz experimenti ronitsamaddar pmikas hongpeng1992 mutewall pkishan originofamonia eternityup faninafanina sandhyac0203 shyam1234 yweweler mdangschat genka7 stevenbh duhaoze11 targoons pb-pravin felixdollack akhil2495 v1vekkumar developermili shilpakk95 josephkj dunjapet

free-spoken-digit-dataset's Issues

DOI

It would be nice to have a DOI for the dataset instead of relying on just git tags.
I would suggest to use Zenodo as they are integrated with github and support multiple versions of the same dataset.

Dataset download

sir how can get this whole dataset/part of it for my project

New recordings are badly truncated

Hi. I was listening to the new samples from Jason, a lot of them are badly truncated at the beginning.

How to perform 1D wavelet scattering transform on this dataset

How to perform 1D wavelet scattering transform on this dataset?

problem of 8bit&16bit

some files are 8bit like 1_nicolas_36.wav
some files are 16bit like 9_theo_22.wav
it hard to deal with two diffient Sampling number
how can i unified？

The dataset contains 1501 recordings

Currently, the dataset contains 1501 recordings instead of the 1500 described in the readme. If you check the list of files currently at the recordings directory, you will see that it contains a file named "6_jackson_50.wav". However the maximum possible number label for the index should have been 49 instead of 50.

IndexError when I run "ds.spectrograms[0].numpy()"

Code

import hub
ds = hub.load("hub://activeloop/spoken_mnist")
ds.spectrograms[0].numpy()

Got the error:

.wav encoding for speaker Nicolas not consistent with other speakers

FYI file encoding for speaker nicolas are 8bit unsigned integer whereas all other speakers are 16bit Signed int
sox -b 16 -e signed-int old_nicolas.wav new_nicolas.wav
does the trick

View

Consider spreading the data into multiple directories

Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls can become burdensome after around 10,000 files.

But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.

Regards,
Cesar

Add other languages

Hi! Would you consider adding Serbian language to the dataset? I am interesetd to contribute my voice and as many as I can gather. I suppose this would also be simpler to accomplish if we could gather audio online using an automated website.

License

Have you considered choosing a license (Creative Commons for instance) to make explicit the conditions for copying and reuse?

adding more numbers

I think it will be better if you add numbers like 20,30,40,50,60,70,80,90,100 and from 11:20
so we can have all the numbers combinations, so for example, if we use this dataset in application to recognize numbers like 3.45, 578, or 54 this dataset after improving will help a lot

Normalize the recordings to have the same number of channels

It seems that the recordings done by Jackson have only 1 audio channel (mono), but the recordings by Nicolas have 2 audio channels (stereo). I've noticed that there are no guidelines regarding how many channels the recordings should have in the main README.md at the front page of this project. As such, I would like to know whether the samples from this dataset can have samples with different audio channels, or whether there are plans to normalize the samples such that they all have just one channel. In either case, I suppose the contribution guidelines could be extended with this information.

Regards,
Cesar

About reference

Thank you for your work，Please tell me if I want to use this data set in my paper，what citation format should I use？

MAINT Update Zenodo and Tensorflow versions

I believe the Zenodo & Tensorflow versions of FSDD are out-of-sync. Links:

https://zenodo.org/record/1342401#.YWygx9nMK3I
https://www.tensorflow.org/datasets/catalog/spoken_digit

RFI

Hello,
Can I ask simple (read stupid question), can I use this model for digit recognition or speaker recognition/identification or both?
Thanks in advance,
Mirko

How can we record 8KHz Mono WAV format file for Digit Classification?

I used this code for training a model to classify free-spoken-digit-dataset (https://github.com/mikesmales/Udacity-ML-Capstone). The accuracy of the trained model is 96%.

But the prediction using the saved model fails when I test it for my recorded voice. I recorded some digits using windows 10 voice recorder and converted files to 8KHz Mono WAV format.

Any help you can provide? The model predicts accurately on the recordings provided within the dataset.

My Recorded Digit 3:

Original sample rate: 22050
Librosa sample rate: 22050
Original audio file minmax range: 20 to 239
Librosa audio file minmax range: -0.84375 to 0.8671875
(40, 18)

Dataset : Jackson 3:

Original sample rate: 8000
Librosa sample rate: 22050
Original audio file minmax range: -10989 to 9277
Librosa audio file minmax range: -0.35349792 to 0.28417692
(40, 21)

zero wav files too small to be read in python

I would suggest making all of the recordings a standard 1 second so we have access to all 8000 samples. the recordings vary greatly in sample length