Comments (9)
from open_stt.
We are planning to share a much larger dataset based on audio-books
Please PM me (telegram), I will share a private meta-data file, from which you could extract the data you need
We are not planning to share this data publicly yet
from open_stt.
@i7p9h9
We shared this dataset update here
from open_stt.
It would be great if the data came with dedicated directories for each speaker e.g.
<dataset-id>/<speaker-id>/<sample-id>.wav
<sample-id>.txt
because it makes sense to separate speakers during training and testing. Not just for speaker recognition but also for STT tasks.
However, open_stt
is an awesome dataset nevertheless. Are you planning on adding more languages?
from open_stt.
Hi!
Doing exactly this is not feasible unfortunately due to the nature of the dataset (zero money investment into annotation).
But we could share speakers privately as meta data for a very limited subset of data if this helps. Mostly books.
from open_stt.
I see. Well, my workaround here is throwing everything uncertain into the train
set and test on data which has speaker separation. E.g. the Common Voice dataset might be reliable enough.
If I may ask, what kind of word error rate (WER) did you get on the entire open_stt
dataset? I am currently not too far below 40% (using ~3000h of the data) which is actually not as good as I expected it to be for so many hours of speech. :)
from open_stt.
Well, my workaround here is throwing everything uncertain into the train set and test on data which has speaker separation.
We have a small subset of the data (15 hours) manually annotated - we will be posting it soon enough
what kind of word error rate (WER) did you get on the entire open_stt dataset
Sorry for a late reply, but please refer to a ticket #5 #7
Obviously these are not the best / latest models, but you can see some patterns in the distributions
You will see that the whole dataset is not consistent in the annotation quality, so it has / will be distilled
There have been reports that if you use esp-net w/o data with bad annotation, you will get a much better result
It will be the foremost focus of our future work - seeding out the bad data
from open_stt.
@snakers4 no worries :)
Thanks for sharing that information. Will take a look on those issues.
Thanks for doing all this great work and providing such an easy-to-use dataset!
from open_stt.
I see. Well, my workaround here is throwing everything uncertain into the
train
set and test on data which has speaker separation. E.g. the Common Voice dataset might be reliable enough.If I may ask, what kind of word error rate (WER) did you get on the entire
open_stt
dataset? I am currently not too far below 40% (using ~3000h of the data) which is actually not as good as I expected it to be for so many hours of speech. :)
Hi, stefan: You mentationed that you have trained ASR system on common voice russia, could you share the lastest common voice russia WER performance? I do not learn a lot about russian language, and train a russian ASR system with little 60h data with RU- common voice data, now the WER is about 40% with a chain model with kaldi toolkit even with a test set text LM, do you think it's normal? I haven't found any bench mark on common voice russia, do you think it's a normal performance? I find you often evalute russia ASR with CER, wheather it is more common on russian ASR target? Thanks a lot !!!
from open_stt.
Related Issues (20)
- public_youtube700 is subset of public_youtube1200? HOT 3
- Ordering of the audio files. HOT 3
- Q: Is there annotations for radio_v4 dataset? HOT 2
- Двойные буквы и дефисы для тестовых данных HOT 5
- нет txt файлов в public_speech.tar.gz HOT 20
- Опыт применения open_stt для обучения распознавания телефонных разговоров на DeepSpeech HOT 1
- No seeders on the torrent file! HOT 1
- What does alignment mean in annotation ? HOT 1
- Any more information about the structure of the folder HOT 3
- Can't download dataset HOT 7
- Не могу скачать файл HOT 3
- Torrent announcement HOT 3
- В private_buriy_audiobooks_2 нет буквы ё, а в private_buriy_audiobooks_2_val есть HOT 3
- Сколько говорящих в данном датасете? HOT 2
- Does the dataset contain the speaker's IDs? HOT 4
- How was this dataset assembled? HOT 1
- Download error HOT 4
- Проблема с торрентом HOT 1
- Opus files are not opus theirs vorbis HOT 3
- question of re-sampling HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open_stt.