Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

~7m utterances (1-2m with less perfect annotation, see #7);
~7000 hours;
855 GB (in .wav format in int16);
(new!) A new domain - radio;
(new!) A larger YouTube dataset with 1000+ additional hours;
(new!) A small (300 hours) YouTube dataset downloaded in maximum quality;
(new!) 18 hours in 3 validation sets for YouTube / books / public calls with ground truth annotation;

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Planned releases:

1000-10,000 additional hours of books;
Data quality distillation and improvement / annotation improvement;
EVEN MOAR DATA (give us your ideas where to find it!);
~~1000+ additional hours of YouTube~~;
~~Some validation / test sets~~;
~~Plain benchmarks, "bad files"~~;
~~Mp3 torrent~~;
~~Wav torrent~~;
~~Radio set~~
... and more!;

Table of contents

Dataset composition

Dataset	Utterances	Hours	GB	Av s/chars	Comment	Annotation	Quality/noise
audiobook_2	1,149,404	1,511	162	4.7s / 56	Books	Alignment (*)	95% / crisp
radio_2	651,645	1,439	154	7.95s / 110	Radio	Alignment (*)	TBC, should be high
public_youtube1120	1,410,979	1,104	237	2.82s / 34	Yutube videos	Subtitles	95% / ~crisp
public_youtube700	759,483	701	75	3.3s / 43	Youtube videos	Subtitles	95% / ~crisp
tts_russian_addresses	1,741,838	754	81	1.6s / 20	Russian addresses	TTS 4 voices	100% / crisp
asr_public_phone_calls_2	603,797	601	66	3.6s / 37	Phone calls	ASR	70% / noisy
public_youtube1120_hq	369,245	291	31	2.84s / 37	YouTube videos HQ sound	Subtitles	95% / ~crisp
asr_public_phone_calls_1	233,868	211	23	3.3s / 29	Phone calls	ASR	70% / noisy
asr_public_stories_2	78,186	78	9	3.5s / 43	Books	ASR	80% / crisp
asr_public_stories_1	46,142	38	4	3.0s / 30	Books	ASR	80% / crisp
public_series_1	20,243	17	2	3.1s / 38	Youtube videos	Subtitles	95% / ~crisp
ru_RU	5,826	17	2	11s / 12	Public dataset	Alignment	99% / crisp
voxforge_ru	8,344	17	2	7.5s / 77	Public dataset	Reading	100% / crisp
russian_single	3,357	9	1	9.3s / 102	Public dataset	Alignment	99% / crisp
asr_calls_2_val	12,950	7,7	2	2.15s / 34	Phone calls	Manual annotation	99% / crisp
public_lecture_1	6,803	6	1	3.4s / 47	Lectures	Subtitles	95% / crisp
buriy_audiobooks_2_val	7,850	4,9	1	2.25s / 31	Books	Manual annotation	99% / crisp
public_youtube700_val	7,311	4,5	1	2.2 / 35	Youtube videos	Manual annotation	99% / crisp
Total	7,117,271‬	6,812	855

(*) Automatic alignment

This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.

Updates

Update 2019-06-28

New train datasets added:

1,439 hours radio_2;
1,104 hours public_youtube1120;
291 hours public_youtube1120_hq;

New validation datasets added:

8 hours asr_calls_2_val;
5 hours buriy_audiobooks_2_val;
5 hours public_youtube700_val;

Update 2019-05-19

Also shared a wav version via torrent.

Click to expand

Update 2019-05-13

Added the forgotten txt files to mp3 archives. Updating the torrent.

Update 2019-05-12

Torrent created and uploaded to academictorrents.

Update 2019-05-10

Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.

Update 2019-05-07 Help needed!

If you want to support the project, you can:

Help us with hosting (create a mirror) / provide a reliable node for torrent;
Help us with writing some helper functions;
Donate (each coffee pays for several full downloads) / use our DO referral link to help;

~~We are converting the dataset to MP3 now.~~ Please contact us using the below contacts, if you would like to help.

Downloads

Via torrent

Save us a couple of bucks, download via torrent:

An MP3 version of the dataset (v3), to be updated;
A WAV version of the dataset (v5);

You can download separate files via torrent. Try several torrent clients if some do not work.

Links

Meta data file.

Dataset	GB, wav	GB, mp3	Wav	Mp3	Source	Manifest
audiobook_2	162	21.0	torrent	part1	Sources from the Internet + alignment	link
radio_2	154	25.7	torrent	part1	Radio	link
public_youtube1120	237	32.4	torrent	part1	YouTube videos	link
asr_public_phone_calls_2	66	7.5	torrent	part1	Sources from the Internet + ASR	link
public_youtube1120_hq	31	8.6	torrent	parе1	YouTube videos	link
asr_public_stories_2	9	1.1	torrent	part1	Sources from the Internet + alignment	link
tts_russian_addresses_rhvoice_4voices	80.9	9.9	torrent	part1	TTS	link
public_youtube700	75.0	9.6	torrent	part1	YouTube videos	link
asr_public_phone_calls_1	22.7	2.6	torrent	part1	Sources from the Internet + ASR	link
asr_public_stories_1	4.1	0.5	torrent	part1	Public stories	link
public_series_1	1.9	0.2	torrent	part1	Public series	link
ru_RU	1.9	0.2	torrent	part1	Caito.de dataset	link
voxforge_ru	1.9	0.2	torrent	part1	Voxforge dataset	link
russian_single	0.9	0.1	torrent	part1	Russian single speaker dataset	link
asr_calls_2_val	2	0.2	torrent	part1	Sources from the Internet	link
public_lecture_1	0.7	0.1	torrent	part1	Sources from the Internet + manual	link
buriy_audiobooks_2_val	1	0.15	torrent	part1	Books + manual	link
public_youtube700_val	2	0.13	torrent	part1	YouTube videos + manual	link
Total	855	87.5

Download instructions

Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz

Download the meta data and manifests for each dataset:
Merge files (where applicable), unpack and enjoy!

Check md5sum

Including links to deprecated files. md5sum /path/to/downloaded/file

Click to expand

type	md5sum	file
audio	f24e21c69c03062d667caf0f055244f2	asr_public_stories_2_mp3.tar.gz
audio	a6f888c53d7cbded85ab51627ef57c96	asr_public_phone_calls_1_mp3.tar.gz
audio	f707e34f488c62af2e3142085ff595ad	asr_public_phone_calls_2_mp3.tar.gz
audio	baa491ed0b526b2a989b8c4a8897429d	asr_public_stories_1_mp3.tar.gz
audio	42b9c8c2e31100d6c5b972c9ac000167	private_buriy_audiobooks_2_mp3.tar.gz
audio	7a5704721012fafa115e7316e5f6e058	public_lecture_1_mp3.tar.gz
audio	16cf820330f9f8b388395d777b2331ac	public_series_1_mp3.tar.gz
audio	dd048e7110c0c852c353759dad8fec0f	public_youtube700_mp3.tar.gz
audio	579e9d98bd159a27d3573641edee69b0	ru_ru_mp3.tar.gz
audio	177b041594684623ec7d038613e1330d	russian_single_mp3.tar.gz
audio	d7ce4c4116dcc655be2b466f82c98b6e	tts_russian_addresses_rhvoice_4voices_mp3.tar.gz
audio	25ea6d9e249a242ecc217acc28c8077b	voxforge_ru_mp3.tar.gz
audio	97cd6b56ba1eb5088bc5643dce054028	asr_calls_2_val_mp3.tar.gz
audio	69a465e218fc1f597f7b5da836952d9d	radio_2_mp3.tar.gz
audio	0cc0f50db85ec4271696b4eb03a2203c	buriy_audiobooks_2_val_mp3.tar.gz
audio	f5d2e3d13b47e1566ba0b021f00788cf	public_youtube1120_hq_mp3.tar.gz
audio	12eb78a9ab7c3d39bbe2842b8d6550ca	public_youtube1120_mp3.tar.gz
audio	f6b6034e1e91d9a0a5069fc9ad2ed545	public_youtube700_val_mp3.tar.gz
manifest	b0ce7564ba90b121aeb13aada73a6e30	asr_public_phone_calls_1.csv
manifest	6867d14dfdec1f9e9b8ca2f1de9ceda6	asr_public_phone_calls_2.csv
manifest	0bdd77e15172e654d9a1999a86e92c7f	asr_public_stories_1.csv
manifest	f388013039d94dc36970547944db51c7	asr_public_stories_2.csv
manifest	3b67e27c1429593cccbf7c516c4b582d	private_buriy_audiobooks_2.csv
manifest	04027c20eb3aff05f6067957ecff856b	public_lecture_1.csv
manifest	89da3f1b6afcd4d4936662ceabf3033e	public_series_1.csv
manifest	a81dfb018c88d0ecd5194ab3d8ff6c95	public_youtube700.csv
manifest	c858f020729c34ba0ab525bbb8950d0c	ru_RU.csv
manifest	0275525914825dec663fd53390fdc9a0	russian_single.csv
manifest	52f406f4e30fcc8c634f992befd91beb	tts_russian_addresses_rhvoice_4voices.csv
audio	7533581bb26975212817bcacb25546d0	asr_public_stories_2.tar.gz
manifest	0cdbd085ffa6dab4bfdce7c3ed31fcfe	asr_calls_2_val.csv
manifest	4e0b73e0d00374482a0f2286acf314a0	buriy_audiobooks_2_val.csv
manifest	6b9ce6828a55d2741d51bc3503345db5	public_youtube1120.csv
manifest	33040a25cad99e70a81e9e54ff8c758e	public_youtube1120_hq.csv
manifest	525bd20802e529dcabf9e44345a50d0b	public_youtube700_val.csv
manifest	2996fe938cdfb37dc6e359e4384c9bfe	radio_2.csv

End to end download scripts

You can use this script or this script with this config file. Please check the config first. You can also contribute a similar script in python.

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

Converted to mono, if necessary;
Converted to 16 kHz sampling rate, if necessary;
Stored as 16-bit integers;

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

See example

from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')

Merge, check and save manifests

See example

from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')

Contacts

Please contact us here or just create a GitHub issue!

Authors in alphabetic order:

Anna Slizhikova;
Alexander Veysov;
Dmitry Voronin;
Yuri Baburov;

Acknowledgements

This repo would not be possible without these people:

Many thanks for helping to encode the initial bulk of the data into mp3 to akreal;
18 hours of ground truth annotation datasets for validation are a courtesy of activebc;

Kudos!

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Mostly we used pydub (via ffmpeg) to convert to MP3. We omitted blank files (YouTube mostly). We used the following parameters:

16kHz;
32 kbps;
Mono;

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech. But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice. We did not use other formats like .ogg, because .mp3 is much more popular.

See example

from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
                               format="wav")

file_handle = sound.export(store_mp3_path,
                           format="mp3",
                           parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
                           bitrate="{}k".format(str(32)))

Decoding

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

See example

# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
                    root_folder='../data/ru_open_stt/',
                    target_sr=16000):
    assert type(wav) == np.ndarray
    assert wav.dtype == np.dtype('int16')
    assert len(wav.shape)==1

    target_format = 'wav'
    wavb = wav.tobytes()

    # f_path = Path(audio_path)
    f_hash = hashlib.sha1(wavb).hexdigest()

    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)

    store_path.parent.mkdir(parents=True,
                            exist_ok=True)

    wavfile.write(filename=str(store_path),
                  rate=target_sr,
                  data=wav)

    return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
                       mono=True,
                       sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
                           root_folder=root_folder,
                           target_sr=target_sr)

Why not OGG

Even though OGG is considered to be better for speech with higher compression, we opted for a more conventional well known format.

1. Issues with reading files

Maybe try this approach:

See example

from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

Public datasets;
Public pre-trained models;
Open source frameworks;
Open research;

TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

~~Blank files in Youtube dataset~~. Removed in mp3 archive. Meta-data not cleaned;
Some files that have low values / crash with tochaudio;
Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

License

License:

cc-by-nc and commercial usage available after agreement with dataset authors;
Except for radio_2, which is public domain;
Except for VoxForge, its license is GNU GPL 3.0;
Except for Caito.de dataset, its licence is here.

Donations

Donate (each coffee pays for several full downloads) / use our DO referral link to help.

nurtdinovadf / open_stt Goto Github PK

open_stt's Introduction

Russian Open Speech To Text (STT/ASR) Dataset

Dataset composition

Updates

Update 2019-06-28

Update 2019-05-19

Update 2019-05-13

Update 2019-05-12

Update 2019-05-10

Update 2019-05-07 Help needed!

Downloads

Via torrent

Links

Download instructions

Check md5sum

End to end download scripts

Annotation methodology

Audio normalization

On disk DB methodology

Helper functions

Read manifests

Merge, check and save manifests

Contacts

Acknowledgements

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Decoding

Why not OGG

1. Issues with reading files

Maybe try this approach:

2. Why share such dataset?

3. Known issues with the dataset to be fixed

License

Donations

open_stt's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org