Coder Social home page Coder Social logo

open_stt's Introduction

Mailing list : test Mailing list : test

Donations Backers Sponsors License: CC BY-NC 4.0

header)

Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

  • ~16m utterances (1-2m with less perfect annotation, see #7);
  • ~20 000 hours;
  • 2,3 TB (in .wav format in int16), 356G in .opus;
  • A new domain - public speech;
  • A huge Radio dataset update with 10 000+ hours;
  • (new!) Utils for working with OPUS;
  • (new!) New OPUS torrent;
  • (new!) New OPUS direct links;

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Important - assume that ё everywhere is replaced with е.

Planned releases:

  • Working on a new project with 3 more languages, stay tuned!

Table of contents

Dataset composition

Dataset Utterances Hours GB Secs/chars Comment Annotation Quality/noise
radio_v4 7,603,192 10,430 1,195 5s / 68 Radio Align (*) 95% / crisp
public_speech 1,700,060 2,709 301 6s / 79 Public speech Align (*) 95% / crisp
audiobook_2 1,149,404 1,511 162 5s / 56 Books Align (*) 95% / crisp
radio_2 651,645 1,439 154 8s / 110 Radio Align (*) 95% / crisp
public_youtube1120 1,410,979 1,104 237 3s / 34 Youtube Subtitles 95% / ~crisp
public_youtube700 759,483 701 75 3s / 43 Youtube Subtitles 95% / ~crisp
tts_russian_addresses 1,741,838 754 81 2s / 20 Addresses TTS 4 voices 100% / crisp
asr_public_phone_calls_2 603,797 601 66 4s / 37 Phone calls ASR 70% / noisy
public_youtube1120_hq 369,245 291 31 3s / 37 YouTube HQ Subtitles 95% / ~crisp
asr_public_phone_calls_1 233,868 211 23 3s / 29 Phone calls ASR 70% / noisy
radio_v4_add 92,679 157 18 6s / 80 Radio Align (*) 95% / crisp
asr_public_stories_2 78,186 78 9 4s / 43 Books ASR 80% / crisp
asr_public_stories_1 46,142 38 4 3s / 30 Books ASR 80% / crisp
public_series_1 20,243 17 2 3s / 38 Youtube Subtitles 95% / ~crisp
asr_calls_2_val 12,950 7,7 2 2s / 34 Phone calls Manual annotation 99% / crisp
public_lecture_1 6,803 6 1 3s / 47 Lectures Subtitles 95% / crisp
buriy_audiobooks_2_val 7,850 4,9 1 2s / 31 Books Manual annotation 99% / crisp
public_youtube700_val 7,311 4,5 1 2s / 35 Youtube Manual annotation 99% / crisp
Total 16,513,202‬ 20,108 2,369

Updates

Update 2021-06-04

Added Zenodo direct link mirrors as well.

Update 2020-09-23

Now hosting a torrent via aria2c as well. Please use aria2c to download as well.

Update 2020-06-13

Now featured via Azure Datasets:

Update 2020-05-09

Legacy links and torrents deprecated

  • All legacy link and torrents deprecated
  • Please switch to new links and opus
  • Opus helpers are available in this repo

Update 2020-05-04

Opus direct links

  • Unlimited direct downloads via direct opus links

Update 2020-05-04

Migration to OPUS

  • Conversion of the whole dataset to OPUS
  • New OPUS torrent
  • Added OPUS helpers and build instructions
  • Coming soon - new unlimited direct downloads

Update 2020-02-07

Temporarily Deprecated Direct MP3 Links:

Update 2019-11-04

New train datasets added:

  • 10,430 hours radio_v4;
  • 2,709 hours public_speech;
  • 154 hours radio_v4_add;
  • 5% sample of all new datasets with annotation.
Click to expand

Update 2019-06-28

New train datasets added:

- 1,439 hours radio_2;
- 1,104 hours public_youtube1120;
- 291 hours public_youtube1120_hq;

New validation datasets added:

- 8 hours asr_calls_2_val;
- 5 hours buriy_audiobooks_2_val;
- 5 hours public_youtube700_val;

Update 2019-05-19

Also shared a wav version via torrent.

Update 2019-05-13

Added the forgotten txt files to mp3 archives. Updating the torrent.

Update 2019-05-12

Torrent created and uploaded to academictorrents.

Update 2019-05-10

Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.

Update 2019-05-07 Help needed!

If you want to support the project, you can:

  • Help us with hosting (create a mirror) / provide a reliable node for torrent;
  • Help us with writing some helper functions;
  • Donate (each coffee pays for several full downloads) / use our DO referral link to help;

We are converting the dataset to MP3 now. Please contact us using the below contacts, if you would like to help.

Downloads

Via torrent

  • An MP3 version of the dataset (v3) DEPRECATED;
  • A WAV version of the dataset (v5) DEPRECATED;
  • A OPUS version of the dataset (v1.01);

You can download separate files via torrent.

Looks like that due to large chunk size, most conversional torrent clients just fail silently. No problem (re-calculating the torrent takes much time, and some people have downloaded it already), use aria2c:

apt update
apt install aria2
# list the torrent files
aria2c --show-files ru_open_stt_wav_v10.torrent
# download only one file
aria2c --select-file=4 ru_open_stt_wav_v10.torrent
# for more options visit
# https://aria2.github.io/manual/en/html/aria2c.html#basic-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-metalink-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-specific-options

If you are using Windows, you may use Linux subsystem to run these commands.

Links

Dataset GB, wav GB, archive Archive Source Manifest
Train
radio_v4 1059 176 opus, txt Radio manifest
public_speech 257 47.4 opus, txt Internet + alignment manifest
radio_v4_add 15.7 2.8 opus, txt Radio manifest
5% of radio_v4 + public_speech - 11.4 opus+txt mirror - manifest
audiobook_2 162 25.8 opus+txt mirror Internet + alignment manifest
radio_2 154 24.6 opus+txt mirror Radio manifest
public_youtube1120 237 19.0 opus+txt mirror YouTube videos manifest
asr_public_phone_calls_2 66 9.4 opus+txt mirror Internet + ASR manifest
public_youtube1120_hq 31 4.9 opus+txt mirror YouTube videos manifest
asr_public_stories_2 9 1.4 opus+txt mirror Internet + alignment manifest
tts_russian_addresses_rhvoice_4voices 80.9 12.9 opus+txt mirror TTS manifest
public_youtube700 75.0 12.2 opus+txt mirror YouTube videos manifest
asr_public_phone_calls_1 22.7 3.2 opus+txt mirror Internet + ASR manifest
asr_public_stories_1 4.1 0.7 opus+txt mirror Public stories manifest
public_series_1 1.9 0.3 opus+txt mirror Public series manifest
public_lecture_1 0.7 0.1 opus+txt mirror Internet + manual manifest
Val
asr_calls_2_val 2 0.8 wav+txt mirror Internet manifest
buriy_audiobooks_2_val 1 0.5 wav+txt mirror Books + manual manifest
public_youtube700_val 2 0.13 wav+txt mirror YouTube videos + manual manifest
Total 2,186 354

Download instructions

End to end

download.sh

or

download.py with this config file. Please check the config first.

Manually

  1. Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
  1. Download the meta data and manifests for each dataset:
  2. Merge files (where applicable), unpack and enjoy!

Manually (using AzCopy) (2022-03-10)

When downloading large files from Azure wget downlaod may restart so often that it is impossible to download the largest file archives/radio_v4_manifest.tar.gz (176GB).

In that case you can use AzCopy util.

Instructions to download files using it are here. For the large file mentioned earlier you need to run

azcopy[.exe] copy https://azureopendatastorage.blob.core.windows.net/openstt/ru_open_stt_opus/archives/radio_v4_manifest.tar.gz radio_v4_manifest.tar.gz

command if you want to download file to the same folder where azcopy[.exe] is located.

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

  • Converted to mono, if necessary;
  • Converted to 16 kHz sampling rate, if necessary;
  • Stored as 16-bit integers;

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

See example

from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')

Merge, check and save manifests

See example

from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')

How to open opus

The best efficient way to read opus files in python (the we know of) that does incur any significant overhead (i.e. launching subprocesses, using a daisy chain of libraries with sox, FFMPEG etc) is to use pysoundfile (a python CFFI wrapper around libsoundfile).

When this solution was being researched the community had been waiting for a major libsoundfile release for years. Opus support has been implemented some time ago upstream, but it has not been properly released. Therefore we opted for a custom build + monkey patching.

At the time when you read / use this - probably there will be decent / proper builds of libsndfile.

Building libsoundfile

apt-get update
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y

cd /usr/local/lib
git clone https://github.com/erikd/libsndfile.git
cd libsndfile
git reset --hard 49b7d61
mkdir -p build && cd build

cmake .. -DBUILD_SHARED_LIBS=ON
make && make install
cmake --build .

Patched pysoundfile wrapper

Install pysoundfile pip install soundfile

import utils.soundfile_opus as sf

path = 'path/to/file.opus`
audio, sr = sf.read(path, dtype='int16')

Known issues

When you attempt writing large files (90-120s), there is an upstream bug in libsndfile that prevents writing such files with opus / vorbis. Most likely will be fixed by major libsndfile releases.

Contacts

Please contact us here or just create a GitHub issue!

Authors (in alphabetic order):

  • Anna Slizhikova;
  • Alexander Veysov;
  • Diliara Nurtdinova;
  • Dmitry Voronin;

Acknowledgements

This repo would not be possible without these people:

  • Newest direct download links are a courtesy of Azure Open Datasets;
  • Many thanks for helping to encode the initial bulk of the data into mp3 to akreal;
  • 18 hours of ground truth annotation datasets for validation are a courtesy of activebc;

Kudos!

FAQ

0. Why not MP3? MP3 encoding / decoding - DEPRECATED

Encoding

Mostly we used pydub (via ffmpeg) or sox (much much faster way) to convert to MP3. We omitted blank files (YouTube mostly). We used the following parameters:

  • 16kHz;
  • 32 kbps;
  • Mono;

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech. But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice. We did not use other formats like .ogg, because .mp3 is much more popular.

See example `pydub`

from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
                               format="wav")

file_handle = sound.export(store_mp3_path,
                           format="mp3",
                           parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
                           bitrate="{}k".format(str(32)))

See example `sox`

import subprocess
cmd = 'sox "{}" -C 32.01 -c 1 "{}"'.format(
            wav_path,
            store_mp3_path)
    
res = subprocess.call([cmd], shell=True)

if res != 0:
    print('Problems with {}'.format(wav_path))

Decoding

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

See example

# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
                    root_folder='../data/ru_open_stt/',
                    target_sr=16000):
    assert type(wav) == np.ndarray
    assert wav.dtype == np.dtype('int16')
    assert len(wav.shape)==1

    target_format = 'wav'
    wavb = wav.tobytes()

    # f_path = Path(audio_path)
    f_hash = hashlib.sha1(wavb).hexdigest()

    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)

    store_path.parent.mkdir(parents=True,
                            exist_ok=True)

    wavfile.write(filename=str(store_path),
                  rate=target_sr,
                  data=wav)

    return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
                       mono=True,
                       sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
                           root_folder=root_folder,
                           target_sr=target_sr)

Why not OGG/ Opus - DEPRECATED

Even though OGG / Opus is considered to be better for speech with higher compression, we opted for a more conventional well known format.

Also LPC net codec boasts ultra-low bitrate speech compression as well. But we decided to opt for a more familiar format to avoid worry about actually losing signal in compression.

1. Issues with reading files

Maybe try this approach:

See example

from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

  • Public datasets;
  • Public pre-trained models;
  • Open source frameworks;
  • Open research;

STT does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

  • Speaker labels coming soon;
  • Validation sets for new domains: Radio/Public Speech will be added in next releases.

4. Why migrate to OPUS?

After extensive testing, both during training and validation, we confirmed that converting 16kHz int16 data to OPUS does not at the very least degrade quality.

Also designed for speech, OPUS even at default compression rates takes less space than MP3 and does not introduce artefacts.

Some people even reported quality improvements when training using OPUS.

License

сс-nc-by-license

CC-BY-NC and commercial usage available after agreement with dataset authors.

Donations

Donate (each coffee pays for several full downloads) or via open_collective or just use our DO referral link to help.

Commerical inquiries

Further reading

English

Chinese

Russian

open_stt's People

Contributors

akreal avatar buriy avatar crazymidnight avatar emilpi avatar snakers4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open_stt's Issues

Any more information about the structure of the folder

I have been looking at some smaller datasets that are available for example public_youtube700_val and public_lecture_1. I would like to learn more about what the structure of the folder means.

For the public_youtube700_val I see folders from 0-9 and a-f with several subfolders. What do these numbers mean? Are all the audio files from each of these top-level root folders from a different youtube video or are they from the same video? Is there a way to determine this?

File reading problems, mp3 release

Just informing the users, that due to the fact that I have saved files with scipy, sometimes it throws different errors with different libraries (pydub, torchaudio) that mostly revolve around the absense of meta data in header.

Though not a problem for me, this will be fixed by an mp3 / torrent version of the dataset.
In the meanwhile, you can use this.

Several questions

Hi.

  1. Is there any gender separation or you didn't track this info?
  2. Are any of this files clear from noise and, if so, can I easily track them?
  3. Which dataset have most number of different speakers?

Missing text files

Hi!

New mp3 datasets have no text files with annotations. At least it's true for russian_single, ttl_russian_addresses and voxforge datasets.

image

Total reformatting

We have run some benchmarks with the soundfile library.
So far they seem very promising.

OGG format benches

Compared time for several options on 10,000 samples:

  • wav (from nvme) with scipy.io.wavfile.read - ~3 sec;
  • ogg (from nvme) with soundfile - 57 sec - best option;
  • wav (from hdd raid) with scipy.io.wavfile.read- ~3 min;
  • ogg (from nvme) to wav using sox , then read wav with scipy - ~20 min;
  • wav (from hdd raid) with librosa >30 min (librosa normalizes);
import soundfile as sf
sound_array, sampling_rate = sf.read('test.ogg', dtype='int16')

We are considering for 1.2 (not soon anyway)

  • dropping wav format altogether
  • dropping mp3 format altogether
  • converting everything to ogg and using soundfile
  • maybe ditching the direct links (?)

Libraries I used / know

  • scipy.io.wavfile.read - the best option, but wavs take a lot of space. Cannot scale 10x more
  • mp3 decoding - also good, but would occupy additional CPU. All libraries are either slow or bulky (requires fiddling with temporary files, i.e. soxi - you have to be careful - you can exhaust your nvme drive resource)
  • torchaudio - seems the most bulky library, also it used ffmpeg which is slower that soxi
  • librosa - overall the slowest library

Motivation

  • Ogg takes 10x less space
  • Less IO heavy load (but high IOPS nevertheless)
  • Looks like pysoundfile does all this in memory
PySoundFile can read and write sound files. File reading/writing is supported through libsndfile, which is a free, cross-platform, open-source (LGPL) library for reading and writing many different sampled sound file formats that runs on many platforms including Windows, OS X, and Unix. It is accessed through CFFI, which is a foreign function interface for Python calling C code. CFFI is supported for CPython 2.6+, 3.x and PyPy 2.0+. PySoundFile represents audio data as NumPy arrays.

@buriy
Maybe will be a good idea to do this when we will be doing a major annotation clean-up?

Question on transcriptions

for the STT corpus, you write that some of the annotations are generated automatically and then verified.

Can you elaborate on this process?

do you have an estimate for how accurate these generated transcripts are?

Opus files are not opus theirs vorbis

Hello!
I found a contradiction.
Seems like some/all of your audios encoded in ogg/vorbis stream instead of ogg/opus.

file from public_lecture_1/0/00

ffprobe 539959b62659.opus
ffprobe version 4.4.1 Copyright (c) 2007-2021 the FFmpeg developers
  built with Apple clang version 13.0.0 (clang-1300.0.29.3)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.4.1_3 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-avresample --enable-videotoolbox
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, ogg, from '539959b62659.opus':
  Duration: 00:00:02.83, start: 0.000000, bitrate: 47 kb/s
  Stream #0:0: Audio: vorbis, 16000 Hz, mono, fltp, 56 kb/s
    Metadata:
      ENCODER         : libsndfile

Audio: vorbis

Real opus should look like that:

Input #0, ogg, from 'test.opus':
  Duration: 00:00:05.01, start: 0.000000, bitrate: 65 kb/s
  Stream #0:0: Audio: opus, 48000 Hz, mono, fltp
    Metadata:
      encoder         : Lavc58.134.100 libopus

Audio: opus

Is it a bug? Or maybe I misunderstand something?

Thanks for reply!

Links to YouTube videos

Hey! I'm working on creating russian Visual Speech Recognition dataset. So I'm very interested whether it is possible to provide links to YouTube videos that were used for creating youtube part of open_sst?

В private_buriy_audiobooks_2 нет буквы ё, а в private_buriy_audiobooks_2_val есть

Я не могу, конечно, назвать это критическим моментом, но мне кажется, такая особенность данных должна быть отражена где-то в описаниях к датасетам. Если есть возможность, было бы супер получить обучающий датасет с буквой ё.

Unexpected EOF of radio-v4 mp3 archive

Hi,

I downloaded radio-v4 mp3 tar.gz from v10 release, the archive was downloaded successfully, but command

tar -xvzf radio_v4_mp3.tar.gz

ends with the mistake:

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Не могу скачать файл

Пытаюсь скачать датасет через aria2c, но нет сидов уже несколько дней. Может кто-нибудь на раздачу встать или подсказать в чем ещё может быть проблема, если сиды всё-таки есть?

question of re-sampling

First of all, thanks for the great project!
the documentation says there were resampled to 16k if needed

were there any up-sampling??
for example, 8k -> 16k

if so, can I get the list of upsampled corpora??

Can't download dataset

Hi. I'm trying to download dataset on my PC via download.sh

Error "404: The specified blob does not exist.."

Python way is also broken and nobody seed the torrent.

Upcoming releases and support

We are planning new cool releases sometime in future (with a twist you are not expecting), soon!

Also now you can support our initiative directly using open collective
image

Can't download via torent

@snakers4 , thank you for you job.
I find torrent client that open torrent without errors (Vuze Bittorent or Deluge), but there are no online peers :(. Or may be it is wrong behavior because 'unsupported piece size (1 GB)!'?

Проблема с торрентом

При запуске торрента через utorrent пишет, что "невозможно загрузить название_торрента неподдерживаемый размер части (1.0 Гб)". При запуске торрента через transmission торрент не добавляется в список загрузок.

нет txt файлов в public_speech.tar.gz

Здравствуйте, пользуюсь Вашим датасетом в учебном проекте, как видно из названия проблемы, не нашел .txt транскрипций в publich_speech.tar.gz, хотя они указаны в public_speech_manifest.csv (скачивал с торрента). При этом все wav файлы

Speakers id.

Is it possible extract speakers id from your dataset to using for speaker recognition tasks?

Some benchmarks on the datasets

Below I will post some of the results on the public part of the dataset
Both train and validation

Hope this will inspire the community to share their results and models

Двойные буквы и дефисы для тестовых данных

В тестах много сдвоенных букв, вроде

что можно эксплуатировать людеей

а также отсутствуют дефисы, что делает результат не совсем удобночитаемым.

Предлагаю внести следующие исправления в asr_calls_2_val и public_youtube700_val

text-youtube.txt
text-calls.txt

Могу преобразовать их в другой формат, если нужно

Number of speakers

Hi!...i created this issue because i am willing to know the variability of speakers in this dataset. How many are approximately the training speakers?

thanks in advance!

Опыт применения open_stt для обучения распознавания телефонных разговоров на DeepSpeech

Ситуация: экспериментирую с распознаванием голоса в телефонных разговорах. Исходное качество 8 кГц:
image
Своих размеченных данных всего на 3,5 часа.

Моя стратегия:

  • Исключаю плохо размеченные записи согласно public_exclude_file_v5.csv + исключаю аудио короче 3 и длиннее 10 секунд.
  • Конвертирую все оставшееся под нужный формат. Например, было:

image

Стало после конвертации:
image

  • Делю все записи в соотношении 80% - train, 20% - dev. Финальную проверку делаю на своих размеченных данных. Т.к. своих данных мало они вообще не участвуют в обучении. Лингвистическая модель собирается тоже по своим данным т.к. много специфичных слов. далее обучаю на deepspeech.

Результат на public_youtube1120_hq

Исключено 182853 записей, оставляю для обучения 186770 (приблизительно 138 часов). В результате обучения на dev получаю validation loss: 69.137811. При прогоне модели на моих данных получаю loss в два раза выше loss: 126.084709 (WER: 0.594781, CER: 0.404845).

Результат на asr_public_phone_calls_2

Исключено 410507 записей, оставляю для обучения 193290 (приблизительно 140 часов). В результате обучения на dev получаю validation loss: 58.780545. При прогоне модели на моих данных снова получаю loss в два раза выше loss: 120.485580 (WER: 0.582977, CER: 0.405931).

Результат объединения public_youtube700, public_youtube1120_hq, asr_public_phone_calls_1, asr_public_phone_calls_2, asr_calls_2_val

После всех исключений получается примерно 670 часов. В результате обучения на dev получаю validation loss: 46.687967. При прогоне модели на моих данных получаю loss: 91.705399 (WER: 0.383752, CER: 0.268075).
Тут я решил немного поиграть с аугментацией. Но вопреки моим ожиданиям стало только хуже.
Агрессивная аугментация (-–data_aug_features_additive=0.3 -–data_aug_features_multiplicative=0.3 -–augmentation_freq_and_time_masking=True -–augmentation_speed_up_std=0.3 -–augmentation_pitch_and_tempo_scaling=True): получаю loss: 103.976425 (WER: 0.455561, CER: 0.306885).
Не агрессивная аугментация (-–data_aug_features_additive=0.1 -–data_aug_features_multiplicative=0.1 -–augmentation_freq_and_time_masking=True -–augmentation_speed_up_std=0.1 -–augmentation_pitch_and_tempo_scaling=True): получаю loss: 94.959457 (WER: 0.388207, CER: 0.267716). Далее продолжил эксперименты без какой либо аугментации.

Результат объединения public_youtube700, public_youtube1120_hq, asr_public_phone_calls_1, asr_public_phone_calls_2, asr_calls_2_val + private_buriy_audiobooks_2, radio_2

После всех исключений получается примерно 2130 часов. В результате обучения на dev получаю validation loss: 35.213670. При прогоне модели на моих данных получаю loss в: 83.400711 (WER: 0.358523, CER: 0.273550).

Вопросы к моим результатам:

  1. Правильно ли я привел wav файлы к одному формату? Может надо проверить какие-то еще параметры аудио? Конвертация происходит примерно так:
import librosa
import soundfile as sf

sr = 8000

y, s = librosa.load(fname, sr=sr)
sf.write(fname_8k, y, sr, subtype='PCM_16')
  1. Какие либо советы по добавлению новых датасетов из open_stt? Смущает, что при добавлении огромных датасетов private_buriy_audiobooks_2 и radio_2 обучающая выборка увеличилась с 670 до 2130 часов. Но при этом почти не улучшилась точность модели WER 0.38 → 0.36.
  2. Какие либо советы по аугментации? Обучаю на открытых размеченных данных, а применяю модель разговорам по телефону.

Заранее спасибо.

v1.0 torrent download issues

You can download separate files via torrent. Try several torrent clients if some do not work. Looks like that due to large chunk size, most conversional torrent clients just fail silently. No problem (re-calculating the torrent takes much time, and some people have downloaded it already):

apt update
apt install aria2
# list the torrent files
aria2c --show-files ru_open_stt_wav_v10.torrent
# download only one file
aria2c --select-file=4 ru_open_stt_wav_v10.torrent
# for more options visit
# https://aria2.github.io/manual/en/html/aria2c.html#basic-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-metalink-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-specific-options

If you are using Windows 10, you may use Linux subsystem to run these commands.
If you are using older versions please stop using it please post your solution in this thread.
For Mac users there is no redemption probably there is some hack to use aria2 as well.

seeders please come back

can't download public_youtube1120.tar.gz from torrent
the downloading process stopped at 107G no seeders to download more

Ordering of the audio files.

Hi,
This might be a stupid question, but i was trying to use the radio_2 dataset and upon downloading (mp3+txt), I am having trouble understanding the order in which the audio files correlate with each other. My questions are:

  1. Are the audio files in folders continuous? And are they alphabetically continuous?
    For e.g., if I am in folder 00, and I play an audio file 0d5d45dbfeba.mp3 and then alphabetically next, 0e4154bfc77c.mp3, is the content continuous i.e the actual content has been segmented into these files?
    If not, are the mp3 files ordered and if yes, what is the ordering?

  2. Some segmentations seem to have been broken down into letters. This is normally seen at the start and end of the sentence. I am assuming this is because words have been cut down. My question here is, is this transcription done by an automated software?

This is an amazing dataset. I really appreciate any help on the above issues.
Regards,

Torrent announcement

Still struggling to find a headless client for Ubuntu 20 that would work w/o crashing
Testing aria2c for seeding now as it performed so well for downloading

Download error

I tried to download datasets. I launched download.py but got this
"urllib.error.HTTPError: HTTP Error 404: The specified blob does not exist."
How does fix it?

Easy download script in for bash / python

It would be nice, if someone from community would write one / a couple of scripts in bash / python, that would do the following:

  • Download the dataset;
  • Check md5 sums;
  • Check manifests;
  • Check empty / low contrast files;
  • Check wav encoding (sr=16kHz, int16);

Files with poor annotation

I will be posting here some lists of files to be exluded from the dataset from time to time
Such lists are obtained via training models and seeping through files with higher than expected CER

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.