Coder Social home page Coder Social logo

laborotvspeech's People

Contributors

hfujihara avatar s-and-o avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

laborotvspeech's Issues

Too long utterances in the dev set

I found that the dev set includes too long sentences.
I suggest removing such long sentences for the official development data.
My suggestion is to remove utterances for more than 30 seconds (e.g., remove the top 33 utterances from the following list).

$ sort -k 2 -nr dump/raw/dev/utt2num_samples | head -n 34
v001_dev_RlFGUdfe 2659360
v001_dev_tVD60s3X 2603200
v001_dev_XcWLZ8Ds 2284800
v001_dev_m3w23nXF 1494400
v001_dev_yhU29kG8 1464960
v001_dev_yIo4ACN1 1390720
v001_dev_rUVXkS0Z 1346560
v001_dev_ebeV3qs8 1223840
v001_dev_PPdlAyGz 1205120
v001_dev_6JZPzq4k 1130560
v001_dev_g8vWy3VB 987040
v001_dev_BYca8AqR 982560
v001_dev_kpypOOpB 966240
v001_dev_fwWvJhdZ 939840
v001_dev_YVVAj1Ml 939520
v001_dev_m9vLDdIr 743520
v001_dev_tvigUeHB 694080
v001_dev_puL8hGoC 689120
v001_dev_eqpJcjYj 681920
v001_dev_IroOq3x7 673920
v001_dev_tbyYEkw2 562240
v001_dev_gt3MQcrd 560160
v001_dev_Xkf0hGjo 527200
v001_dev_yIkO2OlI 522560
v001_dev_xL5OF2nG 515680
v001_dev_dfcYXj1r 506880
v001_dev_YUtqHgLy 504800
v001_dev_bJ4u7YZd 500800
v001_dev_I7fBUbhG 497600
v001_dev_6lwD3FOm 487680
v001_dev_hEaNS9HA 486240
v001_dev_uGKSJThT 483200
v001_dev_5SMfe193 480160
v001_dev_g51KQbqF 479200

Error during `local/tedx-jp/10k_data_prep.sh`

When I perform local/tedx-jp/10k_data_prep.sh, it was failed with the following error

local/tedx-jp/10k_data_prep.sh: All required files exist in tedx-jp.
local/tedx-jp/10k_data_prep.sh: Creating a directory containing all valid subtitle segments.
2020-11-19 20:25:05,588 (make_raw_dir:235) INFO: done making data/tedx-jp-all_raw
fix_data_dir.sh: kept all 273 utterances.
fix_data_dir.sh: old files are kept in data/tedx-jp-all_raw_whole/.backup
patching file data/local/tedx-jp-10k_verbatim.tmp/text_content
Hunk #90 FAILED at 2657.
1 out of 320 hunks FAILED -- saving rejects to file data/local/tedx-jp-10k_verbatim.tmp/text_content.rej

It seems that the patch file would be wrong.
Could you let me know how we solve it?

English and Music in TEDxJP 10k corpus

The TEDxJP 10k corpus contains some inapporopriate data for evaluation of Japanese speech recognition.

In following videos, Japanese people talk in English.
But the corpus uses subtitles that are automatically translated into Japanese.

Then, following videos contain some music (corpus uses the interval of music).

It may be better to be removed from the corpus.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.