xinjli / allosaurus Goto Github PK

View Code? Open in Web Editor NEW

514.0 25.0 84.0 468 KB

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages

License: GNU General Public License v3.0

Python 100.00%

speech speech-recognition pytorch phonetics

allosaurus's People

Contributors

Stargazers

Watchers

Forkers

entn-at ruclion shreyakhare cognitivehorizons dendisuhubdy saikrishnarallabandi lalimili6 aashishag babat00nday stark-akib complinger ithoidis mdp0999 eduamf hapaxhypatia ashkanmradi lengjiayi shekofteh aghilassini chenchy sadam1195 steveway cngun holyma kamperh beeawesome raotnameh eleferrand kormoczi zaidsheikh willstott101 mbencherif martinjbaker arizona-linguistics savaliyagautam edumansky cloverhound chironito elafislam123 agangzz caizi336 jazminvidal topel yenine mayamatakeshi adelinocpp road2018 wendonggan cckk2913 padster06 wxlsummer nayanjha16 damaru amirhussein96 emersonknapp imenbaa repushko z451538473 jeffeuxmartin diyism kevinrineer arrowluo 007v eribertoo pbilk andrewkuo biraj-dahal noise-labs guangkechen rishiramchandani ferdavid1 ahmedmsalah99 jaedukseo 5l1v3r1 kei-on amuvarma13 madcato essort ufcompling mengzhegeng

allosaurus's Issues

Deterministic output

I noticed that there is some variability in the output from call to call. For example, I just ran the same 15 second sample 10 times and the output contained varying numbers of phones:

[197, 198, 200, 199, 196, 195, 203, 195, 198, 197]

Is it possible to configure/modify the code slightly to generate deterministic results? I'm not sure, but I suspect this has something to do with Torch.

Input wav file as a BytesIO object not working

Hi,
I wanted to read the speech wav from a BytesIO object, but it does not work, because of the assert in the line 65 of app.py (filename should have a ".wav" extension). I have tried to give a filename to the BytesIO object, but it did not help either (do not really understand, why). If I comment this above mentioned line, then everything works well, but I would like to have a more appropriate / robust solution, which do not need the modification of the original code.
Do you have any suggestion or advice?
Thanks!

Issue with using dependencies numpy with numba and panphon

Hi, when I try to run the package, I get an error with panphon stating that numpy needs to be greater than 1.20.2, but if I upgrade numpy, I get an error stating that numba only works with numpy between 1.17 to 1.20

EDIT: Had to upgrade numba

Loss for recognizing a part of audio

Hi,

I'm trying to recognize the audio file below by lang_id='jpn', emit=1, timestamp=True, but there is nothing to be generated during 7.290~13.170s which includes about two audio clips:
drive link of the audio file

Could you please have a look at this?

By the way, I found that the generated duration seems always to be 0.045s? Could you please give some tips for optimizing it, like considering the connection between two phones, or Vowels and Consonants?

Thank you

Phone times?

Would it be straightforward to modify Allosaurus to return the approximate times of the recognized phones?

Also, I’m a novice in this area, but for what it’s worth, very impressive tool!

Incomplete phone inventory for iso gup

Description:

The phone inventory for Kunwinjku (iso gup) is incomplete. The output of python -m allosaurus.list_phone --lang gup is:

['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ']

However, Phoible lists the complete inventory as:

allophone	description_name
m	m Gunwinggu (PH 883)
i ɪ	i Gunwinggu (PH 883)
j	j Gunwinggu (PH 883)
u ʊ	u Gunwinggu (PH 883)
a ʌ ai au	a Gunwinggu (PH 883)
w	w Gunwinggu (PH 883)
n	n Gunwinggu (PH 883)
l	l Gunwinggu (PH 883)
b p pʰ	b Gunwinggu (PH 883)
ŋ	ŋ Gunwinggu (PH 883)
e ɛ æ	e Gunwinggu (PH 883)
o ɔ ɒ	o Gunwinggu (PH 883)
ɡ k kʰ	ɡ Gunwinggu (PH 883)
r	r Gunwinggu (PH 883)
ɲ	ɲ Gunwinggu (PH 883)
ʔ	ʔ Gunwinggu (PH 883)
d̪ t̪ t̪ʰ	d̪ Gunwinggu (PH 883)
ɳ	ɳ Gunwinggu (PH 883)
ɭ	ɭ Gunwinggu (PH 883)
ɻ	ɻ Gunwinggu (PH 883)
ɖ	ɖ Gunwinggu (PH 883)
ɽ	ɽ Gunwinggu (PH 883)
ʎ	ʎ Gunwinggu (PH 883)
dʲ tʲ tʲʰ	dʲ Gunwinggu (PH 883)

https://phoible.org/inventories/view/883

Expected behavior

I would expect the allosaurus model inventory for iso gup to be:

['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ', 'ɪ', 'ʊ', 'ʌ', 'ai', 'au',  'b', 'p', 'pʰ', 'ɛ','æ', 'ɔ', 'ɒ', 'ɡ', 'k', 'kʰ', 'ɲ', 'd̪', 't̪', 't̪ʰ', 'ɖ', 'ɽ', 'ʎ', 'dʲ', 'tʲ', 'tʲʰ']

Is the output phone or phonemes?

@xinjli Just curious is the output a list of phones or phonemes?
Two cases:

lang_id token provided
lang_id token NOT provided

Runtime Error

Thanks for the sharing the codes. During running, I encountered the following runtime error:

Traceback (most recent call last): File "/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/run.py", line 61, in <module> phones = recognizer.recognize(args.input, args.lang, args.topk) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/app.py", line 69, in recognize tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/am/allosaurus_torch.py", line 88, in forward hidden_pack_sequence, _ = self.blstm_layer(pack_sequence) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 573, in forward self.num_layers, self.dropout, self.training, self.bidirectional) RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

I used Python3.7 and torch 1.5. It seems to be package version problem, could you please list all your package versions?
Tx

Phoneme Boundaries

Hi,

Thank you for putting up the code open-source.

I have a question, is it somehow possible to add phoneme boundaries for each word recognized.

For example:

Transcript for a wav file (german): schau mal hin ist das dorf noch nicht zu sehen
Phonemes Recognized: | ʃ a ʊ h m a l h ɪ n ɪ s t d a s d ɔ ə f n ɔ x n ɪ x t s u z e h ə n
Phonemes with word boundaries: * | ʃ a ʊ h* m a l * h ɪ n* * ɪ s t* * d a s* * d ɔ ə f* * n ɔ x* * n ɪ x t* * s u* * z e h ə n*

Not sure if I am missing something.

Thank you.

Phone distance metric

Thanks for all your work on allosaurus. It's a really great resource!

For comparing the similarity of two phonetic sequences right now I've been using simple jaccard distance, but it would be nice to use a distance metric that would be sensitive to the fact that phonemes are not equally similar. Can you recommend a resource that would allow for this kind of distance metric?

Thank you!

assert wave_path.exists()

Hi, so I'm running into what should be a simple problem, but I simply can't figure out what I'm doing wrong.

I run the following command
python -m allosaurus.bin.prep_feat --path='C:\Users\maria\Allo\train'

I wanted to test with just a few samples to make sure I have everything working before using the complete dataset, but I'm stuck on this stage.

The wave txt file for the train directory contains

utt_1 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs5.wav
utt_2 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs6.wav
utt_3 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs7.wav
utt_4 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs8.wav
utt_5 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs9.wav
utt_6 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs10.wav
utt_7 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs12.wav
utt_8 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs14.wav
utt_9 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs15.wav
utt_10 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs16.wav

but I keep getting the error

Traceback (most recent call last):
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\site-packages\allosaurus\bin\prep_feat.py", line 57, in
assert wave_path.exists(), "the path directory should contain a wave file, please check README.md for details"
AssertionError: the path directory should contain a wave file, please check README.md for details

Is my wave file just formatted incorrectly and that's why I keep getting an error about no wav files existing? Is my command line argument the reason? Thank you for any help.

Timestamps for phones?

It would be wonderful to optionally be able to retrieve the timestamps for the phonemes. Is that possible?

[edit: I see this suggestion #20 would it be possible to add this option to the code?]

support for python 3.10

I don't know why, but it does not work in python 3.10. Is it going to have support for 3.10 in the future?

Optimizing for Latency

Have the authors considered any approaches to reduce the latency of this approach?

Would be interested to understand if any avenues have been pursued (e.g., distilling into a more performant architecture)

Thanks!

Unable to use the model

Hello, Thank you for the great repo.

I am unable to run it. Could you please help me to fix the issue. I tried the following compands.

I am using Miniconda:

Install allosaurus
pip install allosaurus
Run inference

(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.run -i deM23-44.wav
Traceback (most recent call last):
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/run.py", line 21, in <module>
    if len(get_all_models()) == 0:
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/model.py", line 13, in get_all_models
    assert len(models) > 0, "No models are available, you can maually download a model with download command or just run inference to download the latest one automatically"
AssertionError: No models are available, you can maually download a model with download command or just run inference to download the latest one automatically

Download the model

(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.bin.download_model -m latest
downloading model  latest
from:  https://www.pyspeech.com/static/model/recognition/allosaurus/latest.tar.gz
to:    /home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/bin/pretrained
please wait...
(deepspeech_v0.7.4) [email protected]@wika:~$

Run inference

(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.run -i deM23-44.wav
Traceback (most recent call last):
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/run.py", line 21, in <module>
    if len(get_all_models()) == 0:
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/model.py", line 13, in get_all_models
    assert len(models) > 0, "No models are available, you can maually download a model with download command or just run inference to download the latest one automatically"
AssertionError: No models are available, you can maually download a model with download command or just run inference to download the latest one automatically

Allosaurus function to perform phoneme recognition without having to run the library as an executable

Hi,

Currently, the only way to perform phoneme recognition with allosaurus is to run a command in a cli type interface with the following structure python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] -i <audio file/directory>.

It would be great if there was a function within the library that can also do something similar for example
from allosaurus.app import read_recognizer, speech_recognizer ... phoneme_seq = speech_recognizer.recognize(model_name, speech_wav_file, other_config) ...

allosaurus results for Persian language

Hi, I'm trying to use allosaurus for Persian language but the results are not accurate at all!

here is an example:
model.recognize("source.wav", "pes")
returned result is:
f l a l ə m a l n ɪ k a m a n a ŋ t b a ʃ p uː x t ɔ l t b a ɪ s t ɔ n ə
but it should be like:
s a l ə m m a n m ɪ t a v ə n a m f a r s ɪ s uː x b a t ɔ o n a m

The source.wav has been attached.

What should I do? How can I improve the results?

Build issue

Hi,

Looks like a great program.

However, I was having trouble building allosaurus. I am on Ubuntu 16.04 and when I do 'pip install allosaurus' I get

Building wheel for llvmlite (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-r4C97N/llvmlite/setup.py'"'"'; file='"'"'/tmp/pip-install-r4C97N/llvmlite/s
etup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_whee
l -d /tmp/pip-wheel-AzDXdO
cwd: /tmp/pip-install-r4C97N/llvmlite/
Complete output (7 lines):
running bdist_wheel
/usr/bin/python /tmp/pip-install-r4C97N/llvmlite/ffi/build.py
File "/tmp/pip-install-r4C97N/llvmlite/ffi/build.py", line 122
raise ValueError(msg.format(_ver_check_skip)) from e
^
SyntaxError: invalid syntax
error: command '/usr/bin/python' failed with exit status 1

ERROR: Failed building wheel for llvmlite

On the web, there were suggestions to use 'python -m pip ...' and also try install llvm. I did both, but it didn't help

Appreciate any help

请问没有中文语言的支持吗

Option to extract last layer embeddings of the lstm.

Hi,

It would be better if an argument can be passed to extract the embeddings instead of the phones. @xinjli

Support Speaker Diarization

Hello,
As you can see here I've started integrating this project into Papagayo-NG:
morevnaproject-org/papagayo-ng#49
The first results from my tests seem to be very promising.
Especially the new timestamp feature is helping a lot with that.

Is it possible to add some speaker separation to this?
Papagayo-NG itself allows several speakers for one audio file.
If we could recognize which parts are spoken by a separate speaker then that would make this a really nice solution for even
more animators.
I've taken a look at the topic, and it seems to be quite complex.
If this could be integrated to Allosaurus then that would be awesome of course.
If not there would be ways to get this into Papagayo-NG, we could do a separate pass over the audio.
I've taken a look and pyAudioAnalysis seems to already do that.
But that would be a big dependency addition.

How the different phonemes sounds exactly? (Preparation for fine-tuning...)

Hi,

When I use allosaurus with the eng2102 model for an English wav file, the results looks quite good (although there is one issue, if there is no silence at the beginning of the wav file, some phonemes from the beginning of the speech will be missing - I am still testing this, maybe later I will start a separate issue on this topic).

But when I use the universal model for a Hungarian wav file, the results are not so good (of course, I know it is not a very well known language ;-)).
So I would like to fine-tune the model. But for this, I need to create the text files about the phonemes of the sentences. As it is stated in the doc, the phones here should be restricted to the phone inventory of my target language.
The phone inventory for the Hungarian language is the following:
aː b bː c d dː d̠ d̪ d̪ː d̻ eː f fː h hː i iː j jː k kː l lː l̪ l̪ː m mː n nː n̪ n̪ː o oː p pː r rː r̪ r̪ː s sː s̪ s̻ t tː t̠ t̪ t̪ː t̻ u uː v vː w y yː z zː z̪ z̻ æ ø øː ɑ ɒ ɔ ɛ ɟ ɡ ɡː ɲ ɲː ɾ ʃ ʃː ʒ ʒː ʝ ʝː
But for some phonemes I cannot recognize.
Here is the explanation for the IPA signs for the Hungarian language:
https://hu.wikipedia.org/wiki/IPA_magyar_nyelvre
(unfortunately, it is in Hungarian, but the IPA signs are easy to find...)
Can you help me to understand this, or give me a link to any document, describing these phonemes?

Thanks!

Support for custom phoneme symbols beyond IPA

Hi,

Not sure if this is on the roadmap, but it would be super cool to have a way to provide a custom set of symbols to represent phonemes and their mappings to phones. Probably a function/layer to support IPA to custom phoneme set mapping should be sufficient for this requirement.

Not able to transcribe simple word what in English

The issue

I am currently trying to use Allosaurus to help a Speech Language Pathologist perform transcriptions, but I am having issues with getting the application to recognize the word what let along longer WAV files with more complex sentences in them. Attached is the WAV file. The output I get from Allosaurus is:

~/Downloads❯ python -m allosaurus.run -i what.wav --model eng2102 --lang eng

~/Downloads❯

I even installed the eng2102 model.

~/Downloads❯ python -m allosaurus.bin.list_model
Available Models
- uni2005 (default)
~/Downloads❯ python -m allosaurus.bin.download_model -m eng2102
downloading model  eng2102
from:  https://github.com/xinjli/allosaurus/releases/download/v1.0/eng2102.tar.gz
to:    /home/filbot/.local/lib/python3.9/site-packages/allosaurus/pretrained
please wait...
~/Downloads❯ python -m allosaurus.bin.list_model               
Available Models
- uni2005 (default)
- eng2102

It was recorded using a Tascam DR-40X using WAV 32bit then transferred over to a Pop!_OS Linux System.

Python Version

~/Downloads❯ python -V
Python 3.9.7

Pop!_OS Version

~/Downloads❯ neofetch 
             /////////////                filbot@pop-os 
         /////////////////////            ------------- 
      ///////*767////////////////         OS: Pop!_OS 21.10 x86_64 
    //////7676767676*//////////////       Host: Oryx Pro oryp6 
   /////76767//7676767//////////////      Kernel: 5.15.23-76051523-generic 
  /////767676///*76767///////////////     Uptime: 1 hour, 40 mins 
 ///////767676///76767.///7676*///////    Packages: 2857 (dpkg), 90 (flatpak) 
/////////767676//76767///767676////////   Shell: zsh 5.8 
//////////76767676767////76767/////////   Resolution: 1920x1080 
///////////76767676//////7676//////////   DE: GNOME 40.5 
////////////,7676,///////767///////////   WM: Mutter 
/////////////*7676///////76////////////   WM Theme: Pop 
///////////////7676////////////////////   Theme: Pop-dark [GTK2/3] 
 ///////////////7676///767////////////    Icons: Pop [GTK2/3] 
  //////////////////////'////////////     Terminal: gnome-terminal 
   //////.7676767676767676767,//////      CPU: Intel i7-10875H (16) @ 5.100GHz 
    /////767676767676767676767/////       GPU: Intel CometLake-H GT2 [UHD Graphics] 
      ///////////////////////////         Memory: 3052MiB / 31977MiB 
         /////////////////////
             /////////////

what.wav file.
what.wav.zip

The question

I feel like I'm not doing something correctly. Do I need to train allosaurus to listen for English sounds as well? I expect to see something similar to wʌt

Checking the full language list raised IndexError

While I'm checking the full language list with python -m allosaurus.bin.list_lan, the executable raised IndexError. The following is the error message:

Traceback (most recent call last):
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/bin/list_lang.py", line 13, in <module>
    model_path = get_model_path(args.model)
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/model.py", line 27, in get_model_path
    resolved_model_name = resolve_model_name(model_name)
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/model.py", line 76, in resolve_model_name
    return models[0].name
IndexError: list index out of range

The timestamp's duration always be 0.045

Is this a bug or something I did is wrong?

audio file size limit?

is there any limit to size of audio file ? i tried for some files with 6 min of data , it processed but didn't gave an output.

Does allosaurus handle mixed speech and non-speech data?

Hi, Thanks again for a great program

I tried to run allosaurus on approximately a 15 minute TED talk and got the following error. From the same talk, I extracted a 5 second speech excerpt, and allosaurus seemed to work. Did allosaurus crash because the TED talk starts with about 12 seconds of music? Here's the error message:

python -m allosaurus.run -i ~/datasets/tedlium3-wav/NaliniNadkarni_2009.wav
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/run.py", line 59, in
phones = recognizer.recognize(args.input, args.lang)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/app.py", line 56, in recognize
tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/am/allosaurus_torch.py", line 88, in forward
hidden_pack_sequence, _ = self.blstm_layer(pack_sequence)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 580, in forward
self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

device-id argument doesn't work for different GPUs

For inference, the command is
python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] -i <audio>

However, specifying any device ID other than 0 (like say 1) still runs the inference on GPU 0.

Currently, the following code works to run inference on a GPU other than 0, but I think the intention of the device_id argument was to specify GPU ID as well.
CUDA_VISIBLE_DEVICES=<gpu_id> python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id 0] -i <audio>

Prior.txt file path

Hi, thank you for the nice work! I would like to know where should I put the prior.txt file for Prior Customization ? Thanks !

Any explanation on feature window re-ordering?

Hi, I'm looking at shrinking the processing window down from the entire audio file at once.

Could you shed any light on this line?

allosaurus/allosaurus/pm/utils.py

Line 19 in d9f1ada

    
           feature = np.concatenate((np.roll(feature, 1, axis=0), feature, np.roll(feature, -1, axis=0)), axis=1)

Why does it use np.roll to move the frames to the front, and the end as well as joining all 3 together to widen the sample?

I'd spread out the one-liner as below to try and figure it out.

    rollup   = np.roll(feature, 1, axis=0)  # make last feature first
    rolldown = np.roll(feature, -1, axis=0) # make first feature last

    combined = np.concatenate((rollup, feature, rolldown), axis=1) # join all feature on second axis
    windowed = combined[::3, ] # removes features with overlapping samples

    return windowed

It seems to make all the overlapping features into 1 deeper sample and then drops all the overlaps by getting every 3rd item. But why the np.roll ?

Can't download models

Hi!

I am trying to download the English model by running:

python -m allosaurus.bin.download_model -m eng2102

and I get the following error:

downloading model  eng2102
from:  https://www.pyspeech.com/static/model/recognition/allosaurus/eng2102.tar.gz
to:    /home/j/miniconda3/lib/python3.8/site-packages/allosaurus/pretrained
please wait...
Error: could not download the model

Same goes for the other model.
Is there a way to get them?

Thanks,

Suggestions for README

Maybe you could add two things to the README:

A link to the arXiv version of the paper at the top of the README
A link to the dictate.app online interface

Change the location for the downloaded model

Hi,

It would be great if there was support for modifying the location in which the latest model (and other model versions) would be downloaded into. For example python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] -i <audio file/directory> -mp <custom directory to save models to>

Realtime? (low-latency streaming inference)

Thanks for allosaurus, my experiments with it have been fruitful so far. Very impressive work!

I'm curious about whether the architecture of this package is suitable for operating on streaming audio at a reasonably low-latency?

I haven't dug much further than what I needed to load a file with pydub and get some output, and am happy to dig further. I thought it could be a good idea to start a conversation about this, perhaps the system and models are totally unsuitable for real-time, or perhaps it might just require a bit of engineering effort from me.

Thanks in advance

How was the training data processed?

Hello,

We're trying to evaluate allosaurus for a pronunciation trainer. But currently the results fluctuate a bit too much for it to be reliable. Is there any tips that you have to get more consistent results? How was the training data recorded and was it processed in some way (compressor, noise reduction, etc...)? With this information we could adjust our input data and might get better results.

Peter

using 'eval' insted of 'distance' in trainer.py

Hi
Thank you for allosaurus :)
I'm trying to train a model and I've got this error message
AttributeError: module 'editdistance' has no attribute 'distance'
so I replaced 'distance' with 'eval' in trainer.py and it works well
I'm using python v3.7

Segmentation fault

Hi, I tried to install with pip install allosaurus and tried to run python -m allosaurus.run -i <path>/cmu_us_slt_arctic/wav/arctic_b0340.wav where I try to transcribe a 16kHz wave file from the CMU ARCTIC dataset. I got the following results:

$ python -m allosaurus.run -i <path>/cmu_us_slt_arctic/wav/arctic_b0340.wav
ð i z k w ɪ k l ɪ tʰ ə l dʒ o j z ʌ v h ɹ̩ z w ɹ̩ s ɔ ɹ s ə z ʌ v dʒ o j t ə h ɪ m
Segmentation fault (コアダンプ)

The last Japanese word means "core dump". Has anyone encountered this issue before?
FYI, I am using python 3.7.7 and my torch version is 1.5.1+cu101.

recording best practice to get best result ?

Hello
first thanks a lot to make your work so easily available

I'm trying to make a software to help my friends improve their French pronunciation by doing the following things :

put a french sentence to read (for which I have the IPA and a native recording )
let them read aloud this sentence
transcribe their recording to IPA using allosaurus
compare with the expected IPA and point out mistakes

I've started to first play with allosaurus to chekc if it can correctly transcribe me (a french native) pronouncing some simple words, but it seems to have some trouble doing so (the result is quite approximate) I've added -l fra which seems to improve slightly the accuracy but not by much

Is there some best practice regarding recording to give best results ? Is there some other way I have to improve the accuracy for french ? (I'm a software engineer with good knowledge in python but not that much in machine learning )

thanks a lot for the pointers you can give me

Incorrect command for downloading model

A minor error:
In the README, the command specified for downloading a model is
python -m allosaurus.download <model>
However, the following is what actually works:
python -m allosaurus.download -m <model>

pip download

The pip download for allosaurus shows that it downloads successfully in the terminal, however allosaurus does not show up as a known module when I import it in my coding environment. What is the fix for this?

system not deterministic

Hello. I faced this issue a few times. It seems that the system is not deterministic. After having run the model several times on the same audio file, sometime a phone or two are replaced. The replacement seems to happen between the 1st and 2nd more likely phones when the probability for the top1 is low.

Progress Information possible

Hi,
So this is working quite well now in Papagayo-NG.
But I wanted to know if it is possible to get progress information while it is recognizing.
Because if the input files are larger it could take a while.
If not then I will likely test slicing the input files into smaller segments based on silence gaps if possible and running them in series.
So I can then show an approximate progress status.
But the slicing might likely change the result of the recognizer.

maximum size for Inventory CustomizationIs?

Hi
Is there a a maximum size for inventory customization?
It seems not working if I have a phoneme number> 230?
and keep getting this error assert max_domain_idx != -1

Thank you so much and I really appreciate your time to replay me

urllib.error.URLError: <urlopen error [Errno 60] Operation timed out>

Hi:

It seems like your pre-trained model link is dead

from:  https://www.pyspeech.com/static/model/recognition/allosaurus/latest.tar.gz
to:    /python_path/lib/python3.7/site-packages/allosaurus-0.4.2-py3.7.egg/allosaurus/pretrained
please wait...
Error: could not download the model
Traceback (most recent call last):
  File "/python_path/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/python_path/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/python_path/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/python_path/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/python_path/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/python_path/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/python_path/lib/python3.7/http/client.py", line 1439, in connect
    super().connect()
  File "/python_path/lib/python3.7/http/client.py", line 944, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/python_path/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/python_path/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 60] Operation timed out

my python version: 3.7.9
allosaurus version: commit a11771dd4aa16b5162e9aae6238a58bbcac430e5

Issue with shapes alignment

Hello! I was having an issue with fine-tuning the model. This is the error message I'm getting :

I'm not sure how to proceed. Any insight would be greatly appreciated, thank you!

Phone duration is always 0.045

No matter what, phone duration is 0.045 that doesn't sound right. Even if I say something like "Ooooooooh yeeeeees"

4.080 0.045 iː
4.320 0.045 tʲ
4.410 0.045 iː

WASM support?

Hello,

This is a really amazing project. Is there some way to make it run directly on a website? Without going to the server? Via wasm or similar

Peter

Cannot open 32 bit floating audio file

Hi,

It seems like the wave package does not support 32-bit floating encoding. Here is the error message:

Traceback (most recent call last):
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/run.py", line 71, in <module>
    phones = recognizer.recognize(args.input, args.lang, args.topk, args.emit, args.timestamp)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/app.py", line 63, in recognize
    audio = read_audio(filename)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/audio.py", line 17, in read_audio
    wf = wave.open(filename)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 510, in open
    return Wave_read(f)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 164, in __init__
    self.initfp(f)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 144, in initfp
    self._read_fmt_chunk(chunk)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 269, in _read_fmt_chunk
    raise Error('unknown format: %r' % (wFormatTag,))
wave.Error: unknown format: 3

Could we try to use torchaudio instead of the wave to open files?