xinjli / allosaurus Goto Github PK
View Code? Open in Web Editor NEWAllosaurus is a pretrained universal phone recognizer for more than 2000 languages
License: GNU General Public License v3.0
Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
License: GNU General Public License v3.0
I noticed that there is some variability in the output from call to call. For example, I just ran the same 15 second sample 10 times and the output contained varying numbers of phones:
[197, 198, 200, 199, 196, 195, 203, 195, 198, 197]
Is it possible to configure/modify the code slightly to generate deterministic results? I'm not sure, but I suspect this has something to do with Torch.
Hi,
I wanted to read the speech wav from a BytesIO object, but it does not work, because of the assert in the line 65 of app.py (filename should have a ".wav" extension). I have tried to give a filename to the BytesIO object, but it did not help either (do not really understand, why). If I comment this above mentioned line, then everything works well, but I would like to have a more appropriate / robust solution, which do not need the modification of the original code.
Do you have any suggestion or advice?
Thanks!
Hi, when I try to run the package, I get an error with panphon
stating that numpy
needs to be greater than 1.20.2, but if I upgrade numpy
, I get an error stating that numba
only works with numpy
between 1.17 to 1.20
EDIT: Had to upgrade numba
Hi,
I'm trying to recognize the audio file below by lang_id='jpn', emit=1, timestamp=True
, but there is nothing to be generated during 7.290~13.170s which includes about two audio clips:
drive link of the audio file
Could you please have a look at this?
By the way, I found that the generated duration seems always to be 0.045s? Could you please give some tips for optimizing it, like considering the connection between two phones, or Vowels and Consonants?
Thank you
Would it be straightforward to modify Allosaurus to return the approximate times of the recognized phones?
Also, I’m a novice in this area, but for what it’s worth, very impressive tool!
The phone inventory for Kunwinjku (iso gup) is incomplete. The output of python -m allosaurus.list_phone --lang gup
is:
['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ']
However, Phoible lists the complete inventory as:
allophone | description_name |
---|---|
m | m Gunwinggu (PH 883) |
i ɪ | i Gunwinggu (PH 883) |
j | j Gunwinggu (PH 883) |
u ʊ | u Gunwinggu (PH 883) |
a ʌ ai au | a Gunwinggu (PH 883) |
w | w Gunwinggu (PH 883) |
n | n Gunwinggu (PH 883) |
l | l Gunwinggu (PH 883) |
b p pʰ | b Gunwinggu (PH 883) |
ŋ | ŋ Gunwinggu (PH 883) |
e ɛ æ | e Gunwinggu (PH 883) |
o ɔ ɒ | o Gunwinggu (PH 883) |
ɡ k kʰ | ɡ Gunwinggu (PH 883) |
r | r Gunwinggu (PH 883) |
ɲ | ɲ Gunwinggu (PH 883) |
ʔ | ʔ Gunwinggu (PH 883) |
d̪ t̪ t̪ʰ | d̪ Gunwinggu (PH 883) |
ɳ | ɳ Gunwinggu (PH 883) |
ɭ | ɭ Gunwinggu (PH 883) |
ɻ | ɻ Gunwinggu (PH 883) |
ɖ | ɖ Gunwinggu (PH 883) |
ɽ | ɽ Gunwinggu (PH 883) |
ʎ | ʎ Gunwinggu (PH 883) |
dʲ tʲ tʲʰ | dʲ Gunwinggu (PH 883) |
https://phoible.org/inventories/view/883
I would expect the allosaurus model inventory for iso gup to be:
['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ', 'ɪ', 'ʊ', 'ʌ', 'ai', 'au', 'b', 'p', 'pʰ', 'ɛ','æ', 'ɔ', 'ɒ', 'ɡ', 'k', 'kʰ', 'ɲ', 'd̪', 't̪', 't̪ʰ', 'ɖ', 'ɽ', 'ʎ', 'dʲ', 'tʲ', 'tʲʰ']
@xinjli Just curious is the output a list of phones or phonemes?
Two cases:
Thanks for the sharing the codes. During running, I encountered the following runtime error:
Traceback (most recent call last): File "/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/run.py", line 61, in <module> phones = recognizer.recognize(args.input, args.lang, args.topk) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/app.py", line 69, in recognize tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/am/allosaurus_torch.py", line 88, in forward hidden_pack_sequence, _ = self.blstm_layer(pack_sequence) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 573, in forward self.num_layers, self.dropout, self.training, self.bidirectional) RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm
I used Python3.7 and torch 1.5. It seems to be package version problem, could you please list all your package versions?
Tx
Hi,
Thank you for putting up the code open-source.
I have a question, is it somehow possible to add phoneme boundaries for each word recognized.
For example:
Transcript for a wav file (german): schau mal hin ist das dorf noch nicht zu sehen
Phonemes Recognized: | ʃ a ʊ h m a l h ɪ n ɪ s t d a s d ɔ ə f n ɔ x n ɪ x t s u z e h ə n
Phonemes with word boundaries: * | ʃ a ʊ h* m a l * h ɪ n* * ɪ s t* * d a s* * d ɔ ə f* * n ɔ x* * n ɪ x t* * s u* * z e h ə n*
Not sure if I am missing something.
Thank you.
Thanks for all your work on allosaurus. It's a really great resource!
For comparing the similarity of two phonetic sequences right now I've been using simple jaccard distance, but it would be nice to use a distance metric that would be sensitive to the fact that phonemes are not equally similar. Can you recommend a resource that would allow for this kind of distance metric?
Thank you!
Hi, so I'm running into what should be a simple problem, but I simply can't figure out what I'm doing wrong.
I run the following command
python -m allosaurus.bin.prep_feat --path='C:\Users\maria\Allo\train'
I wanted to test with just a few samples to make sure I have everything working before using the complete dataset, but I'm stuck on this stage.
The wave txt file for the train directory contains
utt_1 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs5.wav
utt_2 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs6.wav
utt_3 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs7.wav
utt_4 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs8.wav
utt_5 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs9.wav
utt_6 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs10.wav
utt_7 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs12.wav
utt_8 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs14.wav
utt_9 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs15.wav
utt_10 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs16.wav
but I keep getting the error
Traceback (most recent call last):
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\site-packages\allosaurus\bin\prep_feat.py", line 57, in
assert wave_path.exists(), "the path directory should contain a wave file, please check README.md for details"
AssertionError: the path directory should contain a wave file, please check README.md for details
Is my wave file just formatted incorrectly and that's why I keep getting an error about no wav files existing? Is my command line argument the reason? Thank you for any help.
It would be wonderful to optionally be able to retrieve the timestamps for the phonemes. Is that possible?
[edit: I see this suggestion #20 would it be possible to add this option to the code?]
I don't know why, but it does not work in python 3.10. Is it going to have support for 3.10 in the future?
Have the authors considered any approaches to reduce the latency of this approach?
Would be interested to understand if any avenues have been pursued (e.g., distilling into a more performant architecture)
Thanks!
Hello, Thank you for the great repo.
I am unable to run it. Could you please help me to fix the issue. I tried the following compands.
I am using Miniconda:
Install allosaurus
pip install allosaurus
Run inference
(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.run -i deM23-44.wav
Traceback (most recent call last):
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/run.py", line 21, in <module>
if len(get_all_models()) == 0:
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/model.py", line 13, in get_all_models
assert len(models) > 0, "No models are available, you can maually download a model with download command or just run inference to download the latest one automatically"
AssertionError: No models are available, you can maually download a model with download command or just run inference to download the latest one automatically
(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.bin.download_model -m latest
downloading model latest
from: https://www.pyspeech.com/static/model/recognition/allosaurus/latest.tar.gz
to: /home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/bin/pretrained
please wait...
(deepspeech_v0.7.4) [email protected]@wika:~$
(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.run -i deM23-44.wav
Traceback (most recent call last):
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/run.py", line 21, in <module>
if len(get_all_models()) == 0:
File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/model.py", line 13, in get_all_models
assert len(models) > 0, "No models are available, you can maually download a model with download command or just run inference to download the latest one automatically"
AssertionError: No models are available, you can maually download a model with download command or just run inference to download the latest one automatically
Hi,
Currently, the only way to perform phoneme recognition with allosaurus is to run a command in a cli type interface with the following structure python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] -i <audio file/directory>
.
It would be great if there was a function within the library that can also do something similar for example
from allosaurus.app import read_recognizer, speech_recognizer ... phoneme_seq = speech_recognizer.recognize(model_name, speech_wav_file, other_config) ...
Hi, I'm trying to use allosaurus for Persian language but the results are not accurate at all!
here is an example:
model.recognize("source.wav", "pes")
returned result is:
f l a l ə m a l n ɪ k a m a n a ŋ t b a ʃ p uː x t ɔ l t b a ɪ s t ɔ n ə
but it should be like:
s a l ə m m a n m ɪ t a v ə n a m f a r s ɪ s uː x b a t ɔ o n a m
The source.wav has been attached.
What should I do? How can I improve the results?
Hi,
Looks like a great program.
However, I was having trouble building allosaurus. I am on Ubuntu 16.04 and when I do 'pip install allosaurus' I get
ERROR: Failed building wheel for llvmlite
On the web, there were suggestions to use 'python -m pip ...' and also try install llvm. I did both, but it didn't help
Appreciate any help
Hi,
It would be better if an argument can be passed to extract the embeddings instead of the phones. @xinjli
Hello,
As you can see here I've started integrating this project into Papagayo-NG:
morevnaproject-org/papagayo-ng#49
The first results from my tests seem to be very promising.
Especially the new timestamp feature is helping a lot with that.
Is it possible to add some speaker separation to this?
Papagayo-NG itself allows several speakers for one audio file.
If we could recognize which parts are spoken by a separate speaker then that would make this a really nice solution for even
more animators.
I've taken a look at the topic, and it seems to be quite complex.
If this could be integrated to Allosaurus then that would be awesome of course.
If not there would be ways to get this into Papagayo-NG, we could do a separate pass over the audio.
I've taken a look and pyAudioAnalysis seems to already do that.
But that would be a big dependency addition.
Hi,
When I use allosaurus with the eng2102 model for an English wav file, the results looks quite good (although there is one issue, if there is no silence at the beginning of the wav file, some phonemes from the beginning of the speech will be missing - I am still testing this, maybe later I will start a separate issue on this topic).
But when I use the universal model for a Hungarian wav file, the results are not so good (of course, I know it is not a very well known language ;-)).
So I would like to fine-tune the model. But for this, I need to create the text files about the phonemes of the sentences. As it is stated in the doc, the phones here should be restricted to the phone inventory of my target language.
The phone inventory for the Hungarian language is the following:
aː b bː c d dː d̠ d̪ d̪ː d̻ eː f fː h hː i iː j jː k kː l lː l̪ l̪ː m mː n nː n̪ n̪ː o oː p pː r rː r̪ r̪ː s sː s̪ s̻ t tː t̠ t̪ t̪ː t̻ u uː v vː w y yː z zː z̪ z̻ æ ø øː ɑ ɒ ɔ ɛ ɟ ɡ ɡː ɲ ɲː ɾ ʃ ʃː ʒ ʒː ʝ ʝː
But for some phonemes I cannot recognize.
Here is the explanation for the IPA signs for the Hungarian language:
https://hu.wikipedia.org/wiki/IPA_magyar_nyelvre
(unfortunately, it is in Hungarian, but the IPA signs are easy to find...)
Can you help me to understand this, or give me a link to any document, describing these phonemes?
Thanks!
Hi,
Not sure if this is on the roadmap, but it would be super cool to have a way to provide a custom set of symbols to represent phonemes and their mappings to phones. Probably a function/layer to support IPA to custom phoneme set mapping should be sufficient for this requirement.
I am currently trying to use Allosaurus to help a Speech Language Pathologist perform transcriptions, but I am having issues with getting the application to recognize the word what
let along longer WAV files with more complex sentences in them. Attached is the WAV file. The output I get from Allosaurus is:
~/Downloads❯ python -m allosaurus.run -i what.wav --model eng2102 --lang eng
~/Downloads❯
I even installed the eng2102
model.
~/Downloads❯ python -m allosaurus.bin.list_model
Available Models
- uni2005 (default)
~/Downloads❯ python -m allosaurus.bin.download_model -m eng2102
downloading model eng2102
from: https://github.com/xinjli/allosaurus/releases/download/v1.0/eng2102.tar.gz
to: /home/filbot/.local/lib/python3.9/site-packages/allosaurus/pretrained
please wait...
~/Downloads❯ python -m allosaurus.bin.list_model
Available Models
- uni2005 (default)
- eng2102
It was recorded using a Tascam DR-40X using WAV 32bit then transferred over to a Pop!_OS Linux System.
Python Version
~/Downloads❯ python -V
Python 3.9.7
Pop!_OS Version
~/Downloads❯ neofetch
///////////// filbot@pop-os
///////////////////// -------------
///////*767//////////////// OS: Pop!_OS 21.10 x86_64
//////7676767676*////////////// Host: Oryx Pro oryp6
/////76767//7676767////////////// Kernel: 5.15.23-76051523-generic
/////767676///*76767/////////////// Uptime: 1 hour, 40 mins
///////767676///76767.///7676*/////// Packages: 2857 (dpkg), 90 (flatpak)
/////////767676//76767///767676//////// Shell: zsh 5.8
//////////76767676767////76767///////// Resolution: 1920x1080
///////////76767676//////7676////////// DE: GNOME 40.5
////////////,7676,///////767/////////// WM: Mutter
/////////////*7676///////76//////////// WM Theme: Pop
///////////////7676//////////////////// Theme: Pop-dark [GTK2/3]
///////////////7676///767//////////// Icons: Pop [GTK2/3]
//////////////////////'//////////// Terminal: gnome-terminal
//////.7676767676767676767,////// CPU: Intel i7-10875H (16) @ 5.100GHz
/////767676767676767676767///// GPU: Intel CometLake-H GT2 [UHD Graphics]
/////////////////////////// Memory: 3052MiB / 31977MiB
/////////////////////
/////////////
what.wav
file.
what.wav.zip
I feel like I'm not doing something correctly. Do I need to train allosaurus to listen for English sounds as well? I expect to see something similar to wʌt
While I'm checking the full language list with python -m allosaurus.bin.list_lan
, the executable raised IndexError. The following is the error message:
Traceback (most recent call last):
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/bin/list_lang.py", line 13, in <module>
model_path = get_model_path(args.model)
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/model.py", line 27, in get_model_path
resolved_model_name = resolve_model_name(model_name)
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/model.py", line 76, in resolve_model_name
return models[0].name
IndexError: list index out of range
Is this a bug or something I did is wrong?
is there any limit to size of audio file ? i tried for some files with 6 min of data , it processed but didn't gave an output.
Hi, Thanks again for a great program
I tried to run allosaurus on approximately a 15 minute TED talk and got the following error. From the same talk, I extracted a 5 second speech excerpt, and allosaurus seemed to work. Did allosaurus crash because the TED talk starts with about 12 seconds of music? Here's the error message:
python -m allosaurus.run -i ~/datasets/tedlium3-wav/NaliniNadkarni_2009.wav
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/run.py", line 59, in
phones = recognizer.recognize(args.input, args.lang)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/app.py", line 56, in recognize
tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/am/allosaurus_torch.py", line 88, in forward
hidden_pack_sequence, _ = self.blstm_layer(pack_sequence)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 580, in forward
self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm
For inference, the command is
python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] -i <audio>
However, specifying any device ID other than 0 (like say 1) still runs the inference on GPU 0.
Currently, the following code works to run inference on a GPU other than 0, but I think the intention of the device_id
argument was to specify GPU ID as well.
CUDA_VISIBLE_DEVICES=<gpu_id> python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id 0] -i <audio>
Hi, thank you for the nice work! I would like to know where should I put the prior.txt file for Prior Customization ? Thanks !
Hi, I'm looking at shrinking the processing window down from the entire audio file at once.
Could you shed any light on this line?
allosaurus/allosaurus/pm/utils.py
Line 19 in d9f1ada
I'd spread out the one-liner as below to try and figure it out.
rollup = np.roll(feature, 1, axis=0) # make last feature first
rolldown = np.roll(feature, -1, axis=0) # make first feature last
combined = np.concatenate((rollup, feature, rolldown), axis=1) # join all feature on second axis
windowed = combined[::3, ] # removes features with overlapping samples
return windowed
It seems to make all the overlapping features into 1 deeper sample and then drops all the overlaps by getting every 3rd item. But why the np.roll
?
Hi!
I am trying to download the English model by running:
python -m allosaurus.bin.download_model -m eng2102
and I get the following error:
downloading model eng2102
from: https://www.pyspeech.com/static/model/recognition/allosaurus/eng2102.tar.gz
to: /home/j/miniconda3/lib/python3.8/site-packages/allosaurus/pretrained
please wait...
Error: could not download the model
Same goes for the other model.
Is there a way to get them?
Thanks,
Maybe you could add two things to the README:
Hi,
It would be great if there was support for modifying the location in which the latest model (and other model versions) would be downloaded into. For example python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] -i <audio file/directory> -mp <custom directory to save models to>
Thanks for allosaurus, my experiments with it have been fruitful so far. Very impressive work!
I'm curious about whether the architecture of this package is suitable for operating on streaming audio at a reasonably low-latency?
I haven't dug much further than what I needed to load a file with pydub and get some output, and am happy to dig further. I thought it could be a good idea to start a conversation about this, perhaps the system and models are totally unsuitable for real-time, or perhaps it might just require a bit of engineering effort from me.
Thanks in advance
Hello,
We're trying to evaluate allosaurus for a pronunciation trainer. But currently the results fluctuate a bit too much for it to be reliable. Is there any tips that you have to get more consistent results? How was the training data recorded and was it processed in some way (compressor, noise reduction, etc...)? With this information we could adjust our input data and might get better results.
Peter
Hi
Thank you for allosaurus :)
I'm trying to train a model and I've got this error message
AttributeError: module 'editdistance' has no attribute 'distance'
so I replaced 'distance' with 'eval' in trainer.py and it works well
I'm using python v3.7
Hi, I tried to install with pip install allosaurus
and tried to run python -m allosaurus.run -i <path>/cmu_us_slt_arctic/wav/arctic_b0340.wav
where I try to transcribe a 16kHz wave file from the CMU ARCTIC dataset. I got the following results:
$ python -m allosaurus.run -i <path>/cmu_us_slt_arctic/wav/arctic_b0340.wav
ð i z k w ɪ k l ɪ tʰ ə l dʒ o j z ʌ v h ɹ̩ z w ɹ̩ s ɔ ɹ s ə z ʌ v dʒ o j t ə h ɪ m
Segmentation fault (コアダンプ)
The last Japanese word means "core dump". Has anyone encountered this issue before?
FYI, I am using python 3.7.7 and my torch version is 1.5.1+cu101
.
Hello
first thanks a lot to make your work so easily available
I'm trying to make a software to help my friends improve their French pronunciation by doing the following things :
I've started to first play with allosaurus to chekc if it can correctly transcribe me (a french native) pronouncing some simple words, but it seems to have some trouble doing so (the result is quite approximate) I've added -l fra
which seems to improve slightly the accuracy but not by much
Is there some best practice regarding recording to give best results ? Is there some other way I have to improve the accuracy for french ? (I'm a software engineer with good knowledge in python but not that much in machine learning )
thanks a lot for the pointers you can give me
A minor error:
In the README, the command specified for downloading a model is
python -m allosaurus.download <model>
However, the following is what actually works:
python -m allosaurus.download -m <model>
The pip download for allosaurus shows that it downloads successfully in the terminal, however allosaurus does not show up as a known module when I import it in my coding environment. What is the fix for this?
Hello. I faced this issue a few times. It seems that the system is not deterministic. After having run the model several times on the same audio file, sometime a phone or two are replaced. The replacement seems to happen between the 1st and 2nd more likely phones when the probability for the top1 is low.
Hi,
So this is working quite well now in Papagayo-NG.
But I wanted to know if it is possible to get progress information while it is recognizing.
Because if the input files are larger it could take a while.
If not then I will likely test slicing the input files into smaller segments based on silence gaps if possible and running them in series.
So I can then show an approximate progress status.
But the slicing might likely change the result of the recognizer.
Hi
Is there a a maximum size for inventory customization?
It seems not working if I have a phoneme number> 230?
and keep getting this error assert max_domain_idx != -1
Thank you so much and I really appreciate your time to replay me
Hi:
It seems like your pre-trained model link is dead
from: https://www.pyspeech.com/static/model/recognition/allosaurus/latest.tar.gz
to: /python_path/lib/python3.7/site-packages/allosaurus-0.4.2-py3.7.egg/allosaurus/pretrained
please wait...
Error: could not download the model
Traceback (most recent call last):
File "/python_path/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/python_path/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/python_path/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/python_path/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/python_path/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/python_path/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/python_path/lib/python3.7/http/client.py", line 1439, in connect
super().connect()
File "/python_path/lib/python3.7/http/client.py", line 944, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/python_path/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/python_path/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
TimeoutError: [Errno 60] Operation timed out
my python version: 3.7.9
allosaurus version: commit a11771dd4aa16b5162e9aae6238a58bbcac430e5
No matter what, phone duration is 0.045 that doesn't sound right. Even if I say something like "Ooooooooh yeeeeees"
4.080 0.045 iː
4.320 0.045 tʲ
4.410 0.045 iː
Hello,
This is a really amazing project. Is there some way to make it run directly on a website? Without going to the server? Via wasm or similar
Peter
Hi,
It seems like the wave package does not support 32-bit floating encoding. Here is the error message:
Traceback (most recent call last):
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/run.py", line 71, in <module>
phones = recognizer.recognize(args.input, args.lang, args.topk, args.emit, args.timestamp)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/app.py", line 63, in recognize
audio = read_audio(filename)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/audio.py", line 17, in read_audio
wf = wave.open(filename)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 510, in open
return Wave_read(f)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 164, in __init__
self.initfp(f)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 144, in initfp
self._read_fmt_chunk(chunk)
File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 269, in _read_fmt_chunk
raise Error('unknown format: %r' % (wFormatTag,))
wave.Error: unknown format: 3
Could we try to use torchaudio instead of the wave to open files?
Thank you
I just checked and it seems the phone list for yor
and pcm
are incorrect. How can I update this and potentially retrain the model so it can predict the appropriate phoneme sequence?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.