Coder Social home page Coder Social logo

yeyupiaoling / masr Goto Github PK

View Code? Open in Web Editor NEW
586.0 11.0 103.0 6.41 MB

Pytorch实现的流式与非流式的自动语音识别框架,同时兼容在线和离线识别,目前支持Conformer、Squeezeformer、DeepSpeech2模型,支持多种数据增强方法。

License: Apache License 2.0

Python 97.18% JavaScript 1.47% HTML 1.04% CSS 0.31%
deepspeech pytorch asr deep-learning speech-recognition speech-to-text speech conformer squeezeformer

masr's Issues

实时语音识别,报hidden层size不匹配

`(base) luke@luke-VirtualBox:~/MASR$ python infer_path.py --real_time_demo=True
----------- Configuration Arguments -----------
alpha: 2.2
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
decoder: ctc_beam_search
feature_method: linear
is_long_audio: False
lang_model_path: lm/zh_giga.no_cna_cmn.prune01244.klm
model_path: models/deepspeech2/inference.pt
pun_model_dir: models/pun_models/
real_time_demo: 1
to_an: False
use_gpu: False
use_model: deepspeech2
use_pun: False
vocab_path: dataset/vocabulary.txt
wav_path: ./dataset/test.wav

缺少 paddlespeech-ctcdecoders 库,请根据文档安装,如果是Windows系统,只能使用ctc_greedy。
【注意】已自动切换为ctc_greedy解码器。

[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
分段结果:消耗时间:153ms, 识别结果: 近, 得分: 60
Traceback (most recent call last):
File "infer_path.py", line 93, in
real_time_predict_demo()
File "infer_path.py", line 81, in real_time_predict_demo
predictor.predict_stream(audio_bytes=data, to_an=args.to_an, init_state_h_box=state_h, init_state_c_box=state_c)
File "/home/luke/MASR/masr/predict.py", line 209, in predict_stream
output_data, output_state_h, output_state_c = self.predictor(audio_data, audio_len, init_state_h_box, init_state_c_box)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/luke/MASR/masr/model_utils/utils.py", line 34, in forward
logits, _, final_chunk_state_h_box, final_chunk_state_c_box = self.model(x, audio_len, init_state_h_box, init_state_c_box)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/luke/MASR/masr/model_utils/deepspeech2/model.py", line 53, in forward
x, final_chunk_state_h_box, final_chunk_state_c_box = self.rnn(x, x_lens, init_state_h_box, init_state_c_box)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/luke/MASR/masr/model_utils/deepspeech2/rnn.py", line 69, in forward
x, final_state = rnn(x, x_lens, init_state)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/luke/MASR/masr/model_utils/deepspeech2/rnn.py", line 24, in forward
x, final_state = self.rnn(x, init_state) # [B, T, D]
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 847, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 232, in check_forward_args
self.check_hidden_size(hidden, expected_hidden_size)
File "/home/luke/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 226, in check_hidden_size
raise RuntimeError(msg.format(expected_hidden_size, list(hx.size())))
RuntimeError: Expected hidden size (1, 1, 1024), got [5, 1, 1024]
`

默认类型应该改为float

add_arg('learning_rate', int, 5e-5, '初始学习率的大小')
add_arg('min_duration', int, 0.5, '过滤最短的音频长度')

训练模型问题

你好,能在百度的模型基础上,添加自己特定的语音进行训练吗

C++ 部署问题

你好,感谢大佬的分享,现在将pytorch模型转化为libtorch,但是缺少前处理提取特征的库,在pytorch版本里面是:

def spectrogram(wav, normalize=True):
    D = librosa.stft(wav, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window)

    spec, phase = librosa.magphase(D)
    spec = np.log1p(spec)
    spec = torch.FloatTensor(spec)

    if normalize:
        spec = (spec - spec.mean()) / spec.std()

    return spec

其中,librosa是否有对应的C++封装库,不知道大佬能否提供一些关于部署语音系统的C++资料。

另外,在训练的过程中,学习率设置为0.6,模型会出现不收敛的现象,出现inf,调小一点才可以正常训练。

梯度爆炸

飘零大佬,不知道你用这个网络训练自己的数据集没,我在训练的过程中经常出现梯度爆炸现象,用开源数据集也会出现梯度爆炸!

WenetSpeech数据集的训练参数

你好,想问一下对于WenetSpeech这样的数据集,epoch、batch、learning rate等训练参数应该如何设置比较好,scheduler需不需稍作修改。我目前使用的是epoch=60、batch=32、learning rate=5e-5,scheduler稍微修改了一下step_size=3, gamma=0.9。但是epoch在30到40的时候,模型的loss就差不多一直保持在22左右,cer保持在0.3左右

培训模型问题

这边使用train.py进行培训,但是不知道dataset下面文件的格式,可以给我传一份吗?

是spectrogram还是mfcc特征呢?

你好,我看了代码,有个地方一直没有明白:语音数据预处理的后,输入到神经网络的是声谱图(spectrogram)还是梅尔声谱图(mel-spectrogram)还是mfcc特征呢?麻烦您解答一下,谢谢!!

选择音频处理方式前向计算维度错误

夜雨你好,我在跑这个项目时遇到如下错误:
问题描述:
我在选择使用mfcc处理音频时,错误如下:
Traceback (most recent call last):
File "infer_path.py", line 35, in
predictor = Predictor(model_path=args.model_path, vocab_path=args.vocab_path, use_model=args.use_model,
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\predict.py", line 101, in init
self.predict(warmup_audio_path, to_an=False)
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\predict.py", line 173, in predict
output_data, _, _ = self.predictor(audio_data, audio_len, init_state_h_box, init_state_c_box)
File "C:\Users\Administrator\PycharmProjects\masr\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\model_utils\utils.py", line 33, in forward
x = self.normalizer(audio)
File "C:\Users\Administrator\PycharmProjects\masr\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\model_utils\utils.py", line 19, in forward
x = (x - self.mean) / (self.std + self.eps)
RuntimeError: The size of tensor a (39) must match the size of tensor b (161) at non-singleton dimension 1

在选择使用fbank处理音频时,错误如下:
Traceback (most recent call last):
File "infer_path.py", line 35, in
predictor = Predictor(model_path=args.model_path, vocab_path=args.vocab_path, use_model=args.use_model,
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\predict.py", line 101, in init
self.predict(warmup_audio_path, to_an=False)
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\predict.py", line 173, in predict
output_data, _, _ = self.predictor(audio_data, audio_len, init_state_h_box, init_state_c_box)
File "C:\Users\Administrator\PycharmProjects\masr\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\model_utils\utils.py", line 33, in forward
x = self.normalizer(audio)
File "C:\Users\Administrator\PycharmProjects\masr\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\PycharmProjects\masr\MASR\masr\model_utils\utils.py", line 19, in forward
x = (x - self.mean) / (self.std + self.eps)
RuntimeError: The size of tensor a (120) must match the size of tensor b (161) at non-singleton dimension 1

这个要怎么解决呢
还有我想问问,使用mfcc或者fbank的效果一定会比线性的好吗
希望您能解惑

数据集标注信息

您好,我想问一下,如果标注信息中含有英文字母是不是最后识别不出来?因为我看您在数据准备中说明,标注信息中不能含有英文字母、阿拉伯数字等

预训练模型加载错误

self.__class__.__name__, "\n\t".join(error_msgs)))

RuntimeError: Error(s) in loading state_dict for DeepSpeech2Model:
size mismatch for output.weight: copying a param with shape torch.Size([2988, 1024]) from checkpoint, the shape in current model is torch.Size([3894, 1024]).
size mismatch for output.bias: copying a param with shape torch.Size([2988]) from checkpoint, the shape in current model is torch.Size([3894]).

About dataset

作者您好,请问可否分享一下 " 超大数据集(1600多小时真实数据)+(1300多小时合成数据)"

gpu can't run

GPU execution requested, but not compiled with GPU support

wer/cer 和loss 下降到多少合适

请问一下,项目带的model,在运行train.py过程中,wer/cer 是下降到了多少?还有loss?
因为nobody132的项目。cer是到了11%。

没有python_speech_features包

使用作者提供模型导出模型时提示一下错误:
Traceback (most recent call last):
File "D:/gitDataset/BaiduAI_SR_sent/MASR/export_model.py", line 4, in
from masr.trainer import MASRTrainer
File "D:\gitDataset\BaiduAI_SR_sent\MASR\masr\trainer.py", line 20, in
from masr.data_utils.featurizer.audio_featurizer import AudioFeaturizer
File "D:\gitDataset\BaiduAI_SR_sent\MASR\masr\data_utils\featurizer\audio_featurizer.py", line 2, in
from python_speech_features import delta
ModuleNotFoundError: No module named 'python_speech_features'

运行tune.py报错,提示没有mean_std.npz文件

我在第一次跑这个项目时,只用了一个数据集,当时觉得效果不好,于是后续三个数据集都使用了,一切过程都很顺利,模型训练好的下一步骤运行tune.py,提示没有mean_std.npz文件;然后我回到数据准备阶段,上面说create_data.py会生成mean_std.npz文件,然而我只生成了mean_std.json文件;config_zh.yml文件里面也是只有mean_std.json,那么请问这个npz文件是在哪个代码文件生成或者是配置文件里面说明的?

训练loss不收敛

大佬 为什么我训练自己的数聚集loss不会收敛呢 cer一直是100%
是不是跟我数聚集太少有关系呢?

infer_path.py 消耗时间:3432ms 是不是数据集的关系

额 终于跑起来了。。用的那个超大数据集 在 jeston nano 2GB上:

(py36) cgisky@cgisky-jeston:~/MASR$ python infer_path.py --wav_path=./dataset/test.wav
-----------  Configuration Arguments -----------
alpha: 2.2
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
decoder: ctc_beam_search
feature_method: linear
is_long_audio: False
lang_model_path: lm/zh_giga.no_cna_cmn.prune01244.klm
model_path: models/deepspeech2/inference.pt
pun_model_dir: models/pun_models/
real_time_demo: False
to_an: False
use_gpu: True
use_model: deepspeech2
use_pun: False
vocab_path: dataset/vocabulary.txt
wav_path: ./dataset/test.wav
------------------------------------------------

==================================================================
缺少 paddlespeech-ctcdecoders 库,请根据文档安装,如果是Windows系统,只能使用ctc_greedy。
【注意】已自动切换为ctc_greedy解码器。
==================================================================


消耗时间:3432ms, 识别结果: 近几年不但我用书给女儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, 得分: 97

缺少vocabulary

刚按你提供的下载链接下载了模型,导入的时候报错,应该是缺少了vocab,能提供一份1300小时数据训练的模型的vocab嘛?

模型训练问题

博主你好:
我这边尝试训练一个200小时的自定义数据集,batch设置的512,学习率按照(5e-5)*5.66设置,并且加载了AIshell的预训练,训练了290个epoch,没有很收敛,loss最终30+, cer只有67, 请问是什么原因

模型训练GPU利用率

大佬请问一下,在训练大数据集(1000h+)为什么会出现GPU利用刚开始可以跑满,过一下就直接往下掉的情况呀?试了很多方式都解决不了

效果好像很差?

用了第一和第三个模型,直接用的是dataset里面的长语音,识别结果如下:

最终结果,消耗时间:13005, 得分: 60, 识别结果: ,做品名好,把尺力女争上个有的于树,瑞一直的案与值的知,他着干了,衡厂只丈爸到现止质与人工式,人
指与处,法宽大了一子也是智聘相上几乎没有时生目的,更不勇说赵处婴,她着指光华而游银色的运圈,微微反出战惊子,是是是在被方的风雪的压迫下却保持家
称,这就是白养术希被吉主通的一主,而截不是平法的,法没有普苏的自势没有有需七盘选的职之,别始不养说他不买力,如观美只专职婆说或很十举出折而言,
于止八,拿实是故中的美正,上里在积极出炉了高原上住果,罕景苹厂的罢地上扼然智率这么以朱国应传榜养术,玩照你处只之的数只是数,难到你就不想不的他
澳属由间奖部之就下着而央施医样澳然景力的首位他们家限的照兵,完岛李诱著原一点想到近养之之运人人好紧团结绿球上不尽的怀样术我然想认玩今天转环给职

是哪里不对吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.