jianchang512 / stt Goto Github PK

View Code? Open in Web Editor NEW

1.2K 7.0 132.0 112.47 MB

Voice Recognition to Text Tool / 一个离线运行的本地语音识别转文字服务，输出json、srt字幕带时间戳、纯文字格式

Home Page: https://v.wonyes.org

License: GNU General Public License v3.0

Python 52.16% HTML 47.68% Batchfile 0.16%

speech speech-recognition speech-to-text stt

stt's Introduction

English / 捐助本项目 / Discord / Q群 905581228

语音识别转文字工具

这是一个离线运行的本地语音识别转文字工具，基于 fast-whipser 开源模型，可将视频/音频中的人类声音识别并转为文字，可输出json格式、srt字幕带时间戳格式、纯文字格式。可用于自行部署后替代 openai 的语音识别接口或百度语音识别等，准确率基本等同openai官方api接口。

部署或下载后，双击 start.exe 自动调用本地浏览器打开本地网页。

拖拽或点击选择要识别的音频视频文件，然后选择发声语言、输出文字格式、所用模型(已内置base模型),点击开始识别，识别完成后以所选格式输出在当前网页。

全过程无需联网，完全本地运行，可部署于内网

fast-whisper 开源模型有 base/small/medium/large-v3, 内置base模型，base->large-v3识别效果越来越好，但所需计算机资源也更多，根据需要可自行下载后解压到 models 目录下即可。

全部模型下载地址

视频演示

cn-stt.mp4

预编译Win版使用方法/Linux和Mac源码部署

点击此处打开Releases页面下载预编译文件
下载后解压到某处，比如 E:/stt
双击 start.exe ，等待自动打开浏览器窗口即可
点击页面中的上传区域，在弹窗中找到想识别的音频或视频文件，或直接拖拽音频视频文件到上传区域，然后选择发生语言、文本输出格式、所用模型，点击“立即开始识别”，稍等片刻，底部文本框中会以所选格式显示识别结果
如果机器拥有英伟达GPU，并正确配置了CUDA环境，将自动使用CUDA加速

源码部署(Linux/Mac/Window)

要求 python 3.9->3.11
创建空目录，比如 E:/stt, 在这个目录下打开 cmd 窗口，方法是地址栏中输入 cmd, 然后回车。

使用git拉取源码到当前目录 git clone [email protected]:jianchang512/stt.git .
创建虚拟环境 python -m venv venv
激活环境，win下命令 %cd%/venv/scripts/activate，linux和Mac下命令 source ./venv/bin/activate
安装依赖: pip install -r requirements.txt,如果报版本冲突错误，请执行 pip install -r requirements.txt --no-deps ,如果希望支持cuda加速，继续执行代码 pip uninstall -y torch, pip install torch --index-url https://download.pytorch.org/whl/cu121
win下解压 ffmpeg.7z，将其中的ffmpeg.exe和ffprobe.exe放在项目目录下, linux和mac 自行搜索如何安装ffmpeg
下载模型压缩包，根据需要下载模型，下载后将压缩包里的文件夹放到项目根目录的 models 文件夹内
执行 python start.py ，等待自动打开本地浏览器窗口。

Api接口

接口地址: http://127.0.0.1:9977/api

请求方法: POST

请求参数:

language: 语言代码:可选如下

>
> 中文：zh
> 英语：en
> 法语：fr
> 德语：de
> 日语：ja
> 韩语：ko
> 俄语：ru
> 西班牙语：es
> 泰国语：th
> 意大利语：it
> 葡萄牙语：pt
> 越南语：vi
> 阿拉伯语：ar
> 土耳其语：tr
>

model: 模型名称，可选如下
>
> base 对应于 models/models--Systran--faster-whisper-base
> small 对应于 models/models--Systran--faster-whisper-small
> medium 对应于 models/models--Systran--faster-whisper-medium
> large-v3 对应于 models/models--Systran--faster-whisper-large-v3
>

response_format: 返回的字幕格式，可选 text|json|srt

file: 音视频文件，二进制上传

Api 请求示例

    import requests
    # 请求地址
    url = "http://127.0.0.1:9977/api"
    # 请求参数  file:音视频文件，language：语言代码，model：模型，response_format:text|json|srt
    # 返回 code==0 成功，其他失败，msg==成功为ok，其他失败原因，data=识别后返回文字
    files = {"file": open("C:/Users/c1/Videos/2.wav", "rb")}
    data={"language":"zh","model":"base","response_format":"json"}
    response = requests.request("POST", url, timeout=600, data=data,files=files)
    print(response.json())

CUDA 加速支持

安装CUDA工具 详细安装方法

如果你的电脑拥有 Nvidia 显卡，先升级显卡驱动到最新，然后去安装对应的 CUDA Toolkit 和 cudnn for CUDA11.X。

安装完成成，按Win + R,输入 cmd然后回车，在弹出的窗口中输入nvcc --version,确认有版本信息显示，类似该图

然后继续输入nvidia-smi,确认有输出信息，并且能看到cuda版本号，类似该图

然后执行 `python testcuda.py`，如果提示成功，说明安装正确，否则请仔细检查重新安装

默认使用 cpu 运算，如果确定使用英伟达显卡，并且配置好了cuda环境，请修改 set.ini 中 `devtype=cpu`为 `devtype=cuda`,并重新启动，可使用cuda加速

注意事项

如果没有英伟达显卡或未配置好CUDA环境，不要使用 large/large-v3 模型，可能导致内存耗尽死机
中文在某些情况下会输出繁体字
有时会遇到“cublasxx.dll不存在”的错误，此时需要下载 cuBLAS，然后将dll文件复制到系统目录下，点击下载 cuBLAS，解压后将里面的dll文件复制到 C:/Windows/System32下
如果控制台出现"[W:onnxruntime:Default, onnxruntime_pybind_state.cc:1983 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.", 可忽略，不影响使用
默认使用 cpu 运算，如果确定使用英伟达显卡，并且配置好了cuda环境，请修改 set.ini 中 devtype=cpu为 devtype=cuda,并重新启动，可使用cuda加速
尚未执行完毕就闪退

如果启用了cuda并且电脑已安装好了cuda环境，但没有手动安装配置过cudnn，那么会出现该问题，去安装和cuda匹配的cudnn。比如你安装了cuda12.3，那么就需要下载cudnn for cuda12.x压缩包，然后解压后里面的3个文件夹复制到cuda安装目录下。具体教程参考 https://juejin.cn/post/7318704408727519270

如果cudnn按照教程安装好了仍闪退，那么极大概率是GPU显存不足，可以改为使用 medium模型，显存不足8G时，尽量避免使用largev-3模型，尤其是视频大于20M时，否则可能显存不足而崩溃

致谢

本项目主要依赖的其他项目

stt's People

Contributors

Stargazers

Watchers

Forkers

ft1142558190 mzhren csbde mzbqhbc chimingdd 8389899 derek-zl songyingjie322 pickmonster yifree zddlb oovucc784 wangbaochao caobinbc herongxhr jacky526 killvxk jqk6 jackyyvan jnhu76 imekaku kevin811103 cellinlab cloudin-iot githubskylh leetesla babck myqiuyun henuhaigang jinzaizhichi marxlanhaochen pshtfp b08240 chu108 ashuaidehao tutumomo icerain-mvc imloama luogaara wkongge yanghq13 poorkids adai5210 xiaosaaaa gitlyp 263055 y-projects neos55555 azhai heguangwu mencomao aceliuchanghong alexandajerry xuanyuan219 tiktok186 king2022 luqien seafitliu asdlei99 x237743972 dysonnnn kekewind peng-yt yezhu219 amtech xialaup w492969105 zhouzaiqing vcszhy whosea yumingvvv alanlee1996 dmulxw youtian0624 xtwork2 pest1999 huanqiuxuexiji xzqttt buddy23333 songsihan liuyangspace you4728 qzeroq alantse1314 sologenius-ai canoi2933 zhusilence blackwhites yangemail zhaopufeng zhoulingjie c55403225 dilid arcphoenix95 qingmou dynamo-github xiaoyan5686670 nathanhex 1179021477 mrchips233

stt's Issues

put download button / now not stock in 98% - pls view video

https://drive.google.com/file/d/1Vn28cht--INSiRlKrJzWYlDvLOM8jOkf/view?usp=drivesdk

may be will amazing if we have a download button:
download srt
download TXT
download json

Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

打开页面，点击立即识别，报错如下：
Requested int8 compute type, but the target device or backend do not support efficient int8 computation.
应该怎么修改配置？

ERROR: No matching distribution found for torch==2.1.2+cu121

Error opening output file

import requests

url = "http://127.0.0.1:9977/api"
files = {"file": open("D:/video-to-text/stt-v0.92/123.mp4", "rb")}
data={"language":"en","model":"base","response_format":"text"}
response = requests.request("POST", url, timeout=600, data=data,files=files)
print(response.json())

返回错误信息：
{'code': 1, 'msg': 'Error opening output file D:\video-to-text\stt-v0.92\static\tmp\123.wav.\r\nError opening output files: Invalid argument\r\n"'}

我去static/tmp目录看了下，把我的123.MP4文件拷贝进去了，不知道为什么，可能没有从mp4中成功分离出123.wav
其实对于123.mp4 我有 123.m4a 的音频文件，但是咱们这套系统不支持这个文件格式，请问有什么办法解决一下吗？

昨天更新后，处理完成后结果处显示为null

参考上图，最后一次获取结果的时候result为空

mac 运行出错

感谢开发软件，运行时出现错误，我是mac:
python start.py
/Users/mam94/Documents/A/To_code/audiotext/venv/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C.get_default_device()), # torch.device('cpu'),
Traceback (most recent call last):
File "/Users/mam94/Documents/A/To_code/audiotext/stt/start.py", line 13, in
from stslib import cfg, tool
File "/Users/mam94/Documents/A/To_code/audiotext/stt/stslib/cfg.py", line 40, in
sets=parse_ini()
File "/Users/mam94/Documents/A/To_code/audiotext/stt/stslib/cfg.py", line 11, in parse_ini
"lang":"en" if locale.getdefaultlocale()[0].split('')[0].lower() != 'zh' else "zh",
AttributeError: 'NoneType' object has no attribute 'split'

ModuleNotFoundError: No module named 'torch'

执行python3 start.py之后提示

Traceback (most recent call last):
  File "/stt/start.py", line 5, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

0.0.92版本结果出现幻觉

本来用的0.0.91转换过一些音频，看到有更新，便升级了一下，存在两个问题：

很多内容没有转换出来，本来40s的语音，就转换出来三句内容；
结果有幻觉，出现语音中没有的内容

stt

大时长的视频或音频似乎没法读取

对于上面的问题，我说一下我的解决方法：
1、将一个视频分为10min的小段
2、将这些小段一一进行音频读取
3、将读取的文字进行结合后输出
ps：如果觉得进行截取的时候，会在截取处，将音频的文字错误读取。那么可以在截取的那个时间点再进行一次10秒左右的截取，并将读取到文字与附近的文字进行缝合防止错误

数字识别不太好，有什么解决办法吗？

录音文件中的数字和手机号码的播报识别不出来，请教下有没有好的解决方案？

092版识别后还是有很多繁体字

建议增加设置，以支持转译时带标点符号。已赞助100

whisper本身支持标点符号，但是本项目转写中文语音时很少带标点。一般方案是增加prompt设置以增加标点符号输出的概率。
参考此文
设置prompt后也能解决输出繁体中文的问题。
另外如果可以，希望能增加区分说话人的功能，参考项目 whisperX
希望作者能看看这个需求，已赞助100元。

处理结果为null

几个模型都试过了，python和exe也都同样是null

how can use my GPUS?

i want to know in windows 64bit use not CPU, GPU instead, becuase with CPU is a lot of time:

Consider put a progress bar pls

在语音识别完成后闪退

我在采用预先分割和均等分割时，在语音识别完成后，应该即将进入翻译阶段时都会出现闪退。在日志最后显示“INFO:VideoTrans:语音识别完成 / 266
INFO:qdarkstyle:QSS file successfully loaded.
INFO:qdarkstyle:Found application patches to be applied.”似乎没有明显问题，我也不明白是什么情况。

requirements.txt 缺少 faster_whisper

1983 error with sstv-0.0.8

2024-01-28 12:06:00.8464605 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1983 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

Environment

windows 10 22H2

nvidia 551.23
cuda 12.3.2_546.12
cudnn-windows-x86_64-8.9.6.50_cuda12-archive

Big File can recongnize but no result.

one mp4 file with 100MB size, about 1 hrs, cuda with float32 , the result display nothing after recognize.
the sample mp4 file can recongnize.

Create voice from SRT file

Hi Jianchang, I am looking for 1 solution to create voice for SRT file. That is, at the beginning, you will create a voice for the SRT file with the voice of OpenAI, Edge TTS
Running independently, can you share how to do it? Thank you for answering this fringe question !

预编译版，api上传失败报错：ERROR in start: [api]error: tuple indices must be integers or slices, not str

打扰了。
预编译版，网页端处理同一段音频没问题，但python调用api报错“上传失败”。
小白诚心提问，该如何处理呢？

Can't complete the conversion, in gpu will flash back in the last 2%

I installed the driver correctly and I can also use cuda, and the problem occurs in cuda, cpu doesn't have this problem for now, and the problem has nothing to do with the model version, neither the base model nor the biggest l3 can complete the conversion it will flash back at the end, I'm using cuda12.3

文件不完整

mainworker ('File model.bin is incomplete: failed to read a value of size 4 at position 0',)("('File model.bin is incomplete: failed to read a value of size 4 at position 0',)",)
报错出现这个请问如何解决

locale.getdefaultlocale Error in cfg.py

Env

version: 0.92
platform: Mac Air M1

pull project as README and run, then throw error as follow:

Traceback (most recent call last):
  File ".../stt/start.py", line 13, in <module>
    from stslib import cfg, tool
  File ".../stt/stslib/cfg.py", line 41, in <module>
    sets=parse_ini()
  File ".../stt/stslib/cfg.py", line 11, in parse_ini
    "lang":"en" if locale.getdefaultlocale()[0].split('_')[0].lower() != 'zh' else "zh", 
AttributeError: 'NoneType' object has no attribute 'split'

i want to translate to spanish and italian - can you share json file with strings?

能否提供docker镜像版本

能否提供docker镜像版本？这样也方便一键体验和部署

stt运行时出现的问题

请问这个问题怎么解决

Plus?

Hi!
I have downloaded a whisper model file which contains multilanguages, so it recognizes languages like mine, which is Hungarian. It is not listed in the drop down list. In the "page source" section I see this image, I just can't add an extra language by editing. So this: Hungarian.

When I try to edit the layui.js file, I see chaos and when I search for text, it doesn't return any results if I type in, say, French.
Same with layui.css file.
Please help me how I can add my language. Thank you.