Coder Social home page Coder Social logo

jianchang512 / stt Goto Github PK

View Code? Open in Web Editor NEW
1.2K 7.0 132.0 112.47 MB

Voice Recognition to Text Tool / 一个离线运行的本地语音识别转文字服务,输出json、srt字幕带时间戳、纯文字格式

Home Page: https://v.wonyes.org

License: GNU General Public License v3.0

Python 52.16% HTML 47.68% Batchfile 0.16%
speech speech-recognition speech-to-text stt

stt's Introduction

English / 捐助本项目 / Discord / Q群 905581228

语音识别转文字工具

这是一个离线运行的本地语音识别转文字工具,基于 fast-whipser 开源模型,可将视频/音频中的人类声音识别并转为文字,可输出json格式、srt字幕带时间戳格式、纯文字格式。可用于自行部署后替代 openai 的语音识别接口或百度语音识别等,准确率基本等同openai官方api接口。

部署或下载后,双击 start.exe 自动调用本地浏览器打开本地网页。

拖拽或点击选择要识别的音频视频文件,然后选择发声语言、输出文字格式、所用模型(已内置base模型),点击开始识别,识别完成后以所选格式输出在当前网页。

全过程无需联网,完全本地运行,可部署于内网

fast-whisper 开源模型有 base/small/medium/large-v3, 内置base模型,base->large-v3识别效果越来越好,但所需计算机资源也更多,根据需要可自行下载后解压到 models 目录下即可。

全部模型下载地址

视频演示

cn-stt.mp4

image

预编译Win版使用方法/Linux和Mac源码部署

  1. 点击此处打开Releases页面下载预编译文件

  2. 下载后解压到某处,比如 E:/stt

  3. 双击 start.exe ,等待自动打开浏览器窗口即可

  4. 点击页面中的上传区域,在弹窗中找到想识别的音频或视频文件,或直接拖拽音频视频文件到上传区域,然后选择发生语言、文本输出格式、所用模型,点击“立即开始识别”,稍等片刻,底部文本框中会以所选格式显示识别结果

  5. 如果机器拥有英伟达GPU,并正确配置了CUDA环境,将自动使用CUDA加速

源码部署(Linux/Mac/Window)

  1. 要求 python 3.9->3.11

  2. 创建空目录,比如 E:/stt, 在这个目录下打开 cmd 窗口,方法是地址栏中输入 cmd, 然后回车。

    使用git拉取源码到当前目录 git clone [email protected]:jianchang512/stt.git .

  3. 创建虚拟环境 python -m venv venv

  4. 激活环境,win下命令 %cd%/venv/scripts/activate,linux和Mac下命令 source ./venv/bin/activate

  5. 安装依赖: pip install -r requirements.txt,如果报版本冲突错误,请执行 pip install -r requirements.txt --no-deps ,如果希望支持cuda加速,继续执行代码 pip uninstall -y torch, pip install torch --index-url https://download.pytorch.org/whl/cu121

  6. win下解压 ffmpeg.7z,将其中的ffmpeg.exeffprobe.exe放在项目目录下, linux和mac 自行搜索 如何安装ffmpeg

  7. 下载模型压缩包,根据需要下载模型,下载后将压缩包里的文件夹放到项目根目录的 models 文件夹内

  8. 执行 python start.py ,等待自动打开本地浏览器窗口。

Api接口

接口地址: http://127.0.0.1:9977/api

请求方法: POST

请求参数:

language: 语言代码:可选如下

>
> 中文:zh
> 英语:en
> 法语:fr
> 德语:de
> 日语:ja
> 韩语:ko
> 俄语:ru
> 西班牙语:es
> 泰国语:th
> 意大利语:it
> 葡萄牙语:pt
> 越南语:vi
> 阿拉伯语:ar
> 土耳其语:tr
>

model: 模型名称,可选如下
>
> base 对应于 models/models--Systran--faster-whisper-base
> small 对应于 models/models--Systran--faster-whisper-small
> medium 对应于 models/models--Systran--faster-whisper-medium
> large-v3 对应于 models/models--Systran--faster-whisper-large-v3
>

response_format: 返回的字幕格式,可选 text|json|srt

file: 音视频文件,二进制上传

Api 请求示例

    import requests
    # 请求地址
    url = "http://127.0.0.1:9977/api"
    # 请求参数  file:音视频文件,language:语言代码,model:模型,response_format:text|json|srt
    # 返回 code==0 成功,其他失败,msg==成功为ok,其他失败原因,data=识别后返回文字
    files = {"file": open("C:/Users/c1/Videos/2.wav", "rb")}
    data={"language":"zh","model":"base","response_format":"json"}
    response = requests.request("POST", url, timeout=600, data=data,files=files)
    print(response.json())

CUDA 加速支持

安装CUDA工具 详细安装方法

如果你的电脑拥有 Nvidia 显卡,先升级显卡驱动到最新,然后去安装对应的 CUDA Toolkitcudnn for CUDA11.X

安装完成成,按Win + R,输入 cmd然后回车,在弹出的窗口中输入nvcc --version,确认有版本信息显示,类似该图 image

然后继续输入nvidia-smi,确认有输出信息,并且能看到cuda版本号,类似该图 image

然后执行 `python testcuda.py`,如果提示成功,说明安装正确,否则请仔细检查重新安装

默认使用 cpu 运算,如果确定使用英伟达显卡,并且配置好了cuda环境,请修改 set.ini 中 `devtype=cpu`为 `devtype=cuda`,并重新启动,可使用cuda加速

注意事项

  1. 如果没有英伟达显卡或未配置好CUDA环境,不要使用 large/large-v3 模型,可能导致内存耗尽死机

  2. 中文在某些情况下会输出繁体字

  3. 有时会遇到“cublasxx.dll不存在”的错误,此时需要下载 cuBLAS,然后将dll文件复制到系统目录下,点击下载 cuBLAS,解压后将里面的dll文件复制到 C:/Windows/System32下

  4. 如果控制台出现"[W:onnxruntime:Default, onnxruntime_pybind_state.cc:1983 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.", 可忽略,不影响使用

  5. 默认使用 cpu 运算,如果确定使用英伟达显卡,并且配置好了cuda环境,请修改 set.ini 中 devtype=cpudevtype=cuda,并重新启动,可使用cuda加速

  6. 尚未执行完毕就闪退

如果启用了cuda并且电脑已安装好了cuda环境,但没有手动安装配置过cudnn,那么会出现该问题,去安装和cuda匹配的cudnn。比如你安装了cuda12.3,那么就需要下载cudnn for cuda12.x压缩包,然后解压后里面的3个文件夹复制到cuda安装目录下。具体教程参考 https://juejin.cn/post/7318704408727519270

如果cudnn按照教程安装好了仍闪退,那么极大概率是GPU显存不足,可以改为使用 medium模型,显存不足8G时,尽量避免使用largev-3模型,尤其是视频大于20M时,否则可能显存不足而崩溃

相关联项目

视频翻译配音工具:翻译字幕并配音

声音克隆工具:用任意音色合成语音

人声背景乐分离:极简的人声和背景音乐分离工具,本地化网页操作

致谢

本项目主要依赖的其他项目

  1. https://github.com/SYSTRAN/faster-whisper
  2. https://github.com/pallets/flask
  3. https://ffmpeg.org/
  4. https://layui.dev

stt's People

Contributors

jianchang512 avatar peng-yt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

stt's Issues

Error opening output file

import requests

url = "http://127.0.0.1:9977/api"
files = {"file": open("D:/video-to-text/stt-v0.92/123.mp4", "rb")}
data={"language":"en","model":"base","response_format":"text"}
response = requests.request("POST", url, timeout=600, data=data,files=files)
print(response.json())

返回错误信息:
{'code': 1, 'msg': 'Error opening output file D:\video-to-text\stt-v0.92\static\tmp\123.wav.\r\nError opening output files: Invalid argument\r\n"'}

我去static/tmp目录看了下,把我的123.MP4文件拷贝进去了,不知道为什么,可能没有从mp4中成功分离出123.wav
其实对于123.mp4 我有 123.m4a 的音频文件,但是咱们这套系统不支持这个文件格式,请问有什么办法解决一下吗?

mac 运行出错

感谢开发软件,运行时出现错误,我是mac:
python start.py
/Users/mam94/Documents/A/To_code/audiotext/venv/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C.get_default_device()), # torch.device('cpu'),
Traceback (most recent call last):
File "/Users/mam94/Documents/A/To_code/audiotext/stt/start.py", line 13, in
from stslib import cfg, tool
File "/Users/mam94/Documents/A/To_code/audiotext/stt/stslib/cfg.py", line 40, in
sets=parse_ini()
File "/Users/mam94/Documents/A/To_code/audiotext/stt/stslib/cfg.py", line 11, in parse_ini
"lang":"en" if locale.getdefaultlocale()[0].split('
')[0].lower() != 'zh' else "zh",
AttributeError: 'NoneType' object has no attribute 'split'

0.0.92版本结果出现幻觉

本来用的0.0.91转换过一些音频,看到有更新,便升级了一下,存在两个问题:

  1. 很多内容没有转换出来,本来40s的语音,就转换出来三句内容;
  2. 结果有幻觉,出现语音中没有的内容
    image

大时长的视频或音频似乎没法读取

对于上面的问题,我说一下我的解决方法:
1、将一个视频分为10min的小段
2、将这些小段一一进行音频读取
3、将读取的文字进行结合后输出
ps:如果觉得进行截取的时候,会在截取处,将音频的文字错误读取。那么可以在截取的那个时间点再进行一次10秒左右的截取,并将读取到文字与附近的文字进行缝合防止错误

建议增加设置,以支持转译时带标点符号。已赞助100

whisper本身支持标点符号,但是本项目转写中文语音时很少带标点。一般方案是增加prompt设置以增加标点符号输出的概率。
参考此文
设置prompt后也能解决输出繁体中文的问题。
另外如果可以,希望能增加区分说话人的功能,参考项目 whisperX
希望作者能看看这个需求,已赞助100元。

how can use my GPUS?

i want to know in windows 64bit use not CPU, GPU instead, becuase with CPU is a lot of time:

SNAG-0229

Consider put a progress bar pls

在语音识别完成后闪退

我在采用预先分割和均等分割时,在语音识别完成后,应该即将进入翻译阶段时都会出现闪退。在日志最后显示“INFO:VideoTrans:语音识别完成 / 266
INFO:qdarkstyle:QSS file successfully loaded.
INFO:qdarkstyle:Found application patches to be applied.”似乎没有明显问题,我也不明白是什么情况。

1983 error with sstv-0.0.8

2024-01-28 12:06:00.8464605 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1983 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

Environment

windows 10 22H2

nvidia 551.23
cuda 12.3.2_546.12
cudnn-windows-x86_64-8.9.6.50_cuda12-archive

Create voice from SRT file

Hi Jianchang, I am looking for 1 solution to create voice for SRT file. That is, at the beginning, you will create a voice for the SRT file with the voice of OpenAI, Edge TTS
Running independently, can you share how to do it? Thank you for answering this fringe question !

Can't complete the conversion, in gpu will flash back in the last 2%

I installed the driver correctly and I can also use cuda, and the problem occurs in cuda, cpu doesn't have this problem for now, and the problem has nothing to do with the model version, neither the base model nor the biggest l3 can complete the conversion it will flash back at the end, I'm using cuda12.3

文件不完整

mainworker ('File model.bin is incomplete: failed to read a value of size 4 at position 0',)("('File model.bin is incomplete: failed to read a value of size 4 at position 0',)",)
报错出现这个请问如何解决
image

locale.getdefaultlocale Error in cfg.py

Env

version: 0.92
platform: Mac Air M1

pull project as README and run, then throw error as follow:

Traceback (most recent call last):
  File ".../stt/start.py", line 13, in <module>
    from stslib import cfg, tool
  File ".../stt/stslib/cfg.py", line 41, in <module>
    sets=parse_ini()
  File ".../stt/stslib/cfg.py", line 11, in parse_ini
    "lang":"en" if locale.getdefaultlocale()[0].split('_')[0].lower() != 'zh' else "zh", 
AttributeError: 'NoneType' object has no attribute 'split'

Plus?

Hi!
I have downloaded a whisper model file which contains multilanguages, so it recognizes languages like mine, which is Hungarian. It is not listed in the drop down list. In the "page source" section I see this image, I just can't add an extra language by editing. So this: Hungarian.

When I try to edit the layui.js file, I see chaos and when I search for text, it doesn't return any results if I type in, say, French.
Same with layui.css file.
Please help me how I can add my language. Thank you.

model

请问这些个model文件是怎么训练出的呢?能不能说一下

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.