dadangdut33 / speech-translate Goto Github PK

A realtime speech transcription and translation application using Whisper OpenAI and free translation API. Interface made using Tkinter. Code written fully in Python.

License: MIT License

Python 93.52% Tcl 5.90% PowerShell 0.25% Inno Setup 0.33%

python speech-transcription speech-translation tkinter-python translate whisper

speech-translate's Introduction

Speech Translate

Speech Translate is a practical application that combines OpenAI's Whisper ASR model with free translation APIs. It serves as a versatile tool for both real-time / live speech-to-text and speech translation, allowing the user to seamlessly convert spoken language into written text. Additionally, it has the option to import and transcribe audio / video files effortlessly.

Speech Translate aims to expand whisper ability by combining it with some translation APIs while also providing a simple and easy to use interface to create a more practical application. This application is also open source, so you can contribute to this project if you want to.

Preview - Usage

Transcribe mode on detached window (English)
Translate mode on detached window (English to Indonesia)

Preview - Setting

🚀 Features
📜 Requirements
🔧 Installation
📚 More Information
🛠️ Development
💡 Contributing
License
Attribution
Other

🚀 Features

Speech to text and/or Speech translation (transcribed text can be translated to other languages) with live input from mic or speaker 🎙️
Customizable subtitle window for live speech to text and/or speech translation
Batch file processing of audio / video files for transcription and translation with output of (.txt .srt .ass .tsv .vtt .json) 📂
Result refinement
Result alignment
Result translation (Translate only the result.json)

📜 Requirements

Compatible OS Installation:

OS	Installation from Prebuilt binary	Installation as a Module	Installation from Git
Windows	✔️	✔️	✔️
MacOS	❌	✔️	✔️
Linux	❌	✔️	✔️

* Python 3.8 or later (3.11 is recommended) for installation as module.

Speaker input only work on windows 8 and above (Alternatively, you can make a loopback to capture your system audio as virtual input (like mic input) by using this guide/tool: [Voicemeeter on Windows]/[YT Tutorial] - [pavucontrol on Ubuntu with PulseAudio] - [blackhole on MacOS])
Internet connection is needed only for translation with API & downloading models (If you want to go fully offline, you can setup LibreTranslate on your local machine and set it up in the app settings)
Recommended to have Segoe UI font installed on your system for best UI experience (For OS other than windows, you can see this: Ubuntu - MacOS)
Recommended to have capable GPU with CUDA compatibility (prebuilt version is using CUDA 11.8) for faster result. Each whisper model has different requirements, for more information you can check it directly at the whisper repository.

Size	Parameters	Required VRAM	Relative speed
tiny	39 M	~1 GB	~32x
base	74 M	~1 GB	~16x
small	244 M	~2 GB	~6x
medium	769 M	~5 GB	~2x
large	1550 M	~10 GB	1x

* This information is also available in the app (hover over the model selection in the app and there will be a tooltip about the model info). Also note that when using faster-whisper, the model speed will be significantly faster and have smaller vram usage, for more information about this please visit faster-whisper repository

🔧 Installation

Important

Please take a look at the Requirements first before installing. For more information about the usage of the app, please check the wiki

From Prebuilt Binary (.exe)

Note

The prebuilt binary is shipped with CUDA 11.8, so it will only work with GPU that has CUDA 11.8 compatibility. If your GPU is not compatible, you can try installation as module or from git below.

Download the latest release (There are 2 versions, CPU and GPU/CUDA)
Install/extract the downloaded file
Run the program
Set the settings to your liking
Enjoy!

As A Module

Note

Use python 3.11 for best compatibility and performance

Warning

You might need to have Build tools for Visual Studio (or the equivalent of it on your OS) installed

To install as module, we can use pip, with the following command.

Install with GPU (Cuda compatible) support:

pip install -U git+https://github.com/Dadangdut33/Speech-Translate.git --extra-index-url https://download.pytorch.org/whl/cu118

cu118 here means CUDA 11.8, you can change it to other version if you need to. You can check older version of pytorch here or here.
CPU only:

pip install -U git+https://github.com/Dadangdut33/Speech-Translate.git

You can then run the program by typing speech-translate in your terminal/console. Alternatively, when installing as a module, you can also clone the repo and install it locally by running pip install -e . in the project directory. (Don't forget to add --extra-index-url if you want to install with GPU support)

Notes For Installation as Module:

If you are updating from an older version, you need to add --upgrade --force-reinstall at the end of the command, if the update does not need new dependencies you can add --no-deps at the end of the command to speed up the installation process.
If you want to install from a specific branch or commit, you can do it by adding @branch_name or @commit_hash at the end of the url. Example: pip install -U git+https://github.com/Dadangdut33/Speech-Translate.git@dev --extra-index-url https://download.pytorch.org/whl/cu118
The --extra-index-url here is for the version of CUDA. If your device is not compatible or you need to use other version of CUDA you can check older version of pytorch here or here.

From Git

If you prefer cloning the app directly from git/github, you can follow the guide in development (wiki) or below. Doing it this way might also provide a more stable environment.

📚 More Information

Check out the wiki for more information about the app, user settings, how to use it, and more.

🛠️ Development

Note

Check the wiki for more details

Setup

Note

It is recommended to create a virtual environment, but it is not required. I also use python 3.11.6 for development, but it should work with python 3.8 or later

Warning

You might need to have Build tools for Visual Studio installed

Clone the repo with its submodules by running git clone --recurse-submodules https://github.com/Dadangdut33/Speech-Translate.git
cd into the project directory
Create a virtual environment by running python -m venv venv
Activate your virtual environment
Install all the dependencies needed by running pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118 if you are using GPU or pip install -r requirements.txt if you are using CPU.
Run python Run.py in root directory to run the app.

Notes:

If you forgot the --recure-submodules flag when cloning the repository and the submodules is not cloned correctly, you can do git submodule update --init --recursive in the project directory to pull the needed submodules.
The --extra-index-url is needed to install CUDA version of pytorch and for this one we are using CUDA 11.8. If your device is not compatible or you need to use other version of CUDA you can check the previous pytorch version in this link or this.

Running the app

You can run the app by running the Run.py located in root directory. Alternatively you can also run it using python -m speech_translate in the root directory.

Building

Before compiling the project, make sure you have installed all the dependencies and setup your pytorch correctly. Your pytorch version will control wether the app will use GPU or CPU (that's why it's recommended to make virtual environment for the project).

The pre compiled version in this project is built using cx_freeze, we have provided the script in build.py. This build script is only configured for windows build at the moment, but feel free to contribute if you know how to build properly for other OS.

To compile it into an exe run python build.py build_exe in the root directory. This will produce a folder containing the compiled project alongside an executable in the build directory. After that, use innosetup script to create an installer. You can use the provided installer.iss to create the installer.

Compatibility

This project should be compatible with Windows (preferrably windows 10 or later) and other platforms. But I haven't tested it extensively on other platforms. If you find any bugs or issues, feel free to create an issue.

💡 Contributing

Feel free to contribute to this project by forking the repository, making your changes, and submitting a pull request. You can also contribute by creating an issue if you find a bug or have a feature request. Also, feel free to give this project a star if you like it.

License

This project is licensed under the MIT License - see the LICENSE file for details

Attribution

Sunvalley TTK Theme (used for app theme although i modified it a bit)
Noto Emoji for the icons used in the app

Other

Check out my other similar project called Screen Translate a screen translator / OCR tools made possible using tesseract.

speech-translate's People

Contributors

Stargazers

Watchers

Forkers

migggzz monicaarnaud o3t1w slepetys cafew robrita ryanhe312 lovegotobe jonmatthis hawkingwan totopo27 bouallati darthpedroo sungeunfarm uokereh techthiyanes anirudhasher heitor92 timber8205 zhubaojie 665ermirshyti duagangtech zenkynemesis rstokes81 librastalker samirpaul1 jaedukseo 5l1v3r1 serviteur yhardie jasb62 sagatsou 0xfoc-eth andupotorac mcarbel ksjpswaroop syaikhipin artchess serjik777 soliver84 ssun3 advancedai-nl tiktok186 aloproducao eltld superoldman96 honsa quyen88 felicityrooting evgeni17 mmis1000 f901107

speech-translate's Issues

The first minute of some of the audio transcriptions is missing [BUG]

The first minute of some of the audio transcriptions is missing.
This happens when I upload multiple audio files to be transcribed.
I'm not sure what causes this, since the audio quality of the first minute is as clear as the rest of the audio.
I use the CPU version on a small laptop, so it may be that because of the processor limitations, and me working on the same laptop with another application, this results in loss of information during the transcription. I will test this with another, stronger PC.
Anyhow, there is no warning or indication that there is data loss or transcription issues. Is there a way for the program to check this in real time?

[BUG] Language code is mixed up

Fix the language code

[BUG] Nonetype error on record session

Describe the bug
Program fail to record after a few seconds because of nonetype error

Export srt with timestamp

This is possible if transcribed from file (mp4 and stuff)

[BUG] can't get the input devices when the program start

when i start the program download from realsase(the latest),it said
2023-08-03 14:01:35,349 ERROR - Something went wrong while trying to get the input devices (mic). (Record.py:56) [MainThread]
2023-08-03 14:01:35,350 ERROR - 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte (Record.py:57)

the program should recognize my mic,it's useable in other applications and it's fine in windows.
Buzz and whisper-desktop can recognize my mic and it's useable.

OS: windows 11 22H2
cpu 7840s with 4060
a laptop with the mic bulit in

Fails to download large model

I get
urllib.error.URLError: <urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>
When downloading the large model .. every other model works normally.
BTW is it using the large-v2 model ?
Thanks mr dadangdut33 .. the GUI doesnt look the best but it s good enough and easy / straightforward :)

[BUG] Hangs on stop recording

Describe the bug
Seems to start a recording just fine, but then hangs after clicking "stop".

To Reproduce
Steps to reproduce the behavior:

Install release 1.2.3 on Windows 10
Download model (tried 3)
Click Record Mic
Record a few seconds of speaking
Click Stop
Says "Stopping..." but never progresses and no indication is shown in console that stop was clicked. App hangs while using 8% of CPU and 5 GB of ram (on tiny model). Only way to stop is to force quit.

Expected behavior
Recording should stop and then audio should be processed.

Screenshots

Desktop (please complete the following information):

OS: Windows 10 Pro (build: 19044.2846)
Python Version 3.9.2

Additional context
Importing an mp3 works as expected.

[BUG] Chinese tradional is not supported on TL Engline: Google

Describe the bug
Chinese traditional is not supported on TL Engline: Google

Screenshots

[REQ] Add option for no max sentences limit

Describe the bug
When I am live transcribing audio from speaker when recording, the first transcribe sentence disappears after some duration, therefore unable to export all transcribed text (into .txt) from the software. In short, main window acting as subtitle window which should not happen when recording as transcribed text from live audio should be kept from start to end in the main video for exporting into (.txt).

To Reproduce
Steps to reproduce the behavior:

(For this setup) Open this youtube video in your browser https://www.youtube.com/watch?v=02E2WgRcHpo (Note: video duration 1 min 20 sec)
Pause the youtube video
Set task to transcribe, Input to speaker and then Start recording.
Play the Youtube video till the end. (You will see that the CPU speech-translator has perfectly transcribed the video from starting, however when you reach the end of the youtube video, previous first line is deleted.

Expected behavior
All transcribed text must remain from the start till end in the main window, (until the user manually press the 'Clear' button.)

Desktop (please complete the following information):

OS: Edition Windows 10 Home Single Language
Version 22H2
Installed on ‎04/‎08/‎23
OS build 19045.3803
Experience Windows Feature Experience Pack 1000.19053.1000.0
Python Version: 3.11.5

[REQ] consider using static-ffmpeg so that you don't need an extra step for ffmpeg installation

https://pypi.org/project/static-ffmpeg/

[BUG] Detected as virus

Kaspersky see's latest release as a virus

[BUG] Error when activating Debug recording

When activating the "Debug Recording" option (see below), it generate an error when starting a recording session.

Error log is:
2024-01-01 20:35:11.178 | ERROR | record.py:944 [Thread-28 (record_session)] - TypedDict does not support instance and class checks
Traceback (most recent call last):
File "D:\Codes_Projects\Python\Speech-Translate\speech_translate\utils\audio\record.py", line 860, in record_session
File "D:\Codes_Projects\Python\Speech-Translate\speech_translate\utils\whisper\helper.py", line 57, in stablets_verbose_log
File "C:\Users\PC.pyenv\pyenv-win\versions\3.11.6\Lib\typing.py", line 3010, in subclasscheck
TypeError: TypedDict does not support instance and class checks

It seems that it comes from a call to isinstance in helper.py
record.py

helper.py

how to set transcription language

Is there a way to set the transcription language?

failure to run main.py

Successfully install the env and build.spec... is there something i miss out in the readME.md?
Im using ubuntu 20.04

python Main.py
Traceback (most recent call last):
File "Main.py", line 27, in
from speech_translate.Globals import app_icon, app_icon_missing, app_name, fJson, gClass
File "/home/mraway/Desktop/src/Speech-Translate/speech_translate/Globals.py", line 16, in
from .utils.Json import SettingJsonHandler
File "/home/mraway/Desktop/src/Speech-Translate/speech_translate/utils/Json.py", line 7, in
from speech_translate.components.MBox import Mbox
File "/home/mraway/Desktop/src/Speech-Translate/speech_translate/components/MBox.py", line 10, in
def Mbox(title: str, text: str, style: Literal[0, 1, 2, 3], parent: Tk | None = None):
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

[BUG] AttributeError: module 'subprocess' has no attribute 'STARTUPINFO'

Describe the bug
Won't start

To Reproduce
Try to start

Expected behavior
GUI showing up

Log

$ speech-translate 
2023-12-16 06:02:25.466 | INFO    | setting.py:282 [MainThread] - Setting loaded
Traceback (most recent call last):
  File "/home/user/.local/bin/speech-translate", line 5, in <module>
    from speech_translate.__main__ import main
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/__main__.py", line 38, in <module>
    from .ui.window.main import main  # pylint: disable=wrong-import-position
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 37, in <module>
    from speech_translate.ui.window.setting import SettingWindow
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/setting.py", line 10, in <module>
    from speech_translate.ui.frame.setting.textbox import SettingTextbox
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/frame/setting/textbox.py", line 4, in <module>
    from matplotlib import pyplot as plt
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/pyplot.py", line 56, in <module>
    import matplotlib.colorbar
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/colorbar.py", line 19, in <module>
    from matplotlib import _api, cbook, collections, cm, colors, contour, ticker
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/contour.py", line 14, in <module>
    from matplotlib.backend_bases import MouseButton
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 46, in <module>
    from matplotlib import (
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/text.py", line 16, in <module>
    from .font_manager import FontProperties
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/font_manager.py", line 1582, in <module>
    fontManager = _load_fontmanager()
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/font_manager.py", line 1576, in _load_fontmanager
    fm = FontManager()
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/font_manager.py", line 1041, in __init__
    *findSystemFonts(fontext=fontext)]:
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/font_manager.py", line 280, in findSystemFonts
    installed_fonts = _get_fontconfig_fonts()
  File "/home/user/.local/lib/python3.10/site-packages/matplotlib/font_manager.py", line 254, in _get_fontconfig_fonts
    if b'--format' not in subprocess.check_output(['fc-list', '--help']):
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/__main__.py", line 26, in __init__
    kwargs['startupinfo'] = subprocess.STARTUPINFO()
AttributeError: module 'subprocess' has no attribute 'STARTUPINFO'

Desktop (please complete the following information):

OS: Linux Mint 21.2 x86_64
App Installation version: module
App / Python version: Python 3.10.12

Additional context
See https://stackoverflow.com/a/46281980

Fix Character limit per line

It's entirely possible I overlooked it but, is there a way to set a custom character limit per line? I am trying to have just a single line with a max of 18 characters per sentence.

Edit*
This is for the audio transcriptions

[BUG]Cannot find PyTorch

Describe the bug
I'm using Python 3.8.10 on Windows which installed the GPU version of PyTorch, but it shows me that "None of PyTorch, TensorFlow >= 2.0, or Flax have been found".

To Reproduce
Steps to reproduce the behavior:

Run the program
On CLI it shows "None of PyTorch, TensorFlow >= 2.0, or Flax have been found"
When recording sound or importing file, it will download and use the CPU version of PyTorch

Expected behavior
It can detect PyTorch existed on my system.

Screenshots

Desktop (please complete the following information):

OS: Windows 11
Python Version 3.8.10

Additional context
I guess maybe it is because different versions of Python are used? I don't know:(

[BUG] Error when only translating and not using whisper engine

It wont start translating and will throw an error / crash instead

[BUG]The GUI application tries to run endlessly.

When you try to start the application, it loads endlessly. 1 CPU thread is constantly busy. An attempt to delusional CPU and GPU versions, trying to run 1.3.0 and 1.3.1, trying to run on different computers (1: R7 4800h and RTX2060 and the second i7-3630qm and hd4000/Gt730m. Python 3.10.10 was installed on the computers (The one with installer), and then reinstalled on 3.11.6 (also Installer) and it didn't help. Ffmpeg added. Cmd reports the presence of ffmpeg and python. Old versions worked without any problems even without Python installed on the computer.

I have prebuilt binary version.

[BUG] GUI app: Standard whisper (including the new large-v3) does not work.

Problem with GUI application. Standard whisper (including the new large-v3) does not work on CPU and CUDA. Only faster-whisper works. The computer freezes, fills the RAM to 100% (Also using lightweight models e.g. tiny.en) and also the paging file, which even causes the NVMe SSD disk to be used at 100%. When it comes to logs, it is difficult to remove because the computer hangs and the log itself weighs as much as 1 GB after forcefully killing the application. Notepad can't even open it (big size file). Faster-whisper works properly.

[BUG] TCLError bad pad value "2.5": must be positive screen distance

Log

$ speech-translate 
2023-12-16 10:28:02.097 | INFO    | setting.py:284 [MainThread] - Setting loaded
2023-12-16 10:28:04.206 | INFO    | main.py:2081 [MainThread] - App Version: 1.3.6
2023-12-16 10:28:04.208 | INFO    | main.py:2082 [MainThread] - OS: Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 | CPU: x86_64
2023-12-16 10:28:04.255 | INFO    | main.py:2083 [MainThread] - GPU: NVIDIA GeForce 930MX | CUDA: Detected 1 GPU(s): NVIDIA GeForce 930MX
2023-12-16 10:28:04.255 | DEBUG   | main.py:2084 [MainThread] - Sys args: ['/home/user/.local/bin/speech-translate']
2023-12-16 10:28:04.255 | DEBUG   | main.py:2085 [MainThread] - Loading UI...
2023-12-16 10:28:04.371 | INFO    | main.py:207 [MainThread] - Tray created successfully
2023-12-16 10:28:04.592 | DEBUG   | main.py:259 [MainThread] - Available Theme to use: ['default', 'sun-valley-light', 'sun-valley-dark', 'clam', 'alt', 'classic']
2023-12-16 10:28:04.593 | DEBUG   | style.py:32 [MainThread] - Setting theme: sun-valley-light
2023-12-16 10:28:04.641 | DEBUG   | style.py:60 [MainThread] - Setting custom light theme style
2023-12-16 10:28:04.781 | ERROR   | device.py:426 [MainThread] - 'PyAudio' object has no attribute 'get_default_wasapi_loopback'
Traceback (most recent call last):

  File "/home/user/.local/bin/speech-translate", line 8, in <module>
    sys.exit(main())
    │   │    └ <function main at 0x7f7dd1ed5a20>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 2091, in main
    main_ui = MainWindow()
              └ <class 'speech_translate.ui.window.main.MainWindow'>
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 460, in __init__
    self.menu_speaker = self.input_device_menu("speaker")
    │                   │    └ <function MainWindow.input_device_menu at 0x7f7dd1ed4430>
    │                   └ <speech_translate.ui.window.main.MainWindow object at 0x7f7dd0b2fa60>
    └ <speech_translate.ui.window.main.MainWindow object at 0x7f7dd0b2fa60>
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 930, in input_device_menu
    success, default_host = get_default_dict[mode]()
                            │                └ 'speaker'
                            └ {'hostAPI': <function get_default_host_api at 0x7f7dd37057e0>, 'mic': <function get_default_input_device at 0x7f7dd37056c0>, ...
> File "/home/user/.local/lib/python3.10/site-packages/speech_translate/utils/audio/device.py", line 419, in get_default_output_device
    default_device = p.get_default_wasapi_loopback()  # type: ignore
                     └ <pyaudio.PyAudio object at 0x7f7dd0b42b30>

AttributeError: 'PyAudio' object has no attribute 'get_default_wasapi_loopback'
2023-12-16 10:28:04.783 | ERROR   | device.py:427 [MainThread] - Something went wrong while trying to get the default output device (speaker).
2023-12-16 10:28:04.811 | ERROR   | _logging.py:63 [MainThread] - Traceback (most recent call last):
2023-12-16 10:28:04.812 | ERROR   | _logging.py:63 [MainThread] - File "/home/user/.local/bin/speech-translate", line 8, in <module>
2023-12-16 10:28:04.812 | ERROR   | _logging.py:63 [MainThread] - sys.exit(main())
2023-12-16 10:28:04.812 | ERROR   | _logging.py:63 [MainThread] - File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 2091, in main
2023-12-16 10:28:04.813 | ERROR   | _logging.py:63 [MainThread] - main_ui = MainWindow()
2023-12-16 10:28:04.813 | ERROR   | _logging.py:63 [MainThread] - File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 494, in __init__
2023-12-16 10:28:04.813 | ERROR   | _logging.py:63 [MainThread] - self.cbtn_task_transcribe.pack(side="left", padx=5, pady=2.5, ipady=0)
2023-12-16 10:28:04.814 | ERROR   | _logging.py:63 [MainThread] - File "/usr/lib/python3.10/tkinter/__init__.py", line 2425, in pack_configure
2023-12-16 10:28:04.815 | ERROR   | _logging.py:63 [MainThread] - self.tk.call(
2023-12-16 10:28:04.815 | ERROR   | _logging.py:63 [MainThread] - _tkinter
2023-12-16 10:28:04.815 | ERROR   | _logging.py:63 [MainThread] - .
2023-12-16 10:28:04.816 | ERROR   | _logging.py:63 [MainThread] - TclError
2023-12-16 10:28:04.816 | ERROR   | _logging.py:63 [MainThread] - :
2023-12-16 10:28:04.816 | ERROR   | _logging.py:63 [MainThread] - bad pad value "2.5": must be positive screen distance

Desktop (please complete the following information):

OS: Linux Mint 21.2 x86_64
App Installation version: module
App / Python version: Python 3.10.12

[BUG] Subtitles lose sync

I have compared your version to another whisper variant called StoryToolKitAI.

I think yours gets it right with breaking up the subtitles, while StoryToolKit has large paragraphs, but the problem with your version is that it loses sync, maybe due to prolonged loud noise with some speech (I have seen this phenomenon with other whisper versions)

In this example StoryToolkitAI is the large font, while Speech-Translate is small font. Go to around 2minutes to see where your version loses sync and starts adding subtitles before the words have been said while StoryToolKIt keeps sync. Translate option is being used for both, as well as Large dictionary V2. Russian is the language being translated
https://www.youtube.com/watch?v=S8e80gE8YVk

[BUG]

I imported a small english audio file for the purpose of translating into spanish. My choice of translation engine was google. I keep receiving this error a few seconds into the process: ERROR - byte indices must be integers or slices, not str

The size of tensor a (16) must match the size of tensor b (5) atnon-singleton dimension 3

The size of tensor a (16) must match the size of tensor b (5) atnon-singleton dimension 3
what's wrong

[BUG]

After running the application I get issue: FileNotFoundError: [WinError 2] The system cannot find the file specified

[REQ] Add support for Large-v3 faster whisper model

Whisper has released large-v3 model but stable-ts have not supported it yet, reminder to add it once it's supported

[BUG]RUN ERROR

Traceback (most recent call last):
File "Main.py", line 969, in
File "Main.py", line 247, in init
File "speech_translate\utils\Record.py", line 35, in getInputDevices
File "sounddevice.py", line 564, in query_devices
File "sounddevice.py", line 564, in
File "sounddevice.py", line 578, in query_devices
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte
[20716] Failed to execute script 'Main' due to unhandled exception!

OS: windows 11 pro x64

Python: 3.10.6

GUI: GPU version last

Add click through option for the detached/external window

This would make it more customizable

Feature Request: Batch Uploading of audio files for processing.

I would like to check if there is an option to batch upload a zip folders of multiple audio files for a single processing?

On light themes, text in the context menu in the dynamic text field is not visible [BUG]

On light themes, text in the context menu in the dynamic text field is not visible.

Only white text:

On dark theme is ok:

[BUG] ImportError: cannot import name 'startfile' from 'os' (/usr/lib/python3.10/os.py)

Describe the bug
Won't start

To Reproduce
Try to start

Expected behavior
GUI showing up

Log

$ speech-translate 
Traceback (most recent call last):
  File "/home/user/.local/bin/speech-translate", line 5, in <module>
    from speech_translate.__main__ import main
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/__main__.py", line 38, in <module>
    from .ui.window.main import main  # pylint: disable=wrong-import-position
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/ui/window/main.py", line 23, in <module>
    from speech_translate.linker import bc, sj
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/linker.py", line 13, in <module>
    from speech_translate.utils.helper import generate_color, str_separator_to_html, wrap_result
  File "/home/user/.local/lib/python3.10/site-packages/speech_translate/utils/helper.py", line 9, in <module>
    from os import makedirs, path, startfile
ImportError: cannot import name 'startfile' from 'os' (/usr/lib/python3.10/os.py)

Desktop (please complete the following information):

OS: Linux Mint 21.2 x86_64
App Installation version: module
App / Python version: Python 3.10.12

Additional context
See https://stackoverflow.com/q/62578790

Add window indicating the progress of each process

Add modal window just like download when processing import file or recording

[REQ] WhisperX support

First, I want to thank the author of this tool for simplifying the process of using OpenAI Whisper. Thanks to you, Fauzan, far more people are able to use the features of Whisper via a clean GUI.

As a feature request, I would love to see support added in your program for the latest enhancements added by WhisperX (https://github.com/m-bain/whisperX), which is a greatly-improved version of OpenAI's Whisper.

WhisperX is by a research group from the University of Oxford, is 70x faster than OpenAI Whisper, requires much less GPU memory running the language models, has a lower word error rate, does not have the hallucinations, drifting and repetitions that standard WhisperAI is prone to. The program detects when there is silence, can also detect when there are multiple speakers and identify each one uniquely, even with overlapping voices. It is also able to produce far more accurate timestamps, down to the level of individual letters in the words.

As it processes a recording, it splits the audio into 30 second chunks then batch processes them simultaneously for a dramatic speed increase. It appears to be different from WhisperJAX (https://github.com/sanchit-gandhi/whisper-jax) in that the released version of WhisperJAX splits the audio for batch processing without proper context, meaning that the cuts sometimes occur in the middle of words, which means that WhisperJAX ends up translating partial words, which generates a higher Word Error Rate. WhisperX does not do this. It scans before splitting, properly detecting the start and stop of words, so cuts happen in the spaces between.

I have been reading that WhisperX does a much better job translating various languages compared to OpenAI's version, which makes me think that proceeding with the current version of Whisper I have been using is fairly pointless, because the results would be inferior to WhisperX and I would need to re-do them later.

The problem is that I have been unable to get WhisperX running properly on my machine. I don't know which version/update of which dependency has broken the installation. I have reinstalled things multiple times and spent many hours trying to troubleshoot it. I know that there are many others experiencing similar problems like me. It would be great if you could provide support for either WhisperX or even Faster-Whisper (https://github.com/guillaumekln/faster-whisper) which is not as advanced as WhisperX, but is an improvement over regular WhisperAI.

Ideally, users would have the option to choose between OpenAI's standard version and a huge improvement like WhisperX. Combining the improvements of WhisperX with your GUI would be wonderful!

More info:
https://github.com/m-bain/whisperX (WhisperX GitHub source)
https://www.slashcam.com/news/single/WhisperX--Free-audio-transcription-with-speaker-re-17704.html
https://web.archive.org/web/20230301023005/https://www.swyx.io/transcribe-podcasts-with-whisper
https://arxiv.org/abs/2303.00747

Application hang due to: FileNotFoundError: [WinError 2] The system cannot find the file specified

Only interested in transcription.
Open application.
Mode: Transcribe
Model: Base
TL Engine: Whisper or Google

Click "Import From File".
Choose an mp4 file from my drive.

Output from command prompt window:

...
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2023-12-19 08:53:06,476 INFO - Booting up... (Main.py:959) [MainThread]
2023-12-19 08:53:06,478 DEBUG - Got console window (Main.py:962) [MainThread]
2023-12-19 08:53:08,098 INFO - Checking for update on start (About.py:108) [MainThread]
2023-12-19 08:53:08,160 INFO - Checking for update... (About.py:133) [MainThread]
2023-12-19 08:53:08,682 INFO - New version found: 1.3.7 (About.py:146) [Thread-3 (req_update_check)]
2023-12-19 08:53:44,877 INFO - Start Process (FILE) (Record.py:848) [Thread-5 (from_file)]
2023-12-19 08:53:45,199 INFO - -------------------------------------------------- (Record.py:761) [Thread-6 (multiproc_tc)]
2023-12-19 08:53:45,199 INFO - Transcribing Audio: C:/Users/zlazi/Downloads/next-gen-hacker.mp4 (Record.py:762) [Thread-6 (multiproc_tc)]
2023-12-19 08:53:45,200 DEBUG - Source Language: Auto (Record.py:771) [Thread-6 (multiproc_tc)]
Exception in thread Thread-7 (run_threaded):
Traceback (most recent call last):
File "threading.py", line 1016, in _bootstrap_inner
File "threading.py", line 953, in run
File "speech_translate\utils\Record.py", line 776, in run_threaded
File "whisper\transcribe.py", line 84, in transcribe
File "whisper\audio.py", line 111, in log_mel_spectrogram
File "whisper\audio.py", line 42, in load_audio
File "ffmpeg_run.py", line 313, in run
File "ffmpeg_run.py", line 284, in run_async
File "subprocess.py", line 971, in init
File "subprocess.py", line 1440, in _execute_child
FileNotFoundError: [WinError 2] The system cannot find the file specified
2023-12-19 08:54:55,037 INFO - Cancelling file import processing... (Main.py:937) [MainThread]
2023-12-19 08:54:55,115 INFO - End process (FILE) [Total time: 70.24s] (Record.py:906) [Thread-5 (from_file)]

[BUG] Export Folder location never visually updates in Settings, despite changing to a new location

Under Speech Translate > View > Settings > Export > Export Folder

When I change the Export folder, internally, the program properly changes the Export folder (which is good), but visually, in the Export Folder section of Settings, the directory name never changes. It continues to show the default Export Folder directory that originally was there.

This is a very minor (visual) bug that does not affect the functionality of the program.

Adding more whisper commands

Discussed in #8

^{Originally posted by galaxea January 15, 2023}
This is so useful and I appreciate the hard work. Is there a way to add commands such as suppress tokens that can be utilized when running a transcription file? I like to set it at 0 so it understands commands such as 'open quote' when running it through CMD.

[REQ] PLEASEEE ADD THE SUPPORT Speech-to-Speech and Text-to-Speech

Dear Dadangdut33,

I hope this message finds you well. I am a great admirer of your Speech-Translate project on GitHub. Your work has significantly contributed to the field of speech processing, making it accessible and useful for many.

I am writing to kindly request the integration of a Speech-to-Speech (S2S) feature into your Speech-Translate application. This addition would significantly enhance its capabilities, allowing for a more dynamic and comprehensive user experience.

For inspiration and potential guidance, I suggest looking at the following resources:

Hugging Face's demonstration of S2S translation: Hugging Face S2S Demo and their detailed course chapter on the same: Hugging Face Audio Course.
Microsoft's SpeechT5 TTS model: SpeechT5 TTS on Hugging Face, which could offer a robust framework for the text-to-speech component.
Other related projects like NaveenGTK's TTS project and TomSchimansky's CustomTkinter, which might provide useful insights or codebases.
Your consideration of this enhancement would be greatly appreciated by many in the community who rely on and enjoy your application. Thank you for your time and all the incredible work you do.

New Play HT 2.0 turbo
https://github.com/playht

Best regards,

Word level transcription

Discussed in #7

^{Originally posted by MaxHaller91 January 12, 2023}
Thanks so much for putting this program out here. Is there any adjustment possible so i get timecodes for every second? or even every word?

[BUG]Can't open and run normally

Describe the bug
2023-04-12 00:57:36,936 INFO - Console window hidden. If it is not hidden (only minimized), try changing your default windows terminal to windows cmd. (Main.py:51) [MainThread]
test
2023-04-12 00:57:36,936 INFO - Booting up... (Main.py:1050) [MainThread]
2023-04-12 00:57:37,002 DEBUG - Available Theme to use: ['vista', 'sv-light', 'sv-dark'] (Main.py:159) [MainThread]
2023-04-12 00:57:37,002 DEBUG - Setting theme: sv-dark (Style.py:28) [MainThread]
2023-04-12 00:57:37,014 DEBUG - Setting custom dark theme style (Style.py:49) [MainThread]
Traceback (most recent call last):
File "Main.py", line 1053, in
File "Main.py", line 404, in init
File "Main.py", line 541, in onInit
File "Main.py", line 546, in cb_mic_init
File "Main.py", line 564, in label_microphone_Rclick
File "speech_translate\utils\Record.py", line 65, in getDefaultInputDevice
File "sounddevice.py", line 569, in query_devices
sounddevice.PortAudioError: Error querying device -1

To Reproduce
Steps to reproduce the behavior:

Run SpeechTranslate.exe directly

Expected behavior
Download SpeechTranslate 1.2.2 GPU, run SpeechTranslate.exe directly after decompression

Screenshots

Desktop (please complete the following information):

OS: windows 11
Python Version : 3.10.9
GPU：RTX 4090

Additional context
Interestingly, I can open the software normally on my other windows 10 computer.

Mic issue for real time transcription.

it seems like the RT transcription faces some issue after the git pull patches?

[BUG] weird translation error

in version 1.1 everything worked, in 1.2 these errors in the log:

2023-03-27 16:56:10,128 INFO - Console window hidden. If it is not hidden (only minimized), try changing your default windows terminal to windows cmd. (Main.py:50) [MainThread]
2023-03-27 16:56:10,128 INFO - Booting up... (Main.py:1019) [MainThread]
2023-03-27 16:56:10,254 DEBUG - Available Theme to use: ['vista', 'sv-light', 'sv-dark'] (Main.py:157) [MainThread]
2023-03-27 16:56:10,254 DEBUG - Setting theme: sv-dark (Style.py:28) [MainThread]
2023-03-27 16:56:10,273 DEBUG - Setting custom dark theme style (Style.py:49) [MainThread]
2023-03-27 16:56:13,442 INFO - Checking for update on start (About.py:100) [MainThread]
2023-03-27 16:56:14,118 INFO - Checking for update... (About.py:125) [MainThread]
2023-03-27 16:56:14,431 INFO - No update available (About.py:145) [Thread-5 (req_update_check)]
2023-03-27 16:56:35,352 INFO - Start Process (FILE) (Record.py:1015) [Thread-6 (from_file)]
2023-03-27 16:56:35,399 DEBUG - Translating... (Record.py:701) [Thread-7 (cancellable_tl)]
2023-03-27 16:56:35,400 DEBUG - Translating with whisper (Record.py:714) [Thread-7 (cancellable_tl)]
2023-03-27 16:56:35,406 DEBUG - Source Language: polish (Record.py:715) [Thread-7 (cancellable_tl)]
2023-03-27 16:58:48,499 ERROR - string indices must be integers (Record.py:828) [Thread-7 (cancellable_tl)]
Traceback (most recent call last):
File "speech_translate\utils\Record.py", line 759, in cancellable_tl
File "speech_translate\utils\Helper_Whisper.py", line 44, in whisper_result_to_srt
TypeError: string indices must be integers
TL Wait (133.34s)

???

for some reason , it is not starting

the program does not open
version 1.3.7 cpu only

11th Gen Intel(R) Core(TM) i3-1125G4 @ 2.OOGHz 2.00 GHz
16.0 GB (15.8 GB usable)
Windows 11 Pro
23H2
22631.2861

[REQ] Copy to clipboard button

Is your feature request related to a problem? Please describe.
Since I usually want to place the text in another window I need to select the entire text and copy it manually.

Describe the solution you'd like
It would be great to have a copy to clipboard button of the current text in the box.

FileNotFoundError: [WinError 2] The system cannot find the file specified

...
i am getting the above error when importing the file/s.

When importing multiple audio files, the output file names are mixed up [BUG]

Describe the bug
When importing multiple audio files, the output files are named wrongly.
The title of one audio file is included in the title of another output file of another audio file of the list.

release 1.3. CPU only cant find "small" module

I tried release 1.3. CPU only and got into this issue

I follow the instruction to download the module and success, but the above issue just keep coming up. It seems it cant find the "small" module. Is there any solution?

[REQ] PLEASEEE ADD THE SUPPORT FOR WHSIPER.cpp it will make the transcription so much faster

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when the model is so slow not everyone is using the fastest gpu i want to refer it to some one i know one of them if a blind person and they own a windows latpop with 2 gigs of vram in their graphics card they are willing to learn about new things as they are Software engineer as well but unable to work and read properly because they cant read they have hard time using their pc , in this way they might be able to transcribe vedios and learn better due to text readers

Describe the solution you'd like
i just want you to keep the app the same just add the support for it to use the whisper.cpp model on cpu and the user have the choice to download whisper.cpp model from this repo https://github.com/bilalazhar72/whisper.cpp/tree/master/models

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
they show it being used on the i phone 13 and doing a good job the model seems to be very fast and accurate , there are currently no alternatives to use this app on the windows since you have already have everything setup i would just like it to have two features load the model version and let user speak in voice and outputs the texts (even long text) and also load any model size and let you give an mp3 file to transcribe since whisper models are show this implementation in .cpp will make sure that the model can be ran on any hardware and people will start using this app for this reason as well

Additional context
Add any other context or screenshots about the feature request here.here is just the vedio of them running it on in the i phone 13 and the model working flawlessly i am not good at coading since i am beginner but it seems like everything is linked to one file and can be ran easily as well

Implement real time transcription

create the methods
connect it with translation
create the optimal settings and connect it with the settings
setting interface and implementation

[BUG]FileNotFoundError: [WinError 2] 系统找不到指定的文件。

2023-01-31 11:15:17,328 INFO - Booting up... (Main.py:959) [MainThread]
2023-01-31 11:15:17,328 DEBUG - Got console window (Main.py:962) [MainThread]
2023-01-31 11:15:19,949 INFO - Checking for update on start (About.py:108) [MainThread]
2023-01-31 11:15:20,074 INFO - Checking for update... (About.py:133) [MainThread]
2023-01-31 11:15:20,684 INFO - No update available (About.py:153) [Thread-3 (req_update_check)]
2023-01-31 11:15:40,683 INFO - Start Process (FILE) (Record.py:848) [Thread-4 (from_file)]
2023-01-31 11:15:41,853 INFO - -------------------------------------------------- (Record.py:761) [Thread-5 (multiproc_tc)]
2023-01-31 11:15:41,853 INFO - Transcribing Audio: D:/cicada3301/小说/艾斯13/nu/艾斯13.wav (Record.py:762) [Thread-5 (multiproc_tc)]
2023-01-31 11:15:41,853 DEBUG - Source Language: Auto (Record.py:771) [Thread-5 (multiproc_tc)]
Exception in thread Thread-6 (run_threaded):
Traceback (most recent call last):
File "threading.py", line 1016, in _bootstrap_inner
File "threading.py", line 953, in run
File "speech_translate\utils\Record.py", line 776, in run_threaded
File "whisper\transcribe.py", line 84, in transcribe
File "whisper\audio.py", line 111, in log_mel_spectrogram
File "whisper\audio.py", line 42, in load_audio
File "ffmpeg_run.py", line 313, in run
File "ffmpeg_run.py", line 284, in run_async
File "subprocess.py", line 971, in init
File "subprocess.py", line 1440, in _execute_child
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

[BUG] The software can not see the large dictionary

Software can not see the large dictionary , so I delete it, software successfully downloads it, but then can not see it, will keep asking to download large dictionary
screencap
https://www.youtube.com/watch?v=3zzrkz5oeUU

Downloading of other dictionaries works fine

dadangdut33 / speech-translate Goto Github PK

speech-translate's Introduction

Speech Translate

Table Of Contents

🚀 Features

📜 Requirements

🔧 Installation

From Prebuilt Binary (.exe)

As A Module

From Git

📚 More Information

🛠️ Development

Setup

Running the app

Building

Compatibility

💡 Contributing

License

Attribution

Other

speech-translate's People

Contributors

Stargazers

Watchers

Forkers

speech-translate's Issues

Discussed in #8

Discussed in #7

Recommend Projects

Recommend Topics

Recommend Org