Coder Social home page Coder Social logo

erew123 / alltalk_tts Goto Github PK

View Code? Open in Web Editor NEW
595.0 15.0 60.0 94.21 MB

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.

License: GNU Affero General Public License v3.0

Python 37.15% Batchfile 3.00% Shell 2.36% Dockerfile 0.21% HTML 46.78% JavaScript 9.94% CSS 0.56%

alltalk_tts's Introduction

AllTalk TTS

For those interested, the AllTalk v2 BETA is out. See here

AllTalk V1 Below

AllTalk is an updated version of the Coqui_tts extension for Text Generation web UI. Features include:

  • Can be run as a standalone application or part of :
    • Text-generation-webui link
    • SillyTavern link
    • KoboldCPP link
  • Simple setup utlilty Windows & Linux.
  • API Suite and 3rd Party support via JSON calls: Can be used with 3rd party applications via JSON calls.
  • Model Finetuning: Train the model specifically on a voice of your choosing for better reproduction.
  • Local/Custom models: Use any of the XTTSv2 models (API Local and XTTSv2 Local).
  • Bulk TTS Generator/Editor: Generate hours of TTS into one big file or have something read back to you demo.
  • DeepSpeed: A 2-3x performance boost generating TTS. Screenshot
  • Low VRAM mode: Great for people with small GPU memory or if your VRAM is filled by your LLM.
  • Custom Start-up Settings: Adjust your default start-up settings. Screenshot
  • Narrarator: Use different voices for main character and narration. Example Narration
  • Optional wav file maintenance: Configurable deletion of old output wav files. Screenshot
  • Documentation: Fully documented with a built in webpage. Screenshot
  • Clear Console output: Clear command line output for any warnings or issues.

🟦 Screenshots

image image image image
image image image image

Index


πŸ› οΈ About this project & me

AllTalk is a labour of love that has been developed, supported and sustained in my personal free time. As a solo enthusiast (not a business or team) my resources are inherently limited. This project has been one of my passions, but I must balance it with other commitments.

To manage AllTalk sustainably, I prioritize support requests based on their overall impact and the number of users affected. I encourage you to utilize the comprehensive documentation and engage with the AllTalk community discussion area. These resources often provide immediate answers and foster a supportive user network.

Should your inquiry extend beyond the documentation, especially if it concerns a bug or feature request, I assure you I’ll offer my best support as my schedule permits. However, please be prepared for varying response times, reflective of the personal dedication I bring to AllTalk. Your understanding and patience in this regard are greatly appreciated.

It's important to note that I am not the developer of any TTS models utilized by AllTalk, nor do I claim to be an expert on them, including understanding all their nuances, issues, and quirks. For specific TTS model concerns, I’ve provided links to the original developers in the Help section for direct assistance.

Thank you for your continued support and understanding.


πŸ’– Showing Your Support

If AllTalk has been helpful to you, consider showing your support through a donation on my Ko-fi page. Your support is greatly appreciated and helps ensure the continued development and improvement of AllTalk.


🟩 Quick Setup (Text-generation-webui & Standalone Installation)

Quick setup scripts are available for users on Windows 10/11 and Linux. Instructional videos for both setup processes are linked below.

  • Ensure that Git is installed on your system as it is required for cloning the repository. If you do not have Git installed, visit Git's official website to download and install it.
  • Windows users must install C++ development tools for Python to compile Python packages. Detailed information and a link to these tools can be found in the help section Windows & Python requirements for compiling packages.
QUICK SETUP - Text-Generation-webui

For a step-by-step video guide, click here.

To set up AllTalk within Text-generation-webui, follow either method:

  1. Download AllTalk Setup:

    • Via Terminal/Console (Recommended):
      • cd \text-generation-webui\extensions\
      • git clone https://github.com/erew123/alltalk_tts
    • Via Releases Page (Cannot be automatically updated after install as its not linked to Github):
      • Download the latest alltalk_tts.zip from Releases and extract it to \text-generation-webui\extensions\alltalk_tts\.
  2. Start Python Environment:

    • In the text-generation-webui folder, start the environment with the appropriate command:

      • Windows: cmd_windows.bat
      • Linux: ./cmd_linux.sh

      If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  3. Run AllTalk Setup Script:

    • Navigate to the AllTalk directory and execute the setup script:
      • cd extensions
      • cd alltalk_tts
      • Windows: atsetup.bat
      • Linux: ./atsetup.sh
  4. Install Requirements:

    • Follow the on-screen instructions to install the necessary requirements. It's recommended to test AllTalk's functionality before installing DeepSpeed.

Note: Always activate the Text-generation-webui Python environment before making any adjustments or using Fine-tuning. Additional instructions for Fine-tuning and DeepSpeed can be found within the setup utility and on this documentation page.

QUICK SETUP - Standalone Installation

For a step-by-step video guide, click here.

To perform a Standalone installation of AllTalk:

  1. Get AllTalk Setup:

    • Via Terminal/Console (Recommended):
      • Navigate to your preferred directory: cd C:\myfiles\
      • Clone the AllTalk repository: git clone https://github.com/erew123/alltalk_tts
    • Via Releases Page (Cannot be automatically updated after install as its not linked to Github):
      • Download alltalk_tts.zip from Releases and extract it to your chosen directory, for example, C:\myfiles\alltalk_tts\.
  2. Start AllTalk Setup:

    • Open a terminal/command prompt, move to the AllTalk directory, and run the setup script:
      • cd alltalk_tts
      • Windows: atsetup.bat
      • Linux: ./atsetup.sh
  3. Follow the Setup Prompts:

    • Select Standalone Installation and then Option 1 and follow any on-screen instructions to install the required files. DeepSpeed is automatically installed on Windows based system, but will only work on Nvidia GPU's. Linux based system users will have to follow the DeepSpeed installation instructions.

If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

Important: Do not use spaces in your folder path (e.g. avoid /my folder-is-this/alltalk_tts-main) as this causes issues with Python & Conda.

Refer to 🟩 Other installation notes for further details, including information on additional voices, changing IP, character card notes etc.

If you wish to understand AllTalks start-up screen, please read Understanding the AllTalk start-up screen in the Help section.


🟩 Docker Builds and Google Colab's

While an AllTalk Docker build exists, it's important to note that this version is based on an earlier iteration of AllTalk and was set up by a third party. At some point, my goal is to deepen my understanding of Docker and its compatibility with AllTalk. This exploration may lead to significant updates to AllTalk to ensure a seamless Docker experience. However, as of now, the Docker build should be considered a BETA version and isn't directly supported by me.

As for Google Colab, there is partial compatibility with AllTalk, though with some quirks. I am currently investigating these issues and figuring out the necessary adjustments to enhance the integration. Until I can ensure a smooth experience, I won't be officially releasing any Google Colab implementations of AllTalk.


🟩 Manual Installation - As part of Text generation web UI (inc. macOSX)

MANUAL INSTALLATION - Text-Generation-webui

Manual Installation for Text Generation Web UI

If you're using a Mac or prefer a manual installation for any other reason, please follow the steps below. This guide is compatible with the current release of Text Generation Web UI as of December 2023. Consider updating your installation if it's been a while, update instructions here.

  • For a visual guide on the installation process, watch this video.
  1. Navigate to Text Generation Web UI Folder:

    • Open a terminal window and move to your Text Generation Web UI directory with:
      • cd text-generation-webui
  2. Activate Text Generation Web UI Python Environment:

    • Start the appropriate Python environment for your OS using one of the following commands:

      • For Windows: cmd_windows.bat
      • For Linux: ./cmd_linux.sh
      • For macOS: cmd_macos.sh
      • For WSL: cmd_wsl.bat
    • Loading the Text Generation Web UI's Python environment is crucial. If unsure about what a loaded Python environment should look like, refer to this image and video guide.

    If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  3. Move to Extensions Folder:

    • cd extensions
  4. Clone the AllTalk TTS Repository:

    • git clone https://github.com/erew123/alltalk_tts
  5. Navigate to the AllTalk TTS Folder:

    • cd alltalk_tts
  6. Install Required Dependencies:

    • Install dependencies for your machine type:
      • For Windows: pip install -r system\requirements\requirements_textgen.txt
      • For Linux/Mac: pip install -r system/requirements/requirements_textgen.txt
  7. Optional DeepSpeed Installation:

  • If you're using an Nvidia graphics card on Linux or Windows and wish to install DeepSpeed, follow the instructions here.
  • Recommendation: Start Text Generation Web UI and ensure AllTalk functions correctly before installing DeepSpeed.
  1. Start Text Generation Web UI:
  • Return to the main Text Generation Web UI folder using cd .. (repeat as necessary).

    • Start the appropriate Python environment for your OS using one of the following commands:
      • For Windows: start_windows.bat
      • For Linux: ./start_linux.sh
      • For macOS: start_macos.sh
      • For WSL: start_wsl.bat
  • Load the AllTalk extension in the Text Generation Web UI session tab.

  • For any updates to AllTalk or for tasks like Finetuning, always activate the Text Generation Web UI Python environment first.

Refer to 🟩 Other installation notes for further details, including information on additional voices, changing IP, character card notes etc.

🟩 Manual Installation - As a Standalone Application

MANUAL INSTALLATION - Run AllTalk as a Standalone with Text-generation-webui

Running AllTalk as a Standalone Application alongside Text Generation Web UI

If you have AllTalk installed as an extension of Text Generation Web UI but wish to run it as a standalone application, follow these steps:

  1. Activate Text Generation Web UI Python Environment:

    • Use the appropriate command for your operating system to load the Python environment:
      • Windows: cmd_windows.bat
      • Linux: ./cmd_linux.sh
      • macOS: cmd_macos.sh
      • WSL: cmd_wsl.bat
  2. Navigate to the AllTalk Directory:

    • Move to the AllTalk folder with the following commands:
      • cd extensions
      • cd alltalk_tts
  3. Start AllTalk:

    • Run AllTalk with the command:
      • python script.py

    There are no additional steps required to run AllTalk as a standalone application from this point.

MANUAL INSTALLATION - Custom Install of AllTalk

Custom Installation of AllTalk

Support for custom Python environments is limted. Please read Custom Python environments Limitations Notice below this section.

To run AllTalk as a standalone application with a custom Python environment, ensure you install AllTalk's requirements into the environment of your choice. The instructions provided are generalized due to the variety of potential Python environments.

  • Python Compatibility: The TTS engine requires Python 3.9.x to 3.11.x. AllTalk is tested with Python 3.11.x. See TTS Engine details.
  • Path Names: Avoid spaces in path names as this can cause issues.
  • Custom Python Environments: If encountering issues potentially related to a custom environment, consider testing AllTalk with the quick setup standalone method that builds its own environment.

Quick Overview of Python Environments

If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

Building a Custom Python Environment with Miniconda

  1. Initial Setup:

  2. Install Miniconda:

    • Download and install Miniconda for your OS from the Miniconda Website. Use the Anaconda Prompt from the Start Menu or Application Launcher to access the base Conda environment.
  3. Clone AllTalk Repository:

    • Navigate to your desired folder (e.g., c:\myfiles\) and clone the AllTalk repository:
      • git clone https://github.com/erew123/alltalk_tts
    • Move to the AllTalk folder with the following commands:
      • cd alltalk_tts
  4. Create Conda Environment:

    • Create a Conda environment named alltalkenv with Python 3.11.5:
      • conda create --name alltalkenv python=3.11.5
    • Activate the new environment:
      • conda activate alltalkenv
  5. Install Requirements:

    • Install dependencies based on your machine type:
      • For Windows: pip install -r system\requirements\requirements_standalone.txt
      • For Linux/Mac: pip install -r system/requirements/requirements_standalone.txt
  6. Start AllTalk:

    • Run AllTalk with the following:
      • python script.py

Note: For updates, DeepSpeed installations, or other modifications, always activate the alltalkenv Conda environment first. Custom scripts or batch files can simplify launching AllTalk.

🟩 Custom Python environments Limitations Notice: Given the vast array of Python environments and custom configurations out there, it's challenging for me to guarantee comprehensive support for each unique setup. AllTalk leverages a wide range of scripts and libraries, many of which are developed and maintained outside of my control. As a result, these components might not always behave as expected in every custom Python environment. I'll do my best to assist where I can, but please understand that my ability to help with issues stemming from these external factors may be limited.


🟩 Other installation notes

On first startup, AllTalk will download the Coqui XTTSv2 2.0.2 model to its models folder (1.8GB space required). Check the command prompt/terminal window if you want to know what its doing. After it says "Model Loaded" the Text generation webUI is usually available on its IP address a few seconds later, for you to connect to in your browser. If you are running a headless system and need to change the IP, please see the Help with problems section down below.

Once the extension is loaded, please find all documentation and settings on the link provided in the interface (as shown in the screenshot below).

Where to find voices https://aiartes.com/voiceai or https://commons.wikimedia.org/ or interviews on youtube etc. Instructions on how to cut down and prepare a voice sample are within the built in documentation.

Please read the note below about start-up times and also the note about ensuring your character cards are set up correctly

Some extra voices for AllTalk are downloadable here and here

🟩 Changing AllTalks IP address & Accessing AllTalk over your Network

Click to expand

AllTalk is coded to start on 127.0.0.1, meaning that it will ONLY be accessable to the local computer it is running on. If you want to make AllTalk available to other systems on your network, you will need to change its IP address to match the IP address of your network card/computers current IP address. There are 2x ways to change the IP address:

  1. Start AllTalk and within its web interface and you can edit the IP address on the "AllTalk Startup Settings".
  2. You can edit the confignew.jsonfile in a text editor and change "ip_address": "127.0.0.1", to the IP address of your choosing.

So, for example, if your computer's network card was on IP address 192.168.0.20, you would change AllTalk's setting to 192.168.1.20 and then restart AllTalk. You will need to ensure your machine stays on this IP address each time it is restarted, by setting your machine to have a static IP address.

🟩 Text-geneneration-webui & Stable-Diffusion Plugin - Load Order & stripped text

Click to expand

The Stable Diffusion plugin for Text-generation-webui strips out some of the text, which is passed to Stable Diffusion for image/scene generation. Because this text is stripped, its important to consider the load order of the plugins to get the desired result you want. Lets assume the AI has just generated the following message *He walks into the room with a smile on his face and says* Hello how are you?. Depending on the load order will change what text reaches AllTalk for generation e.g.

SD Plugin loaded before AllTalk - Only Hi how are you? is sent to AllTalk, with the *He walks into the room with a smile on his face and says* being sent over to SD for image generation. Narration of the scene is not possible.

AllTalk loaded before SD Plugin - *He walks into the room with a smile on his face and says* Hello how are you? is sent to AllTalk with the *He walks into the room with a smile on his face and says* being sent over to SD for image generation.

The load order can be changed within Text-generation-webui's settings.yaml file or cmd_flags.txt (depending on how you are managing your extensions).

image

🟩 A note on Character Cards & Greeting Messages

Click to expand

Messages intended for the Narrator should be enclosed in asterisks * and those for the character inside quotation marks ". However, AI systems often deviate from these rules, resulting in text that is neither in quotes nor asterisks. Sometimes, text may appear with only a single asterisk, and AI models may vary their formatting mid-conversation. For example, they might use asterisks initially and then switch to unmarked text. A properly formatted line should look like this:

"Hey! I'm so excited to finally meet you. I've heard so many great things about you and I'm eager to pick your brain about computers." *She walked across the room and picked up her cup of coffee*

Most narrator/character systems switch voices upon encountering an asterisk or quotation marks, which is somewhat effective. AllTalk has undergone several revisions in its sentence splitting and identification methods. While some irregularities and AI deviations in message formatting are inevitable, any line beginning or ending with an asterisk should now be recognized as Narrator dialogue. Lines enclosed in double quotes are identified as Character dialogue. For any other text, you can choose how AllTalk handles it: whether it should be interpreted as Character or Narrator dialogue (most AI systems tend to lean more towards one format when generating text not enclosed in quotes or asterisks).

With improvements to the splitter/processor, I'm confident it's functioning well. You can monitor what AllTalk identifies as Narrator lines on the command line and adjust its behavior if needed (Text Not Inside - Function).

🟩 I want to know more about the XTTS AI model used

Click to expand

Currently the XTTS model is the main model used by AllTalk for TTS generation. If you want to know more details about the XTTS model, its capabilties or its technical features you can look at resources such as:


πŸŸͺ Updating

Maintaining the latest version of your setup ensures access to new features and improvements. Below are the steps to update your installation, whether you're using Text-Generation-webui or running as a Standalone Application.

NOTE Future updates will be handled by using the atsetup utility.

NOTE If you have an install prior to 28th March 2024 that you are updating, perform the git pull instructions below, then run the atsetup utility and select option 1 in either the Standalone ot Text-generation-webui menu (as matches your system).

UPDATING - Text-Generation-webui

The update process closely mirrors the installation steps. Follow these to ensure your setup remains current:

  1. Open a Command Prompt/Terminal:

    • Navigate to your Text-Generation-webui folder with:
      • cd text-generation-webui
  2. Start the Python Environment:

    • Activate the Python environment tailored for your operating system. Use the appropriate command from below based on your OS:
      • Windows: cmd_windows.bat
      • Linux: ./cmd_linux.sh
      • macOS: cmd_macos.sh
      • WSL (Windows Subsystem for Linux): cmd_wsl.bat

    If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  3. Navigate to the AllTalk TTS Folder:

    • Move into your extensions and then the alltalk_tts directory:
      • cd extensions/alltalk_tts
  4. Update the Repository:

    • Fetch the latest updates from the repository with:
      • git pull
  5. Install Updated Requirements:

    • Depending on your machine's OS, install the required dependencies using pip:
      • For Windows Machines:
        • pip install -r system\requirements\requirements_textgen.txt
      • For Linux/Mac:
        • pip install -r system/requirements/requirements_textgen.txt
  6. DeepSpeed Requirements:

    • If Text-gen-webui is using a new version of PyTorch, you may need to uninstall and update your DeepSpeed version.
    • Use AllTalks diagnostics or start-up menu to identify your version of PyTorch.

UPDATING - Standalone Application

If you installed from a ZIP file, you cannot use a git pull to update, as noted in the Quick Setup instructions.

For Standalone Application users, here's how to update your setup:

  1. Open a Command Prompt/Terminal:

    • Navigate to your AllTalk folder with:
      • cd alltalk_tts
  2. Access the Python Environment:

    • In a command prompt or terminal window, navigate to your alltalk_tts directory and start the Python environment:
      • Windows:
        • start_environment.bat
      • Linux/macOS:
        • ./start_environment.sh

If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  1. Pull the Latest Updates:

    • Retrieve the latest changes from the repository with:
      • git pull
  2. Install Updated Requirements:

    • Depending on your machine's OS, install the required dependencies using pip:
      • For Windows Machines:
        • pip install -r system\requirements\requirements_standalone.txt
      • For Linux/Mac:
        • pip install -r system/requirements/requirements_standalone.txt

πŸŸͺ Resolving Update Issues

If you encounter problems during or after an update, following these steps can help resolve the issue by refreshing your installation while preserving your data:

RESOLVING - Updates

The process involves renaming your existing alltalk_tts directory, setting up a fresh instance, and then migrating your data:

  1. Rename Existing Directory:

    • First, rename your current alltalk_tts folder to keep it safe e.g. alltalk_tts.old. This preserves any existing data.
  2. Follow the Quick Setup instructions:

    • You will now follow the Quick Setup instructions, performing the git clone https://github.com/erew123/alltalk_tts to pull down a new copy of AllTalk and install the requirements.

      If you're not familiar with Python environments, see Understanding Python Environments Simplified in the Help section for more info.

  3. Migrate Your Data:

    • Before starting the AllTalk, transfer the models, voices, outputs folders and also confignew.json from alltalk_tts.old to the new alltalk_tts directory. This action preserves your voice history and prevents the need to re-download the model.
  1. Launch AllTalk
    • You're now ready to launch AllTalk and check it works correctly.
  1. Final Step:
    • Once you've verified that everything is working as expected and you're satisfied with the setup, feel free to delete the alltalk_tts.old directory to free up space.

πŸ”΅πŸŸ’ DeepSpeed Installation Options

DeepSpeed requires an Nvidia Graphics card

πŸ”΅ Linux Installation

DeepSpeed requires access to the Nvidia CUDA Development Toolkit to compile on a Linux system. It's important to note that this toolkit is distinct and unrealted to your graphics card driver or the CUDA version the Python environment uses.

Linux DeepSpeed - Text-generation-webui

DeepSpeed Installation for Text generation webUI

  1. Nvidia CUDA Development Toolkit Installation:

    • The toolkit is crucial for DeepSpeed to compile/build for your version of Linux and requires around 3GB's of disk space.
    • Install using your package manager (Recommended) e.g. CUDA Toolkit 11.8 or download directly from Nvidia CUDA Toolkit Archive (choose 11.8 or 12.1 for Linux).
  2. Open a Terminal Console:

    • After Nvidia CUDA Development Toolkit installation, access your terminal console.
  3. Install libaio-dev:

    • Use your Linux distribution's package manager.

      • sudo apt install libaio-dev for Debian-based systems
      • sudo yum install libaio-devel for RPM-based systems.
  4. Navigate to Text generation webUI Folder:

    • Change directory to your Text generation webUI folder with cd text-generation-webui.
  5. Activate Text generation webUI Custom Conda Environment:

    • Run ./cmd_linux.sh to start the environment.

    If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  6. Set CUDA_HOME Environment Variable:

    • DeepSpeed locates the Nvidia toolkit using the CUDA_HOME environment variable.
    • You will only set this temporarily as Text generation webUI sets up its own CUDA_HOME environment each time you use ./cmd_linux.sh or ./start_linux.sh
  7. Temporarily Configuring CUDA_HOME:

    • When the Text generation webUI Python environment is active (step 5), set CUDA_HOME.

      • export CUDA_HOME=/usr/local/cuda
      • export PATH=${CUDA_HOME}/bin:${PATH}
      • export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
    • You can confirm the path is set correctly and working by running the command nvcc --version should confirm Cuda compilation tools, release 11.8..

    • Incorrect path settings may lead to errors. If you encounter path issues or receive errors like [Errno 2] No such file or directory when you run the next step, confirm the path correctness or adjust as necessary.

  8. DeepSpeed Installation:

    • Install DeepSpeed using pip install deepspeed.
  9. Troubleshooting:

    • Troubleshooting steps for DeepSpeed installation can be located down below.
    • NOTE: You DO NOT need to set Text-generation-webUI's --deepspeed setting for AllTalk to be able to use DeepSpeed. These are two completely separate things and incorrectly setting that on Text-generation-webUI may cause other complications.
Linux DeepSpeed - Standalone Installation

DeepSpeed Installation for Standalone AllTalk

  1. Nvidia CUDA Development Toolkit Installation:

    • The toolkit is crucial for DeepSpeed to compile/build for your version of Linux and requires around 3GB's of disk space.
    • Install using your package manager (Recommended) e.g. CUDA Toolkit 11.8 or download directly from Nvidia CUDA Toolkit Archive (choose 11.8 or 12.1 for Linux).
  2. Open a Terminal Console:

    • After Nvidia CUDA Development Toolkit installation, access your terminal console.
  3. Install libaio-dev:

    • Use your Linux distribution's package manager.

      • sudo apt install libaio-dev for Debian-based systems
      • sudo yum install libaio-devel for RPM-based systems.
  4. Navigate to AllTalk TTS Folder:

    • Change directory to your AllTalk TTS folder with cd alltalk_tts.
  5. Activate AllTalk Custom Conda Environment:

    • Run ./start_environment.sh to start the AllTalk Python environment.
    • This command will start the custom Python environment that was installed with ./atsetup.sh.

    If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  6. Set CUDA_HOME Environment Variable:

    • The DeepSpeed installation routine locates the Nvidia toolkit using the CUDA_HOME environment variable. This can be set temporarily for a session or permanently, depending on other requirements you may have for other Python/System environments.
    • For temporary use, proceed to step 8. For a permanent solution, see Conda's manual on setting environment variables.
  7. (Optional) Permanent CUDA_HOME Setup:

    • If you choose to set CUDA_HOME permanently, follow the instructions in the provided Conda manual link above.
  8. Configuring CUDA_HOME:

    • When your Python environment is active (step 5), set CUDA_HOME.

      • export CUDA_HOME=/usr/local/cuda
      • export PATH=${CUDA_HOME}/bin:${PATH}
      • export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
    • You can confirm the path is set correctly and working by running the command nvcc --version should confirm Cuda compilation tools, release 11.8..

    • Incorrect path settings may lead to errors. If you encounter path issues or receive errors like [Errno 2] No such file or directory when you run the next step, confirm the path correctness or adjust as necessary.

  9. DeepSpeed Installation:

    • Install DeepSpeed using pip install deepspeed.
  10. Starting AllTalk TTS WebUI:

    • Launch the AllTalk TTS interface with ./start_alltalk.sh and enable DeepSpeed.

Troubleshooting

  • If setting CUDA_HOME results in path duplication errors (e.g., .../bin/bin/nvcc), you can correct this by unsetting CUDA_HOME with unset CUDA_HOME and then adding the correct path to your system's PATH variable.
  • Always verify paths and compatibility with other CUDA-dependent applications to avoid conflicts.
  • If you have multiple versions of the Nvidia CUDA Development Toolkit installed, you will have to specify the version number in step 8 for the CUDA_HOME path.
  • If it became necessary to uninstall DeepSpeed, you can do so by start the Python enviroment and then running pip uninstall deepspeed

🟒 Windows Installation

You have 2x options for how to setup DeepSpeed on Windows. Pre-compiled wheel files for specific Python, CUDA and Pytorch builds, or manually compiling DeepSpeed.

Windows DeepSpeed - Pre-Compiled Wheels (Quick and Easy)

DeepSpeed Installation with Pre-compiled Wheels

  1. Introduction to Pre-compiled Wheels:

    • The atsetup.bat utility simplifies the installation of DeepSpeed by automatically downloading and installing pre-compiled wheel files. These files are tailored for specific versions of Python, CUDA, and PyTorch, ensuring compatibility with both the Standalone Installation and a standard build of Text-generation-webui.
  2. Manual Installation of Pre-compiled Wheels:

    • If needed, pre-compiled DeepSpeed wheel files that I have built are available on the Releases Page. You can manually install or uninstall these wheels using the following commands:
      • Installation: pip install {deep-speed-wheel-file-name-here}
      • Uninstallation: pip uninstall deepspeed
  3. Using atsetup.bat for Simplified Management:

    • For those running the Standalone Installation or a standard build of Text-generation-webui, the atsetup.bat utility offers the simplest and most efficient way to manage DeepSpeed installations on Windows.
Windows DeepSpeed - Manual Compilation

Manual DeepSpeed Wheel Compilation

  1. Preparation for Manual Compilation:

    • Manual compilation of DeepSpeed wheels is an advanced process that requires:
      • 1-2 hours of your time for initial setup and compilation.
      • 6-10GB of disk space on your computer.
      • A solid technical understanding of Windows environments and Python.
  2. Understanding Wheel Compatibility:

    • A compiled DeepSpeed wheel is uniquely tied to the specific versions of Python, PyTorch, and CUDA used during its compilation. If any of these versions are changed, you will need to compile a new DeepSpeed wheel to ensure compatibility.
  3. Compiling DeepSpee Resources:

    • Myself and @S95Sedan have worked to simplify the compilation process. @S95Sedan has notably improved the process for later versions of DeepSpeed, ensuring ease of build on Windows.
    • Because @S95Sedan is now maintaining the instructions for compiling DeepSpeed on Windows, please visit @S95Sedan's
      DeepSpeed GitHub page.

πŸ†˜ Support Requests, Troubleshooting & Feature requests

I'm thrilled to see the enthusiasm and engagement with AllTalk! Your feedback and questions are invaluable, helping to make this project even better. To ensure everyone gets the help they need efficiently, please consider the following before submitting a support request:

Consult the Documentation: A comprehensive guide and FAQ sections (below) are available to help you navigate AllTalk. Many common questions and troubleshooting steps are covered here.

Search Past Discussions: Your issue or question might already have been addressed in the discussions area or closed issues. Please use the search function to see if there's an existing solution or advice that applies to your situation.

Bug Reports: If you've encountered what you believe is a bug, please first check the Updates & Bug Fixes List to see if it's a known issue or one that's already been resolved. If not, I encourage you to report it by raising a bug report in the Issues section, providing as much detail as possible to help identify and fix the issue.

Feature Requests: The current Feature request list can be found here. I love hearing your ideas for new features! While I can't promise to implement every suggestion, I do consider all feedback carefully. Please share your thoughts in the Discussions area or via a Feature Request in the Issues section.


🟨 Help with problems

Β Β Β Β  πŸ”„ Minor updates/bug fixes list can be found here

🟨 How to make a diagnostics report file

If you are on a Windows machine or a Linux machine, you should be able to use the atsetup.bat or ./atsetup.sh utility to create a diagnositcs file. If you are unable to use the atsetup utility, please follow the instructions below.

Manually making a diagnostics report file
  1. Open a command prompt window and start the Python environment. Depending on your setup (Text-generation-webui or Standalone AllTalk), the steps to start the Python environment vary:
  • For Text-generation-webui Users:

    • Navigate to the Text-generation-webui directory:
      • cd text-generation-webui
    • Start the Python environment suitable for your OS:
      • Windows: cmd_windows.bat
      • Linux: ./cmd_linux.sh
      • macOS: cmd_macos.sh
      • WSL (Windows Subsystem for Linux): cmd_wsl.bat
    • Move into the AllTalk directory:
      • cd extensions/alltalk_tts
  • For Standalone AllTalk Users:

    • Navigate to the alltalk_tts folder:
      • cd alltalk_tts
    • Start the Python environment:
      • Windows: start_environment.bat
      • Linux: ./start_environment.sh

    If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  1. Run the diagnostics and select the requirements file name you installed AllTalk with:

    • python diagnostics.py
  2. You will have an on screen output showing your environment setttings, file versions request vs whats installed and details of your graphics card (if Nvidia). This will also create a file called diagnostics.log in the alltalk_tts folder, that you can upload if you need to create a support ticket on here.

image

Installation and Setup Issues

🟨 Understanding Python Environments Simplified

Think of Python environments like different rooms in your house, each designed for a specific purpose. Just as you wouldn't cook in the bathroom or sleep in the kitchen, different Python applications need their own "spaces" or environments because they have unique requirements. Sometimes, these requirements can clash with those of other applications (imagine trying to cook a meal in a bathroom!). To avoid this, you can create separate Python environments.

Why Separate Environments?

Separate environments, like separate rooms, keep everything organized and prevent conflicts. For instance, one Python application might need a specific version of a library or dependency, while another requires a different version. Just as you wouldn't store kitchen utensils in the bathroom, you wouldn't want these conflicting requirements to interfere with each other. Each environment is tailored and customized for its application, ensuring it has everything it needs without disrupting others.

How It Works in Practice:

Standalone AllTalk Installation: When you install AllTalk standalone, it's akin to adding a new room to your house specifically designed for your AllTalk activities. The setup process, using the atsetup utility, constructs this custom "room" (Python environment alltalk_environment) with all the necessary tools and furnishings (libraries and dependencies) that AllTalk needs to function smoothly, without meddling with the rest of your "house" (computer system). The AllTalk environment is started each time you run start_alltalk or start_environment within the AllTalk folder.

Text-generation-webui Installation: Similarly, installing Text-generation-webui is like setting up another specialized room. Upon installation, it automatically creates its own tailored environment, equipped with everything required for text generation, ensuring a seamless and conflict-free operation. The Text-generation-webui environment is started each time you run start_*your-os-version* or cmd_*your-os-version* within the Text-generation-webui folder.

Managing Environments:

Just as you might renovate a room or bring in new furniture, you can also update or modify Python environments as needed. Tools like Conda or venv make it easy to manage these environments, allowing you to create, duplicate, activate, or delete them much like how you might manage different rooms in your house for comfort and functionality.

Once you're in the right environment, by activating it, installing or updating dependencies (the tools and furniture of your Python application) is straightforward. Using pip, a package installer for Python, you can easily add what you need. For example, to install all required dependencies listed in a requirements.txt file, you'd use:

pip install -r requirements.txt

This command tells pip to read the list of required packages and versions from the requirements.txt file and install them in the current environment, ensuring your application has everything it needs to operate. It's like having a shopping list for outfitting a room and ensuring you have all the right items delivered and set up.

Remember, just as it's important to use the right tools for tasks in different rooms of your house, it's crucial to manage your Python environments and dependencies properly to ensure your applications run as intended.

How do I know if I am in a Python environment?:

When a Python environment starts up, it changes the command prompt to show the Python environment that it currently running within that terminal/console.

image

🟨 Windows & Python requirements for compiling packages (ERROR: Could not build wheels for TTS)

ERROR: Microsoft Visual C++ 14.0 or greater is required or ERROR: Could not build wheels for TTS. or ModuleNotFoundError: No module named 'TTS

Python requires that you install C++ development tools on Windows. This is detailed on the Python site here. You would need to install MSVCv142 - VS 2019 C++ x64/x86 build tools and Windows 10/11 SDK from the C++ Build tools section.

You can get hold of the Community edition here the during installation, selecting C++ Build tools and then MSVCv142 - VS 2019 C++ x64/x86 build tools and Windows 10/11 SDK.

image

🟨 Standalone Install - start_{youros}.xx opens and closes instantly and AllTalk doesnt start

This is more than likely caused by having a space in your folder path e.g. c:\program files\alltalk_tts. In this circumstance you would be best moving the folder to a path without a space e.g. c:\myfiles\alltalk_tts. You would have to delete the alltalk_environment folder and start_alltalk.bat or start_alltalk.sh and then re-run atsetup to re-create the environment and startup files.

🟨 I think AllTalks requirements file has installed something another extension doesn't like

Ive paid very close attention to not impact what Text-generation-webui is requesting on a factory install. This is one of the requirements of submitting an extension to Text-generation-webui. If you want to look at a comparison of a factory fresh text-generation-webui installed packages (with cuda 12.1, though AllTalk's requirements were set on cuda 11.8) you can find that comparison here. This comparison shows that AllTalk is requesting the same package version numbers as Text-generation-webui or even lower version numbers (meaning AllTalk will not update them to a later version). What other extensions do, I cant really account for that.

I will note that the TTS engine downgrades Pandas data validator to 1.5.3 though its unlikely to cause any issues. You can upgrade it back to text-generation-webui default (december 2023) with pip install pandas==2.1.4 when inside of the python environment. I have noticed no ill effects from it being a lower or higher version, as far as AllTalk goes. This is also the same behaviour as the Coqui_tts extension that comes with Text-generation-webui.

Other people are reporting issues with extensions not starting with errors about Pydantic e.g. pydantic.errors.PydanticImportError: BaseSettings` has been moved to the pydantic-settings package. See https://docs.pydantic.dev/2.5/migration/#basesettings-has-moved-to-pydantic-settings for more details.

Im not sure if the Pydantic version has been recently updated by the Text-generation-webui installer, but this is nothing to do with AllTalk. The other extension you are having an issue with, need to be updated to work with Pydantic 2.5.x. AllTalk was updated in mid december to work with 2.5.x. I am not specifically condoning doing this, as it may have other knock on effects, but within the text-gen Python environment, you can use pip install pydantic==2.5.0 or pip install pydantic==1.10.13 to change the version of Pydantic installed.

🟨 I am having problems getting AllTalk to start after changing settings or making a custom setup/model setup.

I would suggest following Problems Updating and if you still have issues after that, you can raise an issue here

Networking and Access Issues

🟨 I cannot access AllTalk from another machine on my Network

You will need to change the IP address within AllTalk's settings from being 127.0.0.1, which only allows access from the local machine its installed on. To do this, please see Changing AllTalks IP address & Accessing AllTalk over your Network at the top of this page.

You may also need to allow access through your firewall or Antivirus package to AllTalk.

🟨 I am running a Headless system and need to change the IP Address manually as I cannot reach the config page

To do this you can edit the confignew.json file within the alltalk_tts folder. You would look for "ip_address": "127.0.0.1", and change the 127.0.0.1 to your chosen IP address,then save the file and start AllTalk.

When doing this, be careful not to impact the formatting of the JSON file. Worst case, you can re-download a fresh copy of confignew.json from this website and that will put you back to a factory setting.

Configuration and Usage Issues

🟨 I activated DeepSpeed in the settings page, but I didnt install DeepSpeed yet and now I have issues starting up

You can either follow the Problems Updating and fresh install your config. Or you can edit the confignew.json file within the alltalk_tts folder. You would look for '"deepspeed_activate": true,' and change the word true to false `"deepspeed_activate": false,' ,then save the file and try starting again.

If you want to use DeepSpeed, you need an Nvidia Graphics card and to install DeepSpeed on your system. Instructions are here

🟨 I am having problems updating/some other issue where it wont start up/Im sure this is a bug

Please see Problems Updating. If that doesnt help you can raise an ticket here. It would be handy to have any log files from the console where your error is being shown. I can only losely support custom built Python environments and give general pointers. Please create a diagnostics.log report file to submit with a support request.

Also, is your text-generation-webui up to date? instructions here

🟨 I see some red "asyncio" messages

As far as I am aware, these are to do with the chrome browser the gradio text-generation-webui in some way. I raised an issue about this on the text-generation-webui here where you can see that AllTalk is not loaded and the messages persist. Either way, this is more a warning than an actual issue, so shouldnt affect any functionality of either AllTalk or text-generation-webui, they are more just an annoyance.

Startup, Performance and Compatibility Issues

🟨 Understanding the AllTalk start-up screen

The AllTalk start-up screen provides various bits of information about the detected Python environment and errors.

image

Config file check

  • Sometimes I need to add/remove something to your existing configuration file settings. Obviously, I don’t want to impact your existing settings, however any new features may need these settings to be created before AllTalk starts up. Ive added extra code that checks alltalk_tts/system/config/at_configupdate.json and alltalk_tts/system/config/at_configdowngrade.json, either adding or removing items to your configuration as necessary. If a change is made, you will be notified and a backup of the previous configuration file will be created in the alltalk_tts folder.

AllTalk startup Mode

  • informational. This will state if AllTalk has detected it is running as part of Text-generation-webui or as a Standalone Application.

WAV file deletion

  • If you have set deletion of old generated WAV files, this will state the time frame after which they are purged.

DeepSpeed version

  • What version of DeepSpeed is installed/detected. This will not tell you if the version of DeepSpeed is compiled for your Python, PyTorch or CUDA version. Its important to remember that DeepSpeed has to be compiled for the exact version of Python, PyTorch and CUDA that you are using, so please ensure you have the correct DeepSpeed version installed if necessary.

Model is available

  • AllTalk is checking if your model files exist. This is not a validity check of the actual model files, they can still be corrupted. If files are missing, AllTalk will attempt to download them from Huggingface, however, if Huggingface has an outage/issue or your internet connection has issues, its possible corrupted or incomplete files will be downloaded. Please read RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory if you need to confirm your model files are ok.

Current Python Version

  • Informational. Literally tells you the version of Python running in your Python environment.

Current PyTorch Version

  • Informational. Tell tells you the version of PyTorch running in your Python environment, however if you have an Nvidia card, you should be running a CUDA based version of Pytorch. This is indicated with a +cXXX after the PyTorch version e.g. 2.2.2+cu121 would be PyTorch version 2.2.2 with CUDA 12.1 extensions. If you don’t have the PyTorch CUDA extensions installed, but you do have an Nvidia card, you may need to re-install PyTorch.

Current CUDA Version

  • Informational. This is linked to the Current PyTorch Version, as detailed above.

Current TTS Version

  • Informational. The current version of the TTS engine that is running.

AllTalk Github updated

  • As long as you have an internet connection, this will tell you last time AllTalk was updated on Github. It is checking the commit list to see when the last commit was made. As such, this could be simply a documentation update, a bug fix or new features. Its simply there as a guide to let you know the last time something was changed on AllTalk's Github.

TTS Subprocess

  • When AllTalk reaches this stage, the subprocess that loads in the AI model is starting. This is most likely where an error could occur with loading the TTS model, just after the documentation message.

AllTalk Settings & Documentation: http ://x.x.x.x

  • The link where you can reach AllTalks built in settings and documentation page. The TTS model will be loading immediately after this is displayed.

🟨 AllTalk is only loading into CPU, but I have an Nvidia GPU so it should be loading into CUDA

This is caused by Pytorch (Torch) not having the CUDA extensions installed (You can check by running the diagnostics). Typically this happens (on Standalone installations) because when the setup routine goes to install Pytorch with CUDA, it looks in the PIP cache and if a previous application has downloaded a version of Pytorch that doesn't have CUDA extensions, the PIP installer doesnt recognise this fact and just uses the cached version for installation. To resolve this:

  1. On the atsetup utility, on the Standalone menu select to Purge the PIP cache. This will remove cached packages from the PIP cache, meaning it will have to download fresh copies.
  2. As we need to force the upgrade to the Python environment, the easiest way to do this will be to use atsetup to Delete AllTalk's custom Python environment. This means it will have to rebuild the Python environment. Note, you may have to run this step twice, as it has to exit the current Python environment, then you have to re-load atsetup and select Delete AllTalk's custom Python environment again.
  3. You can now use atsetup to Install AllTalk as a Standalone Application which will download fresh copies of everything and re-install the Python environment.
  4. Once this is done you can check if CUDA is now working with the diagnostics or starting AllTalk and checking the model loads into CUDA.
🟨 RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

This error message is caused by the model being corrupted or damaged in some way. This error can occur if Huggingface, where the model is downloaded from, have an error (when the model is downloaded) or potentailly internet issues occuring while the model is downloaded on first start-up.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

ERROR: Application startup failed. Exiting.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.

To resolve this, first look in your alltalk_tts/models/xttsv2_2.0.2 (or whichever) model folder and confirm that the file sizes are correct.

image

You can delete one or more suspect files and a factory fresh copy of that file or files will be downloaded on next start-up of AllTalk.

🟨 RuntimeError: Found no NVIDIA driver on your system.

This error message is caused by DeepSpeed being enabled when you do not have a Nvidia GPU. To resolve this, edit confignew.json and change "deepspeed_activate": true, to "deepspeed_activate": false, then restart AllTalk.

  File "C:\alltalk_tts\alltalk_environment\env\Lib\site-packages\torch\cuda\__init__.py", line 302, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

ERROR:    Application startup failed. Exiting.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
🟨 raise RuntimeError("PyTorch version mismatch! DeepSpeed ops were compiled and installed.

This error message is caused by having DeepSpeed enabled, but you have a version of DeepSpeed installed that was compiled for a different version of Python, PyTorch or CUDA (or any mix of those). You will need to start your Python environment and run pip uninstall deepspeed to remove DeepSpeed from your Python environment and then install the correct version of DeepSpeed.

raise RuntimeError("PyTorch version mismatch! DeepSpeed ops were compiled and installed 
RuntimeError: PyTorch version mismatch! DeepSpeed ops were compiled and installed with a different version than what is being used at runtime. Please re-install DeepSpeed or switch torch versions. Install torch version=2.1, Runtime torch version=2.2
🟨 Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait. It times out after 120 seconds.
When the subprocess is starting 2x things are occurring:

A) Its trying to load the voice model into your graphics card VRAM (assuming you have a Nvidia Graphics card, otherwise its your system RAM)
B) Its trying to start up the mini-webserver and send the "ready" signal back to the main process.

Before giving other possibilities a go, some people with old machines are finding their startup times are very slow 2-3 minutes. Ive extended the allowed time within the script from 1 minute to 2 minutes. If you have an older machine and wish to try extending this further, you can do so by editing script.py and changing startup_wait_time = 120 (120 seconds, aka 2 minutes) at the top of the script.py file, to a larger value e.g startup_wait_time = 240 (240 seconds aka 4 minutes).

Note: If you need to create a support ticket, please create a diagnostics.log report file to submit with a support request. Details on doing this are above.

Other possibilities for this issue are:

  1. You are starting AllTalk in both your CMD FLAG.txt and settings.yaml file. The CMD FLAG.txt you would have manually edited and the settings.yaml is the one you change and save in the session tab of text-generation-webui and you can Save UI defaults to settings.yaml. Please only have one of those two starting up AllTalk.

  2. You are not starting text-generation-webui with its normal Python environment. Please start it with start_{your OS version} as detailed here (start_windows.bat,./start_linux.sh, start_macos.sh or start_wsl.bat) OR (cmd_windows.bat, ./cmd_linux.sh, cmd_macos.sh or cmd_wsl.bat and then python server.py).

  3. You have installed the wrong version of DeepSpeed on your system, for the wrong version of Python/Text-generation-webui. You can go to your text-generation-webui folder in a terminal/command prompt and run the correct cmd version for your OS e.g. (cmd_windows.bat, ./cmd_linux.sh, cmd_macos.sh or cmd_wsl.bat) and then you can type pip uninstall deepspeed then try loading it again. If that works, please see here for the correct instructions for installing DeepSpeed here.

  4. You have an old version of text-generation-webui (pre Dec 2023) I have not tested on older versions of text-generation-webui, so cannot confirm viability on older versions. For instructions on updating the text-generation-webui, please look here (update_linux.sh, update_windows.bat, update_macos.sh, or update_wsl.bat).

  5. You already have something running on port 7851 on your computer, so the mini-webserver cant start on that port. You can change this port number by editing the confignew.json file and changing "port_number": "7851" to "port_number": "7602" or any port number you wish that isn’t reserved. Only change the number and save the file, do not change the formatting of the document. This will at least discount that you have something else clashing on the same port number.

  6. You have antivirus/firewalling that is blocking that port from being accessed. If you had to do something to allow text-generation-webui through your antivirus/firewall, you will have to do that for this too.

  7. You have quite old graphics drivers and may need to update them.

  8. Something within text-generation-webui is not playing nicely for some reason. You can go to your text-generation-webui folder in a terminal/command prompt and run the correct cmd version for your OS e.g. (cmd_windows.bat, ./cmd_linux.sh, cmd_macos.sh or cmd_wsl.bat) and then you can type python extensions\alltalk_tts\script.py and see if AllTalk starts up correctly. If it does then something else is interfering.

  9. Something else is already loaded into your VRAM or there is a crashed python process. Either check your task manager for erroneous Python processes or restart your machine and try again.

  10. You are running DeepSpeed on a Linux machine and although you are starting with ./start_linux.sh AllTalk is failing there on starting. This is because text-generation-webui will overwrite some environment variables when it loads its python environment. To see if this is the problem, from a terminal go into your text-generation-webui folder and ./cmd_linux.sh then set your environment variable again e.g. export CUDA_HOME=/usr/local/cuda (this may vary depending on your OS, but this is the standard one for Linux, and assuming you have installed the CUDA toolkit), then python server.py and see if it starts up. If you want to edit the environment permanently you can do so, I have not managed to write full instructions yet, but here is the conda guide here.

  11. You have built yourself a custom Python environment and something is funky with it. This is very hard to diagnose as its not a standard environment. You may want to updating text-generation-webui and re installing its requirements file (whichever one you use that comes down with text-generation-webui).

🟨 I have multiple GPU's and I have problems running Finetuning

Finetuning pulls in various other scripts and some of those scripts can have issues with multiple Nvidia GPU's being present. Until the people that created those other scripts fix up their code, there is a workaround to temporarily tell your system to only use the 1x of your Nvidia GPU's. To do this:

  • Windows - You will start the script with set CUDA_VISIBLE_DEVICES=0 && python finetune.py
    After you have completed training, you can reset back with set CUDA_VISIBLE_DEVICES=

  • Linux - You will start the script with CUDA_VISIBLE_DEVICES=0 python finetune.py
    After you have completed training, you can reset back with unset CUDA_VISIBLE_DEVICES

Rebooting your system will also unset this. The setting is only applied temporarily.

Depending on which of your Nvidia GPU's is the more powerful one, you can change the 0 to 1 or whichever of your GPU's is the most powerful.

🟨 Firefox - Streaming Audio doesnt work on Firefox

This is a long standing issue with Mozilla & Firefox and one I am unable to resolve as Mozilla have not resolved the issue with Firefox. The solution is to use another web browser if you want to use Streaming audio. For details of my prior invesitigation please look at this ticket

🟨 Hindi Support - Not working or issues

Hindi support does not officially exist according to Coqui. Ive added a limited Hindi support at this time, however, It only works with API TTS method and Im sure there will be issues. ticket

Application Specific Issues

🟨 SillyTavern - I changed my IP address and now SillyTavern wont connect with AllTalk
SillyTavern checks the IP address when loading extensions, saving the IP to its configuration only if the check succeeds. For whatever reason, SillyTavern's checks dont always allow changing its IP address a second time.

To manually change the IP address:

  1. Navigate to the SillyTavern Public folder located at /sillytavern/public/.
  2. Open the settings.json file.
  3. Look for the AllTalk section and find the provider_endpoint entry.
  4. Replace localhost with your desired IP address, for example, 192.168.1.64.

image

TTS Generation Issues & Questions

🟨 XTTS - Does the XTTS AI Model Support Emotion Control or Singing?

No, the XTTS AI model does not currently support direct control over emotions or singing capabilities. While XTTS infuses generated speech with a degree of emotional intonation based on the context of the text, users cannot explicitly control this aspect. It's worth noting that regenerating the same line of TTS may yield slightly different emotional inflections, but there is no way to directly control it with XTTS.

🟨 XTTS - Skips, repeats or pronunciation Issues

Firstly, it's important to clarify that the development and maintenance of the XTTS AI models and core scripts are handled by Coqui, with additional scripts and libraries from entities like huggingface among many other Python scripts and libraries used by AllTalk.

AllTalk is designed to be a straightforward interface that simplifies setup and interaction with AI TTS models like XTTS. Currently, AllTalk supports the XTTS model, with plans to include more models in the future. Please understand that the deep inner workings of XTTS, including reasons why it may skip, repeat, or mispronounce, along with 3rd party scripts and libraries utilized, are ultimately outside my control.

Although I ensure the text processed through AllTalk is accurately relayed to the XTTS model speech generation process, and I have aimed to mitigate as many issues as much as possible; skips, repeats and bad pronounciation can still occur.

Certain aspects I have not been able to investigate due to my own time limitations, are:

  • The impact of DeepSpeed on TTS quality. Is this more likely to cause skips or repetition?
  • Comparative performance between different XTTS model versions (e.g., 2.0.3 vs. 2.0.2) regarding audio quality and consistency.

From my experience and anecdotally gained knowledge:

  • Lower quality voice samples tend to produce more anomalies in generated speech.
  • Model finetuning with high-quality voice samples significantly reduces such issues, enhancing overall speech quality.
  • Unused/Excessive punctuation causes issues e.g. asterisks *, hashes #, brackets ( ) etc. Many of these AllTalk will filter out.

So for example, the female_01.wav file that is provided with AllTalk is a studio quality voice sample, which the XTTS model was trained on. Typically you will find it unlikely that anomolies occur with TTS generation when using this voice sample. Hence good quality samples and finetuning, generally improve results with XTTS.

If you wish to try out the XTTS version 2.0.3 model and see if it works better, you can download it from here, replacing all the files within your /alltalk_tts/models/xttsv2_2.0.2 folder. This is on my list to both test version 2.0.3 more, but also build a more flexible TTS models downloader, that will not only accomdating other XTTS models, but also other TTS engines. If you try the XTTS version 2.0.3 model and gleen any insights, please let me know.


⚫ Finetuning a model

If you have a voice that the model doesnt quite reproduce correctly, or indeed you just want to improve the reproduced voice, then finetuning is a way to train your "XTTSv2 local" model (stored in /alltalk_tts/models/xxxxx/) on a specific voice. For this you will need:

  • An Nvidia graphics card. (Please see the help section note if you have multiple Nvidia GPU's). Preferably 12GB+ VRAM on Windows. Minimum 16GB VRAM on Linux.
  • 18GB of disk space free (most of this is used temporarily)
  • At least 2 minutes of good quality speech from your chosen speaker in mp3, wav or flacc format, in one or more files (have tested as far as 20 minutes worth of audio).
  • As a side note, many people seem to think that the Whisper v2 model (used on Step 1) is giving better results at generating training datasets, so you may prefer to try that, as opposed to the Whisper 3 model.

⚫ How will this work/How complicated is it?

Everything has been done to make this as simple as possible. At its simplest, you can literally just download a large chunk of audio from an interview, and tell the finetuning to strip through it, find spoken parts and build your dataset. You can literally click 4 buttons, then copy a few files and you are done. At it's more complicated end you will clean up the audio a little beforehand, but its still only 4x buttons and copying a few files.

⚫ The audio you will use

I would suggest that if its in an interview format, you cut out the interviewer speaking in audacity or your chosen audio editing package. You dont have to worry about being perfect with your cuts, the finetuning Step 1 will go and find spoken audio and cut it out for you. Is there is music over the spoken parts, for best quality you would cut out those parts, though its not 100% necessary. As always, try to avoid bad quality audio with noises in it (humming sounds, hiss etc). You can try something like Audioenhancer to try clean up noisier audio. There is no need to down-sample any of the audio, all of that is handled for you. Just give the finetuning some good quality audio to work with.

⚫ Can I Finetune a model more than once on more than one voice

Yes you can. You would do these as multiple finetuning's, but its absolutely possible and fine to do. Finetuning the XTTS model does not restrict it to only being able to reproduce that 1x voice you trained it on. Finetuning is generally nuding the model in a direction to learn the ability to sound a bit more like a voice its not heard before.

⚫ A note about anonymous training Telemetry information & disabling it

Portions of Coqui's TTS trainer scripts gather anonymous training information which you can disable. Their statement on this is listed here. If you start AllTalk Finetuning with start_finetuning.bat or ./start_finetuning.sh telemetry will be disabled. If you manually want to disable it, please expand the below:

Manually disable telemetry

Before starting finetuning, run the following in your terminal/command prompt:

  • On Windows by typing set TRAINER_TELEMETRY=0
  • On Linux & Mac by typing export TRAINER_TELEMETRY=0

Before you start finetune.py. You will now be able to finetune offline and no anonymous training data will be sent.

⚫ Prerequisites for Fine-tuning with Nvidia CUDA Development Toolkit 11.8

All the requirements for Finetuning will be installed by using the atsetup utility and installing your correct requirements (Standalone or for Text-generation-webui). The legacy manual instructions are stored below, however these shouldnt be required.

Legacy manual instructions for installing Nvidia CUDA Development Toolkit 11.8
- To perform fine-tuning, a specific portion of the **Nvidia CUDA Development Toolkit v11.8** must be installed. This is crucial for step 1 of fine-tuning. The objective is to minimize the installation footprint by installing only the essential components. - The **Nvidia CUDA Development Toolkit v11.8** operates independently from your graphics card drivers and the CUDA version utilized by your Python environment. - This installation process aims to keep the download and install size as minimal as possible, however a full install of the tookit requires 3GB's of disk space. - When running Finetuning it will require upto 20GB's of temporary disk space, so please ensure you have this space available and preferably use a SSD or NVME drive.
  1. Download the Toolkit:

    • Obtain the network install version of the Nvidia CUDA Development Toolkit 11.8 from Nvidia's Archive.
  2. Run the Installer:

    • Choose Custom (Advanced) installation.
    • Deselect all options initially.
    • Select the following components:
      • CUDA > Development > Compiler > nvcc
      • CUDA > Development > Libraries > CUBLAS (both development and runtime)
  3. Configure Environment Search Path:

    • It's essential that nvcc and CUDA 11.8 library files are discoverable in your environment's search path. Adjustments can be reverted post-fine-tuning if desired.

      For Windows:

      • Edit the Path environment variable to include C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin.
      • Add CUDA_HOME and set its path to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8.

      For Linux:

      • The path may vary by Linux distribution. Here's a generic setup:
        • export CUDA_HOME=/usr/local/cuda

        • export PATH=${CUDA_HOME}/bin:${PATH}

        • export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH

        • Consider adding these to your ~/.bashrc for permanence, or apply temporarily for the current session by running the above commands each time you start your Python environment.

      Note: If using Text-generation-webui, its best to set these temporarily.

  4. Verify Installation:

    • Open a new terminal/command prompt to refresh the search paths.
    • In a terminal or command prompt, execute nvcc --version.
    • Success is indicated by a response of Cuda compilation tools, release 11.8. Specifically, ensure it is version 11.8.
  5. Troubleshooting:

    • If the correct version isn't reported, recheck your environment path settings for accuracy and potential conflicts with other CUDA versions.

Additional Note on Torch and Torchaudio:

  • Ensure Torch and Torchaudio are CUDA-enabled (any version), which is separate from the CUDA Toolkit installation. CUDA 11.8 corresponds to cu118 and CUDA 12.1 to cu121 in AllTalk diagnostics.
  • Failure to install CUDA for Torch and Torchaudio will result in Step 2 of fine-tuning failing. These requirements are distinct from the CUDA Toolkit installation, so avoid conflating the two.

⚫ Starting Fine-tuning

NOTE: Ensure AllTalk has been launched at least once after any updates to download necessary files for fine-tuning.

  1. Close Resource-Intensive Applications:

    • Terminate any applications that are using your GPU/VRAM to ensure enough resources for fine-tuning.
  2. Organize Voice Samples:

    • Place your audio samples into the following directory: /alltalk_tts/finetune/put-voice-samples-in-here/

Depending on your setup (Text-generation-webui or Standalone AllTalk), the steps to start the Python environment vary:

  • For Standalone AllTalk Users:

    • Navigate to the alltalk_tts folder:
      • cd alltalk_tts
    • Start the Python environment:
      • Windows: start_finetune.bat
      • Linux: ./start_finetune.sh
  • For Text-generation-webui Users:

    • Navigate to the Text-generation-webui directory:
      • cd text-generation-webui
    • Start the Python environment suitable for your OS:
      • Windows: cmd_windows.bat
      • Linux: ./cmd_linux.sh
      • macOS: cmd_macos.sh
      • WSL (Windows Subsystem for Linux): cmd_wsl.bat
    • Move into the AllTalk directory:
      • cd extensions/alltalk_tts
    • Linux users only need to run this command:
       export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
      
    • Start the fine-tuning process with the command:
      • python finetune.py

    If you're unfamiliar with Python environments and wish to learn more, consider reviewing Understanding Python Environments Simplified in the Help section.

  1. Pre-Flight Checklist:

    • Go through the pre-flight checklist to ensure readiness. Address any issues flagged as "Fail".
  2. Post Fine-tuning Actions:

    • Upon completing fine-tuning, the final tab will guide you on managing your files and relocating your newly trained model to the appropriate directory.

These steps guide you through the initial preparations, starting the Python environment based on your setup, and the fine-tuning process itself. Ensure all prerequisites are met to facilitate a smooth fine-tuning experience.

⚫ How many Epochs etc is the right amount?

In finetuning the suggested/recommended amount of epochs, batch size, evaluation percent etc is already set. However, there is no absolutely correct answer to what the settings should be, it all depends on what you are doing.

  • If you just want to train a normal human voice that is in an existing language, for most people’s needs, the base settings would work fine. You may choose to increase the epochs up to maybe 20, or run a second round of training if needed.
  • If you were training an entirely new language, you would need a huge amount of training data and it requires around 1000 epochs (based on things I can find around the internet of people who tried this).
  • If you are training a cartoon style voice in an existing language, it may need well upwards of 40 epochs until it can reproduce that voice with some success.

There are no absolute correct settings, as there are too many variables, ranging from the amount of samples you are using (5 minutes worth? 4 hours worth? etc), if they are similar samples to what the AI model already understands, so on and so forth. Coqui whom originally trained the model usually say something along the lines of, once you’ve trained it X amount, if it sounds good then you are done and if it doesn’t, train it more.

⚫ Evaluation Data Percentage

In the process of finetuning, it's crucial to balance the data used for training the model against the data reserved for evaluating its performance. Typically, a portion of the dataset is set aside as an 'evaluation set' to assess the model's capabilities in dealing with unseen data. On Step 1 of finetuning you have the option to adjust this evaluation data percentage, offering more control over your model training process.

Why Adjust the Evaluation Percentage?

Adjusting the evaluation percentage can be beneficial in scenarios with limited voice samples. When dealing with a smaller dataset, allocating a slightly larger portion to training could enhance the model's ability to learn from these scarce samples. Conversely, with abundant data, a higher evaluation percentage might be more appropriate to rigorously test the model's performance. There are currently no absolutely optimal split percentages as it varies by dataset.

  • Default Setting: The default evaluation percentage is set at 15%, which is a balanced choice for most datasets.

  • Adjustable Range: Users can now adjust this percentage, but it’s generally recommend keeping it between 5% and 30%.

    • Lower Bound: A minimum of 5% ensures that there's enough data to evaluate model performance.
    • Upper Bound: Its suggested not exceeding 30% for evaluation to avoid limiting the amount of data available for training.
  • Understanding the Impact: Before adjusting this setting, it's important to understand its impact on model training and evaluation. Incorrect adjustments can lead to suboptimal model performance.

  • Gradual Adjustments: For those unfamiliar with the process, we recommend reading up on training data and training sets, then making small, incremental changes and observing their effects.

  • Data Quality: Regardless of the split, the quality of the audio data is paramount. Ensure that your datasets are built from good quality audio with enough data within them.

⚫ Using a Finetuned model in Text-generation-webui

At the end of the finetune process, you will have an option to Compact and move model to /trainedmodel/ this will compact the raw training file and move it to /model/trainedmodel/. When AllTalk starts up within Text-generation-webui, if it finds a model in this location a new loader will appear in the interface for XTTSv2 FT and you can use this to load your finetuned model.

Be careful not to train a new model from the base model, then overwrite your current /model/trainedmodel/ if you want a seperately trained model. This is why there is an OPTION B to move your just trained model to /models/lastfinetuned/.

⚫ Training one model with multiple voices

At the end of the finetune process, you will have an option to Compact and move model to /trainedmodel/ this will compact the raw training file and move it to /model/trainedmodel/. This model will become available when you start up finetuning. You will have a choice to train the Base Model or the Existing finetuned model (which is the one in /model/trainedmodel/). So you can use this to keep further training this model with additional voices, then copying it back to /model/trainedmodel/ at the end of training.

⚫ Do I need to keep the raw training data/model?

If you've compacted and moved your model, its highly unlikely you would want to keep that data, however the choice is there to keep it if you wish. It will be between 5-10GB in size, so most people will want to delete it.

⚫ I have deeper questions about training the XTTS model, where can I find more information?

If you have deeper questions about the XTTS model, its capabilites, the training process etc, anything thats not covered within the above text or the interface of finetune.py, please use the following links to research Coqui's documentation on the XTTS model.


⬜ AllTalk TTS Generator

AllTalk TTS Generator is the solution for converting large volumes of text into speech using the voice of your choice. Whether you're creating audio content or just want to hear text read aloud, the TTS Generator is equipped to handle it all efficiently. Please see here for a quick demo

The link to open the TTS generator can be found on the built-in Settings and Documentation page.

DeepSpeed is highly recommended to speed up generation. Low VRAM would be best turned off and your LLM model unloaded from your GPU VRAM (unload your model). No Playback will reduce memory overhead on very large generations (15,000 words or more). Splitting Export to Wav into smaller groups will also reduce memory overhead at the point of exporting your wav files (so good for low memory systems).

⬜ Estimated Throughput

This will vary by system for a multitude of reasons, however, while generating a 58,000 word document to TTS, with DeepSpeed enabled, LowVram disabled, splitting size 2 and on an Nvidia RTX 4070, throughput was around 1,000 words per minute. Meaning, this took 1 hour to generate the TTS. Exporting to combined wavs took about 2-3 minutes total.

⬜ Quick Start

  • Text Input: Enter the text you wish to convert into speech in the 'Text Input' box.
  • Generate TTS: Hit this to start the text-to-speech conversion.
  • Pause/Resume: Used to pause and resume the playback of the initial generation of wavs or the stream.
  • Stop Playback: This will stop the current audio playing back. It does not stop the text from being generated however. Once you have sent text off to be generated, either as a stream or wav file generation, the TTS server will remain busy until this process has competed. As such, think carefully as to how much you want to send to the server. If you are generating wav files and populating the queue, you can generate one lot of text to speech, then input your next lot of text and it will continue adding to the list.

⬜ Customization and Preferences

  • Character Voice: Choose the voice that will read your text.
  • Language: Select the language of your text.
  • Chunk Sizes: Decide the size of text chunks for generation. Smaller sizes are recommended for better TTS quality.

⬜ Interface and Accessibility

  • Dark/Light Mode: Switch between themes for your visual comfort.
  • Word Count and Generation Queue: Keep track of the word count and the generation progress.

⬜ TTS Generation Modes

  • Wav Chunks: Perfect for creating audio books, or anything you want to keep long term. Breaks down your text into manageable wav files and queues them up. Generation begins automatically, and playback will start after a few chunks have been prepared ahead. You can set the volume to 0 if you don’t want to hear playback. With Wav chunks, you can edit and/or regenerate portions of the TTS as needed.
  • Streaming: For immediate playback without the ability to save. Ideal for on-the-fly speech generation and listening. This will not generate wav files and it will play back through your browser. You cannot stop the server generating the TTS once it has been sent.

    With wav chunks you can either playback β€œIn Browser” which is the web page you are on, or β€œOn Server” which is through the console/terminal where AllTalk is running from, or "No Playback". Only generation β€œIn Browser” can play back smoothly and populate the Generated TTS List. Setting the Volume will affect the volume level played back both β€œIn Browser” and β€œOn Server”.

    For generating large amounts of TTS, it's recommended to select the No Playback option. This setting minimizes the memory usage in your web browser by avoiding the loading and playing of audio files directly within the browser, which is particularly beneficial for handling extensive audio generations. The definition of large will vary depending on your system RAM availability (will update when I have more information as to guidelines). Once the audio is generated, you can export your list to JSON (for safety) and use the Play List option to play back your audio.

⬜ Playback and List Management

  • Playback Controls: Utilize 'Play List' to start from the beginning or 'Stop Playback' to halt at any time.
  • Custom Start: Jump into your list at a specific ID to hear a particular section.
  • Regeneration and Editing: If a chunk isn't quite right, you can opt to regenerate it or edit the text directly. Click off the text to save changes and hit regenerate for the specific line.
  • Export/Import List: Save your TTS list as a JSON file or import one. Note: Existing wav files are needed for playback. Exporting is handy if you want to take your files away into another program and have a list of which wav is which, or if you keep your audio files, but want to come back at a later date, edit one or two lines, regenerate the speech and re-combine the wav’s into one new long wav.

⬜ Exporting Your Audio

  • Export to WAV: Combine all generated TTS from the list, into one single WAV file for easy download and distribution. Its always recommended to export your list to a JSON before exporting, so that you have a backup, should something go wrong. You can simply re-import the list and try exporting again.

    When exporting, there is a file size limit of 1GB and as such you have the option to choose how many files to include in each block of audio exported. 600 is just on the limit of 1GB, depending on the average file size, so 500 or less is a good amount to work with. You can combine the generated files after if you wish, in Audacity or similar.

    Additionally, lower export batches will lower the memory requirements, so if your system is low on memory (maybe 8 or 16GB system), you can use smaller export batches to keep the memory requirement down.

⬜ Exporting Subtitles (SRT file)

  • Export SRT: This will scan through all wav files in your list and generate a subtitles file that will match your exported wav file.

⬜ Analyzing generated TTS for errors

  • Analyze TTS: This will scan through all wav files comparing each ID's orignal text with the TTS generated for that ID and then flag up inconsistences. Its important to understand this is a best effort process and not 100% perfect, for example:

    • Your text may have the word their and the automated routine that listens to your generated TTS interprets the word as there, aka a spelling difference.
    • Your text may have Examples are: (note the colon) and the automated routine that listens to your generated TTS interprets the word as "Examples are` (note no colon as you cannot sound out a colon in TTS), aka a punctuation difference.
    • Your text may have There are 100 items and the automated routine that listens to your generated TTS interprets the word as There are one hundred items, aka numbers vs the number written out in words.
    • There will be other examples such as double quotes. As I say, please remember this is a best effort to help you identify issues.

As such, there is a % Accuracy setting. This uses a couple of methods to try find things that are similar e.g. taking the their and there example from above, it would identify that they both sound the same, so even if the text says their and the AI listening to the generated TTS interprets the word as there, it will realise that both sound the same/are similar so there is no need to flag that as an error. However, there are limits to this and some things may slip through or get picked up when you would prefer them not to be flagged.

The higher the accuracy you choose, the more things it will flag up, however you may get more unwanted detections. The lower the less detections. Based on my few tests, accuracy settings between 96 to 98 seem to generally give the best results. However, I would highly recommend you test out a small 10-20 line text and test out the Analyze TTS button to get a feel for how it responds to different settings, as well as things it flags up.

You will be able to see the ID's and Text (orignal and as interpreted) by looking at the terminal/command prompt window.

The Analyze TTS feature uses the Whisper Larger-v2 AI engine, which will download on first use if necessary. This will require about 2.5GB's of disk space and could take a few minutes to download, depending on your internet connection.

You can use this feature on systems that do not have an Nvidia GPU, however, unless you have a very powerful CPU, expect it to be slow.

⬜ Tricks to get the model to say things correctly

Sometimes the AI model won’t say something the way that you want it to. It could be because it’s a new word, an acronym or just something it’s not good at for whatever reason. There are some tricks you can use to improve the chances of it saying something correctly.

Adding pauses
You can use semi-colons ";" and colons ":" to create a pause, similar to a period "." which can be helpful with some splitting issues.

Acronyms
Not all acronyms are going to be pronounced correctly. Let’s work with the word ChatGPT. We know it is pronounced "Chat G P T" but when presented to the model, it doesn’t know how to break it down correctly. So, there are a few ways we could get it to break out "Chat" and the G P and T. e.g.

Chat G P T. Chat G,P,T. Chat G.P.T. Chat G-P-T. Chat gee pee tea

All bar the last one are using ways within the English language to split out "Chat" into one word being pronounced and then split the G, P and T into individual letters. The final example, which is to use Phonetics will sound perfectly fine, but clearly would look wrong as far as human readable text goes. The phonetics method is very useful in edge cases where pronunciation difficult.

⬜ Notes on Usage

  • For seamless TTS generation, it's advised to keep text chunks under 250 characters, which you can control with the Chunk sizes.
  • Generated audio can be played back from the list, which also highlights the currently playing chunk.
  • The TTS Generator remembers your settings, so you can pick up where you left off even after refreshing the page.

🟠 API Suite and JSON-CURL

🟠Overview

The Text-to-Speech (TTS) Generation API allows you to generate speech from text input using various configuration options. This API supports both character and narrator voices, providing flexibility for creating dynamic and engaging audio content.

🟠 Ready Endpoint

Check if the Text-to-Speech (TTS) service is ready to accept requests.

  • URL: http://127.0.0.1:7851/api/ready
    - Method: GET

    curl -X GET "http://127.0.0.1:7851/api/ready"

    Response: Ready

🟠 Voices List Endpoint

Retrieve a list of available voices for generating speech.

  • URL: http://127.0.0.1:7851/api/voices
    - Method: GET

    curl -X GET "http://127.0.0.1:7851/api/voices"

    JSON return: {"voices": ["voice1.wav", "voice2.wav", "voice3.wav"]}

🟠 Current Settings Endpoint

Retrieve a list of available voices for generating speech.

  • URL: http://127.0.0.1:7851/api/currentsettings
    - Method: GET

    curl -X GET "http://127.0.0.1:7851/api/currentsettings"

    JSON return: {"models_available":[{"name":"Coqui","model_name":"API TTS"},{"name":"Coqui","model_name":"API Local"},{"name":"Coqui","model_name":"XTTSv2 Local"}],"current_model_loaded":"XTTSv2 Local","deepspeed_available":true,"deepspeed_status":true,"low_vram_status":true,"finetuned_model":false}

    name & model_name = listing the currently available models.
    current_model_loaded = what model is currently loaded into VRAM.
    deepspeed_available = was DeepSpeed detected on startup and available to be activated.
    deepspeed_status = If DeepSpeed was detected, is it currently activated.
    low_vram_status = Is Low VRAM currently enabled.
    finetuned_model = Was a finetuned model detected. (XTTSv2 FT).

🟠 Preview Voice Endpoint

Generate a preview of a specified voice with hardcoded settings.

  • URL: http://127.0.0.1:7851/api/previewvoice/
    - Method: POST
    - Content-Type: application/x-www-form-urlencoded

    curl -X POST "http://127.0.0.1:7851/api/previewvoice/" -F "voice=female_01.wav"

    Replace female_01.wav with the name of the voice sample you want to hear.

    JSON return: {"status": "generate-success", "output_file_path": "/path/to/outputs/api_preview_voice.wav", "output_file_url": "http://127.0.0.1:7851/audio/api_preview_voice.wav"}

🟠 Switching Model Endpoint

  • URL: http://127.0.0.1:7851/api/reload
    - Method: POST

    curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20Local"
    curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20TTS"
    curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=XTTSv2%20Local"

    Switch between the 3 models respectively.

    curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=XTTSv2%20FT"

    If you have a finetuned model in /models/trainedmodel/ (will error otherwise)

    JSON return {"status": "model-success"}

🟠 Switch DeepSpeed Endpoint

  • URL: http://127.0.0.1:7851/api/deepspeed
    - Method: POST

    curl -X POST "http://127.0.0.1:7851/api/deepspeed?new_deepspeed_value=True"

    Replace True with False to disable DeepSpeed mode.

    JSON return {"status": "deepspeed-success"}

🟠 Switching Low VRAM Endpoint

  • URL: http://127.0.0.1:7851/api/lowvramsetting
    - Method: POST

    curl -X POST "http://127.0.0.1:7851/api/lowvramsetting?new_low_vram_value=True"

    Replace True with False to disable Low VRAM mode.

    JSON return {"status": "lowvram-success"}

🟠 TTS Generation Endpoint (Standard Generation)

Streaming endpoint details are further down the page.

  • URL: http://127.0.0.1:7851/api/tts-generate
    - Method: POST
    - Content-Type: application/x-www-form-urlencoded

🟠 Example command lines (Standard Generation)

Standard TTS generation supports Narration and will generate a wav file/blob. Standard TTS speech Example (standard text) generating a time-stamped file

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

Narrator Example (standard text) generating a time-stamped file

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes." -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=true" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

Note that if your text that needs to be generated contains double quotes you will need to escape them with \" (Please see the narrator example).

🟠 Request Parameters

🟠 text_input: The text you want the TTS engine to produce. Use escaped double quotes for character speech and asterisks for narrator speech if using the narrator function. Example:

-d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes."

🟠 text_filtering: Filter for text. Options:

  • none No filtering. Whatever is sent will go over to the TTS engine as raw text, which may result in some odd sounds with some special characters.
  • standard Human-readable text and a basic level of filtering, just to clean up some special characters.
  • html HTML content. Where you are using HTML entity's like "

-d "text_filtering=none"
-d "text_filtering=standard"
-d "text_filtering=html"

Example:

  • Standard Example: *This is text spoken by the narrator* "This is text spoken by the character" This is text not inside quotes.
  • HTML Example: *This is text spoken by the narrator* "This is text spoken by the character" This is text not inside quotes.
  • None: Will just pass whatever characters/text you send at it.

🟠 character_voice_gen: The WAV file name for the character's voice.

-d "character_voice_gen=female_01.wav"

🟠 narrator_enabled: Enable or disable the narrator function. If true, minimum text filtering is set to standard. Anything between double quotes is considered the character's speech, and anything between asterisks is considered the narrator's speech.

-d "narrator_enabled=true"
-d "narrator_enabled=false"

🟠 narrator_voice_gen: The WAV file name for the narrator's voice.

-d "narrator_voice_gen=male_01.wav"

🟠 text_not_inside: Specify the handling of lines not inside double quotes or asterisks, for the narrator feature. Options:

  • character: Treat as character speech.
  • narrator: Treat as narrator speech.

-d "text_not_inside=character"
-d "text_not_inside=narrator"

🟠 language: Choose the language for TTS. Options:

ar Arabic
zh-cn Chinese (Simplified)
cs Czech
nl Dutch
en English
fr French
de German
hi Hindi (Please see this re Hindi support, which is very limited #178)
hu Hungarian
it Italian
ja Japanese
ko Korean
pl Polish
pt Portuguese
ru Russian
es Spanish
tr Turkish

-d "language=en"

🟠 output_file_name: The name of the output file (excluding the .wav extension).

-d "output_file_name=myoutputfile"

🟠 output_file_timestamp: Add a timestamp to the output file name. If true, each file will have a unique timestamp; otherwise, the same file name will be overwritten each time you generate TTS.

-d "output_file_timestamp=true"
-d "output_file_timestamp=false"

🟠 autoplay: Enable or disable playing the generated TTS to your standard sound output device at time of TTS generation.

-d "autoplay=true"
-d "autoplay=false"

🟠 autoplay_volume: Set the autoplay volume. Should be between 0.1 and 1.0. Needs to be specified in the JSON request even if autoplay is false.

-d "autoplay_volume=0.8"

🟠 TTS Generation Response

The API returns a JSON object with the following properties:

  • status Indicates whether the generation was successful (generate-success) or failed (generate-failure).
  • output_file_path The on-disk location of the generated WAV file.
  • output_file_url The HTTP location for accessing the generated WAV file for browser playback.
  • output_cache_url The HTTP location for accessing the generated WAV file as a pushed download.

Example JSON TTS Generation Response:

{"status":"generate-success","output_file_path":"C:\\text-generation-webui\\extensions\\alltalk_tts\\outputs\\myoutputfile_1704141936.wav","output_file_url":"http://127.0.0.1:7851/audio/myoutputfile_1704141936.wav","output_cache_url":"http://127.0.0.1:7851/audiocache/myoutputfile_1704141936.wav"}

🟠 TTS Generation Endpoint (Streaming Generation)

Streaming TTS generation does NOT support Narration and will generate an audio stream. Streaming TTS speech JavaScript Example:

  • URL: http://localhost:7851/api/tts-generate-streaming
    - Method: POST
    - Content-Type: application/x-www-form-urlencoded

// Example parameters
const text = "Here is some text";
const voice = "female_01.wav";
const language = "en";
const outputFile = "stream_output.wav";
// Encode the text for URL
const encodedText = encodeURIComponent(text);
// Create the streaming URL
const streamingUrl = `http://localhost:7851/api/tts-generate-streaming?text=${encodedText}&voice=${voice}&language=${language}&output_file=${outputFile}`;
// Create and play the audio element
const audioElement = new Audio(streamingUrl);
audioElement.play(); // Play the audio stream directly
  • Text (text): This is the actual text you want to convert to speech. It should be a string and must be URL-encoded to ensure that special characters (like spaces and punctuation) are correctly transmitted in the URL. Example: Hello World becomes Hello%20World when URL-encoded.
  • Voice (voice): This parameter specifies the voice type to be used for the TTS. The value should match one of the available voice options in AllTalks voices folder. This is a string representing the file, like female_01.wav.
  • Language (language): This setting determines the language in which the text should be spoken. A two-letter language code (like en for English, fr for French, etc.).
  • Output File (output_file): This parameter names the output file where the audio will be streamed. It should be a string representing the file name, such as stream_output.wav. AllTalk will not save this as a file in its outputs folder.

πŸ”΄ Future to-do list

  • I am maintaining a list of things people request here
  • Possibly add some additional TTS engines (TBD).
  • Have a break!

alltalk_tts's People

Contributors

arthurwolf avatar danielwburch avatar dpkirchner avatar erew123 avatar johnbenac avatar josh-xt avatar nicobubulle avatar q5sys avatar rbruels avatar s95sedan avatar ytt246 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alltalk_tts's Issues

float16 error in finetune.py

Attempting to run finetune.py directly (cli in text-gen environment, etc via python finetune.py) and it starts fine, but upon running step 1, regardless of Whisper model (v2 or v3, etc), I get the following traceback (entire cli dump):

Running on local URL:  http://127.0.0.1:7052

To create a public link, set `share=True` in `launch()`.
[FINETUNE] οΏ½[94mPart of AllTalkοΏ½[0m https://github.com/erew123/alltalk_tts/
[FINETUNE] οΏ½[94mCoqui Public Model LicenseοΏ½[0m
[FINETUNE] οΏ½[94mhttps://coqui.ai/cpml.txtοΏ½[0m
[FINETUNE] οΏ½[94mStarting Step 1οΏ½[0m - Preparing Audio/Generating the dataset
[FINETUNE] Updated lang.txt with the target language.
[FINETUNE] Loading Whisper Model: large-v2
Traceback (most recent call last):
  File "C:\MODELS\text-generation-webui\extensions\alltalk_tts\finetune.py", line 987, in preprocess_dataset
    train_meta, eval_meta, audio_total_size = format_audio_list(target_language=language, whisper_model=whisper_model ,out_path=out_path, gradio_progress=progress)
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\MODELS\text-generation-webui\extensions\alltalk_tts\finetune.py", line 112, in format_audio_list
    asr_model = WhisperModel(whisper_model, device=device, compute_type="float16")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\MODELS\text-generation-webui\installer_files\env\Lib\site-packages\faster_whisper\transcribe.py", line 130, in __init__
    self.model = ctranslate2.models.Whisper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

This was running a week ago, but now I get this error and I'm not sure why. I've deleted and recreated alltalk_tts via fresh git clone, double checked requirements installed, fully updated text-gen AND then reran alltalk requirements (to replace panda, etc) and the error persists. nvcc --version shows Cuda compilation tools, release 11.8, V11.8.89. I've removed and redownloaded the faster-whisper models themselves from the huggingface hub cache, etc. Problem persists if I run through text-gen UI as well.

I feel like I'm overlooking something simple but I'm not sure what more to try.

diagnostics.log is attached.

Finetuning: Processed sentence clips are truncated sometimes

This is not a terribly helpful issue yet, but I wanted to open it for input/tracking.

When running finetune.py, the script processes the sound clips through Whisper to get the text transcript and timecodes of the words within them. It then separates the sentences out (by punctuation) and then saves the individual sentences as separate audio clips.

This generally works (and is rather magical), but I'm noticing that a lot of those clips are truncated, and almost always at the end of the clip. It's usually the last syllable or so of the audio clip (though the transcription of the clip in metadata_eval.csv is full and correct).

In my first totally unscientific experiment, I think this is affecting my final output (a lot of my TTS sentences seem to get weirdly truncated) but I haven't completely verified that.

I have read a number of filed issues saying that Whisper can be inaccurate with its timestamps (especially at the sub-second level), so it's possible that the times for the word start and word end aren't accurate, which would then propagate to this clipping functionality:

word_end = min((word.end + next_word_start) / 2, word.end + buffer)

Will add more to this issue as I discover more.

Unable to install, it tries to build something which fails due to missing 'basetsd.h'

Please generate a diagnostics report and upload the "diagnostics.log".

This is what the diagnostic returns:

Error importing module: No module named 'importlib_metadata'

Please ensure you started the Text-generation-webUI Python environment with either
cmd_linux.sh, cmd_windows.bat, cmd_macos.sh, or cmd_wsl.bat
from the text-generation-webui directory, and then try running the diagnostics again.

   There was an error running diagnostics. Have you correctly started your
   Text-generation-webui Python environment with cmd_windows.bat?

Describe the bug
Trying to install this. First it crashed because I didnt have VS build tools so I installed it for the version it said. But now it crashes and I cant figure out how to fix it. Why does it even need to build something? Thanks in advance for any help!

To Reproduce
I just installed following QUICK SETUP - Text-Generation-webui for windows

Screenshots
If applicable, add screenshots to help explain your problem.

Text/logs

"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\User\AppData\Local\Temp\pip-build-env-dc_xmk8t\overlay\Lib\site-packages\numpy\core\include -IC:\Users\User\AppData\Local\Programs\Python\Python310\include -IC:\Users\User\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /TcTTS/tts/utils/monotonic_align/core.c /Fobuild\temp.win-amd64-cpython-310\Release\TTS/tts/utils/monotonic_align/core.obj
    core.c
    c:\users\user\appdata\local\programs\python\python310\include\pyconfig.h(200): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit code 2
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for TTS
Failed to build TTS
ERROR: Could not build wheels for TTS, which is required to install pyproject.toml-based projects

Desktop (please complete the following information):
AllTalk was updated: [approx. date]
Custom Python environment: [yes/no give details if yes]
Text-generation-webUI was updated: Maybe 1o days ago

Additional context
Add any other context about the problem here.

Hungarian characters bug

Describe the bug
I generate text in standalone mode, but it doesn't like the Hungarian Ε‘ and Ε±, it just skips them. For example, here is a sentence that contains all the Hungarian accented characters:
ÁrvΓ­ztΕ±rΕ‘ tΓΌkΓΆrfΓΊrΓ³gΓ©p.
or this:
ÁRVÍZTΕ°RŐ TÜKΓ–RFÚRΓ“GΓ‰P

That's all I see in the console:
Árvíztr tükârfúrógép
or this:
ÁRVÍZTR TÜKΓ–RFÚRΓ“GΓ‰P
And the voice reads the latter without Ε‘ and Ε±

Please fix this!

start_alltalk.bat window closes instantly on boot

Hey, its my first time writing a bug report since i dont use github very much so please bare with me, but i didn't see anyone else saying they have this issue and i really want to try this out!

Like the title says after installing alltalk with the atsetup.bat and i go to try and boot it almost instantly closes the window without any warnings or error messages... I've tried to run the installer 5 times on 2 different places on my PC but every time its the same issue, i also tried the atsetup.sh with gitbash and got no further. I'm trying to use the Standalone version for SillyTavernAI if that info helps.

diagnostics.log

que with a different instance and different voice

can we have a different instance of the program outputting to a different folder so we can do different voices in a que please?

the program would finish one text chunk in a voice, the next one in que would do a different voice that we specify outputting to a different folder(can be automatic) so that our outputs are not mixed. and then we would get multiple options to combine to wav.

like what handbrake has with a batch que process

i want to input two large texts and have 2 outputs instead of 1

thank you very much

Finetuning: AssertionError if the filename of the input sound clip in the `/finetune` folder has a number in it.

EDIT: This is not the root cause. There's something else causing it

Platform
Windows 10

Commit
50aa6de

Describe the bug
The metadata_train.csv and metadata_eval.csv files do not get populated with the filenames from /finetune/tmp-trn/wavs, preventing the training workflow from continuing.

/finetune/put-voice-samples-in-here/boy2_dirty.wav = Fail
/finetune/put-voice-samples-in-here/boya_dirty.wav = Works

To Reproduce
Steps to reproduce the behaviour:

  1. Add a number to the filename of the sample file.
  2. Run "Step 1" of the finetune trainer
  3. If "Step 2" of the finetune trainer is run, an error will occur due to the empty csvs

Screenshots
1 Screenshot 2024-01-09 173708
2 Screenshot 2024-01-09 173915

Text/logs

(test_env) P:\voice\alltalk_tts>python finetune.py
Running on local URL:  http://127.0.0.1:7052
[FINETUNE] HTTP Request: GET http://127.0.0.1:7052/startup-events "HTTP/1.1 200 OK"
[FINETUNE] HTTP Request: HEAD http://127.0.0.1:7052/ "HTTP/1.1 200 OK"

To create a public link, set `share=True` in `launch()`.
[FINETUNE] HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
[FINETUNE] HTTP Request: GET https://checkip.amazonaws.com/ "HTTP/1.1 200 OK"
[FINETUNE] HTTP Request: GET https://checkip.amazonaws.com/ "HTTP/1.1 200 OK"
[FINETUNE] HTTP Request: POST https://api.gradio.app/gradio-initiated-analytics/ "HTTP/1.1 200 OK"
[FINETUNE] HTTP Request: POST https://api.gradio.app/gradio-launched-telemetry/ "HTTP/1.1 200 OK"
[FINETUNE] Part of AllTalk https://github.com/erew123/alltalk_tts/
[FINETUNE] Coqui Public Model License
[FINETUNE] https://coqui.ai/cpml.txt
[FINETUNE] Starting Step 1 - Preparing Audio/Generating the dataset
[FINETUNE] Updated lang.txt with the target language.
[FINETUNE] Loading Whisper Model: large-v3
[FINETUNE] Current working file: P:\voice\alltalk_tts\finetune\put-voice-samples-in-here\boy2_dirty.wav
[FINETUNE] Processing audio with duration 04:21.649
[FINETUNE] VAD filter removed 00:11.792 of audio
[FINETUNE] Train CSV: P:\voice\alltalk_tts\finetune\tmp-trn\metadata_train.csv
[FINETUNE] Eval CSV: P:\voice\alltalk_tts\finetune\tmp-trn\metadata_eval.csv
[FINETUNE] Audio Total: 261.6491458333333
[FINETUNE] Dataset Generated. Move to Step 2
[FINETUNE] Starting Step 2 - Fine-tuning the XTTS Encoder
[FINETUNE] Starting finetuning on Base Model
>> DVAE weights restored from: P:\voice\alltalk_tts\models\xttsv2_2.0.2\dvae.pth
Traceback (most recent call last):
  File "P:\voice\alltalk_tts\finetune.py", line 1091, in train_model
    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=str(output_path), max_audio_length=max_audio_length)
  File "P:\voice\alltalk_tts\finetune.py", line 422, in train_gpt
    train_samples, eval_samples = load_tts_samples(
  File "P:\voice\alltalk_tts\test_env\lib\site-packages\TTS\tts\datasets\__init__.py", line 121, in load_tts_samples
    assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}"
AssertionError:  [!] No training samples found in P:\voice\alltalk_tts\finetune\tmp-trn/P:\voice\alltalk_tts\finetune\tmp-trn\metadata_train.csv

ImportError: cannot import name 'SampleOutput' from 'transformers.generation.utils'

Please generate a diagnostics report and upload the "diagnostics.log".

https://github.com/erew123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file
diagnostics.log

Describe the bug
I ran the shell script on a fresh new OS and when i start the app I get this error.

To Reproduce
Steps to reproduce the behaviour:
Fresh install run the ./atsetup.sh and select standalone

Text/logs
./start_alltalk.sh
[AllTalk Startup] Running script.py in standalone mode
[AllTalk Startup] Coqui Public Model License
[AllTalk Startup] https://coqui.ai/cpml.txt
[AllTalk Startup] Old output wav file deletion is set to disabled.
[AllTalk Startup] Checking Model is Downloaded.
[AllTalk Startup] TTS version installed: 0.21.3
[AllTalk Startup] TTS version is up to date.
[AllTalk Startup] All required files are present.
[AllTalk Startup] TTS Subprocess starting
[AllTalk Startup]
[AllTalk Startup] AllTalk Settings & Documentation: http://127.0.0.1:7851
[AllTalk Startup]
Traceback (most recent call last):
File "/home/ai/alltalk_tts/tts_server.py", line 7, in
from TTS.tts.configs.xtts_config import XttsConfig
File "/home/ai/alltalk_tts/alltalk_environment/env/lib/python3.11/site-packages/TTS/tts/configs/xtts_config.py", line 5, in
from TTS.tts.models.xtts import XttsArgs, XttsAudioConfig
File "/home/ai/alltalk_tts/alltalk_environment/env/lib/python3.11/site-packages/TTS/tts/models/xtts.py", line 12, in
from TTS.tts.layers.xtts.stream_generator import init_stream_support
File "/home/ai/alltalk_tts/alltalk_environment/env/lib/python3.11/site-packages/TTS/tts/layers/xtts/stream_generator.py", line 24, in
from transformers.generation.utils import GenerateOutput, SampleOutput, logger
ImportError: cannot import name 'SampleOutput' from 'transformers.generation.utils' (/home/ai/alltalk_tts/alltalk_environment/env/lib/python3.11/site-packages/transformers/generation/utils.py)

requirements mismatch?

text-generation-webui installed torch 2.1.2 with cuda 121 for me. The nvidia requirements file shows its installing for torch 2.1.0 and cuda 118. I tried changing this, but it still didnt let me boot up alltalk, even after trying everything in the troubleshooting section.

Fine-tuning: Library cublas64_11.dll is not found or cannot be loaded

diagnostics.log
Hi! (fluecured, btw) I have attached a "diagnostics.log" for fine-tuning.

Describe the bug
After installing requirements_finetune.txt and CUDA 11.8 Toolkit as explained, updating the Windows "Path" variable, and ensuring "nvcc --version" returned the proper version globally and in the "cmd_windows.bat" text-generation-webui environment, I commenced training with a 22:10 sample. The requisite training file(s) downloaded, and step 1 began.

After some time processing a temp wav file, the process stopped with "RuntimeError: Library cublas64_11.dll is not found or cannot be loaded". Searching the system, I located one "cublas64.dll", but it is a part of the stable-diffusion-webui default installation in a separate environment. (I was surprised the dll was not in the Toolkit files since I remember selecting "CUBLAS" in the NVIDIA installer.)

To Reproduce
This may be an error of configuration, requirements, or understanding, and reproducing it might might be as easy as temporarily removing that dll.

Text/logs

[FINETUNE] Current working file: F:\text-generation-webui\extensions\alltalk_tts\finetune\put-voice-samples-in-here\Voice Sample.wav
[FINETUNE] Processing audio with duration 22:10.801
[FINETUNE] VAD filter removed 00:00.000 of audio
Traceback (most recent call last):
  File "F:\text-generation-webui\extensions\alltalk_tts\finetune.py", line 731, in preprocess_dataset
    train_meta, eval_meta, audio_total_size = format_audio_list(target_language=language, whisper_model=whisper_model ,out_path=out_path, gradio_progress=progress)
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\text-generation-webui\extensions\alltalk_tts\finetune.py", line 180, in format_audio_list
    segments = list(segments)
               ^^^^^^^^^^^^^^
  File "F:\text-generation-webui\installer_files\env\Lib\site-packages\faster_whisper\transcribe.py", line 941, in restore_speech_timestamps
    for segment in segments:
  File "F:\text-generation-webui\installer_files\env\Lib\site-packages\faster_whisper\transcribe.py", line 445, in generate_segments
    encoder_output = self.encode(segment)
                     ^^^^^^^^^^^^^^^^^^^^
  File "F:\text-generation-webui\installer_files\env\Lib\site-packages\faster_whisper\transcribe.py", line 629, in encode
    return self.model.encode(features, to_cpu=to_cpu)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Library cublas64_11.dll is not found or cannot be loaded

Desktop (please complete the following information):
AllTalk was updated: 12-25-2023
Custom Python environment: Using text-generation-webui one-click installer environment
Text-generation-webUI was updated: 12-24-2023

Additional context
Thanks, and happy holidays!

Edit: I do see a "F:\text-generation-webui\installer_files\env\Lib\site-packages\torch\lib\cublas64_12.dll" in the Oobabooga files. Could the recent update have moved ahead by a version?

Disable Coqui analytics / Cannot finetune offline

Please generate a diagnostics report and upload the "diagnostics.log".

https://github.com/erew123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file

Describe the bug
Coqui trainer package uses tracking/analytics and cannot work offline.

To Reproduce

  • Deny Internet access to python
  • Run finetune step 2

Screenshots
N/A

Text/logs

Traceback (most recent call last):
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\urllib3\connectionpool.py", line 844, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\urllib3\util\retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='coqui.gateway.scarf.sh', port=443): Max retries exceeded with url: /trainer/training_run (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000029092034490>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "O:\ai\tests\alltalk\alltalk_tts\finetune.py", line 817, in train_model
    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=str(output_path), max_audio_length=max_audio_length)
                                                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\finetune.py", line 394, in train_gpt
    trainer = Trainer(
              ^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\trainer\trainer.py", line 583, in __init__
    ping_training_run()
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\trainer\analytics.py", line 12, in ping_training_run
    _ = requests.get(URL, timeout=5)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\ai\tests\alltalk\alltalk_tts\installer_files\env\Lib\site-packages\requests\adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='coqui.gateway.scarf.sh', port=443): Max retries exceeded with url: /trainer/training_run (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000029092034490>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions'))

Desktop (please complete the following information):
AllTalk was updated: 2023-12-24 da04454
Custom Python environment: Yes, Miniconda environment made with Oobabooga start_windows.bat, Python 3.11.
Text-generation-webUI was updated: N/A

Additional context
It's possible to start the training by commenting out this line but there probably is a better/cleaner way to get rid of their shenanigans.

Deepspeed on standalone app?

This may not be possible since the docs only reference how to do it with the text-generation-ui.
I already have libaio-devel installed (Im on Rocky Linux)
I've got alltalk_tts installed and its works just like it's supposed to. I'm now attempting to test it out with deepspeed.

Here's how I attempted to get it running on the standalone app.

  1. I navigate to the directory that alltalk_tts is in.
  2. I activate my conda environment.
  3. I run to pip install deepspeed
  4. I get the following error:
(alltalkenv) [q5@apollo alltalk_tts]$ pip install deepspeed
Collecting deepspeed
  Downloading deepspeed-0.12.6.tar.gz (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 10.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  Γ— python setup.py egg_info did not run successfully.
  β”‚ exit code: 1
  ╰─> [9 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-khqhn5sx/deepspeed_93b84e6a32f94f3495d451600d2ab13b/setup.py", line 100, in <module>
          cuda_major_ver, cuda_minor_ver = installed_cuda_version()
                                           ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-khqhn5sx/deepspeed_93b84e6a32f94f3495d451600d2ab13b/op_builder/builder.py", line 50, in installed_cuda_version
          raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
      op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Γ— Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.

You mentioned that textgen-ui overwrites the env variables, but since i'm using your native UI, will I still have that problem?
Since alltalk_tts works properly I wouldnt think there's an issue with the env variables for CUDA.

Am I missing something obvious?

finetuning Step 2 missing file mel_stats.pth

When running Step 2 of the finetuning it errors saying "mel_stats.pth" is missing:
Step 1 ran fine though.

[FINETUNE] Starting Step 2 - Fine-tuning the XTTS Encoder
Traceback (most recent call last):
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\extensions\alltalk_tts\finetune.py", line 817, in train_model
    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=str(output_path), max_audio_length=max_audio_length)
                                                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\extensions\alltalk_tts\finetune.py", line 384, in train_gpt
    model = GPTTrainer.init_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\TTS\tts\layers\xtts\trainer\gpt_trainer.py", line 500, in init_from_config
    return GPTTrainer(config)
           ^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\TTS\tts\layers\xtts\trainer\gpt_trainer.py", line 88, in __init__
    self.xtts.mel_stats = load_fsspec(self.args.mel_norm_file)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\TTS\utils\io.py", line 46, in load_fsspec
    with fsspec.open(
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\core.py", line 100, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\implementations\cached.py", line 417, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\spec.py", line 1307, in open
    f = self._open(
        ^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\implementations\cached.py", line 417, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\implementations\cached.py", line 646, in _open
    fn = self._make_local_details(path)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\implementations\cached.py", line 417, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\implementations\cached.py", line 570, in _make_local_details
    "uid": self.fs.ukey(path),
           ^^^^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\spec.py", line 1346, in ukey
    return sha256(str(self.info(path)).encode()).hexdigest()
                      ^^^^^^^^^^^^^^^
  File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\fsspec\implementations\local.py", line 83, in info
    out = os.stat(path, follow_symlinks=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'E:/oobabooga/text-generation-webui-snapshot-2023-11-19/extensions/alltalk_tts/models/xttsv2_2.0.2/mel_stats.pth'

Permit the models to specify the voices to be used.

Having multiple characters speaking would currently require manual changing of the voice before generation.

Allow [tags] the model can be made aware of to select the next voice. [Fry.wav] Stephen: Because this would be pretty cool.

Unable to get guide to work on EndeavorOS, ModuleNotFoundError: No module named 'lazy_loader'

diagnostics.log
Please generate a diagnostics report and upload the "diagnostics.log".

https://github.com/erew123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file

Describe the bug
After attempting to install (before deepspeech steps, so that package is not installed yet), with a fresh text-generation-webui, I cannot get this extension to load.

To Reproduce
Attempt to install text-generation-webui from a fresh git clone today, followed by a fresh clone of this new repo.

Text/logs

21:56:24-932363 INFO     Saved /home/korodarn/Apps/text-generation-webui/settings.yaml.                                                                                                                                                                                                                                                                                                     
Closing server running on port: 7860
21:56:34-009812 INFO     Loading the extension "gallery"                                                                                                                                                                                                                                                                                                                                    
21:56:34-010813 INFO     Loading the extension "alltalk_tts"                                                                                                                                                                                                                                                                                                                                
21:56:34-017786 ERROR    Failed to load the extension "alltalk_tts".                                                                                                                                                                                                                                                                                                                        
Traceback (most recent call last):
  File "/home/korodarn/Apps/text-generation-webui/extensions/alltalk_tts/script.py", line 37, in <module>
    from TTS.api import TTS
  File "/home/korodarn/Apps/text-generation-webui/installer_files/env/lib/python3.11/site-packages/TTS/api.py", line 9, in <module>
    from TTS.utils.audio.numpy_transforms import save_wav
  File "/home/korodarn/Apps/text-generation-webui/installer_files/env/lib/python3.11/site-packages/TTS/utils/audio/__init__.py", line 1, in <module>
    from TTS.utils.audio.processor import AudioProcessor
  File "/home/korodarn/Apps/text-generation-webui/installer_files/env/lib/python3.11/site-packages/TTS/utils/audio/processor.py", line 4, in <module>
    import librosa
  File "/home/korodarn/Apps/text-generation-webui/installer_files/env/lib/python3.11/site-packages/librosa/__init__.py", line 212, in <module>
    import lazy_loader as lazy
ModuleNotFoundError: No module named 'lazy_loader'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/korodarn/Apps/text-generation-webui/modules/extensions.py", line 37, in load_extensions
    exec(f"import extensions.{name}.script")
  File "<string>", line 1, in <module>
  File "/home/korodarn/Apps/text-generation-webui/extensions/alltalk_tts/script.py", line 40, in <module>
    logger.error(
    ^^^^^^
NameError: name 'logger' is not defined
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Desktop (please complete the following information):
AllTalk was updated: today 12/24
Custom Python environment: I'm using the start_linux.sh installer
Text-generation-webUI was updated: today 12/24

OS: EndeavourOS Linux x86_64
Kernel: 6.6.8-arch1-1
Uptime: 42 mins
Packages: 1726 (pacman), 15 (flatpak)
Shell: zsh 5.9
Resolution: 3440x1440
DE: Hyprland
WM: sway
Theme: Breeze-Dark [GTK2/3]
Icons: Breeze-Noir-White-Blue [GTK2/3]
Terminal: kitty
CPU: AMD Ryzen 7 5800X (16) @ 3.800GHz
GPU: NVIDIA GeForce RTX 3080
Memory: 4374MiB / 32002MiB

Additional context
There also seems to be a pandas related conflict from TTS with superbooga/chroma, but at least one windows user reports having this extension working in tandem with that.

ImportError: cannot import name 'field_validator' from 'pydantic

Please generate a diagnostics report and upload the "diagnostics.log".

Describe the bug
Apparently, the "field_validator" has been removed in Pydantic V2 and couldn't be imported in the tts_server.py file.

To Reproduce
Fresh install of the AllTalk extension, while loading the extension

Screenshots
Line 650
Line 25

Text/logs
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Closing server running on port: 7860
2023-12-24 17:31:30 INFO:Loading the extension "gallery"...
2023-12-24 17:31:30 INFO:Loading the extension "alltalk_tts"...
[AllTalk Startup] Coqui Public Model License
[AllTalk Startup] https://coqui.ai/cpml.txt
[AllTalk Startup] Old output wav file deletion is set to disabled.
[AllTalk Startup] Checking Model is Downloaded.
[AllTalk Startup] TTS version installed: 0.21.3
[AllTalk Startup] TTS version is up to date.
[AllTalk Startup] All required files are present.
[AllTalk Startup] TTS Subprocess starting
[AllTalk Startup] Readme available here: http://127.0.0.1:7851
Traceback (most recent call last):
File "C:\Programs\text-generation-webui\extensions\alltalk_tts\tts_server.py", line 25, in
from pydantic import field_validator
ImportError: cannot import name 'field_validator' from 'pydantic' (C:\programs\text-generation-webui\installer_files\env\Lib\site-packages\pydantic_init_.cp311-win_amd64.pyd)
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.
[AllTalk Shutdown] Received Ctrl+C, terminating subprocess
2023-12-24 17:33:51 ERROR:Failed to load the extension "alltalk_tts".
[AllTalk Shutdown] Received Ctrl+C, terminating subprocess
[AllTalk Shutdown] Received Ctrl+C, terminating subprocess

Desktop (please complete the following information):
AllTalk was updated: 12/24/23
Custom Python environment: Python version = 3.10.9
Text-generation-webUI was updated:

Additional context
Was able to solve the issue by going into the tts_server.py file in the alltalk_tts extension folder and change "field_validator"
on lines 25 and 650 to "validator". Save the file and restarted Ooba and the extension. Alltalk loaded successfully.

FYI: Not sure if it this is relevant but when I initially ran "pip install -r requirements_nvidia.txt", it installed pydantic version 2.5.1. I uninstalled it and installed the 2.5.0 version instead, then applied the solution above.

πŸ”„ ImportError: cannot import name 'SampleOutput' from 'transformers.generation.utils'

EDIT This has now been resolved with an update to Transformers 24th Jan 2024 to version 4.37.1.

A new version of Transformers has been released 4.37 https://github.com/huggingface/transformers/releases which causes a load/import problem ImportError: cannot import name 'SampleOutput' from 'transformers.generation.utils' at this time, I'm unsure if this is a bug in their code as I cannot find any breaking changes currently. I have forced pip install transformers==4.36.2 in the requirements files.

If you experience this, you can load your python environment that AllTalk runs within and run pip install transformers==4.36.2 to force a downgrade.

I have raised an issue with the Transformers developers to ask them for help/guidance or to look into this issue.

fixed!! TTS Generator doesnt work on standalone ALLTALK in linux

fixed by changing playback from in browser to no playback

export to wav is still broken, i reinstalled today on linux, broken on both win and linx.

Please generate a diagnostics report and upload the "diagnostics.log".

diagnostics.log

when i click generate in the generator, the wheel flashes and it says 3 out of 3 completed but nothing is generated and nothing new is in the console

To Reproduce
put any text into the generator tts

Desktop (please complete the following information):
AllTalk was updated: newest version 19th jan
Custom Python environment: [default}
Text-generation-webUI was updated: not using, using standalone

AllTalk TTS Generator - Incomplete audio and special characters removed.

diagnostics.log

Describe the bug
AllTalk TTS Generator: It altered the text so certain words did't sound correct and it didn't finished processing all the text given.

To Reproduce
I tried to make a complete audiobook, I pasted the complete text (around 23500 words), selected a custom voice wav file for the voice (the same I use all the time for alltalk) and all other settings at default. First I noticed that it removed all of ' "" ... etc, that alters the way TTS reads them, for example "I'm" to "Im". Then it stopped processing at around chunk number 1981 of 2351, no error on the terminal. Finally I tried consolidating all the generated files using "export to wav" but it didn't do anything.

Text/logs
Sorry, I closed the terminal before thinking of submitting a report.

Desktop (please complete the following information):
AllTalk was updated: 06/01/2024 version [AllTalk 1.8d]
Custom Python environment: [yes/no give details if yes]
Text-generation-webUI was updated: 20/12/2023

Additional context
Sorry if make a mistake reporting or writing about my issue but English is not my native tongue.

Training writes to disk as checkpoints frequently. Option to use RAM instead.

I have a very fast SSD, but has limited writes over its lifespan. I have 64gb of ram barely being used during training, and this number will only grow over time im sure.

Please consider adding an option to store temporary checkpoints or whatnot thats currently being written to disc currently, to RAM instead, so hard disk access is kept to a minimum.

This is an awesome project, btw.

Real Time Streaming

Is it possible at exec of TTS cmd, to Stream the results in chunks to something like a Temp_stream.wav file that will be playable immediately after exec as it is still being created.

So for example if i am Transcribing 100 words and it takes me 4 seconds. but i want to play the file in real time, meaning at the point of exec of the TTS command i want the Audio to start playing, you can essentially do real time.

Piper does this - https://github.com/rhasspy/piper
echo 'This sentence is spoken first. This sentence is synthesized while the first sentence is spoken.' |
./piper --model en_US-lessac-medium.onnx --output-raw |
aplay -r 22050 -f S16_LE -t raw -

But the models on Coqui_tts are better but longer to exec, but if we can Stream it wouldn't matter and can do real time.

Move tts_chunk_player.html from /static to a template render

Is your feature request related to a problem? Please describe.

When attempting to run alltalk on a remote host which has the GPUs in it, the tts_chunk_player/tts_chunk_player.html does not render properly as it is hard coded to be directed at 127.0.0.1

I have worked around this problem by adding this:

const baseURL = "http://<host>:<port>" in the script definition and then changed each one of the references to use baseURL like this:

const response = await fetch(`${baseURL}/api/tts-generate`

This works but is not ideal.

Describe the solution you'd like

I was going to write a PR for this but then realized that this page is being served via fastAPIs static files which does not support passing in parameters. A possible solution is to attempt to read and parse the confignew.json again in the static file but this can lead to a duplication of effort.

It may be best to move this from a static app.mount to the same template.render like the admin.html page uses.

Additionally, it might be good to note in the docs (which are already quite good btw) that if you wanted to serve on anything other than local host you should edit the confignew.json.

Finally, it may be worth noting that other projects using gradio will set the listen address to 0.0.0.0 which does not work in this case due to the way certain files that are being rendered behave.

Additional context

This idea/solution will not fix the links in the getting started page which also still point to local host. If you have a preferred method I am happy to investigate that path and make an initial PR

Bat Directly closing

Please generate a diagnostics report and upload the "diagnostics.log".

https://github.com/erew123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file

Describe the bug

G:\Ai\Alltalk\alltalk_tts>start_alltalk.bat
Traceback (most recent call last):
File "G:\Ai\Alltalk\alltalk_tts\script.py", line 47, in
from TTS.api import TTS
ModuleNotFoundError: No module named 'TTS'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "G:\Ai\Alltalk\alltalk_tts\script.py", line 50, in
logger.error(
^^^^^^
NameError: name 'logger' is not defined

To Reproduce
Starting alltalk.bat

Screenshots
If applicable, add screenshots to help explain your problem.

Text/logs

INFO:root:NOTE requirements.txt not found. Skipping version checks.
INFO:root:OS Version: Windows 10.0.19045
INFO:root:Note: Windows 11 will list as build is 10.x.22xxx
INFO:root:Torch Version: 2.1.0+cu121
INFO:root:System RAM: 19.98 GB available out of 31.93 GB total
INFO:root:CUDA_HOME: N/A
INFO:root:Port Status: Port 7851 is available.
INFO:root:Python Version: 3.11.7
INFO:root:Python Version Info: sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)
INFO:root:Python Executable: G:\Ai\AlltalkShit\alltalk_tts\alltalk_environment\env\python.exe
INFO:root:Python Virtual Environment: N/A (Should be N/A when in Text-generation-webui Conda Python environment)
INFO:root:Conda Environment: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env
INFO:root:
Python Search Path:
INFO:root: G:\Ai\Alltalk\alltalk_tts
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\python311.zip
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\DLLs
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\Lib
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\Lib\site-packages
INFO:root:
OS PATH Environment Variable:
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\Library\mingw-w64\bin
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\Library\usr\bin
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\Library\bin
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\Scripts
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\env\bin
INFO:root: G:\Ai\Alltalk\alltalk_tts\alltalk_environment\conda\condabin
INFO:root: C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler
INFO:root: C:\ProgramData\Oracle\Java\javapath
INFO:root: C:\Program Files (x86)\Common Files\Oracle\Java\javapath
INFO:root: C:\WINDOWS\system32
INFO:root: C:\WINDOWS
INFO:root: C:\WINDOWS\System32\Wbem
INFO:root: C:\WINDOWS\System32\WindowsPowerShell\v1.0
INFO:root: C:\WINDOWS\System32\OpenSSH
INFO:root: C:\Program Files (x86)\GtkSharp\2.12\bin
INFO:root: C:\Program Files\Intel\WiFi\bin
INFO:root: C:\Program Files\Common Files\Intel\WirelessCommon
INFO:root: C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common
INFO:root: C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR
INFO:root: C:\WINDOWS\system32
INFO:root: C:\WINDOWS
INFO:root: C:\WINDOWS\System32\Wbem
INFO:root: C:\WINDOWS\System32\WindowsPowerShell\v1.0
INFO:root: C:\WINDOWS\System32\OpenSSH
INFO:root: C:\Program Files\dotnet
INFO:root: D:\Node
INFO:root: G:\Git\cmd
INFO:root: C:\Users\Agando\AppData\Local\Microsoft\WindowsApps
INFO:root: C:\Program Files\Intel\WiFi\bin
INFO:root: C:\Program Files\Common Files\Intel\WirelessCommon
INFO:root: ÔÇ£C:\ffmpeg\bin
INFO:root: C:\Users\Agando\AppData\Roaming\npm
INFO:root: C:\WINDOWS\system32
INFO:root:GPU Information:
Mon Jan 29 21:10:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.13 Driver Version: 537.13 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 WDDM | 00000000:2B:00.0 On | N/A |
| 0% 43C P0 31W / 160W | 1230MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 6224 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 6316 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 11420 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
| 0 N/A N/A 12940 C+G ...wekyb3d8bbwe\XboxGameBarWidgets.exe N/A |
| 0 N/A N/A 14624 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 15480 C+G D:\Programme\Steam\steam.exe N/A |
| 0 N/A N/A 15812 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 17004 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 17524 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 17864 C+G D:\elgato\StreamDeck\StreamDeck.exe N/A |
| 0 N/A N/A 18160 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A |
| 0 N/A N/A 18188 C+G G:\Corsair\iCUE.exe N/A |
| 0 N/A N/A 18324 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 20192 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 21808 C+G ...12.0_x64__8wekyb3d8bbwe\GameBar.exe N/A |
| 0 N/A N/A 25356 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
+---------------------------------------------------------------------------------------+

INFO:root:Package Versions:
INFO:root:annotated-types>= 0.6.0
INFO:root:anyio>= 3.7.1
INFO:root:certifi>= 2022.12.7
INFO:root:cffi>= 1.16.0
INFO:root:charset-normalizer>= 2.1.1
INFO:root:click>= 8.1.7
INFO:root:colorama>= 0.4.6
INFO:root:cutlet>= 0.3.0
INFO:root:deepspeed>= 0.12.7+d058d4b
INFO:root:fastapi>= 0.104.1
INFO:root:filelock>= 3.9.0
INFO:root:fsspec>= 2023.12.2
INFO:root:fugashi>= 1.3.0
INFO:root:h11>= 0.14.0
INFO:root:hjson>= 3.1.0
INFO:root:huggingface-hub>= 0.20.3
INFO:root:idna>= 3.4
INFO:root:importlib-metadata>= 4.8.1
INFO:root:jaconv>= 0.3.4
INFO:root:Jinja2>= 3.1.2
INFO:root:MarkupSafe>= 2.1.3
INFO:root:mojimoji>= 0.0.13
INFO:root:mpmath>= 1.3.0
INFO:root:networkx>= 3.0
INFO:root:ninja>= 1.11.1.1
INFO:root:numpy>= 1.24.4
INFO:root:packaging>= 23.2
INFO:root:Pillow>= 9.3.0
INFO:root:pip>= 23.3.1
INFO:root:psutil>= 5.9.8
INFO:root:pycparser>= 2.21
INFO:root:pydantic>= 1.10.13
INFO:root:pydantic_core>= 2.16.1
INFO:root:pynvml>= 11.5.0
INFO:root:python-multipart>= 0.0.6
INFO:root:PyYAML>= 6.0.1
INFO:root:py-cpuinfo>= 9.0.0
INFO:root:regex>= 2023.12.25
INFO:root:requests>= 2.31.0
INFO:root:safetensors>= 0.4.2
INFO:root:setuptools>= 68.2.2
INFO:root:sniffio>= 1.3.0
INFO:root:sounddevice>= 0.4.6
INFO:root:soundfile>= 0.12.1
INFO:root:starlette>= 0.27.0
INFO:root:sympy>= 1.12
INFO:root:tokenizers>= 0.15.1
INFO:root:torch>= 2.1.0+cu121
INFO:root:torchaudio>= 2.1.0+cu121
INFO:root:torchvision>= 0.16.0+cu121
INFO:root:tqdm>= 4.66.1
INFO:root:transformers>= 4.36.2
INFO:root:typing_extensions>= 4.8.0
INFO:root:unidic-lite>= 1.0.8
INFO:root:urllib3>= 1.26.13
INFO:root:uvicorn>= 0.24.0.post1
INFO:root:wheel>= 0.41.2
INFO:root:zipp>= 3.17.0

Desktop (please complete the following information):
AllTalk was updated: [approx. date]
Custom Python environment: [yes/no give details if yes]
Text-generation-webUI was updated: [approx. date]

Additional context
Add any other context about the problem here.

i tried to install it for the 4 time still the same issue, ty in advance keep up the great work ^^

Russian language is not working

Diagnostics
diagnostics.log

Describe the bug
When I was on version 1.7. Everything worked great. When I updated to 1.8. Russian language stopped working. It reads punctuation marks but not letters. English and other languages work fine.

To Reproduce
I have already reinstalled and updated the extension several times. I launched the extension in webui text and as standalone app.
Nothing helps.

Screenshots
image

Text/logs
In Screenshots

Desktop (please complete the following information):
AllTalk was updated: Today (04.01.2024)
Custom Python environment: No.
Text-generation-webUI was updated: Today (04.01.2024)

Additional context

API returned incorrect path url.

Describe the bug
"output_file_path" and "output_file_url" returned incorrect path to the output (without '_combined') wav file when using the API.

image

Returned JSON:

{"status": "generate-success", "output_file_path": "C:\\text-generation-webui-main\\extensions\\alltalk_tts\\outputs\\1110117857783713802_1_combined.wav", "output_file_url": "http://192.168.0.99:7851/audio/1110117857783713802_1.wav", "output_cache_url": "http://192.168.0.99:7851/audiocache/1110117857783713802_1.wav"}

Real file on disk
1110117857783713802_1_combined.wav
URL links to:
1110117857783713802_1.wav

To Reproduce
image

Unsure if this was intended.
I looked into the code here:
image

Numpy version lock breaks install

Just tried out the extension (great work!), and found that the locked version on numpy broke the dependency tree:

INFO: pip is looking at multiple versions of tts to determine which version is compatible with other requirements. This could take a while.
Collecting TTS>=0.21.3 (from -r requirements_nvidia.txt (line 4))
  Downloading https://pip/index/tts/TTS-0.21.3-cp310-cp310-manylinux1_x86_64.whl (942 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 942.6/942.6 kB 696.6 MB/s eta 0:00:00
ERROR: Cannot install -r requirements_nvidia.txt (line 4) and numpy>=1.24.4 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested numpy>=1.24.4
    tts 0.22.0 depends on numpy==1.22.0; python_version <= "3.10"
    The user requested numpy>=1.24.4
    tts 0.21.3 depends on numpy==1.22.0; python_version <= "3.10"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

Removing the >=1.24.4 from the requirements_nvidia.txt however fixed the issue:

(venv) root@textgen:/app/extensions/alltalk_tts# sed -i 's/numpy>=1.24.4/numpy/g' requirements_nvidia.txt

(venv) root@textgen:/app/extensions/alltalk_tts# pip install -r requirements_nvidia.txt
WARNING: The directory '/root/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Looking in indexes: https://pip/index/
Requirement already satisfied: numpy in /app/venv/lib/python3.10/site-packages (from -r requirements_nvidia.txt (line 1)) (1.22.0)
....
Installing collected packages: zipp, pandas, importlib-metadata, TTS
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.3
    Uninstalling pandas-2.0.3:
      Successfully uninstalled pandas-2.0.3
  Attempting uninstall: TTS
    Found existing installation: TTS 0.20.6
    Uninstalling TTS-0.20.6:
      Successfully uninstalled TTS-0.20.6
Successfully installed TTS-0.22.0 importlib-metadata-7.0.0 pandas-1.5.3 zipp-3.17.0

pydantic>=2.5.0 not compatible with deepspeed & extension load fail

on a windows 10 machine.
since installing the finetune requirements, the extension no longer works correctly.
previously, i had pydantic 1.10.13 installed, but with the new requirement i get this:

`E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\pydantic_internal_config.py:321: UserWarning: Valid config keys have changed in V2:

  • 'allow_population_by_field_name' has been renamed to 'populate_by_name'
  • 'validate_all' has been renamed to 'validate_default'
    warnings.warn(message, UserWarning)
    E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\pydantic_internal_fields.py:149: UserWarning: Field "model_persistence_threshold" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = ().
warnings.warn(
2023-12-24 19:05:51 ERROR:Failed to load the extension "alltalk_tts".
Traceback (most recent call last):
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\modules\extensions.py", line 36, in load_extensions
exec(f"import extensions.{name}.script")
File "", line 1, in
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\extensions\alltalk_tts\script.py", line 50, in
import deepspeed
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed_init_.py", line 22, in
from . import module_inject
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\module_inject_init_.py", line 6, in
from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 580, in
from ..pipe import PipelineModule
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\pipe_init_.py", line 6, in
from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\pipe_init_.py", line 6, in
from .module import PipelineModule, LayerSpec, TiedLayerSpec
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\pipe\module.py", line 19, in
from ..activation_checkpointing import checkpointing
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\activation_checkpointing\checkpointing.py", line 26, in
from deepspeed.runtime.config import DeepSpeedConfig
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\config.py", line 29, in
from .zero.config import get_zero_config, ZeroStageEnum
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\zero_init_.py", line 6, in
from .partition_parameters import ZeroParamType
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 723, in
class Init(InsertPostInitMethodToModuleSubClasses):
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 725, in Init
param_persistence_threshold = get_config_default(DeepSpeedZeroConfig, "param_persistence_threshold")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\deepspeed\runtime\config_utils.py", line 116, in get_config_default
field_name).required, f"'{field_name}' is a required field and does not have a default value"
^^^^^^^^
AttributeError: 'FieldInfo' object has no attribute 'required'`

if i remove pydantic>2.5.0 and instead do pydantic==1.10.13 i get:

[AllTalk Startup] TTS Subprocess starting [AllTalk Startup] Readme available here: http://127.0.0.1:7851 Traceback (most recent call last): File "E:\oobabooga\text-generation-webui-snapshot-2023-11-19\extensions\alltalk_tts\tts_server.py", line 25, in <module> from pydantic import field_validator ImportError: cannot import name 'field_validator' from 'pydantic' (E:\oobabooga\text-generation-webui-snapshot-2023-11-19\installer_files\env\Lib\site-packages\pydantic\__init__.cp311-win_amd64.pyd) [AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait. [AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 120 seconds maximum. Please wait.

After having run the finetuning successfully, I moved the model files as written in the guide and then changed the cuda path back to what it was previously. So I'm not sure what broke.

Use Whisper to check the generated audio in AllTalk TTS Generator.

Is your feature request related to a problem? Please describe.
I primarily use AllTalk TTS Generator for a large quantity of text, e.g. audiobook generation (for personal use), given the generator has the ability to regenerate a single chunk of text when the audio has problems it's relatively easy to fix misreadings, what's not easy is to track down wich chunck has issues without hearing all the audio generated.

Describe the solution you'd like
I don't know if this is out of scope for this project but using Whisper it's possible to check the audio for issues.

Describe alternatives you've considered
For example I created an audiobook that resulted in a json file with 5558 entries, but when it finished I only have 5002 wav files. An example of two entries that illustrates why:

{
"id": 5552,
"fileUrl": "http://127.0.0.1:7851/audio/TTS_1705084624.wav",
"text": "Bonifatius appears in the color art, too.",
"characterVoice": "voice.wav",
"language": "en"
},
{
"id": 5553,
"fileUrl": "http://127.0.0.1:7851/audio/TTS_1705084624.wav",
"text": "It’s raining grandfathers!",
"characterVoice": "voice.wav",
"language": "en"
},

  • Notice that both entries point to the same wav file.

Additional context
What I did is kind of a hack solution, I used Whisper to transcribe each wav file (I had to make batches of 1000 files at a time or the Gradio Whisper UI I used would spit an error), then I used a script to check every text entry in the json file.

script:
compare.txt

That script let's me know within a certain threshold (specified when I run the command) wich lines I need to regenerate. I decided to use a less that 80% percent similarity and it found 807 entries with issues:

example:
ID: 5552, JSON Text: bonifatius appears in the color art, too.
TXT Content: it's raining grandfathers.

Differences: Similarity: 26%

I decided to check for those kind of discrepancies because Whisper is not perfect and it has troubles with names for example.

Updating to 1.8b from a previous version.

Apologies, but I left a mistake in the last gitignore file and forgot to exclude the finetune directory from updates. If you have performed finetuning and want to update and not re-setup the whole of AllTalk, you will need to open the gitignore file and replace its contents with:

voices/*.*
models/*.*
outputs/*.*
finetune/*.*
config.json
confignew.json
models.json
diagnostics.log

gitignore can be found in /alltalk_tts/.gitignore

You would open it with Notepad or a text editor, select everything in there, delete it and copy the above inside.

After you have done that, you should be able to follow the standard instructions and perform a git pull along with running the requirements file again. The instructions for doing this are here

Option to set training data/testing data percentages.

Currently the software automatically divides the audio samples provided into a dataset, and then allocates some of that dataset for training and some of it for testing.

The problem here is that, while sub-optimal, the reality is that often the number of voice samples of the target isnt very high. (eg character from media without many minutes of spoken dialogue).

In such cases, it would be useful to be able to throw more samples at training, rather than testing. This will likely cause problems in training, of course, but might be the better option in rare cases.

batch que please?

with the deepspeed optimisation, we can actually do multiple books in 1 night!

if u have some spare time, a batch que would allow us to do so!

another fantastic reason to do so, somewhere in the process, the max length of wav is <17h. after that the output shows 17 but is 4.5hrs max, maybe its a limit of ffmpeg or something else like a limit of the wav file or the header or something, i couldnt find it tho

thank you very much

No module named 'TTS' in standalone mode

diagnostics.log

Describe the bug
I installed the app in standalone mode, in venv, so that everything installs cleanly. When I start it, I get this error:

Text/logs

(venv) I:\alltalk_tts>python script.py
[AllTalk Startup] Running script.py in standalone mode
[AllTalk Startup] Coqui Public Model License
[AllTalk Startup] https://coqui.ai/cpml.txt
[AllTalk Startup] Old output wav file deletion is set to disabled.
[AllTalk Startup] Checking Model is Downloaded.
[AllTalk Startup] Warning TTS is not installed.
[AllTalk Startup] All required files are present.
[AllTalk Startup] TTS Subprocess starting
[AllTalk Startup] Readme available here: http://127.0.0.1:7851
C:\Users\Mykee\AppData\Local\Programs\Python\Python310\lib\site-packages\torchaudio\backend\utils.py:74: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
Traceback (most recent call last):
  File "I:\alltalk_tts\tts_server.py", line 7, in <module>
    from TTS.tts.configs.xtts_config import XttsConfig
ModuleNotFoundError: No module named 'TTS'

It seems that the script does not take the modules from venv, am I right?
The log shows that TTS is installed, but there is an error. I also ran this command, but it did not help:
pip install -r requirements_nvidia.txt
CUDA is on the machine according to the log, but I see torch 2.1.2+cpu mode in the log. Why?

I had to install TTS outside of venv and still got this: ModuleNotFoundError: No module named 'sounddevice'. After installing this, the program started, so venv is useless if the program doesn't work from there.
After installing the modules outside the venv, it still only sees cpu for TTS:

[2024-01-04 23:30:59,491] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-04 23:30:59,678] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[AllTalk Startup] Running script.py in standalone mode
[AllTalk Startup] Coqui Public Model License
[AllTalk Startup] https://coqui.ai/cpml.txt
[AllTalk Startup] Old output wav file deletion is set to disabled.
[AllTalk Startup] Checking Model is Downloaded.
[AllTalk Startup] TTS version installed: 0.21.3
[AllTalk Startup] TTS version is up to date.
[AllTalk Startup] All required files are present.
[AllTalk Startup] TTS Subprocess starting
[AllTalk Startup] Readme available here: http://127.0.0.1:7602
[2024-01-04 23:31:05,053] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-04 23:31:05,100] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[AllTalk Startup] DeepSpeed Detected
[AllTalk Startup] Activate DeepSpeed in AllTalk  settings
[AllTalk Model] XTTSv2 Local Loading xttsv2_2.0.2 into cpu
[AllTalk Model] Model Loaded in 6.34 seconds.

Here is second log:
diagnostics2.log

Desktop (please complete the following information):
AllTalk was updated: 2024-01-04
Custom Python environment: [yes/no give details if yes] yes, see log: 3.10.11
Text-generation-webUI was updated: (standalone install, not an extension).

Feature request: model switching

with finetuning now a possibility, would it be possible to hot-reload a model/select one without needing to restart ooba?
something like a dropdown, to select one of our finetuned models

Finetuning uses internet access for some reason?

image

Finetuning models is periodically sending bulk network traffic, and actually errors out if the internet is disconnected when instigating training. Why would local finetuning require an internet connection?

Additionally, training writes to disk very frequently, I have 64gb of ram, requesting that be uses for checkpoints, and a final, singular, write at end of training be done instead of thrashing my ssd.

Base Packages installed by Text-generation-webui (Dec 2023)

Below is a list of the base installed packages if you fresh installed Text-generation-webui (based on Dec 24th 2023 as a Nvidia 12.1 install)

🟩 TLDR summary

AllTalk will install 2x or 3x things that are not natively installed by Text-generation-webUI and its native Coqui_tts extension:

importlib-metadata>=4.8.1 - Used to confirm your packages are up to date when AllTalk starts-up
soundfile>=0.12.1 Β Β Β Β Β Β Β Β Β Β Β Β Β  - Combines multiple wav files into 1 file for the narrator to function
faster-whisper>=0.10.0 Β Β Β Β Β - If you use finetuning and ONLY if you install the requirements_finetuning.txt

>= means greater than this version number OR equal to this version number. Meaning AllTalk is asking for a minimum of this version number or greater. AllTalk is NOT requesting to downgrade to earlier versions. Please scroll down for a side by side comparison of Text-generation-webui's factory installed package versions VS AllTalks requested packages.

AllTalk will NOT bump versions of other things that are NOT matching the base requirements set by Text-generation-webui, bar Pandas which is forced by the TTS engine, however, this is the same as what the native Coqui_tts extension does when it installs TTS from its requirements file. Details here. See next section for an explanation of Pandas and upgrading it again (if necessary).

Putting this simply, if AllTalk specifies numpy>=1.24.4 (which is Text-generation-webUI factory default) and you install numpy>=1.29.1, AllTalk will simply go "ok thats a greater version than I am asking for >= so I wont change anything or do anything, as it satisfies my requirement needs".

🟩 Here are the things that AllTalk requests to be installed. There are 3x unique packages

All package versions AllTalk is requesting are the same as Text-generation-webui's factory installed/default package versions. In some cases Text-generation-webui will install a higher version, and AllTalk will NOT downgrade or change those packages. AllTalk only has minimum requested version that match or are lower than Text-generation-webui's requirements.

importlib-metadata>=4.8.1 Β Β (Unique to AllTalk)
soundfile>=0.12.1 Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β (Unique to AllTalk)
TTS>=0.21.3Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Β Β (Unique to AllTalk & Text-gen-webui's native Coqui_tts extension)
fastapi>=0.104.1
Jinja2>=3.1.2
numpy>=1.24.4
packaging>=23.2
pydantic>=1.10.13
requests>=2.31.0
torch>=2.1.0+cu118
torchaudio>=2.1.0+cu118
tqdm>=4.66.1
uvicorn>=0.24.0.post1

TTS downgrades Pandas to 1.5.3, though it appears fine to upgrade it again pip install pandas==2.1.4
I do not know of a way to do this automatically within the 1x requirements file as it causes a conflict.

Pandas is a data validation feature e.g. if you send some text to some other code, it checks if the the text is in the correct format. So if your code wanted the word true or false and you send jelly, Pandas will say jelly is not an acceptable term for this type of data, I only accept the word true or false.

🟩 And finetuning:

These are only installed when you install requirements_finetuning.txt. All package versions AllTalk is requesting are the same as Text-generation-webui's factory installed/default package versions. In some cases Text-generation-webui will install a higher version, and AllTalk will NOT downgrade or change those packages. AllTalk only has minimum requested version that match or are lower than Text-generation-webui's requirements.

faster-whisper>=0.10.0 Β Β (Unique to AllTalk)
gradio>=3.50.2
torch>=2.1.0+cu118
torchaudio>=2.1.0+cu118
TTS==0.21.3 Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β (Force TTS 0.21.3 as 0.22.0 has an issue with file paths on finetuning)
tqdm>=4.66.0
pandas>=1.5.0

🟩Text-generation-webui's base packages on a factory fresh install: (side by side comparison)

absl-py==2.0.0
accelerate==0.25.0
aiofiles==23.2.1
aiohttp==3.9.1
aiosignal==1.3.1
altair==5.2.0
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
anyio==3.7.1
appdirs==1.4.4
asttokens==2.4.1
attributedict==0.3.0
attrs==23.1.0
auto-gptq @ https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-linux_x86_64.whl#sha256=80f44157c636a38ea12e0820ec681966310dfaa34b00724a176cd4c097b856d6
autoawq==0.1.7
beautifulsoup4==4.12.2
bitsandbytes==0.41.1
blessings==1.7
blinker==1.7.0
cachetools==5.3.2
certifi==2022.12.7
cffi==1.16.0
chardet==5.2.0
charset-normalizer==2.1.1
click==8.1.7
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
contourpy==1.2.0
coverage==7.3.4
cramjam==2.7.0
ctransformers @ https://github.com/jllllll/ctransformers-cuBLAS-wheels/releases/download/AVX2/ctransformers-0.2.27+cu121-py3-none-any.whl#sha256=9be6bfa8ac9feb5b2d4c98fbf5ac90394bbfa5c406313f8161dca67b28333e51
cycler==0.12.1
DataProperty==1.0.1
datasets==2.16.0
decorator==5.1.1
deep-translator==1.9.2
deepdiff==6.7.1
dill==0.3.7
diskcache==5.6.3
distlib==0.3.8
docker-pycreds==0.4.0
docopt==0.6.2
einops==0.7.0
evaluate==0.4.1
executing==2.0.1
exllama @ https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp311-cp311-linux_x86_64.whl#sha256=a56d4281a16bc1e03ebfa82c5333f5b623c5f983de58d358cb2960cd6cbd8b03
exllamav2 @ https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp311-cp311-linux_x86_64.whl#sha256=9a36893f577ba058c7b8add08090a22a48921471c79764cac5ec4d298435a0ee
fastapi==0.105.0 Alltalk requests minimum of fastapi>=0.104.1
fastparquet==2023.10.1
ffmpeg==1.4
ffmpy==0.3.1
filelock==3.13.1
flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.4/flash_attn-2.3.4+cu122torch2.1cxx11abiFALSE-cp311-cp311-linux_x86_64.whl#sha256=35f3335cc9b3c533e622fa5ea85908502f7aa558646523f6def10d1b53ca82e0
Flask==3.0.0
flask-cloudflared==0.0.14
fonttools==4.47.0
frozenlist==1.4.1
fsspec==2023.10.0
gekko==1.0.6
gitdb==4.0.11
GitPython==3.1.40
google-auth==2.25.2
google-auth-oauthlib==1.2.0
gptq-for-llama @ https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-linux_x86_64.whl#sha256=b6b0ce1b3b2568dff3c21d31956a82552e4eb6950c2a1f626767f9288ebc36d7
gradio==3.50.2 Finetuning requests minimum of gradio>=3.50.2
gradio_client==0.6.1
grpcio==1.60.0
h11==0.14.0
hqq==0.1.1.post1
httpcore==1.0.2
httpx==0.26.0
huggingface-hub==0.20.1
humanfriendly==10.0
idna==3.4
importlib-resources==6.1.1
inspecta==0.1.3
ipython==8.19.0
itsdangerous==2.1.2
jedi==0.19.1
Jinja2==3.1.2 AllTalk requests minimum of Jinja2>=3.1.2
joblib==1.3.2
jsonlines==4.0.0
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
kiwisolver==1.4.5
llama_cpp_python @ https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.24+cpuavx2-cp311-cp311-manylinux_2_31_x86_64.whl#sha256=73f93f750d4af6ba2f9d5bc0d2f46778a9d85934c1816c9ff1908750c4c477d7
llama_cpp_python_cuda @ https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.24+cu121-cp311-cp311-manylinux_2_31_x86_64.whl#sha256=da6b5accd73a040e6640a0c5aae55fc1ed99c0a6b2950ca6122ecbf30fcb7b4d
llama_cpp_python_cuda_tensorcores @ https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.24+cu121-cp311-cp311-manylinux_2_31_x86_64.whl#sha256=de1c9111a9a43e83da40244aacd423a5095fa1a6a25fb5607da79c4caf76e328
llvmlite==0.41.1
lm_eval==0.4.0
lxml==4.9.4
Markdown==3.5.1
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.8.2
matplotlib-inline==0.1.6
mbstrdecoder==1.1.3
mdurl==0.1.2
more-itertools==10.1.0
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
networkx==3.0
ngrok==0.12.1
ninja==1.11.1.1
nltk==3.8.1
num2words==0.5.13
numba==0.58.1
numexpr==2.8.8
numpy==1.24.4 Alltalk requests minimum of numpy>=1.24.4
oauthlib==3.2.2
omegaconf==2.3.0
openai-whisper==20231117
optimum==1.16.1
ordered-set==4.1.0
orjson==3.9.10
packaging==23.2 AllTalk requests minimum of packaging>=23.2
pandas==2.1.4 - TTS downgrades Pandas to 1.5.3, though it appears fine to upgrade it again pip install pandas==2.1.4
parso==0.8.3
pathvalidate==3.2.0
peft==0.7.1
pexpect==4.9.0
Pillow==10.1.0
platformdirs==4.1.0
pluggy==1.3.0
portalocker==2.8.2
prompt-toolkit==3.0.43
protobuf==4.23.4
psutil==5.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==14.0.2
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.11.1
pycparser==2.21
pydantic==2.5.3 AllTalk requests minimum of pydantic>=1.10.13
pydantic_core==2.14.6
pydub==0.25.1
Pygments==2.17.2
pyparsing==3.1.1
pyproject-api==1.6.1
pytablewriter==1.2.0
python-dateutil==2.8.2
python-multipart==0.0.6
pytz==2023.3.post1
PyYAML==6.0.1
referencing==0.32.0
regex==2023.12.25
requests==2.31.0 AllTalk requests minimum of requests>=2.31.0
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.7.0
rootpath==0.1.1
rouge==1.0.1
rouge-score==0.1.2
rpds-py==0.15.2
rsa==4.9
sacrebleu==2.4.0
safetensors==0.4.1
scikit-learn==1.3.2
scipy==1.11.4
semantic-version==2.10.0
sentencepiece==0.1.99
sentry-sdk==1.39.1
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
soundfile==0.12.1
soupsieve==2.5
SpeechRecognition==3.10.0
sqlitedict==2.1.0
sse-starlette==1.6.5
stack-data==0.6.3
starlette==0.27.0
sympy==1.12
tabledata==1.3.3
tabulate==0.9.0
tcolorpy==0.1.4
tensorboard==2.15.1
tensorboard-data-server==0.7.2
termcolor==2.4.0
texttable==1.7.0
threadpoolctl==3.2.0
tiktoken==0.5.2
timm==0.9.12
tokenizers==0.15.0
toml==0.10.2
toolz==0.12.0
torch==2.1.2+cu121 AllTalk requests minimum of torch>=2.1.0+cu118 - CUDA is not requested by the "requirements_other.txt"
torchaudio==2.1.2+cu121 AllTalk requests minimum torchaudio>=2.1.0+cu118 - CUDA is not requested by the "requirements_other.txt"
torchvision==0.16.2+cu121
tox==4.11.4
tqdm==4.66.1 AllTalk requests minimum of tqdm>=4.66.1
tqdm-multiprocess==0.0.11
traitlets==5.14.0
transformers==4.36.2
triton==2.1.0
typepy==1.3.2
typing_extensions==4.9.0
tzdata==2023.3
urllib3==1.26.13
uvicorn==0.25.0 Alltalk requests minimum of uvicorn>=0.24.0.post1
virtualenv==20.25.0
wandb==0.16.1
wcwidth==0.2.12
websockets==11.0.3
Werkzeug==3.0.1
xxhash==3.4.1
yarl==1.9.4
zstandard==0.22.0

Deepspeed CMD or Activation (minor) Issue

For starters - Alltalk and Deepspeed are working perfectly fine as intended.

I've added the following line to settings.yaml in order to auto-start Deepspeed, and this is also working:

alltalk_tts-deepspeed_activate: True

Now, onto what is not working as expected... please close this if you deem this outside the scope of your project, and if so I apologize for taking up your time.

I maintain a discord bot that integrates with textgen-webui, and it applies any settings defined in settings.yaml, or added by the bot script directly to module shared.settings (It imports the shared module, finds settings.yaml and adds all the settings).

The settings that I use are all working just the same as if I launch the webui directly.

If I change one of the shared.settings parameters, it is immediately reflected in the responses.

It's working great. All except for deepspeed_activate (specifically with this bot script).

For all I know, it may actually be applying Deepspeed but just logging "False".

Here is a screenshot of the cmd output in my script. I print shared.settings

screenshot

I'm not expecting you to work out and resolve any potential errors with my script - However, I've checked what I can and everything looks like it should work...

diagnostics.log

AllTalk was updated: 12/24/2023
Custom Python environment: Same env as textgen-webui
Text-generation-webUI was updated: 12/24/2023

finetuning name wrong: voice.json -> vocab.json

Just completed my first finetuned model and found this.
In the finetuning UI, in the "What to do next" tab it says:
In the folder /finetune/tmp-trn/training/XTTS_FT(date here)/ you will find your best_model.pth, config.json and voice.json (also the last epoch training best_model_xxx.pth)

"voice.json" is wrong, it should be "vocab.json".

super minor but someone might think they did something wrong if they can't find it.

export to wav broken

diagnostics.log

nothing happens when i click export to wav with split to 100

To Reproduce
Steps to reproduce the behaviour:
i put in a text of 900k characters, and waited until all the 4200 chunks at size 4 were completed and clicked on export

Screenshots
Screenshot (32)

all the console code was fine, looked like this, this is the last part after i clicked export

[AllTalk TTSGen] 5.87 seconds. LowVRAM: False DeepSpeed: True
[AllTalk TTSGen] Link’s laughter was carried upon the wind. β€œThen let’s go.” He whirled about. heel tipping off the edge of the plateau as he gave them all a wide grin. β€œWe’ve got a future to build.” He let himself freefall for a moment before tapping his Slate and vanishing into tendrils of blue upon Nayru’s winds.
[AllTalk TTSGen] 11.44 seconds. LowVRAM: False DeepSpeed: True

AllTalk was updated: v1.9
Custom Python environment: [default}
Text-generation-webUI was updated: latest 7th jan 2024

i checked in outputs folder, nothing more than 7mbs, and ive checked task manager, no process taking more than 1% CPU and ive waited more than 30mins and clicked a few more times just to be sure after 10mins

πŸ”„ AllTalk Minor updates/bug fixes list

🟩 AllTalk minor updates/bug fixes/new features (This is for version 1.x of AllTalk)

If you have an issue, I will be keeping a list here of any minor updates that I make to fix those issues. If any issue you are experiencing is in the list please follow the updating instructions

Help with known problems can be found here

πŸŸͺ Changelog

19th June 2024

  • AllTalk - docker - Merged in updates for the docker build files #249

14th June 2024

  • AllTalk - atseup.sh - Forced PyTorch version back to being 2.2.1 on standalone install for v1,9 of AllTalk.

14th May 2024

  • AllTalk - Finetuning - Updated file path handling to deal with Gradio 4.xx versions.

1st May 2024

  • AllTalk - Narrator - Low VRAM & Narrator, when used together should now be faster. Changed the way the AI model is moved about in this circumstance, which should reduce generation time where Low VRAM is used, shaving X seconds off the standard length generation.

28th April 2024

  • AllTalk - TTS Generator - Amended text splitting to catch outlier scenarios.
  • AllTalk - Streaming - Added an endpoint for stopping Streaming mid generation (in testing currently).

26th April 2024

  • AllTalk - Finetuning - Custom folders for saving models. Learning rate selectable. Updated documentation.

8th April 2024

  • AllTalk - atsetup.sh - Linux specific. Corrected PyTorch download URL, due to a missing digit on the path.

5th April 2024

  • AllTalk - TTS Generator - Export WAV, updated crunker to a new version as it fixes export performance across a network.
  • AllTalk - atsetup.bat - Altered the way Windows 10 is handled as the old way appeared to have an issue on Windows 10.
  • AllTalk - Compacted some images down just to improve performance.

1st April 2024

  • AllTalk - Sorry I fluffed up DeepSpeed on the ATsetup (not joking though I know this is an April 1st update). Please git pull, or follow the instructions directly below. The updated ATsetup should clear the fault now! Sorry sorry.
  • AllTalk - Vastly improved the start-up screen. Added additional useful information/checks. Removed large DeepSpeed text output. Added a Github last updated check, so people know when AllTalk had its last change made. This is detailed in the help section under Understanding the AllTalk start-up screen.

29th March 2024

  • AllTalk - DeepSpeed setup issue with ATSetup.bat has been corrected.

    1. In your alltalk_tts folder, perform a git pull
    2. Run atsetup.bat.
    3. Select option 2 AllTalk as a Standalone Application
    4. Select option 3 Re-Apply/Update the requirements file.

If you cannot git pull (AllTalk installed from a ZIP file) you can download the updated atsetup.bat from here saving it over the top of the one in the alltalk_tts folder, then follow from step 2 above.

28th March 2024

  • AllTalk & Finetuning - Streamlined the interface. Added additional Pre-flight checks and terminal/console warnings. Improved the documentation throughout. Removed the need to install the Nvidia CUDA Toolkit v11.8, though it will still be required for compiling DeepSpeed on Linux systems.
  • AllTalk - TTS Generator - Squashed various interface bugs. The ID list buttons are now dynamic and enable/disable as necessary. Resolved a few outlier issues with problems editing the text of generated TTS when trying to regenerated.
  • AllTalk - TTS Generator - Added 2x new features. TTSDiff will compare the original text to the generated TTS and let you know which generated TTS's are bad (on a best effort basis). TTSSRT can be used to generate subtitle files, for things such as video audiobooks.
  • AllTalk - Moved up Python requirements. Simplified the requirements installation files. Updated ATsetup. Updated Diagnostics. Re-wrote all documentation and applied to both Github & built in documentation. Cleaned up the whole file structure. Tested across Windows & Linux. All in about 12-14 days work.

25th March 2024

  • AllTalk & Finetuning - Further cleaned up instructions. Cleaned up Pre-flight check. Added a refresh to the dropdowns on step 3.

22nd March 2024

  • AllTalk - Documentation - Provided a large update to the Github documented instructions. Built in documentation to be done at a later date.

17th March 2024

  • AllTalk - TTS Generator - Stopped currently playing ID number from resetting to 1 on pressing Stop.

16th March 2024

  • AllTalk & Finetuning - Added a custom tokenizer for Japanese.. Also nice big warning messages if you don't pass the Pre-flight check.
  • AllTalk & DeepSpeed - Built the DeepSpeed v14 wheel for CUDA 11.8 with PyTorch 2.2.1. Option added to windows atsetup.bat.

16th March 2024

  • AllTalk & Finetuning - Finetuning now has a pre-flight check system with additional help documentation to ensure your system is configured correctly for Finetuning. The interface was given a little overhaul. There is more yet to add/change, but figured this is a good start. TTS is now version 0.22.0 across the board of all of AllTalks apps.
  • AllTalk - Diagnostics were updated to make things a little cleaner to look though and understand.

11th March 2024

  • AllTalk & Finetuning - Reduced thread count when working with Japanese language training (limitation of external training scripts). Improved some documentation. Moved Whisper models to 32bit floats, allowing older non RTX cards to work (no noticeable impact on speed).
  • AllTalk & Text-gen-webui - Specifically when used with the Stable Diffusion Plugin. AllTalk will now strip any images before TTS generation, generate the TTS, then re-insert the image back into the chat when handing the audio and text string back to Text-generation-webui. It should also be noted that the Stable Diffusion Plugin will remove text from the generation, so you need to consider load order of plugins link to details here.

7th March 2024

21st Feb 2024

  • AllTalk Currently only for Text-generation-webui - Attempted to allow Chinese character set to pass through (not sure if it will or wont work with the Narrator).

5th Feb 2024

  • AllTalk extended API text length to 2000 characters.

24th Jan 2024

  • AllTalk A new version of Transformers has been released 4.37.1 which now resolves the prior loading issue.
  • AllTalk Documentation/Github Added another link to some sample new voices (as yet unsure of the quality).

22nd Jan 2024

  • AllTalk A new version of Transformers has been released 4.37 https://github.com/huggingface/transformers/releases which causes a load/import problem ImportError: cannot import name 'SampleOutput' from 'transformers.generation.utils' at this time, I'm unsure if this is a bug in their code as I cannot find any breaking changes currently. I have forced pip install transformers==4.36.2 in the requirements files.

21st Jan 2024

  • AllTalk Simplified changing the start-up duration allowed for people with older machines. At the top of script.py is now startup_wait_time = 120. Help documentation updated accordingly.

19th Jan 2024

  • AllTalk - TTS Generator removed the hard coding to the IP of the AllTalk server so that it will dynamically point to the correct location on generation requests.

18th Jan 2024

  • AllTalk - Text-gen-webui temperature and repetition sliders now within the main interface.
  • AllTalk - SillyTavern post a couple of changes to AllTalks ST extension, ST have approved the extension into the staging area.

15th Jan 2024

  • AllTalk - TTS Generator Added a warning message if people try to run it from its disk location vs its URL address.
  • AllTalk - TTS Generator Push mimetype of application/javascript (for exporting on TTS generator).

13th Jan 2024

  • AllTalk - Version update Added additional API endpoints for streaming generation and server status. Added atsetup utility for Windows & Linux systems to streamline installation & maintenance both with Text-gen-webui and Standalone installation routines. Added SillyTavern support (yet to send PR to SillyTavern). Documentation created along with installation videos for the atsetup utility. Text-gen-webui interface cleaned up to look a bit nicer. Cleaned up a few console outputs. Added cutlet and unidic-lite to assist generation on non-Japanese enabled computers with Japanese TTS.
  • AllTalk - TTS API Corrected missing _combined.wav on non-timestamped narrator generations.
  • AllTalk - TTS Generator/API Timestamps Corrected applying a short UUID to the timestamp to avoid dual filename generation occurring within the same second for short sentences, resulting in one of the files being overwritten by the latter.

8th Jan 2024

  • AllTalk - TTS Generator Added export batch file splitting to avoid the 1GB limit within browser combining of wav's. Smaller batches can also reduce memory overhead, so good for systems with less memory. Ran a test to generate TTS that was 57911 words long.

7th Jan 2024

  • AllTalk - TTS Generator Added page pagination for large TTS generations to reduce browser memory use. Cleared some audio browser cache issues to further reduce memory use. Added a "No Playback" option which will be good for very large generation 20,000+ word type scenario, to keep memory down (as yet untested at that character count).

6th Jan 2024

  • AllTalk - TTS Generator Added the TTS Generator, which is designed for creating TTS of any length from as larger amount of text as you want. You are able to individually edit/regenerate sections after all the TTS is produced, export out to 1x wav file. You can also stream TTS if you just want to play back text or even push audio output to wherever AllTalk is currently running from at the terminal/command prompt.

5th Jan 2024

  • AllTalk - Updated filtering to allow Hungarian Ε‘ and Ε± characters to pass through correctly.

4th Jan 2024

  • AllTalk - Updated filtering to allow Cyrillic characters to pass through correctly.

3rd Jan 2024

  • AllTalk - Thanks to @nicobubulle. Add greedy option to avoid apostrophe being removed. Add accentuated character for foreign language.

2nd Jan 2024

  • AllTalk - Ms Word Add-in Added a proof-of-concept MS Word add-in to stream selected text to speech from within documents. This is purely POC and not production, hence support on this will be limited.

1st Jan 2024

  • AllTalk - Narrator Given another rebuild and a final upgrade. This version passed all of my tests. AI Systems will still not follow the rules all the time, see here for details, but it should give a good level of control.

  • AllTalk - Updated Narrator and filtering also set on the API now.

  • AllTalk - Streaming audio & Separation/tidy of built in documentation. A big thanks to @rbruels who has managed to get streaming working through the demo page within the built in documentation. This now allows for a lot of other opportunities in future. The built in documentation has also been split out of the main code base, allowing for much easier editing & management of both the code and the documentation.

30th Dec 2023

  • Finetuning - Simplified Added an additional routine to give 2x possible locations to send your compacted model to. Built in the routine to compact any legacy models. Tidied up the interface a bit. Cleaned up the built in documentation & added some additional documentation on Github. Corrected the gitignore file not ignoring the finetune folder.

29th Dec 2023

  • AllTalk - Additional API Endpoints and Playback 3x additional API endpoints, providing a Ready status, list of available voices and a preview voice option. The API now also supports playing the generated TTS at the terminal/command prompt where the script is running from, through that machines local audio device. Full details in the API section.

  • AllTalk - 4th Model Loader for Finetuned models As the finetuning process now moves models to /models/trainedmodel/ as long as a model is detected in this location when AllTalk starts up a 4th model loader will become available in the Gradio interface so that you can directly load the model.

28th Dec 2023

  • AllTalk & Finetuning - Larger update on Finetuning to simplify last steps as well as compact down the model. Added a separate compaction script for people "stuck" with large models here. Improved the Narrator text splitting function, though still hunting out some outlier situations (it varies LLM model to model, so harder to track them down).

27th Dec 2023

  • AllTalk - Standalone API Fix possible lost TTS segments. Same as the fix on Dec 25th, but for the standalone mode API. This will have no bearing for anyone who is just using AllTalk normally and not in standalone mode.

25th Dec 2023

  • AllTalk - Fix possible lost TTS segments. Applied a small update to avoid a possible race condition on file naming with small sentences when generating narrator/character speech. This would fix small sentences sometimes being lost.

24th Dec 2023

  • Finetuning - MP3 & Flac issue. Corrected finetuning not correctly picking up MP3 and Flac file names.
  • Finetuning - speakers_xtts.pth file missing - Issue is with TTS 0.22.0. Have set a downgrade in the requirements_finetune.txt file to TTS 0.21.3 while I get an answer/solution from Coqui. Re-run the pip install -r requirements_finetune.txt if you get this issue.

πŸ”„ Compacting an existing finetuned model.

With the older versions of finetune (pre 28th December 2023), it wasn't compacting models (due to the the fact that the "correct" code for doing this has changed). I've now gotten hold of the updated/correct code and this has been integrated into finetune.py, so this will not be an issue moving forwards.

For people stuck with large (5GB models) who want to compact them, you will need the updated version of AllTalk https://github.com/erew123/alltalk_tts#-updating

This process has now been built into Finetune. You would:

  1. Copy the 5GB model.pth file into the /finetune/ folder and rename it best_model.pth

  2. Start up finetune.py and go to the final tab.

  3. There is a button at the bottom called Compact a legacy finetuned model. Click the button and wait for the on-screen prompt to say its completed.

  4. In the /finetune/ folder you should now have both your best_model.pth and model.pth.
    The model.pth is your new compressed file. Copy it back to your loader location, confirm that it works, then you can delete your best_model.pth file.

Deepspeed installation errors.

Been trying to add Deepspeed as we really need that extra speed, on 4090 and want to be at around 400-700ms for 20 second audio files.

On ubuntu 22.04

Followed alltalk_tts deepspeed instructions.

Getting the same error as - microsoft/DeepSpeed#3531

Tried adding "export CUDA_HOME=/usr/local/cuda" to start_linux.sh
but each time i re run start_linux.sh and then go to ./cmd_linux.sh

izzy@izzy-System-Product-Name:~/text-generation-webui$ ds_report

I get

File "/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/subprocess.py", line 1950, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/izzy/text-generation-webui/installer_files/env/bin/nvcc'

If i post

export CUDA_HOME=/usr/local/cuda

manually i still get - same error as - microsoft/DeepSpeed#3531

which is

File "/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 677, in mpi_discovery
from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'

If i go

./cmd_linux.sh
then post manually - export CUDA_HOME=/usr/local/cuda
then run
ds_report

i get

`

izzy@izzy-System-Product-Name:~/text-generation-webui$ ./cmd_linux.sh
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ export CUDA_HOME=/usr/local/cuda
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ ds_report
[2023-12-17 09:13:30,168] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 15.54 GB
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ 
izzy@izzy-System-Product-Name:~/text-generation-webui$ ./cmd_linux.sh
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ export CUDA_HOME=/usr/local/cuda
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ ds_report
[2023-12-17 09:13:30,168] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 15.54 GB
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ izzy@izzy-System-Product-Name:~/text-generation-webui$ ./cmd_linux.sh
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ export CUDA_HOME=/usr/local/cuda
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$ ds_report
[2023-12-17 09:13:30,168] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/home/izzy/text-generation-webui/installer_files/env/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 15.54 GB
(/home/izzy/text-generation-webui/installer_files/env) izzy@izzy-System-Product-Name:~/text-generation-webui$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.