What if you can fulfill your dream of becoming a cute girl? Well, it's possible now (sort of).
- Audio transcription is done with Whisper.
- Translation is done with DeepL.
- Text to (cute) speech is done with Voicevox.
On my laptop, only CPU
Screen.Recording.2023-05-07.at.12.12.45.AM.mov
- Install Docker for voicevox engine
- Install Python 3.10 + Poetry, I recommend using asdf for this.
- Install dependencies with Poetry by running
poetry install
. If you don't want to use it, checkpyproject.toml
for Python and package versions. - Rename/copy
config.template.py
toconfig.py
. - Download whisper's models (https://github.com/openai/whisper#available-models-and-languages) and update
WHISPER_MODEL_PATH
in config.py with the path to the model file of your choice. - Update the array
VOICE_OUTPUT_DEVICE_IDS
in config.py with devices that you want the final voice to go to (e.g. speaker/headphone/"fake" microphone for voice chats) - SET
SPEAKER_ID
in voicevox_client/voice_config.py to your desired speaker ID. See below for how to check the voices out.
Start Voicevox engine in 1 console:
# Depends on whether you have GPU or not
# With GPU
docker compose -f docker-compose.gpu.yml up
# Without GPU
docker compose -f docker-compose.cpu.yml up
Start the program in another console:
poetry run python main.py
# Or wish a shell inside poetry's virtualenv
poetry shell
python main.py
- Move whisper audio transcription + voicevox engine to some cloud server with GPU or just Google Colab if internet connection is good so less local resource is needed and things will run faster.
Run this inside a python console with asyncio (python -m asyncio
):
from voicevox_client.client import Client
with Client() as client:
for speaker in client.fetch_speakers():
print(speaker)
speaker_uuid
from this can be used to get more info about the speaker.
Each speaker has a styles
array, each element has its own id
that can be used to for speaker initialization/voice synthesis.
We can combine speaker_uuid
and id
to check voice samples from the get speaker info API.
Run this inside a python console with asyncio (python -m asyncio
):
from voicevox_client.client import Client
with Client() as client:
speaker = client.fetch_speaker_info("<speaker_uuid>")
# speaker["portrait"] is an base64 encoded image
# speaker["style_infos"] is an array where each element contains id (style id), portrait (base64 encoded image), icon (base64 encoded image), voice_samples (array of base64 encoded voice samples)
# Sample code to write the base64 encoded data to a file:
# decoded = base64.b64decode(speaker["style_infos"][0]["voice_samples"][0])
# out_file = ("test.wav")
# with open(out_file, 'wb') as file:
# file.write(decoded)
Run this inside a python console with asyncio (python -m asyncio
):
from voicevox_client.client import Client
with Client() as client:
with open("test.wav", "wb") as f:
f.write(client.text_to_speech("交流できて嬉しいです", speaker_id=10))
Run this inside a python console:
import sounddevice as sd
print(sd.query_devices())
Use something like VB-CABLE to forward the audio output of this program to a fake audio input device, then use that fake the device as audio input for your voice chat application, should work with most games/Discord/Zoom.