Coder Social home page Coder Social logo

wtlow003 / auto-subtitles Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 1.0 50.21 MB

CLI tool to transcribe (+ translate) videos and embed subtitles automatically.

License: MIT License

Python 45.76% Shell 49.08% Dockerfile 5.16%
faster-whisper nllb subtitles translation whisper-cpp whisper subtitles-generator

auto-subtitles's Introduction

๐Ÿ“บ Auto-Subtitles

license

About โ€ข Features โ€ข Installation โ€ข Usage

banner

About

The Auto-Subtitles is a CLI tool that generates and embeds subtitles for any YouTube video automatically. Other core functionality includes the ability to generate translated transcripts prior to the output process.

Why Should You Use It?

Prior to the advancement of automatic speech recognition (ASR), transcription process is often seen as a tedious manual task that requires meticulousness in understanding the given audio.

I studied and interned in the film and media industry prior to working as a Machine Learning/Platform Engineer. I was involved in several production that involves manually generating transcriptions and overlay subtitles via video editing software for various advertisements and commercials.

With OpenAI's Whisper models garnering favourable interests from developers due to the ease of local processing and high accuracy in languages such as english, it soon became a viable drop-in (free) replacement for professional (paid) transcription services.

While far from perfect โ€“ Auto-Subtitles still provides automatically generated transcriptions from your local setup with ease of setting up and using from the get-go. The CLI tool can be a initial starting phase in the subtitling process by generating a first-draft of transcriptions that can be vetted and edited by the human before using the edited subtitles for the eventual output. This can reduce the time-intensive process of audio scrubbing and typing every single word from scratch.

Features

Supported Models

Currently, the auto-subtitles workflow supports the following variant(s) of the Whisper model:

  1. @ggerganov/whisper.cpp:
    • Provides the whisper-cpp backend for the workflow.
    • Port of OpenAI's Whisper model in C/C++. Generate fast transcription on local setup (esp. MacOS via MPS).
  2. @jianfch/stable-ts:
    • Provides the faster-whisper backend for the workflow, while producing more reliable and accurate timestamps for transcription.
    • Functionalities also includes VAD filters to more accurately detect voice activities.
  3. @Vaibhavs10/insanely-fast-whisper [Experimental]:
    • Leverages Flash Attention 2 (or Scaled Dot Product Attention) and batching to improve transcription speed.
    • Works for only for gpu setup (cuda or mps) at the moment.
    • Supports only large, large-v2, and large-v3 models.
    • No default support for max segment length โ€“ currently using self-implemented heuristics for segment length adjustment.

Translation

In Auto-Subtitles, we also included the functionality to translate transcripts, e.g., english (en) to chinese (zh), prior to embedding subtitles on the output video.

We did not opt to use the translation features directly via the Whisper model due to observed performance issue and hallucination in the generated transcript.

To support a more efficient and reliable translation process, we used Meta AI's group of models - No Language Left Behind (NLLB) for translation post-transcription.

Currently, the following models are supported:

  1. facebook/nllb-200-1.3B
  2. facebook/nllb-200-3.3B
  3. facebook/nllb-200-distilled-600M
  4. facebook/nllb-200-distilled-1.3B

By default, the facebook/nllb-200-distilled-600M model is used.

Installation

For this project, you can setup the requirements/dependencies and environment either locally or in a containerised environment with Docker.

Local Setup

Pre-requisites

  1. ffmpeg

    Alternatively, referenced from @openai/whisper:

    # on Ubuntu or Debian
    sudo apt update && sudo apt install ffmpeg
    
    # on Arch Linux
    sudo pacman -S ffmpeg
    
    # on MacOS using Homebrew (https://brew.sh/)
    brew install ffmpeg
    
    # on Windows using Chocolatey (https://chocolatey.org/)
    choco install ffmpeg
    
    # on Windows using Scoop (https://scoop.sh/)
    scoop install ffmpeg
  2. Python 3.9

  3. whisper.cpp

    # build the binary for usage
    git clone https://github.com/ggerganov/whisper.cpp.git
    
    cd whisper.cpp
    make
    • Please refer to the actual repo for all other build arguments relevant to your local setup for better performance.

Python Dependencies

Install the dependencies in requirements.txt into a virtual environment (virtualenv):

python -m venv .venv

# mac-os
source .venv/bin/activate

# install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

Docker Setup

To run the workflow using docker:

# build the image
docker buildx build -t auto-subs .

Usage

Transcribing

To run the automatic subtitling process for the following video, simply run the following command (refer here for advanced options):

Local

chmod +x ./workflow.sh

./workflow.sh -u https://www.youtube.com/watch?v=fnvZJU5Fj3Q \
    -b faster-whisper \
    -t 8 \
    -m medium \
    -ml 47

Docker

# run the image
docker run \
   --volume <absolute-path>:/app/output
   auto-subs \
   -u https://www.youtube.com/watch?v=fnvZJU5Fj3Q \
   -b faster-whisper \
   -t 8 \
   -ml 47

The above command generate the workflow with the following settings:

  1. Using the faster-whisper backend
    • More reliable and accurate timestamps as opposed to whisper.cpp, using VAD etc.
  2. Running on 8 threads for increased performance
  3. Using the openai/whisper-medium multi-lingual model
  4. Limit the maximum length of each transcription segment to max 47 characters.

The following is the generated video:

ollama-transcribed.mp4

Transcribing + Translating

To run the automatic subtitling process for the following video and generate Chinese (zh) subtitles:

Local

chmod +x ./workflow.sh

./workflow.sh -u https://www.youtube.com/watch?v=DtLJjNyl57M \
    -b whisper-cpp \
    -wbp ~/code/whisper.cpp \
    -t 8 \
    -m medium \
    -ml 47 \
    -tf "eng_Latn" \
    -tt "zho_Hans"

Docker

# run the image
docker run \
   --volume <absolute-path>:/app/output
   auto-subs \
   -u https://www.youtube.com/watch?v=DtLJjNyl57M \
   -b whisper-cpp \
   -t 8 \
   -ml 47 \
   -tf "eng_Latn" \
   -tt "zho_Hans"

The above command generate the workflow with the following settings:

  1. Using the whisper-cpp backend
    • Faster transcription process compared to whisper.cpp.
    • However, may produce degraded output video with inaccurate timestamps or subtitles appearing early with no noticeable voice activity.
  2. Specifying directory path to the pre-built binary of whisper.cpp to be used for transcription.
  3. Running on 8 threads for increased performance
  4. Using the openai/whisper-medium multi-lingual model
  5. Limit the maximum length of each transcription segment to max 47 characters.
  6. Translating from (-tf) English (eng_Latn) to (-tt) Chinese (zho_Hans), using the FLORES-200 Code found here.

The following is the generated video:

coding-zero-to-hero-translated.mp4

Detailed Options

To check all the available options, use the --help flag:

./workflow.sh --help

Usage: ./workflow.sh [-u <youtube_video_url>] [options]
Options:
  -u, --url <youtube_video_url>                       YouTube video URL
  -o, --output-path <output_path>                     Output path
  -b, --backend <backend>                             Backend to use: whisper-cpp or faster-whisper
  -wbp, --whisper-bin-path <whisper_bin_path>         Path to whisper-cpp binary. Required if using [--backend whisper-cpp].
  -ml, --max-length <max_length>                      Maximum length of the generated transcript
  -t, --threads <threads>                             Number of threads to use
  -w, --workers <workers>                             Number of workers to use
  -m, --model <model>                                 Model name to use
  -tf, --translate-from <translate_from>              Translate from language
  -tt, --translate-to <translate_to>                  Translate to language
  -f, --font <font>                                   Font to use for subtitles

[WIP] Performance

For mps device, I am running performance testing on a M2 Max 12/30 (cpu/gpu) cores MacBook Pro (14-inch, 2023).

Transcription

Model Backend Device Threads Time Taken
base whisper-cpp cpu 4 ~
base whisper-cpp mps 4 ~
base faster-whisper cpu 4 ~
base faster-whisper mps 4 ~
medium whisper-cpp cpu 4 ~
medium whisper-cpp mps 4 ~
medium faster-whisper cpu 4 ~
medium faster-whisper mps 4 ~

Transcription + Translation

Model Backend Device Threads Time Taken
base whisper-cpp cpu 4 ~
base whisper-cpp mps 4 ~
base faster-whisper cpu 4 ~
base faster-whisper mps 4 ~
medium whisper-cpp cpu 4 ~
medium whisper-cpp mps 4 ~
medium faster-whisper cpu 4 ~
medium faster-whisper mps 4 ~

Known Issues

  1. Korean subtitles are not supported at the moment.
    • Details: The default font used to embed subtitles is Arial Unicode MS, which does not provide glpyh for Korean characters.
    • Potential Solution: Add alternate fonts for Korean characters
    • Status: โœ… Done

Changelog

  1. ๐Ÿ—“๏ธ [24/02/2024]: Include ./fonts folder to host downloaded fonts to be copied into the Docker container. Once copied, users can specified their desired fonts with the -f or --font flag.

auto-subtitles's People

Contributors

wtlow003 avatar

Stargazers

 avatar

Watchers

 avatar

auto-subtitles's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.