View Code? Open in Web Editor NEW

Cog implementation of transcribing + diarization pipeline with Whisper & Pyannote

Python 100.00%

cog-whisper-diarization's Introduction

Cog Whisper Diarization

Audio transcribing + diarization pipeline.

Models used

Used at Audiogest
Or try at Replicate
Or deploy yourself at Replicate (Make sure to add your own HuggingFace API key and accept the terms of use of the pyannote models used)

file_string: str: Either provide a Base64 encoded audio file.
file_url: str: Or provide a direct audio file URL.
file: Path: Or provide an audio file.
group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
language: str: Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.
prompt: str: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy.
offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
- Default is both.
- Options are words_only, segments_only, both,

segments: List[Dict]: List of segments with speaker, start and end time.
- Includes avg_logprob for each segment and probability for each word level segment.
num_speakers: int: Number of speakers (detected, unless specified in input).
language: str: Language of the spoken words as a language code like 'en' (detected, unless specified in input).