Notes on usage:
- Make sure to change runtime to GPU.
- The transcript will be saved in Files, which you can find in the menu on the left.
- Change the number of speakers below if different from two.
- Pick a bigger model if you want more accuracy and a smaller model if you want the program to run faster (more info).
- If you know the language being spoken is English, then change language to 'English' as this improves performance.
High level overview of what's happening here:
- I'm using Open AI's Whisper model to seperate audio into segments and generate transcripts.
- I'm then generating speaker embeddings for each segments.
- Then I'm using agglomerative clustering on the embeddings to identify the speaker for each segment.
https://huggingface.co/blog/fine-tune-whisper
Let me know if I can make it better!