Coder Social home page Coder Social logo

ai-rapper's Introduction

logo logo logo

AI Rapper

Talking Head videos of your favorite rapper rapping about anything. Using open-source NLP and TTS libraries.

Github


Table of Contents

    📝 About
    💻 How to build
    🔧 Tools used
    👤 Contact

📝About

Overview

  • Input a prompt, reference audio, reference photo
  • Output auto-generated rap lyrics in style of rapper, synthesized audio using cloned voice, and Talking Head video.

Features

  • Intelligent text (lyrics): input a prompt and harness state-of-the-art LLMs to craft creative and engaging rap verses.

  • Synthetic audio (voice): a text-to-speech (TTS) system to clone a voice based on audio sample and feed it generated lyrics.

  • Talking Head (video): input a reference image and cobine with generated audio to create a realistic, engaging talking head.

💻 How to build

Prerequisites

  • Clone MakeItTalk (for video generation) https://github.com/adobe-research/MakeItTalk/ into root directory of ai-rapper
  • Add a strictly 256 x 256 image of rapper in MakeItTalk/examples. Face should be clear and un-obstructed. Ex: MakeItTalk/examples/eminem.png
  • Add an audio .wav file ( ~ 10-30 sec) of rapper in a separate directory of audio_samples i.e.audio_samples/eminem_00.wav

Install dependencies and run

pip install -r requirements.txt
python src/app.py

Output

Look for generated video in MakeItTalk/examples:

/tmp/tmpx_swo6p1eminem_00.wav
/tmp/tmp7zx0u65zem.png
Audio-----> tmpx_swo6p1eminem_00.wav
Parameters===== tmpx_swo6p1eminem_00.wav 48000 [-29 -36 -43 ... 120 125 124]
Loaded the voice encoder model on cuda in 0.04 seconds.
Processing audio file tmpx_swo6p1eminem_00.wav
Loaded the voice encoder model on cuda in 0.03 seconds.
source shape: torch.Size([1, 576, 80]) torch.Size([1, 256]) torch.Size([1, 256]) torch.Size([1, 576, 257])
converted shape: torch.Size([1, 576, 80]) torch.Size([1, 1152])
Run on device: cuda
======== LOAD PRETRAINED FACE ID MODEL examples/ckpt/ckpt_speaker_branch.pth =========
....
....
....
====================================
z = torch.tensor(torch.zeros(aus.shape[0], 128), requires_grad=False, dtype=torch.float).to(device)
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
examples/tmpx_swo6p1eminem_00.wav
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
....
....
....
OpenCV: FFMPEG: tag 0x67706a6d/'mjpg' is not supported with codec id 7 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
Time - ffmpeg add audio: 15.704241514205933
finish image2image gen
examples/test_pred_fls_tmpx_swo6p1eminem_00_audio_embed.mp4

video preview

video

🔧Tools Used

Python Hugging Face Transformers library pyTorch audio CUDA Toolkit Tortoise TTS FFMPEG OpenCV

NLP

HuggingFace Transformers libary

  • Harnesses fine-tuned and pre-trained language models for rap lyric generation
  • AutoModelForCausalLM generates text by predicting the next word based on previous ones, not on the ones that follow. Useful for speciifc creative tasks such as generating rap lyrics, which rely on stylistic model outputs that have been trained on vast amounts of diverse text data (thus enabling it to generate coherent and contextually relevant text based on a given user prompt)
  • AutoTokenizer efficiently tokenizes input prompts, enabling seamless integration with LLMs. DistilGPT2 (a distilled, more efficient version of GPT-2) efficiently handles this. See usage in src/text_generation/text_generator.py

TTS

Tortoise TTS

  • Used for synthesizing audio from text
  • Supports custom voice models to mimic specific rappers' voices

CUDA Toolkit

  • Trained Eminem's voice (as in the example) on a custom TTS model.
  • NVIDIA's CUDA Toolkit used to accelerate GPU training.

PyTorch Audio

  • torchaudio library handles audio data, saving synthesized rap audio in *.wav format

Talking Head generation

MakeItTalk

  • Open-source Github repo used for video synthesis, harnessing OpenCV and FFMPEG
  • Demo: https://github.com/yzhou359/MakeItTalk/blob/main/quick_demo_tdlr.ipynb

OpenCV

  • used to segment facial features in input image and lip-sync to audio

FFMPEG

  • Used to handle smooth, compatible audio + video synthesis

👤Contact

Email Twitter

ai-rapper's People

Contributors

vdutts7 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.