Coder Social home page Coder Social logo

Comments (16)

sanchit-gandhi avatar sanchit-gandhi commented on August 26, 2024 3

Hey @Saltb0xApps @adamfils - this sounds like it would work. The only change that I would make would be using a more powerful audio encoder to extract more meaningful representations from the music conditioning (e.g. warm starting an audio encoder from the HuBERT model to extract music embedding representations). Using DAC or EnCodec alone is only going to provide you with a down-sampled version of the music inputs, rather than any features that encode tempo, key, mood, and rhythm, etc. This is then analogous to what the Flan-T5 encoder does for the text conditioning.

Note that you could use something similar to train a TTS model that has text and voice conditioning as well (just replace the music conditioning with a voice sample in the flowchart above). You could then give it a 2-second voice prompt to control the style of generated voice, and then control how fast/slow or animated/monotonous the speech is using the text prompt.

from parler-tts.

adamfils avatar adamfils commented on August 26, 2024 1

@Saltb0xApps awesome please share some of your audio outputs when you can as the training proceeds.

from parler-tts.

adamfils avatar adamfils commented on August 26, 2024

@Saltb0xApps I was thinking the same thing and also adding another input conditioning, like background music for the generated audio to follow. I am not exactly an ML engineer but here is my rough thinking.

  1. Text Encoder (Unchanged):
    Continues to map text descriptions to a sequence of hidden-state representations using a frozen text encoder initialized from Flan-T5.

  2. Music Encoder:
    A new component that takes background music as input and generates a music-conditioned representation using a pretrained music autoencoder (DAC or EnCodec).
    This encoder analyses the background music to extract features such as tempo, key, mood, and rhythm, which will be used to condition the generated speech.

  3. Parler-TTS Decoder (Modified):
    The decoder now auto-regressively generates audio tokens conditional not only on the encoder hidden-state representations (from text) but also on the music-conditioned representation.
    To incorporate the music-conditioned representation, you could either:

  • Concatenate: Directly concatenate the music-conditioned representation with the text-conditioned hidden states before feeding them into the decoder.
  • Cross-Attention Modification: Integrate the music-conditioned representation into the cross-attention layers of the decoder, allowing the decoder to attend to both text and music features simultaneously.
  1. Audio Codec (Unchanged):
    Continues to recover the audio waveform from the audio tokens predicted by the decoder using the DAC model or EnCodec as preferred.

@sanchit-gandhi Does this sound feasible?

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

@sanchit-gandhi / @adamfils Thank you for providing a detailed response!
I have a large dataset of around 1k hours of just vocals (separated out of music from soundcloud songs using demucs) and their lyrics/transcription using Whisper. I was wondering if I could take that dataset of just vocals, combined with dataspeech info for style, and retrain parler-tts to only output singing vocals.

The idea is to create a robust singing vocals version of text to speech models using parler-tts that would only generate singing vocals instead of regular speech.

Would this require code level changes as mentioned above, or simply retraining parler-tts on this singing vocals dataset would be a good enough starting point for something that can generate vocals only?

from parler-tts.

ylacombe avatar ylacombe commented on August 26, 2024

Hey @Saltb0xApps, this would totally work as you only need 3 things from your datasets for the training to work:

  1. Audio samples
  2. Transcriptions
  3. Text conditioning

Parler-TTS is agnostic to the text conditioning and audio samples you're using!

Also, 1K hours should be enough to get a good-enough model from scratch. You can also explore fine-tuning the current model as well as it already learned some acoustics features and to associate text tokens to acoustics sounds!

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

@ylacombe I curated the dataset of 1000 hours of vocals (mostly english)
Would love to hear your thoughts whether this dataset could work -
• audio + transcriptions - https://huggingface.co/datasets/AkhilTolani/vocals
• transcriptions + dataspeech tags - https://huggingface.co/datasets/AkhilTolani/vocals-stripped-pitch-text-bins-descriptions

I just started a finetune run here on 2xA100 GPUs. Would love to know if my fine-tune parameters are adjusted correctly for the difference in hardware/dataset size - https://wandb.ai/akhiltolani/parler-speech/runs/wkh5eor3/overview?nw=nwuserakhiltolani

from parler-tts.

ylacombe avatar ylacombe commented on August 26, 2024

Hey @Saltb0xApps, wow thanks for sharing this! How did you create this dataset out of curiosity?

A few remarks:

  • I'm pretty sure the model can learn from your samples but it would surely benefit from more precise tags: for now, it only uses the dataspeech tags but you would probably need more signals about the singing voices (for example, the singing style, the note etc). Maybe you can get some additional features from the way you created your dataset, would you like to describe this a bit more?
  • You should also probably modify the prompt to create descriptions a bit. The current one is about delivering speech, whereas you want to deliver singing.
  • I'll give a proper look to the HPs ASAP

from parler-tts.

ylacombe avatar ylacombe commented on August 26, 2024

Thank you also for sharing your logs, they bring a lot of value to the community! I really like your initiative!
If you're okay with this, we can probably make a big splash from your model once the results are what we expect. What do you think?

from parler-tts.

ylacombe avatar ylacombe commented on August 26, 2024

I've listened to some samples, the model seems to get a sense of singing, which is a good sign! It'd probably need some better hyper-parameters though!

Re: Hyper-parameters

  • You've got 1k hours of audio, so you'd better raise your global batch size to get closer to the global batch size we've used for training (196), you can do it by setting gradient_accumulation_steps to something aound 6
  • You might have to experiment with the learning rate, for now you can leave it as it is but it's a bit too high IMO

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

Hey @Saltb0xApps, wow thanks for sharing this! How did you create this dataset out of curiosity?

A few remarks:

  • I'm pretty sure the model can learn from your samples but it would surely benefit from more precise tags: for now, it only uses the dataspeech tags but you would probably need more signals about the singing voices (for example, the singing style, the note etc). Maybe you can get some additional features from the way you created your dataset, would you like to describe this a bit more?
  • You should also probably modify the prompt to create descriptions a bit. The current one is about delivering speech, whereas you want to deliver singing.
  • I'll give a proper look to the HPs ASAP

Thank you! The dataset is just off music online. I separated the vocals out using Demucs, used pydub silence detection to chunk them, and then transcribed them using whisper large/medium.

I'm happy to add more info about the singing, but we'll need to come up with some automatic techniques to detect signals like singing style/notes. I was thinking LP-MusicCaps (https://github.com/seungheondoh/lp-music-caps), but open to other ideas for this.

I decided to keep the prompt same and not introduce additional tags since i'm finetuning the base model, instead of training one from scratch. The idea being that the model will basically "sing" instead of "talk" with everything else being the same.

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

Thank you also for sharing your logs, they bring a lot of value to the community! I really like your initiative! If you're okay with this, we can probably make a big splash from your model once the results are what we expect. What do you think?

Absolutely! I'd love connect & discuss a plan over discord/email. Lmk if that works for you!

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

I've listened to some samples, the model seems to get a sense of singing, which is a good sign! It'd probably need some better hyper-parameters though!

Re: Hyper-parameters

  • You've got 1k hours of audio, so you'd better raise your global batch size to get closer to the global batch size we've used for training (196), you can do it by setting gradient_accumulation_steps to something aound 6
  • You might have to experiment with the learning rate, for now you can leave it as it is but it's a bit too high IMO

@ylacombe Ah thank you for sharing this! I definitely agree 1e-4 is too high. I'll do another run with 4e-5 as the lr and updated gradient_accumulation_steps to 6 later today.

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

@ylacombe I believe parler-tts is going to come up with a larger model very soon if i remember correctly?
600M is comparable to musicgen-small, and i believe there is a major qualitative difference between 3B+ params and 600M params (atleast in musicgen)!
I'd imagine funetuning the parler-tts large model would most likely also give much better singing results!

would really appreciate if you could share any details about the large model (param size, dataset size, estimated launch date, etc.) if possible :)

Finetune run 2 with a less aggressive learning rate and gradient_accumulation_steps = 6 https://wandb.ai/akhiltolani/parler-speech/runs/mv9dd4hz/overview?nw=nwuserakhiltolani

from parler-tts.

Saltb0xApps avatar Saltb0xApps commented on August 26, 2024

Here is a Huggingface space to try out the singing vocals fine-tune of Parler-tts!
https://huggingface.co/spaces/AkhilTolani/vocals

  1. The model is having a very hard time differentiating between male and female vocals. Maybe its my training dataset?

  2. Ideas on how to generate consistently long vocals based on speaker_id in chunks? (2 min+ for practical use?)

  3. Need to figure out how to set determine min_length or min_new_tokens based on input text so that the model doesn't miss a few words. Can do something very rudimentary like input_prompt_word_count*0.8 (or any simple formula that almost gets it right)

  4. The model occasionally just generates screeching noises with no coherent words. Need to determine why.

from parler-tts.

ylacombe avatar ylacombe commented on August 26, 2024

Hey @Saltb0xApps, you can reach out to me by mail on yoach [at] huggingface.co!
This run is definitely better and I really love the space you've created.
Let's discuss offline how we can make this even better. I believe you could use a mix of automatic features and generated features to create better singing voice descriptions.

Also some of the data-speech features might not be adapted here.

from parler-tts.

ylacombe avatar ylacombe commented on August 26, 2024
  1. Ideas on how to generate consistently long vocals based on speaker_id in chunks? (2 min+ for practical use?)

This needs a bit of hacking around the model, i.e by adding some names to indicate to your model to keep voice consistency.

from parler-tts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.