Coder Social home page Coder Social logo

ml_music_style_transfer's Introduction

ML_Music_Style_Transfer

Style transfer in the piano space only.

Design

Use a Wavenet autoencoder/decoder structure with a condition on pitch (i.e. midi). From Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders this conditioning happens by concatenating the latent embedding with a one-hot vector of the pitch. For music where MIDI is not available, you can get it using onset and frames transcription model.

Having the midi as an input to the decoder is key because then the decoder becomes "genreal purpose". I.e. in A Universal Music Translation Network the style transfer happens on a decoder-by-decoder basis - to get a generic piano sound you use the piano decoder. However instead I want different kinds of piano sounds, conditioned on the inputs: a) the piano sound I provided and b) the midi input I provide. Hopefully, midi also allows polyphonic.

Data

Converting midi -> vectors

As stated in the MAESTRO paper: "The input to the context stack is an onset “piano roll” representation, a size-88 vector signaling the onset of any keys on the keyboard, with 4ms bins (250Hz). Each element of the vector is a float that represents the strike velocity of a piano key in the 4ms frame, scaled to the range [0, 1]. When there is no onset for a key at a given time, the value is 0." So I think basically you go through every midi file with a bin size and vectorize everything. Then you can input them to Wavenet.

this repo is probably very important for you to use to make sure your midi/audio are aligning properly. How are they doing it??

Note: Something very key will be to take a single midi and convert it to many different sounding piano versions. Take the MidiNet midi files and augment them in Native Instruments. Make them available online, I am sure they will be valued. Note: The second key thing to make the conditioning properly work is to have the input audio be the same timbre but deliberately different notes (and many different midi combinations). The output spectrogram is the actual matching midi/piano. This is easily done via mixcraft.

Upsampling

You probably need to upsample your midi vector arrays to make sure that it has the same temporal resolution as the output audio you want. This is also what is done in conditioning deep generative raw audio models for structured automatic music.

Loss

However I think the loss you want to eventually use is the Spectral loss like what is done in DDSP. The point they show (and also "Phase Invariance" example here) is that maximum likelihood/cross entropy is not a good loss as similarly sounding audio can have very different waveforms.

Current Best Idea

Following NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS, your encoded audio is the mel-spectrogram (potentially run through a few CNN layers), while your input symbolic representation is the midi data (probably upsampled to the same output audio sampling rate), as is mentioned in conditioning deep generative raw audio models for structured automatic music.

PerformanceNet Figure 2: pretrain the networks as end-to-end models is for sure a good idea. But then the encoder doesnt learn the right thing when you mix-n-match - it's learning a melody + timbre representation, where you just want the timbre. Instead you need to train single end-to-end with the audio E_a and midi E_s encoders, and then a single D_a decoder. How to combine the latent representations? Maybe concatenate (with a constant token always to use as a separator?) followed by dense network (to learn feature combinations) and then whatever they have as the decoder.

I think it's also just worth trying the audio convolutions and then concatenate with the piano midi (upscale to match dimension). Then at each layer you have piano roll skip connections, upscaled to match the output dimension.

Specific Architectures

  • A Universal Music Translation Network in pytorch.
  • MAESTRO paper - they say: WaveNet (van den Oord et al., 2016) is able to synthesize realistic instrument sounds directly in the waveform domain, but it is not as adept at capturing musical structure at timescales of seconds or longer. However, if we provide a MIDI sequence to a WaveNet model as conditioning information, we eliminate the need for capturing large scale structure, and the model can focus on local structure instead, i.e., instrument timbre and local interactions between notes.
  • tacotron2 with paper - condition a wavenet decoder on mel-spectrograms. This could be what you need to do - your encoder is just audio->mel-spectrogram, and decoder is wavenet conditioned on midi + mel-spectrogram. Would be a lot simpler and probably easier to train.
  • PerformanceNet - looks like they skip wavenet altogether, and use convolutional. Easier to train apparently as wavenet is slow and data hungry. They go from score -> audio mapping.
  • CONDITIONING DEEP GENERATIVE RAW AUDIO MODELS FOR STRUCTURED AUTOMATIC MUSIC - this may be the ticket for what you want, and at the very least gives you a nice explanation of how to do the conditioning.

Additional Code Resources

Datasets

Blog resources

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.