Style transfer in the piano space only.
Use a Wavenet autoencoder/decoder structure with a condition on pitch (i.e. midi). From Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders this conditioning happens by concatenating the latent embedding with a one-hot vector of the pitch. For music where MIDI is not available, you can get it using onset and frames transcription model.
Having the midi as an input to the decoder is key because then the decoder becomes "genreal purpose". I.e. in A Universal Music Translation Network the style transfer happens on a decoder-by-decoder basis - to get a generic piano sound you use the piano decoder. However instead I want different kinds of piano sounds, conditioned on the inputs: a) the piano sound I provided and b) the midi input I provide. Hopefully, midi also allows polyphonic.
As stated in the MAESTRO paper: "The input to the context stack is an onset “piano roll” representation, a size-88 vector signaling the onset of any keys on the keyboard, with 4ms bins (250Hz). Each element of the vector is a float that represents the strike velocity of a piano key in the 4ms frame, scaled to the range [0, 1]. When there is no onset for a key at a given time, the value is 0." So I think basically you go through every midi file with a bin size and vectorize everything. Then you can input them to Wavenet.
this repo is probably very important for you to use to make sure your midi/audio are aligning properly. How are they doing it??
Note: Something very key will be to take a single midi and convert it to many different sounding piano versions. Take the MidiNet midi files and augment them in Native Instruments. Make them available online, I am sure they will be valued. Note: The second key thing to make the conditioning properly work is to have the input audio be the same timbre but deliberately different notes (and many different midi combinations). The output spectrogram is the actual matching midi/piano. This is easily done via mixcraft.
You probably need to upsample your midi vector arrays to make sure that it has the same temporal resolution as the output audio you want. This is also what is done in conditioning deep generative raw audio models for structured automatic music.
However I think the loss you want to eventually use is the Spectral loss like what is done in DDSP. The point they show (and also "Phase Invariance" example here) is that maximum likelihood/cross entropy is not a good loss as similarly sounding audio can have very different waveforms.
Following NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS, your encoded audio is the mel-spectrogram (potentially run through a few CNN layers), while your input symbolic representation is the midi data (probably upsampled to the same output audio sampling rate), as is mentioned in conditioning deep generative raw audio models for structured automatic music.
PerformanceNet Figure 2: pretrain the networks as end-to-end models is for sure a good idea. But then the encoder doesnt learn the right thing when you mix-n-match - it's learning a melody + timbre representation, where you just want the timbre. Instead you need to train single end-to-end with the audio E_a and midi E_s encoders, and then a single D_a decoder. How to combine the latent representations? Maybe concatenate (with a constant token always to use as a separator?) followed by dense network (to learn feature combinations) and then whatever they have as the decoder.
I think it's also just worth trying the audio convolutions and then concatenate with the piano midi (upscale to match dimension). Then at each layer you have piano roll skip connections, upscaled to match the output dimension.
- A Universal Music Translation Network in pytorch.
- MAESTRO paper - they say: WaveNet (van den Oord et al., 2016) is able to synthesize realistic instrument sounds directly in the waveform domain, but it is not as adept at capturing musical structure at timescales of seconds or longer. However, if we provide a MIDI sequence to a WaveNet model as conditioning information, we eliminate the need for capturing large scale structure, and the model can focus on local structure instead, i.e., instrument timbre and local interactions between notes.
- tacotron2 with paper - condition a wavenet decoder on mel-spectrograms. This could be what you need to do - your encoder is just audio->mel-spectrogram, and decoder is wavenet conditioned on midi + mel-spectrogram. Would be a lot simpler and probably easier to train.
- PerformanceNet - looks like they skip wavenet altogether, and use convolutional. Easier to train apparently as wavenet is slow and data hungry. They go from score -> audio mapping.
- CONDITIONING DEEP GENERATIVE RAW AUDIO MODELS FOR STRUCTURED AUTOMATIC MUSIC - this may be the ticket for what you want, and at the very least gives you a nice explanation of how to do the conditioning.