lucidrains / musiclm-pytorch Goto Github PK

View Code? Open in Web Editor NEW

3.0K 98.0 247.0 201 KB

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch

License: MIT License

Python 100.00%

artificial-intelligence attention-mechanisms deep-learning music-synthesis transformers

musiclm-pytorch's Introduction

MusicLM - Pytorch

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch.

They are basically using text-conditioned AudioLM, but surprisingly with the embeddings from a text-audio contrastive learned model named MuLan. MuLan is what will be built out in this repository, with AudioLM modified from the other repository to support the music generation needs here.

Please join if you are interested in helping out with the replication with the LAION community

What's AI by Louis Bouchard

Appreciation

Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research
🤗 Huggingface for their accelerate training library

Usage

$ pip install musiclm-pytorch

Usage

MuLaN first needs to be trained

import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)
loss.backward()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan.get_audio_latents(wavs)  # during training

embeds = mulan.get_text_latents(texts)  # during inference

To obtain the conditioning embeddings for the three transformers that are a part of AudioLM, you must use the MuLaNEmbedQuantizer as so

from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer(
    mulan = mulan,                          # pass in trained mulan from above
    conditioning_dims = (1024, 1024, 1024), # say all three transformers have model dimensions of 1024
    namespaces = ('semantic', 'coarse', 'fine')
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch.randn(2, 1024)
conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers

To train (or finetune) the three transformers that are a part of AudioLM, you simply follow the instructions over at audiolm-pytorch for training, but pass in the MulanEmbedQuantizer instance to the training classes under the keyword audio_conditioner

ex. SemanticTransformerTrainer

import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    audio_text_condition = True      # this must be set to True (same for CoarseTransformer and FineTransformers)
).cuda()

trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    audio_conditioner = quantizer,   # pass in the MulanEmbedQuantizer instance above
    folder ='/path/to/audio/files',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1
)

trainer.train()

After much training on all three transformers (semantic, coarse, fine), you will pass your finetuned or trained-from-scratch AudioLM and MuLaN wrapped in MuLaNEmbedQuantizer to the MusicLM

# you need the trained AudioLM (audio_lm) from above
# with the MulanEmbedQuantizer (mulan_embed_quantizer)

from musiclm_pytorch import MusicLM

musiclm = MusicLM(
    audio_lm = audio_lm,                 # `AudioLM` from https://github.com/lucidrains/audiolm-pytorch
    mulan_embed_quantizer = quantizer    # the `MuLaNEmbedQuantizer` from above
)

music = musiclm('the crystalline sounds of the piano in a ballroom', num_samples = 4) # sample 4 and pick the top match with mulan

Todo

mulan seems to be using decoupled contrastive learning, offer that as an option
wrap mulan with mulan wrapper and quantize the output, project to audiolm dimensions
modify audiolm to accept conditioning embeddings, optionally take care of different dimensions through a separate projection
audiolm and mulan goes into musiclm and generate, filter with mulan
give dynamic positional bias to self attention in AST
implement MusicLM generating multiple samples and selecting top match with MuLaN
support variable lengthed audio with masking in audio transformer
add a version of mulan to open clip
set all the proper spectrogram hyperparameters

Citations

@inproceedings{Agostinelli2023MusicLMGM,
    title     = {MusicLM: Generating Music From Text},
    author    = {Andrea Agostinelli and Timo I. Denk and Zal{\'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank},
    year      = {2023}
}

@article{Huang2022MuLanAJ,
    title   = {MuLan: A Joint Embedding of Music Audio and Natural Language},
    author  = {Qingqing Huang and Aren Jansen and Joonseok Lee and Ravi Ganti and Judith Yue Li and Daniel P. W. Ellis},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2208.12415}
}

@misc{https://doi.org/10.48550/arxiv.2302.01327,
    doi     = {10.48550/ARXIV.2302.01327},
    url     = {https://arxiv.org/abs/2302.01327},
    author  = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},
    title   = {Dual PatchNorm},
    publisher = {arXiv},
    year    = {2023},
    copyright = {Creative Commons Attribution 4.0 International}
}

@article{Liu2022PatchDropoutEV,
    title   = {PatchDropout: Economizing Vision Transformers Using Patch Dropout},
    author  = {Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2208.07220}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@misc{gilmer2023intriguing
    title  = {Intriguing Properties of Transformer Training Instabilities},
    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},
    year   = {2023},
    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}
}

@inproceedings{Shukor2022EfficientVP,
    title   = {Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment},
    author  = {Mustafa Shukor and Guillaume Couairon and Matthieu Cord},
    booktitle = {British Machine Vision Conference},
    year    = {2022}
}

@inproceedings{Zhai2023SigmoidLF,
    title   = {Sigmoid Loss for Language Image Pre-Training},
    author  = {Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
    year    = {2023}
}

The only truth is music. - Jack Kerouac

Music is the universal language of mankind. - Henry Wadsworth Longfellow

musiclm-pytorch's People

Contributors

Stargazers

Watchers

Forkers

yuan-manx cian0 neuroidss francislabountyjr mohan-zhang-u luomor-ai liujingxiu23 mbilalai cate9021 jadenyifanhe justinormont shaun95 zachhhhh iburunat ricklentz tanveret hungngomobile farcel thesoulharmonic ira-drizina mbrukman qdaria iuriimattos2 cyberflamego nimmen c4p-n1ck aigenerative haifengzeng vhranger manick94 hadryan russpalms dexveloper maxmax2016 anggadaz bondarchukb 0xlws gertie01 vamelnyk ukaserge rickyhong rosssong alirezabayatmk russell-shu sascha1337 theisys domigome mtphil coderninja-0 5l1v3r1 canass abdullah50053 xiaocdh mec-is gandolfxu r488it brocawernicke gblue1223 aaronannecchiarico lucymjk skiiwoo wujunde wewqwr ghlee3401 cason-soul aspiringastro lightning-universe xxxx2x tiagozhang penguinkang ameerazam08 lizcoultersmith crystallinebutterfly technocannon1337 tashaskyup cv-synthesis morenoh149 kavin2003 sivashankar-s cosimoiaia jack-kks jjandnn colinski sofianel5 kulhunter mrbungle-codes declanzane feima0011 luciusjay unnikrishnannambiar russellwmy zero506 dahwin aixingxy smoothlanding kartikcpp juanumusic starlightinfinite nurkhalissspa cerisk

musiclm-pytorch's Issues

It takes forever to generate music samples

music = musiclm('the crystalline sounds of the piano in a ballroom', num_samples = 4) # sample 4 and pick the top match with mulan

I'm using this example line - taking me 2.5 hours and then it fails on MPS for Mac. Should this be happening? Trying again on CPU and it seems to be a lot quicker for some reason.

What's the hold up here? Is this lazy execution with training waiting until sample generation is triggered? Or is the sample generation actually taking this long?

Discord invite link is invalid.

This discord invite link is invalid.

Can the musicLM creates song with the given lyric?

AudioLM can implement Vall-E's functionality. So, can Musiclm produce music based on the given lyric and music type?

Usage about grad_accum_every

I am curious about how "grad_accum_every" used in https://github.com/lucidrains/musiclm-pytorch/blob/main/musiclm_pytorch/trainer.py#L317

In my previous experience, the model basically get gradient (backward) once a step. Why should we split loss "grad_accum_every" times to get gradient in a step?

If I have gpu constrain (1 T4 gpu), that means I could only set batch size to 1 or 2 at each stage training, should I still set "grad_accum_every' to large number like 16 or 32?

Thank you!

'MusicLM' is not defined - cannot run usage commands

When I run your usage commands, I get the error name 'MusicLM' is not defined when I run the last snippet (I'm using a Jupyter notebook for that). Am I doing something wrong? Thanks

Here is the Gist with the notebook: https://gist.github.com/albusdemens/cbf06f190b0b519920a09418e0897a4f

Error running AudioLM

I am running the audiolm implementation from github and facing error in the following

audiolm = AudioLM(
    wav2vec = wav2vec,
    codec = soundstream,
    semantic_transformer = semantic_transformer,
    coarse_transformer = coarse_transformer,
    fine_transformer = fine_transformer
)

text = "The sound of a violin playing a sad melody"
generated_wav = audiolm(text=text, batch_size=1)

I have tried changing the dimensions in the transformers, but the issue is still there,

fine_transformer = FineTransformer( num_coarse_quantizers = 3, num_fine_quantizers = 5, codebook_size = 1024, dim = 1024, depth = 6, audio_text_condition=True, # this must be set to True (same for SemanticTransformer and FineTransformer) )

coarse_transformer = CoarseTransformer( num_semantic_tokens = wav2vec.codebook_size, codebook_size = 1024, num_coarse_quantizers = 3, dim = 1024, depth = 6, audio_text_condition=True, # this must be set to True (same for SemanticTransformer and FineTransformer) )

semantic_transformer = SemanticTransformer( num_semantic_tokens = wav2vec.codebook_size, dim = 1024, depth = 6, audio_text_condition = True # this must be set to True (same for CoarseTransformer and FineTransformers) ).cuda()

but I am still get the following error,

AssertionError: you had specified a conditioning dimension of 1024, yet what was received by the transformer has
dimension of 768

Please help, I need to submit this implementation

can you do this for colab

Disable the question "do you want to clear previous experiment checkpoints and results? (y/n)" by passing an argument to the trainer

@lucidrains Hi Phil, can you pass an argument to the MuLaNTrainer in trainer.py to disable asking the question "do you want to clear previous experiment checkpoints and results? (y/n)" every time?
Otherwise it doesn't proceed to the training, if not answered.

Error: name 'partial' is not defined (The new release version)

I'm getting this error on the new release of musiclm-pytorch

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ramsy/projects/music/main.py:47 in <module>                                             │
│                                                                                                  │
│   44 │   │   sys.exit(1)                                                                         │
│   45                                                                                             │
│   46 if __name__ == "__main__":                                                                  │
│ ❱ 47 │   CLI()                                                                                   │
│   48                                                                                             │
│   49                                                                                             │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/typer/main.py:214 in __call__                    │
│                                                                                                  │
│   211 │   │   )                                                                                  │
│   212 │                                                                                          │
│   213 │   def __call__(self, *args: Any, **kwargs: Any) -> Any:                                  │
│ ❱ 214 │   │   return get_command(self)(*args, **kwargs)                                          │
│   215                                                                                            │
│   216                                                                                            │
│   217 def get_group(typer_instance: Typer) -> click.Command:                                     │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/click/core.py:829 in __call__                    │
│                                                                                                  │
│    826 │                                                                                         │
│    827 │   def __call__(self, *args, **kwargs):                                                  │
│    828 │   │   """Alias for :meth:`main`."""                                                     │
│ ❱  829 │   │   return self.main(*args, **kwargs)                                                 │
│    830                                                                                           │
│    831                                                                                           │
│    832 class Command(BaseCommand):                                                               │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/click/core.py:782 in main                        │
│                                                                                                  │
│    779 │   │   try:                                                                              │
│    780 │   │   │   try:                                                                          │
│    781 │   │   │   │   with self.make_context(prog_name, args, **extra) as ctx:                  │
│ ❱  782 │   │   │   │   │   rv = self.invoke(ctx)                                                 │
│    783 │   │   │   │   │   if not standalone_mode:                                               │
│    784 │   │   │   │   │   │   return rv                                                         │
│    785 │   │   │   │   │   # it's not safe to `ctx.exit(rv)` here!                               │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/click/core.py:1259 in invoke                     │
│                                                                                                  │
│   1256 │   │   │   │   Command.invoke(self, ctx)                                                 │
│   1257 │   │   │   │   sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)                    │
│   1258 │   │   │   │   with sub_ctx:                                                             │
│ ❱ 1259 │   │   │   │   │   return _process_result(sub_ctx.command.invoke(sub_ctx))               │
│   1260 │   │                                                                                     │
│   1261 │   │   # In chain mode we create the contexts step by step, but after the                │
│   1262 │   │   # base command has been invoked.  Because at that point we do not                 │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/click/core.py:1066 in invoke                     │
│                                                                                                  │
│   1063 │   │   """                                                                               │
│   1064 │   │   _maybe_show_deprecated_notice(self)                                               │
│   1065 │   │   if self.callback is not None:                                                     │
│ ❱ 1066 │   │   │   return ctx.invoke(self.callback, **ctx.params)                                │
│   1067                                                                                           │
│   1068                                                                                           │
│   1069 class MultiCommand(Command):                                                              │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/click/core.py:610 in invoke                      │
│                                                                                                  │
│    607 │   │   args = args[2:]                                                                   │
│    608 │   │   with augment_usage_errors(self):                                                  │
│    609 │   │   │   with ctx:                                                                     │
│ ❱  610 │   │   │   │   return callback(*args, **kwargs)                                          │
│    611 │                                                                                         │
│    612 │   def forward(*args, **kwargs):  # noqa: B902                                           │
│    613 │   │   """Similar to :meth:`invoke` but fills in default keyword                         │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/typer/main.py:497 in wrapper                     │
│                                                                                                  │
│   494 │   │   │   │   use_params[k] = v                                                          │
│   495 │   │   if context_param_name:                                                             │
│   496 │   │   │   use_params[context_param_name] = click.get_current_context()                   │
│ ❱ 497 │   │   return callback(**use_params)  # type: ignore                                      │
│   498 │                                                                                          │
│   499 │   update_wrapper(wrapper, callback)                                                      │
│   500 │   return wrapper                                                                         │
│                                                                                                  │
│ /home/ramsy/projects/music/main.py:34 in train                                                │
│                                                                                                  │
│   31 │   │   Train the model locally in order                                                    │
│   32 │   │   to use it generate music                                                            │
│   33 │   """                                                                                     │
│ ❱ 34 │   train_model()                                                                           │
│   35                                                                                             │
│   36 @CLI.command()                                                                              │
│   37 def generate(prompt: list[str]) -> None:                                                    │
│                                                                                                  │
│ /home/ramsy/projects/music/src/train_model.py:37 in train_model                               │
│                                                                                                  │
│    34 │   │   dim_head = 64                                                                      │
│    35 │   )                                                                                      │
│    36 │                                                                                          │
│ ❱  37 │   mulan = MuLaN(                                                                         │
│    38 │   │   audio_transformer = AUDIO_TRANSFORMER,                                             │
│    39 │   │   text_transformer = TEXT_TRANSFORMER                                                │
│    40 │   )                                                                                      │
│ <@beartype(musiclm_pytorch.musiclm_pytorch.MuLaN.__init__) at 0x7f7e2d17eef0>:52 in __init__     │
│                                                                                                  │
│ /home/ramsy/.local/lib/python3.10/site-packages/musiclm_pytorch/musiclm_pytorch.py:673 in        │
│ __init__                                                                                         │
│                                                                                                  │
│   670 │   │   self.text_to_latents = nn.Linear(self.text.dim, dim_latent)                        │
│   671 │   │   self.audio_to_latents = nn.Linear(self.audio.dim, dim_latent)                      │
│   672 │   │                                                                                      │
│ ❱ 673 │   │   klass = SigmoidContrastiveLearning if sigmoid_contrastive_loss else partial(Soft   │
│   674 │   │   self.contrast = klass()                                                            │
│   675 │   │                                                                                      │
│   676 │   │   self.multi_layer_contrastive_learning = None                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'partial' is not defined

and i have looked to the the file /home/ramsy/.local/lib/python3.10/site-packages/musiclm_pytorch/musiclm_pytorch.py i can't seem to find this partial thing

How to add embeddings or adding support for melody embeddings

I am able to train and run the modal with text. It would be great if we can add melody embeddings as mentioned on the examples page of google research

Ex code for CoarseTransformer and FineTransformer

I have been able to successfully train SemanticTransformerTrainer. But getting error with later two.
coarse_transformer = CoarseTransformer(
codebook_size=wav2vec.codebook_size,
num_coarse_quantizers = 8,
num_semantic_tokens = 1000,
dim=1024,
depth=6,
audio_text_condition=True # this must be set to True (same for SemanticTransformer and FineTransformer)
).cuda()

trainer = CoarseTransformerTrainer(
transformer=coarse_transformer,
wav2vec=wav2vec,
audio_conditioner=quantizer, # pass in the MulanEmbedQuantizer instance above
folder='/content/music_data',
soundstream=soundstream, #where to get this from
batch_size=1,
data_max_length=320 * 32,
num_train_steps=1
)`

Exception when attempting to train

i'm excited to try this out!

i attempted to train, feeding in a MockTextAudioDataset similar to the example on AudioLM's page (that worked with the semantic trainer there), but encountered the following exception: TypeError: 'int' object is not iterable

Full stack trace, in case it helps:

File "train_mulan.py", line 60, in
trainer.train()
File "<@beartype(musiclm_pytorch.trainer.MuLaNTrainer.train) at 0x7ff0e221f160>", line 30, in train
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 363, in train
logs = self.train_step()
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 330, in train_step
data_kwargs = self.data_tuple_to_kwargs(next(self.dl_iter))
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 57, in cycle
for data in dl:
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/accelerate/data_loader.py", line 375, in iter
current_batch = next(dataloader_iter)
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 146, in inner
output = fn(datum)
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 156, in curtail_to_shortest_collate
min_len = min(*[datum.shape[0] for datum in data])
TypeError: 'int' object is not iterable

Inference with MuLaN

@lucidrains Somehow i got the MuLaN trained with the MusicCaps dataset. Now i want to check how close the text and wav embeddings are. So while extracting text and wav embeddings using:

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    max_seq_len = 512
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

mulan.to(device)

model = torch.load('results/mulan.45000.pt')

# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in model.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v

mulan.load_state_dict(new_state_dict)

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

wav_emb = mulan.get_audio_latents(wavs) 
text_emb = mulan.get_text_latents(texts)

I get the following error:

Traceback (most recent call last):
  File "test_mulan.py", line 140, in <module>
    audio_emb = mulan.get_audio_latents(wavs)
  File "venv/lib/python3.8/site-packages/musiclm_pytorch/musiclm_pytorch.py", line 502, in get_audio_latents
    audio_embeds = self.audio(wavs)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/venv/lib/python3.8/site-packages/musiclm_pytorch/musiclm_pytorch.py", line 379, in forward
    x = self.transformer(x, rel_pos_bias = rel_pos_bias)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/venv/lib/python3.8/site-packages/musiclm_pytorch/musiclm_pytorch.py", line 217, in forward
    x = attn(x, rel_pos_bias = rel_pos_bias, mask = mask) + x
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/venv/lib/python3.8/site-packages/musiclm_pytorch/musiclm_pytorch.py", line 163, in forward
    sim = sim + rel_pos_bias
RuntimeError: The size of tensor a (20) must match the size of tensor b (2560) at non-singleton dimension 3

It happens only for audio inputs. Not for text.

Create endless stream of genre

Hi,

Can I use this to create an endless stream per genre? So a steam of "minimal house, "classic music in Bach style" and "reggae music of the Caribbean"?

inference time

Thanks for your great work!
Could you share with me the inference time of generating 10s audio on CPU/GPU?

Tried to run the example of Readme.md, got error related with tensor dimension

I am trying to run the example in Google Collab but I get a Runtime error when running the part for obtaining the conditioning embeddings:

from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer(
    mulan = mulan,                          # pass in trained mulan from above
    conditioning_dims = (1024, 1024, 1024), # say all three transformers have model dimensions of 1024
    namespaces = ('semantic', 'coarse', 'fine')
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch.randn(2, 1024)
conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers

RuntimeError: The size of tensor a (20) must match the size of tensor b (2560) at non-singleton dimension 3

Runtime Error on CPU

Discussed in #55

^{Originally posted by sauravp June 9, 2023}
I am trying to train this on CPU (on a small dataset) to validate some ideas.

import torch
from musiclm_pytorch import MusicLM, MuLaNTrainer
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer, MuLaNEmbedQuantizer

import random
import numpy as np

device = 'cpu'


audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    max_seq_len = 512
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

mulan.to(device)
mulan.eval()

wavs = torch.randn(5, 1024).to(device)
texts = torch.randint(0, 20000, (5, 512)).to(device)
#print(wavs.shape, texts.shape)

from torch.utils.data import Dataset

class TextAudioDataset(Dataset):
    def __init__(self, wavs, texts):
        super().__init__()
        self.wavs = wavs
        self.texts = texts

    def __len__(self):
        if len(self.wavs) != len(self.texts):
            return -1
        else:
            return len(self.wavs)

    def __getitem__(self, idx):
        return self.wavs[idx], self.texts[idx]


trainer = MuLaNTrainer(
    mulan = mulan,
    dataset = TextAudioDataset(wavs, texts),
    batch_size = 2
)

trainer.to(device)

trainer.train()

I am getting the following error:

RuntimeError: stft input and window must be on the same device but got self on mps:0 and window on cpu

Is there a way to run the entire thing on CPU?

Additional Work

Could you please document any additional work which is needed for this project? I'd like to contribute.

What about a Hugging Face Spaces demo so we can test this?

Please make a google colab, i cant really do anything. Plus i dont know how to get the herburt model

Load MuLaN from last checkpoint

@lucidrains Could you please pass an argument to the MuLaNTrainer class to enable/disable loading from last available checkpoint?

UnpicklingError: invalid load key, '\x0d'.

anyone getting this error when running in colab?

What size of graphics card memory is required for local deployment training of 'musiclm pytorch'

What size of graphics card memory is required for local deployment training of 'musiclm pytorch'？
Please advise

time of training

Thanks for your great work!
I'm interested in audio music generaion, as the big dataset and complex architecture, I wonder how long will the model be trained in general.

checkpoint files?

I dont even know if it makes sense in this context, but would it be possible to release a checkpoint file for a trained model so we can run inference without having to train ourselves?

Getting started

I'm really interested in generating music from text. However, I have poor knowledge of the pytorch framework and ML in general.

I appreciate the README a lot. However it's quite implicit, especially for a newbe like me. Is there a way I can get started ?

How to get the ./hubert/hubert_base_ls960.pt and ./hubert/hubert_base_ls960_L9_km500.bin files ? Where to download audio files ?

Training MuLan

@lucidrains The dataset for training MuLan in their original paper seems to be private. So we need to see other options like: Free Music Archive (FMA) dataset. Where the text part of a sample is a list of strings, like:

['low quality', 'sustained strings melody', 'soft female vocal', 'mellow piano melody', 'sad', 'soulful', 'ballad']

My question is: Which string should we feed to our network along with the audio? One randomly selected string? Or all? Or make pairs of all with the audio to make even more samples?

How long to train until convergence?

To anyone that has trained MusicLM to convergence, what was your setup and what compute resources did you utilise?

Controlling the length of output when calling MusicLM model

Using the default settings, it seems that MusicLM will always output a tensor of length 163840. This is a bit of a strange number, as it's not divisible by the standard sample rate of 44100 that it would presumably be trained on.

I've found that it's possible to pass a max_length argument when calling MusicLM, which gets passed to AudioLM. But passing this argument only controls how many semantic tokens are generated - the coarse, fine and output tensor remain the same size.

For now I've hacked a solution together by additionally passing a max_length to the self.coarse.generate() call in audiolm_pytorch:1628, but I'm wondering if this is the correct way to do it.

What's the best way to generate outputs of different lengths with this model?

"pip install musicml-pytorch" appears to fail on Windows

I have Python 3.8.5, Windows 10 latest, tried pip 20.1.1 and 23.0.1 with the same error.

audio_lm and mulan_embed_quantizer are not defined

audio_lm and mulan_embed_quantizer are not defined in the code provided on main page

Explain Usage - perhaps a Jupyter Notebook can help

I am not sure how to train the model. Please provide more information, I think a Colab Jupyter notebook would really help

Please make a Hugging Face Spaces demo, I can't really do anything. Plus I don't want to include any seed number.

Support our open source music pretrained Transformer

Hi, we are researchers from the MAP (music audio pre-train) project. We pre-train transformer LMs on large-scale music audio datasets.
See below. Our model, MERT, uses a similar method as HuBERT and has verified its performance on downstream music information retrieval tasks. It has been released on hugging face and can be used interchangeably with HuBERT loading code: model = HubertModel.from_pretrained("m-a-p/MERT-v0")
We are currently working on training a better base model and scaling up to a large model with more music+speech data.
Using our weights as an initialization will be a better start than using speech HuBERT. Better checkpoints will be released soon.

https://huggingface.co/m-a-p/MERT-v0

nan loss in MuLaN training

@lucidrains
While training MuLaN on a dataset of around 5.2k samples, the loss goes to nan after some 15-16k steps.
My batch size is 4, and the text part of the data samples are tokenized using:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text_in_numbers = tokenizer.encode(text)

Does it has something to do with the zero division? or square-root of 0 in the loss function?

T

ValueError error when importing the pip package

Hi there! So im kinda new to all of this, but when i try to import the pip package, i get this error:

does anybody know a way to fix this error? Thanks!

How to include audioset

Hello,

I am fairly new to the whole topic of huge ML models and I'd really like to try this out.
However, I don't quite get a grasp on how to implement the dataset to train the model.
In another issue, someone mentioned Audioset (https://research.google.com/audioset/download.html) which
sounds interesting but apparently "only" offers large csv-files.
How do I implement them in this project where a path to the dataset is required? Or is there any
tutorial I can look this up?

Thank you so much in advance!

Install and use musiclm

Hello,
id like to input a promt and get music fro your ai but im not sure how i can install it

Happy to help

I've been working on some basic music generation lately. I'm very interested in this implementation and would be happy to contribute in any way. Maybe that's as simple as donating to sustain your work, or it could be contributing code. Just want to put it out there that I support what you're doing and I'm very interested to see an implementation of this in PyTorch.

What about a colab so we can test this?

Nameerror in musiclm_pytorch.py

Hi, I'm implenting code in Google Colab. And I find a strange error.

import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

Error code occurs like this.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 1>:1                                                                              │
│ in __init__:52                                                                                   │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/musiclm_pytorch/musiclm_pytorch.py:673 in __init__        │
│                                                                                                  │
│   670 │   │   self.text_to_latents = nn.Linear(self.text.dim, dim_latent)                        │
│   671 │   │   self.audio_to_latents = nn.Linear(self.audio.dim, dim_latent)                      │
│   672 │   │                                                                                      │
│ ❱ 673 │   │   klass = SigmoidContrastiveLearning if sigmoid_contrastive_loss else partial(Soft   │
│   674 │   │   self.contrast = klass()                                                            │
│   675 │   │                                                                                      │
│   676 │   │   self.multi_layer_contrastive_learning = None                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'partial' is not defined

I'm getting a memory error that seems unrealistic (small dataset) so I think I've messed up or there's a bug

Can you help with this when you have a moment please? I'd be much appreciative

this is the error:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "E:\ML\playingWithMusicLM\main.py", line 114, in <module>
    loss = mulan(wavsTensor, selectedTextsTensor)
  File "C:\Users\user\.conda\envs\playingWithMusicLM\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "<@beartype(musiclm_pytorch.musiclm_pytorch.MuLaN.forward) at 0x24d7011c670>", line 47, in forward
  File "C:\Users\user\.conda\envs\playingWithMusicLM\lib\site-packages\musiclm_pytorch\musiclm_pytorch.py", line 547, in forward
    audio_latents = self.get_audio_latents(wavs)
  File "C:\Users\user\.conda\envs\playingWithMusicLM\lib\site-packages\musiclm_pytorch\musiclm_pytorch.py", line 523, in get_audio_latents
    audio_embeds = self.audio(wavs)
  File "C:\Users\user\.conda\envs\playingWithMusicLM\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\user\.conda\envs\playingWithMusicLM\lib\site-packages\musiclm_pytorch\musiclm_pytorch.py", line 396, in forward
    rel_pos_bias = self.dynamic_pos_bias_mlp(rel_dist.float())
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 2250000000 bytes.

I'm just running 10 song/text pairs as tensor params in order to "train mulan"

The code to do so is as follows:

 # get a ton of <sound, text> pairs and train
ids = []
texts = []
with open(musicDescriptiveMetadataFilename, newline='\n') as csvfile:
    rows = csv.reader(csvfile, delimiter=',')
    for rowNumber, row in enumerate(rows):
        if rowNumber > 0:
            id = row[0]
            text = row[5]
            ids.append(id)
            texts.append(text)

wavs = []
selectedTexts = []

audioFileNames = os.listdir(".//ytRips")
for n, id in enumerate(ids):
    if n < 10:
        for audioFileName in audioFileNames:
            if audioFileName.__contains__(id):
                a = read(".\\ytRips\\" + audioFileName)
                a = np.array(a[1], dtype=np.float32)
                try:
                    channels=a.shape[1]
                except:
                    channels=1
                    continue

                samples=a.shape[0]

                if channels==2:
                    a = np.resize(a, (samples,1))

                if samples == 480000:
                    wavs.append(a)
                    selectedTexts.append(numpy.asarray(stringToListOfInts(texts[n]),dtype=np.compat.long))

#resize texts to same size
resizedSelectedTexts=[]
for selectedText in selectedTexts:
    size=selectedText.shape[0]
    if size > 450:
        resizedSelectedTexts.append(numpy.resize(selectedText,(450,1)))
    else:
        tmp=selectedText
        for x in range(10):
            tmp=np.concatenate((tmp, selectedText), axis=0)
        if tmp is not None:
            resizedSelectedTexts.append(numpy.resize(np.stack(tmp, axis=0), (450,1)))

wavsTensor = torch.squeeze(torch.tensor(np.stack(wavs, axis=0),dtype=torch.float32))
selectedTextsTensor= torch.squeeze(torch.tensor(np.stack(resizedSelectedTexts, axis=0),dtype=torch.long))

loss = mulan(wavsTensor, selectedTextsTensor)

Training data

Hi there!
Im trying to train the model using the MusicCaps dataset.

However, on the readme, according to wavs = torch.randn(2, 1024) it looks like the audio tensors are 2x1024 (which makes me think it's requiring stereo audio).
The MusicCaps audio is actually mono.

Im not sure if Im correctly interpreting this. Could you give me a hint here?
Thanks!

Few questions

First of all, it is very interesting project.
Thanks for your work!

So, I'm trying to implement this project step by step on Colab(https://colab.research.google.com/drive/1fkXdwUBw9tDxofj5-us0vuOenuqC7rfZ?usp=sharing).

But there is something bothering me, the below code ran out so fast, like 10 sec.

import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)
loss.backward()

With below message.

spectrogram yielded shape of (65, 86), but had to be cropped to (64, 80) to be patchified for transformer

Is this normal behavour?

And to create musiclm, it requires audiolm and to create audiolm, do I have to create soundstrem, coarse_transformer, fine_transformer according to here(https://github.com/lucidrains/audiolm-pytorch)?
Or is there another way to achieve it?

audiolm = AudioLM(
    wav2vec = wav2vec,
    soundstream = soundstream,
    semantic_transformer = semantic_transformer,
    coarse_transformer = coarse_transformer,
    fine_transformer = fine_transformer
)

musiclm = MusicLM(
    audio_lm = embeds_audio,
    mulan_embed_quantizer = quantizer
)

music = musiclm(['the crystalline sounds of the piano in a ballroom']) # torch.Tensor

Option to use Encodec instead of SoundStream

This tweet suggests that it might be possible to replace SoundStream with the openly available EnCodec:

https://twitter.com/keunwoochoi/status/1620101029399515138

I'm not sure how plausible this is, but EnCodec is even mentioned in the MusicLM paper, and it would save a lot of training effort to be able to use it!

I don't want to include any different random seeds please.

ast pretrained model use

@lucidrains Hi, i want to use pretrained AST(https://github.com/YuanGongND/ast )model in this audiotransformer.
In musiclm-pytorch, the input shape of audio wave is 2 dim(ex. (2,1024)), however, in pretrained AST, we need a 3 dim input(ex. (batchsize, time, frequency)).

i saw a example which maded by someone for applying audiocap dataset to mulan training, but it didn't work because of difference of input dimension.
so, Do you have an idea how the existing dataset(ex. audiocap) to apply to this mulan model?

can we calculate the similarity between chinese text and and audio

can we calculate the similarity between chinese text and and audio?

lucidrains / musiclm-pytorch Goto Github PK

musiclm-pytorch's Introduction

MusicLM - Pytorch

Appreciation

Usage

Usage

Todo

Citations

musiclm-pytorch's People

Contributors

Stargazers

Watchers

Forkers

musiclm-pytorch's Issues

Discussed in #55

Recommend Projects

Recommend Topics

Recommend Org