jongwook / onsets-and-frames Goto Github PK

A Pytorch implementation of Onsets and Frames (Hawthorne 2018)

License: MIT License

Python 98.43% Shell 1.57%

onsets-and-frames's Introduction

PyTorch Implementation of Onsets and Frames

This is a PyTorch implementation of Google's Onsets and Frames model, using the Maestro dataset for training and the Disklavier portion of the MAPS database for testing.

Instructions

This project is quite resource-intensive; 32 GB or larger system memory and 8 GB or larger GPU memory is recommended.

Downloading Dataset

The data subdirectory already contains the MAPS database. To download the Maestro dataset, first make sure that you have ffmpeg executable and run prepare_maestro.sh script:

ffmpeg -version
cd data
./prepare_maestro.sh

This will download the full Maestro dataset from Google's server and automatically unzip and encode them as FLAC files in order to save storage. However, you'll still need about 200 GB of space for intermediate storage.

Training

All package requirements are contained in requirements.txt. To train the model, run:

pip install -r requirements.txt
python train.py

train.py is written using sacred, and accepts configuration options such as:

python train.py with logdir=runs/model iterations=1000000

Trained models will be saved in the specified logdir, otherwise at a timestamped directory under runs/.

Testing

To evaluate the trained model using the MAPS database, run the following command to calculate the note and frame metrics:

python evaluate.py runs/model/model-100000.pt

Specifying --save-path will output the transcribed MIDI file along with the piano roll images:

python evaluate.py runs/model/model-100000.pt --save-path output/

In order to test on the Maestro dataset's test split instead of the MAPS database, run:

python evaluate.py runs/model/model-100000.pt Maestro test

Implementation Details

This implementation contains a few of the additional improvements on the model that were reported in the Maestro paper, including:

Offset head
Increased model capacity, making it 26M parameters by default
Gradient stopping of inter-stack connections
L2 Gradient clipping of each parameter at 3
Using the HTK mel frequencies

Meanwhile, this implementation does not include the following features:

Variable-length input sequences that slices at silence or zero crossings
Harmonically decaying weights on the frame loss

Despite these, this implementation is able to achieve a comparable performance to what is reported on the Maestro paper as the performance without data augmentation.

onsets-and-frames's People

Contributors

Stargazers

Watchers

Forkers

zhaoyuhsin ssgalitsky nomius10 seyong92 bsmnyk falaktheoptimist sargammenghani brianc118 wwpww allinmybed kinwaicheuk rkelln arroganceg shiyuangu skynet-dl dbraun ml-and-ai-repo greenbech yveschu zeqimao minhtu14 tatsubori fylsunghwan jamesliu chekelee nborggren earthjade96 k1ngcyk xueqiciela 3fen lei540352 cwitkowitz subinjo92 geoffro oliver-tautz nick0kcin mapleee czfandyslash chenchy markyouyuren hhh-0 boomwang marypilataki xk-wang changyujiecn meadow163 rfalcon100 dream-high harmankx bisratgetnet solitarydream alanlomeli albert-yu yonghyunk1m chaostong kkhacksun marc-philipp-knechtle ethio-artifical techthiyanes lucasmpaim

onsets-and-frames's Issues

Expected results?

I get the following when evaluating on MAPS after training the model over 100k iterations.

These metrics appear to be quite low, especially the frame metrics which are 0.65/0.65/0.64 whereas the Maestro paper reports 0.90/0.95/0.81.

Is this expected?

Thanks!

                            note precision                : 0.795 ± 0.096
                            note recall                   : 0.756 ± 0.109
                            note f1                       : 0.773 ± 0.096
                            note overlap                  : 0.541 ± 0.101
               note-with-offsets precision                : 0.362 ± 0.127
               note-with-offsets recall                   : 0.345 ± 0.126
               note-with-offsets f1                       : 0.352 ± 0.125
               note-with-offsets overlap                  : 0.808 ± 0.092
              note-with-velocity precision                : 0.739 ± 0.093
              note-with-velocity recall                   : 0.704 ± 0.110
              note-with-velocity f1                       : 0.719 ± 0.096
              note-with-velocity overlap                  : 0.543 ± 0.102
  note-with-offsets-and-velocity precision                : 0.341 ± 0.123
  note-with-offsets-and-velocity recall                   : 0.325 ± 0.124
  note-with-offsets-and-velocity f1                       : 0.332 ± 0.122
  note-with-offsets-and-velocity overlap                  : 0.807 ± 0.092
                           frame f1                       : 0.636 ± 0.108
                           frame precision                : 0.649 ± 0.163
                           frame recall                   : 0.654 ± 0.102
                           frame accuracy                 : 0.475 ± 0.115
                           frame substitution_error       : 0.106 ± 0.058
                           frame miss_error               : 0.240 ± 0.108
                           frame false_alarm_error        : 0.337 ± 0.338
                           frame total_error              : 0.683 ± 0.337
                           frame chroma_precision         : 0.686 ± 0.155
                           frame chroma_recall            : 0.696 ± 0.102
                           frame chroma_accuracy          : 0.516 ± 0.106
                           frame chroma_substitution_error: 0.064 ± 0.033
                           frame chroma_miss_error        : 0.240 ± 0.108
                           frame chroma_false_alarm_error : 0.337 ± 0.338
                           frame chroma_total_error       : 0.641 ± 0.315

Missing Sigmoid when calculating BCE loss / Activation function during inference

Hi JongWook,

Thank you for your implementation.

I have 2 questions regarding your implementation.

Based on the documentation, a sigmoid activation function needs to be applied to the logits before calculating the BCE loss. However, in transcriber.py, a sigmoid activation is not applied before activation. May I know whether this is an error?
During inference in evaluate.py, I noticed that a relu activation is applied in the function evaluate(). Instead of relu, shouldn't sigmoid be used here?

Best,

Nicolas

delete

sorry, Github threw an error but it ended up actually opening the issue, which got posted twice

Errors in mir_eval library

I'm running with the following parameters in train.py

batch_size = 4
sequence_length = 327680
model_complexity = 16

And I got the following error:

0%|          | 499/500000 [01:32<25:51:47,  5.36it/s]/home/lab/.linuxbrew/Cellar/python/3.7.2_2/lib/python3.7/site-packages/mir_eval/transcription.py:167: UserWarning: Estimated notes are empty.
  warnings.warn("Estimated notes are empty.")
/home/lab/.linuxbrew/Cellar/python/3.7.2_2/lib/python3.7/site-packages/mir_eval/multipitch.py:275: UserWarning: Estimate frequencies are all empty.
  warnings.warn("Estimate frequencies are all empty.")
  1%|          | 2999/500000 [10:07<25:37:36,  5.39it/s]/home/lab/.linuxbrew/Cellar/python/3.7.2_2/lib/python3.7/site-packages/mir_eval/transcription_velocity.py:185: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  ref_matched_velocities)[0]
ERROR - train_transcriber - Failed after 0:10:13!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 117, in train
    for key, value in evaluate(validation_dataset, model).items():
  File "/home/lab/Documents/CS224NProject/Sample/onsets-and-frames/evaluate.py", line 70, in evaluate
    p, r, f, o = evaluate_notes_with_velocity(i_ref, p_ref, v_ref, i_est, p_est, v_est, velocity_tolerance=0.1)
  File "/home/lab/.linuxbrew/Cellar/python/3.7.2_2/lib/python3.7/site-packages/mir_eval/transcription_velocity.py", line 291, in precision_recall_f1_overlap
    offset_min_tolerance, strict, velocity_tolerance)
  File "/home/lab/.linuxbrew/Cellar/python/3.7.2_2/lib/python3.7/site-packages/mir_eval/transcription_velocity.py", line 178, in match_notes
    ref_matched_velocities = ref_velocities[matching[:, 0]]
IndexError: too many indices for array

BTW, another issue, I first tried all the default parameters, and I kept getting GPU out of memory message. My system should meet the minimum system requirements.

Does anyone have insights into these two issues?

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

When running transcribe.py, if I choose the DEFAULT_DEVICE='cpu', this will raise the error:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

File "export.py", line 44, in transcribe
mel = melspectrogram(audio.reshape(-1, audio.shape[-1])[:, :-1]).transpose(-1, -2)
File "/home/ryusinka/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/ryusinka/cli/pytorch/onsets-and-frames/onsets_and_frames/mel.py", line 93, in forward
magnitudes, phases = self.stft(y)
File "/home/ryusinka/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/ryusinka/cli/pytorch/onsets-and-frames/onsets_and_frames/mel.py", line 59, in forward
padding=0)
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

Subtract - not supported for PyTorch 1.3.1

Hi, really thank you for your implementation!
I use PyTorch 1.3.1 which reports an error at line 25 in onsets_and_frames/decoding.py.
Since '-' operation between two bool tensors is deprecated, can I simply replace A - B with (A ^ B) | (A & B)?

results when replicate the code following Onsets and Frames: Dual-Objective Piano Transcription

Hello, @jongwook, Thanks for your opening codes.
Recently, I use your code in order to get the performance in paper "Onsets and Frames: Dual-Objective Piano Transcription". I use the MAPS to train the model and the batch_size=4, iteration= 358000. When evaluating, I get the performance as following.

Some metrics appear to be quite low, especially the frame metrics which are 82.2/70.4/75.5 whereas the "Onsets and Frames: Dual-Objective Piano Transcription" paper reports 88.53/70.89/78.3

Do you know the reasons about that?
Thanks a lot

If only the onset dataset is trained, how accurate will it be

I used TensorFlow Keras to implement onset_stack neural network, and make the corresponding training data. According to the understanding of this model, the onset_stack is just the starting point of piano sound training. But I don't know why. The accuracy of my training is extremely low. After several hours of training, the accuracy rate is only about 0.05. Please ask, can't we only train the onset dataset? If only the onset dataset is trained, what is the accuracy will be?

Converting midi to txt

It seems to be a problem with the midi to tsv conversion. In the original Onsets and frames paper it says:

"If a note is active when sustain goes on, that note will be extended until either sustain goes off or the same note is played again. This process gives the same note durations as the text files included with the dataset."

I noticed that I don't get the same offsets as in the text files, and then I saw that when moving the offset, you search for sustain off but not also for a new onset of the same note. Or did you omit this for a reason?

Adding what's in bold to line 45 from midi.py seems to have the result I desired:

offset = next(n for n in events[offset['index'] + 1:] if n['type'] == 'sustain_off' or n['note'] == offset['note'] or n is events[-1])

Linter/formatter?

Is there a standard linter / formatter that you're using @jongwook?

Thanks!

Is it possible to implement weighted frame loss?

You mentioned that you did not implement the weighted frame loss in this repository.
But I am curious on how to do it with PyTorch. Do you have any insights?

How to run inference on single wav file?

When I try using transcribe.py to run inference on a single wav file it throws me Error opening '/home/Jason/TestFlac/': File contains data in an unknown format.

Many warnings during training

During training, I get tons of warning for Sequential, ConvStack, LSTM and etc.
The warning message is like this

"type " + obj.name + ". It won't be checked "
/opt/conda/envs/onset_model/lib/python3.6/site-packages/torch/serialization.py:292: UserWarning: Couldn't retrieve source code for container of type LSTM. It won't be checked for correctness upon loading.

Is it normal?

I created a new conda environment to run this code and installed all the dependencies via requirements.txt.

RuntimeError: Tensor stacks

Hello ,

I'm trying to run on training mode your model and have an error:

RuntimeError: stack expects each tensor to be equal size, but got [327680, 2] at entry 0 and [327680] at entry 6

Do you know what it could be the problem? I installed all the suggested requirements.

Sincerely yours,
Aleksandra

Discarding the last sample point from the audio?

In your transcriber.py, line 102, you obtain the melspectrogram by using

mel = melspectrogram(audio_label.reshape(-1, audio_label.shape[-1])[:, :-1]).transpose(-1, -2)

The audio_label.reshape(-1, audio_label.shape[-1])[:, :-1] part discards the last audio sample point.
May I know the reason of putting [:, :-1] to discard the last audio sample point?

What happens if we keep the complete audio? (Not discarding the last sample point)

Upload pretrained model to run inference

It would be great if anyone could upload a pretrained model so that we could try this model/project without needing to train the model. It is quite a big commitment to wait a week for training (as mentioned in #10 ) if you primarily just want to check out the performance on some .wav files.

And I would also like to say this repo is very well written and educational. Thanks!

Suggest to loosen the dependency on sacred

Dear developers,

Your project onsets-and-frames requires "sacred==0.7.4" in its dependency. After analyzing the source code, we found that the following versions of sacred can also be suitable without affecting your project, i.e., sacred 0.7.3. Therefore, we suggest to loosen the dependency on sacred from "sacred==0.7.4" to "sacred>=0.7.3,<=0.7.4" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

May I pull a request to further loosen the dependency on sacred?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

Details:

Your project (commit id: 783ca08) directly uses 3 APIs from package sacred.

sacred.experiment.Experiment.__init__, sacred.observers.file_storage.FileStorageObserver.create, sacred.commands.print_config

Beginning fromwhich, 44 functions are then indirectly called, including 25 sacred's internal APIs and 19 outsider APIs as follows:

[/jongwook/onsets-and-frames]
+--sacred.experiment.Experiment.__init__
|      +--inspect.stack
|      +--os.path.basename
|      +--sacred.ingredient.Ingredient.__init__
|      |      +--collections.OrderedDict
|      |      +--inspect.stack
|      |      +--os.path.dirname
|      |      +--os.path.abspath
|      |      +--sacred.dependencies.gather_sources_and_dependencies
|      |      |      +--sacred.dependencies.get_main_file
|      |      |      |      +--os.path.abspath
|      |      |      |      +--sacred.dependencies.Source.create
|      |      |      |      |      +--os.path.exists
|      |      |      |      |      +--sacred.dependencies.get_py_file_if_possible
|      |      |      |      |      |      +--os.path.exists
|      |      |      |      |      +--os.path.abspath
|      |      |      |      |      +--sacred.dependencies.get_commit_if_possible
|      |      |      |      |      |      +--os.path.dirname
|      |      |      |      |      |      +--git.Repo
|      |      |      |      |      |      +--git.Repo.is_dirty
|      |      |      |      |      |      +--git.Repo.remote
|      |      |      |      |      +--sacred.dependencies.Source.__init__
|      |      |      |      |      +--sacred.dependencies.get_digest
|      |      |      |      |      |      +--hashlib.md5
|      |      |      |      +--os.path.dirname
|      |      |      +--sacred.dependencies.PackageDependency.create
|      |      |      |      +--sacred.dependencies.PackageDependency.__init__
|      +--sacred.ingredient.Ingredient.command
|      |      +--sacred.ingredient.Ingredient.capture
|      |      |      +--sacred.config.captured_function.create_captured_function
|      |      |      |      +--sacred.config.signature.Signature.__init__
|      |      |      |      |      +--sacred.config.signature.get_argspec
|      |      |      |      |      |      +--inspect.signature
|      |      |      |      |      |      +--collections.OrderedDict
|      |      |      |      |      |      +--inspect.getfullargspec
|      |      |      |      |      |      +--inspect.getargspec
|      |      |      |      +--sacred.config.captured_function.captured_function
|      |      |      |      |      +--sacred.config.custom_containers.FallbackDict.__init__
|      |      |      |      |      +--sacred.randomness.get_seed
|      |      |      |      |      |      +--random.randint
|      |      |      |      |      +--sacred.randomness.create_rnd
|      |      |      |      |      |      +--random.Random
|      |      |      |      |      +--time.time
|      |      |      |      |      +--datetime.timedelta
+--sacred.observers.file_storage.FileStorageObserver.create
|      +--os.path.exists
|      +--os.makedirs
|      +--os.path.join
|      +--sacred.utils.FileNotFoundError.__init__
+--sacred.commands.print_config
|      +--sacred.commands._format_config
|      |      +--sacred.commands._iterate_marked
|      |      |      +--sacred.utils.iterate_flattened_separately
|      |      |      |      +--sacred.utils.iterate_flattened_separately
|      |      |      |      |      +--sacred.utils.join_paths
|      |      +--sacred.commands._format_entry

Since all these functions have not been changed between any version for package "sacred" from [0.7.3] and 0.7.4. Therefore, we believe it is safe to loosen the corresponding dependency.

How are tsv files created?

If I am not wrong, you store all the annotations to tsv files, and you load the labels directly from these tsv files.

If I have my own dataset with only .midi files, how do I get these annotations? Do you have the scripts to do .midi to tsv files conversions?

Gradient clipping in the wrong place?

In your train.py line 101, you have the gradient clipping after the .backward() and .step() operations.

Shouldn't we put the clip_grad_norm_ in between .backward() and .step()?

Why is there a need for Torch implementation of STFT and MelSpectrogram?

Hi Jong,
Thank you for such a useful implementation! Sorry for a silly doubt, but I am a beginner in MIR and working with Onsets and Frames for a project of mine.

Looking at the STFT and Mel Spectrogram classes, it doesn't seem (and there possibly isn't a need) that they are 'learnable'. I am a little confused as to why is there a need for a Torch implementation for STFT and Mel Spectrogram? Is it not possible to use librosa implementations for the same?

Thanks!

Checkpoint and possible fine-tuning on a custom dataset

Hello Jong Wook,

I would like to experiment fine-tuning Onsets and Frames on a custom dataset with your PyTorch implementation.

For that I would ask, is there a pretrained model checkpoint available for your implementation please ?

Then I would format the custom dataset I would like to fine-tune on as the MAPS example:
_ one folder of .flac audio inputs at 16kHz mono
_ one folder of matched .tsv annotation targets (col 1. onset sec. / col 2. offset sec. / col 3. note / col 4. velocity)
So that it can be read with PianoRollAudioDataset and used for continuing training a previous checkpoint.

One more thing I would please ask for confirmation regarding the annotation files, the 3rd. note column should be midi pitch (in the range of the 88 piano keys) and the 4th. velocity column should be scaled in which range ? (it doesn't seem to go up to 127 like a midi velocity)

Thanks for sharing the model to Pytorch !

Error while trying to retrain the model with MAPS dataset

I'm trying to train the model with MAPS dataset with the command:

python train.py with logdir=runs/model iterations=1000000

and

python train.py

But I got the following error:

INFO - train_transcriber - Running command 'train'
INFO - train_transcriber - Started run with ID "3"
ERROR - train_transcriber - Failed after 0:00:02!
Exception originated from within Sacred.
Traceback (most recent calls):
  File "/home/user/anaconda3/lib/python3.8/site-packages/sacred/commands.py", line 40, in _non_unicode_repr
    repr_string, isreadable, isrecursive = pprint._safe_repr(objekt, context,
TypeError: _safe_repr() missing 1 required positional argument: 'sort_dicts'

I installed the dependencies with:

pip install -r requirements.txt

I checked the version of sacred and it's 0.7.4

What is the reason of reading audio as int16 instead of float?

In your dataset.py at line 98, you use soundfile to read the audio as int16.
Later during __getitem__ at line 51, you convert the int16 tensor into a float tensor.

Why don't we just read the audio as float directly with soundfile?

Stucked on 0%

Hello, I am using a new dataset and i want to start with a portion of dataset like 4 mid & 4 wav files. but the training stacked in 0%. I keep t running for 30s still no change.
0%| | 0/500000 [00:00<?, ?it/s]

Thanks.

If only the onset dataset is trained, how accurate will it be

Onset length as per the onset frames paper

Firstly, thank you so much for your super useful implementation of onset and frames model in pytorch. It has been valuable to understanding the paper and also in our project. I was wondering about the lengths of the onset in the labels which is 1 frame as per the implementation

onsets-and-frames/onsets_and_frames/dataset.py

Line 116 in 007980a

label[left:onset_right, f] = 3

However, the onset frames method mentions that

We performed a coarse hyperparameter search over onset length (we tried 16, 32 and 48ms) and found that 32ms worked best. In hindsight this is not surprising as it is also the length of our frames and so almost all onsets will end up spanning exactly two frames.

In this case, would making this to 2 help? (Either from here or doubling the ONSET_LENGTH constant). I was curious also as it took the model about 6k steps using the Maestro dataset to come up with onsets (not surprising since they would be sparse across samples) - it just predicted frames before and no onsets. Wanted to know your take on the values.

Thanks.

Can this project convert wav format files into mid files?

Kouretchian

Hello Jong Wook

I downloaded the model you shared as a checkpoint and tried to apply it to a music file to see the result. However, when I run the "transcribe.py" file I get the error " 'LSTM' object has no attribute '_flat_weights' ". I appreciate it if you help me with this matter.

Training method enquiry and evaluation issue

Hello,

Thank you for providing the PyTorch version of onsets-and-frames.
I would like to ask 2 questions.

Is the training method here similar to the one described on the onsets and frames paper?

we split the training audio into smaller files...
We found that 20 second splits allowed us to achieve a reasonable
batch size during training of at least 8...
When notes are active and we must split, we
choose a zero-crossing of the audio signal. Inference is
performed on the original and un-split audio file.

If not, could you please explain the inputs and predictions of the model?

When using the checkpoint provided and running the evaluation, I get zero predictions for both MAPS and MAESTRO. I have updated the mir_eval library to the latest version and still having the same issue. Could you please advise me on this?

UserWarning: Reference notes are empty.
UserWarning: Estimated notes are empty.
UserWarning: Estimate frequencies are all empty.
UserWarning: Reference frequencies are all empty.
UserWarning: Reference frequencies are all empty.

Thank you!

jongwook / onsets-and-frames Goto Github PK

onsets-and-frames's Introduction

PyTorch Implementation of Onsets and Frames

Instructions

Downloading Dataset

Training

Testing

Implementation Details

onsets-and-frames's People

Contributors

Stargazers

Watchers

Forkers

onsets-and-frames's Issues

Recommend Projects

Recommend Topics

Recommend Org