Coder Social home page Coder Social logo

smc-2022's Introduction

PRAIG-logo

Insights into transfer learning between image and audio music transcription

Full text available here.

Gitter Tensorflow License

AboutHow To UseCitationsAcknowledgmentsLicense

About

We consider a music transcription system, trained on either image (Optical Music Recognition, OMR) or audio (Automatic Music Transcription, AMT)1 data, and adapt its performance to the unseen domain during the training phase using different transfer learning schemes.

content

How To Use

Dataset

We use the Camera-PrIMuS dataset.

The Camera-PrIMuS dataset contains 87 6782 real-music incipits3, each represented by six files: (i) the Plaine and Easie code source, (ii) an image with the rendered score, (iii) a distorted image, (iv) the musical symbolic representation of the incipit both in Music Encoding Initiative format (MEI) and (v) in an on-purpose simplified encoding (semantic encoding), and (vi) a sequence containing the graphical symbols shown in the score with their position in the staff without any musical meaning (agnostic encoding).

To obtain the corresponding audio files, we must convert one of the provided representations to MIDI and then synthesize the MIDI data. We have opted to convert the semantic representation, as there is a publicly available semantic-to-MIDI converter. Once we have obtained the MIDI files, we render them using FluidSynth.

The specific steps to follow are:

  1. Download the semantic-to-MIDI converter from here and place the omr-3.0-SNAPSHOT.jar file in the dataset folder.
  2. Download a General MIDI SounFont (sf2). We recommend downloading the SGM-v2.01 soundfont as this code has been tested using this soundfont. Place the sf2 file in the dataset folder.

Experiments

We consider two scenarios:

  • Scenario A. This scenario assesses the performance of the transcription models when transfer learning is both considered and ignored.
  • Scenario B. This scenario studies the amount of data required in the target domain for an efficient transfer process that outperforms the base case of ignoring transfer learning.

To replicate our experiments, you will first need to meet certain requirements specified in the Dockerfile. Alternatively, you can set up a virtual environment if preferred. Once you have prepared your environment (either a Docker container or a virtual environment) and followed the steps in the dataset section, you are ready to begin. Follow this recipe to replicate our experiments:

Important note: To execute the following code, both Java and FluidSynth must be installed.

$ cd dataset
$ sh prepare_data.sh
$ cd ..
$ python main.py

Citations

@inproceedings{alfaro2022insights,
  title     = {{Insights into Transfer Learning between Image and Audio Music Transcription}},
  author    = {Alfaro-Contreras, Mar{\'\i}a and Valero-Mas, Jose J and I{\~n}esta, Jos{\'e} M and Calvo-Zaragoza, Jorge},
  booktitle = {{Proceedings of the 19th Sound and Music Computing Conference}},
  pages     = {295--301},
  year      = {2022},
  publisher = {Zenodo},
  month     = jun,
  address   = {Saint-Étienne, France},
  doi       = {10.5281/zenodo.6797870},
}

Acknowledgments

This work is part of the I+D+i PID2020-118447RA-I00 (MultiScore) project, funded by MCIN/AEI/10.13039/501100011033.

License

This work is under a MIT license.

Footnotes

  1. It is important to clarify that the model we are referring to is actually an Audio-to-Score (A2S) model. At the time of conducting this research, we used the term AMT because the distinction between AMT and A2S did not exist in the literature. However, nowadays, there is a clear distinction between the two. AMT typically focuses on note-level transcription, encoding the acoustic piece in terms of onset, offset, pitch values, and the musical instrument of the estimated notes. In contrast, A2S aims to achieve a score-level codification.

  2. In this work, we consider 22 285 samples out of the total 87 678 that constitute the complete Camera-PrIMuS dataset. This selection resulted from a data curation process, primarily involving the removal of samples containing long multi-rests. These music events contribute minimally to the length of the score image but may span a large number of frames in the audio signal.

  3. An incipit is a sequence of notes, typically the first ones, used for identifying a melody or musical work.

smc-2022's People

Contributors

mariaalfaroc avatar

Stargazers

 avatar  avatar

Watchers

 avatar Kostas Georgiou avatar  avatar

Forkers

multiscore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.