mir-dataset-loaders / mirdata Goto Github PK

View Code? Open in Web Editor NEW

357.0 14.0 59.0 507.21 MB

Python library for working with Music Information Retrieval datasets

Home Page: https://mirdata.readthedocs.io/en/stable/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

music audio python dataset mirdata mir

mirdata's Introduction

mirdata

Common loaders for Music Information Retrieval (MIR) datasets. Find the API documentation here.

This library provides tools for working with common MIR datasets, including tools for:

downloading datasets to a common location and format
validating that the files for a dataset are all present
loading annotation files to a common format, consistent with the format required by mir_eval
parsing track level metadata for detailed evaluations

Installation

To install, simply run:

pip install mirdata

Quick example

import mirdata

orchset = mirdata.initialize('orchset')
orchset.download()  # download the dataset
orchset.validate()  # validate that all the expected files are there

example_track = orchset.choice_track()  # choose a random example track
print(example_track)  # see the available data

See the documentation for more examples and the API reference.

Currently supported datasets

Supported datasets include AcousticBrainz, DALI, Guitarset, MAESTRO, TinySOL, among many others.

For the complete list of supported datasets, see the documentation

Citing

There are two ways of citing mirdata:

If you are using the library for your work, please cite the version you used as indexed at Zenodo:

If you refer to mirdata's design principles, motivation etc., please cite the following paper:

"mirdata: Software for Reproducible Usage of Datasets"
Rachel M. Bittner, Magdalena Fuentes, David Rubinstein, Andreas Jansson, Keunwoo Choi, and Thor Kell
in International Society for Music Information Retrieval (ISMIR) Conference, 2019

@inproceedings{
  bittner_fuentes_2019,
  title={mirdata: Software for Reproducible Usage of Datasets},
  author={Bittner, Rachel M and Fuentes, Magdalena and Rubinstein, David and Jansson, Andreas and Choi, Keunwoo and Kell, Thor},
  booktitle={International Society for Music Information Retrieval (ISMIR) Conference},
  year={2019}
}

When working with datasets, please cite the version of mirdata that you are using (given by the DOI above) AND include the reference of the dataset, which can be found in the respective dataset loader using the cite() method.

Contributing a new dataset loader

We welcome contributions to this library, especially new datasets. Please see contributing for guidelines.

mirdata's People

Contributors

Stargazers

Watchers

Forkers

drubinstein theagaomir tomxi gabolsgabs diggerdu urinieto stefansjs wangyu spijkervet jerryjazzy nkundiushuti agangzz kyungyunlee birdvox nicolamontecchio kdr marypilataki mtg adason ooyamatakehisa pramoneda angelomendes chenchy sebastianrosenzweig pyrsos cvf-bcn-gituser tyffical giovana-morais migperfer dave-foster bmcfee lucaspbastos mzehren jarey oriolcolomefont shane-15 fjgeijeieoio jimearruti jerrisk mimbres cyxoud iranroman elijahahianyo shanin harshpalan jamison413 klay-music duobin carlthome leotd francescopapaleo shtephlee tanmayy24 nullmightybofo jwmao98 stellaywong masetto96

mirdata's Issues

Get setup.py file working

The current setup.py doesn't work. fix it :)

Million Song Dataset

data loader for the million song dataset

guitarset.download doesn't create the right folder structure

everything is unzipped directly to ~/mir_datasets/GuitarSet instead of into the specific subfolders. This is partially because download_utils.downloader by default downloads directly into data_home. We should add an option to RemoteFileMetadata which optionally specifies a subfolder to create and unzip data into.

Add GuitarSet

I'd like to include one very clean dataloader in the first release - one with an open download and with trivial data loading. GuitarSet is a good example of a dataset that does everything right. It's freely downloadable, the data loading from jams should be trivial etc.

Add loader for medleydb-melody

Add loader for MedleyDB-Melody

Named tuple track objects to classes

Move all track namedtuples to classes to
(1) more easily document the meaning of each attribute
(2) lazy load larger annotation files

Clean inconsistencies between loaders

List of small inconsistencies between loaders that should be fixed

Beatles: beat positions are str -> int
RWC collection: track.track_duration_sec -> track.duration_sec

Check if dataset is available locally

If a dataset is "available" on disk, we should know it. There are three possible states for a given dataset:

exists and valid (the .validate() function returns empty dictionaries)
exists and not valid (at least one file in the dataset index exists, but some may be missing or have invalid checksums)
does not exist (no files in the dataset exist locally)

Let's indicate states (1) and (2) by files called _VALIDATED and _INVALID.json which live in the dataset folder, e.g. Orchset/_VALIDATED.

when calling .download(), first check if _VALIDATED exists. If it does not (or if clobber=True) download the dataset / print instructions to download. Otherwise, print something that indicates that the dataset already exists and is valid.
In utils.validator() if _VALIDATED exists, don't run validation. Add a flag to the function to force run even if _VALIDATED exists.
At the end of utils.validator(), create an empty _VALIDATED file is run successfully without missing/invalid files. If every file in the index is missing, print something that tells the user the dataset is not available locally. If some files are missing/some are invalid, create _INVALID.json which saves the dictionaries of invalid/missing files.

Add `how to test` to readme

Maybe a subissue of #18, but we should also include an example of how to clone and install for test purposes.

Something like
For local development/integration testing:

Install pyenv
Create python virtual environment
Add functionality and test

For running the test suite:

Install tox
Run tox

DALI loader

Loader for the dali dataset (#78 )

add "known issues" page to docs

Add a page to the documentation to report any known issues in the loaders/annotations (e.g. #79 )

Bug in validator?

When I'm loading any dataset like this:

from mirdata import orchset
orchset.download()
orchset.load()

I get the following error:

  File "/home/mfuentes/astre/code/repositories/mirdata/tests/tmp.py", line 10, in <module>
    orchset.load()
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/orchset.py", line 79, in load
    validate(data_home)
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/orchset.py", line 70, in validate
    missing_files, invalid_checksums = utils.validator(ORCHSET_INDEX, data_home, dataset_path)
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/utils.py", line 75, in validator
    if check_validated(dataset_path):
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/utils.py", line 293, in check_validated
    return os.path.exists(os.path.join(dataset_path, VALIDATED_FILE_NAME))
  File "/home/mfuentes/anaconda3/envs/struct/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

I think checking dataset_path is None or similar might be missing in the validator.

beatles.download checksum error

When I run mirdata.beatles.download(), everything appears to download as expected, but the checksums don't match. Can someone else run the same and see if you can reproduce it? The checksum of my downloaded file is c3b7d505e033ea9ff0d7a1d57871f2ee, vs the expected 62425c552d37c6bb655a78e4603828cc

ikala audio volumes

When we load the mix audio to mono, we compute (L + R)/2, but when we load the vocal/instrumental parts, we just take L or R directly. We should divide L and R by 2 (or multiply the mix by 2) so the overall volumes of each source are consistent.

Medley-solos-DB

Medley-solos-DB is a cross-collection dataset for automatic musical instrument recognition in solo recordings. It consists of a training set of 3-second audio clips, which are extracted from the MedleyDB dataset of Bittner et al. (ISMIR 2014) as well as a test set of 3-second clips, which are extracted from the solosDB dataset of Essid et al. (IEEE TASLP 2009). Each of these clips contains a single instrument among a taxonomy of eight: clarinet, distorted electric guitar, female singer, flute, piano, tenor saxophone, trumpet, and violin.

The Medley-solos-DB dataset is the dataset that is used in the benchmarks of musical instrument recognition in the publications of Lostanlen and Cella (ISMIR 2016) and Andén et al. (IEEE TSP 2019).

https://ieeexplore.ieee.org/abstract/document/8721532/

Medley-solos-DB is available on Zenodo: https://zenodo.org/record/2582103

I am personally working on a PR. My branch is https://github.com/lostanlen/mirdata/tree/medley-solos

Missing files RWC-Genre dataset

So for now missing audio checksums in rwc_genre. Trying to get the missing CD..

BeatlesTrack and SalamiTrack missing audio_path field

BeatlesTrack and SalamiTrack objects should have an audio_path field.

Beatles Loader

Implement a loader for the Beatles dataset

remove dataset directory from index

For when #17 is merged:

Since dataset_dir is being passed to validator, the top level dataset directory name in the index is redundant.

(1) update paths in the index to start below the dataset folder level, e.g.
Beatles/audio/02_-_With_the_Beatles/01_-_It_Won't_Be_Long.wav --> audio/02_-_With_the_Beatles/01_-_It_Won't_Be_Long.wav

(2) update the dataset's individual loaders to include the dataset name.

Increase test coverage

Increase the test coverage.

For each of the loaders, write simple tests to check the accuracy of the loaders

test_beatles.py (move the current tests to the generic tests below)
test_ikala.py
test_medleydb_melody.py
test_medleydb_pitch.py
test_salami.py (move the current tests to the generic tests below)
test_orchset.py

Can we create a "generic" single test module which tests all of the common functions for each loader?

test_download.py
test_validate.py
test_cite.py

Jams and annotations consistency

We should write a ‘to_jams’ function to convert data to the jams format, and make clear we don’t test annotations consistency but only that the loaders are doing their job. The ‘to_jams’ function would allow people to check themselves.

download any remote metadata as part of dataset

Right now we download metadata when loading a dataset, and this is inconsistent with the rest of the API. Instead, any metadata downloading should be done as part of the .download() method.

rename the library?

the name of the library is pretty clunky... got some great, prettier suggestions from @lostanlen:

mirabelle: a small hardy European plum tree with finely toothed leaves and small cherry-shaped fruit
mireille: Opéra in five acts by Charles-François Gounod. 1864
mirific: (maɪˈrɪfɪk) achieving wonderful things or working wonders.
mirliton: a generic term for Membranophone played by a performer speaking or singing into them, and which alter the sound of the voice by means of a vibrating membrane.

I like mirliton, personally. Other thoughts/opinions/ideas?

Regularly check if links for downloading data are online

Just discussed with @rabitt about this. We are using several links to download data, we should check regularly if they are online. Is it possible to write a test on that?

If a link goes offline, one idea is to write a function that checks links when calling download, and gives a temporarily offline message or so. What do you think?

RWC beat parsing

The actual parsing of beat positions in the RWC collections is probably wrong, trying to get information to get it right.

Add RWC Database

(already started in #56 )

Testing similar functionality

This was part of #44 - we have functions like download and validate that are similar but not always the same across many datasets.

Testing these functions is a pain because it requires mocking a lot of IO, and is unfriendly to new people. On the other hand, it is nice to have tests for important functions.

In a discussion with @rabitt, we discussed:

Not testing these functions. 👻
Making a base class and testing these functions there. 🏗
Testing these functions per-dataset. 📜 📖 ✍️

What do folks think? cc @rabitt, @lostanlen, @andreasjansson

How should we handle big datasets?

Right now we store the dataset index in the repo, and we load annotations into memory. This works OK for smaller datasets but won't scale to big ones. How should we handle this?

Use logging for warning statements

Right now we're very lazily printing warning statements - we should use logging

Support multichannel wav files

Guitarset mirdata/guitarset.py has 6-channel wav files. Librosa supports this, but I think we need to get librosa 0.7.0 working in the build environment. Right now we're raising NotImplementedError if someone tries to access the 6-channel audio files - once this is ready we can update the guitarset module.

unit tests for utils.py

Add a first set of unit tests on utils.py

Salami Loader

Loader for the Salami dataset

Docs with Examples

Setup docs page (ghpages? readthedocs?).
Add usage examples
Docs for each dataset with a basic description of the dataset and links to the relevant websites. Especially focus on good docs for the data loaded in each dataset track object, e.g. OrchsetTrack.

Add flag to silence validator

utils.validator() prints all issues by default. Add an optional flag to skip printing, default should print.

remember to update README with last changes

e.g. Create Module -> ExampleTrack is nametuple:

ExampleTrack = namedtuple(
    'ExampleTrack',
    ['track_id',
     'audio_path',
     'annotation_path',
     'genre']
)

To do when changes in modules are done.

Error loading dataset that wasn't download

Example:

If you didn't download medleydb_melody and try medleydb_melody.load() then you get FileNotFoundError: [Errno 2] No such file or directory: '/home/mfuentes/mir_datasets/MedleyDB-Melody/_INVALID.json'. I think this happens with all datasets, Is this the expected behaviour?

Should be create the dataset folder just to host _INVALID.json? Or maybe don't try to create the
file if the folder doesn't exist?

Check if dataset exists on load

Wait for #17 to be closed.

The loaders for each dataset should throw an error if the data does not exist on disk, but should work if _VALIDATED or _INVALID.json exist.

import mirdata is slow

When we import mirdata, each of the loaders are imported, which loads each of the json indexes into memory. No bueno.

We should load the index on the fly or when it's first accessed to speed things up.

Force version to be updated if code is changed in CircleCI

(For when we have a stable release) Is there a way in CircleCI to check that the code's version is updated when the code changes? If so, can we have CircleCI fail for PRs the version isn't bumped?

Related, via @drubinstein

Make the library python2 and python3 compatible

Among a few other things, need to replace urllib.requests with something cross compatible.

Missing annotation in Beatles breaks loader

Missing beat and key annotation files of track 10212 breaks the code (when loading all beat and key annotations)

Bug in Beatles.chords

start_time is fine, but end_time data should be converted to float.

ipdb> track.chords.end_time
array(['2.612267', '11.459070', '12.921927', '17.443474', '20.410362',
       '21.908049', '23.370907', '24.856984', '26.343061', '27.840748',
       '29.350045', '35.305963', '36.803650', '41.263102', '44.245646',
       '45.720113', '47.206190', '48.692267', '50.155124', '51.652811',
       '53.138888', '56.111043', '65.131995', '68.150589', '71.192403',
       '74.199387', '75.697074', '80.236575', '83.208730', '86.221693',
       '87.736621', '89.257528', '90.720385', '92.157453', '104.106689',
       '107.125283', '110.178707', '113.124087', '114.613718',
       '116.099795', '118.944961', '128.046462', '131.053446',
       '134.037210', '137.044195', '138.475524', '143.058163',
       '146.041927', '147.551224', '149.060521', '150.511768',
       '152.021065', '153.530362', '155.062879', '159.532721',
       '161.065238', '165.581519', '167.114036', '168.646553',
       '169.737409', '171.687173', '175.804082'], dtype='<U10')