Coder Social home page Coder Social logo

mir-dataset-loaders / mirdata Goto Github PK

View Code? Open in Web Editor NEW
357.0 14.0 59.0 507.21 MB

Python library for working with Music Information Retrieval datasets

Home Page: https://mirdata.readthedocs.io/en/stable/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
music audio python dataset mirdata mir

mirdata's Introduction

mirdata

Common loaders for Music Information Retrieval (MIR) datasets. Find the API documentation here.

CI status Formatting status Linting status Documentation Status GitHub

PyPI version codecov Downloads DOI PRs Welcome

This library provides tools for working with common MIR datasets, including tools for:

  • downloading datasets to a common location and format
  • validating that the files for a dataset are all present
  • loading annotation files to a common format, consistent with the format required by mir_eval
  • parsing track level metadata for detailed evaluations

Installation

To install, simply run:

pip install mirdata

Quick example

import mirdata

orchset = mirdata.initialize('orchset')
orchset.download()  # download the dataset
orchset.validate()  # validate that all the expected files are there

example_track = orchset.choice_track()  # choose a random example track
print(example_track)  # see the available data

See the documentation for more examples and the API reference.

Currently supported datasets

Supported datasets include AcousticBrainz, DALI, Guitarset, MAESTRO, TinySOL, among many others.

For the complete list of supported datasets, see the documentation

Citing

There are two ways of citing mirdata:

If you are using the library for your work, please cite the version you used as indexed at Zenodo:

DOI

If you refer to mirdata's design principles, motivation etc., please cite the following paper:

DOI

"mirdata: Software for Reproducible Usage of Datasets"
Rachel M. Bittner, Magdalena Fuentes, David Rubinstein, Andreas Jansson, Keunwoo Choi, and Thor Kell
in International Society for Music Information Retrieval (ISMIR) Conference, 2019
@inproceedings{
  bittner_fuentes_2019,
  title={mirdata: Software for Reproducible Usage of Datasets},
  author={Bittner, Rachel M and Fuentes, Magdalena and Rubinstein, David and Jansson, Andreas and Choi, Keunwoo and Kell, Thor},
  booktitle={International Society for Music Information Retrieval (ISMIR) Conference},
  year={2019}
}

When working with datasets, please cite the version of mirdata that you are using (given by the DOI above) AND include the reference of the dataset, which can be found in the respective dataset loader using the cite() method.

Contributing a new dataset loader

We welcome contributions to this library, especially new datasets. Please see contributing for guidelines.

mirdata's People

Contributors

andreasjansson avatar bmcfee avatar carlthome avatar chrisdonahue avatar cyxoud avatar dave-foster avatar drubinstein avatar francescopapaleo avatar genisplaja avatar giovana-morais avatar guillemcortes avatar harshpalan avatar iranroman avatar keunwoochoi avatar kyungyunlee avatar lostanlen avatar magdalenafuentes avatar migperfer avatar mimbres avatar mmscibor avatar nkundiushuti avatar ooyamatakehisa avatar pramoneda avatar rabitt avatar sebastianrosenzweig avatar spijkervet avatar tanmayy24 avatar tkell avatar tomxi avatar tyffical avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mirdata's Issues

guitarset.download doesn't create the right folder structure

everything is unzipped directly to ~/mir_datasets/GuitarSet instead of into the specific subfolders. This is partially because download_utils.downloader by default downloads directly into data_home. We should add an option to RemoteFileMetadata which optionally specifies a subfolder to create and unzip data into.

Add GuitarSet

I'd like to include one very clean dataloader in the first release - one with an open download and with trivial data loading. GuitarSet is a good example of a dataset that does everything right. It's freely downloadable, the data loading from jams should be trivial etc.

Clean inconsistencies between loaders

List of small inconsistencies between loaders that should be fixed

  • Beatles: beat positions are str -> int

  • RWC collection: track.track_duration_sec -> track.duration_sec

Check if dataset is available locally

If a dataset is "available" on disk, we should know it. There are three possible states for a given dataset:

  1. exists and valid (the .validate() function returns empty dictionaries)
  2. exists and not valid (at least one file in the dataset index exists, but some may be missing or have invalid checksums)
  3. does not exist (no files in the dataset exist locally)

Let's indicate states (1) and (2) by files called _VALIDATED and _INVALID.json which live in the dataset folder, e.g. Orchset/_VALIDATED.

  1. when calling .download(), first check if _VALIDATED exists. If it does not (or if clobber=True) download the dataset / print instructions to download. Otherwise, print something that indicates that the dataset already exists and is valid.
  2. In utils.validator() if _VALIDATED exists, don't run validation. Add a flag to the function to force run even if _VALIDATED exists.
  3. At the end of utils.validator(), create an empty _VALIDATED file is run successfully without missing/invalid files. If every file in the index is missing, print something that tells the user the dataset is not available locally. If some files are missing/some are invalid, create _INVALID.json which saves the dictionaries of invalid/missing files.

Add `how to test` to readme

Maybe a subissue of #18, but we should also include an example of how to clone and install for test purposes.

Something like
For local development/integration testing:

Install pyenv
Create python virtual environment
Add functionality and test

For running the test suite:

Install tox
Run tox

Bug in validator?

When I'm loading any dataset like this:

from mirdata import orchset
orchset.download()
orchset.load()

I get the following error:

  File "/home/mfuentes/astre/code/repositories/mirdata/tests/tmp.py", line 10, in <module>
    orchset.load()
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/orchset.py", line 79, in load
    validate(data_home)
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/orchset.py", line 70, in validate
    missing_files, invalid_checksums = utils.validator(ORCHSET_INDEX, data_home, dataset_path)
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/utils.py", line 75, in validator
    if check_validated(dataset_path):
  File "/home/mfuentes/astre/code/repositories/mirdata/mirdata/utils.py", line 293, in check_validated
    return os.path.exists(os.path.join(dataset_path, VALIDATED_FILE_NAME))
  File "/home/mfuentes/anaconda3/envs/struct/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

I think checking dataset_path is None or similar might be missing in the validator.

beatles.download checksum error

When I run mirdata.beatles.download(), everything appears to download as expected, but the checksums don't match. Can someone else run the same and see if you can reproduce it? The checksum of my downloaded file is c3b7d505e033ea9ff0d7a1d57871f2ee, vs the expected 62425c552d37c6bb655a78e4603828cc

ikala audio volumes

When we load the mix audio to mono, we compute (L + R)/2, but when we load the vocal/instrumental parts, we just take L or R directly. We should divide L and R by 2 (or multiply the mix by 2) so the overall volumes of each source are consistent.

Medley-solos-DB

Medley-solos-DB is a cross-collection dataset for automatic musical instrument recognition in solo recordings. It consists of a training set of 3-second audio clips, which are extracted from the MedleyDB dataset of Bittner et al. (ISMIR 2014) as well as a test set of 3-second clips, which are extracted from the solosDB dataset of Essid et al. (IEEE TASLP 2009). Each of these clips contains a single instrument among a taxonomy of eight: clarinet, distorted electric guitar, female singer, flute, piano, tenor saxophone, trumpet, and violin.

The Medley-solos-DB dataset is the dataset that is used in the benchmarks of musical instrument recognition in the publications of Lostanlen and Cella (ISMIR 2016) and Andén et al. (IEEE TSP 2019).

https://ieeexplore.ieee.org/abstract/document/8721532/

Medley-solos-DB is available on Zenodo: https://zenodo.org/record/2582103

I am personally working on a PR. My branch is https://github.com/lostanlen/mirdata/tree/medley-solos

remove dataset directory from index

For when #17 is merged:

Since dataset_dir is being passed to validator, the top level dataset directory name in the index is redundant.

(1) update paths in the index to start below the dataset folder level, e.g.
Beatles/audio/02_-_With_the_Beatles/01_-_It_Won't_Be_Long.wav --> audio/02_-_With_the_Beatles/01_-_It_Won't_Be_Long.wav

(2) update the dataset's individual loaders to include the dataset name.

Increase test coverage

Increase the test coverage.

For each of the loaders, write simple tests to check the accuracy of the loaders

  • test_beatles.py (move the current tests to the generic tests below)
  • test_ikala.py
  • test_medleydb_melody.py
  • test_medleydb_pitch.py
  • test_salami.py (move the current tests to the generic tests below)
  • test_orchset.py

Can we create a "generic" single test module which tests all of the common functions for each loader?

  • test_download.py
  • test_validate.py
  • test_cite.py

Jams and annotations consistency

We should write a ‘to_jams’ function to convert data to the jams format, and make clear we don’t test annotations consistency but only that the loaders are doing their job. The ‘to_jams’ function would allow people to check themselves.

download any remote metadata as part of dataset

Right now we download metadata when loading a dataset, and this is inconsistent with the rest of the API. Instead, any metadata downloading should be done as part of the .download() method.

rename the library?

the name of the library is pretty clunky... got some great, prettier suggestions from @lostanlen:

  • mirabelle: a small hardy European plum tree with finely toothed leaves and small cherry-shaped fruit
  • mireille: Opéra in five acts by Charles-François Gounod. 1864
  • mirific: (maɪˈrɪfɪk) achieving wonderful things or working wonders.
  • mirliton: a generic term for Membranophone played by a performer speaking or singing into them, and which alter the sound of the voice by means of a vibrating membrane.

I like mirliton, personally. Other thoughts/opinions/ideas?

Regularly check if links for downloading data are online

Just discussed with @rabitt about this. We are using several links to download data, we should check regularly if they are online. Is it possible to write a test on that?

If a link goes offline, one idea is to write a function that checks links when calling download, and gives a temporarily offline message or so. What do you think?

RWC beat parsing

The actual parsing of beat positions in the RWC collections is probably wrong, trying to get information to get it right.

Testing similar functionality

This was part of #44 - we have functions like download and validate that are similar but not always the same across many datasets.

Testing these functions is a pain because it requires mocking a lot of IO, and is unfriendly to new people. On the other hand, it is nice to have tests for important functions.

In a discussion with @rabitt, we discussed:

  • Not testing these functions. 👻
  • Making a base class and testing these functions there. 🏗
  • Testing these functions per-dataset. 📜 📖 ✍️

What do folks think? cc @rabitt, @lostanlen, @andreasjansson

How should we handle big datasets?

Right now we store the dataset index in the repo, and we load annotations into memory. This works OK for smaller datasets but won't scale to big ones. How should we handle this?

Support multichannel wav files

Guitarset mirdata/guitarset.py has 6-channel wav files. Librosa supports this, but I think we need to get librosa 0.7.0 working in the build environment. Right now we're raising NotImplementedError if someone tries to access the 6-channel audio files - once this is ready we can update the guitarset module.

Docs with Examples

  • Setup docs page (ghpages? readthedocs?).
  • Add usage examples
  • Docs for each dataset with a basic description of the dataset and links to the relevant websites. Especially focus on good docs for the data loaded in each dataset track object, e.g. OrchsetTrack.

remember to update README with last changes

e.g. Create Module -> ExampleTrack is nametuple:

ExampleTrack = namedtuple(
    'ExampleTrack',
    ['track_id',
     'audio_path',
     'annotation_path',
     'genre']
)

To do when changes in modules are done.

Error loading dataset that wasn't download

Example:

If you didn't download medleydb_melody and try medleydb_melody.load() then you get FileNotFoundError: [Errno 2] No such file or directory: '/home/mfuentes/mir_datasets/MedleyDB-Melody/_INVALID.json'. I think this happens with all datasets, Is this the expected behaviour?

Should be create the dataset folder just to host _INVALID.json? Or maybe don't try to create the
file if the folder doesn't exist?

Check if dataset exists on load

Wait for #17 to be closed.

The loaders for each dataset should throw an error if the data does not exist on disk, but should work if _VALIDATED or _INVALID.json exist.

import mirdata is slow

When we import mirdata, each of the loaders are imported, which loads each of the json indexes into memory. No bueno.

We should load the index on the fly or when it's first accessed to speed things up.

Bug in Beatles.chords

start_time is fine, but end_time data should be converted to float.

ipdb> track.chords.end_time
array(['2.612267', '11.459070', '12.921927', '17.443474', '20.410362',
       '21.908049', '23.370907', '24.856984', '26.343061', '27.840748',
       '29.350045', '35.305963', '36.803650', '41.263102', '44.245646',
       '45.720113', '47.206190', '48.692267', '50.155124', '51.652811',
       '53.138888', '56.111043', '65.131995', '68.150589', '71.192403',
       '74.199387', '75.697074', '80.236575', '83.208730', '86.221693',
       '87.736621', '89.257528', '90.720385', '92.157453', '104.106689',
       '107.125283', '110.178707', '113.124087', '114.613718',
       '116.099795', '118.944961', '128.046462', '131.053446',
       '134.037210', '137.044195', '138.475524', '143.058163',
       '146.041927', '147.551224', '149.060521', '150.511768',
       '152.021065', '153.530362', '155.062879', '159.532721',
       '161.065238', '165.581519', '167.114036', '168.646553',
       '169.737409', '171.687173', '175.804082'], dtype='<U10')

Implement `list_datasets` function

(Wait for #17 to be closed)
Implement a top-level list_datasets(data_home=None) function that pretty prints a list of all available datasets in this library, and the current status locally (check for _VALIDATED, _INVALID.json, or none).

Bonus - print the status with pretty colors.

Replace urllib with requests

Requests makes Py2 and Py3 compatibility much easier, though it does mean that we have to write our own function for saving large files to disk.

How to download ikala data ?

Dear,

When I go to the ikala data web link, I find that it is no longer available.
So how to get the dataset ?
Please help me,
Thx

What to FAQ?!

Working on #18. I'd like to hear about anything we wanna add to the FAQ section.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.