Coder Social home page Coder Social logo

audio-dataset's Introduction

                 





What is Audio Dataset Project?

This repository is created for Audio Dataset Project, an audio dataset collection initiative announced by LAION. These datasets, each containing enormous amount of audio-text pairs, will be eventually processed and used for training CLAP (Contrastive language-Audio Pretraining) model and other models.

Here is an explicative video introducing you to the project.

Who are we?

Since Audio Dataset is an open source project belongs to LAION, we have a team of open source contributors. They are, along with LAION members, a three-people researchers group including Yusong Wu, Ke Chen and Tianyu Zhang from Mila and UCSD, intern Marianna Nezhurina, previous intern Yuchen Hui, as well as many enthusiastic contributors from all over the world, such as @PiEquals4#1909 in Discord server.

What have we done?

  • We are keeping collecting audio datasets and here is the LIST of all what we found.
  • We define the standard and method to store and process all audio datasets, which is essential in the sense of unifying the final format of datasets to simplify model training. The final dataset format we used for now is webdataset. The concrete data process pipeline is specified here
  • You may also find the processing code for each processed audio datasets, respectively. Dependencies required for testing these scripts are specified in the document environment.txt. Please Note that environment.txt may be an non-exhaustive list. There is also a list with redundant packages environment.yml(i.e. superclass $\supset$ of the exhaustive list), and you can use command conda env create --name envname --file=environment.yml to create the environment and conda activate envname for using it.

Contributing

Contact

  • You could find us on LAION Discord Server CLAP channel (the channel name is clap in lower case).
  • In the CLAP channel, If you have any question about the project, please feel free to talk with intern Marianna Nezhurina(marianna13#7139), Christoph Schuhmann(@spirit-from-germany#1488), Richard(@rvencu#4120), Romain(@rom1504#5008),Yuchen Hui(@Yuchen Hui#8574), Yusong Wu(@Yusong Wu#3047), Ke Chen(@Ke Chen#0709) or Tianyu Zhang(@tianyuzhang#1725). Text in parenthesis is Discord id.
  • Moreover, if you need computation resources during contributing, please go into compute-allocation channel of Discord Server and read the pinned messages for usage of LAION pods. If any problem is encountered, please feel free to ask any question in the channel.
  • 7.14 update: old LAION pods are not accessible any more, so you have to ask Richard(@rvencu#4120) in CLAP channel for access to new LAION cluster.

Project progress

We have created a github project page to keep track of the progress of data collection and data processing. Here follows some descriptions for each board of the project:

  • Todo board : In this board is placed all the datasets in the LIST that is not yet converted to webdataset form and on which nobody is currently working
  • Assigned/In progress/Processing board: We listed datasets that is assigned to someone for processing, i.e. we have already contributors working on these datasets.
  • Review board: Once a certain dataset is converted to webdataset format, the corresponding item should be moved here, indicating that it is ready for further review (e.g. check if there is any format error in order to ensure the quality of model training) by our team.
  • Done board: If there is no problem found at the review stage, the dataset will be archived to “Done” board, meaning it is ready for training the model.

How to contribute?

There are mainly two ways to contribute to our audio dataset project.

  1. Collection of scattered audio sources by means of web scraping technique (and then convert them to webdatset format, i.e. the second point below).

    Example: crawling word-pronunciation pair from Cambridge Dictionary, or scrape videos from youtube, extract the sound and associate then with the title.

    Please join us in Discord if you want to know which scattered audio sources we currently focus on, or if you have suggestion about what we should scrape next.

  2. Process of curated datasets, i.e. convert them to webdataset format according to the pipeline

    Example: Clotho is an curated audio dataset having its own format, thought we ought to convert it to webdataset format with aid of data_preprocess/preprocess_clotho.py and utils/make_tars.py . For more processing details please read the pipeline part.

    For this type of contribution, it is suggested to view the datasets in the Todo board in the github project page and join us in Discord server. Please contact Marianna Nezhurina(marianna13#7139) in CLAP channel after you have chosen one from Todo board to process, so that we can keep track of the progress and avoid the case where many people work simultaneously on one dataset.

  • Last but not least, if you find any interesting curated datasets (e.g. Clotho), you can tell us in LAION Discord server. We will eventually add it to the LIST

Contribution Delivery

Ideally, in both cases mentioned above, we hope receive from you the webdataset format dataset. When you’ve packed up your dataset into webdataset format, upload it to our AWS S3 bucket: aws s3 cp your/webdataset/ s3://s-laion-audio/webdataset_tar/your webdataset/ and contact Marianna Nezhurina(marianna13#7139) so that she could move the dataset to the review board. (If possible, please also add the processed (not yet packed up) dataset to S3://s-laion-audio/processed_dataset).

When it comes to AWS s3 accessibility problem, please see the LAION cluster part in contact entry above, because AWS s3 are accessible if visited from the LAION new cluster.

Nevertheless, for the scrapped dataset, we also receive a CSV file of which the structure is:

url_link_to_the_audio_allowing_us_to_download , text

i.e. each line is an audio_url-text pair, by which we can write a batch file to handle it easily.

The End

Last updated on July 14 0:57 EST, 2022 Last updated on September 5th 11:00 EST, 2022 (Marianna Nezhurina will take over the intern's work of Yuchen Hui) Last updated on November 8th 18:55 EST, 2022 (Release of LAION-Audio-630K dataset)

audio-dataset's People

Contributors

christophschuhmann avatar dmarx avatar isaac0804 avatar kjhenner avatar knoriy avatar krishnakalyan3 avatar lukewys avatar marianna13 avatar retrocirce avatar rvencu avatar tianyu-z avatar tj-solergibert avatar turian avatar yuchenhui22314 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audio-dataset's Issues

Missing 'tag' key in FSD50k preprocessor

Hi,
Thanks for sharing the wonderful code.

According to the readme of data preprocess (here)
there should be a key of 'tag' (containing labels) in the output JSON file after preprocessing.
Screenshot 2023-02-21 at 2 12 11 PM
This tag extraction/creation is missing in the preprocess_FSD50K.py file.

Am I understanding something incorrectly or there is 'tag' creation missing in the file?

Thanks,
Saksham

How to download Freesound?

Hi, can you share some ways to download Freesound? e.g. How to use Linux scripts to download these audio.

Dataset Plan

@rvencu @rom1504
We need more data in the next step. The data we need in the ranking of priority is:

  1. Audio data with natural text description(s).
  2. Audio data with other labels, and "made up" a text description for the audio.

For audio data with natural text description, we further need:

For audio data with other labels, we need to collect new large datasets while converting our current dataset with tag labels.

The datasets in top priority are those with large size and easy to turn labels into a text description:

(The following datasets all are those with tag labels of the audio)

The datasets we currently have that need converting labels to text are:

We should come up with a unified way of converting tags to text. We could reference how CLIP did that (in converting classification to natural text).

files for preprocessing

Where do we get the following files from?
Any help would be appreciated.
Thanks in advance.

metadata_file = r'/home/yuchen/raw/freesound/parquet/freesound_parquet.parquet'

ignore_file = r'/home/yuchen/raw/freesound/filename_dic.txt'

duration_file = r"/home/yuchen/raw/freesound/all_duration.txt"

When possible prefer saving parquet with url inside

Similarly to image datasets, it's better to first save a url + metadata file as parquet
That can be distributed without copyright issue

Then a tool like img2dataset can handle the download

Let's add that in the readme here

AWS S3 Access

Congratulations for executing the herculean effort of putting together this dataset!
Where can one find the access information for the data in s3://s-laion-audio/?

decoding speed / benchmark

This repo is great. I always wanted to benchmark webdataset for audio. A couple of questions:

  1. did you find flac to be a good trade-off between decoding performance and file-size? have you tried mp3 instead?
  2. did you benchmark the pipeline against plain torch.data with torchaudio or the new torch data pipes? Maybe adding the benchmark to https://github.com/faroit/python_audio_loading_benchmark/ to give this a go?
  3. How is partial decoding seeking be typically done with webdatasets, when storing long audio but at decoding stage, only random chunks are being read. Is seeking supported? If yes, does this slow down the i/o pipeline?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.