Multimodal dataset catalog

This repository lists publicly available datasets encompassing the following modalities:

Audio and speech
Biomedical signals (e.g., EEG, ECG, etc.)

Though we updated it on a regular basis, we may miss some most recent ones. Feel free to let us know by creating a new issue :)

Speech-datasets (updated 10-30-2023)

Listed speech datasets are mainly in the cybersecurity and healthcare domain.

Deepfakes

ASVspoof 2021: the commonly used deepfake dataset from the ASVspoof challenge series. This is the 2021 version which includes a deepfake track, with 600K utterances from a variety of generation algorithms and codecs. See also the 2019 version, which also has some DFs in the LA track.

WaveFake: include only crafted speech based on the data from LJ speech corpus. For each genuine utterance, it comes with more than 10 different DF versions.

In-the-wild: in-the-wild deepfakes, including genuine and crafted ones from celebrity voices.

ADD: mandarin deepfake detection challenge databases. Link to be updated.

Half-truth audio detection (HAD): partial-deepfake and fully-deepfake utterances. Link to be updated.

Partial Spoof: partially-spoofed utterances contain a mix of both spoofed and bona fide segments.

SceneFake: acoustic scene is crafted while voice itself remains unchanged. Detailed generation pipeline can be found in the paper.

Singfake: In-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. Train/valid/test splits were provided with the data.

Healthcare

The UK COVID-19 Vocal Audio Dataset: Audio recordings of volitional coughs, exhalations, and speech alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 PCR test results. A total of 72,999 participants (25,776 tested positive). This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.

Cambridge COVID Sound: (obtained upon requests) includes ~300H of voice, cough, and breathing data collected remotely from healthy and COVID individuals. It comes with rich metadata, such as COVID-status, gender, age, symptom, pre-existing medical conditions. However, the COVID labels are self-reproted not PCR-validated.

Coswara: COVID-19 sounds (voice, cough, breathing) collected in India. See also the related DiCOVA 1&2 challenge datasets. The challenge ones are obtained upon requests.

ComParE 2021 COVID Detection Dataset: (obtained upon requests) includes ~3K audio samples (speech, cough, and breathing) from COVID and healthy individuals. This is an INTERSPEECH challenge dataset.

TORGO: in-lab voice recordings from individuals with dysarthria. It also provides the text groudtruth and articulatory traces.

Nemours: (link to be updated) ~800 sentence utterances collected in-lab from individuals with different degrees of dysarthria. Labels are intelligibility.

NCSC: (link to be updated) sentence utterances from individuals who received a cervical tumor surgery, with binary labels (low- / high-intelligibility)

KSoF-C: (obtained upon request) Original version contains 5K 3-sec speech segments from 37 German speakers. The segments contain speech of persons who stutter. The one used in the INTERSPEECH 2022 ComParE challenge (KSOF-C) only features 4601 non-ambiguously labeled segments, where segments are classified as one of the 8 classes - the seven stuttering-related classes and an eighth “garbage” class, denoting unintelligible segments, segments containing no speech, or segments that are negatively affected by loud background noise.

Sep-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter. It contains stuttering event annotations for approximately 28,000 3-second clips (English). In addition they include stutter event annotations for about 4,000 3-second clips from the FluencyBank dataset.

DAIC-WOZ: (obtained upon request) this dataset includes audio-visual interviews of 189 participants, male and female, who underwent evaluation of psychological distress. Each participant was assigned a self-assessed depression score through the patient health questionnaire (PHQ-8) method. A total of ~58H of audio data.

MDVR-KCL: scripted and spontaneous speech recordings from healthy and individuals with Parkinson's disease. Labels are binary PD/Healthy. Other rating labels are available as well.

Biosignal-datasets

TBD

Contribute & Author

For contribution or questions, please contact at [[email protected]].

dearborn-open-ai / multimodal-dataset-catalog Goto Github PK