Stuff that might get overriden
Stuff that won't get overriden at all.

Remove .overwrite_output() from audio.py

If the file exists, it should be an error

Make sure in hearbaseline we follow the API for GPU support, and hear-eval too

Audio dir stats formats - breaks on mp3

Python soundfile cannot read mp3s but this is one of the formats that is searched for by audio_dir_stats_wav: https://github.com/neuralaudio/hear-eval-kit/blob/531adcc778582300fcdf7f4a19507cb16b8d31aa/heareval/tasks/util/audio.py#L93

Update the formats that are searched for or use something like audioread in the case of an mp3. Perhaps include the soundfile info function in a try block and load in with audioread and get the stats that way if it fails.

95% percentile stats

Describe all tasks and their purpose and origin

#1 (comment)

Small version of nsynth kfold

relpath => relpaths

Changes to make to adapt for multi-audio training examples

We need to change the behavior of the preprocessing pipeline to accommodate for tasks that require multiple audio files for each example of training data.
Currently, the metadata prepared in the ExtractMetadata is such that each row corresponds to one training example and can have only one audio file.

Changing from relpath to relpaths in the metadata to accommodate more than one audiofile with each example will help us achieve the multi-audio-label training example scenario.

Possible changes -

Splitting can remain the same i.e create deterministic buckets of unique split keys, but the split key has to be defined for each example. i.e. for each relpaths
Subsampling can still depend on the str(relpaths) because that is unique at an example level.
At subsampling, we will copy the audio files in the relpaths(subsampled on str(relpath) in previous step) into the corresponding task directory.
Trimming and padding of audio files can be done or avoided depending on the task.
Currently, the split.json has the following format -

{
    "filename: str": "labels: (Union[List[Any], str]"
}

To have multiple audio files for each example, the format of this file has to be changed to

[
    {
        "relpaths": List[str],
        "labels": List[Any]
    }
]

Few open questions after this -

How will we save the memory-mapped embeddings for these set of audio files?
Since, multiple embeddings can have one label, a possible way of saving this is as a 2-d embedding.
Like embedding -->nxm(assuming num(relpaths) == m) and label --> nx1. This way we will have a correspondance between each set of embeddings(m audio embeddings for one example in this case) and the corresponding label.
For downstream prediction, we depend on the embedding and the label generated in the above step, so if we are able to have the one to one correspondence between the set of embeddings for the audio in relpathsand thelabel or labels(for multi-label)`, we can use the same downstream setup(like the data loader)

Add statistics JSONs to tasks/ directory

Also add soxi -T total to each stats file?

And audio_samplerate_count should be audio_file_count.

Lastly we need audio_duration which is total number of samples / SR

Lastly, we want label counts + percentages.

This can be broken into smaller issues

Check that changing the base path doesn't affect the sorting

Add codecov.io

task metadata => task config

Metadata is the labels, or let's call them labels everywhere and drop the term "metadata".

Using metadata for both task config and labels is confusing.

Full versions of metadata JSONs with more stuff than downstream needs for HEAR tasks

Luigi interface logger in runner.py not working

There are some messages being logged within runner.py using the luigi interface logger that are not showing up. I believe this is b/c the logger can only be called from within the pipeline (i.e in a function that runs within the pipeline). Need to check on that. Either way we are trying to log a couple things that are not showing up.

Rename metadata to labels

Throughout the preprocessing pipeline we have the concept of metadata for each task -- this is really just the labels and we should refactor the code to reflect this. i.e., rename 'metadata' to 'labels' throughout the pipeline.

A simple tf dataset task could do wonders.

speech commands could be simpler and safer if we just downloaded the tensorflow dataset where they run the original generation, instead of our ported version GenerateTrainDataset.

nsynth too? Do we save anything there?

There are also probably good tf datasets we could cherry pick. A simple tf dataset task could do wonders.

Clean up output directories

Add None to task subsampling

When task subsampling is None, don't subsample that split.

Total number of hours in the dataset should be in the final tar name, unless None

Simple way to profile each task?

citation.cff

secrettasks repo (NOT hear-eval which is public) should run pytest and travis on secret tasks using heareval

https://github.com/neuralaudio/hear-eval-kit/pull/209/files

When trimming event audio, trim labels too

Submodule init doens't work for secret tasks

The command listed in the readme doesn't correctly add the secret task submodule. I had to run this to get it to work properly: git submodule add [email protected]:neuralaudio/hear2021-secret-tasks hearpreprocess/secrettasks

https://stackoverflow.com/questions/3336995/git-will-not-init-sync-update-new-submodules

logging instead of print

"This is different thing but it would be great to update all the print statements to use the luigi logger. logger.info("")"

Combine statistics

https://github.com/neuralaudio/hear-preprocess/pull/15/files/375b3f747500c21fd8871756272e3cebfe7d93c7#diff-b0857abca6cb066da53ba7b4577ce81d526d1fb02d926f94708bd29d076e0d37

Spotty + ExtractMetadata branches merged

Merging these two branches #11 #23

Update all repos with contributor shield

speech commands has some comments that need to be clarified

Tar file using script and push to huggingface

#11 (comment)

Remove testing kfold stuff from open tasks?

How do we handle multilabel instances with no labels?

For event embeddings, it's fine.

For scene embeddings, it's potentially a problem if those relpaths don't even appear in the metadata, because we will believe that this file doesn't exist! Do we have at least an assertion to check if that's the case?

#7 (review)

Luigi: Create README and LICENSE

Luigi should create README and LICENSE, as described in the tasks/README.md

Some secret tasks have much bigger 48000 than 44100

Some sort of checksum for each preprocessing

Samples can have different lengths when resizing

Is ffmpeg padding?

Possible refinements to spoken digit

https://github.com/neuralaudio/hear-preprocess/pull/104/files

Document get_sampler_task._RandomSampleOriginalDataset better

https://github.com/neuralaudio/hear-preprocess/pull/110/files#r750862015

Check sample duration + audio length of secret tasks

Maybe consider longer datasets for fast embeddings?

ffmpeg version check

You will need ffmpeg>=4.2 installed (possibly from conda-forge).
You will need soxr support, which might require package
libsox-fmt-ffmpeg or installing from
source.

Can we check these programmatically?

See also the discussion here: hearbenchmark/hear-eval-kit#156

Small dataset subsampling for speech commands

Ending up with a weird number of samples in the train/valid split:

test: 96
train: 56
valid: 132

This is caused by the background_noise subsampling in the tasks/sampler.py. In speech commands all the background noise samples (which are labelled as silence) are delivered as longer audio samples that are expected to be sliced up into smaller chunks. When we are subsampling this dataset only one background noise sample is being included (running_tap.wav), and that happens to be in the validation set. As a result we are ending up with a validation set that is almost exclusively silence samples.

Post-deadline: Try a version where audio files are not necessarily the same length

Second logger for diagnostics

@jorshi writes:

I think you can do logger = logging.getLogger('name')

that will either retrieve or create a new logger

Note that Loggers should NEVER be instantiated directly, but always through the module-level function logging.getLogger(name). Multiple calls to getLogger() with the same name will always return a reference to the same Logger object.

le on how to setup a logger

https://docs.python.org/3/howto/logging-cookbook.html

import logging

logger = logging.getLogger('hear-logger')
logger.setLevel(logging.DEBUG)

handler = logging.FileHandler("hear-preprocess.log")
handler.setLevel(logging.DEBUG)

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

logger.addHandler(handler)
logger.info("message")

Verify dataset task

Checks metadata
Adds m5dum
Makes sure WAV files in each directory have the same size

Delete partially complete task workdir

Now that we have assert on symlinks not existing, this causes errors when trying to rerun a pipeline that was stopped in the middle of a task that creates symlinks.

Let's delete partially complete task workdirs when restarting a task, or atleast remove existing symlinks at the beginning of tasks that create them?

hearbenchmark / hear-preprocess Goto Github PK

hear-preprocess's People

Contributors

Stargazers

Watchers

Forkers

hear-preprocess's Issues

Changes to make to adapt for multi-audio training examples

Recommend Projects

Recommend Topics

Recommend Org