Coder Social home page Coder Social logo

hear-preprocess's People

Contributors

deepsource-autofix[bot] avatar deepsourcebot avatar jonnor avatar jorshi avatar khumairraj avatar snyk-bot avatar turian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

auroracramer

hear-preprocess's Issues

relpath => relpaths

Changes to make to adapt for multi-audio training examples


We need to change the behavior of the preprocessing pipeline to accommodate for tasks that require multiple audio files for each example of training data.
Currently, the metadata prepared in the ExtractMetadata is such that each row corresponds to one training example and can have only one audio file.

Changing from relpath to relpaths in the metadata to accommodate more than one audiofile with each example will help us achieve the multi-audio-label training example scenario.

Possible changes -

  1. Splitting can remain the same i.e create deterministic buckets of unique split keys, but the split key has to be defined for each example. i.e. for each relpaths
  2. Subsampling can still depend on the str(relpaths) because that is unique at an example level.
  3. At subsampling, we will copy the audio files in the relpaths(subsampled on str(relpath) in previous step) into the corresponding task directory.
  4. Trimming and padding of audio files can be done or avoided depending on the task.
  5. Currently, the split.json has the following format -
{
    "filename: str": "labels: (Union[List[Any], str]"
}

To have multiple audio files for each example, the format of this file has to be changed to

[
    {
        "relpaths": List[str],
        "labels": List[Any]
    }
]

Few open questions after this -

  1. How will we save the memory-mapped embeddings for these set of audio files?
    Since, multiple embeddings can have one label, a possible way of saving this is as a 2-d embedding.
    Like embedding -->nxm(assuming num(relpaths) == m) and label --> nx1. This way we will have a correspondance between each set of embeddings(m audio embeddings for one example in this case) and the corresponding label.
  2. For downstream prediction, we depend on the embedding and the label generated in the above step, so if we are able to have the one to one correspondence between the set of embeddings for the audio in relpathsand thelabel or labels(for multi-label)`, we can use the same downstream setup(like the data loader)

Add statistics JSONs to tasks/ directory

Also add soxi -T total to each stats file?

And audio_samplerate_count should be audio_file_count.

Lastly we need audio_duration which is total number of samples / SR

Lastly, we want label counts + percentages.

This can be broken into smaller issues

task metadata => task config

Metadata is the labels, or let's call them labels everywhere and drop the term "metadata".

Using metadata for both task config and labels is confusing.

Luigi interface logger in runner.py not working

There are some messages being logged within runner.py using the luigi interface logger that are not showing up. I believe this is b/c the logger can only be called from within the pipeline (i.e in a function that runs within the pipeline). Need to check on that. Either way we are trying to log a couple things that are not showing up.

Rename metadata to labels

Throughout the preprocessing pipeline we have the concept of metadata for each task -- this is really just the labels and we should refactor the code to reflect this. i.e., rename 'metadata' to 'labels' throughout the pipeline.

A simple tf dataset task could do wonders.

speech commands could be simpler and safer if we just downloaded the tensorflow dataset where they run the original generation, instead of our ported version GenerateTrainDataset.

nsynth too? Do we save anything there?

There are also probably good tf datasets we could cherry pick. A simple tf dataset task could do wonders.

Add None to task subsampling

When task subsampling is None, don't subsample that split.

Total number of hours in the dataset should be in the final tar name, unless None

logging instead of print

"This is different thing but it would be great to update all the print statements to use the luigi logger. logger.info("")"

Small dataset subsampling for speech commands

Ending up with a weird number of samples in the train/valid split:

  • test: 96
  • train: 56
  • valid: 132

This is caused by the background_noise subsampling in the tasks/sampler.py. In speech commands all the background noise samples (which are labelled as silence) are delivered as longer audio samples that are expected to be sliced up into smaller chunks. When we are subsampling this dataset only one background noise sample is being included (running_tap.wav), and that happens to be in the validation set. As a result we are ending up with a validation set that is almost exclusively silence samples.

Second logger for diagnostics

@jorshi writes:

I think you can do logger = logging.getLogger('name')

that will either retrieve or create a new logger

Note that Loggers should NEVER be instantiated directly, but always through the module-level function logging.getLogger(name). Multiple calls to getLogger() with the same name will always return a reference to the same Logger object.

le on how to setup a logger

https://docs.python.org/3/howto/logging-cookbook.html

import logging

logger = logging.getLogger('hear-logger')
logger.setLevel(logging.DEBUG)

handler = logging.FileHandler("hear-preprocess.log")
handler.setLevel(logging.DEBUG)

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

logger.addHandler(handler)
logger.info("message")

Verify dataset task

  • Checks metadata
  • Adds m5dum
  • Makes sure WAV files in each directory have the same size

Delete partially complete task workdir

Now that we have assert on symlinks not existing, this causes errors when trying to rerun a pipeline that was stopped in the middle of a task that creates symlinks.
image

Let's delete partially complete task workdirs when restarting a task, or atleast remove existing symlinks at the beginning of tasks that create them?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.