hearbenchmark / hear-eval-kit Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 14.0 1.24 MB

Evaluation kit for the HEAR Benchmark

Home Page: https://hearbenchmark.com

License: Apache License 2.0

Python 8.69% Roff 2.07% Shell 0.04% Jupyter Notebook 89.20%

hear-eval-kit's People

Contributors

Stargazers

Watchers

Forkers

luckygyana soundsensing normonisping constd ankitshah009 karlhajal courtois-neuromod neclow auroracramer haoheliu tomsen1410

hear-eval-kit's Issues

Better README on how to use pip package

Might have has same scope at #120

Xavier initialization for baseline weights

Create a tutorial document -- Jupyter Book?

Should have the following components:

•Background and motivation (from proposal doc)
•Summary of relevant prior works and synthesis of general trends in the literature, both in audio as well as adjacent ML fields whose progress in representation learning has not yet been borne out in audio ML research.
•A high-level description of the variety of domains and tasks that the model will be evaluated on. A particular emphasis will be made on high societal impact audio tasks that are currently underrepresented, such as low-resource languages, environ-mental and ecological safety, clinical speech applications, and ethnomusicology, thus encouraging participants to devise impactful datasets rather than relying solely upon popular and/or commercially viable benchmarks.

Convert SubsampleCorpus task into two sub-tasks

Find filenames with labels and symlink them.
Subsample to number of audio files, but based upon only filenames with labels

https://github.com/neuralaudio/hear2021-eval-kit/pull/18/files#r641180248

Multi-label evaluation

Implement evaluation for auto-tagging tasks (multi-label) using LWLRAP -- see see: https://arxiv.org/abs/1906.02975

Baseline code might have a bug in the int8 conversion

As spotted by @jorshi

Remove baseline models, move to other repo

Colab notebook?

Could have the full sandbox testing API and speed.

Also might be nice to show how to do training in a separate notebook (could be a separate issue)

.version() API method or variable (whatever is more conventional)

.version() API method, returning a string, which we can use segregating different output runs of a particular model.

This should also going into the website API description.

Stratify validation and test according to label?

util.luigi.which_set should make sure the filename is relative

TF port of evaluation

This might not actually be necessary if we always are working with numpy embeddings that were cached to disk.

S3 sync should be two separate scripts

Much easier than mixing it with Luigi

Stratify speech tasks into male and non-male evaluations?

Can we avoid creating a whole new task?

Maybe we include nonmale into the task name?

Splitting by person/instrument and then by filename?

We might want to slugify the person/instrument and then the filename, for partitioning. Or just have a partition slug, which is the person/instrument

Multi-class evaluation.

Implement evaluation metric for multi-class evaluation predictions using MRR.

Hash metadata CSVs at end to check correctness

GCP storage scripts

Related to #11

Transcription evaluation

For transcription I suggest we go with Jesse's paper (https://arxiv.org/pdf/1710.11153.pdf)

Documentation on version numbers

Explain our versioning: year-major-minor

README for each luigi pipeline

Hash the final Luigi evaluation tgz to verify they are correct?

This will allow us to do regression testing.

Should we change hop_size to frame_rate?

The idea of hop_size might be confusing. Another proposal from @maxsolomonhenry is frame_rate (as number of frames per second).

What any user wants with an audio embedding for downstream use (e.g. for frame based transcription or sed) is that the embedding is based upon the prediction at every particular timestep. However, the input to the embedding might be variable length or use multi-scale centered frames.

The concern was that this distinction might not be clear and hop_size suggests classic overlapp add stuff.

Simple pip-installable baseline with saving and loading weights

Remove models from this repo

@jorshi "should we pull out all the models from this repo since we have the separate repo for the baseline now?" yeah

Originally posted by @turian in #68 (comment)

Tests that the framing is happening as expected

That get_audio_embedding is framing things correctly based upon the hop-size and centering.

Confirm for each task that we can learn it using our baselines

filename_hash should be slug_hash in speech commands

Don't throw out fields from Metadata if not completely necessary

Filter coughvid to exclude unlabeled rows

We don't want NaN labels for any of the audio, so a filter step should be applied towards the top of the pipeline

Add sponsors to page

Luigi: Dict tasks for multiple requires

If there are multiple requires in the Luigi tasks, use a dict. This is less brittle than numerical indexing.

This will also require changing utils/luigi.py for the stage number.

Tensorflow support for embedding inference + caching

Related to #49 and #21

Implement the evaluation pipeline.

Given task type, test.csv, and predicted-test.csv output evaluation scores. (Note that we can implement this now just by creating random test.csv and predicted-test.csv files, Christian is starting this task.)

Pandas read_csv functionality in utils

https://github.com/neuralaudio/hear2021-eval-kit/pull/17/files#r641632436

Remove workdir from slugs in metadata

get_audio_embedding_numpy

This would be a higher-level convenience.

In this case, the return value would be a numpy.

This code exists in heareval/task_embeddings.py, however we might consider exposing a convenience higher-level API over all embeddings that follow our lower-level API.