Coder Social home page Coder Social logo

audiotagging's People

Contributors

joshvarty avatar nathanhubens avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

audiotagging's Issues

Plan

  • Persist images to disk
  • Convert all images
    • Convert train
    • Convert train noisy
    • Convert test
  • Train model on train
    • Submit
  • Train model on train_noisy
    • Submit

Use competition metric

We would like to use the same metric that the competition is using. Let's figure out how.

Try regenerating dataset with different audio parameters

Some people are getting higher LB scores than us with very shallow models. They mention that their data preprocessing is probably partly responsible. It might be worth trying to explore how different audio parameters affect our score.

I would like to try a few experiments

Use the maximum sampling rate:

  • conf.sampling_rate = 44100 (The format the sounds are presented to us in)

Try different hop lengths

  • We're using 500 but we should try different values.
  • Hop Length of 500 means we're looking at 2 seconds at a time
  • Hop Length of 250 means we're looking at 1 second a a time

Correct fmax

  • Right now we're using fmax=14000 but we should be using fmax = sampling_rate // 2

Try with more folds

@nathanhubens' solution performed well using an ensemble of 10 models. Perhaps we should try something similar and see if it improves our results.

Try mixup

Everyone else seems to use it for audio.

Investigate waveforms

I need to sanity check the spectrogram process to see what our waveforms look like after transformation when compared to before.

Look at what we're getting wrong.

We haven't looked at the output distributions so we should probably see what ones we're getting wrong. It might also be useful to look at the length of the clips we're getting right vs wrong.

Keep Track of Results

I'm fairly sure our work on #7 has given us a reasonable validation set, but we should double check that by recording how our test scores change in comparison to our validation scores.

Name Valid lwlwrap Test lwlwrap
vgg-16 5 folds 0.797 0.648
xresnet-101 0.791 0.647
xresnset-101 (curated) 0.814 0.664
xresnet-101 (curated) (TTA) 0.843 0.674
xresnet-101 (curated) (TTA) (label smoothing 0.851 0.683
xresnet-152 (curated) (TTA) (label smoothing 0.855 0.690

Integrate noisy dataset

We should try to integrate the noisy dataset so we can see how performance changes. We should

  • Include it during training
  • Investigate a custom loss function to protect against noise
    • Consider some kind of flag? So we run one loss function against noisy portion, one not?

Explore lwlwrap

We should try to better understand lwlwrap. What causes a good lwlwrap score? What causes a bad lwlwrap score? Are there any interesting features of lwlwrap that should guide our predictions?

One thought: lwlwrap seems to be rank based, once we determine the ordering of our inputs, would it be beneficial to minimize the distance between successive items?

For example: If we output [0.9,0.2,0.1] would it help to modify these to something like [0.9,0.89,0.88]? This approach maintains the order, but minimizes the distance between each prediction.

Consider other models

Probably should have started out with this, but we should compare the other models.

ResNet-18

97	0.035250	0.029217	0.664344	00:39
98	0.035705	0.029428	0.662940	00:39
99	0.035190	0.029455	0.662831	00:39

ResNet-50

97	0.043572	0.028318	0.665870	00:43
98	0.043541	0.027721	0.672451	00:43
99	0.043261	0.027860	0.665838	00:43

VGG-16 BN

97	0.041682	0.025638	0.700495	01:03
98	0.042080	0.025700	0.704907	01:03
99	0.041788	0.025718	0.704059	01:03

VGG-19 BN

97	0.045181	0.027131	0.674292	01:12
98	0.044569	0.027153	0.676560	01:12
99	0.044443	0.027219	0.676386	01:12

Figure out how many crops to take

Right now we're taking 10 crops on validation and test set. Is this appropriate? Should we use more? Should we use less? Should we use a variable number based on clip length? (Probably)

Explore the lengths of the noisy dataset and test dataset

We should take a look at the noisy and test datasets in our exploratory data analysis. My understanding is that the test dataset is from the same source as the curated dataset:

The test set is used for system evaluation and consists of manually-labeled data from FSD. Since most of the train data come from YFCC, some acoustic domain mismatch between the train and test set can be expected. All the acoustic material present in the test set is labeled, except human error, considering the vocabulary of 80 classes used in the competition.

Are the clips in our test set the same length as the ones in the curated set?

Consider incorporating other representations of sound into our model

It sounds like other features may help us improve our models ability to distinguish between sounds.

From: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93337#latest-537350
image

Paper that describes some of this: https://arxiv.org/pdf/1905.00078.pdf

B. Audio Features, 'โ€ฆ However, due to the physics of sound production, there are additional correlations for frequencies that are multiples of the same base frequency (harmonics). To allow a spatially local model (e.g., a CNN) to take these into account, a third dimension can be added that directly yields the magnitudes of the harmonic series [14], [15].'

Here's one such feature that might be worth exploring: https://librosa.github.io/librosa/generated/librosa.core.cqt.html

Figure out best label smoothing parameters

I chose random parameters initially so we're not really sure what the best parameters are.

Min Max lwlrap
0.000 1.000 0.8482
0.000 0.95 0.8520
0.001 0.990 0.8513
0.001 0.975 0.8531
0.001 0.960 0.8497
0.001 0.950 0.8542
0.001 0.935 0.8497
0.010 0.990 0.8501
0.010 0.975 0.8466
0.010 0.950 0.8526
0.010 0.935 0.8499
0.010 0.900

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.