joshvarty / audiotagging Goto Github PK
View Code? Open in Web Editor NEWWorking on: https://www.kaggle.com/c/freesound-audio-tagging-2019
License: MIT License
Working on: https://www.kaggle.com/c/freesound-audio-tagging-2019
License: MIT License
All of our outputs should be between 0 and 1, we should investigate whether or not we can use y_range to help deal with this.
Some of the files are corrupted or mislabelled. We should remove them or relabel them:
https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93480#latest-537909
Some people have suggested this augmentation (while it does not make sense) actually leads to improvements in LB score. It may be worth trying: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93291#537135
I don't want to do all my dev on Kaggle Kernels. Create a kernel that accepts an export.pkl
and can just generate a submission from that.
We would like to use the same metric that the competition is using. Let's figure out how.
There are some useful ideas for incorporating the noisy dataset here: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/92969#latest-536416
I should finish #24 and use this approach to see which classes we're doing poorly on.
@nathanhubens' solution performed well using an ensemble of 10 models. Perhaps we should try something similar and see if it improves our results.
We should try to better understand lwlwrap. What causes a good lwlwrap score? What causes a bad lwlwrap score? Are there any interesting features of lwlwrap that should guide our predictions?
One thought: lwlwrap seems to be rank based, once we determine the ordering of our inputs, would it be beneficial to minimize the distance between successive items?
For example: If we output [0.9,0.2,0.1]
would it help to modify these to something like [0.9,0.89,0.88]
? This approach maintains the order, but minimizes the distance between each prediction.
I'm fairly sure our work on #7 has given us a reasonable validation set, but we should double check that by recording how our test scores change in comparison to our validation scores.
Name | Valid lwlwrap | Test lwlwrap |
---|---|---|
vgg-16 5 folds | 0.797 | 0.648 |
xresnet-101 | 0.791 | 0.647 |
xresnset-101 (curated) | 0.814 | 0.664 |
xresnet-101 (curated) (TTA) | 0.843 | 0.674 |
xresnet-101 (curated) (TTA) (label smoothing | 0.851 | 0.683 |
xresnet-152 (curated) (TTA) (label smoothing | 0.855 | 0.690 |
Try training without discriminitive learning rates
Right now we're taking 10 crops on validation and test set. Is this appropriate? Should we use more? Should we use less? Should we use a variable number based on clip length? (Probably)
This kernel seemed to improve a lot when it changed to using a LeakyReLU. It's also a very small network which is interesting.
Probably should have started out with this, but we should compare the other models.
97 0.035250 0.029217 0.664344 00:39
98 0.035705 0.029428 0.662940 00:39
99 0.035190 0.029455 0.662831 00:39
97 0.043572 0.028318 0.665870 00:43
98 0.043541 0.027721 0.672451 00:43
99 0.043261 0.027860 0.665838 00:43
97 0.041682 0.025638 0.700495 01:03
98 0.042080 0.025700 0.704907 01:03
99 0.041788 0.025718 0.704059 01:03
97 0.045181 0.027131 0.674292 01:12
98 0.044569 0.027153 0.676560 01:12
99 0.044443 0.027219 0.676386 01:12
We should try to integrate the noisy dataset so we can see how performance changes. We should
Currently our train set overperforms the test set. Why is this? How can we improve?
A couple people have mentioned that they were using a custom loss function and that this may be responsible for some of their results.
A kernel with 0.66
was doing this but it has been made private. It's discussed here: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93291#latest-538024
I'm unfamiliar with Inceptionv3 other than that it takes a fairly long time to train.
I need to sanity check the spectrogram process to see what our waveforms look like after transformation when compared to before.
Are ResNets the best here? Are there other alternatives worth considering?
Can we label things from the noisy or test dataset and improve things?
Investigate label smoothing
I chose random parameters initially so we're not really sure what the best parameters are.
Min | Max | lwlrap |
---|---|---|
0.000 | 1.000 | 0.8482 |
0.000 | 0.95 | 0.8520 |
0.001 | 0.990 | 0.8513 |
0.001 | 0.975 | 0.8531 |
0.001 | 0.960 | 0.8497 |
0.001 | 0.950 | 0.8542 |
0.001 | 0.935 | 0.8497 |
0.010 | 0.990 | 0.8501 |
0.010 | 0.975 | 0.8466 |
0.010 | 0.950 | 0.8526 |
0.010 | 0.935 | 0.8499 |
0.010 | 0.900 |
We haven't looked at the output distributions so we should probably see what ones we're getting wrong. It might also be useful to look at the length of the clips we're getting right vs wrong.
We should take a look at the noisy
and test
datasets in our exploratory data analysis. My understanding is that the test
dataset is from the same source as the curated
dataset:
The test set is used for system evaluation and consists of manually-labeled data from FSD. Since most of the train data come from YFCC, some acoustic domain mismatch between the train and test set can be expected. All the acoustic material present in the test set is labeled, except human error, considering the vocabulary of 80 classes used in the competition.
Are the clips in our test
set the same length as the ones in the curated
set?
To learn a bit more about the intermediate weights lets try visualizing them.
Try training systems based on different lengths? Can we see if we get better results on some classes based on lengths?
Is it possible to leverage this: https://towardsdatascience.com/audio-classification-using-fastai-and-on-the-fly-frequency-transforms-4dbe1b540f89
Can I use half precision to speed up training?
It probably doesn't make sense to use imagenet stats to normalize try:
Are our images between [-1, 1] or [0,1] or something else? Do we need to normalize them?
This technique was used last year.
librosa.effects.trim
was used
Everyone else seems to use it for audio.
Some people are getting higher LB scores than us with very shallow models. They mention that their data preprocessing is probably partly responsible. It might be worth trying to explore how different audio parameters affect our score.
I would like to try a few experiments
conf.sampling_rate = 44100
(The format the sounds are presented to us in)500
but we should try different values.500
means we're looking at 2 seconds at a time250
means we're looking at 1 second a a timefmax=14000
but we should be using fmax = sampling_rate // 2
Learn what the differences are. Why is Log Mel Spectogram preferred for this competition?
Since we're using K-fold cross validation we end up with 5 exported models. We should modify the Kaggle Kernel to accept multiple export.pkl
files and average the predictions from all of them.
It sounds like other features may help us improve our models ability to distinguish between sounds.
From: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93337#latest-537350
Paper that describes some of this: https://arxiv.org/pdf/1905.00078.pdf
B. Audio Features, 'โฆ However, due to the physics of sound production, there are additional correlations for frequencies that are multiples of the same base frequency (harmonics). To allow a spatially local model (e.g., a CNN) to take these into account, a third dimension can be added that directly yields the magnitudes of the harmonic series [14], [15].'
Here's one such feature that might be worth exploring: https://librosa.github.io/librosa/generated/librosa.core.cqt.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.