The audiotagging's discuss from joshvarty

Consider restricting range of outputs?

All of our outputs should be between 0 and 1, we should investigate whether or not we can use y_range to help deal with this.

Correct or remove corrupted audio files

Some of the files are corrupted or mislabelled. We should remove them or relabel them:

https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93480#latest-537909

Try with RandomResizedCrop augmentation of melspectrogram?

Some people have suggested this augmentation (while it does not make sense) actually leads to improvements in LB score. It may be worth trying: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93291#537135

Create Kernel that allows us to load weights

I don't want to do all my dev on Kaggle Kernels. Create a kernel that accepts an export.pkl and can just generate a submission from that.

Use competition metric

We would like to use the same metric that the competition is using. Let's figure out how.

Incorporating Noisy Dataset

There are some useful ideas for incorporating the noisy dataset here: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/92969#latest-536416

I should finish #24 and use this approach to see which classes we're doing poorly on.

Try with more folds

@nathanhubens' solution performed well using an ensemble of 10 models. Perhaps we should try something similar and see if it improves our results.

Explore lwlwrap

We should try to better understand lwlwrap. What causes a good lwlwrap score? What causes a bad lwlwrap score? Are there any interesting features of lwlwrap that should guide our predictions?

One thought: lwlwrap seems to be rank based, once we determine the ordering of our inputs, would it be beneficial to minimize the distance between successive items?

For example: If we output [0.9,0.2,0.1] would it help to modify these to something like [0.9,0.89,0.88]? This approach maintains the order, but minimizes the distance between each prediction.

Keep Track of Results

I'm fairly sure our work on #7 has given us a reasonable validation set, but we should double check that by recording how our test scores change in comparison to our validation scores.

Name	Valid lwlwrap	Test lwlwrap
vgg-16 5 folds	0.797	0.648
xresnet-101	0.791	0.647
xresnset-101 (curated)	0.814	0.664
xresnet-101 (curated) (TTA)	0.843	0.674
xresnet-101 (curated) (TTA) (label smoothing	0.851	0.683
xresnet-152 (curated) (TTA) (label smoothing	0.855	0.690

Try without discriminitive learning rates

Try training without discriminitive learning rates

Figure out how many crops to take

Right now we're taking 10 crops on validation and test set. Is this appropriate? Should we use more? Should we use less? Should we use a variable number based on clip length? (Probably)

Plan

Try xresnet with PReLU or LeakyReLU

This kernel seemed to improve a lot when it changed to using a LeakyReLU. It's also a very small network which is interesting.

Consider other models

Probably should have started out with this, but we should compare the other models.

ResNet-18

97	0.035250	0.029217	0.664344	00:39
98	0.035705	0.029428	0.662940	00:39
99	0.035190	0.029455	0.662831	00:39

ResNet-50

97	0.043572	0.028318	0.665870	00:43
98	0.043541	0.027721	0.672451	00:43
99	0.043261	0.027860	0.665838	00:43

VGG-16 BN

97	0.041682	0.025638	0.700495	01:03
98	0.042080	0.025700	0.704907	01:03
99	0.041788	0.025718	0.704059	01:03

VGG-19 BN

97	0.045181	0.027131	0.674292	01:12
98	0.044569	0.027153	0.676560	01:12
99	0.044443	0.027219	0.676386	01:12

Integrate noisy dataset

We should try to integrate the noisy dataset so we can see how performance changes. We should

Include it during training
Investigate a custom loss function to protect against noise
- Consider some kind of flag? So we run one loss function against noisy portion, one not?

Investigate Validation Set

Currently our train set overperforms the test set. Why is this? How can we improve?

Consider custom loss function?

A couple people have mentioned that they were using a custom loss function and that this may be responsible for some of their results.

A kernel with 0.66 was doing this but it has been made private. It's discussed here: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93291#latest-538024

I'm unfamiliar with Inceptionv3 other than that it takes a fairly long time to train.

Investigate waveforms

I need to sanity check the spectrogram process to see what our waveforms look like after transformation when compared to before.

Try different network architectures

Are ResNets the best here? Are there other alternatives worth considering?

Investigate Pseudo-Labelling techniques

Can we label things from the noisy or test dataset and improve things?

Investigate label smoothing

Figure out best label smoothing parameters

I chose random parameters initially so we're not really sure what the best parameters are.

Min	Max	lwlrap
0.000	1.000	0.8482
0.000	0.95	0.8520
0.001	0.990	0.8513
0.001	0.975	0.8531
0.001	0.960	0.8497
0.001	0.950	0.8542
0.001	0.935	0.8497
0.010	0.990	0.8501
0.010	0.975	0.8466
0.010	0.950	0.8526
0.010	0.935	0.8499
0.010	0.900

Look at what we're getting wrong.

We haven't looked at the output distributions so we should probably see what ones we're getting wrong. It might also be useful to look at the length of the clips we're getting right vs wrong.

Explore the lengths of the noisy dataset and test dataset

We should take a look at the noisy and test datasets in our exploratory data analysis. My understanding is that the test dataset is from the same source as the curated dataset:

The test set is used for system evaluation and consists of manually-labeled data from FSD. Since most of the train data come from YFCC, some acoustic domain mismatch between the train and test set can be expected. All the acoustic material present in the test set is labeled, except human error, considering the vocabulary of 80 classes used in the competition.

Are the clips in our test set the same length as the ones in the curated set?

Try Visualizing Layers

To learn a bit more about the intermediate weights lets try visualizing them.

Consider using audio of different lengths

Try training systems based on different lengths? Can we see if we get better results on some classes based on lengths?

Generate images on the fly

Is it possible to leverage this: https://towardsdatascience.com/audio-classification-using-fastai-and-on-the-fly-frequency-transforms-4dbe1b540f89

Try .to_fp16()

Can I use half precision to speed up training?

Try without imagenet normalization

It probably doesn't make sense to use imagenet stats to normalize try:

Not using them
Using the mean and std of the dataset instead

Investigate range of values in our images

Are our images between [-1, 1] or [0,1] or something else? Do we need to normalize them?

Augmentation

Pitch Shifting
Volume
Add noise

Use the maximum sampling rate:

conf.sampling_rate = 44100 (The format the sounds are presented to us in)

Try different hop lengths

We're using 500 but we should try different values.
Hop Length of 500 means we're looking at 2 seconds at a time
Hop Length of 250 means we're looking at 1 second a a time

Correct fmax

Right now we're using fmax=14000 but we should be using fmax = sampling_rate // 2

Paper that describes some of this: https://arxiv.org/pdf/1905.00078.pdf

B. Audio Features, '… However, due to the physics of sound production, there are additional correlations for frequencies that are multiples of the same base frequency (harmonics). To allow a spatially local model (e.g., a CNN) to take these into account, a third dimension can be added that directly yields the magnitudes of the harmonic series [14], [15].'

Here's one such feature that might be worth exploring: https://librosa.github.io/librosa/generated/librosa.core.cqt.html

joshvarty / audiotagging Goto Github PK

audiotagging's Issues