The audiotagging from joshvarty

Plan

Persist images to disk
Convert all images
- Convert train
- Convert train noisy
- Convert test
Train model on train
- Submit
Train model on train_noisy
- Submit

Use competition metric

We would like to use the same metric that the competition is using. Let's figure out how.

Try regenerating dataset with different audio parameters

Some people are getting higher LB scores than us with very shallow models. They mention that their data preprocessing is probably partly responsible. It might be worth trying to explore how different audio parameters affect our score.

I would like to try a few experiments

Use the maximum sampling rate:

conf.sampling_rate = 44100 (The format the sounds are presented to us in)

Try different hop lengths

We're using 500 but we should try different values.
Hop Length of 500 means we're looking at 2 seconds at a time
Hop Length of 250 means we're looking at 1 second a a time

Correct fmax

Right now we're using fmax=14000 but we should be using fmax = sampling_rate // 2

Consider custom loss function?

A couple people have mentioned that they were using a custom loss function and that this may be responsible for some of their results.

A kernel with 0.66 was doing this but it has been made private. It's discussed here: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93291#latest-538024

I'm unfamiliar with Inceptionv3 other than that it takes a fairly long time to train.

Try with more folds

@nathanhubens' solution performed well using an ensemble of 10 models. Perhaps we should try something similar and see if it improves our results.

Consider using audio of different lengths

Try training systems based on different lengths? Can we see if we get better results on some classes based on lengths?

Investigate range of values in our images

Are our images between [-1, 1] or [0,1] or something else? Do we need to normalize them?

Try xresnet with PReLU or LeakyReLU

This kernel seemed to improve a lot when it changed to using a LeakyReLU. It's also a very small network which is interesting.

Consider removing silent portions at start and end of clip

This technique was used last year.

librosa.effects.trim was used

Try mixup

Everyone else seems to use it for audio.

Consider restricting range of outputs?

All of our outputs should be between 0 and 1, we should investigate whether or not we can use y_range to help deal with this.

Try .to_fp16()

Can I use half precision to speed up training?

Investigate waveforms

I need to sanity check the spectrogram process to see what our waveforms look like after transformation when compared to before.

Look at what we're getting wrong.

We haven't looked at the output distributions so we should probably see what ones we're getting wrong. It might also be useful to look at the length of the clips we're getting right vs wrong.

Try without imagenet normalization

It probably doesn't make sense to use imagenet stats to normalize try:

Not using them
Using the mean and std of the dataset instead

Create Kernel that allows us to load weights

I don't want to do all my dev on Kaggle Kernels. Create a kernel that accepts an export.pkl and can just generate a submission from that.

Try without discriminitive learning rates

Try training without discriminitive learning rates

Keep Track of Results

I'm fairly sure our work on #7 has given us a reasonable validation set, but we should double check that by recording how our test scores change in comparison to our validation scores.

Name	Valid lwlwrap	Test lwlwrap
vgg-16 5 folds	0.797	0.648
xresnet-101	0.791	0.647
xresnset-101 (curated)	0.814	0.664
xresnet-101 (curated) (TTA)	0.843	0.674
xresnet-101 (curated) (TTA) (label smoothing	0.851	0.683
xresnet-152 (curated) (TTA) (label smoothing	0.855	0.690

Try Visualizing Layers

To learn a bit more about the intermediate weights lets try visualizing them.

Integrate noisy dataset

We should try to integrate the noisy dataset so we can see how performance changes. We should

Include it during training
Investigate a custom loss function to protect against noise
- Consider some kind of flag? So we run one loss function against noisy portion, one not?

Incorporating Noisy Dataset

There are some useful ideas for incorporating the noisy dataset here: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/92969#latest-536416

I should finish #24 and use this approach to see which classes we're doing poorly on.

Explore lwlwrap

We should try to better understand lwlwrap. What causes a good lwlwrap score? What causes a bad lwlwrap score? Are there any interesting features of lwlwrap that should guide our predictions?

One thought: lwlwrap seems to be rank based, once we determine the ordering of our inputs, would it be beneficial to minimize the distance between successive items?

For example: If we output [0.9,0.2,0.1] would it help to modify these to something like [0.9,0.89,0.88]? This approach maintains the order, but minimizes the distance between each prediction.

Try different network architectures

Are ResNets the best here? Are there other alternatives worth considering?

Consider other models

Probably should have started out with this, but we should compare the other models.

ResNet-18

97	0.035250	0.029217	0.664344	00:39
98	0.035705	0.029428	0.662940	00:39
99	0.035190	0.029455	0.662831	00:39

ResNet-50

97	0.043572	0.028318	0.665870	00:43
98	0.043541	0.027721	0.672451	00:43
99	0.043261	0.027860	0.665838	00:43

VGG-16 BN

97	0.041682	0.025638	0.700495	01:03
98	0.042080	0.025700	0.704907	01:03
99	0.041788	0.025718	0.704059	01:03

VGG-19 BN

97	0.045181	0.027131	0.674292	01:12
98	0.044569	0.027153	0.676560	01:12
99	0.044443	0.027219	0.676386	01:12

Figure out how many crops to take

Right now we're taking 10 crops on validation and test set. Is this appropriate? Should we use more? Should we use less? Should we use a variable number based on clip length? (Probably)

Explore the lengths of the noisy dataset and test dataset

We should take a look at the noisy and test datasets in our exploratory data analysis. My understanding is that the test dataset is from the same source as the curated dataset:

The test set is used for system evaluation and consists of manually-labeled data from FSD. Since most of the train data come from YFCC, some acoustic domain mismatch between the train and test set can be expected. All the acoustic material present in the test set is labeled, except human error, considering the vocabulary of 80 classes used in the competition.

Are the clips in our test set the same length as the ones in the curated set?

Paper that describes some of this: https://arxiv.org/pdf/1905.00078.pdf

B. Audio Features, '… However, due to the physics of sound production, there are additional correlations for frequencies that are multiples of the same base frequency (harmonics). To allow a spatially local model (e.g., a CNN) to take these into account, a third dimension can be added that directly yields the magnitudes of the harmonic series [14], [15].'

Here's one such feature that might be worth exploring: https://librosa.github.io/librosa/generated/librosa.core.cqt.html

Pitch Shifting
Volume
Add noise

Figure out best label smoothing parameters

I chose random parameters initially so we're not really sure what the best parameters are.

Min	Max	lwlrap
0.000	1.000	0.8482
0.000	0.95	0.8520
0.001	0.990	0.8513
0.001	0.975	0.8531
0.001	0.960	0.8497
0.001	0.950	0.8542
0.001	0.935	0.8497
0.010	0.990	0.8501
0.010	0.975	0.8466
0.010	0.950	0.8526
0.010	0.935	0.8499
0.010	0.900

joshvarty / audiotagging Goto Github PK

audiotagging's People

Contributors

Stargazers

Watchers

audiotagging's Issues