Some people are getting higher LB scores than us with very shallow models. They mentio

Parameters to try: <div class="highlight highlight-source-python notranslate posit

I would like to try with an extra convnet. <code class="notranslate"

Try regenerating dataset with different audio parameters about audiotagging HOT 4 OPEN

joshvarty commented on August 11, 2024

Try regenerating dataset with different audio parameters

from audiotagging.

Comments (4)

JoshVarty commented on August 11, 2024

Parameters to try:

conf = EasyDict()

conf.sampling_rate = 44100            # Highest quality
conf.duration = 3                     # About double the length of time we look at
conf.hop_length = 500                 # We're looking at about 1.4 seconds
conf.fmin = 20                        # Near the lowest a human can hear
conf.fmax = conf.sampling_rate // 2   # The maximum frequency we can represent
conf.n_mels = 128                     # Our crops are 128 in height
conf.n_fft = conf.n_mels * 20

conf.samples = conf.sampling_rate * conf.duration

from audiotagging.

JoshVarty commented on August 11, 2024

This didn't lead to any noticeable improvement :'(

from audiotagging.

JoshVarty commented on August 11, 2024

From: Section C of https://arxiv.org/pdf/1905.00078.pdf

The receptive field (the number of samples or spectra
involved in computing a prediction) of a CNN is fixed by
its architecture. It can be increased by using larger kernels
or stacking more layers. Especially for raw waveform inputs
with a high sample rate, reaching a sufficient receptive field
size may result in a large number of parameters of the CNN
and high computational complexity.

While we're not using raw waveforms, I've increased the sampling rate from 32kHz to 44.1 kHZ. Perhaps we need a larger receptive field for our convolutions? We could try stacking another 3x3 convolution at the beginning of the network?

They also suggest dilated convolutions, something I've never used before. Basically you take your 3x3 conv filter and stick zeros between the values. I think this makes your 3x3 conv now 5x5 in size. The zeros mean that no new information is added for these points but the receptive field of your filter is now looking at a larger area.

It looks like it's built-in to PyTorch via the dilation parameter: https://pytorch.org/docs/stable/nn.html#conv2d

from audiotagging.

JoshVarty commented on August 11, 2024

I would like to try with an extra convnet.

xresnet18: 0.8409990
xresnet18: 0.8438536
xresnet18: 0.8428597
Avg: 0.842570767

xresnet18 3 channels: 0.845929
xresnet18 3 channels: 0.844033
xresnet18 3 channels: 0.846719
Avg: 0.845560333

xresnet18 3 channels + extra conv: 0.8452421
xresnet18 3 channels + extra conv: 0.8453489
xresnet18 3 channels + extra conv: 0.8484291
Avg: 0.846340033

xresnet18 3 channels + dilated convs: 0.83400726
xresnet18 3 channels + dilated convs: 0.8337481
xresnet18 3 channels + dilated convs: 0.8354515

from audiotagging.

Try regenerating dataset with different audio parameters about audiotagging HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent