swimmiing / acl-ssl Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 1.0 4.93 MB

Repository of the WACV'24 paper "Can CLIP Help Sound Source Localization?"

Python 99.57% Shell 0.43%

audio-visual-correspondence audio-visual-segmentation self-supervised-learning visual-sound-localization

acl-ssl's People

Contributors

Stargazers

Watchers

Forkers

qq909244296

acl-ssl's Issues

Hyperparameter for training

Hi,
Thanks for making the code public for this interesting work.

I was trying to train the network on VGG. I noticed there is a difference between the batch size and learning rate mentioned in the paper compared to the one mentioned in the config file here. The learning rate and batch size are mentioned in the paper to be 1e-3 and 16, whereas in the config file, the values are 1e-4 and 8. I tried training with both settings, I observed that with the values (1e-3 and 16)` mentioned in the paper, the training loss diverges whereas the values for the config file seem to perform well. I am attaching the screenshot of both curves below for your reference.

Also, in practice, the batch size for InfoNCE loss is usually higher (~128), is there any specific reason you have chosen a small batch size (8/16)? And what can be the reason for the curve diverging for a higher batch size?

Thanks.

Discrepancy in conversion of output to logits.

I noticed there is a difference in the way the outputs are converted into logits for training and evaluation. Here the logits are obtained by multiplying with w and adding b to it during training. But during evaluation, the operation is different, where the output is added to b/w. Can you please clarify it? Thanks in advance.

Accuracy drop compared to the paper

Dear authors,

Thanks for open-sourcing this project. Generally, the codebase is well organized.
However, I can't get the same or similar number as in the paper. I expect there are some basic mistakes I made.

Below are My "test_rst.txt" on VGGSound dataset, the AP is worse than that of the paper.
ACL (vggss_test with thr = 0.05) AP50(cIoU)=30.579964850615116, AUC=36.1779925795743 ACL (vggss_test with thr = 0.1) AP50(cIoU)=36.711579769576254, AUC=39.26381566100371 ACL (vggss_test with thr = 0.15) AP50(cIoU)=39.58211286858035, AUC=40.89728568638938 ACL (vggss_test with thr = 0.2) AP50(cIoU)=41.08572544424917, AUC=41.67350126928334 ACL (vggss_test with thr = 0.25) AP50(cIoU)=41.59343878148799, AUC=41.87561023237649 ACL (vggss_test with thr = 0.3) AP50(cIoU)=41.515329037297406, AUC=41.8140988088264 ACL (vggss_test with thr = 0.35) AP50(cIoU)=41.67154852567858, AUC=41.55828939660222 ACL (vggss_test with thr = 0.4) AP50(cIoU)=41.49580160124975, AUC=41.17555165006834 ACL (vggss_test with thr = 0.45) AP50(cIoU)=40.81234133958211, AUC=40.65026362038665 ACL (vggss_test with thr = 0.5) AP50(cIoU)=39.660222612770944, AUC=40.068346026166765 ACL (vggss_test with thr = 0.55) AP50(cIoU)=38.6838508103886, AUC=39.41515329037297 ACL (vggss_test with thr = 0.6) AP50(cIoU)=37.609841827768015, AUC=38.604764694395634 ACL (vggss_test with thr = 0.65) AP50(cIoU)=36.1452841241945, AUC=37.68892794376099 ACL (vggss_test with thr = 0.7) AP50(cIoU)=34.876000781097446, AUC=36.6373755125952 ACL (vggss_test with thr = 0.75) AP50(cIoU)=32.98183948447569, AUC=35.32024995118142 ACL (vggss_test with thr = 0.8) AP50(cIoU)=30.931458699472756, AUC=33.86447959382934 ACL (vggss_test with thr = 0.85) AP50(cIoU)=27.514157391134543, AUC=31.9859402460457 ACL (vggss_test with thr = 0.9) AP50(cIoU)=23.667252489748098, AUC=29.254051942979892 ACL (vggss_test with thr = 0.95) AP50(cIoU)=16.285881663737552, AUC=23.326498730716654
I have attached one of the visualization outputs, and it seems reasonable.

I have a few doubts:

I downloaded the VGGSound dataset and extracted the audio, resulting in some "stereo" or "2-channels" audio, so I slightly revised to code always to select the first channel. From my understanding, it shouldn't cause an accuracy drop.
I used the VGGSound from huggingface and extended VGGSound from the link you provided, not sure if I am right.
I use the checkpoint provided in this repo; are they expected to get the same number?
I don't revise the code except for 1).

Do you know any other possible ways to result in the accuracy drop?
Thanks for your help!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.