Coder Social home page Coder Social logo

swimmiing / acl-ssl Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 1.0 4.93 MB

Repository of the WACV'24 paper "Can CLIP Help Sound Source Localization?"

Python 99.57% Shell 0.43%
audio-visual-correspondence audio-visual-segmentation self-supervised-learning visual-sound-localization

acl-ssl's People

Contributors

swimmiing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

qq909244296

acl-ssl's Issues

Hyperparameter for training

Hi,
Thanks for making the code public for this interesting work.

I was trying to train the network on VGG. I noticed there is a difference between the batch size and learning rate mentioned in the paper compared to the one mentioned in the config file here. The learning rate and batch size are mentioned in the paper to be 1e-3 and 16, whereas in the config file, the values are 1e-4 and 8. I tried training with both settings, I observed that with the values (1e-3 and 16)` mentioned in the paper, the training loss diverges whereas the values for the config file seem to perform well. I am attaching the screenshot of both curves below for your reference.
Screenshot 2024-03-28 at 11 28 03
Screenshot 2024-03-28 at 11 28 20

Also, in practice, the batch size for InfoNCE loss is usually higher (~128), is there any specific reason you have chosen a small batch size (8/16)? And what can be the reason for the curve diverging for a higher batch size?

Thanks.

Discrepancy in conversion of output to logits.

I noticed there is a difference in the way the outputs are converted into logits for training and evaluation. Here the logits are obtained by multiplying with w and adding b to it during training. But during evaluation, the operation is different, where the output is added to b/w. Can you please clarify it? Thanks in advance.

Accuracy drop compared to the paper

Dear authors,

Thanks for open-sourcing this project. Generally, the codebase is well organized.
However, I can't get the same or similar number as in the paper. I expect there are some basic mistakes I made.

Below are My "test_rst.txt" on VGGSound dataset, the AP is worse than that of the paper.
ACL (vggss_test with thr = 0.05) AP50(cIoU)=30.579964850615116, AUC=36.1779925795743 ACL (vggss_test with thr = 0.1) AP50(cIoU)=36.711579769576254, AUC=39.26381566100371 ACL (vggss_test with thr = 0.15) AP50(cIoU)=39.58211286858035, AUC=40.89728568638938 ACL (vggss_test with thr = 0.2) AP50(cIoU)=41.08572544424917, AUC=41.67350126928334 ACL (vggss_test with thr = 0.25) AP50(cIoU)=41.59343878148799, AUC=41.87561023237649 ACL (vggss_test with thr = 0.3) AP50(cIoU)=41.515329037297406, AUC=41.8140988088264 ACL (vggss_test with thr = 0.35) AP50(cIoU)=41.67154852567858, AUC=41.55828939660222 ACL (vggss_test with thr = 0.4) AP50(cIoU)=41.49580160124975, AUC=41.17555165006834 ACL (vggss_test with thr = 0.45) AP50(cIoU)=40.81234133958211, AUC=40.65026362038665 ACL (vggss_test with thr = 0.5) AP50(cIoU)=39.660222612770944, AUC=40.068346026166765 ACL (vggss_test with thr = 0.55) AP50(cIoU)=38.6838508103886, AUC=39.41515329037297 ACL (vggss_test with thr = 0.6) AP50(cIoU)=37.609841827768015, AUC=38.604764694395634 ACL (vggss_test with thr = 0.65) AP50(cIoU)=36.1452841241945, AUC=37.68892794376099 ACL (vggss_test with thr = 0.7) AP50(cIoU)=34.876000781097446, AUC=36.6373755125952 ACL (vggss_test with thr = 0.75) AP50(cIoU)=32.98183948447569, AUC=35.32024995118142 ACL (vggss_test with thr = 0.8) AP50(cIoU)=30.931458699472756, AUC=33.86447959382934 ACL (vggss_test with thr = 0.85) AP50(cIoU)=27.514157391134543, AUC=31.9859402460457 ACL (vggss_test with thr = 0.9) AP50(cIoU)=23.667252489748098, AUC=29.254051942979892 ACL (vggss_test with thr = 0.95) AP50(cIoU)=16.285881663737552, AUC=23.326498730716654
I have attached one of the visualization outputs, and it seems reasonable.
image

I have a few doubts:

  1. I downloaded the VGGSound dataset and extracted the audio, resulting in some "stereo" or "2-channels" audio, so I slightly revised to code always to select the first channel. From my understanding, it shouldn't cause an accuracy drop.
  2. I used the VGGSound from huggingface and extended VGGSound from the link you provided, not sure if I am right.
  3. I use the checkpoint provided in this repo; are they expected to get the same number?
    I don't revise the code except for 1).

Do you know any other possible ways to result in the accuracy drop?
Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.