Coder Social home page Coder Social logo

Comments (14)

KaimingHe avatar KaimingHe commented on July 20, 2024 19

I have a small confusion regarding this issue (maybe I am missing something): can the situation occur when the queue of negative examples contains some samples which are similar or exactly the same sample of the l_pos? If it happens then should some labels of l_neg be one instead of zero?

Yes, it could happen. This only matters when the queue is too large. On ImageNet, the queue (65536) is ~6% of the dataset size (1.28M), so the chance of having a positive in the queue is 6% at the first iteration of each epoch, and reduces to 0% in the 6-th percentile of iterations of each epoch. This noise is negligible, so for simplicity we don't handle it. If the queue is too large, or the dataset is too small, an extra indicator should be used on these positive targets in the queue.

from moco.

osdf avatar osdf commented on July 20, 2024 17

labels is the ground truth 'index' for the 1+len(queue) wide tensor logits. As shown in the code snippet of KaimingHe's answer, this is always index 0 (l_pos is the first column of the result tensor logits). logits later is feed into the CrossEntropy criterion, i.e. the contrasting happens through the entanglement of the logit scores by the softmax function.

from moco.

KeremTurgutlu avatar KeremTurgutlu commented on July 20, 2024 2

I have a follow up question regarding negatives. Is there an intuition behind not using encoder negatives along with the positives(index 0). For example calculating cross entropy on the logits matrix Nx(N+K) instead of Nx(1+K) where labels are torch.arange(N). Also wouldn't this mitigate BN signature issue for easily identifying the positives from the same batch because now also there will be N-1 negatives for with the same batchnorm stats?

Edit: I implemented this approach, trained without Shuffle BN on 1 GPU and it didn't overfit or had any performance issues on a downstream task.

Here is a link to pretraining and downstream training. With MoCo and this approach without Shuffle BN on 1 GPU performance improves from 18.2% to 71.2% (random init vs finetuning on MoCo weights).

from moco.

mmiakashs avatar mmiakashs commented on July 20, 2024 1

I have a small confusion regarding this issue (maybe I am missing something): can the situation occur when the queue of negative examples contains some samples which are similar or exactly the same sample of the l_pos? If it happens then should some labels of l_neg be one instead of zero?

from moco.

VaticanCameos99 avatar VaticanCameos99 commented on July 20, 2024 1

labels is the ground truth 'index' for the 1+len(queue) wide tensor logits. As shown in the code snippet of KaimingHe's answer, this is always index 0 (l_pos is the first column of the result tensor logits). logits later is feed into the CrossEntropy criterion, i.e. the contrasting happens through the entanglement of the logit scores by the softmax function.

Hi, I'm still having trouble understanding this. Given that the labels denote the index at which we have a positive pair, why do we still use CE loss as the Contrastive learning loss? Please can someone explain exactly how does it resolve to softmax between the logits? More specifically, can someone explain this: "the contrasting happens through the entanglement of the logit scores by the softmax function."

from moco.

KaimingHe avatar KaimingHe commented on July 20, 2024

Positive samples are the zeroth.

logits = torch.cat([l_pos, l_neg], dim=1)

from moco.

Oktai15 avatar Oktai15 commented on July 20, 2024

@KaimingHe, but it also contains negative samples.. why does it have only zeros in targets?

from moco.

Oktai15 avatar Oktai15 commented on July 20, 2024

@osdf oh, got it, thank you!

from moco.

mmiakashs avatar mmiakashs commented on July 20, 2024

Sounds good, Thanks for the clear explanation. And also a big thanks for the concise implementation of MOCO, I learned a lot 😃

from moco.

qishibo avatar qishibo commented on July 20, 2024

I have a follow up question regarding negatives. Is there an intuition behind not using encoder negatives along with the positives(index 0). For example calculating cross entropy on the logits matrix Nx(N+K) instead of Nx(1+K) where labels are torch.arange(N). Also wouldn't this mitigate BN signature issue for easily identifying the positives from the same batch because now also there will be N-1 negatives for with the same batchnorm stats?

Edit: I implemented this approach, trained without Shuffle BN on 1 GPU and it didn't overfit or had any performance issues on a downstream task.

Here is a link to pretraining and downstream training. With MoCo and this approach without Shuffle BN on 1 GPU performance improves from 18.2% to 71.2% (random init vs finetuning on MoCo weights).

@KeremTurgutlu
so you mean the label is 1, 0*(N-1), 0*K instead of 1, 0*K for every line?

from moco.

skaudrey avatar skaudrey commented on July 20, 2024

I have a small confusion regarding this issue (maybe I am missing something): can the situation occur when the queue of negative examples contains some samples which are similar or exactly the same sample of the l_pos? If it happens then should some labels of l_neg be one instead of zero?

Yes, it could happen. This only matters when the queue is too large. On ImageNet, the queue (65536) is ~6% of the dataset size (1.28M), so the chance of having a positive in the queue is 6% at the first iteration of each epoch, and reduces to 0% in the 6-th percentile of iterations of each epoch. This noise is negligible, so for simplicity we don't handle it. If the queue is too large, or the dataset is too small, an extra indicator should be used on these positive targets in the queue.

@KaimingHe

I have a followed question about K. I am confused for measuring the chance of meeting a positive in queue on video data. Is it based on the number of videos or the number of video frames?

If I have 4K videos, each time I sample a clip (like 32 frames from the videos) as the input sample, and I want to classify videos. The K that I set is 65536, but the chance of having a positive in the queue is 65536/40000 = 1.64, that is already larger than 100%. In this case, can I simply set a smaller K like 2400 to keep the small chance of meeting a positive in the queue?

However, in VideoMoCo, the model for video classification, they use 65536 for K. Is this because the chance is measured on frames rather than number of videos? E.g., if each video has 300 frames, the chance is the 65536/40000/300 = 0.55%.

Which measure is more reasonable?

from moco.

clarkkent0618 avatar clarkkent0618 commented on July 20, 2024

I have a small confusion regarding this issue (maybe I am missing something): can the situation occur when the queue of negative examples contains some samples which are similar or exactly the same sample of the l_pos? If it happens then should some labels of l_neg be one instead of zero?

Yes, it could happen. This only matters when the queue is too large. On ImageNet, the queue (65536) is ~6% of the dataset size (1.28M), so the chance of having a positive in the queue is 6% at the first iteration of each epoch, and reduces to 0% in the 6-th percentile of iterations of each epoch. This noise is negligible, so for simplicity we don't handle it. If the queue is too large, or the dataset is too small, an extra indicator should be used on these positive targets in the queue.

@KaimingHe

I have a followed question about K. I am confused for measuring the chance of meeting a positive in queue on video data. Is it based on the number of videos or the number of video frames?

If I have 4K videos, each time I sample a clip (like 32 frames from the videos) as the input sample, and I want to classify videos. The K that I set is 65536, but the chance of having a positive in the queue is 65536/40000 = 1.64, that is already larger than 100%. In this case, can I simply set a smaller K like 2400 to keep the small chance of meeting a positive in the queue?

However, in VideoMoCo, the model for video classification, they use 65536 for K. Is this because the chance is measured on frames rather than number of videos? E.g., if each video has 300 frames, the chance is the 65536/40000/300 = 0.55%.

Which measure is more reasonable?

Actually i cannot understand the ans of KaiMing. Do you know the meaning of "so the chance of having a positive in the queue is 6% at the first iteration of each epoch, and reduces to 0% in the 6-th percentile of iterations of each epoch" how this comes? thank you

from moco.

solauky avatar solauky commented on July 20, 2024

from moco.

clarkkent0618 avatar clarkkent0618 commented on July 20, 2024

I have a small confusion regarding this issue (maybe I am missing something): can the situation occur when the queue of negative examples contains some samples which are similar or exactly the same sample of the l_pos? If it happens then should some labels of l_neg be one instead of zero?

Yes, it could happen. This only matters when the queue is too large. On ImageNet, the queue (65536) is ~6% of the dataset size (1.28M), so the chance of having a positive in the queue is 6% at the first iteration of each epoch, and reduces to 0% in the 6-th percentile of iterations of each epoch. This noise is negligible, so for simplicity we don't handle it. If the queue is too large, or the dataset is too small, an extra indicator should be used on these positive targets in the queue.

@KaimingHe
I have a followed question about K. I am confused for measuring the chance of meeting a positive in queue on video data. Is it based on the number of videos or the number of video frames?
If I have 4K videos, each time I sample a clip (like 32 frames from the videos) as the input sample, and I want to classify videos. The K that I set is 65536, but the chance of having a positive in the queue is 65536/40000 = 1.64, that is already larger than 100%. In this case, can I simply set a smaller K like 2400 to keep the small chance of meeting a positive in the queue?
However, in VideoMoCo, the model for video classification, they use 65536 for K. Is this because the chance is measured on frames rather than number of videos? E.g., if each video has 300 frames, the chance is the 65536/40000/300 = 0.55%.
Which measure is more reasonable?

Actually i cannot understand the ans of KaiMing. Do you know the meaning of "so the chance of having a positive in the queue is 6% at the first iteration of each epoch, and reduces to 0% in the 6-th percentile of iterations of each epoch" how this comes? thank you

@skaudrey

from moco.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.