I use gradCAM to visualize the same image as the paper, but I get a weird result diffe

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Some questions about grad-CAM showing in fig7 in paper Tag2Text. about recognize-anything HOT 5 OPEN

SKBL5694 commented on September 18, 2024

Some questions about grad-CAM showing in fig7 in paper Tag2Text.

from recognize-anything.

Comments (5)

xinyu1205 commented on September 18, 2024

Did you use the image-tag recognition decoder on the tagging task to obtain the Grad-Cam? Since the figure 7 of Tag2Text is obtained based on the backward gradient of the image-tag interaction encoder on the generation task.
And I also found that the image-tag recognition decoder's Grad-Cam is often a meaningless scatter graph, even if it predicts high logits. Normally, with good recognition performance, its grad-cam should be very accurate. I haven't found the reason yet.

from recognize-anything.

SKBL5694 commented on September 18, 2024

It seems that I am indeed doing gradcam on the recognition task, because your code does not open the generation task for RAM, but I have added the generation task to RAM in the way of T2T, I will test it, thank you.
Also, I share your opinion that "with good recognition performance, its grad-cam should be very accurate". But I also encountered a similar situation pytorch-grad-cam/issues/84 in some other swin-transformer-based discussions about gradcam, but unfortunately these discussions were fruitless. I think it may be due to the patch merging operation in swin-transformer that the features lose the meaning of the traditional spatial structure, but this can't explain why gradcam is accurate sometimes, and I am also very confused. Thanks for your reply though, I'll try it again. Thank you again for your excellent work and kind reply.

from recognize-anything.

xinyu1205 commented on September 18, 2024

Thank you for your interest and your kind words, welcome to provide feedback if you have more issues.

from recognize-anything.

SKBL5694 commented on September 18, 2024

I think I have some problem with image-tag interaction encoder performing backward calculation grad. The calculation method I use is to pre-define a hook, and then register it to the location I need. The general idea is as follows

def backward_hook(module, grad_input, grad_output):
     global gradients
     print('Backward hook running...')
     gradients = grad_output
     print(f'Gradients size: {gradients[0].size()}')
backward_hook = model.visual_encoder.register_full_backward_hook(backward_hook)

Then do backward on the category where I need to calculate the gradient. For example, in the previous recognition decoder, I can perform the following operations on any type of logits
logits[0,252-1].backward() (where 252 is the number of lines where the word "cat" is located in ram_tag_list)
But now, the interaction encoder output is not a scalar, but a shape of (#beam, max_length, #features) eg: (3, 40, 768)
You mentioned that the grad of fig7 is obtained from the interaction encoder, does it mean that I should perform backward on this output to calculate the gradient? Or is there any other operation?
In addition, I also tried to perform .backward() on the output of the text generation decoder, but since self.text_decoder is an instance of the official transformer library, its generate method does not contain grad, so I cannot perform backward() on the output of this , hope you can give me some idea, I want to try to get similar results to fig7.

from recognize-anything.

pribadihcr commented on September 18, 2024

Hi @SKBL5694 , Have you resolve it?

from recognize-anything.

Some questions about grad-CAM showing in fig7 in paper Tag2Text. about recognize-anything HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent