Comments (5)
Did you use the image-tag recognition decoder on the tagging task to obtain the Grad-Cam? Since the figure 7 of Tag2Text is obtained based on the backward gradient of the image-tag interaction encoder on the generation task.
And I also found that the image-tag recognition decoder's Grad-Cam is often a meaningless scatter graph, even if it predicts high logits. Normally, with good recognition performance, its grad-cam should be very accurate. I haven't found the reason yet.
from recognize-anything.
It seems that I am indeed doing gradcam on the recognition task, because your code does not open the generation task for RAM, but I have added the generation task to RAM in the way of T2T, I will test it, thank you.
Also, I share your opinion that "with good recognition performance, its grad-cam should be very accurate". But I also encountered a similar situation pytorch-grad-cam/issues/84 in some other swin-transformer-based discussions about gradcam, but unfortunately these discussions were fruitless. I think it may be due to the patch merging operation in swin-transformer that the features lose the meaning of the traditional spatial structure, but this can't explain why gradcam is accurate sometimes, and I am also very confused. Thanks for your reply though, I'll try it again. Thank you again for your excellent work and kind reply.
from recognize-anything.
Thank you for your interest and your kind words, welcome to provide feedback if you have more issues.
from recognize-anything.
I think I have some problem with image-tag interaction encoder performing backward calculation grad. The calculation method I use is to pre-define a hook, and then register it to the location I need. The general idea is as follows
def backward_hook(module, grad_input, grad_output):
global gradients
print('Backward hook running...')
gradients = grad_output
print(f'Gradients size: {gradients[0].size()}')
backward_hook = model.visual_encoder.register_full_backward_hook(backward_hook)
Then do backward on the category where I need to calculate the gradient. For example, in the previous recognition decoder, I can perform the following operations on any type of logits
logits[0,252-1].backward() (where 252 is the number of lines where the word "cat" is located in ram_tag_list)
But now, the interaction encoder output is not a scalar, but a shape of (#beam, max_length, #features) eg: (3, 40, 768)
You mentioned that the grad of fig7 is obtained from the interaction encoder, does it mean that I should perform backward on this output to calculate the gradient? Or is there any other operation?
In addition, I also tried to perform .backward() on the output of the text generation decoder, but since self.text_decoder is an instance of the official transformer library, its generate method does not contain grad, so I cannot perform backward() on the output of this , hope you can give me some idea, I want to try to get similar results to fig7.
from recognize-anything.
Hi @SKBL5694 , Have you resolve it?
from recognize-anything.
Related Issues (20)
- NameError: name '_C' is not defined HOT 1
- VisionTransformer undefined in ram.models.utils.py
- HuggingFace App is not working HOT 1
- Uncertain output results
- 【Bug】BertLayer should be used as a decoder model if cross attention is added
- finetuning on specific tag list
- How can I obtain the file ram_plus_swin_large_14m.pth? HOT 1
- how to form a ram_plus_tag_embedding_class_4585_des_51.pth for my own data. HOT 3
- Unable to proceed with command 'pip install -e .' HOT 2
- Can't load tokenizer for 'bert-base-uncased'
- tag_encoder and text_decoder HOT 1
- pip install error HOT 2
- Normalize image features while calculating the L1 loss
- i think it is the best to call it MAM(match-anything-model)
- CUDA out of memory error
- Pip Install Error HOT 1
- Checkpoints for smaller versions of Swin
- Relax transformers dependency
- Tag2Text模型微调问题
- retrieval code of Tag2text
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from recognize-anything.