om-ai-lab / vl-checklist Goto Github PK

View Code? Open in Web Editor NEW

120.0 6.0 5.0 27.26 MB

Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.

Python 100.00%

evaluation-metrics multimodal-deep-learning vision-and-language

vl-checklist's People

Contributors

Stargazers

Watchers

Forkers

e-kiss-me farmingtong tutuna rabiulcste

vl-checklist's Issues

PIC dataset is not available anymore

I tried to follow the instruction on HAKE codebase, but the PIC dataset (also their website) is gone..

Difference between code and description in the paper

Hi,

Thanks for open sourcing your code. I am trying to reproduce the results for ALBEF in your paper, but no success. I was going through your code and noticed that ITM logits/probabilities are used differently in the code than in the paper. Paper describes, "If the model score on the original text description is higher than the score on the generated negative samples, we regard it as positive output." However, in the code only the ITM logit corresponding to "matching" z[1] is used. Basically, the code never compares the scores between positive and negative text as described in the paper. Can you please clarify?

Thanks,
Ajinkya

Running test with other models

Hi!
Thank you for publishing this great work. I was able to run test with your Vilt model, is it possible to run test with other models such TCL and the rest? their checkpoints are different so it's not clear to me if the code should support it or not.
Thank you and have a great week,
Amit

Reproducing CLIP score in the paper

Hi,

Thanks for opening the source code.
I'm trying to reproduce the scores for CLIP in the paper but fail to reproduce it.
I use the sample config file by changing MODE_NAME to CLIP (ViT-L/14).
I evaluate all the datasets in the corpus then average the final accuracy.
I got the following score which is quite different from the paper,

Object: 0.8205209550766983
Attribute: 0.6806109948697314
Relation: 0.67975

How can I reproduce the scores in the paper?

Question about the visualization results shown in Figure5 and Figure6

Thanks for this wonderful work. This work is very inspiring. I am confused about how to get the heat-map as shown in your paper. Looking forward to your reply at your convenience.

Why attention demo chooses language model layer to catch model attention？

In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.

def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
    encoder_name = getattr(model, attr_name, None)
    encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
    output = model(image_input, text_input)
    loss = output[:, 1].sum()
    model.zero_grad()
    loss.backward()
    image_size = 256
    temp = int(np.sqrt(image_size))
    # the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
    # grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
    mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
    grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
    cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()

Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.

def getAttMap(self, image_path, text):
    if self.model_name.lower() == 'albef':
        engine = ALBEF('ALBEF_4M.pth')
        model, tokenizer = engine.load_model(engine.model_id)
        image_input = engine.load_data(src_type='local', data=[image_path])[0]
        text_input = tokenizer(engine.pre_caption(text), return_tensors="pt")
        self.get_attention_by_gradcam(model, tokenizer, image_path, image_input, text_input,
                                          attr_name='text_encoder', target_layer=8)

Annotation path in Relation corpus is wrong

VL-CheckList/corpus/v1/Relation/spatial/vg.yaml

Line 1 in 9e6b5ef

ANNO_PATH: "data/Attribute/vg/spatial.json"

I think Attribute should be fixed to Relation.

Object features for Oscar model

Hi, it is a great work! Since the region-based methods like Oscar using extracted features for evaluating, can you provide the features.tsv file or the detector used for object detection in your paper?

Many thanks!

om-ai-lab / vl-checklist Goto Github PK

vl-checklist's People

Contributors

Stargazers

Watchers

Forkers

vl-checklist's Issues

PIC dataset is not available anymore

Difference between code and description in the paper

Running test with other models

Reproducing CLIP score in the paper

Question about the visualization results shown in Figure5 and Figure6

Why attention demo chooses language model layer to catch model attention？

Annotation path in Relation corpus is wrong

Object features for Oscar model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent