uclanlp / visualbert Goto Github PK

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Python 95.03% Jupyter Notebook 4.97%

visualbert's Introduction

This repository contains code for the following two papers:

VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short version titiled What Does BERT with Vision Look At? published on ACL 2020.

Under the folder visualbert is code (the original VisualBERT), where we pre-train a Transformer for vision-and-language (V&L) tasks on image-caption data.
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions published on NAACL 2021.

Under the folder unsupervised_visualbert is code (Unsupervised VisualBERT), where we pre-train a V&L transformer without aligned image-captions pairs. Rather, we pre-training only using unaligned images and text, and achieve competitive performance with many models supervised with aligned data.

The model VisualBERT has been also integrated into several libararies such as Huggingface Transformer (many thanks to Gunjan Chhablani who made it work) and Facebook MMF.

Thanks~

visualbert's People

Contributors

Stargazers

Watchers

Forkers

scape1989 pkulangzi arjunakula jjwangnlp parita ronghanghu transformerswsz vision-and-language jiyali ypengc7512 sirius-aerostar 13331112522 monskim999 candacelax lttsmn jaeyun95 alesuglia lindsayxx jiankeguxin cezi127 wadeyin9712 ky941122 carderne overlordmax sabirdvd m-and-ms azimmerman13 suhini98 jocelyn1981 zdou0830 sharathr19 multimodal-machine-learning laomagic dadaqingjian satishwarkedas carlosejimenez gguptaggaurav ankitshah009 benjamesbabala zuiwufenghua kushalarora vincent-li-9701 puzzlecollector jordiluque pinglmlcv in-the-house kent0304 xrosliang tttyuntian sunshinewhy yueyedeai ameerhamza111 sjen802 yk135915 chenkangyang clisveryok azeddinebouabdallah longkukuhi btroy94402 iniyan1984 shanshansoong rigved-sanku abhiram4572 haoshuogit nlptrinh dapeng2018 demonsblack sxty4170160 saisimha97 fuxuelinwudi lily11223344 wbchief honk2333 zhangjiwei-japan kaavyarekanar sjyttkl rogerwk p-sood vincent-hyx trellixvulnteam mozhdehrouhsedaghat gabi2444 nagoudi siri-2001 aqhali hudscomdz gavince iq-scm dee-ksha jupiterepoch aspnetcs cardano-max skadi007 jinghere11 sycomix riyasinghal04 atroxgod sy456

visualbert's Issues

How to use visualbert for visual grounding(entity grounding)?

Paperwithcode https://paperswithcode.com/sota/phrase-grounding-on-flickr30k-entities-test?metric=R%4010 shows visualbert can be used for entity grounding. Can you please tell me how to achieve this?

COCO pre-training size

Hi! Could you share the size of pre-training data?
I saw that you extend the training set with part of the validation set.

Minimum GPU requirement

What is the minimum GPU requirement to use VisualBERT properly?
Is it possible to use VisualBERT on a machine with a 12 GB GPU?

ModuleNotFoundError: No module named 'visualbert'

(visualbert) [xxx@localhost ~]$ export PYTHONPATH=$PYTHONPATH:visualbert_vcr/visualbert/
(visualbert) [xxx@localhost ~]$ export PYTHONPATH=$PYTHONPATH:visualbert_vcr/
(visualbert) [xxx@localhost ~]$ cd visualbert_vcr/visualbert/models/
(visualbert) [xxx@localhost models]$ CUDA_VISIBLE_DEVICES=7 python train.py -folder ../trained_models -config ../configs/vcr/fine-tune-qa.json
Traceback (most recent call last):
File "train.py", line 26, in
from visualbert.utils.pytorch_misc import time_batch, save_checkpoint, clip_grad_norm,
ModuleNotFoundError: No module named 'visualbert'

I continue encounter this problem, how can i fix it?

"pre-training" section in the readme

Just want to confirm, when you talk about "pre-training" in the readme (https://github.com/airsplay/lxmert#pre-training) you mean training the entire LXMERT model from scratch?

If we just want to use a trained LXMERT model (and stick on a classification or LSTM layer at the end), we can just use the pre-trained model link you provided: http://nlp.cs.unc.edu/data/model_LXRT.pth, load your model, freeze the weights and then finetune with our specific task, right?

Thanks

Contact author

Thank you very much for your disclosure of the code. You just came into contact with the multimodal field. It's a pleasure to see this code. There are indeed many problems in the reproduction experiment, and your understanding is not deep enough. If you have time, you'd like to ask for advice.
My email：[email protected]

Extracting Detectron Features

Hi, thanks for releasing your code soon after your paper and for making your evaluations easy to reproduce! Could you please provide more detail on how you extracted the Detectron features? I don't see a straightforward way to extract the features with the existing code in the Detectron repository. Thanks!

Extracting image features for VQA

https://github.com/uclanlp/visualbert#extracting-image-features

Could you go into more detail? Should we install the custom pytorch into a new virtual environment, so it doesn't break the pytorch used in training the model? What command do we run with detectron to extract features?

How to make evaluation on VQA?

Hi, I have trained the fine-tuned visualbert model on VQA following the instructions in the readme. But I have no idea how to make predictions on the official VQA v2.0 test-dev set and compute the accuracy score which is comparable with the mentioned performance in your paper (70.80)? Thank you very much!

Visual Features Computation

Hi, I'm really interested in your work!, I'd like to ask how exactly the visual features are computed. It would be really useful if you could point me directly to the relative snippet of code.
Moreover, I would like to ask you to kindly clarify the preprocessing carried out for Flickr's visual feature. In the code, you concatenate additional "spatial features" to the original one coming with an embedding sized 2054 instead of 2048. I couldn't find any explanation in the paper in regard.
Thank you!

sentence image matching

Hi, I'm very interested in your work. I'd like to ask how the prediction sentence and image matching of the pre-training task are carried out？Is it in ‘TrainVisualBERTObjective’?I don't quite understand. I'm looking forward to your reply.thank u.

config_vcr is no where to be found

I am running the following command:

python train.py --folder log --config ../configs/vcr/fine-tune-qa.json

Traceback (most recent call last):
  File "train.py", line 26, in <module>
    from visualbert.dataloaders.vcr import VCR, VCRLoader
  File "/auto/nlg-05/chengham/third-party/visualbert/dataloaders/vcr.py", line 20, in <module>
    from dataloaders.box_utils import load_image, resize_image, to_tensor_and_normalize
  File "/auto/nlg-05/chengham/third-party/visualbert/dataloaders/box_utils.py", line 8, in <module>
    from config_vcr import USE_IMAGENET_PRETRAINED
ModuleNotFoundError: No module named 'config_vcr'

I searched the whole repo and cannot find the file.

How to retrain VisualBERT on another dataset?

Out-of-the-box VisualBERT is trained on the COCO dataset, and I'd like to retrain it on my own data so that it retains the original learned model parameters, whilst being updated with the new data.

I'm unsure what section of the code is I need to change.

VisualBERT GitHub repo: https://github.com/uclanlp/visualbert

Thanks in advance.

Not able to download NLVR train feature file

Unable to download features_train_150.th file from given google drive link for NLVR task, has anyone faced this issue? how to resolve it?

about chinese

i am so happy to view the resource and i want to know it perform well on chinese dataset

seq_relationship_score logits order

I'm testing this model on the image-sentence-alignment task and I'm observing weird results.

By running the pretrained COCO model in eval-mode on COCO17 I get results below random chance ( using basically the setting used for the pretraining).

The 'seq_relationship_score' returns two logits and according to what reported in the doc:

index 0 is "next sentence is the continuation"
index 1 is "next sentence is random"

Following the doc, as I said, I get results that would make much more sense if the meaning of the logits was flipped.

Moreover, that part of the code seems to have been borrowed from the transformers library, and recently a similar issue has been found in another BERT-based model: huggingface/transformers#9212

We are conducting experiments with your model and it would be convenient for us just to ignore the documentation and to report the results flipped.

It would be great if you could clarify this point!

Thank you in advance!

about MCAN

Hello, can you please tell me where can I find the supporting papers (MCAN+VG+Multiple Detectors + BERT) in the last few lines of the table? And whether their code is open source?

checkpoints for flickr30k?

Hi,

Thank you for your nice work! But I didn't see any checkpoint for flickr30k experiments. Could you provide the links to them if possible?

missing COCO Pre-training checkpoint in Flickr30K

Mask Probabilty for Task-specific Pre-training

Hi, in your paper you mention that task-specific pre-training is also using masked-language modelling similar to task-agnostic pre-training. However, I cannot find any mask probability for the task-specific pre-training .json files. Why is no probability specified & did you use the same 15% probability as for task agnostic pre-training?

Sorry perhaps I'm missing something here -- Thanks for the help!

The ROIAlign module used in VCR experiments crushes when forwarding

Hi, I am running the experiments on VCR following the instructions in the readme. I have installed the customed torchvision and detectron modules. However, when the COCO pre-training process begins, the program is terminated by segmentation fault while the tensors forward the ROIAlign module. Could you help me to figure out this issue? Thank you very much!

Visualbert VQA model inference lower accuracy in validation around 40% by huggingface framework

class VQADataset(torch.utils.data.Dataset):
"""VQA (v2) dataset."""
def __init__(self, questions, annotations, tokenizer, image_preprocess, frcnn, frcnn_cfg):
self.questions = questions
self.annotations = annotations
self.tokenizer = tokenizer
self.image_preprocess = image_preprocess
self.frcnn = frcnn
self.frcnn_cfg = frcnn_cfg

def __len__(self):
return len(self.annotations)

def __getitem__(self, idx):
 # answer
annotation = self.annotations[idx]
#  question
questions = self.questions[idx]
image_path = id_to_filename[annotation["image_id"]]
image_path = image_path.replace("./multimodal_data/vqa2/val2014/.", "", 1)
text = questions['question']

inputs = self.tokenizer(
     text,
     padding="max_length",
     max_length=25,
     truncation=True,
     return_token_type_ids=True,
     return_attention_mask=True,
     add_special_tokens=True,
     return_tensors="pt")


images, sizes, scales_yx = self.image_preprocess(image_path)
output_dict = self.frcnn(
                     images,
                     sizes,
                     scales_yx=scales_yx,
                     padding="max_detections",
                     max_detections=self.frcnn_cfg.max_detections,
                     return_tensors="pt")

# Very important that the boxes are normalized
feature = output_dict.get("roi_features")
normalized_boxes = output_dict.get("normalized_boxes")

inputs.update(
    {
     "visual_embeds": feature,
     "visual_attention_mask": torch.ones(feature.shape[:-1], dtype=torch.float),
     # "visual_token_type_ids": torch.ones(feature.shape[:-1], dtype=torch.long),
     "output_attentions": False
     }
)

# remove batch dimension
for k, v in inputs.items():
     if isinstance(v, torch.Tensor):
        inputs[k] = v.squeeze()

# add labels
labels = annotation['labels']
# print("label candidate:", labels)
scores = annotation["scores"]

targets = torch.zeros(len(config.id2label), dtype=torch.float)
for label, score in zip(labels, scores):
    # print(f"Setting target at index {label} to {score}")
    targets[label] = score
inputs["labels"] = targets
inputs["text"] = text

print(text)
return inputs

from visualbert.processing_image import Preprocess
from visualbert.visualizing_image import SingleImageViz
from visualbert.modeling_frcnn import GeneralizedRCNN
from visualbert.utils import Config

frcnn_cfg = Config.from_pretrained("unc-nlp/frcnn-vg-finetuned")
frcnn = GeneralizedRCNN.from_pretrained("unc-nlp/frcnn-vg-finetuned", config=frcnn_cfg)
image_preprocess = Preprocess(frcnn_cfg)

from transformers import VisualBertForQuestionAnswering, AutoTokenizer, BertTokenizerFast
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa",
num_labels=len(config.id2label),
id2label=config.id2label,
label2id=config.label2id,
output_hidden_states=True)

model.to(device)
model.eval()

dataset = VQADataset(questions=questions[:100],
annotations=annotations[:100],
tokenizer=tokenizer,
image_preprocess=image_preprocess,
frcnn=frcnn,
frcnn_cfg=frcnn_cfg)

test_dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
correct = 0.0
total = 0

for batch in tqdm(test_dataloader):
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
logits = outputs.logits # [batch_size, 3129]
_, pre = torch.max(logits, 1)
_, target = torch.max(batch["labels"], 1)
print("prediction:", pre)
print("target:", target)
print("Predicted answer:", model.config.id2label[pre.item()])
print("Target answer:", model.config.id2label[target.item()])
correct += (pre == target).sum()
total = total + 1
print(total)

final_acc = correct / float(len(test_dataloader.dataset))
print('Accuracy of test: %f %%' % (100 * float(final_acc)))

Bert-large

Hi visual-bert teams! Can you provide your bert-large version checkpoints for VQA ? I see there are only base version checkpoints in this repo.

Pre-training on other BERT models

Thanks for the great repo and your efforts! Two quick questions:

Is there anything that speaks against pre-training VisualBERT with Albert instead of BERT on COCO and then finetune it for downstream tasks?
Also, I havn't found exact details on what resources are needed for pre-training, except for that it took less than a day on COCO according to your paper - How much hours did it take & what GPUs did you use?

About evaluation

visualbert

Can I use models on huggingface model hub do evaluation without fiinetuning, and get mentioned performance in your paper?

"uclanlp/visualbert-vqa" evaluate vqa
"uclanlp/visualbert-nlvr2" evaluate nlvr

Question about visualBERT

I want to know about the visual token dimension.
linguistic token dimension is 768 which is same with BERT in NLP,
how about visual token?

is there a FCN for visual feature to embedding dimenstion(768)

thanks

COCO features

Hi! Thank you for your excellent work. I noticed that we downloaded COCO features separately for NLVR, VQA and VCR. What is the difference between the features? Are they from different models of detectron2?
By the way, could you please provide the script for generating Flickr30k features?

How to Generate Visual Attention Maps

@liunian-harold-li

Hello, I was wondering how the attention maps in the original paper were generated? All I can find from the output of the model is "attention_weights," and I'm not really sure what they mean.

How to get pre-computed features in NLVR2 task for my own dataset?

Thanks for great works! I want to use the model on my own dataset but I don't know how to get the features like the pre-computed features you told us to download. Could you give some advice?

Flickr30k entities support

Hi! Thanks for releasing the code, it is very useful! I wanted to play with the attention on the Flickr30k Entities dataset, but cannot load the entries in the Flickr30kFeatureDataset constructor (I think some .hdf5 and .pkl files are missing). Could you provide more details on how to instantiate this class?

Thanks!

Process finished with exit code 137

I can't run this program normally. May I ask how much memory is required？

Features vqacoco-pre-train

Hi,

Thank you for this repo! I would like to know what are the visual features used for the checkpoint: visualbert/configs/vqa/coco-pre-train.json?

The image features are from which model? Do you have the checkpoint (for example: Detectron e2e_mask_rcnn_R-101-FPN_2x, model_id: 35861858) so that I can use the model with different images?

Number of ROIs

Hi and thanks for the nice repo!

I couldn't find in the paper how many proposals you used for pre-training and fine-tuning in each dataset (except for NLVR, where you use 144).
Also, could it be that you do the "Task-Agnostic Pre-Training" on COCO separately for each task? (Given that you use different detectors for each task)

Thanks a lot! And congrats on the ACL follow-up paper

"VisualBERTDetector not in acceptable choices for type

I encounter this error when I pretrain on VCR. How can I solve this?
allennlp.common.checks.ConfigurationError: "VisualBERTDetector not in acceptable choices for type: ['bcn', 'constituency_parser', 'biaffine_parser', 'coref', 'crf_tagger', 'decomposable_attention', 'event2mind', 'simple_seq2seq', 'bidaf', 'bidaf-ensemble', 'dialog_qa', 'nlvr_coverage_parser', 'nlvr_direct_parser', 'quarel_parser', 'wikitables_mml_parser', 'wikitables_erm_parser', 'atis_parser', 'text2sql_parser', 'srl', 'simple_tagger', 'esim', 'bimpm', 'graph_parser', 'bidirectional-language-model']"

allennlp 0.8.0 .

Thanks for your open code

有大佬开源个Keras版本的吗？

ModuleNotFoundError: No module named 'visualbert'

Hi,
I'm having some troubles running the code. I created a bash script as shown in the readme, but I continue getting this :

Traceback (most recent call last):
File "train.py", line 23, in
from visualbert.utils.pytorch_misc import time_batch, save_checkpoint, clip_grad_norm,
ModuleNotFoundError: No module named 'visualbert'

How can I solve it?

Using visualBERT for generation

Hi, great work with this, very clearly explained and I'm enjoying tinkering around with it.
I wanted to try and use the same for text generation - captioning images for example, could you give some guidance on how I could proceed here?
I think it will require adding a decoder stack on top of the encoder and can be trained on COCO(which has captions) itself right, in the same way - MLM plus fine tuning on COCO itself?
https://arxiv.org/pdf/2003.01473.pdf - these people have done this and their approach is slightly different in that they use 2 BERT encoders in parallel for encoding images and text separately.
Do you think generation like that would be possible with visualBERT and how do you think I can proceed to try it out? Since you say your version of BERT is from huggingface, maybe I can use a decoder stack from them?
Else - huggingface themselves have a EncoderDecoder class - this may work once trained right? If I preprocess image features the same way you have here?

Flickr30k Entities fine-tuning clarification

Hi,

Thanks for the open-source repository.

I was wondering: how did you implement fine-tuning for the Flickr30k Entities dataset? From the ACL short paper:

What is the loss between the predicted alignment and the ground-truth alignment? As noted in your preprint on arXiv, this can be a bit complicated because the ground-truth alignment can have multiple boxes.

VisualBERT with Detectron2

Hi,

I was wondering whether VisualBERT can be used out of the box (from Hugging Face) with Detectron2? I followed this
nice tutorial (also linked in the same Hugging Face page) for extracting embeddings with Detectron2, but the VisualBERT paper states that it was trained with Detectron rather than Detectron2. Do I have to do my own pretraining then in order to use Detectron2 embeddings?

Thanks in advance!