Coder Social home page Coder Social logo

visualbert's Introduction

This repository contains code for the following two papers:

The model VisualBERT has been also integrated into several libararies such as Huggingface Transformer (many thanks to Gunjan Chhablani who made it work) and Facebook MMF.

Thanks~

visualbert's People

Contributors

dependabot[bot] avatar erjanmx avatar kaiweichang avatar liunian-harold-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

visualbert's Issues

COCO pre-training size

Hi! Could you share the size of pre-training data?
I saw that you extend the training set with part of the validation set.

Minimum GPU requirement

What is the minimum GPU requirement to use VisualBERT properly?
Is it possible to use VisualBERT on a machine with a 12 GB GPU?

ModuleNotFoundError: No module named 'visualbert'

(visualbert) [xxx@localhost ~]$ export PYTHONPATH=$PYTHONPATH:visualbert_vcr/visualbert/
(visualbert) [xxx@localhost ~]$ export PYTHONPATH=$PYTHONPATH:visualbert_vcr/
(visualbert) [xxx@localhost ~]$ cd visualbert_vcr/visualbert/models/
(visualbert) [xxx@localhost models]$ CUDA_VISIBLE_DEVICES=7 python train.py -folder ../trained_models -config ../configs/vcr/fine-tune-qa.json
Traceback (most recent call last):
File "train.py", line 26, in
from visualbert.utils.pytorch_misc import time_batch, save_checkpoint, clip_grad_norm,
ModuleNotFoundError: No module named 'visualbert'

I continue encounter this problem, how can i fix it?

"pre-training" section in the readme

Just want to confirm, when you talk about "pre-training" in the readme (https://github.com/airsplay/lxmert#pre-training) you mean training the entire LXMERT model from scratch?

If we just want to use a trained LXMERT model (and stick on a classification or LSTM layer at the end), we can just use the pre-trained model link you provided: http://nlp.cs.unc.edu/data/model_LXRT.pth, load your model, freeze the weights and then finetune with our specific task, right?

Thanks

Contact author

Thank you very much for your disclosure of the code. You just came into contact with the multimodal field. It's a pleasure to see this code. There are indeed many problems in the reproduction experiment, and your understanding is not deep enough. If you have time, you'd like to ask for advice.
My email:[email protected]

Extracting Detectron Features

Hi, thanks for releasing your code soon after your paper and for making your evaluations easy to reproduce! Could you please provide more detail on how you extracted the Detectron features? I don't see a straightforward way to extract the features with the existing code in the Detectron repository. Thanks!

How to make evaluation on VQA?

Hi, I have trained the fine-tuned visualbert model on VQA following the instructions in the readme. But I have no idea how to make predictions on the official VQA v2.0 test-dev set and compute the accuracy score which is comparable with the mentioned performance in your paper (70.80)? Thank you very much!

Visual Features Computation

Hi, I'm really interested in your work!, I'd like to ask how exactly the visual features are computed. It would be really useful if you could point me directly to the relative snippet of code.
Moreover, I would like to ask you to kindly clarify the preprocessing carried out for Flickr's visual feature. In the code, you concatenate additional "spatial features" to the original one coming with an embedding sized 2054 instead of 2048. I couldn't find any explanation in the paper in regard.
Thank you!

sentence image matching

Hi, I'm very interested in your work. I'd like to ask how the prediction sentence and image matching of the pre-training task are carried out?Is it in ‘TrainVisualBERTObjective’?I don't quite understand. I'm looking forward to your reply.thank u.

config_vcr is no where to be found

I am running the following command:

python train.py --folder log --config ../configs/vcr/fine-tune-qa.json
Traceback (most recent call last):
  File "train.py", line 26, in <module>
    from visualbert.dataloaders.vcr import VCR, VCRLoader
  File "/auto/nlg-05/chengham/third-party/visualbert/dataloaders/vcr.py", line 20, in <module>
    from dataloaders.box_utils import load_image, resize_image, to_tensor_and_normalize
  File "/auto/nlg-05/chengham/third-party/visualbert/dataloaders/box_utils.py", line 8, in <module>
    from config_vcr import USE_IMAGENET_PRETRAINED
ModuleNotFoundError: No module named 'config_vcr'

I searched the whole repo and cannot find the file.

about chinese

i am so happy to view the resource and i want to know it perform well on chinese dataset

seq_relationship_score logits order

I'm testing this model on the image-sentence-alignment task and I'm observing weird results.

By running the pretrained COCO model in eval-mode on COCO17 I get results below random chance ( using basically the setting used for the pretraining).

The 'seq_relationship_score' returns two logits and according to what reported in the doc:

  • index 0 is "next sentence is the continuation"
  • index 1 is "next sentence is random"

Following the doc, as I said, I get results that would make much more sense if the meaning of the logits was flipped.

Moreover, that part of the code seems to have been borrowed from the transformers library, and recently a similar issue has been found in another BERT-based model: huggingface/transformers#9212

We are conducting experiments with your model and it would be convenient for us just to ignore the documentation and to report the results flipped.

It would be great if you could clarify this point!

Thank you in advance!

about MCAN

image
Hello, can you please tell me where can I find the supporting papers (MCAN+VG+Multiple Detectors + BERT) in the last few lines of the table? And whether their code is open source?

checkpoints for flickr30k?

Hi,

Thank you for your nice work! But I didn't see any checkpoint for flickr30k experiments. Could you provide the links to them if possible?

Mask Probabilty for Task-specific Pre-training

Hi, in your paper you mention that task-specific pre-training is also using masked-language modelling similar to task-agnostic pre-training. However, I cannot find any mask probability for the task-specific pre-training .json files. Why is no probability specified & did you use the same 15% probability as for task agnostic pre-training?

Sorry perhaps I'm missing something here -- Thanks for the help!

The ROIAlign module used in VCR experiments crushes when forwarding

Hi, I am running the experiments on VCR following the instructions in the readme. I have installed the customed torchvision and detectron modules. However, when the COCO pre-training process begins, the program is terminated by segmentation fault while the tensors forward the ROIAlign module. Could you help me to figure out this issue? Thank you very much!

Visualbert VQA model inference lower accuracy in validation around 40% by huggingface framework

class VQADataset(torch.utils.data.Dataset):
"""VQA (v2) dataset."""
def __init__(self, questions, annotations, tokenizer, image_preprocess, frcnn, frcnn_cfg):
self.questions = questions
self.annotations = annotations
self.tokenizer = tokenizer
self.image_preprocess = image_preprocess
self.frcnn = frcnn
self.frcnn_cfg = frcnn_cfg

def __len__(self):
return len(self.annotations)

def __getitem__(self, idx):
 # answer
annotation = self.annotations[idx]
#  question
questions = self.questions[idx]
image_path = id_to_filename[annotation["image_id"]]
image_path = image_path.replace("./multimodal_data/vqa2/val2014/.", "", 1)
text = questions['question']

inputs = self.tokenizer(
     text,
     padding="max_length",
     max_length=25,
     truncation=True,
     return_token_type_ids=True,
     return_attention_mask=True,
     add_special_tokens=True,
     return_tensors="pt")


images, sizes, scales_yx = self.image_preprocess(image_path)
output_dict = self.frcnn(
                     images,
                     sizes,
                     scales_yx=scales_yx,
                     padding="max_detections",
                     max_detections=self.frcnn_cfg.max_detections,
                     return_tensors="pt")

# Very important that the boxes are normalized
feature = output_dict.get("roi_features")
normalized_boxes = output_dict.get("normalized_boxes")

inputs.update(
    {
     "visual_embeds": feature,
     "visual_attention_mask": torch.ones(feature.shape[:-1], dtype=torch.float),
     # "visual_token_type_ids": torch.ones(feature.shape[:-1], dtype=torch.long),
     "output_attentions": False
     }
)

# remove batch dimension
for k, v in inputs.items():
     if isinstance(v, torch.Tensor):
        inputs[k] = v.squeeze()

# add labels
labels = annotation['labels']
# print("label candidate:", labels)
scores = annotation["scores"]

targets = torch.zeros(len(config.id2label), dtype=torch.float)
for label, score in zip(labels, scores):
    # print(f"Setting target at index {label} to {score}")
    targets[label] = score
inputs["labels"] = targets
inputs["text"] = text

print(text)
return inputs

from visualbert.processing_image import Preprocess
from visualbert.visualizing_image import SingleImageViz
from visualbert.modeling_frcnn import GeneralizedRCNN
from visualbert.utils import Config

frcnn_cfg = Config.from_pretrained("unc-nlp/frcnn-vg-finetuned")
frcnn = GeneralizedRCNN.from_pretrained("unc-nlp/frcnn-vg-finetuned", config=frcnn_cfg)
image_preprocess = Preprocess(frcnn_cfg)

from transformers import VisualBertForQuestionAnswering, AutoTokenizer, BertTokenizerFast
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa",
num_labels=len(config.id2label),
id2label=config.id2label,
label2id=config.label2id,
output_hidden_states=True)

model.to(device)
model.eval()

dataset = VQADataset(questions=questions[:100],
annotations=annotations[:100],
tokenizer=tokenizer,
image_preprocess=image_preprocess,
frcnn=frcnn,
frcnn_cfg=frcnn_cfg)

test_dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
correct = 0.0
total = 0

for batch in tqdm(test_dataloader):
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
logits = outputs.logits # [batch_size, 3129]
_, pre = torch.max(logits, 1)
_, target = torch.max(batch["labels"], 1)
print("prediction:", pre)
print("target:", target)
print("Predicted answer:", model.config.id2label[pre.item()])
print("Target answer:", model.config.id2label[target.item()])
correct += (pre == target).sum()
total = total + 1
print(total)

final_acc = correct / float(len(test_dataloader.dataset))
print('Accuracy of test: %f %%' % (100 * float(final_acc)))

Bert-large

Hi visual-bert teams! Can you provide your bert-large version checkpoints for VQA ? I see there are only base version checkpoints in this repo.

Pre-training on other BERT models

Thanks for the great repo and your efforts! Two quick questions:

Is there anything that speaks against pre-training VisualBERT with Albert instead of BERT on COCO and then finetune it for downstream tasks?
Also, I havn't found exact details on what resources are needed for pre-training, except for that it took less than a day on COCO according to your paper - How much hours did it take & what GPUs did you use?

About evaluation

visualbert

Can I use models on huggingface model hub do evaluation without fiinetuning, and get mentioned performance in your paper?

  1. "uclanlp/visualbert-vqa" evaluate vqa
  2. "uclanlp/visualbert-nlvr2" evaluate nlvr

Question about visualBERT

I want to know about the visual token dimension.
linguistic token dimension is 768 which is same with BERT in NLP,
how about visual token?

is there a FCN for visual feature to embedding dimenstion(768)

thanks

COCO features

Hi! Thank you for your excellent work. I noticed that we downloaded COCO features separately for NLVR, VQA and VCR. What is the difference between the features? Are they from different models of detectron2?
By the way, could you please provide the script for generating Flickr30k features?

Flickr30k entities support

Hi! Thanks for releasing the code, it is very useful! I wanted to play with the attention on the Flickr30k Entities dataset, but cannot load the entries in the Flickr30kFeatureDataset constructor (I think some .hdf5 and .pkl files are missing). Could you provide more details on how to instantiate this class?

Thanks!

Features vqacoco-pre-train

Hi,

Thank you for this repo! I would like to know what are the visual features used for the checkpoint: visualbert/configs/vqa/coco-pre-train.json?

The image features are from which model? Do you have the checkpoint (for example: Detectron e2e_mask_rcnn_R-101-FPN_2x, model_id: 35861858) so that I can use the model with different images?

Number of ROIs

Hi and thanks for the nice repo!

  1. I couldn't find in the paper how many proposals you used for pre-training and fine-tuning in each dataset (except for NLVR, where you use 144).
  2. Also, could it be that you do the "Task-Agnostic Pre-Training" on COCO separately for each task? (Given that you use different detectors for each task)

Thanks a lot! And congrats on the ACL follow-up paper

"VisualBERTDetector not in acceptable choices for type

I encounter this error when I pretrain on VCR. How can I solve this?
allennlp.common.checks.ConfigurationError: "VisualBERTDetector not in acceptable choices for type: ['bcn', 'constituency_parser', 'biaffine_parser', 'coref', 'crf_tagger', 'decomposable_attention', 'event2mind', 'simple_seq2seq', 'bidaf', 'bidaf-ensemble', 'dialog_qa', 'nlvr_coverage_parser', 'nlvr_direct_parser', 'quarel_parser', 'wikitables_mml_parser', 'wikitables_erm_parser', 'atis_parser', 'text2sql_parser', 'srl', 'simple_tagger', 'esim', 'bimpm', 'graph_parser', 'bidirectional-language-model']"

ModuleNotFoundError: No module named 'visualbert'

Hi,
I'm having some troubles running the code. I created a bash script as shown in the readme, but I continue getting this :

Traceback (most recent call last):
File "train.py", line 23, in
from visualbert.utils.pytorch_misc import time_batch, save_checkpoint, clip_grad_norm,
ModuleNotFoundError: No module named 'visualbert'

How can I solve it?

Using visualBERT for generation

Hi, great work with this, very clearly explained and I'm enjoying tinkering around with it.
I wanted to try and use the same for text generation - captioning images for example, could you give some guidance on how I could proceed here?
I think it will require adding a decoder stack on top of the encoder and can be trained on COCO(which has captions) itself right, in the same way - MLM plus fine tuning on COCO itself?
https://arxiv.org/pdf/2003.01473.pdf - these people have done this and their approach is slightly different in that they use 2 BERT encoders in parallel for encoding images and text separately.
Do you think generation like that would be possible with visualBERT and how do you think I can proceed to try it out? Since you say your version of BERT is from huggingface, maybe I can use a decoder stack from them?
Else - huggingface themselves have a EncoderDecoder class - this may work once trained right? If I preprocess image features the same way you have here?

Flickr30k Entities fine-tuning clarification

Hi,

Thanks for the open-source repository.

I was wondering: how did you implement fine-tuning for the Flickr30k Entities dataset? From the ACL short paper:

Screen Shot 2020-06-01 at 3 25 11 PM

What is the loss between the predicted alignment and the ground-truth alignment? As noted in your preprint on arXiv, this can be a bit complicated because the ground-truth alignment can have multiple boxes.

VisualBERT with Detectron2

Hi,

I was wondering whether VisualBERT can be used out of the box (from Hugging Face) with Detectron2? I followed this
nice tutorial
(also linked in the same Hugging Face page) for extracting embeddings with Detectron2, but the VisualBERT paper states that it was trained with Detectron rather than Detectron2. Do I have to do my own pretraining then in order to use Detectron2 embeddings?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.