luoweizhou / vlp Goto Github PK

Vision-Language Pre-training for Image Captioning and Question Answering

License: Apache License 2.0

Dockerfile 0.57% Python 98.95% Shell 0.48%

vlp's Introduction

VLP

This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Captions dataset and fine-tuned models on COCO Captions and Flickr30k for image captioning and VQA 2.0 for VQA.

Installation

Conda Environment (Option I, Recommended)

Recursively ssh clone the repo to include coco and pythia submodules.

git clone --recursive [email protected]:LuoweiZhou/VLP.git

or clone with https:

git clone --recursive https://github.com/LuoweiZhou/VLP.git

Install CUDA (e.g., 10.0), CUDNN (e.g., v7.5), and Miniconda (either Miniconda2 or 3, version 4.6+).
Run the following commands to set up conda env and install Python packages:

MINICONDA_ROOT=[to your Miniconda root directory] # e.g., /home/[usrname]/miniconda3
cd VLP
conda env create -f misc/vlp.yml --prefix $MINICONDA_ROOT/envs/vlp
conda activate vlp

Finally, cd to the repo root directory and install other dependencies by running:

./setup.sh

To support language evaluation (SPICE), run

cd coco-caption
./get_stanford_models.sh

Docker Image (Option II)

First, install or upgrade to the latest docker (e.g., set <VERSION_STRING> to 5:19.03.2~3-0~ubuntu-xenial). Then pull our docker image:

docker pull luzhou/vlp

Before running the container, you need to declare the environment variable to your data root ($DATA_ROOT, see data prep) and it will be attached as a volume to our container. Finally, install nvidia-container-toolkit and run the docker image in a fresh container:

docker run --gpus all --name vlp_container -it \
     -v $DATA_ROOT:/mnt/dat \
     --shm-size 8G -p 8888:8888 vlp /bin/bash

You can know more about docker commands and usages here.

(Optional) To build the image on your own,

docker build -t vlp .

Data Preparation

Download links for dataset annotations and features: COCO Captions+VQA 2.0 (Part I(95GB), Part II(79GB), download both and run cat COCO0* > COCO.tar.gz), Flickr30k Captions(27GB). If you prefer to download with wget, we attach the commands here. Then, uncompress the downloaded files and place under your data root (denoted as DATA_ROOT).

To prepare for the pre-training, first download and uncompress our pre-processed Conceptual Captions (CC) data(6GB) and place under your data root. Then, download and uncompress the region features from Google Drive (feat(509GB), cls(468GB)) under the CC/region_feat_gvd_wo_bgd/feat_cls_1000_float16 dir. To evaluate CC on caption generation, download the reference file and place it under coco-caption/annotations.

Besides, download and uncompress the detectron fc7 weight files under the code root directory (denoted as CODE_ROOT): GVD Detectron fc7.

(Optional, only for VQA) Download the VQA 2.0 annotation (based on Pythia):

cd $CODE_ROOT/pythia
mkdir -p data && cd data
wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz
tar xf vocab.tar.gz && rm vocab.tar.gz

wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip
unzip v2_Annotations_Val_mscoco.zip && rm v2_Annotations_Val_mscoco.zip

mkdir -p imdb && cd imdb
wget https://dl.fbaipublicfiles.com/pythia/data/imdb/vqa.tar.gz
tar xf vqa.tar.gz && rm vqa.tar.gz

(Optional, only for pre-training) Download the UniLM checkpoints and uncompress under your checkpoint root (denoted as CHECKPOINT_ROOT).

Experiment Overview

Most of the experiments in this work are performed on 8x V100 GPUs with distributed data parallel (i.e., set --world_size to 8, --local_rank and --global_rank from 0 to 7 with 8 separate scripts), unless specified otherwise. See below for detailed configurations (also in the Appendix of the paper).

Dataset	Batch Size	Learning Rate	# of Epochs	GPUs	Time per Epoch
CC	64(x8)	1e-4(x8)	30	8x V100	5hr
COCO	64(x8)	3e-5(x8)	30	8x V100	12min
VQA 2.0	64(x2)	2e-5(x2)	20	2x V100	32min
Flickr30k	64(x8)	3e-5(x8)	30	8x V100	3min
COCO (w/o pre-training)	64(x8)	3e-4(x8)	30	8x V100	12min
COCO (SCST training)	16(x4)	1e-6(x4)	30	4x Titan Xp	3hr

The (x2), (x4), (x8) in the batch size and learning rate results from distributed data parallel. Gradients are accumulated/added across GPUs.

Note that some modules need to be imported manually:

export PYTHONPATH=$CODE_ROOT/pythia:$CODE_ROOT/pythia/pythia/legacy:$CODE_ROOT:$PYTHONPATH

Pre-training

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_cc} \
    --model_recover_path $CHECKPOINT_ROOT/bert_save/base_model_pretrained/model_153999_cpu.bin \
    --do_train --learning_rate ${lr} --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/CC/annotations/dataset_cc.json \
    --dataset cc --split train --file_valid_jpgs $DATA_ROOT/CC/annotations/cc_valid_jpgs.json \
    --local_rank -1 --global_rank -1 --world_size 1 --enable_butd \
    --s2s_prob ${w_s} --bi_prob ${w_b} --image_root $DATA_ROOT/CC/region_feat_gvd_wo_bgd \
    --region_bbox_file bbox/cc_detection_vg_thresh0.2_feat_gvd_checkpoint_trainval.h5 \
    --region_det_file_prefix feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

where lr=1e-4, w_s=0.75, w_b=0.25, and checkpoint_cc is the id of the checkpoint. The pre-trained models are available here.

Fine-tuning

The fine-tuning checkpoints are available at: COCO (CE optim), COCO (CIDEr optim), VQA 2.0 (train on train set only), Flickr30k.

COCO Captions

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0

(Optional) To enable Self-Critical Sequence Training (SCST), set --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.28.bin, --max_pred 0, --mask_prob 0, --scst, --learning_rate 1e-6 (note that SCST requires a much smaller lr than the default 3e-5), and --output_dir accordingly. The training takes 30 epochs to converge with each epoch takes roughly 3hr.

An example code on 2-GPU training with distributed data parallel:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 0 --global_rank 0 --world_size 2 &
python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 1 --global_rank 1 --world_size 2

VQA 2.0

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_vqa2} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --learning_rate 2e-5 --new_segment_ids --always_truncate_tail --amp \
    --num_train_epochs 20 --enable_butd --s2s_prob 0 --bi_prob 1 \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd
    --tasks vqa2 --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_train2014.npy \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --mask_prob 0 --max_pred 1

To get the models for leaderboard, we perform the training on both train set and val set (set src_file to imdb_train2014 and imdb_val2014).

Flickr30k Captions

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_flickr30k} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --dataset flickr30k --region_bbox_file flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

Inference and Testing

Here, we list the expected result outcomes from our Unified VLP checkpoints. For image captioning, on Karpathy's test split:

Dataset	Method	BLEU@4	METEOR	CIDEr	SPICE
COCO	Unified VLP	36.5	28.4	116.9	21.2
	Unified VLP + SCST	39.5	29.3	129.3	23.2
Flickr30k	Unified VLP	30.1	23.0	67.4	17.0

For VQA:

Dataset	Trained on	Eval Split	Overall	Yes/No	Number	Other
VQA 2.0	train only	Dev	67.4	85.4	50.1	58.3
	train+val	Test-Dev	70.5	87.2	52.1	60.3
	train+val	Test-Standard	70.7	87.4	52.1	60.5

Note that results on Test-Dev and Test-Standard are from VQA 2.0 evaluation server. train+val indicates models are trained on both training set and validation set following the practice from early works.

Note: All the evaluation scripts support data parallel. But since we do not use standard PyTorch DataLoader, the data loading speed might be the bottleneck (imagine num_workers is always 0). We recommend to perform single-GPU inference (e.g., CUDA_VISIBLE_DEVICES=0).

COCO Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ --split ${split} \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json

where checkpoint_coco_ce indicates checkpoint name, beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

VQA 2.0

python vlp/eval_vqa2.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_vqa2}/model.${epoch}.bin \
    --new_segment_ids --enable_butd --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ \
    --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_${split}.npy --batch_size 50 \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json --split ${split}

where split could be val2014 or test2015.

Flickr30k Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_flickr30k}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/ --split ${split} \
    --dataset flickr30k --region_bbox_file flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

where beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

Testing

For all the datasets, checkpoints (by epochs) with the best validation accuracy (CIDEr in captioning and overall accuracy in VQA) are evaluated on the test set (Test-Dev and Test-Standard for VQA 2.0).

Misc

The Detectron-based feature extraction code is available under this repo. You need to download this config file and checkpoint file.

List of download commands (only for OneDrive):

wget -O caption_cc_val.json "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212017&authkey=AHy5eiJM75RwPxg"

# data
wget -O COCO00 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212019&authkey=ACn4bwZ0nmZ0nik"
wget -O COCO01 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212018&authkey=AHoTGG-7-6kwoAY"
wget -O flickr30k.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212015&authkey=AFZ2iehPM8HREeA"
wget -O CC.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%213781&authkey=ANA--esfJnWIKIE"

# UniLM checkpoint
wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

# pre-training checkpoints
wget -O cc_g8_lr1e-4_batch512_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212026&authkey=AH98pIVaNS4apSI"

# fine-tuning checkpoints
wget -O coco_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212028&authkey=AEjQxFF1FcBK-Aw"
wget -O coco_g4_lr1e-6_batch64_scst.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212027&authkey=ACM1UXlFxgfWyt0"
wget -O vqa2_g2_lr2e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212029&authkey=APjfGJd1-nzDO7s"
wget -O flickr30k_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212030&authkey=AGmfQ0fXcYCQun0"

# Detectron config/model
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.yaml "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212013&authkey=AHIvnE1FcggwiLU"
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.pkl "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212014&authkey=AAHgqN3Y-LXcBvU"

Reference

Please acknowledge the following paper if you use the code:

@article{zhou2019vlp,
  title={Unified Vision-Language Pre-Training for Image Captioning and VQA},
  author={Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao},
  journal={arXiv preprint arXiv:1909.11059},
  year={2019}
}

Related Projects/Codebase

Pre-trained UniLM: https://github.com/microsoft/unilm
GVD (captioing+grounding): https://github.com/facebookresearch/grounded-video-description
Video DenseCap: https://github.com/salesforce/densecap
MT-DNN: https://github.com/namisan/mt-dnn

Acknowledgement

Our code is mainly based on Li Dong et al.'s UniLM repo. Also, a part of the code is based on pytorch-transformers v0.4.0 and ImageCaptioning.pytorch. We thank the authors for their wonderful open-source efforts.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the UniLM project and pytorch-transformers v0.4.0 project.

vlp's People

Contributors

Stargazers

Watchers

Forkers

darkmatter08 spartag117 alekzandr b2220333 eustcpl alesuglia benhoff vision-and-language ago3 yuanezhou chehao2628 congve1 ardapekis mujoko pkyintelligence ymzhu99 myklmaring wh-forker yjxiao fundou zhengant zhanzheng8585 chengjiun nooralahzadeh shyamalschandra mikkelmedm yifanfanfanfan mindawu dvschultz youngfly11 yimikai xden2331 adeelahmad-co bruinxiong youngergao liu-hao0921 manojramamurthy hassanhub chenyutongthu lanmudan tangdonnie chxmlmn amitakamath gwliu213 jingrongfeng mymuli wangxinyilinda tony-hong a2un sakzsee khiemledev chenganhsieh hamidpalangi ammexm yzc526 trellixvulnteam claytonbrown zengzx0427 nofear18 yuyanze zhuorandang

vlp's Issues

Could you plz tell me how to use the pre-trained model to perform image caption on coco dataset?

coco-caption commit is orphaned and can no longer be checked out

The repo uses kdexd/coco-caption@de6f385. This doesn't exist anymore in the commit history https://github.com/kdexd/coco-caption/commits/master. It was likely force pushed upon or the corresponding branch was deleted. Due to this CiderD(df=cached_tokens) fails as the syntax has changed in the latest commit.

Another failure is:

    cocoEval = COCOEvalCap(coco, cocoRes, 'corpus')
TypeError: __init__() takes 3 positional arguments but 4 were given

Do you have a backup of that commit's tree that you can push? Or can you point to another commit that works with this repo?

UnboundLocalError: local variable 'vis_pe' referenced before assignment

Hi, I want to test some images of my own, but failed(I wrote a json file). It seems that if don't set args.enable_butd flag, the code will raise an UnboundLocalError in seq2seq_loader.py:472 for the lack of "vis_pe".

How can I test on other images except COCO or Flickr?
Thanks!

start the caption evaluation...
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "vlp/decode_img2txt.py", line 268, in
main()
File "vlp/decode_img2txt.py", line 223, in main
instances.append(proc(instance))
File "/opt/tiger/workspace/VLP/vlp/seq2seq_loader.py", line 472, in call
return (input_ids, segment_ids, position_ids, input_mask, self.task_idx, img, vis_pe)
UnboundLocalError: local variable 'vis_pe' referenced before assignment

Values in Region Geometric Information

Hi,

I noticed that the vector containing the region geometric information has 7 values instead of 5 ("top left and bottom right corner coordinates of the region bounding box and one value for its relative area", as described in the VLP paper). What are the last two values meant for? How can we compute them?

Thanks!

Possibility of bugs when not enable butd - decode_img2txt.py and seq2seq_loader.py

Hi,
I am trying to modify your decode_img2txt.py and seq2seq_loader.py to inference to a single image (not whole dataset such as COCO).
Therefore in case we do not have region_bbox_file as well as region_det_file_prefix (so enable_butd is False), should we uncomment these lines for cnn model: 96, 171, 172, 181, 185?
The other issue (could be a bug) is in seq2seq_loader.py, the vis_pe is only defined in "else" part (from line 442 when enable_butd == True), so I wonder how to define it when enable_butd == False? Could it be []?
I am the newbie in NLP so it could be a little bit confuse for me, thank you very much for open source your great work 💯

Pretrained Model

Is there any pretrained model checkpoint available, so that I can test the model without having to train it myself?

Thanks for your help!

Detectron feature extraction

Might it be possible to release the script to extract features using the Detectron model? Probably something similar to this https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/scripts/prepro_feats.py that was used to generate the region_feat_gvd_wo_bgd/trainval .npz features and class probabilities?

region geometric information

区域几何信息应该从数据集哪里读入呢。我看给的特征里只有 feature 和标签概率

The repo is back on!

Thanks to everyone for your patience!

Performance improvements for data loading process

Hi,

First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the Img2TextDataset as an IterableDataset that supports streams of data.

I've noticed that the current implementation of the dataset has already an __iter__ method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you use randint(0, len(self.ex_list)-1) to sample a given example index. This is incorrect because randint won't guarantee that the sampled elements are going to be unique.

I might have soon a fix for this so I can send you a PR if you like :)

Thank you in advance for your answer!

Alessandro

Require a quick start for simple usage...

Hi, I just want to test the captioning result on some raw images. I have read the vlp/decode_img2txt.py, but the settings are a little bit complicated for me, for example, the standard size of an input image.

So it would be very kind of you if a simple usage could be provided.

I really appreciate any help you can provide.

Train model when I have image only without any bbox info

Hi, Thanks for your great work!
I want to know whether the model can be trained without regions. In other words, I have caption and image only without any bbox info, how can I make the model work?
Thank you so much!

UniLM checkpoint is no longer available

UniLM checkpoint

wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

The above link is no longer anymore. Thanks.

Change N object of feature extractor and number of attention head

I want to change the number of objects in the feature extractor (e.g from 100 to 150) and the attention head (instead of changing the BERT base to BERT large) (e.g from 12 to 16). Could you please tell me where to change the code?

Other Pretrained Models

Hi @LuoweiZhou ， is it convenient to provide other pre-trained checkpoints, such as cc_g8_lr1e-4_batch512_s0.25_b0.75.tar.gz or
cc_g8_lr1e-4_batch512_s0_b1.tar.gz ? Many thanks.

Finetuning and Testing on separate dataset

Hi,

Would you have directions for finetuning and testing on a separate dataset? For instance, preparing the dataset, running the model on the dataset, and collecting error metrics?

Thanks

Multiple GPUs Support

The provided fine-tuning scripts fails on multiple GPUs machine.
Traceback (most recent call last): File "run_img2txt_dist.py", line 621, in <module> main() File "run_img2txt_dist.py", line 546, in main iter_bar.set_description('Iter (loss=%5.3f)' % loss.item()) ValueError: only one element tensors can be converted to Python scalars

What is the recommended way to run a parallel training.
Thanks :)

When will you release the visual feature detectron code?

Hi Luowei,
Recently I have been trying to extract detectron visual features following your guideline here. But I still cannot replicate the features. I use the recommended docker image and refer to the preliminary scripts you send me by email. However, your script extract_features_luowei.py is actually incompatible with the detectron model in the housebw's repo. For example, the im_detect_bbox method (in the figure below) should be imported from core.test instead of core.test_engine.

Actually, I also find several other incompatibility issues when trying to run your code inside the provided docker environment. Even after I fixed some incompatibility issues and run the code successfully, I encounter such errors:

WARNING cnn.py:  40: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
INFO net.py:  59: Loading weights from: /export/home/vlp_data/e2e_faster_rcnn_X-101-64x4d-FPN_2x.pkl
I1029 08:52:12.747747  2485 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000186569 secs
I1029 08:52:12.747993  2485 net_dag.cc:61] Number of parallel execution chains 36 Number of operators = 371
I1029 08:52:12.770298  2485 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000165061 secs
I1029 08:52:12.770571  2485 net_dag.cc:61] Number of parallel execution chains 30 Number of operators = 358
/export/home/vlp_data/coco_raw/coco_tiny/COCO_test2014_000000000001.jpg
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at blob.h:94] IsType<T>(). wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor<caffe2::CUDAContext> .
Offending Blob name: gpu_0/rois_fpn2.
Error from operator: 
input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/rois_fpn2" output: "gpu_0/roi_feat_fpn2" name: "" type: "RoIAlign" arg { name: "pooled_h" i: 7 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 7 } device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1572339134 (unix time) try "date -d @1572339134" if you are using GNU date ***
terminate called recursively
terminate called recursively
PC: @     0x7fe85f011428 gsignal
*** SIGABRT (@0x9b5) received by PID 2485 (TID 0x7fe7c0ffd700) from PID 2485; stack trace: ***
    @     0x7fe85f3b7390 (unknown)
    @     0x7fe85f011428 gsignal
    @     0x7fe85f01302a abort
    @     0x7fe8590bf84d __gnu_cxx::__verbose_terminate_handler()
terminate called recursively
    @     0x7fe8590bd6b6 (unknown)
    @     0x7fe8590bd701 std::terminate()
    @     0x7fe8590e8d38 (unknown)
    @     0x7fe85f3ad6ba start_thread
    @     0x7fe85f0e341d clone
    @                0x0 (unknown)
Aborted (core dumped)

Due to these incompatibility issues, I find it pretty difficult to extract the same visual features as yours. But if we use other detectron codes like detectron2 or mmdetection, we cannot use your pre-trained models. Therefore, I would like to ask when you can fully release your detectron code (code & python environment), which will be extremely helpful for those planning to apply your VLP model into their own datasets like me. Look forward to your reply:)

Regarding data count mismatch in Flickr30k dataset annotations' json file?

On doing untar of flickr30k.tar.gz, under region_feat_gvd_wo_bgd folder, .h5 file has a total of 31783 files and even the flickr30k images counts to the same. But under annotations folder, dataset_flickr30k.json file has a total of 31014 entries only. All the images seem to be not covered by the json file in annotations? Kindly clarify and help!

About pre-computed image features

Thank you for uploading the pre-computed image features. I have downloaded the COCO data you provided in Baidu cloud disk, but after unzipping the file, I found that feat_cls_1000 seems to lack the file coco_detection_vg_100dets_gvd_checkpoint_trainval_feat385.h5. Can you provide it separately?

Data Download Problem

Hi, thank you for your interesting work!
I have a problem when trying to download the provided dataset annotations and features since the one drive link provided cannot be visited in China without a VPN. So it's difficult for me to prepare the data on my ubuntu machine.
Do you have any generous advice for me to solve this problem? Or would you please provide another download link that can be easily connected in China for MSCOCO data?
Thank you!

adding specific tokens to vocabulary

Hi Luowei,

Thanks for sharing this repo! I am trying to adapt it to a specific task. In that task, I wish to remain some tokens unsplit (thousands of tokens). Is there a way that I could do that? I am trying to add tokens to bert vocabulary file but didn't find the file. Thanks and look forward to your reply!

Chinese image caption， In the result, multiple words of the same type appear

Hello, I am using the COCO dataset,
A two-layer LSTM model, one layer for top-down attention, and one layer for language models.

Extracting words with jieba
I used all the words in the picture description that occurred more than 3 times as a dictionary file, and a total of 14,226 words.
words = [w for w in word_freq.keys () if word_freq [w]> 3]

After training the model, when using it, multiple words of the same type appear in the result, such as:

Note notebook laptop computer on bed
A little girl little girl girl standing together

How can I solve this problem?

the JSON file of dataset is in need

We intend to use our own dataset to pre-train your model, but we don't know how to construct our data to fit into your model. In addition, COCO dataset is too large and we don't have enough space, so we cannot see the JSON file in which you organize your training data.
So could you please share your JSON file so that we can know how to use our own data? Thanks a lot!

error: command 'gcc' failed with exit status 1 raised in installing apex

Hi，
I run ./setup.sh after creating the conda environment using the provided yml. However, an error raised when building gcc.

/home/chenyutong/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/E
xception.h:338:3: error: expected ‘;’ before ‘do’
   do {                                                    \
   ^
csrc/scale_check_overflow.cpp:23:3: note: in expansion of macro ‘AT_CHECK’
...
error: command 'gcc' failed with exit status 1

Does anyone know what the problem is here? Thanks a bunch!!

Bert Model

Seems there is no link for the pretrained bert model where can I get it for the inference?

Thanks

MSCOCO data not found

Hi there,

I downloaded MSCOCO data and there seems to be missing data. So I removed and tried to redownload them. However, when I do
wget https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212018&authkey=AHoTGG-7-6kwoAY
there is 404 not found error.
Could you have a look at that and help me out?
Thank you very much and stay healthy!

unable to Download feature from google drive

Hi,

the download feature link is not available for me and is very big, could you please provide another link? like baidu disk?
When i click the google drive link, it will appear following error

Thanks

Google Drive Link for 509G Feat Invalid, Please Help With An Alternate Download Link

@LuoweiZhou @darkmatter08 @erjanmx @hamidpalangi

I'm trying to download the 509G and 468G files from https://drive.google.com/file/d/14mr49-14-ZjJXOohInzoOLBZlJb_y7fh/view?usp=sharing and https://drive.google.com/file/d/1kRlnQJcTjGFaOHSptekgG98MiCsTQYDt/view?usp=sharing but these links are not valid.

To prepare for the pre-training, first download and uncompress our pre-processed Conceptual Captions (CC) data(6GB) and place under your data root. Then, download and uncompress the region features from Google Drive (feat(509GB), cls(468GB)) under the CC/region_feat_gvd_wo_bgd/feat_cls_1000_float16 dir. To evaluate CC on caption generation, download the reference file and place it under coco-caption/annotations.

I'll be glad if an alternate link could be provided.

[BUG]: AttributeError: 'Tensor' object has no attribute 'append'

Hi,

I just spotted a bug in the training script run_img2txt_dist.py. Specifically, when running the code with multiple GPUs the following exception is raised:

Traceback (most recent call last):                                                                                                                                                
  File "vlp/run_guesswhat_dist.py", line 625, in <module>
    main()
  File "vlp/run_guesswhat_dist.py", line 543, in main
    vqa2_loss.append(ans_loss.item())
AttributeError: 'Tensor' object has no attribute 'append'

Unfortunately, this is due to the fact that at line https://github.com/LuoweiZhou/VLP/blob/master/vlp/run_img2txt_dist.py#L542 you're overriding vqa2_loss which will become a torch.Tensor therefore the append call at line 543 will break.

Changing line 542 to ans_loss = ans_loss.mean() should fix the error.

Optimizer state dict for RL fine-tuning

Hi,

when you start the RL fine-tuning for image-captioning, do you load the optimiser weights that you used during the SL phase or you start with a brand-new optimiser that you use just for the RL training phase? In your current codebase I can see that you completely disabled the optimiser state checkpointing.

Many thanks,

Alessandro

File Not Found：feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

Thanks a lot for sharing this useful repo. we are trying to reproduce the finetuning result on flicker30k, but an error occurs which say "feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval not found". I found this error is about this line of code, parser.add_argument('--region_det_file_prefix', default='feat_cls_1000/coco_detection_vg_100dets_gvd_checkpoint_trainval', type=str) . So where can I get access of "coco_detection_vg_100dets_gvd_checkpoint_trainval"? In my folder of feat_cls_1000, there are only "flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5" and "trainval".

Question about seq2seq decoder generation

Hi,

I'm having quite a few issues in generating sound sentences using your code. In particular, I'm using your class BertForSeq2SeqDecoder but I really don't understand how it works after the very first generation step.

In the forward pass of the decoder, at the very first generation step your x_input_ids are supposed to encode both the input ids and the [MASK] token (at the end of the sequence). This looks reasonable to me because you're allowing the model to 'look' at the entire input sequence and to generate an hidden state for the [MASK] token. This hidden state is then used to generate a probability distribution over the vocabulary size. After the second step, I start having some issues in understanding how the code works. After the first step, you consider as curr_ids only the last generated token and discard all the actual input ids (https://github.com/LuoweiZhou/VLP/blob/master/pytorch_pretrained_bert/modeling.py#L1249). This single token is then concatenated with the [MASK] as before to predict the next token. I can see that this time you give in input to the model the prev_embeddings (https://github.com/LuoweiZhou/VLP/blob/master/pytorch_pretrained_bert/modeling.py#L1218). My question is: how is the model actually exploiting the previously computed representations? The code used for the BertEncoder in your codebase (https://github.com/LuoweiZhou/VLP/blob/master/pytorch_pretrained_bert/modeling.py#L382) seems not really aligned with the current code used in Huggingface Transformers (https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py#L369) so I'm having trouble understanding how it actually works.

Could you please provide me with some details about this process?

Thanks a lot!

Alessandro

Can't reproduce the scst results

Hi, thanks for the great work.
I followed the steps in README, but was unable to reproduce the results of SCST(129.3 CIDEr in the paper), I got the highest performance in the first epoch of SCST: 120.5 CIDEr.

My script:
python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} --do_train --new_segment_ids --always_truncate_tail --amp --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 --train_batch_size 16 --max_pred 0 --mask_prob 0 --scst --model_recover_path "coco_g8_lr3e-5_batch512_ft_from_s0.75_b0.25/model.28.bin"

I use 4 GPUs, the training takes 30 epochs and the batch size is set to 16. I set --model_recover_path as the pre-trained model you provided in link.

I would like to know if there is something wrong with me that prevents me from reproducing the results in the paper. Thanks~

Finetuning checkpoints link not working

Hi, thanks for releasing the code. The finetuning checkpoints linked in the README file are not available at the moment. Is there any way we could access them?

Dimension of Position Embedding

The original paper describes the position embedding as a 5-dimension vector which contains the coordinates of the top left and bottom right corner plus a relative area. But in the data preprocessing script (seq2seq_loader.py), the variable vis_pe, which I figure stands for the position embedding, has a dimension of 6. What causes the difference and what is the extra value used for?

Is it possible to inference the caption of one image without downloading the whole training dataset?

segment core while running evaluation

segment core occurs while run flickr30k evaluation. The scripts are as follows:
beam=5 CHECKPOINT_ROOT=./flickr30k_g8_lr3e-5_batch512_ft_from_s0.75_b0.25
split=val
DATA_ROOT=./data

python vlp/decode_img2txt.py \
--model_recover_path $CHECKPOINT_ROOT/model.21.bin \
--new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
--image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/ --split ${split} \
--dataset flickr30k \
--region_bbox_file flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
--src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
--file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

Is there any chances that I adopt the wrong cuda/cunn environment?

Not able to load pre-trained model on CC dataset

Hi Luowei,

Thank you so much for sharing the pre-trained model and with such detailed instructions.

I followed the instructions to install VLP, I am using python3.6, pytorch 1.1.0. And I also downloaded the checkpoint file pre-trained on CC dataset (cc_g8_lr1e-4_batch512_s0.75_b0.25.tar.gz).

However, when I load the model in the code (line 344 in run_img2txt_dist.py):
model_recover = torch.load(args.model_recover_path)

I received the following error:
Traceback (most recent call last):
File "vlp/run_img2txt_dist.py", line 629, in
main()
File "vlp/run_img2txt_dist.py", line 362, in main
model_recover = torch.load(args.model_recover_path,encoding='latin1')
File "/home/ubuntu/anaconda3/envs/vlp/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/ubuntu/anaconda3/envs/vlp/lib/python3.6/site-packages/torch/serialization.py", line 564, in _load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '\x1f'.

I tried to load the file(cc_g8_lr1e-4_batch512_s0.75_b0.25.tar.gz) using encoding="latin1" but still the error persists.

Could you please let me know where might be the error? Thank you!

Regards,

Chenye

Bus error (core dump) during training

Hi, thanks for sharing this project.

I prepared all the required h5py files and caption annotations files for COCO Caption finetuning as instructed in README. The training went normally at the beginning, but got killed (bus error (core dumped)) after around 70k~100k iteration.

I wonder if it was an out-of-memory issue caused by data loading. It seemed that huge memory was progressively consumed by the program, perhaps due to reading more and more image features from h5py files. Using del or gc.collect() didn't help free unreferenced objects' memory.

Is there any good solution to save memory for the multimodal training? Or idea on what was going on in my case. Thanks a lot!

Getting `No module named 'apex.optimizers'` error

Hello, Thanks for your work.

Currently, I am trying to run inference on flickr features. I have installed apex as per the instructions in setup.sh. I have the same commit of apex (1603407bf49c7fc3da74fceb6a6c7b47fece2ef8) as mentioned in the setup.sh file. My pytorch version is 1.6.0+cu101, which is different from the one mentioned in misc/vlp.yml file. While installing apex I get following error message: error: command 'gcc' failed with exit status 1. When I try running the flickr inference code I get below error:

Traceback (most recent call last):
  File "vlp/decode_img2txt.py", line 19, in <module>
    from pytorch_pretrained_bert.tokenization import BertTokenizer, WhitespaceTokenizer
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/__init__.py", line 6, in <module>
    from .optimization_fp16 import FP16_Optimizer_State
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/optimization_fp16.py", line 4, in <module>
    from apex.optimizers import FP16_Optimizer
ModuleNotFoundError: No module named 'apex.optimizers'

I tried installing the latest apex version as per the instructions here. I get Successfully installed apex-0.1 message, but when I run the inference code I get below error.

Traceback (most recent call last):
  File "vlp/decode_img2txt.py", line 19, in <module>
    from pytorch_pretrained_bert.tokenization import BertTokenizer, WhitespaceTokenizer
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/__init__.py", line 6, in <module>
    from .optimization_fp16 import FP16_Optimizer_State
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/optimization_fp16.py", line 4, in <module>
    from apex.optimizers import FP16_Optimizer
ImportError: cannot import name 'FP16_Optimizer'

Seems, the optimizers are different in the latest apex commit. Would you recommend replacing the optimizer in your code with any of the current ones. Or, if you would have any other suggestion for me to resolve the issue. I am not able to use the conda environment you had mentioned as a requirement, as I am working on a controlled-access machine and I don't have the liberty to do all the installations.

Some missing *_cls_prob.npy files in Flickr30k

Hi Luowei :)

I am trying to finetune a model on Flickr30k using https://github.com/LuoweiZhou/VLP#flickr30k-captions
and the training begins successfully. But in certain batches, there are missing *_cls_prob.npy files in the $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/trainval data folder.

I downloaded the Flirck30k data twice and the issue persists. Do you have any idea why this might be happening?
Any pointers for how to best fix this?

Thanks and best,
Shruti

Unable to reproduce image features for COCO and CC

Hi Luowei --

I'm unable to reproduce the image features that you've published here for COCO and CC.
I've trained and evaluated the model using your provided features as well as my extracted features, on the VQA2 task (VQA2 uses COCO images). There is still an outstanding gap in performance. While you report 67.4, I can only achieve 64.3. This is a significant 3-point gap. I am wondering if others have encountered similar problem and how they have resolved it?

I've extracted my own features using the script you shared with me privately (slightly modified to resole dependency issues). Using the housebw/detectron image and your provided detectron checkpoint .pkl and config .yaml, I generate different features than yours. Comparing image-by-image, I have different values in the tensors/matricies. I also get different aggregate statisics (min, max, mean, variance) for features, image-by-image. This is the same situation for CC as well. I've also confirmed it is not a precision issue as well (float16 vs float32).

As it stands, I cannot replicate your results despite my best efforts to follow all your provided documentation, using the same environment, code, data dependencies, and source data.

I am attempting to use your SOTA model on a new dataset/task. Not being able to replicate your results is an impediment...

Thanks,
Shawn

What is GPU memory size of your V100? (ERROR: Unexpected bus error encountered in worker)

HI,
I am trying to use one V100 GPU with 16G memory to run the fine-tuning on COCO image captioning task and always encounter such error "ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)".
So what is the GPU memory of your V100 and what are the recommended configurations (e.g., batch_size, num_workers) on my cluster with 4*V100 of 16G memory for running fine-tuning for COCO captioning. Thanks!

why the dimenson of region object labels is 1601?

Hi, I download the extracted features and find that the shape of region object labels is (100, 1601) for each image. Why there are so many classes? I also notice that the shape of region geometric information is (100, 6) for each image. What does each value in the second dimension represent?

Invalid UniLM checkpoints

Hi Luowei,
thanks for releasing code ro the public.
I find the link ' UniLM checkpoints ' is invalid now, can you release a accessible link again?
I find you use BERT-base as Transformer backbone as stated in your paper, and the weights in your BERT model are initialized from UniLM. However, in UniLM paper and their github they only explore BERT-large. So i don't know UniLM checkpoint you use is BERT-base or BERT-large?

luoweizhou / vlp Goto Github PK

vlp's Introduction

VLP

Installation

Conda Environment (Option I, Recommended)

Docker Image (Option II)

Data Preparation

Experiment Overview

Pre-training

Fine-tuning

COCO Captions

VQA 2.0

Flickr30k Captions

Inference and Testing

COCO Captions

VQA 2.0

Flickr30k Captions

Testing

Misc

Reference

Related Projects/Codebase

Acknowledgement

License

vlp's People

Contributors

Stargazers

Watchers

Forkers

vlp's Issues

UniLM checkpoint

Recommend Projects

Recommend Topics

Recommend Org