tgdt's Introduction

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training

This repo is built on top of VSE++ and TERAN.

Setup

Setup python environment using conda:

conda env create --file environment.yml
conda activate gls
export PYTHONPATH=.

Get the data

Download and extract the data folder, containing annotations, the splits by Karpathy et al. and ROUGEL - SPICE precomputed relevances for both COCO and Flickr30K datasets:

wget http://datino.isti.cnr.it/teran/data.tar
tar -xvf data.tar

Download the bottom-up features for both COCO and Flickr30K. We use the code by Anderson et al. for extracting them. The following command extracts them under data/coco/ and data/f30k/. If you prefer another location, be sure to adjust the configuration file accordingly.

# for MS-COCO
wget http://datino.isti.cnr.it/teran/features_36_coco.tar
tar -xvf features_36_coco.tar -C data/coco

# for Flickr30k
wget http://datino.isti.cnr.it/teran/features_36_f30k.tar
tar -xvf features_36_f30k.tar -C data/f30k

Evaluate

Download and extract our pre-trained models.

Then, issue the following commands for evaluating a given model.

# F30K
python3 test.py runs/f30k_m0.3/model_best_rsum.pth.tar --config configs/f30k_global.yaml
python3 test.py runs/f30k_m0.3/model_best_rsum.pth.tar --config configs/f30k_local.yaml
python3 test_gl.py runs/f30k_m0.3/model_best_rsum.pth.tar --config configs/f30k_local.yaml

# COCO
python3 test.py runs/coco_m0.3/model_best_rsum.pth.tar --config configs/coco_global.yaml
python3 test.py runs/coco_m0.3/model_best_rsum.pth.tar --config configs/coco_local.yaml
python3 test_gl.py runs/coco_m0.3/model_best_rsum.pth.tar --config configs/coco_local.yaml

Train

In order to train the model using a given configuration, issue the following command:

python3 train.py --config configs/f30k_all.yaml --logger_name runs/f30k_m0.3
python3 train.py --config configs/coco_all.yaml --logger_name runs/coco_m0.3

Citation

Please cite this work if you find it useful:.

@article{liu2023efficient,
  title={Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training},
  author={Liu, Chong and Zhang, Yuqi and Wang, Hongsong and Chen, Weihua and Wang, Fan and Huang, Yan and Shen, Yi-Dong and Wang, Liang},
  journal={IEEE Transactions on Image Processing},
  year={2023}
}

tgdt's People

Stargazers

Watchers

tgdt's Issues

Unable to get data set

The data link has expired, can you provide an accessible link? thank you!!!

https://cnrsc-my.sharepoint.com/personal/nicola_messina_cnr_it/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fnicola%5Fmessina%5Fcnr%5Fit%2FDocuments%2Fweb%2Ftern%5Fteran&ga=1
I also can't access this link

The results cannot be reproduced

What an outstanding job, I tried to reproduce the code of this repository and painfully realized that I can't reproduce the results of the paper, the rsum differs by tens of points, what can I do to get the results of the paper?
Remarkably, the results in FLICKR30K and COCO 1K are exactly the same, which is surprising.

Question about the transformer_encoder

Thanks for sharing this great work. I have some questions about it. First, you mention you use pre-extracted image features but you choose use transformer to extract image features instead. This dosen't match the paper. Second, i am wondering why you add transformer_encoder_text and transformer_encoder_img. I saw your baseline TERAN before. Compared with it, you add these two encoders. I can't get it, especiallly the operation of .detach(). Can you help me solve these two questions? Thanks again for this work!

lcfractal / tgdt Goto Github PK

tgdt's Introduction

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training

Setup

Get the data

Evaluate

Train

Citation

tgdt's People

Contributors

Stargazers

Watchers

Forkers

tgdt's Issues

Recommend Projects

Recommend Topics

Recommend Org