Coder Social home page Coder Social logo

reclip's Introduction

ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension

This repository contains the code for the paper ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension (ACL 2022).

Setup

This code has been tested on Ubuntu 18.04. We recommend creating a new environment with Python 3.6+ to install the appropriate versions of dependencies for this project. First, install pytorch, torchvision, and cudatoolkit following the instructions in https://pytorch.org/get-started/locally/. Then run pip install -r requirements.txt. Download the ALBEF pre-trained checkpoint and place it at the path albef/checkpoint.pth.

Data Download

Download preprocessed data files via gsutil cp gs://reclip-sanjays/reclip_data.tar.gz, and extract the data using tar -xvzf reclip_data.tar.gz. This data does not include images. Download the images for RefCOCO/g/+ from http://images.cocodataset.org/zips/train2014.zip. Download the images for RefGTA from the original dataset release. NOTE: As stated in the original RefGTA dataset release, the images in RefGTA may only be used "in non-commercial and research uses."

Results with CLIP/ALBEF/MDETR

The following format can be used to run experiments:

python main.py --input_file INPUT_FILE --image_root IMAGE_ROOT --method {parse/baseline/gradcam/random} --gradcam_alpha 0.5 0.5 --box_method_aggregator sum {--clip_model RN50x16,ViT-B/32} {--albef_path albef --albef_mode itm/itc --albef_block_num 8/11} {--mdetr mdetr_efficientnetB3/mdetr_efficientnetB3_refcocoplus/mdetr_effcientnetB3_refcocog} {--box_representation_method crop,blur/crop/blur/shade} {--detector_file PATH_TO_DETECTOR_FILE} {--cache_path PATH_TO_CACHE_DIRECTORY} {--output_file PATH_TO_OUTPUT_FILE}

(/ is used above to denote different options for a given argument.)

--input_file: should be in .jsonl format (we provide these files for the datasets discussed in our paper; see the Data Download information above).

--image_root: the top-level directory containing all images in the dataset. For RefCOCO/g/+, this is the train2014 directory. For RefGTA, this directory contains three subdirectories called black_wearing, dont_specify, white_wearing.

--detector_file: if not specified, ground-truth proposals are used. For RefCOCO/g/+, the detection files are in reclip_data.tar.gz and have the format {refcoco/refcocog/refcoco+}_dets_dict.json. For RefGTA, the detections are in reclip_data.tar.gz and have the format refgta_{val/test}_{gt/unidet_dt/unidet_all_dt}_output2.json.

For ALBEF, we use ALBEF block num 8 for ITM (following the ALBEF paper) and block num 11 for ITC. Note that several arguments are only required for a particular "method," but they can still be included in the command when using a different method.

Choices for method: "parse" is the full version of ReCLIP that includes isolated proposal scoring and the heuristic-based relation handling system. "baseline" is the version of ReCLIP using only isolated proposal scoring. "gradcam" uses GradCAM, and "random" selects one of the proposals uniformly at random. (default: "parse")

Choices for clip_model: The choices are the same as the model names used in the CLIP repository except that the model names can be concatenated with a comma between consecutive names. (default: "RN50x16,ViT-B/32")

Choices for box_representation_method: This argument dictates which of the following methods is used to score proposals: CPT-adapted, cropping, blurring, or some combination of these. For CPT-adapted, choose "shade". To use more than one method, concatenate them with a comma between consecutive methods. (default: "crop,blur")

To see explanations of other arguments see the main.py file.

Results with UNITER

We recommend creating a new environment for UNITER experiments. See UNITER/requirements.txt for the dependencies/versions that we used for these experiments. Note that the lines commented out should still be installed, but it may be easier/better to install them in a different manner than simply installing all packages at once via pip. In particular, we recommend first following the instructions in https://pytorch.org/get-started/locally to install pytorch, torchvision, and cudatoolkit. Then we recommend cloning https://github.com/NVIDIA/apex and following the instructions within that repository to install apex. Then we recommend installing horovod via pip install horovod. Then we recommend running pip install -r requirements.txt. Download the pre-trained UNITER model from https://acvrpublicycchen.blob.core.windows.net/uniter/pretrained/uniter-large.pt and place it inside UNITER/downloads/pretrained/. To train a model on RefCOCO+, edit UNITER/configs/train-refcoco+-large-1gpu.json to have the correct data paths and desired output path. The necessary data files are provided in reclip_data.tar.gz. Run the following command within the UNITER/ directory to train the model:

python train_re.py --config config/train-refcoco+-large-1gpu.json --output_dir OUTPUT_DIR --simple_format

where OUTPUT_DIR is the desired output directory. (Training on RefCOCOg can be done in a similar manner.) Alternatively, you can download our UNITER models trained on RefCOCO+/RefCOCOg:

gsutil cp gs://reclip-sanjays/uniter_large_refcoco+_py10100feats.tar.gz .
gsutil cp gs://reclip-sanjays/uniter_large_refcocog_py10100feats.tar.gz .

To evaluate, run bash scripts/eval_{refcoco+/refcocog/refgta}.sh OUTPUT_DIR. Again, you will probably need to modify the data paths in eval_{refcoco+/refcocog/refgta}.sh.

Synthetic relations experiment on CLEVR-like images

To obtain the accuracies for the relations task on synthetic CLEVR-like image (Section 3.2 in our paper), download the data via gsutil cp gs://reclip-sanjays/clevr-dataset-gen.tar.gz . and extract the data using tar -xvzf clevr-dataset-gen. Then run python generic_clip_pairs.py --input_file clevr-dataset-gen/spatial_2obj_text_pairs.json --image_root clevr-dataset-gen/output/images --gpu 0 --clip_model RN50x16 to obtain results on the spatial text pair task using the CLIP RN50x16 model. Results for the spatial image pair and non-spatial image/text pair tasks can be obtained by replacing the JSON file name appropriately, and results for the other CLIP models can be obtained by replacing "RN50x16" with the appropriate model name. Results for the ALBEF model can be obtained by specifying the ALBEF path (which should be "albef"), and to obtain results with ALBEF ITC you can add the --albef_itc flag.

Other

We used UniDet to detect objects for RefGTA. We provide the outputs in reclip_data.tar.gz, but if you would like to run the pipeline yourself, you can clone UniDet https://github.com/xingyizhou/UniDet and use our script in UniDet/extract_boxes.py on the outputs to obtain the desired detections.

We provide input features for UNITER in reclip_data.tar.gz, but if you would like to run the feature extraction yourself, you can clone py-bottom-up-attention https://github.com/airsplay/py-bottom-up-attention and use our script in py-bottom-up-attention/extract_features.py to obtain the features for ground-truth/detected proposals. You should compile the repository (following the directions given in the repository) before running the script.

Acknowledgements

The code in the albef directory is taken from the ALBEF repository. The code in clip_mm_explain is taken from https://github.com/hila-chefer/Transformer-MM-Explainability. The code in UNITER is a slightly modified version of https://github.com/ChenRocks/UNITER. The script py-bottom-up-attention/extract_features.py is adapted from code in https://github.com/airsplay/py-bottom-up-attention. The file clevr-dataset-gen/bounding_box.py is adapted from https://github.com/larchen/clevr-vqa/blob/master/bounding_box.py.

Citation

If you find this repository useful, please cite our paper:

@inproceedings{subramanian-etal-2022-reclip,
    title = "ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension",
    author = "Subramanian, Sanjay  and
      Merrill, Will  and
       Darrell, Trevor and
      Gardner, Matt  and
      Singh, Sameer  and
      Rohrbach, Anna",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics"
}

reclip's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

reclip's Issues

Question about extracting region proposals in IPS module

Hi.
Thank you for sharing your nice work.

I understand the detection file (reclip_data.tar.gz) includes bounding boxes and IPS module uses these boxes for cropping and blurring.
How did you extract the bounding boxes? using pre-trained object detector or gt boxes?

Thanks.

Performance much better than the paper says

Thank you for sharing your great work.

I have cloned this repo to Google Drive and run the experiments by executing python main.py --input_file /mypath/reclip/dataset/reclip_data/refcoco*.jsonl --image_root /mypath/reclip/dataset/train2014/ --method parse --box_method_aggregator sum --clip_model RN50x16,ViT-B/32 --box_representation_method crop,blur --cache_path /mypath/reclip/cache/ --output_file /mypath/reclip/output/refcoco*.txt in colab. However, I found that the results are all 1%-9% higher than those in the Table 2 of the paper.

My accs are as below:

  • refcocog val 67.913 test 67.069
  • refcoco+ val 52.082 testA 51.659 testB 52.015
  • refcoco val 50.526 testA 47.110 testB 54.877

I wonder why this happened, Thanks.

required files on RefCOCO are missed

@sanjayss34 @authors thank you for the code sharing. I downloaded the preprocessed data files via gsutil cp gs://reclip-sanjays/reclip_data.tar.gz, but the refcoco_train.jsonl, refcoco_train_gt_boxes10100.pt, and refcoco_val_det_boxes10100.pt are missed. RefCOCO shares these files with RefCOCO+? could you please give a simple introduction about the files generation? Sincerely thank you for your help!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.