Coder Social home page Coder Social logo

connect-caption-and-trace's Introduction

connect-caption-and-trace

This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attention Traces (CVPR2021).

example results

Requirements

  • Python 3
  • PyTorch 1.5+ (along with torchvision)
  • coco-caption (Remember to follow initialization steps in coco-caption/README.md)

Prepare data

Our experiments cover all four datasets included in Localized Narratives: COCO2017, Flickr30k, Open Images and ADE20k. For each dataset, we need four things: (1) json file containing image info and word tokens. (DATASET_LN.json) (2) h5 file containing caption labels (DATASET_LN_label.h5) (3) The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/) (4) json file for coco-caption evaluation (captions_DATASET_LN_test.json) (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome.

You can download (1--4) from here: (make a folder named data and put (1--3) in it, and put (4) under coco-caption/annotaions/)

To get (5), you can use Detectron2. First, install Detectron2, then follow Prepare COCO-style annotations for Visual Genome (We use the pre-trained Resnet101-C4 model provided there). After that you can utilize tools/extract_feats.py in Detectron2 to extract features. Finally, run scripts/prepare_feats_boxes_from_npz.py in this repo to prepare features and bounding boxes in seperate folders for training.

For COCO dataest you can also directly use the features provided by Peter Anderson here. The performance is almost the same (with around 0.2% difference.)

Training

The dataset can be chosen from the four datasets. The --task can be chosen from trace, caption, c_joint_t and pred_both. The --eval_task can be chosen from trace, caption, and pred_both.

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 0 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 0 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Flickr30k: training of controlled caption generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_flk30k  --caption_model transformer --input_json data/flk30k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/flk30k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/flk30k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task caption --eval_task caption --dataset_choice=flk30k

ADE20k: training of controlled trace generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_ade20k  --caption_model transformer --input_json data/ade20k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/ade20k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/ade20k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task trace --eval_task trace --dataset_choice=ade20k

Evaluating

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on trace generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task trace --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 1 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Acknowledgements

Some components of this repo were built from Ruotian Luo's ImageCaptioning.pytorch.

connect-caption-and-trace's People

Contributors

facebook-github-bot avatar lichengunc avatar zihangm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

connect-caption-and-trace's Issues

Detectron2 Preprocessing

Hi, I'm having trouble following the steps for (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome. The step: Prepare COCO-style annotations for Visual Genome.

Could you elaborate on these steps? I think this is due to the depreciation of the repository that was linked, with the newer installation not allowing for the same steps.

I would really appreciate detailed preprocessing instructions, or if you could provide the preprocessed features directly, so it would be possible to recreate the results, that would be amazing as well.

Thank you for the amazing work!

How to get npz files from the tsv files for step 5- feature extraction using detectron2?

Hi!
I am using the already available features by Peter Anderson mentioned in your readme to get features for MS COCO dataset. The features mentioned here are stored as tsv files. But the files required for training are of the npz format. I am getting the following error while training-

python tools/train.py --language_eval 0 --id transformer_LN_coco --caption_model transformer --input_json data/coco_LN.json --input_att_dir test2014/ --input_box_dir data/coco_LN_trace_box --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3 --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1 --use_trace 1 --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

Terminal output-
Warning: coco-caption not available cider or coco-caption missing DataLoader loading json file: data/coco_LN.json vocab size is 8370 DataLoader loading h5 file: data/cocotalk_fc test2014/ data/coco_LN_trace_box data/coco_LN_label.h5 max sequence length in data is 225 read 123287 image features assigned 118287 images to split train assigned 5000 images to split val assigned 5000 images to split test <class 'captioning.models.TransformerModel_mitr.TransformerModel'> Traceback (most recent call last): File "/Users/mayankamedhe/connect-caption-and-trace/train.py", line 329, in <module> train(opt) File "/Users/mayankamedhe/connect-caption-and-trace/train.py", line 172, in train data = loader.get_batch('train') ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mayankamedhe/connect-caption-and-trace/captioning/data/dataloader.py", line 424, in get_batch data = next(self.iters[split]) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mayankamedhe/anaconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 633, in __next__ data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/Users/mayankamedhe/anaconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mayankamedhe/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mayankamedhe/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] ~~~~~~~~~~~~^^^^^ File "/Users/mayankamedhe/connect-caption-and-trace/captioning/data/dataloader.py", line 344, in __getitem__ att_feat = self.att_loader.get(str(self.info['images'][ix]['id'])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mayankamedhe/connect-caption-and-trace/captioning/data/dataloader.py", line 76, in get f_input = open(os.path.join(self.db_path, key + self.ext), 'rb').read() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'test2014/462632.npz'

Could you tell me how to get the npz files required for training? Thanks a lot!

More details for creating dense word-to-box alignment.

Hi, @zihangm,

Thanks for your nice work!

I am curious about more details for creating dense word-to-box alignment in Section3.1 in your paper. I have compared your released coco_LN_trace_box data with the original released LN dataset annotations, and found that the numbers of trace segments of one specific image are not the same. For example, considering the image(id: 322944) in coco_val split, the number of trace segments in your released data is 13 while in the original released data the number is 18. So I wonder whether you took some extra rules for filtering or merging the original trace segments for better alignment in your data preprocessing?

Since I can't find related preprocessing code in the repo, I will appreciate it if you can share some experience.

Thanks,
Jianjie

Some questions about dataset

Hi,
I'm interested in the data file ‘(2) h5 file containing caption labels (DATASET_LN_label.h5)’ & ‘The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/)’.

  1. How are these data files generated?
  2. What is fc_feat in the model's input?
  3. which image features provided by Peter Anderson are suitable for this task?

box_feats, trace_feats dimension size 5

Hi,

I was attempting to reproduce the model and I had two questions. I saw that the box_feats (which corresponds to the bounding box of object proposals) and trace_feats (corresponding to bounding box of traces) has 5 dimensions.

Could you elaborate on what each dimension means?
Specifically what is the 5th dimension? What does this value refer to?

Also, is the bounding box expressed in terms of width and height or secondary x,y coordinates, i.e:
(x, y, w, h, ?) or (x1, y1, x2, y2, ?).

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.