Coder Social home page Coder Social logo

hodor's Introduction

HODOR: High-Level Object Descriptors for Object Re-segmentation in Video Learned from Static Images (CVPR 2022 Oral)

Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, Bastian Leibe

[arXiv] [BibTeX] [Related Workshop Paper] [YouTube]

This repository contains official code for the above-mentioned publication, as well as the related workshop paper titled 'Differentiable Soft-Maskted Attention' which was presented at the Transformers for Vision (T4V) Workshop at CVPR 2022.

Conceptual Overview

  • Idea: Can we learn to do VOS by training only on static images?
  • Unlike existing VOS methods which learn pixel-to-pixel correspondences, HODOR learns to encode object appearance information from an image frame into a concise object descriptors which can then be decoded into another video frame to "re-segment" that object.
  • We can also train using cyclic consistency on video clips where just one frame is annotated.

image

Differentiable Soft-Masked Attention

In case you followed the workshop paper and want to nose-dive into the implementation of our novel differentiable soft-masked attention, take a look at the PyTorch module here: hodor/modeling/encoder/soft_masked_attention.py. The API is similar to PyTorch's native nn.MultiHeadAttention with the main difference that the forward requires a soft mask for the attention to be given as input.

Installation

The following packages are required:

  • Python v3.7.10
  • PyTorch (v1.9.0)
  • Torchvision (v0.10.0)
  • Pillow
  • opencv-python
  • imgaug
  • einops
  • timm
  • tqdm
  • pyyaml
  • tensorboardX
  • pycocotools

Directory Setup

  1. Set the environment variable HODOR_WORKSPACE_DIR: This is the top-level directory under which all checkpoints will be loaded/saved, and also where the datasets are expected to be located. The directory structure should look like this:
$HODOR_WORKSPACE_DIR
    - dataset_images
        - coco_2017_train
        - davis_trainval
            - seq1
            - seq2
            - ...
        - davis_testdev
            - seq1
            - seq2
            - ...
        - youtube_vos_2019_train
            - seq1
            - seq2
            - ...
        - youtube_vos_2019_val
            - seq1
            - seq2
            - ...
    - dataset_json_annotations
        - coco_2017_train.json
        - davis_train.json
        - davis_val.json
        - youtube_vos_2019_train.json
        - youtube_vos_2019_val.json
    - pretrained_backbones
        - swin_tiny_pretrained.pth
    - checkpoints
        - my_training_session
        - another_training_session

Note that we convert all annotations for COCO, DAVIS and YouTube-VOS into a somewhat standardized JSON format so that data loading code can be easily re-used.

  1. Download annotations and pretrained models: Links to downloadable resources are given below. For the easiest setup, download the entire zipped workspace. This includes all model checkpoints (COCO training + finetuning on sparse and dense video) as well as train/val/test annotations in JSON format for all 3 datasets (COCO, DAVIS, YouTube-VOS). Note that you'll still have to copy the dataset images to the relevant dataset in $HODOR_WORKSPACE_DIR/dataset_images.
Content URLs
Zipped Workspace (Model Checkpoints + Dataset Annotations) LINK
Dataset Annotations LINK
Model Checkpoints LINK

Inference

DAVIS 2017 val: Run the following from the repository base directory:

python hodor/inference/main.py $HODOR_WORKSPACE_DIR/checkpoints/static_image/250000.pth --dataset davis_val --output_dir davis_inference_output --temporal_window 7 --min_image_dim 512

This will create a directory called davis_inference_output in $HODOR_WORKSPACE_DIR/checkpoints/static_image and write the output masks to it. For Likewise you can point the script to the checkpoints in video_dense or video_sparse to evaluate those.

YouTube-VOS or DAVIS testdev: To run inference on a different dataset, set the --dataset argument to davis_testdev or youtube_vos_val. For detailed inference options, run the script with --help. Note that you may need to adjust the --min_image_dim and/or --temporal_window options to get the exact results mentioned in the paper for different datasets.

Training

Static Images

For single GPU training on static images from COCO:

python hodor/training/main.py --model_dir my_training_on_static_images --cfg static_image.yaml

For multi-GPU training (e.g. 8 GPUs) on static images from COCO:

python -m torch.distributed.launch --nproc_per_node=8 hodor/training/main.py --model_dir my_training_on_static_images --cfg static_image.yaml --allow_multigpu

The checkpoints provided above were usually trained on 4 or 8 GPUs. Note that we use gradient accumulation so it is possible to train with the default batch size of 8 even on a single GPU, but the results will not be exactly reproducible.

Video

To fine-tuning the COCO trained model on sparse video (i.e. assuming that only one frame per video is annotated in the DAVIS and YouTube-VOS training sets):

python -m torch.distributed.launch --nproc_per_node=8 hodor/training/main.py --model_dir my_finetuning_on_video --cfg video_sparse.yaml --allow_multigpu --restore_path /path/to/coco/trained/checkpoint.pth

Likewise you can set --cfg video_dense.yaml to train with the full set of available training annotations.

Cite

@inproceedings{athar2022hodor,
  title={HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images},
  author={Athar, Ali and Luiten, Jonathon and Hermans, Alexander and Ramanan, Deva and Leibe, Bastian},
  booktitle={CVPR},
  year={2022}
}

hodor's People

Contributors

ali2500 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

hodor's Issues

Exact configurations to reproduce the numbers in the paper

Thank you for releasing this code.
You mentioned in the README that one would need to adjust the min_image_dim and temporal_window parameters to reproduce the numbers in the papers. I missed the part where you specify them. Would it be possible to get the parameters used to compute the exact numbers in the paper please? Thank you :)

Question about the learing rate

Dear Ali,
thanks for making your inspiring work publicly available!
I have a question about the learning rate you use throughout your experiments. E.g., for pre-training on MS Coco, you specify the base_lr after warmup to be 0.0001:

BASE_LR: 0.0001
and you decay after 100.000 steps by 0.1. Yet, in the appendix of your paper (S4, Learning rate schedule), you state that the learing rate after warmup is 10e-4 (i.e., 0.001). Is this a typo, or am I misunderstanding something?
Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.