Coder Social home page Coder Social logo

onlinerefer's Introduction

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen

Abstract

Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.

Update

  • (2023/07/18) OnlineRefer is accepted by ICCV2023. The online mode is released.

Setup

The main setup of our code follows Referformer.

Please refer to install.md for installation.

Please refer to data.md for data preparation.

Training and Evaluation

If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone ResNet50, please run the following command:

sh ./scripts/online_ytvos_r50.sh

If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone Swin-L, please run the following command:

sh ./scripts/online_ytvos_swinl.sh

If you want to use your own video sequence, please run the following command:

python inference_long_videos.py

Note: The models with ResNet50 are trained using 8 NVIDIA 2080Ti GPU, and the models with Swin-L are trained using 8 NVIDIA Tesla V100 GPU.

Model Zoo

Ref-Youtube-VOS

Please upload the zip file to the competition server.

Backbone J&F J F Pretrain Model Submission
ResNet-50 57.3 55.6 58.9 weight model link
Swin-L 63.5 61.6 65.5 weight model link
Video Swin-B 62.9 61.0 64.7 - - link

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

Backbone J&F J F Model
ResNet-50 59.3 55.7 62.9 model
Swin-L 64.8 61.6 67.7 model

Citation

If you find OnlineRefer useful in your research, please consider citing:

@inproceedings{wu2023onlinerefer,
  title={OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation},
  author={Wu, Dongming and Wang, Tiancai and Zhang, Yuang and Zhang, Xiangyu and Shen, Jianbing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={2761--2770},
  year={2023}
}

Acknowledgement

onlinerefer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

onlinerefer's Issues

There is a bug when bacth_size !=1 .

Thanks for your great work!
I find that the default value of batch_size is 1 in your release code.
But when I set other values to batch_size (such as 2, 4), it doesn't work...
The relevant wrong information is :

File "/home/fxk/python_proj/tracking_proj/OnlineRefer/models/ops/modules/ms_deform_attn.py", line 99, in forward
sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)
RuntimeError: shape '[2, 5, 8, 4, 4, 2]' is invalid for input of size 1280

Is the CUDA operator compilation necessary?

Dear author:
Thank you for your great work. I have encountered some problems in compiling the CUDA operator. Do it a necessary option for running the code? Thank you in advance! Your reply is appreciated very much!

Parameters

I find a parameter named ‘reverse_aug'. But there is not a parameter initialized in opts.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.