Towards Robust Referring Video Object Segmentation with Cyclic Relational Consistency

Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, Yan Lu

Updates

(2023-05-30) Code released.
(2023-07-13) R2VOS is accepted to ICCV 2023!

Install

conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 -c pytorch
pip install -r requirements.txt 
pip install 'git+https://github.com/facebookresearch/fvcore' 
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
cd models/ops
python setup.py build install
cd ../..

Docker

You may try docker to quick start.

Weights

Please download and put the checkpoint.pth in the main folder.

Run demo:

Inference on images in the demo/demo_examples.

python demo.py --with_box_refine --binary --freeze_text_encoder --output_dir=output/demo --resume=checkpoint.pth --backbone resnet50 --ngpu 1 --use_cycle --mix_query --neg_cls --is_eval --use_cls --demo_exp 'a big track on the road' --demo_path 'demo/demo_examples'

Inference:

If you want to evaluate on Ref-YTVOS, you may try inference_ytvos.py or inference_ytvos_segm.py if you encounter OOM for the entire video inference.

python inference_ytvos.py --with_box_refine --binary --freeze_text_encoder --output_dir=output/eval --resume=checkpoint.pth --backbone resnet50 --ngpu 1 --use_cycle --mix_query --neg_cls --is_eval --use_cls --ytvos_path=/data/ref-ytvos

Related works for robust multimodal video segmentation:

R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations , Arxiv 2024

Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition, CVPR 2024

Citation

@inproceedings{li2023robust,
  title={Robust referring video object segmentation with cyclic structural consensus},
  author={Li, Xiang and Wang, Jinglu and Xu, Xiaohao and Li, Xiao and Raj, Bhiksha and Lu, Yan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={22236--22245},
  year={2023}
}

Training problem

Could you please tell me what parameters need to be specified for training? I'm referring to running the command python3 main.py with options like --with_box_refine, --binary, --freeze_text_encoder, --ngpu 1, --use_cycle, --mix_query, --use_fg_contra，--neg_cls, --use_cls, --output_dir=ytvos_dirs/resnet50, and --pretrained_weights=pretrained_weights/r50_pretrain.pth. What would the complete command look like?
I always make the following mistakes when training：
Loss is nan, stopping training
{'loss_ce': tensor(0.2781, device='cuda:0', grad_fn=), 'loss_bbox': tensor(0.8912, device='cuda:0', grad_fn=), 'loss_giou': tensor(1.0353, device='cuda:0', grad_fn=), 'loss_mask': tensor(0.0142, device='cuda:0', grad_fn=), 'loss_dice': tensor(0.1781, device='cuda:0', grad_fn=), 'loss_cycle_dist': tensor(nan, device='cuda:0', grad_fn=), 'loss_cycle_angle': tensor(0., device='cuda:0', grad_fn=), 'loss_cycle_mse': tensor(16.8410, device='cuda:0', grad_fn=), 'loss_cycle_contrastive': tensor(0., device='cuda:0', grad_fn=), 'loss_cycle_cls': tensor(0.6948, device='cuda:0',
grad_fn=), 'loss_fg_contra': tensor(1.2444, device='cuda:0', grad_fn=), 'loss_VQ': tensor(2.7403, device='cuda:0', grad_fn=), 'loss_ce_0': tensor(0.2831, device='cuda:0', grad_fn=), 'loss_bbox_0': tensor(0.8821, device='cuda:0', grad_fn=), 'loss_giou_0': tensor(1.0914, device='cuda:0', grad_fn=), 'loss_mask_0': tensor(0.0197, device='cuda:0', grad_fn=), 'loss_dice_0': tensor(0.1742, device='cuda:0', grad_fn=), 'loss_ce_1': tensor(0.2792, device='cuda:0', grad_fn=), 'loss_bbox_1': tensor(0.8986, device='cuda:0', grad_fn=), 'loss_giou_1': tensor(1.0750, device='cuda:0', grad_fn=), 'loss_mask_1': tensor(0.0141, device='cuda:0', grad_fn=), 'loss_dice_1': tensor(0.1859, device='cuda:0', grad_fn=), 'loss_ce_2': tensor(0.2896, device='cuda:0', grad_fn=), 'loss_bbox_2': tensor(0.8999, device='cuda:0', grad_fn=), 'loss_giou_2': tensor(1.0165, device='cuda:0', grad_fn=), 'loss_mask_2': tensor(0.0141, device='cuda:0', grad_fn=), 'loss_dice_2': tensor(0.1790, device='cuda:0', grad_fn=)}

I wonder if this is because I am using non-distributed related.

lxa9867 / r2vos Goto Github PK

r2vos's Introduction

Updates

Install

Docker

Weights

Run demo:

Inference:

Related works for robust multimodal video segmentation:

Citation

r2vos's People

Contributors

Stargazers

Watchers

Forkers

r2vos's Issues

Recommend Projects

Recommend Topics

Recommend Org