zhang-tao-whu / dvis Goto Github PK

View Code? Open in Web Editor NEW

121.0 4.0 7.0 192 KB

DVIS: Decoupled Video Instance Segmentation Framework

License: MIT License

Python 92.82% Shell 0.07% C++ 0.71% Cuda 6.40%

offline online ovis segmentation video-instance-segmentation video-panoptic-segmentation

dvis's Introduction

DVIS: Decoupled Video Instance Segmentation Framework

Tao Zhang, XingYe Tian, Yu Wu, ShunPing Ji, Xuebo Wang, Yuan Zhang, Pengfei Wan

News

DVIS-DAQ achieves 57.1 AP on the OVIS dataset and also sets a new SOTA performance on YTVIS19/21 and VIPSeg. The code will be released in DVIS-DAQ. The paper is available at DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries and the project page can be found in project page.
The improved version of DVIS, DVIS++, is now available. Please refer to DVIS++ for more information. DVIS++ achieves 41.2 AP, 56.7 AP, and 52.0 AP, as well as 48.6 mIOU and 44.2 VPQ in OVIS, YTVIS19, YTVIS21, VSPW, and VIPSeg, respectively. Additionally, OV-DVIS++ supports open-vocabulary universal video segmentation.
DVIS achieved 1st place in the VPS Track of the PVUW challenge at CVPR 2023. 2023.5.25
DVIS has been accepted by ICCV 2023. 2023.7.15
DVIS achieved 1st place in the VIS Track of the 5th LSVOS challenge at ICCV 2023. 2023.8.15

Features

DVIS is a universal video segmentation framework that supports VIS, VPS and VSS.
DVIS can run in both online and offline modes.
DVIS achieved SOTA performance on YTVIS, OVIS, VIPSeg and VSPW datasets.
DVIS can complete training and inference on GPUs with only 11G memory.

Demos

Installation

See Installation Instructions.

Getting Started

See Preparing Datasets for DVIS.

See Getting Started with DVIS.

Model Zoo

Trained models are available for download in the DVIS Model Zoo.

Citing DVIS

@article{DVIS,
  title={DVIS: Decoupled Video Instance Segmentation Framework},
  author={Zhang, Tao and Tian, Xingye and Wu, Yu and Ji, Shunping and Wang, Xuebo and Zhang, Yuan and Wan, Pengfei},
  journal={arXiv preprint arXiv:2306.03413},
  year={2023}
}

@article{zhang2023vis1st,
  title={1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation},
  author={Zhang, Tao and Tian, Xingye and Zhou, Yikang and Wu, Yu and Ji, Shunping and Yan, Cilin and Wang, Xuebo and Tao, Xin and Zhang, Yuan and Wan, Pengfei},
  journal={arXiv preprint arXiv:2308.14392},
  year={2023}
}

@article{zhang2023vps1st,
  title={1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation},
  author={Zhang, Tao and Tian, Xingye and Wei, Haoran and Wu, Yu and Ji, Shunping and Wang, Xuebo and Zhang, Yuan and Wan, Pengfei},
  journal={arXiv preprint arXiv:2306.04091},
  year={2023}
}

Acknowledgement

This repo is largely based on Mask2Former, MinVIS and VITA. Thanks for their excellent works.

dvis's People

Contributors

Stargazers

Watchers

Forkers

fyaft2012 cv-seg suppurnewer sehuzeyfe aimedical pinghe-stan kagawa588

dvis's Issues

can not produce demos

Hi authors,

Thanks for your great work! I try to produce some demo videos for VIPSeg dataset, but the segmentation results are poor, do I need to change some settings to reproduce the results shown in paper?

Thanks

Why don't Temporal refiner module work well in my dataset?

Thank you so much for your excellent work！！！
I want to make the temporal refine mechanism like your work. First, I use a trained model to extract the time series features, which fuse temporal information. Then put it into the temporal refine module, but I found that it don't work well.
I think one reason is that my dataset size is relatively small. And the feature exactor is based on CNN.
Could you give me some advice? Thank you very much!!!

whether release LSVOS challenge technique report ?

congratulations. "DVIS achieved 1st place in the VIS Track of the 5th LSVOS challenge at ICCV 2023".

I want to ask whether to release the technique report about getting the first award.

Or just the paper code can get it?

Is the COCO dataset only used for training segmentation models? Do tracking datasets require separate annotations?

Some questions about your motivation of instance association.

Dear Zhang,

I have read your paper closely. But I am confused about the 2 principles you proposed. I just cannot get the point that you want to solve.

(1) encourage sufficient interaction between instance representations of adjacent frames to fully exploit their similarity for better association.

You argue that previous works use some heuristic methods that lack of interaction between frames. But interaction really makes confused? What interaction really mean? Because there are a lot of work before that have interaction between frames in different manners. And the previous work you mentioned are using post processing methods to track. Do you want to emphasize that your method does not need post processing? Or something else? How to define sufficient or not sufficient?

(2) avoid mixing their information during the interaction process to prevent introducing indistinguishable noise that may interfere with association results.

This part also let me confuse, especially the word mixing. What the pre-frame instance representation mean? Is that query? Is that instance feature? In your paper, you reference two works before. The reference number is 33 and 9. In 33 it just passes the query to the next frame like below. Is that called instance representation? But in your work, it also passes the query from decoder to the next TD Block.

I just confused about this part of motivation, the problems that you want to solve and your own special solution. So kindly of you, if you can give me more detailed idea, I would be far more appreciate it.

Have a nice day.
Rohan

How to Train on New Data

Hello, I am interested in training a dataset using your DVIS model. After reading through the GETTING_STARTED.md, my understanding is that I should follow these steps:

Finetune the segmenter.
Use the weights from the trained segmenter to train DVIS_Online.
Use the weights from the trained DVIS_Online to train DVIS_Offline.

Could you please confirm if my understanding is correct?

Train on custom dataset

I can't get to load an annotated custom image dataset and train it. How is that possible?

🐛[Bugs] I can't reproduce DVIS online results on Youtube-VIS 2019

The version of the adopted repo: 39a4514

The minvis pretraining stage is reproduced perfectly. However, training DVIS online version with minvis pretrain weights just achieve 29.2 mAP. This is my training log. Could you please check this bug?

No detections shown

I tried to use demo on a few videos, but even though there are some instances detected, they aren't shown on the output frames. Lowering the score threshold doesn't help.

单卡gpu 不支持推理吗

如题，输入命令
python train_net_video.py --num-gpus 1 --config-file configs/ovis/DVIS_Offline_R50.yaml --eval-only MODEL.WEIGHTS checkpoints/DVIS_offline_ovis_r50.pth
返回

[09/04 15:40:30 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from checkpoints/DVIS_offline_ovis_r50.pth ...
[09/04 15:40:30 fvcore.common.checkpoint]: [Checkpointer] Loading from checkpoints/DVIS_offline_ovis_r50.pth ...
[09/04 15:40:32 d2.data.common]: Serializing the dataset using: <class 'detectron2.data.common._TorchSerializedList'>
[09/04 15:40:32 d2.data.common]: Serializing 140 elements to byte tensors and concatenating them all ...
[09/04 15:40:32 d2.data.common]: Serialized dataset takes 0.42 MiB
COCO Evaluator instantiated using config, this is deprecated behavior. Please pass in explicit arguments instead.
[09/04 15:40:32 d2.evaluation.evaluator]: Start inference on 140 batches
/home/hs/AIGC/DVIS_ENV/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
已杀死

Dataset file missing

I found that the Annotations file is missing compared to the prepared dataset when downloading the YouTube VIS 2021 dataset. Can you help solve this problem? Download link:https://drive.google.com/drive/folders/12DxR2HWTVjULNwKVMdYAvhOmZ9gBxsX2

Exploring Real-time Video Instance Segmentation with DVIS Model

I am currently using the DVIS model for inference, and it appears to take a directory of video frames in image format as input. I would like to inquire whether it is possible to directly input a video for real-time video instance segmentation.

Is it feasible to configure the DVIS model to work with video input, allowing for real-time video instance segmentation, or is it limited to processing individual frames in image format?

Thank you for your guidance and support!

Why add ID can make sure that the preframe information will not mix with next frame information.

Why add ID can make sure that the preframe information will not mix with next frame information. Because the queries are from preframe.

About the json file size

Hi Siir

Thanks for your excellent job, when I try to evaluate the R-50 DVIS offline on ytvis2019, I found that the generated json file is more than 500+MB, which exceeds the codelab server requirement, is that a normal result?

Thanks
Kai

can not use demo file

When I tried to use the demo file. It stops the program with a "keyboard interrupt" error even though I don't press a key

VSS bug

Hi, authors,

Thanks for your great work!
I met a few problems when running VSS code, could you pls give any suggestions?

How do I train DVIS on VSS? (following which config file?)
and What is the difference between 480p and 720p dataset?

I follow the code python train_net_video.py --num-gpus 4 --config-file configs/VSPW/MinVIS_R50_480p.yaml

The traceback error is : sem_seg_gt[sem_seg_gt == 0] = 255
ValueError: assignment destination is read-only

Thanks!

about match_embds function

@zhang-tao-whu hello, as for the tracker part, I have a question about the function of match_embds, in this function, why is the cosine similarity calculated from only one sample in the batch, as shown in the following code?
`

def match_embds(self, ref_embds, cur_embds):
    #  embeds (q, b, c)
    ref_embds, cur_embds = ref_embds.detach()[:, 0, :], cur_embds.detach()[:, 0, :] # only one sample in a batch
    ref_embds = ref_embds / (ref_embds.norm(dim=1)[:, None] + 1e-6)
    cur_embds = cur_embds / (cur_embds.norm(dim=1)[:, None] + 1e-6)
    cos_sim = torch.mm(ref_embds, cur_embds.transpose(0, 1))
    C = 1 - cos_sim

    C = C.cpu()
    C = torch.where(torch.isnan(C), torch.full_like(C, 0), C)

    indices = linear_sum_assignment(C.transpose(0, 1))
    indices = indices[1]
    return indices

Model wieghts about VSPW dataset

Very good idea and project!
When I test the model on the VSPW dataset, I can't find model weights about it in Model Zoo. Where can I find it? Thank you!

The dataset “ytvis2021” does not have instances.json for validation and test sets. Where does their annotation information come from?

Hello, Mr Zhang. The dataset “YoutubeVIS2021” does not have instances.json for validation and test sets. Where does their annotation information come from?

no detection results on demo.py

Hi,

I am trying to generate some visual results on VSPW dataset, I used VIPSeg's config and ckp, but it detect 0 instances, could you give me any suggestions?

Thanks!

dataset

Hello, the dataset I use is youtubevis, and my data structure is as follows:
ytvis_2021/
train.json
valid.json
train/
Annotations/
JPEGImages/
valid/
Annotations/
JPEGImages/
And I put them under the datasets file.

But when I train with this command:
python train_net_video.py --num-gpus 2 --config-file ./configs/youtubevis_2021/swin/DVIS_Offline_SwinL.yaml --resume MODEL.WEIGHTS ./pretrain/DVIS_offline_ytvis21_swinl.pth
The following error occurred:
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/coco/annotations/coco2ytvis2021_train.json'
Do I need a json file in coco format for training?

About the transformer denoising blocks (TD)

Thanks for your great work!

As I understand, from the frame 2, the Q, K, and V are consistent on each TD block of the same frame, right? The only change is the ID, which will be updated through L blocks.

So have you ever tried to experiment with the effectiveness of the number of L?

How to make a dataset for video instance segmentation model？

Hi！ The DVIS model is a great model for video tasks. I've finished labeling the video data. Each object of each frame of each video has segmentation, classification, and persistent ID information, and a JSON file is made like

{info:{},
licenses:[],
videos:[], # video information
categories:[],
annotations:[] # Information about each instance, instance represents a collection of objects with unique IDs in each frame of the video
}

l have a question. What format and information does the 'image_instance‘ of the image contain? In my case, what do I need to do?

Thanks！

how to export in onnx format

could you help?

Can I deploy DVIS model with onnx?

Thanks to super awesome work! I'm really impressed on your work :) And I have a quick question.
Can I deploy DVIS model with onnx? if not yet, do you have any plan for that work?

Problem when I evaluate DVIS(online) on OVIS dataset

Hi! Thank you for your great work!
When I try to evaluate DVIS(online) on OVIS dataset, I only get "nan" results. Like this:

I can't find the reason. I just change the IMS_PER_BATCH in DVIS_Online_R50.yaml from 8 to 4 in DVIS_Online_R50.yaml.
Here is my order：python train_net_video.py --num-gpus 4 --config-file configs/ovis/DVIS_Online_R50.yaml --eval-only --resume MODEL.WEIGHTS checkpoints/DVIS_online_ovis_r50.pth。
And my evaluate log is log.txt

Hope you can help answer.

Training parameters

How can I replicate similar results on only one GPU? What parameters need to be modified?
Can I display only the segmentation mask (without categories and scores) when running the demo.py for visualization?
Thank you!

AP/AR results from the result.json

Thank you for your awesome research work!
I have completed the training and evaluation, and obtained an inference JSON file. I would like to know how to obtain the AP value through this JSON file?
Thank you.

A problem to submit result

Hi, I try to get the inference result by submitting esult_submission.zip on the server for video panoptic segmentation swin-L, but failed. The errors are shown below, could you please check?

WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Traceback (most recent call last):
File "/tmp/codalab/tmp4vzAlN/run/program/score.py", line 815, in
pred_js = pred_j[video_id]
KeyError: '1001_5z_ijQjUf_0'

A problem of training R50 model under VIPSeg dataset

Hi! Your work is excellent!
I try to train DVIS_R50 model under VIPSeg dataset.
Do I need to finetune the segmenter by following "Training on a new dataset"? Or I just need to train by following "Training" and use minvis_pretrained_weights.pth like "minvis_ovis_R50.pth"?
I‘m looking forward to your answer. Thanks!

where coco2ytvis2019_train.json?

coco2ytvis2019_train.json

vspw inference results image pixels is all zero

the command use for inference:
python train_net_video.py
--num-gpus 8
--config-file configs/SDG/DVIS_Online_R50_720p.yaml
--eval-only MODEL.WEIGHTS output_DVIS_Online_R50_720p_VSPW/model_final.pth

train command:
python train_net_video.py --num-gpus 8 --config-file configs/VSPW/DVIS_Online_R50_720p.yaml \

ckpt reproduce

Hi, Author!

When i evaluated the performance using the trained model you provided, DVIS_online_r50 and DVIS_offline_r50, the performance was similar with paper.

To reproduce the ckpt by re-training, i followed your configuration.

# train the DVIS_Online
python train_net_video.py \
  --num-gpus 8 \
  --config-file ./configs/youtubevis_2019/DVIS_Online_R50.yaml \
  MODEL.WEIGHTS ./ckpt/minvis_ytvis19_R50.pth

benchmark score of reproduced-online method was simliar with paper.

# train the DVIS_Offline
python train_net_video.py \
  --num-gpus 8 \
  --config-file ./configs/youtubevis_2019/DVIS_Offline_R50.yaml \
  MODEL.WEIGHTS ./ckpt/DVIS_online_ytvis19_r50.pth \

However, when I tried learning the offline method according to the code provided above, the reproduced performance was significantly lower (ytvis2019: 1 AP). Could you please check the training code or configuration?
(never change anything in your code)