Coder Social home page Coder Social logo

clickseg's Introduction

ClickSEG: A Codebase for Click-Based Interactive Segmentation

Introduction

ClickSEG is codebase for click-based interactive segmentation developped on RITM codebase.

What's New?

Compared with the repo of RITM codebase, ClickSEG has following new features:

1. The official implementation for the following papers.

Conditional Diffusion for Interative Segmentation (ICCV2021) [Link]
FocalClick: Towards Practical Interactive Image Segmentation (CVPR2022)

2. More correct crop augmentation during training.

RITM codebase uses albumentations to crop and resize image-mask pairs for training. In this way, the crop size are fixed, which is not suitable for training on a combined dataset with variant image size; Besides, the NEAREST INTERPOLATION adopt in albumentations causes the mask to have 1 pixel bias towards bottom-right, which is harmful for the boundary details, especially for the Refiner of FocalClick.

Therefore, we re-write the augmentation, which is crucial for the final performance.

3. More backbones and more train/val data.

We add efficient backbones like MobileNets and PPLCNet. We trained all our models on COCO+LVIS dataset for the standard configuration. At the same time, we train them on a combinatory large dataset and provide the trained weight to facilitate academic research and industrial applications. The combinatory large dataset include 8 dataset with high quality annotations and Diversified scenes: COCO1, LVIS2, ADE20K3, MSRA10K4, DUT5, YoutubeVOS6, ThinObject7, HFlicker8.

1. Microsoft coco: Common objects in context
2. Lvis: A dataset for large vocabulary instance segmentation
3. Scene Parsing through ADE20K Dataset
4. Salient object detection: A benchmark
5. Learning to detect salient objects with image-level supervision
6. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
7. Deep Interactive Thin Object Selection
8. DoveNet: Deep Image Harmonization via Domain Verification

4. Dataset and evaluation code for starting from initial masks.

In the paper of FocalClick, we propose a new dataset of DAVIS-585 which provides initial masks for evaluation. The dataset could be download at ClickSEG GOOGLE DIRVIE. We also provide evaluation code in this codebase.



User Guidelines

To use this codebase to train/val your own models, please follow the steps:

  1. Install the requirements by excuting
pip install -r requirements.txt
  1. Prepare the dataset and pretrained backbone weights following: Data_Weight_Preparation.md

  2. Train or validate the model following: Train_Val_Guidance.md



Supported Methods

The trained model weights could be downloaded at ClickSEG GOOGLE DIRVIE

CDNet: Conditional Diffusion for Interative Segmentation (ICCV2021)

CONFIG
Input Size: 384 x 384
Previous Mask: No
Iterative Training: No
Train
Dataset
Model GrabCut Berkeley Pascal
VOC
COCO
MVal
SBD DAVIS DAVIS585
from zero
DAVIS585
from init
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
SBD ResNet34
(89.72 MB)
1.86/2.18 1.95/3.27 3.61/4.51 4.13/5.88 5.18/7.89 5.00/6.89 6.68/9.59 5.04/7.06
COCO+
LVIS
ResNet34
(89.72 MB)
1.40/1.52 1.47/2.06 2.74/3.30 2.51/3.88 4.30/7.04 4.27/5.56 4.86/7.37 4.21/5.92

FocalClick: Towards Practical Interactive Image Segmentation (CVPR2022)

CONFIG
S1 version: coarse segmentator input size 128x128; refiner input size 256x256.  
S2 version: coarse segmentator input size 256x256; refiner input size 256x256.  
Previous Mask: Yes
Iterative Training: Yes
Train
Dataset
Model GrabCut Berkeley Pascal
VOC
COCO
MVal
SBD DAVIS DAVIS585
from zero
DAVIS585
from init
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
COCO+
LVIS
HRNet18s-S1
(16.58 MB)
1.64/1.88 1.84/2.89 3.24/3.91 2.89/4.00 4.74/7.29 4.77/6.56 5.62/8.08 2.72/3.82
COCO+
LVIS
HRNet18s-S2
(16.58 MB)
1.48/1.62 1.60/2.23 2.93/3.46 2.61/3.59 4.43/6.79 3.90/5.23 4.87/6.87 2.47/3.30
COCO+
LVIS
HRNet32-S2
(119.11 MB)
1.64/1.80 1.70/2.36 2.80/3.35 2.62/3.65 4.24/6.61 4.01/5.39 4.77/6.84 2.32/3.09
Combined+
Dataset
HRNet32-S2
(119.11 MB)
1.30/1.34 1.49/1.85 2.84/3.38 2.80/3.85 4.35/6.61 3.19/4.81 4.80/6.63 2.37/3.26
COCO+
LVIS
SegFormerB0-S1
(14.38 MB)
1.60/1.86 2.05/3.29 3.54/4.22 3.08/4.21 4.98/7.60 5.13/7.42 6.21/9.06 2.63/3.69
COCO+
LVIS
SegFormerB0-S2
(14.38 MB)
1.40/1.66 1.59/2.27 2.97/3.52 2.65/3.59 4.56/6.86 4.04/5.49 5.01/7.22 2.21/3.08
COCO+
LVIS
SegFormerB3-S2
(174.56 MB)
1.44/1.50 1.55/1.92 2.46/2.88 2.32/3.12 3.53/5.59 3.61/4.90 4.06/5.89 2.00/2.76
Combined
Datasets
SegFormerB3-S2
(174.56 MB)
1.22/1.26 1.35/1.48 2.54/2.96 2.51/3.33 3.70/5.84 2.92/4.52 3.98/5.75 1.98/2.72

Efficient Baselines using MobileNets and PPLCNets

CONFIG
Input Size: 384x384.
Previous Mask: Yes
Iterative Training: Yes
Train
Dataset
Model GrabCut Berkeley Pascal
VOC
COCO
MVal
SBD DAVIS DAVIS585
from zero
DAVIS585
from init
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
NoC
85/90%
COCO+
LVIS
MobileNetV2
(7.5 MB)
1.82/2.02 1.95/2.69 2.97/3.61 2.74/3.73 4.44/6.75 3.65/5.81 5.25/7.28 2.15/3.04
COCO+
LVIS
PPLCNet
(11.92 MB)
1.74/1.92 1.96/2.66 2.95/3.51 2.72/3.75 4.41/6.66 4.40/5.78 5.11/7.28 2.03/2.90
Combined
Datasets
MobileNetV2
(7.5 MB)
1.50/1.62 1.62/2.25 3.00/3.61 2.80/3.96 4.66/7.05 3.59/5.24 5.05/7.12 2.06/2.97
Combined
Datasets
PPLCNet
(11.92 MB)
1.46/1.66 1.63/1.99 2.88/3.44 2.75/3.89 4.44/6.74 3.65/5.34 5.02/6.98 1.96/2.81


License

The code is released under the MIT License. It is a short, permissive software license. Basically, you can do whatever you want as long as you include the original copyright and license notice in any copy of the software/source.



Acknowledgement

The core framework of this codebase follows: https://github.com/saic-vul/ritm_interactive_segmentation

Some code and pretrained weights are brought from:
https://github.com/Tramac/Lightweight-Segmentation
https://github.com/facebookresearch/video-nonlocal-net
https://github.com/visinf/1-stage-wseg
https://github.com/frotms/PP-LCNet-Pytorch

We thank those authors for their great works.



Citation

If you find this work is useful for your research, please cite our papers:

@inproceedings{cdnet,
  title={Conditional Diffusion for Interactive Segmentation},
  author={Chen, Xi and Zhao, Zhiyan and Yu, Feiwu and Zhang, Yilei and Duan, Manni},
  booktitle={ICCV},
  year={2021}
}

@article{focalclick,
  title={FocalClick: Towards Practical Interactive Image Segmentation},
  author={Chen, Xi and Zhao, Zhiyan and Zhang, Yilei and Duan, Manni and Qi, Donglian and Zhao, Hengshuang},
  booktitle={CVPR},
  year={2022}
}

clickseg's People

Contributors

alibaba-oss avatar xavierchen34 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

clickseg's Issues

Recurrence

I trained COCO+LVIS SegFormerB0-S1 using the trainval_scripts/train_focalclickB0_S1_cclvs.sh with nothing changed.
And the val result is as follows:
图片1
compared with the model published:
图片2
I notice that the BS is 32 in paper, and it's 64 in the bash script. Is it the reason for the accuracy gap?

Also, there are two ways of data augmentation in the code and the paper says "During training, we only use flip and random resize with the scale from 0.75 to 1.4 as data augmentation." which one is used in the model published?

Question: importance of click history

Dear authors, thank you for sharing the code, great work!

I am trying to understand how you use clicks. It looks like the Segmentor receives the previous mask and up to 20 last clicks in the order they were generated. The last one is the correction the user is trying to make, whereas the others are historical clicks that led to the previous mask. Is that correct?

If so, did you find it important to keep the history of clicks for coarse segmentation? What would happen if you passed only the previous mask and the last click (e.g. like you do later for refinement)?

Pretrained weights

Hello, thank you for the work.

What ImageNet weights are you specifically using for pre-training? For W18_S2 I tried a few different weights from the HRNet repo but they did not match.

Thank you!

Validation

I see that you have validation turned off in your training script:

Moreover, the code would not work because of

from albumentations.augmentations import functional as F

This should be

from albumentations.augmentations.geometric import functional as F

because that's where resize() is located.

Now, the question: did you use those validation augmentations to come up with the training schedule? How important are they?

Questions on Datasets

Could you please share your considerations when choosing datasets for training?

  • As one option, you use COCO+LVIS. In the RITM paper, they say that LVIS is the best; however, it has a problem of being "long-tailed and therefore lacking general object categories". Could you elaborate on the problem from your experience? For example, if we wanted to achieve way over 90% IoU, wouldn't low-quality COCO annotations be a problem?
  • Models trained on a combination of 8 datasets achieve better results that those trained on COCO+LVIS. Is the reason in higher image diversity (meaning COCO+LVIS is not diverse enough) or that the other 6 datasets have higher-quality annotations than COCO?
  • How did you combine those 8 datasets? For example, did you just add 6 full datasets to the COCO+LVIS combination?
  • How long did it take to train on the huge combined dataset?

Positive click on mask / negative click on background

During benchmarking every click is either a positive on the background or a negative on the foreground. This results in the assumption that for a click at (y, x), pred_mask[y, x] is different from previous_mask[y, x] at https://github.com/alibaba/ClickSEG/blob/main/isegm/inference/evaluation.py#L50
Since progressive merge is not performed during training, it is run only on a fully trained model where the above assumption always (?) holds - for the Clicker that you defined that is.

Two bugs / features with the above:

  1. During a real user interaction, nothing prevents the user from placing a positive click on an existing mask or a negative click on the background. Then progressive merge is going to ruin the mask. For example, for a freshly loaded image without a previous mask, if the user places a negative click anywhere in the image, the resulting mask will cover the whole image.

  2. Another possibility is that the user's correction simply did not lead to any changes - at least at the location of the click. I have actually run into such a situation while playing with your model and real user clicks, but in theory there's no reason why this can't happen even with ideal center-of-the-region clicks you use for your benchmarks.

The fix for (1) would be not to run the algorithm at all. Since you don't have this scenario in your test and don't provide inference code, I guess it's irrelevant. The fix for (2) would probably be to detect this strange result and discard the predicted mask logging an error/warning.

I ran into such scenarios when testing your code in inference mode. I guess for benchmarking it's a very minor issue; however, for inference / real user interaction, the results may be really surprising.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.