microsoft / proda Goto Github PK

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

Home Page: https://arxiv.org/abs/2101.10979

License: MIT License

Python 100.00%

deep-learning neural-network domain-adaptation semantic-segmentation semi-supervised-learning pseudo-label computer-vision

proda's People

Stargazers

Watchers

Forkers

siddharth-shrivastava7 yiliu-coding pinglmlcv skrya dlwbm123 theo2021 fengqiliu1221 alandene jfabriciocp jiangrongatrobo yanghaiyang12138 jie311 xychen9459 junha1125 raqoon886 shiyutang ericyu97 trendingtechnology longkukuhi jiangxiaobai00 standardgalactic lslrh dimitriosgkegkas junkunpeng17 leo-hao derrickwang005 ssssshwan fy-vision jo-wang thaneos cv-ip shigemichimatsuzaki seominseok0429 ydyd11 junzhin xufenggao jtrneo pratyaksh10 tr2512 oreochiz sibo888 cv-seg iq-scm nik1806 hi-fishu doodleima

proda's Issues

Have you been trained with source only?

Hi.
Thank you to share your perfect code.

I am running your code. And I met one issue that can be relatived with
the problem in your paper, that ProDA's "train class mIoU score" was Zero.

1. Adding this to your code

ProDA/models/adaptation_modelsv2.py
this is for source only training

class CustomModel():
    def step_source(self, source_x, source_label, source_imageS=None, source_params=None):
        self.BaseOpti.zero_grad()
        
        if self.opt.S_pseudo_src > 0: # Always [opt.S_pseudo_src == 0]
            source_output = self.BaseNet_DP(source_imageS)
            source_label_d4 = F.interpolate(source_label.unsqueeze(1).float(), size=source_output['out'].size()[2:])
            source_labelS = self.label_strong_T(source_label_d4.clone().float(), source_params, padding=250, scale=4).to(torch.int64)
            loss_ = cross_entropy2d(input=source_output['out'], target=source_labelS.squeeze(1))
            loss_GTA = loss_ * self.opt.S_pseudo_src
            source_outputUp = F.interpolate(source_output['out'], size=source_x.size()[2:], mode='bilinear', align_corners=True)
        else: # Runing part
            source_output = self.BaseNet_DP(source_x, ssl=True)
            source_outputUp = F.interpolate(source_output['out'], size=source_x.size()[2:], mode='bilinear', align_corners=True)
            loss_GTA = cross_entropy2d(input=source_outputUp, target=source_label, size_average=True, reduction='mean')

        loss_GTA.backward()
        self.BaseOpti.step()

        return loss_GTA.item()

2. Result to train with source only

But I got this result.

3. the main issue

I've never seen a few classes score 0.
But If we see the result above, we can see Zero score.
How do you think about this??

Thank you.

PS: Runging config

init_parameter = 'imagenet'

# gta5 / synthia / cityscapes
src_dataset = 'gta5' 
tgt_dataset = 'gta5'
# GTA5 / Synthia / Cityscapes
src_rootpath = 'Dataset/GTA5'
tgt_rootpath = 'Dataset/GTA5'
n_class = 19
num_workers = 8
bs = 8

name = 'gta5_src'
seed = 1337

## model
model_name = 'deeplabv2'
lr = 0.0005
freeze_bn = False
epochs = 84
train_iters = 90000
bn = 'sync_bn'
no_resume = False
stage = 'src_only' # warm_up / stage1 / stage2 or stage3 / src_only 
finetune = False
bn_clr = False
train_thred = 0
used_save_pseudo = False
no_droplast = False

resize = 2200
rcrop = [896, 512]
hflip = 0.5

S_pseudo = 0.0

noshuffle = False
noaug = False

Which pictures of Cityscapes dataset are use to make _qualitative results of semantic segmentation_?

Good job!!
I would like to ask which pictures of Cityscapes dataset are use to make qualitative results of semantic segmentation, 500 val images?
thanks!

Train with custom dataset

Hi! The idea is amazing, may I ask how to train it using my own dataset(only with images and labels) ?

Exact command line of warmup stage

Hey :)
Can you guys please supply the command line used to train the warmup phase?
I'm trying to retrain a network "from the beginning".
Thanks!
Shahaf

Can not reproduce in Gta5 -cityscape task.

We rerun your code following the default setting. The result is quite lower than reported(57.5%).
our result is as follows:
Overall Acc: 0.8863829080331662
Mean Acc : 0.6676762265179861
FreqW Acc : 0.8026627310122789
Mean IoU : 0.5486842157620068

UMAP visualization

Good morning. I am trying to UMAP visualization.

But, it's harder than I think.

Can you share the UMAP visualization code?

Amount of epochs for both source only and warm-up stage

Hi, I am trying to reproduce the full pipeline from the very beginning, i.e. starting from training the model only with the source dataset, then doing the warm-up stage and so on.

Could you provide more information about the amount of epochs or iterations that you trained the model for the following stages:

Source only
Warm-up

Thanks for the help.

Missing key(s) in state_dict

hello~Thanks for your excellent work, but now I meet a problem when I try to follow the steps in "Inference Using Pretrained Model"-"2) GTA -> Cityscapes". It throws a error:
RuntimeError: Error(s) in loading state_dict for ResNet101:
Missing key(s) in state_dict: "bn_pretrain.weight", "bn_pretrain.bias", "bn_pretrain.running_mean", "bn_pretrain.running_var".

if anyone meet the same error?

question about the loss formulation of kl_div

I am wondering whether there is any motivation behind the fact you excluded log() whiling computing kl_div loss in distillation stage? (in line 315 of adaptation_modelv2.py). Usually people use loss_kd = F.kl_div(student.log(), teacher), but here you are giving loss_kd = F.kl_div(student, teacher), right?

about the size of prototype.

Thanks for your great work and code sharing. Here is one confusing I concerted.

I saw you define the objective_vectors in models/adaptation_modelv2.py with a size of 256
self.objective_vectors = torch.zeros([self.class_numbers, 256]) self.objective_vectors_num = torch.zeros([self.class_numbers])

but I found that the channel of feature maps is 2048 in layer4.
self.layer4 = self._make_layer(block, 512, layers[3], stride=1, dilation=4, BatchNorm=BatchNorm) self.layer5 = self._make_pred_layer(Classifier_Module2, 2048, [6, 12, 18, 24], [6, 12, 18, 24], num_classes)

and in your command, you just use the output feature of layer4 as out['feat']
` def forward(self, x, ssl=False, lbl=None):
_, _, h, w = x.size()
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
if self.bn_clr:
x = self.bn_pretrain(x)

    out = self.layer5(x, get_feat=True)
    # out = dict()
    # out['feat'] = x
    # x = self.layer5(x)
    
    # if not ssl:
    #     x = nn.functional.upsample(x, (h, w), mode='bilinear', align_corners=True)
    #     if lbl is not None:
    #         self.loss = self.CrossEntropy2d(x, lbl)    
    # out['out'] = x
    return out`

so....which is the correct size of prototype and if it's 256 how to get the feature~

Discriminator model

Thank you for your project!
In file discriminator.py in see function:

class FCDiscriminator_class(nn.Module): 
#TODO: whether reduce channels before pooling, whether update pred, more complex discriminator
#TODO: 19 different discriminators or 1 discriminator after projection

Have you finished coding this function? And with the current code can I use this function in warm-up stage?

SynchronizedBatchNorm2d or nn.BatchNorm2d?

What's the difference between SynchronizedBatchNorm2d and nn.BatchNorm2d? I usually freeze BN since the batch_size is usually small, i.e., 1. Dose this training trick you used could bring boost to your final performance. I think a comparison in fair settings will show the real improvement of your method.

Where do I get the "split.mat" file?

Whenever I run generate_pseudo_label.py,
It raises FileNotFoundError: [Errno 2] No such file or directory: '~path_to_Dataset/GTA5/split.mat'
How or Where do I get the "split.mat" file?

Thank you

Warm-up stage training

Can you share how to train warm-up stage?
I use this script to train warm-up stage but the result is quite fail and can't get 43.3 mIoU like pre-trained.
python train.py --name gta2citylabv2_warmup --stage warm_up --student_init simclr --no_resume --lr 0.0001

About the usage of your code with PyTorch>=1.6.0

Thanks for sharing your fantastic work!

I am using your code with PyTorch version >= 1.6.0. And it seems that the operation of F.affine_grid() & F.grid_sample() has been changed leading to the warning :

UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. "Default grid_sample and affine_grid behavior has changed "

Because I don't know the running Pytorch version of your environment, I am not sure whether I need to change the align_corners argument.

The same issues occur for upsample function

UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. "The default behavior for interpolate/upsample with float scale_factor changed "

Could you please tell me the running version of your PyTorch?

Missing key(s) when loading the pre-trained simclr model parameters in stage 3

Is warming-up a critical part in the full model performance

Hi, thanks for sharing the code. As you mentioned in the README.md, a warm-up model is used to start up the 3-stage training process, which seems a pretraining process with adversarial training according to your code. However, this part is not discussed a lot in the paper. Since the warm-up model is used to initialize the Basemodel in stage1, and from the training instruction, each stage is highly relying on the trained model from its previous stage (either to initialize the Basemodel or initialze a Basemodel_ema), I wonder if I change the DA warm-up model to a regular source-only model for startup, will there be a severe chain-effect in downstream stage ?

pretrained/simclr/r101_1x_sk0.pth

Thank you very much for the open source of your code and I am very interested in your work. When inference Using Pretrained Model， I had a problem and I don t know how to choose the simclr weights. Simclrv2

I enter the Fine-tuned SimCLRv2 models on 100% of labels, but I don't konw which model I should use.

two GPUs

how can I just use two GPUs to train the model

source-only pretrained model

thanks for your work. Can you release the parameters of source-only model? I want to do some experiments based on it.

deeplabv2.py code problem

Hi, I'm curious about a line of code in deeplabv2.py

ProDA/models/deeplabv2.py

Line 123 in 9ba80c7

return out

It seems like return out has an extra indent which makes for loop useless. Or does it has some purpose?

    def forward(self, x):
        out = self.conv2d_list[0](x)
        for i in range(len(self.conv2d_list) - 1):
            out += self.conv2d_list[i + 1](x)
            return out

The method does not have domain alignment that works so well.

Hi, I am amazed that this algorithm does not have domain alignment that can work so well. Is domain alignment important? Thus what I want to know is: which category prototype of the source data or the category prototype of the target data is used in the experiment. Which of the two works better: the category prototype of the source data and the target data?

Inconsistency between paper and code

Hi, congratulations on your great work and acceptance in CVPR '21. Thanks for releasing the code and model weights.

In the paper, you mention using DeepLabv2 with ResNet101 backbone. However, your code actually makes use of a modified ASPP module (ClassifierModule2 in models/deeplabv2.py) while actually ClassifierModule has to be used for DeepLabv2. Similar issues were raised here and here which mention that this type of ASPP module is used in DeepLabv3+ which has a much better performance compared to DeepLabv2 (both issues were raised in Jan. 2020). Could you please confirm this point and if you have also performed experiments with the original DeepLabv2 model, could you report those results for a fair comparison with prior arts?

DeepLab network selection

Hi and congratulations on your work. It is astonishing how much ground UDA has gained over the years.

Looking over the code, I noticed that you changed some parts of the last layers of DeepLabv2. In particular, you add group normalization, ReLu activations, and dropout in the last layers. What is the inspiration behind those changes? How much did they contribute to the results? Similar methods have shown much improved results using DeepLabv3. Is it fair to compare your improved model with other DeepLabv2 approaches without explaining how much those changes contributed to the result?

Thank you for your time and consideration. I am looking forward to your response.

Model selection

Why do you use validation dataset (500 images Cityscapes) to select best model? Valid dataset (500 images Cityscapes) is only used for evaluation, not for model selection (for saving best model), this paper is having a big problem!

Good job! Some questions about the file calc_prototype.py, thanks！

Problem about the pretrained warmup model in "SYNTHIA -> Cityscapes"

Hi, I'm reproducing your result, but I have found that the MIoU of warmup model in "SYNTHIA -> Cityscapes" is 23.8, which is not equal to 41.4 that you report. Can you check it?

I have downloaded from_synthia_to_cityscapes_on_deeplabv2_best_model.pkl from link given by you, and test with the following code:

python test.py --n_class 16 --resume pretrained/syn2citylabv2_warmup/from_synthia_to_cityscapes_on_deeplabv2_best_model.pkl

Out of memory with two GPU training

Hi, I am trying to reproduce the results for the training process adapting GTA V to Cityscapes. I have downloaded the warm-up model and generated the soft pseudo-labels successfully, and also calculated the prototypes. But after running the script to train the stage 1, the highest mIoU that I get is 52.9.

The parameters that I have used for my current set-up (2 Nvidia RTV 2080 with 24GB each) are:

batch size = 2 learning_rate = 0.0001/2 epochs = 84 train_iters = 90000*2
I couldn't run the script with the default configuration (bs=4, lr=0.0001 and train_iters = 90000) due to out of memory. Any thoughts on how I can achieve the results with my hardware configuration?

About the mIoU metric

The warmup pretrained model was supposed to be 43.3 mIoU, but when I test on cityscapes valid dataset, I've got a really lower one.I'm using the runningSocre class in your Repo and my metric function with the same result.

I'm really confused with the result, could you tell me how to get the correct mIoU about the pretrained model?

The time of training

How long will each stage last based on the paper setting?
I am training the stage1 on 2 Tesla V100-DGXS-32GB, and it takes 2 hours to train 1000 iters(about 1.3 epoch), it's a little slow, is this to be expected?

specific methods about the stage two.

After reading your paper and amazed by model performance, I have problem in understanding the specific methods in stage two. I know little about representation learning. And the paper doesn't explain this table in detail so I hope someone can give me an explanation.

In stage two, what does it means by self-supervised initialization? And how to combine the self-supervised initialization with/without self distillation?

Here are my own understanding for stage two.
These three initialization in stage two represents the methods of initialization of student model and the teacher model is always the best-performing model in stage one if self-distillation is used. Stage1 init. is the best-performing model in stage one(that is 53.7 in stage one). Supervised init. is the SimCLRv2 pretrained model(I assume resnet101 with supervised training).

I don't know if my understanding is right.

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

when I run tran.py

 python train.py --name gta2citylabv2_stage1Denoise --used_save_pseudo --ema --proto_rectify --moving_prototype --path_soft Pseudo/gta2citylabv2_warmup_soft --resume_path ./pretrained/gta2citylabv2_warmup/from_gta5_to_cityscapes_on_deeplabv2_best_model.pkl --proto_consistW 10 --rce --regular_w 0.1

Traceback (most recent call last):
  File "train2.py", line 217, in <module>
    train(opt, logger)
  File "train2.py", line 88, in train
    target_lpsoft, target_image_full, target_weak_params)
  File "/media/ailab/data/yy/ProDA/models/adaptation_modelv2.py", line 209, in step
    weights = self.get_prototype_weight(ema_out['feat'], target_weak_params=target_weak_params)
  File "/media/ailab/data/yy/ProDA/models/adaptation_modelv2.py", line 350, in get_prototype_weight
    feat_proto_distance = self.feat_prototype_distance(feat)
  File "/media/ailab/data/yy/ProDA/models/adaptation_modelv2.py", line 345, in feat_prototype_distance
    feat_proto_distance[:, i, :, :] = torch.norm(self.objective_vectors[i].reshape(-1,1,1).expand(-1, H, W) - feat, 2, dim=1,)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I only want to use two gpu, So I change the code only in here

 if opt.model_name == 'deeplabv2':
        model = adaptation_modelv2.CustomModel(opt, logger)
        model = torch.nn.DataParallel(model, device_ids=device_ids)

The kl_div loss of self distillation

The following code calculate the kl_div loss of teacher from stage 1 and the student model. But the student didn't calculate log_softmax. Is this a mistake?

    student = F.softmax(target_out['out'], dim=1)
    with torch.no_grad():
        teacher_out = self.teacher_DP(target_imageS)
        teacher_out['out'] = F.interpolate(teacher_out['out'], size=threshold_arg.shape[2:], mode='bilinear', align_corners=True)
        teacher = F.softmax(teacher_out['out'], dim=1)

    loss_kd = F.kl_div(student, teacher, reduction='none')
    mask = (teacher != 250).float()
    loss_kd = (loss_kd * mask).sum() / mask.sum()
    loss = loss + self.opt.distillation * loss_kd

About the function full2weak() and label_strong_T()

Dear authors: Thanks for sharing your code with us. I do not understand why you use augmentation methods to the output probability of the encoder in the function full2weak() and label_strong_T(). How do these two functions help? I sincerely hope you can help me:>.

Training Script

Hi,

Thanks for sharing the code.
Could you please also share the training script to reproduce the results?

Thanks!

the training in Synthia - > cityscapes warmup

How many classes have you used in the warmup training of Synthia - > cityscapes? Can you provide the warmup training command of Synthia - > cityscapes as an example? thank you very much indeed

this amazing performance improvments kill this area!!!!!!!!!!

this amazing performance improvements kill this area!!!!!!!!!!

the classifier is replaced with an enhanced version with SE module ?

The paper states: 'We use the DeepLabv2 [8] for segmentation with the backbone ResNet-101 [25]'. However, in the code, the last classifier is replaced with an enhanced version with SE module. The enhanced network structure may further improve the performance of UDA. Will this lead to an unfair comparison with deeplabv2 in other works?

Plotting features using UMAP

Hello, thanks for such a good contribution in the field, it is really a groundbreaking work.

I was trying to reproduce the plot of the features that you have in Figure 5 of the main manuscript using UMAP. How did you determine which features belong to those specific classes (building, traffic sign, pole, and vegetation)? We can determine from the output to which class each pixel belongs to, but how did you do it in the feature space? Resizing the logits back to the feature space shape, then argmax to determine the correspondence?

A Problem of Using Pretrained Model

Hi~ I want to use the pretrained model "GTA5 -> Cityscapes". However, when running the command python test.py --bn_clr --student_init simclr --resume ./pretrained/gta2citylabv2_stage3/from_gta5_to_cityscapes_on_deeplabv2_best_model.pkl, I encountered an error that read FileNotFoundError: [Errno 2] No such file or directory: '/mnt/blob\\Dataset/GTA5\\split.mat'. So, could you please describe how to set the correct path, and how to configure the dataset? Thanks! :)

what is the warm-up model?

In the training processing of stage1, I need to download warm-up model (43.4 mIoU), and resume from it, but there seems no description of what dataset is it pretrained on. Is it pretrained on source dataset?
thank you

Potential Bug issue

why the softmax output "teacher" will equal to 250 (ignore index) here?

ProDA/models/adaptation_modelv2.py

Line 316 in 9ba80c7

mask = (teacher != 250).float()

Is it a bug? Is it should be "threshold_arg"?

Question about structure learning

I noticed that weak and strong augment have been used in structure learning.

In my opinion, the difference between strong augment and original image is greater than weak augment, and why you use weak augment but not original image? Did you do ablation study about weak augment of different level even not augment?

About the flip Image

if opt.flip:
    flip_out = model.BaseNet_DP(fliplr(images_val))
    flip_out['out'] = F.interpolate(sm(flip_out['out']), size=images_val.size()[2:], mode='bilinear', align_corners=True)
    out['out'] = F.interpolate(sm(out['out']), size=images_val.size()[2:], mode='bilinear', align_corners=True)
    out['out'] = (out['out'] + fliplr(flip_out['out'])) / 2

Dear author，thanks for sharing the code with us. I don't the reason why we should flip the image and compute flip_out here, If it can help increasing the performance, why not using it in generating soft label at first time?

Why the student is initialized with self-supervised weights rather than supervised weights

Great work and thank you for sharing the code.

I noticed that in the distillation stage, you initialize your student model with SSL weights learned by SimCLRv2. I am wondering why not use the fully-supervised image weight. After all, the fully-supervised model is better than the SSL one. I guess this is not because you want to avoid using ImageNet labels, as you initialize your DeepLab model with fully-supervised weights in the first stage. The question is why not use the fully-supervised weights for the distillation as well.

Thanks.

question about the argument in if statement

ProDA/data/cityscapes_dataset.py

Lines 165 to 175 in 9ba80c7

    
           if self.split == 'train' and self.opt.used_save_pseudo: 
        
               if self.opt.proto_rectify: 
        
                   lpsoft = np.load(os.path.join(self.opt.path_soft, os.path.basename(img_path).replace('.png', '.npy'))) 
        
               else: 
        
                   lp_path = os.path.join(self.opt.path_LP, os.path.basename(img_path)) 
        
                   lp = Image.open(lp_path) 
        
                   lp = lp.resize(self.img_size, Image.NEAREST) 
        
                   lp = np.array(lp, dtype=np.uint8) 
        
                   if self.opt.threshold: 
        
                       conf = np.load(os.path.join(self.opt.path_LP, os.path.basename(img_path).replace('.png', '_conf.npy'))) 
        
                       lp[conf <= self.opt.threshold] = 250

In line 166 if self.opt.proto_rectify: , I'm wondering if it should be if self.opt.path_soft: ?
Thank you~

Results of SYNTHIA2Cityscapes in each stage

Hi, have you recorded the results of SYNTHIA2Cityscapes in each stage? If yes, would you please report them here?

Question for self distillation in stage2 and stage3

Hi, Thanks for your great work achieving sota performance in UDA. Have you try using self training for stage2 rather than self distillation?

where to load the generated pseudo labels of target data?

First I run the generate_presudo_label.py, then when I run the train.py. I don't know how to load the generated pseudo labels of target data to became target_train_loader? could you please show me where this part of the code is?

Training Stage Loss

Dear author，
Thanks for your great work. Recently I do the code reproduction and fine a small issue which I'm not sure. When I finish the stage1 training and start to train the stage2 I found that the loss will be negative in some cases. After checking the loss code part I thought it will not be negative theoretically. May you know what the problem is or it is not a problem due to my poor knowledge.

	if self.split == 'train' and self.opt.used_save_pseudo:
	if self.opt.proto_rectify:
	lpsoft = np.load(os.path.join(self.opt.path_soft, os.path.basename(img_path).replace('.png', '.npy')))
	else:
	lp_path = os.path.join(self.opt.path_LP, os.path.basename(img_path))
	lp = Image.open(lp_path)
	lp = lp.resize(self.img_size, Image.NEAREST)
	lp = np.array(lp, dtype=np.uint8)
	if self.opt.threshold:
	conf = np.load(os.path.join(self.opt.path_LP, os.path.basename(img_path).replace('.png', '_conf.npy')))
	lp[conf <= self.opt.threshold] = 250