isl-org / lang-seg Goto Github PK

View Code? Open in Web Editor NEW

697.0 17.0 85.0 9.34 MB

Language-Driven Semantic Segmentation

License: MIT License

Python 35.54% Jupyter Notebook 64.00% Shell 0.46%

lang-seg's People

Contributors

Stargazers

Watchers

Forkers

zebrajack healthonrails shakthisharavanan tobran dongzhang89 hosford42 yutong-zhou-cv cv-ip benjamesbabala edwinstudy robotpin lk-greenbird soskek tuskaw haojunyu1998 ak391 sun-xh zacchaeus00 whiteking64 suyanzhou626 tlwzzy tchangtc reconstruct qzhangli sophistz jonathan-roberts1 mymuli leejaeyong7 siriusxt shanmy thias15 ngfuong donggangj henrywoo dtyxs huudatdo backyes phoenixdigitalfx aliceinhunterland 1ucky40nc3 jadgardner dasupradyumna pranjali-pathre mix345 williamshen-nz dandingbudanding yuhoudedanding abdessalam-eddib xuweiyichen mvpduncan glavin001 3d-vision-project kekeblom clipnerf brandontrabucco siqi9747 theaeroes miaowu99 quantaji gj313 tingelst sarthak268 ahwhbc wangyxxjtu otonari726 luolin0715 charlierabea whuhxb yiluzhou yiluzhou1 holmes-gu nigamkatta racheljkim tilmto peterouzh jooyongsim maxxyouu underthelights k2m5t2 goodstudent9 lzlcs xsins zjrandom951 bukebuhao

lang-seg's Issues

ImportError

I try to run a zero-shot demo. I compiler and install torch-encoding in gcc7.5.

(lang-seg) [zhongzm@ai_gpu28 lang-seg]$ python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset fss \
> --widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 \
> --weights checkpoints/fss_l16.ckpt 
Traceback (most recent call last):
  File "test_lseg_zs.py", line 8, in <module>
    from modules.lseg_module_zs import LSegModuleZS
  File "/public/home/zhongzm/project/lang-seg/modules/lseg_module_zs.py", line 7, in <module>
    from .lsegmentation_module_zs import LSegmentationModuleZS
  File "/public/home/zhongzm/project/lang-seg/modules/lsegmentation_module_zs.py", line 13, in <module>
    from encoding.models import get_segmentation_model
  File "/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/__init__.py", line 13, in <module>
    from . import nn, functions, parallel, utils, models, datasets, transforms
  File "/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/nn/__init__.py", line 12, in <module>
    from .encoding import *
  File "/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/nn/encoding.py", line 18, in <module>
    from ..functions import scaled_l2, aggregate, pairwise_cosine
  File "/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/functions/__init__.py", line 2, in <module>
    from .encoding import *
  File "/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/functions/encoding.py", line 15, in <module>
    from encoding import cpu
ImportError: /public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/cpu.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jS2_

And I try to reinstall the env then get this error.

 File "/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/functions/encoding.py", line 17, in <module>
    from encoding import gpu
ImportError: cannot import name 'gpu' from partially initialized module 'encoding' (most likely due to a circular import) (/public/home/zhongzm/anaconda3/envs/lang-seg/lib/python3.8/site-packages/encoding/__init__.py)```

Difference between the settings for demo and those in your paper

Hi,
I have a question on the difference of settings between the demo in your README and the experiment in your paper.

In the README, you published the pre-trained weight for demo.
It says while training the backbones for both image and text are ViT-L/16.
The section 5.1 in your paper says

We used LSeg with DPT and a smaller ViT-B/32 backbone together with the CLIP ViT-B/32 text encoder ...

When reproducing your results in 5.1, does that require a full-scratch training with ViT-B/32 backbone for the images?
Also, are there any other differences, such as batch size? More specifically, How do I change the arguments in train.sh ?

Finally, is it possible to share with us (or me) the weight used for your results?

Thank you in advance.

Problem when running the streamlit app

model = _load_state(cls, checkpoint, strict=strict, **kwargs)

File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/core/saving.py", line 158, in _load_state
obj = cls(**_cls_kwargs)
File "/teamspace/studios/this_studio/lang-seg/modules/lseg_module.py", line 55,

about finetune_weights

There is a flag --finetune_weights in modules/lseg_module.py, I do not see where this flag is used. Can it be directly used to fine-tune a pre-trained model on a new dataset?

question about the image encoder

Hi, thanks for open-sourcing the code.

I have a quick question:

What's the reason for choosing DPT as the image encoder?

What should I note if I want to use other encoders (e.g., HR-Net)?

Error with Pytorch Encoding

I am running windows and I have issues with installing this project, specifically for the torch encoding package.

Some primary error include:
error: ninja: error: loading 'build.ninja': The system cannot find the file specified.
and
Error building extension 'enclib_cpu'

These errors are usually in tandem with a giant list of other errors presumable in dependencies. When I tried to build the package via Docker, similar issues arose as well.

Things I have tried:

I have ensured that visual studios and the C++ compilers are properly installed, and environmental variables are set
I have installed cudatoolkit along with various pytorch libraries with cuda support (torch.cuda.is_available() returns True)
Tried to build a docker image by following the Pytorch Encoding installation guide, but error still occurs

I am running on a Windows 10 machine.

Are there any fixes or guides to get lang-seg to work under these circumstances?

How can I download the torch-encoding library?

I cannot download the torch-encoding library. When running the lseg_app.py file, I encounter the following error:
File "/jinx/language-drive-seg/lang-seg-main/data/init.py", line 17, in
import encoding.datasets as enc_ds
ModuleNotFoundError: No module named 'encoding'
I found that it is likely due to not having the torch-encoding library in the dependencies. After attempting to download it with the command, I encountered an error as shown in the screenshot. Could you please advise on how to resolve this issue?

i have a question about text feature dimention

hi、i saw the paper that text feature dim is N 、but N is not sure in different image , so how can we design the Spatial regularization Structure at the back

Replace DPT backbone with ResNet101

I want to use a ResNet-based LSeg and I did the following:

Generally, I added a elif branch in _make_encoder(), which returns a resnet101, and modified the dimension in _make_scratch as [256,512,1024,2048]. I also replaced the forward_vit in lseg_net.py with a vanilla ResNet forward (return 4-stage output). To this end, I could start training, but could not get expected performance .

I might plug in ResNet wrongly or miss some points. Is there any demos of ResNet-based LSeg and if there is any ResNet pre-trained weights of LSeg? Thanks!

Can you provide the command to train models for Pascal-5i, COCO-20i, and fss?

multiple-gpu

If I use eight GPUS, in addition to modifying the batch size=8, do I need to modify other parameters such as --num_nodes and lr?

Error when building Pytorch-encoding

Hi, I would like to tried your code, but the error shows up when I tried to install the pytorch-encoding. Could you give us your environment info of CUDA GPU python g++ and so on？Do you have any advice about installing it，which I see a lot of people get the error.

My env:
OS: Ubuntu 18.04
gcc: 7.5.0
GPU:3090
driver：515
CUDA: tried 11.7 and 10.2
pytorch: tried 1.12 and 1.7.1

Was the Lseg trained in an inductive zero-shot setting?

Thanks for your excellent research.

I have a question about Lseg's zero-shot setting.
Was the Lseg trained in an inductive zero-shot setting?

What is the difference between inductive zero-shot setting and language-driven semantic segmentation setting in the training step?

The settings for zero-shot semantic segmentation are confusing. Please help me.

Thank you for reading.

Reproduction issue for Table 5

Hi, I have a reproduction issue for Table 5. In the paper, the LSeg with ViT-B/32 backbone achieves 79.7 pixAcc and 37.8 mIoU. However, I only get 78.9 pixAcc and 33.7 mIoU by using the released code. The reproduced pixAcc/mIoU are not as expected.

Our reproduction command is as follows on 8 GPU cards.

python -u train_lseg.py --dataset ade20k --data_path datasets --batch_size 4 --exp_name lseg_ade20k_b32_240e --base_lr 0.004 --weight_decay 1e-4 --no-scaleinv --max_epochs 240 --widehead --accumulate_grad_batches 2 --backbone clip_vitb32_384

So what is the reason of the performance gap? I may miss some detail settings.

By the way, I encounter a warning when running the released code.
[W reducer.cpp:283] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [512, 768, 1, 1], strides() = [768, 1, 768, 768]

Have ever met this warning? May the performance gap caused by this warning?

Look forward to your reply.

Error when running test.py

Hi! Thanks for the good work!
I met a problem when I tried to run test.sh. The error said, "cannot import name 'Resize' from 'utils'". I have checked utils.py, there is no function or class called 'Resize'. Is the code missing this part?
Looking forward to your reply.

Hard to reproduce the zero-shot results on COCO dataset.

Hi, could you please provide the range of the learning rate, or other hyper-parameter settings for the zero-shot experiments on the COCO-20i dataset? It is difficult to reproduce the results shown in the paper.
I use ViT-L/16 as backbone, and the results are 10 points lower than yours.

error with torch-encoding

Hi! Thanks for the great work

I can't seem to import encoding after following the installation steps. The error I got is "cannot import name 'gpu' from partially initialized module 'encoding' (most likely due to a circular import)". Can you please let me know whether you know the cause of this issue? Thanks!

Whether the training is going on normally

When I train the code, the multi-GPU allocation stops at the following location：
"
Resuming checkpoint None, exp_version=None
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/5
Resuming checkpoint None, exp_version=None
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/5

"
Is this training? Or something wrong ？

Test_lseg_zs.py

Thank you for providing an implementation. While testing the code, I am running the Test_lseg_zs.py file for pascal, but facing an error.
Any suggestion?

Model parameter descriptions

Hi Boyi,

Thanks a lot for releasing the code for LSeg!

I’ve been having a play around using the demo code + model in a zero-shot setting and just have a few (hopefully quick) questions about some of the model parameters.

Could you please give a brief overview (description, where the default values originate, what the optimum values might be) of the following parameters used in the LSeg_MultiEvalModule:
1. ‘scales’ - e.g., lseg_app.py Line 315
2. (‘base_size’ - e.g., additional_utils/models.py Line 28)
3. ‘crop_size’ - e.g., additional_utils/models.py Line 29
And this parameter used in the LSegNet class:
5. scale_factor - e.g., module/models/lseg_net.py Line 216 (this has a default value of 0.5 and is different to the scale_factor parameter that is passed to 'Interpolate')

Thanks!

Training configuration

Hi! Can you please let me know what is the correct training configuration to reproduce the performance reported in the paper?

In the paper you mentioned that 6 GPUs were used and batch size was set to 6. Does this mean that I should just launch train.sh with 6 GPUs available? And can you please let me know what is the approximate time for training? Thanks!

Questions about training and inference configuration

Hi,

Thanks for open-sourcing such great work. I have some questions when using this code:

Does the test_lseg.py script support multi-GPU inference? When using a single GPU, it takes about 2~3 hours for inference on ade20k.
I tried to evaluate the provided demo_e200.ckpt on ade20k and got (pixAcc: 0.8078, mIoU: 0.3207), is that correct? It seems lower than the values in the paper.
I trained a model on ade20k (the same config as train.sh, backbone is vit_l16_384) with 8*V100 but found it needs ~90 hours for training 240 epochs. Is it reasonable (it seems much longer than you said in #7)?
When I use this code for other datasets like cityscapes, what changes should I make? The only difference I found is get_labels()in lseg_module.py. Have you evaluated the mIoU on cityscapes?

Thanks in advance.

where should i download the lseg_ade20k_l16.ckpt

where should i download the pertained model lseg_ade20k_l16.ckpt, as depicted in test.sh.

How to train the zero-shot model?

Hi! Thanks for your interesting work!
I am trying to reproduce the zero-shot experiments in the paper recently, but like #19 (comment) , it gets mIoU much lower than yours.

Here is my scripts:

train_lseg_zs.py:

from modules.lseg_module_zs import LSegModuleZS
from utils import do_training, get_default_argument_parser

if __name__ == "__main__":
    parser = LSegModuleZS.add_model_specific_args(get_default_argument_parser())
    args = parser.parse_args()
    do_training(args, LSegModuleZS)

command:

python -u train_lseg_zs.py --backbone clip_resnet101 --exp_name lsegzs_pascal_f0 --dataset pascal \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 --batch_size 8 \

Default aruguments: base_lr=0.004, weight_decay=1e-4, momentum=0.9

I wonder where the problem is. And could you please share your training scripts for the zero-shot experiment?

how many gpus is need?

Thanks for your great jobs, I would like to know how many gpus is need to train this model?

Link to zero-shot model of COCO fold 1

Hi,
Thanks for your great work. I believe the link to the zero-shot COCO fold 1 model is identical to fold 2, or did I miss anything?
Could you please take a look?
Many thanks,

Training set used in demo model

Good day,

I was wondering whether the demo model available from the repo (demo_e200.ckpt) was solely trained on ADE20K as specified in the repo or whether it was trained on all 7 datasets presented in section 5.2. This is unclear to me since the demo model works well with classes that are not covered by ADE20K such as the animals, and these classes are covered by others such as COCO.

In case it was trained on multiple datasets, I would like to know how to do so myself.

Thank you in advance.

How long does it takes for default training?

System: 4xRTX3090.
Training scripts: Default training scripts:
python -u train_lseg.py --dataset ade20k --data_path ./datasets --batch_size 1 --exp_name lseg_ade20k_l16
--base_lr 0.004 --weight_decay 1e-4 --no-scaleinv --max_epochs 200 --widehead --accumulate_grad_batches 2 --backbone clip_vitl16_384

My system has shown that one epoch takes for 45mins, which is a pretty long time for 200 epochs. Is that a normal procedure? Or we may not need max epochs like that?

about backward between enconde_text and encode_image

Thank you very much for your innovative work.
I want to know the update code for the image encoder on your model, but I can't find him, which is very important to me. Like you, I also want to use clip's text encoder, but it prompts me to 'Try to back through the graph a second time'. There are two networks that need backpropagation, please tell me how you did it.

cannot load released ckpt to perform inference

Hi, @Boyiliee ,
Great work. I just follow your instruction to run the demo and failed. The issue occurred when loading the model from the released checkpoint. I attach the errors below:
"super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted"

looking forward to your feedback.

freeze CLIP TextEncoder

Hi, the paper pointed out that we should freeze the text encoder of CLIP during training. I wonder how to achieve it and where is the corrresponding code. Thanks!

bad results with the provided checkpoint

Pretrained LSeg on Pascal-5i, COCO-20i

Congrats on your paper accepted to ICLR 2022!

Do you have your pretrained models on 4 folds of Pascal-5i and COCO-20i? Can you share them?

I really appreciate your response.

How to train a zero-shot model

Hi thanks for the interesting work and demo!

I wrote train_lseg_zs.py based on train_lseg.py to train a zero-shot model, it gets mIoU = 28.36% (pascal fold 0, best val miou=27.51%, epoch=0), versus 52.8% reported in the paper. I have tested the pretrained model pascal_fold0.ckpt and get mIoU = 52.8%.

So I wonder how the model is trained? And could you please provide training scripts for the zero-shot experiment?

Here is my script:

train_lseg_zs.py

from modules.lseg_module_zs import LSegModuleZS
from utils import do_training, get_default_argument_parser

if __name__ == "__main__":
    parser = LSegModuleZS.add_model_specific_args(get_default_argument_parser())
    args = parser.parse_args()
    do_training(args, LSegModuleZS)

command:

python -u train_lseg_zs.py --backbone clip_resnet101 --exp_name lsegzs_pascal_f0 --dataset pascal \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 --batch_size 8 \

Default aruguments: base_lr=0.004, weight_decay=1e-4, momentum=0.9

Here is the log (Epoch01-05):


:=========== Few-shot Seg. with HSNet ===========
|              logpath:                         
|            benchmark: pascal                  
|                  bsz: 8                       
|                 fold: 0                       
|                nshot: 0                       
|        finetune_mode: False                   
:================================================

[Epoch: 00] [Batch: 0001/0125] L: 0.76545  Avg L: 0.76545  mIoU:  8.32  |  FB-IoU: 13.67

*** Validation [@Epoch 00] Avg L: 0.77806  mIoU:  8.66   FB-IoU: 12.32   ***

[Epoch: 00] [Batch: 0001/1425] L: 0.70466  Avg L: 0.70466  mIoU:  5.47  |  FB-IoU: 30.35
[Epoch: 00] [Batch: 0051/1425] L: 0.47214  Avg L: 0.54077  mIoU: 38.52  |  FB-IoU: 58.87
[Epoch: 00] [Batch: 0101/1425] L: 0.51257  Avg L: 0.52989  mIoU: 41.43  |  FB-IoU: 62.78
[Epoch: 00] [Batch: 0151/1425] L: 0.47782  Avg L: 0.51624  mIoU: 44.89  |  FB-IoU: 65.67
[Epoch: 00] [Batch: 0201/1425] L: 0.44150  Avg L: 0.50688  mIoU: 47.24  |  FB-IoU: 67.38
[Epoch: 00] [Batch: 0251/1425] L: 0.49717  Avg L: 0.49714  mIoU: 48.49  |  FB-IoU: 68.80
[Epoch: 00] [Batch: 0301/1425] L: 0.45244  Avg L: 0.49022  mIoU: 49.80  |  FB-IoU: 69.67
[Epoch: 00] [Batch: 0351/1425] L: 0.44908  Avg L: 0.48395  mIoU: 50.82  |  FB-IoU: 70.51
[Epoch: 00] [Batch: 0401/1425] L: 0.47295  Avg L: 0.47964  mIoU: 51.95  |  FB-IoU: 71.10
[Epoch: 00] [Batch: 0451/1425] L: 0.51765  Avg L: 0.47718  mIoU: 52.74  |  FB-IoU: 71.66
[Epoch: 00] [Batch: 0501/1425] L: 0.46239  Avg L: 0.47528  mIoU: 53.60  |  FB-IoU: 72.06
[Epoch: 00] [Batch: 0551/1425] L: 0.45335  Avg L: 0.47240  mIoU: 54.53  |  FB-IoU: 72.64
[Epoch: 00] [Batch: 0601/1425] L: 0.49251  Avg L: 0.47043  mIoU: 54.87  |  FB-IoU: 72.97
[Epoch: 00] [Batch: 0651/1425] L: 0.45890  Avg L: 0.46813  mIoU: 55.26  |  FB-IoU: 73.36
[Epoch: 00] [Batch: 0701/1425] L: 0.41210  Avg L: 0.46570  mIoU: 55.87  |  FB-IoU: 73.77
[Epoch: 00] [Batch: 0751/1425] L: 0.48672  Avg L: 0.46475  mIoU: 56.46  |  FB-IoU: 74.08
[Epoch: 00] [Batch: 0801/1425] L: 0.43756  Avg L: 0.46335  mIoU: 57.12  |  FB-IoU: 74.42
[Epoch: 00] [Batch: 0851/1425] L: 0.50632  Avg L: 0.46193  mIoU: 57.38  |  FB-IoU: 74.69
[Epoch: 00] [Batch: 0901/1425] L: 0.42512  Avg L: 0.46019  mIoU: 57.81  |  FB-IoU: 74.98
[Epoch: 00] [Batch: 0951/1425] L: 0.43964  Avg L: 0.45871  mIoU: 58.42  |  FB-IoU: 75.25
[Epoch: 00] [Batch: 1001/1425] L: 0.45453  Avg L: 0.45688  mIoU: 58.87  |  FB-IoU: 75.46
[Epoch: 00] [Batch: 1051/1425] L: 0.46058  Avg L: 0.45650  mIoU: 59.39  |  FB-IoU: 75.68
[Epoch: 00] [Batch: 1101/1425] L: 0.44513  Avg L: 0.45552  mIoU: 59.83  |  FB-IoU: 75.89
[Epoch: 00] [Batch: 1151/1425] L: 0.48968  Avg L: 0.45476  mIoU: 60.06  |  FB-IoU: 76.05
[Epoch: 00] [Batch: 1201/1425] L: 0.37759  Avg L: 0.45366  mIoU: 60.34  |  FB-IoU: 76.26
[Epoch: 00] [Batch: 1251/1425] L: 0.33930  Avg L: 0.45289  mIoU: 60.50  |  FB-IoU: 76.39
[Epoch: 00] [Batch: 1301/1425] L: 0.39120  Avg L: 0.45196  mIoU: 60.78  |  FB-IoU: 76.57
[Epoch: 00] [Batch: 1351/1425] L: 0.41460  Avg L: 0.45114  mIoU: 61.08  |  FB-IoU: 76.73
[Epoch: 00] [Batch: 1401/1425] L: 0.44821  Avg L: 0.45055  mIoU: 61.29  |  FB-IoU: 76.87

*** Training [@Epoch 00] Avg L: 0.44989  mIoU: 61.41   FB-IoU: 76.94   ***

[Epoch: 00] [Batch: 0001/0125] L: 0.44296  Avg L: 0.66636  mIoU:  8.40  |  FB-IoU: 24.95
[Epoch: 00] [Batch: 0051/0125] L: 0.40932  Avg L: 0.43308  mIoU: 26.50  |  FB-IoU: 57.01
[Epoch: 00] [Batch: 0101/0125] L: 0.43034  Avg L: 0.42625  mIoU: 27.23  |  FB-IoU: 58.17

*** Validation [@Epoch 00] Avg L: 0.42385  mIoU: 27.51   FB-IoU: 58.78   ***

[Epoch: 01] [Batch: 0001/1425] L: 0.40329  Avg L: 0.44986  mIoU: 61.41  |  FB-IoU: 76.94
[Epoch: 01] [Batch: 0051/1425] L: 0.43004  Avg L: 0.44835  mIoU: 61.71  |  FB-IoU: 77.12
[Epoch: 01] [Batch: 0101/1425] L: 0.48569  Avg L: 0.44722  mIoU: 62.09  |  FB-IoU: 77.29
[Epoch: 01] [Batch: 0151/1425] L: 0.34780  Avg L: 0.44597  mIoU: 62.34  |  FB-IoU: 77.47
[Epoch: 01] [Batch: 0201/1425] L: 0.46758  Avg L: 0.44531  mIoU: 62.66  |  FB-IoU: 77.64
[Epoch: 01] [Batch: 0251/1425] L: 0.44974  Avg L: 0.44444  mIoU: 62.86  |  FB-IoU: 77.78
[Epoch: 01] [Batch: 0301/1425] L: 0.44508  Avg L: 0.44375  mIoU: 63.05  |  FB-IoU: 77.90
[Epoch: 01] [Batch: 0351/1425] L: 0.44855  Avg L: 0.44306  mIoU: 63.22  |  FB-IoU: 78.04
[Epoch: 01] [Batch: 0401/1425] L: 0.43203  Avg L: 0.44248  mIoU: 63.46  |  FB-IoU: 78.18
[Epoch: 01] [Batch: 0451/1425] L: 0.45815  Avg L: 0.44203  mIoU: 63.72  |  FB-IoU: 78.32
[Epoch: 01] [Batch: 0501/1425] L: 0.40439  Avg L: 0.44159  mIoU: 63.93  |  FB-IoU: 78.45
[Epoch: 01] [Batch: 0551/1425] L: 0.38655  Avg L: 0.44125  mIoU: 64.11  |  FB-IoU: 78.54
[Epoch: 01] [Batch: 0601/1425] L: 0.44867  Avg L: 0.44079  mIoU: 64.27  |  FB-IoU: 78.62
[Epoch: 01] [Batch: 0651/1425] L: 0.39662  Avg L: 0.44018  mIoU: 64.45  |  FB-IoU: 78.73
[Epoch: 01] [Batch: 0701/1425] L: 0.40490  Avg L: 0.43969  mIoU: 64.63  |  FB-IoU: 78.82
[Epoch: 01] [Batch: 0751/1425] L: 0.49497  Avg L: 0.43916  mIoU: 64.80  |  FB-IoU: 78.91
[Epoch: 01] [Batch: 0801/1425] L: 0.41001  Avg L: 0.43860  mIoU: 65.06  |  FB-IoU: 79.03
[Epoch: 01] [Batch: 0851/1425] L: 0.47044  Avg L: 0.43825  mIoU: 65.24  |  FB-IoU: 79.13
[Epoch: 01] [Batch: 0901/1425] L: 0.42717  Avg L: 0.43781  mIoU: 65.47  |  FB-IoU: 79.21
[Epoch: 01] [Batch: 0951/1425] L: 0.46506  Avg L: 0.43753  mIoU: 65.62  |  FB-IoU: 79.30
[Epoch: 01] [Batch: 1001/1425] L: 0.39906  Avg L: 0.43723  mIoU: 65.72  |  FB-IoU: 79.37
[Epoch: 01] [Batch: 1051/1425] L: 0.47572  Avg L: 0.43697  mIoU: 65.85  |  FB-IoU: 79.45
[Epoch: 01] [Batch: 1101/1425] L: 0.42398  Avg L: 0.43653  mIoU: 65.94  |  FB-IoU: 79.50
[Epoch: 01] [Batch: 1151/1425] L: 0.46927  Avg L: 0.43618  mIoU: 66.06  |  FB-IoU: 79.56
[Epoch: 01] [Batch: 1201/1425] L: 0.44691  Avg L: 0.43587  mIoU: 66.17  |  FB-IoU: 79.64
[Epoch: 01] [Batch: 1251/1425] L: 0.41890  Avg L: 0.43541  mIoU: 66.29  |  FB-IoU: 79.71
[Epoch: 01] [Batch: 1301/1425] L: 0.48377  Avg L: 0.43498  mIoU: 66.46  |  FB-IoU: 79.78
[Epoch: 01] [Batch: 1351/1425] L: 0.32745  Avg L: 0.43464  mIoU: 66.52  |  FB-IoU: 79.85
[Epoch: 01] [Batch: 1401/1425] L: 0.44876  Avg L: 0.43403  mIoU: 66.63  |  FB-IoU: 79.93

*** Training [@Epoch 01] Avg L: 0.43380  mIoU: 66.73   FB-IoU: 79.97   ***

[Epoch: 01] [Batch: 0001/0125] L: 0.44064  Avg L: 0.42399  mIoU: 27.44  |  FB-IoU: 58.71
[Epoch: 01] [Batch: 0051/0125] L: 0.39828  Avg L: 0.42200  mIoU: 23.29  |  FB-IoU: 56.57
[Epoch: 01] [Batch: 0101/0125] L: 0.41144  Avg L: 0.42079  mIoU: 20.82  |  FB-IoU: 55.18

*** Validation [@Epoch 01] Avg L: 0.42004  mIoU: 20.03   FB-IoU: 54.81   ***

[Epoch: 02] [Batch: 0001/1425] L: 0.43414  Avg L: 0.43380  mIoU: 66.73  |  FB-IoU: 79.97
[Epoch: 02] [Batch: 0051/1425] L: 0.47006  Avg L: 0.43274  mIoU: 66.83  |  FB-IoU: 80.04
[Epoch: 02] [Batch: 0101/1425] L: 0.39890  Avg L: 0.43218  mIoU: 66.90  |  FB-IoU: 80.11
[Epoch: 02] [Batch: 0151/1425] L: 0.43031  Avg L: 0.43159  mIoU: 67.06  |  FB-IoU: 80.19
[Epoch: 02] [Batch: 0201/1425] L: 0.41378  Avg L: 0.43115  mIoU: 67.17  |  FB-IoU: 80.25
[Epoch: 02] [Batch: 0251/1425] L: 0.41220  Avg L: 0.43086  mIoU: 67.30  |  FB-IoU: 80.33
[Epoch: 02] [Batch: 0301/1425] L: 0.37929  Avg L: 0.43035  mIoU: 67.39  |  FB-IoU: 80.40
[Epoch: 02] [Batch: 0351/1425] L: 0.44048  Avg L: 0.42986  mIoU: 67.48  |  FB-IoU: 80.46
[Epoch: 02] [Batch: 0401/1425] L: 0.37508  Avg L: 0.42955  mIoU: 67.67  |  FB-IoU: 80.53
[Epoch: 02] [Batch: 0451/1425] L: 0.43737  Avg L: 0.42913  mIoU: 67.79  |  FB-IoU: 80.61
[Epoch: 02] [Batch: 0501/1425] L: 0.38389  Avg L: 0.42876  mIoU: 67.88  |  FB-IoU: 80.67
[Epoch: 02] [Batch: 0551/1425] L: 0.36958  Avg L: 0.42827  mIoU: 68.02  |  FB-IoU: 80.75
[Epoch: 02] [Batch: 0601/1425] L: 0.39566  Avg L: 0.42797  mIoU: 68.12  |  FB-IoU: 80.82
[Epoch: 02] [Batch: 0651/1425] L: 0.36679  Avg L: 0.42770  mIoU: 68.26  |  FB-IoU: 80.88
[Epoch: 02] [Batch: 0701/1425] L: 0.38809  Avg L: 0.42742  mIoU: 68.35  |  FB-IoU: 80.93
[Epoch: 02] [Batch: 0751/1425] L: 0.32842  Avg L: 0.42722  mIoU: 68.43  |  FB-IoU: 81.00
[Epoch: 02] [Batch: 0801/1425] L: 0.26225  Avg L: 0.42675  mIoU: 68.53  |  FB-IoU: 81.07
[Epoch: 02] [Batch: 0851/1425] L: 0.33936  Avg L: 0.42639  mIoU: 68.67  |  FB-IoU: 81.14
[Epoch: 02] [Batch: 0901/1425] L: 0.38384  Avg L: 0.42603  mIoU: 68.79  |  FB-IoU: 81.20
[Epoch: 02] [Batch: 0951/1425] L: 0.39195  Avg L: 0.42583  mIoU: 68.87  |  FB-IoU: 81.25
[Epoch: 02] [Batch: 1001/1425] L: 0.45193  Avg L: 0.42544  mIoU: 68.97  |  FB-IoU: 81.31
[Epoch: 02] [Batch: 1051/1425] L: 0.34169  Avg L: 0.42516  mIoU: 69.07  |  FB-IoU: 81.37
[Epoch: 02] [Batch: 1101/1425] L: 0.38230  Avg L: 0.42485  mIoU: 69.17  |  FB-IoU: 81.42
[Epoch: 02] [Batch: 1151/1425] L: 0.34792  Avg L: 0.42452  mIoU: 69.24  |  FB-IoU: 81.47
[Epoch: 02] [Batch: 1201/1425] L: 0.37395  Avg L: 0.42417  mIoU: 69.37  |  FB-IoU: 81.54
[Epoch: 02] [Batch: 1251/1425] L: 0.36396  Avg L: 0.42388  mIoU: 69.45  |  FB-IoU: 81.59
[Epoch: 02] [Batch: 1301/1425] L: 0.45701  Avg L: 0.42368  mIoU: 69.51  |  FB-IoU: 81.63
[Epoch: 02] [Batch: 1351/1425] L: 0.35544  Avg L: 0.42358  mIoU: 69.61  |  FB-IoU: 81.68
[Epoch: 02] [Batch: 1401/1425] L: 0.40611  Avg L: 0.42350  mIoU: 69.67  |  FB-IoU: 81.72

*** Training [@Epoch 02] Avg L: 0.42335  mIoU: 69.69   FB-IoU: 81.74   ***

[Epoch: 02] [Batch: 0001/0125] L: 0.48114  Avg L: 0.42028  mIoU: 19.98  |  FB-IoU: 54.76
[Epoch: 02] [Batch: 0051/0125] L: 0.45244  Avg L: 0.42095  mIoU: 20.91  |  FB-IoU: 55.19
[Epoch: 02] [Batch: 0101/0125] L: 0.42347  Avg L: 0.42128  mIoU: 21.42  |  FB-IoU: 55.35

*** Validation [@Epoch 02] Avg L: 0.42098  mIoU: 21.69   FB-IoU: 55.55   ***

[Epoch: 03] [Batch: 0001/1425] L: 0.41656  Avg L: 0.42335  mIoU: 69.69  |  FB-IoU: 81.74
[Epoch: 03] [Batch: 0051/1425] L: 0.38363  Avg L: 0.42257  mIoU: 69.78  |  FB-IoU: 81.80
[Epoch: 03] [Batch: 0101/1425] L: 0.36494  Avg L: 0.42210  mIoU: 69.87  |  FB-IoU: 81.86
[Epoch: 03] [Batch: 0151/1425] L: 0.31996  Avg L: 0.42191  mIoU: 69.94  |  FB-IoU: 81.92
[Epoch: 03] [Batch: 0201/1425] L: 0.28822  Avg L: 0.42155  mIoU: 70.03  |  FB-IoU: 81.97
[Epoch: 03] [Batch: 0251/1425] L: 0.41492  Avg L: 0.42117  mIoU: 70.13  |  FB-IoU: 82.03
[Epoch: 03] [Batch: 0301/1425] L: 0.37413  Avg L: 0.42083  mIoU: 70.22  |  FB-IoU: 82.08
[Epoch: 03] [Batch: 0351/1425] L: 0.44080  Avg L: 0.42038  mIoU: 70.33  |  FB-IoU: 82.14
[Epoch: 03] [Batch: 0401/1425] L: 0.44819  Avg L: 0.42010  mIoU: 70.41  |  FB-IoU: 82.20
[Epoch: 03] [Batch: 0451/1425] L: 0.35568  Avg L: 0.41966  mIoU: 70.49  |  FB-IoU: 82.25
[Epoch: 03] [Batch: 0501/1425] L: 0.39758  Avg L: 0.41940  mIoU: 70.54  |  FB-IoU: 82.30
[Epoch: 03] [Batch: 0551/1425] L: 0.44330  Avg L: 0.41916  mIoU: 70.60  |  FB-IoU: 82.34
[Epoch: 03] [Batch: 0601/1425] L: 0.34404  Avg L: 0.41895  mIoU: 70.68  |  FB-IoU: 82.38
[Epoch: 03] [Batch: 0651/1425] L: 0.39861  Avg L: 0.41872  mIoU: 70.75  |  FB-IoU: 82.42
[Epoch: 03] [Batch: 0701/1425] L: 0.37759  Avg L: 0.41848  mIoU: 70.84  |  FB-IoU: 82.47
[Epoch: 03] [Batch: 0751/1425] L: 0.38684  Avg L: 0.41823  mIoU: 70.92  |  FB-IoU: 82.52
[Epoch: 03] [Batch: 0801/1425] L: 0.37498  Avg L: 0.41805  mIoU: 71.00  |  FB-IoU: 82.56
[Epoch: 03] [Batch: 0851/1425] L: 0.39698  Avg L: 0.41779  mIoU: 71.10  |  FB-IoU: 82.61
[Epoch: 03] [Batch: 0901/1425] L: 0.38732  Avg L: 0.41745  mIoU: 71.18  |  FB-IoU: 82.66
[Epoch: 03] [Batch: 0951/1425] L: 0.39495  Avg L: 0.41708  mIoU: 71.26  |  FB-IoU: 82.71
[Epoch: 03] [Batch: 1001/1425] L: 0.37841  Avg L: 0.41702  mIoU: 71.33  |  FB-IoU: 82.74
[Epoch: 03] [Batch: 1051/1425] L: 0.36131  Avg L: 0.41684  mIoU: 71.37  |  FB-IoU: 82.77
[Epoch: 03] [Batch: 1101/1425] L: 0.38658  Avg L: 0.41641  mIoU: 71.43  |  FB-IoU: 82.81
[Epoch: 03] [Batch: 1151/1425] L: 0.28449  Avg L: 0.41623  mIoU: 71.49  |  FB-IoU: 82.85
[Epoch: 03] [Batch: 1201/1425] L: 0.37200  Avg L: 0.41598  mIoU: 71.55  |  FB-IoU: 82.88
[Epoch: 03] [Batch: 1251/1425] L: 0.34831  Avg L: 0.41563  mIoU: 71.61  |  FB-IoU: 82.92
[Epoch: 03] [Batch: 1301/1425] L: 0.43214  Avg L: 0.41548  mIoU: 71.69  |  FB-IoU: 82.97
[Epoch: 03] [Batch: 1351/1425] L: 0.39302  Avg L: 0.41528  mIoU: 71.77  |  FB-IoU: 83.01
[Epoch: 03] [Batch: 1401/1425] L: 0.39410  Avg L: 0.41494  mIoU: 71.84  |  FB-IoU: 83.05

*** Training [@Epoch 03] Avg L: 0.41487  mIoU: 71.87   FB-IoU: 83.07   ***

[Epoch: 03] [Batch: 0001/0125] L: 0.43704  Avg L: 0.42103  mIoU: 21.67  |  FB-IoU: 55.53
[Epoch: 03] [Batch: 0051/0125] L: 0.41049  Avg L: 0.42024  mIoU: 21.10  |  FB-IoU: 55.31
[Epoch: 03] [Batch: 0101/0125] L: 0.42498  Avg L: 0.41957  mIoU: 20.52  |  FB-IoU: 55.03

*** Validation [@Epoch 03] Avg L: 0.41898  mIoU: 20.37   FB-IoU: 55.04   ***

[Epoch: 04] [Batch: 0001/1425] L: 0.36217  Avg L: 0.41486  mIoU: 71.87  |  FB-IoU: 83.07
[Epoch: 04] [Batch: 0051/1425] L: 0.40869  Avg L: 0.41446  mIoU: 71.92  |  FB-IoU: 83.10
[Epoch: 04] [Batch: 0101/1425] L: 0.39136  Avg L: 0.41423  mIoU: 71.98  |  FB-IoU: 83.14
[Epoch: 04] [Batch: 0151/1425] L: 0.35325  Avg L: 0.41399  mIoU: 72.03  |  FB-IoU: 83.18
[Epoch: 04] [Batch: 0201/1425] L: 0.40706  Avg L: 0.41368  mIoU: 72.11  |  FB-IoU: 83.22
[Epoch: 04] [Batch: 0251/1425] L: 0.37324  Avg L: 0.41343  mIoU: 72.17  |  FB-IoU: 83.26
[Epoch: 04] [Batch: 0301/1425] L: 0.35059  Avg L: 0.41320  mIoU: 72.24  |  FB-IoU: 83.29
[Epoch: 04] [Batch: 0351/1425] L: 0.38189  Avg L: 0.41303  mIoU: 72.30  |  FB-IoU: 83.33
[Epoch: 04] [Batch: 0401/1425] L: 0.41410  Avg L: 0.41285  mIoU: 72.34  |  FB-IoU: 83.36
[Epoch: 04] [Batch: 0451/1425] L: 0.40258  Avg L: 0.41262  mIoU: 72.42  |  FB-IoU: 83.40
[Epoch: 04] [Batch: 0501/1425] L: 0.39372  Avg L: 0.41237  mIoU: 72.49  |  FB-IoU: 83.44
[Epoch: 04] [Batch: 0551/1425] L: 0.37758  Avg L: 0.41220  mIoU: 72.56  |  FB-IoU: 83.48
[Epoch: 04] [Batch: 0601/1425] L: 0.34296  Avg L: 0.41200  mIoU: 72.62  |  FB-IoU: 83.52
[Epoch: 04] [Batch: 0651/1425] L: 0.37940  Avg L: 0.41178  mIoU: 72.69  |  FB-IoU: 83.55
[Epoch: 04] [Batch: 0701/1425] L: 0.38315  Avg L: 0.41164  mIoU: 72.74  |  FB-IoU: 83.59
[Epoch: 04] [Batch: 0751/1425] L: 0.36820  Avg L: 0.41140  mIoU: 72.79  |  FB-IoU: 83.63
[Epoch: 04] [Batch: 0801/1425] L: 0.45394  Avg L: 0.41122  mIoU: 72.84  |  FB-IoU: 83.66
[Epoch: 04] [Batch: 0851/1425] L: 0.41756  Avg L: 0.41101  mIoU: 72.87  |  FB-IoU: 83.68
[Epoch: 04] [Batch: 0901/1425] L: 0.41762  Avg L: 0.41072  mIoU: 72.93  |  FB-IoU: 83.72
[Epoch: 04] [Batch: 0951/1425] L: 0.37698  Avg L: 0.41059  mIoU: 72.99  |  FB-IoU: 83.75
[Epoch: 04] [Batch: 1001/1425] L: 0.34747  Avg L: 0.41038  mIoU: 73.04  |  FB-IoU: 83.78
[Epoch: 04] [Batch: 1051/1425] L: 0.42113  Avg L: 0.41022  mIoU: 73.09  |  FB-IoU: 83.82
[Epoch: 04] [Batch: 1101/1425] L: 0.31263  Avg L: 0.40999  mIoU: 73.15  |  FB-IoU: 83.85
[Epoch: 04] [Batch: 1151/1425] L: 0.39397  Avg L: 0.40979  mIoU: 73.22  |  FB-IoU: 83.89
[Epoch: 04] [Batch: 1201/1425] L: 0.33008  Avg L: 0.40968  mIoU: 73.28  |  FB-IoU: 83.92
[Epoch: 04] [Batch: 1251/1425] L: 0.43431  Avg L: 0.40958  mIoU: 73.34  |  FB-IoU: 83.95
[Epoch: 04] [Batch: 1301/1425] L: 0.38524  Avg L: 0.40942  mIoU: 73.39  |  FB-IoU: 83.99
[Epoch: 04] [Batch: 1351/1425] L: 0.39327  Avg L: 0.40932  mIoU: 73.43  |  FB-IoU: 84.01
[Epoch: 04] [Batch: 1401/1425] L: 0.36319  Avg L: 0.40910  mIoU: 73.48  |  FB-IoU: 84.04

*** Training [@Epoch 04] Avg L: 0.40901  mIoU: 73.51   FB-IoU: 84.05   ***

[Epoch: 04] [Batch: 0001/0125] L: 0.45650  Avg L: 0.41905  mIoU: 20.35  |  FB-IoU: 55.02
[Epoch: 04] [Batch: 0051/0125] L: 0.42925  Avg L: 0.41770  mIoU: 20.47  |  FB-IoU: 55.09
[Epoch: 04] [Batch: 0101/0125] L: 0.40337  Avg L: 0.41644  mIoU: 20.52  |  FB-IoU: 55.09

*** Validation [@Epoch 04] Avg L: 0.41569  mIoU: 20.62   FB-IoU: 55.19   ***

[Epoch: 05] [Batch: 0001/1425] L: 0.38914  Avg L: 0.40901  mIoU: 73.51  |  FB-IoU: 84.05
[Epoch: 05] [Batch: 0051/1425] L: 0.38582  Avg L: 0.40863  mIoU: 73.56  |  FB-IoU: 84.08
[Epoch: 05] [Batch: 0101/1425] L: 0.39982  Avg L: 0.40837  mIoU: 73.63  |  FB-IoU: 84.12
[Epoch: 05] [Batch: 0151/1425] L: 0.36958  Avg L: 0.40814  mIoU: 73.70  |  FB-IoU: 84.16
[Epoch: 05] [Batch: 0201/1425] L: 0.44301  Avg L: 0.40796  mIoU: 73.77  |  FB-IoU: 84.19
[Epoch: 05] [Batch: 0251/1425] L: 0.37197  Avg L: 0.40778  mIoU: 73.84  |  FB-IoU: 84.22
[Epoch: 05] [Batch: 0301/1425] L: 0.34206  Avg L: 0.40756  mIoU: 73.89  |  FB-IoU: 84.26
[Epoch: 05] [Batch: 0351/1425] L: 0.31855  Avg L: 0.40728  mIoU: 73.94  |  FB-IoU: 84.29
[Epoch: 05] [Batch: 0401/1425] L: 0.36472  Avg L: 0.40715  mIoU: 74.00  |  FB-IoU: 84.33
[Epoch: 05] [Batch: 0451/1425] L: 0.33310  Avg L: 0.40697  mIoU: 74.04  |  FB-IoU: 84.36
[Epoch: 05] [Batch: 0501/1425] L: 0.38936  Avg L: 0.40689  mIoU: 74.09  |  FB-IoU: 84.38
[Epoch: 05] [Batch: 0551/1425] L: 0.28624  Avg L: 0.40672  mIoU: 74.15  |  FB-IoU: 84.41
[Epoch: 05] [Batch: 0601/1425] L: 0.35422  Avg L: 0.40659  mIoU: 74.19  |  FB-IoU: 84.44
[Epoch: 05] [Batch: 0651/1425] L: 0.38933  Avg L: 0.40646  mIoU: 74.24  |  FB-IoU: 84.47
[Epoch: 05] [Batch: 0701/1425] L: 0.36583  Avg L: 0.40621  mIoU: 74.30  |  FB-IoU: 84.50
[Epoch: 05] [Batch: 0751/1425] L: 0.35363  Avg L: 0.40603  mIoU: 74.36  |  FB-IoU: 84.53
[Epoch: 05] [Batch: 0801/1425] L: 0.33130  Avg L: 0.40575  mIoU: 74.42  |  FB-IoU: 84.56
[Epoch: 05] [Batch: 0851/1425] L: 0.35086  Avg L: 0.40560  mIoU: 74.46  |  FB-IoU: 84.59
[Epoch: 05] [Batch: 0901/1425] L: 0.32383  Avg L: 0.40543  mIoU: 74.50  |  FB-IoU: 84.61
[Epoch: 05] [Batch: 0951/1425] L: 0.39691  Avg L: 0.40528  mIoU: 74.54  |  FB-IoU: 84.64
[Epoch: 05] [Batch: 1001/1425] L: 0.37308  Avg L: 0.40502  mIoU: 74.58  |  FB-IoU: 84.67
[Epoch: 05] [Batch: 1051/1425] L: 0.37183  Avg L: 0.40492  mIoU: 74.63  |  FB-IoU: 84.70
[Epoch: 05] [Batch: 1101/1425] L: 0.37630  Avg L: 0.40475  mIoU: 74.67  |  FB-IoU: 84.73
[Epoch: 05] [Batch: 1151/1425] L: 0.41443  Avg L: 0.40453  mIoU: 74.71  |  FB-IoU: 84.76
[Epoch: 05] [Batch: 1201/1425] L: 0.36623  Avg L: 0.40434  mIoU: 74.76  |  FB-IoU: 84.79
[Epoch: 05] [Batch: 1251/1425] L: 0.31162  Avg L: 0.40414  mIoU: 74.81  |  FB-IoU: 84.82
[Epoch: 05] [Batch: 1301/1425] L: 0.39881  Avg L: 0.40392  mIoU: 74.84  |  FB-IoU: 84.84
[Epoch: 05] [Batch: 1351/1425] L: 0.39712  Avg L: 0.40369  mIoU: 74.89  |  FB-IoU: 84.87
[Epoch: 05] [Batch: 1401/1425] L: 0.42172  Avg L: 0.40356  mIoU: 74.94  |  FB-IoU: 84.90

*** Training [@Epoch 05] Avg L: 0.40350  mIoU: 74.96   FB-IoU: 84.91   ***

[Epoch: 05] [Batch: 0001/0125] L: 0.47616  Avg L: 0.41578  mIoU: 20.61  |  FB-IoU: 55.18
[Epoch: 05] [Batch: 0051/0125] L: 0.45201  Avg L: 0.41625  mIoU: 20.61  |  FB-IoU: 55.11
[Epoch: 05] [Batch: 0101/0125] L: 0.41850  Avg L: 0.41652  mIoU: 20.59  |  FB-IoU: 55.01

*** Validation [@Epoch 05] Avg L: 0.41648  mIoU: 20.59   FB-IoU: 55.01   ***

Reason on bad results of CLIP-based initialization of image encoder

This is a question on an interesting report in the paper.
The paper reported

We also evaluated on a model initialized with the CLIP image encoder with the same setup and hyperparameters, but observed worse performance than using the ViT initialization.

It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?

Different zero shot results, grad strides do not match butcket view strides

Thank you for your great paper.

I tried to train a zero-shot model (vitl16_384) and tested on PASCAL fold 0 and got the following problems:

After testing, it returned mIoU = 32.9 versus 61.3 reported in the paper.

This is my training script:

python train_lseg_zs.py \

--exp_name train_vitl16_pascal_fold0 --project_name lightseg \

--backbone clip_vitl16_384 \

--dataset pascal --data_path data/Dataset_HSN \

--fold 0 --nshot 0 \

--batch_size 4 --base_lr 0.0001 --max_epochs 200 \

--weight_decay 1e-5 --no-scaleinv --widehead

How does the max_epochs argument take part in the training process since there are only 4 epochs logged out?
Apart from changing the model from vitl16_384 to vitb32_384, is there anything wrong with my training script?

While training, this error is logged out:

[W reducer.cpp:283] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [512, 256, 1, 1], strides() = [256, 1, 256, 256]
bucket_view.sizes() = [512, 256, 1, 1], strides() = [256, 1, 1, 1] (function operator())

While training, DPP is enabled and I only used 01 GPU with batch_size = 4. I am not sure if this damages training. Does the argument accumulate_grad_batches probably make this happen?

load ckpt error

I want to run

python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset fss \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 \
--weights checkpoints/fss_l16.ckpt

load it to <class 'modules.lseg_module_zs.LSegModuleZS'>
and get the error.

size mismatch for net.scratch.layer4_rn.weight: copying a param with shape torch.Size([256, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 2048, 3, 3]).

System requirements (GPU?)

Hello,

This is great work @Boyiliee ! I'm excited to try this out.

I have a quick question: what kind of system requirements are necessary to train and run inference on this model? Specifically I am wondering about the type of GPU(s) needed to train LSeg.

AssertionError: Please setup the dataset usingencoding/scripts/prepare_ade20k.py

Getting the following error after running the streamlit command i.e. - streamlit run lseg_app.py

`Namespace(model='encnet', backbone='clip_vitl16_384', dataset='ade20k', workers=16, base_size=520, crop_size=480, train_split='train', aux=False, se_loss=False, se_weight=0.2, batch_size=16, test_batch_size=16, no_cuda=False, seed=1, weights='', eval=False, export=None, acc_bn=False, test_val=False, no_val=False, module='lseg', data_path='../datasets/', scale_inv=True, widehead=False, widehead_hr=False, ignore_index=-1, label_src='default', arch_option=0, block_depth=0, activation='lrelu', cuda=True)
** Use norm [0.5, 0.5, 0.5], [0.5, 0.5, 0.5] as the mean and std **
{'base_size': 520, 'crop_size': 480}
train
BaseDataset: base_size 520, crop_size 480
2022-01-13 13:34:01.885 Traceback (most recent call last):
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/legacy_caching/caching.py", line 540, in get_or_create_cached_value
return_value = _read_from_cache(
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/legacy_caching/caching.py", line 339, in _read_from_cache
raise e
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/legacy_caching/caching.py", line 324, in _read_from_cache
return _read_from_mem_cache(
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/legacy_caching/caching.py", line 242, in _read_from_mem_cache
raise CacheKeyNotFoundError("Key not found in mem cache")
streamlit.legacy_caching.caching.CacheKeyNotFoundError: Key not found in mem cache

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/script_runner.py", line 354, in _run_script
exec(code, module.dict)
File "/home/resham/lang-seg/lseg_app.py", line 341, in
lseg_model, lseg_transform = load_model()
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/legacy_caching/caching.py", line 574, in wrapped_func
return get_or_create_cached_value()
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/streamlit/legacy_caching/caching.py", line 558, in get_or_create_cached_value
return_value = func(*args, **kwargs)
File "/home/resham/lang-seg/lseg_app.py", line 274, in load_model
module = LSegModule.load_from_checkpoint(
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 157, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, kwargs)
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 199, in _load_model_state
model = cls(_cls_kwargs)
File "/home/resham/lang-seg/modules/lseg_module.py", line 55, in init
self.trainset = self.get_trainset(
File "/home/resham/lang-seg/modules/lsegmentation_module.py", line 202, in get_trainset
dset = get_dataset(
File "/home/resham/lang-seg/data/init.py", line 19, in get_dataset
return encoding_datasetsname.lower()
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/encoding/datasets/init.py", line 39, in get_dataset
return datasetsname.lower()
File "/home/resham/anaconda3/envs/lang-seg/lib/python3.9/site-packages/encoding/datasets/ade20k.py", line 29, in init
assert os.path.exists(root), "Please setup the dataset using" +
AssertionError: Please setup the dataset usingencoding/scripts/prepare_ade20k.py`

How many GPU do you use?

Hello, can you tell me how many GPU do you use for training the model? This is important for reproducing your results. Thank you!

Huggingface Spaces

Hi, would you be interested in sharing a web demo on Huggingface Spaces for lang-seg?

It would make this model more accessible as it would allow people to try out the model directly from the browser. Some other recent machine learning model repos have set up Spaces for easy access:

github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/akhaliq/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

Spaces is completely free, and I can help setup a Gradio Space. Here are some getting started instructions if you'd prefer to do it yourself: https://huggingface.co/blog/gradio-spaces

Query regarding the Dataset Split

Hi ,

Thanks for open sourcing this awesome work. While going through your code I could not find the zero-shot or few-shot splits for the dataset. I could only find the ADE20K supervised label split. Does this mean this code is for the fully supervised version ?

Question of label set vectors

Hi thanks for providing great work.
I have a question about the implementation detail of label set vectors (T). As you've pointed out in the paper, text encoder embeds the set of N potential labels into continuous vector space. However, as far as I can see, the code below seems to be that part, but it seems that only the feature of the eos token is selected after tokenizing the label set.

lang-seg/modules/models/lseg_net.py

Line 183 in 9d063b1

text_features = self.clip_pretrained.encode_text(text)

Shouldn't it extract the embedding from each label token?
Or is it being processed by a other part of the code?
Thanks

How to get pixel-level embeddings

Hi, I am trying to use your model for research purposes on Explainable AI.
After struggling for more than I'd like to admit I finally managed to get it up and working, however, I can't find an easy way to get the pixel-level embeddings from your framework, since the interfaces are quite convoluted.

Right now I've been able to do so with evaluator._modules['module'].net.get_image_features(image) starting from your notebook. I had to write get_image_features as a modified version of forward that ends at the image features. As such, I don't think this is the best way.

Do you have any suggestion on how to proceed? Maybe some general instructions on how to try to do so?

Thank you in advance!

How to prepare and train on medical images?

I wonder if this is suitable for segmentation on grayscale 2D medical images. How to do the data preparation? It looks like I need to prepare the medical dataset using exactly the same file structure as ADE20k dataset?
For training, do I still have to use "--dataset ade20k" argument if I prepare my customized training dataset?
Any other suggestions? Many thanks!

RuntimeError: Trying to backward through the graph a second time

Hi
Thanks for your great work! When I tried to add LSegNet into my own framework, there was a RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. My train function is(run on ADE20K):

def train(self, cur_epoch, optim, train_loader, scheduler=None, print_int=10, logger=None):
        device = self.device
        model = self.model
        criterion = nn.CrossEntropyLoss(ignore_index=-1)
        model.train()
        for cur_step, (images, labels) in enumerate(train_loader):
            images = images.to(device, dtype=torch.float32)
            labels = labels.to(device, dtype=torch.long)
            optim.zero_grad()
            outputs = model(images, labelset='')
            loss = criterion(outputs, labels)
            self.scaler.scale(loss)
            loss.backward()
            optim.step()
            if scheduler is not None:
                scheduler.step()

The model is LSegNet and I didn't modify lseg_net.py. I think maybe some optimizations have been made by Pytorch-lighting. Could you give me some suggestions? Thank you!

Installation errors

Traceback (most recent call last):
File "/home/airs/Clip_Seg/lang-seg/prepare_ade20k.py", line 9, in
from encoding.utils import download, mkdir
File "/home/airs/anaconda3/lib/python3.9/site-packages/encoding/init.py", line 13, in
from . import nn, functions, parallel, utils, models, datasets, transforms
File "/home/airs/anaconda3/lib/python3.9/site-packages/encoding/nn/init.py", line 12, in
from .encoding import *
File "/home/airs/anaconda3/lib/python3.9/site-packages/encoding/nn/encoding.py", line 18, in
from ..functions import scaled_l2, aggregate, pairwise_cosine
File "/home/airs/anaconda3/lib/python3.9/site-packages/encoding/functions/init.py", line 2, in
from .encoding import *
File "/home/airs/anaconda3/lib/python3.9/site-packages/encoding/functions/encoding.py", line 17, in
from encoding import gpu
ImportError: cannot import name 'gpu' from partially initialized module 'encoding' (most likely due to a circular import) (/home/airs/anaconda3/lib/python3.9/site-packages/encoding/init.py)

question about the "other" class

Hi, thanks for your great work.
I noticed that you use "other" class to refer background or unknown classes in training. May I ask where is the corresponding processing code? Beside, do you encode the unseen (novel) classes as "other" in training procedures?