sukjunhwang / ifc Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 13.0 1.84 MB

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

License: Apache License 2.0

Python 93.60% Shell 0.41% C++ 2.45% Cuda 3.41% Dockerfile 0.11% CMake 0.03%

ifc's People

Contributors

Stargazers

Watchers

Forkers

kelvintao zongzi13545329 swall0w omkarthawakar lslrh neutrinoliu suhohan95 lv-tuan cv-seg ashhyun isaac0424 kang-jaehyun ychen404

ifc's Issues

How does the pre-train process effect the final performance?

Thanks for your wonderful work!
I noticed that in your paper, before train IFC on VIS dataset, you firstly add an extra pretrain process on COCO dataset by setting T to 1. This implies the memory token and all bus layers are also pretrained during this process.
So I'm wondering how this process influence the final performance on VIS? If we do not pretrain all memory token and bus layers on COCO, what will happen to the final performance on YouTube dataset?
Hoping for your reply and thank you again.

How to perform sequential inference instead of whole video process?

Thanks for your great work. I want to know how to perform sequential inference instead of the whole video process.

evaluation error

After running the following command to evaluate:

python projects/IFC/train_net.py --num-gpus 8 --eval-only --config-file projects/IFC/configs/base_ytvis.yaml MODEL.WEIGHTS pretrained_weights/coco_r50.pth INPUT.SAMPLING_FRAME_NUM 5

An error occurred

  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
    video_output.update(clip_results)
  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
    input_clip.frame_idx] = input_clip.mask_logits[left_idx]
RuntimeError: shape mismatch: value tensor of shape [100, 5, 45, 80] cannot be broadcast to indexing result of shape [50, 5, 45, 80]

And I change

IFC/projects/IFC/ifc/structures/clip_output.py

Line 24 in fb2ee45

num_max_inst = 50

num_max_inst = 100

the error still occurred when update the second clip of the video

  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
    video_output.update(clip_results)
  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
    input_clip.frame_idx] = input_clip.mask_logits[left_idx]
RuntimeError: shape mismatch: value tensor of shape [5, 5, 45, 80] cannot be broadcast to indexing result of shape [0, 5, 45, 80]

Could you help me to solve it?

COCO pretrained model

Thanks for the code.
It would be great if you could also share the COCO pretrained model.

Questions about Memory

Thanks for your great work.

I have two questions about memory_bus and memory_pos.

The first one:
In the paper, memory tokens helps features in different frames communicate with each other.
However, In the code, It seems the communications is designed for communication among layers.

IFC/projects/IFC/ifc/models/transformer.py

Line 1 in fb2ee45

"""

        for layer_idx in range(self.num_layers):
            output = torch.cat((output, memory_bus))

            output = self.enc_layers[layer_idx](output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos)
            output, memory_bus = output[:hw, :, :], output[hw:, :, :]

            memory_bus = memory_bus.view(M, bs, t, c).permute(2,1,0,3).flatten(1,2) # TxBMxC
            memory_bus = self.bus_layers[layer_idx](memory_bus)
            memory_bus = memory_bus.view(t, bs, M, c).permute(2,1,0,3).flatten(1,2) # MxBTxC

The second one:
It seems self.memory_bus and self.memory_pos are not updated. Intuitively, I guess it will be helpful if it is updated along with frames.

IFC/projects/IFC/ifc/models/transformer.py

Line 66 in fb2ee45

memory_bus = self.memory_bus

self.memory_bus = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
        self.memory_pos = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
        if num_memory_bus:
            nn.init.kaiming_normal_(self.memory_bus, mode="fan_out", nonlinearity="relu")
            nn.init.kaiming_normal_(self.memory_pos, mode="fan_out", nonlinearity="relu")

        self.return_intermediate_dec = return_intermediate_dec

        self.d_model = d_model
        self.nhead = nhead

    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def pad_zero(self, x, pad, dim=0):
        if x is None:
            return None
        pad_shape = list(x.shape)
        pad_shape[dim] = pad
        return torch.cat((x, x.new_zeros(pad_shape)), dim=dim)

    def forward(self, src, mask, query_embed, pos_embed, is_train):
        # prepare for enc-dec
        bs = src.shape[0] // self.num_frames if is_train else 1
        t = src.shape[0] // bs
        _, c, h, w = src.shape

        memory_bus = self.memory_bus
        memory_pos = self.memory_pos

        # encoder
        src = src.view(bs*t, c, h*w).permute(2, 0, 1)               # HW, BT, C
        frame_pos = pos_embed.view(bs*t, c, h*w).permute(2, 0, 1)   # HW, BT, C
        frame_mask = mask.view(bs*t, h*w)                           # BT, HW

        src, memory_bus = self.encoder(src, memory_bus, memory_pos, src_key_padding_mask=frame_mask, pos=frame_pos, is_train=is_train)

        # decoder
        dec_src = src.view(h*w, bs, t, c).permute(2, 0, 1, 3).flatten(0,1)
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)     # Q, B, C
        tgt = torch.zeros_like(query_embed)

        dec_pos = pos_embed.view(bs, t, c, h*w).permute(1, 3, 0, 2).flatten(0,1)
        dec_mask = mask.view(bs, t*h*w)                             # B, THW

        clip_hs = self.clip_decoder(tgt, dec_src, memory_bus, memory_pos, memory_key_padding_mask=dec_mask,
                                    pos=dec_pos, query_pos=query_embed, is_train=is_train)

        ret_memory = src.permute(1,2,0).reshape(bs*t, c, h, w)

        return clip_hs, ret_memory

Do I misunderstand something?

Can you provide the camera-ready version performance in README?

When will you provide the camera-ready version performance in README?
Thanks!

Question about Evaluating

How can I obtain the FPS and AP data for the Neural Network on the validation set?

Code explanation

Hello,

First of all, great paper! I just have one question. Would you mind helping me understand why only the last feature map is used in the transformer? Aren't you losing information by discarding the others?

IFC/projects/IFC/ifc/models/ifc.py

Line 65 in fb2ee45

src, mask = features[-1].decompose()

Question about batch size vs num frames

Hello again,

I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?

No "Annotations" found in the dataset

Hi, I try to do the evaluation, but I do not have Annotations files in the dataset I downloaded. May you give me some clue?

FPS mesurment

Hi, thanks for the amazing work!
I wanted to ask how you compute the FPS on the semi-online setup and how it depends on the stride S and clip_size T used.
Taking the T=5 & S=1 scenario (the one reported on the main results table) the model takes as input 5 frames at a time, 4 of which will be overlapping from window to window (is this correct?). This means that the effective new frames predictions from step to step is just 1 frame, as the other 4 are part of the overlap used to compute the matching.
Having this in mind how do you compute the FPS? I guess that is not computed taking just the effective 1 frame as the actual frames, as then FPS will be equally proportional to the stride for a fixed clip_size T.

Thanks a lot for your clarifications!!

Inference Coco data image

I trained a model using base_coco.yaml files,but when I want to visualize the detection results of Coco data images instead of videos, I encountered the same problem as #5

May I ask how to use base_coco.yaml files to infer some images and obtain visual results

Use model at inference

Hey,
first things first: Great paper!
I am currently trying to run your model at inference and therefor used the script demo/demo.py and passed the arguments
--config-file ifc_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml
-output <path_to_output_file>
--video-input <path_to_input_file>
--opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x/138205316/model_final_a3ec72.pkl

Everything works fine, but I think thats not using your model right? Putting r101.pth for the WEIGHTS and R101_ytvis.yaml for the config-file does not work ("KeyError: 'Non-existent config key: MODEL.IFC' "). So how can I use your pretrained model at inference just to visualize and test for results?

How to generate prediction of instance segmentation without bounding box, class, edge and probability?

When I finish training for instance segmentation and use demo.py to generate masks, I get the result of the first image.

First image includes box and class(0 in this case) and probability.

Also, the segment object has edge with different colors.

I want to ask how to generate the mask like the second image.

I want to generate image without edge, box, probability and class.

Hope someone can help and thank you so much