sukjunhwang / ifc Goto Github PK
View Code? Open in Web Editor NEWVideo Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)
License: Apache License 2.0
Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)
License: Apache License 2.0
Thanks for your wonderful work!
I noticed that in your paper, before train IFC on VIS dataset, you firstly add an extra pretrain process on COCO dataset by setting T to 1. This implies the memory token and all bus layers are also pretrained during this process.
So I'm wondering how this process influence the final performance on VIS? If we do not pretrain all memory token and bus layers on COCO, what will happen to the final performance on YouTube dataset?
Hoping for your reply and thank you again.
Thanks for your great work. I want to know how to perform sequential inference instead of the whole video process.
After running the following command to evaluate:
python projects/IFC/train_net.py --num-gpus 8 --eval-only --config-file projects/IFC/configs/base_ytvis.yaml MODEL.WEIGHTS pretrained_weights/coco_r50.pth INPUT.SAMPLING_FRAME_NUM 5
An error occurred
File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
video_output.update(clip_results)
File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
input_clip.frame_idx] = input_clip.mask_logits[left_idx]
RuntimeError: shape mismatch: value tensor of shape [100, 5, 45, 80] cannot be broadcast to indexing result of shape [50, 5, 45, 80]
And I change
num_max_inst = 100
the error still occurred when update the second clip of the video
File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
video_output.update(clip_results)
File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
input_clip.frame_idx] = input_clip.mask_logits[left_idx]
RuntimeError: shape mismatch: value tensor of shape [5, 5, 45, 80] cannot be broadcast to indexing result of shape [0, 5, 45, 80]
Could you help me to solve it?
Thanks for the code.
It would be great if you could also share the COCO pretrained model.
Thanks for your great work.
I have two questions about memory_bus
and memory_pos
.
The first one:
In the paper, memory tokens helps features in different frames communicate with each other.
However, In the code, It seems the communications is designed for communication among layers.
for layer_idx in range(self.num_layers):
output = torch.cat((output, memory_bus))
output = self.enc_layers[layer_idx](output, src_mask=mask,
src_key_padding_mask=src_key_padding_mask, pos=pos)
output, memory_bus = output[:hw, :, :], output[hw:, :, :]
memory_bus = memory_bus.view(M, bs, t, c).permute(2,1,0,3).flatten(1,2) # TxBMxC
memory_bus = self.bus_layers[layer_idx](memory_bus)
memory_bus = memory_bus.view(t, bs, M, c).permute(2,1,0,3).flatten(1,2) # MxBTxC
The second one:
It seems self.memory_bus
and self.memory_pos
are not updated. Intuitively, I guess it will be helpful if it is updated along with frames.
self.memory_bus = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
self.memory_pos = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
if num_memory_bus:
nn.init.kaiming_normal_(self.memory_bus, mode="fan_out", nonlinearity="relu")
nn.init.kaiming_normal_(self.memory_pos, mode="fan_out", nonlinearity="relu")
self.return_intermediate_dec = return_intermediate_dec
self.d_model = d_model
self.nhead = nhead
def _reset_parameters(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def pad_zero(self, x, pad, dim=0):
if x is None:
return None
pad_shape = list(x.shape)
pad_shape[dim] = pad
return torch.cat((x, x.new_zeros(pad_shape)), dim=dim)
def forward(self, src, mask, query_embed, pos_embed, is_train):
# prepare for enc-dec
bs = src.shape[0] // self.num_frames if is_train else 1
t = src.shape[0] // bs
_, c, h, w = src.shape
memory_bus = self.memory_bus
memory_pos = self.memory_pos
# encoder
src = src.view(bs*t, c, h*w).permute(2, 0, 1) # HW, BT, C
frame_pos = pos_embed.view(bs*t, c, h*w).permute(2, 0, 1) # HW, BT, C
frame_mask = mask.view(bs*t, h*w) # BT, HW
src, memory_bus = self.encoder(src, memory_bus, memory_pos, src_key_padding_mask=frame_mask, pos=frame_pos, is_train=is_train)
# decoder
dec_src = src.view(h*w, bs, t, c).permute(2, 0, 1, 3).flatten(0,1)
query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1) # Q, B, C
tgt = torch.zeros_like(query_embed)
dec_pos = pos_embed.view(bs, t, c, h*w).permute(1, 3, 0, 2).flatten(0,1)
dec_mask = mask.view(bs, t*h*w) # B, THW
clip_hs = self.clip_decoder(tgt, dec_src, memory_bus, memory_pos, memory_key_padding_mask=dec_mask,
pos=dec_pos, query_pos=query_embed, is_train=is_train)
ret_memory = src.permute(1,2,0).reshape(bs*t, c, h, w)
return clip_hs, ret_memory
Do I misunderstand something?
When will you provide the camera-ready version performance in README?
Thanks!
How can I obtain the FPS and AP data for the Neural Network on the validation set?
Hello,
First of all, great paper! I just have one question. Would you mind helping me understand why only the last feature map is used in the transformer? Aren't you losing information by discarding the others?
IFC/projects/IFC/ifc/models/ifc.py
Line 65 in fb2ee45
Hello again,
I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?
Hi, I try to do the evaluation, but I do not have Annotations files in the dataset I downloaded. May you give me some clue?
Hi, thanks for the amazing work!
I wanted to ask how you compute the FPS on the semi-online setup and how it depends on the stride S and clip_size T used.
Taking the T=5 & S=1 scenario (the one reported on the main results table) the model takes as input 5 frames at a time, 4 of which will be overlapping from window to window (is this correct?). This means that the effective new frames predictions from step to step is just 1 frame, as the other 4 are part of the overlap used to compute the matching.
Having this in mind how do you compute the FPS? I guess that is not computed taking just the effective 1 frame as the actual frames, as then FPS will be equally proportional to the stride for a fixed clip_size T.
Thanks a lot for your clarifications!!
I trained a model using base_coco.yaml files,but when I want to visualize the detection results of Coco data images instead of videos, I encountered the same problem as #5
May I ask how to use base_coco.yaml files to infer some images and obtain visual results
Hey,
first things first: Great paper!
I am currently trying to run your model at inference and therefor used the script demo/demo.py and passed the arguments
--config-file ifc_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml
-output <path_to_output_file>
--video-input <path_to_input_file>
--opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x/138205316/model_final_a3ec72.pkl
Everything works fine, but I think thats not using your model right? Putting r101.pth for the WEIGHTS and R101_ytvis.yaml for the config-file does not work ("KeyError: 'Non-existent config key: MODEL.IFC' "). So how can I use your pretrained model at inference just to visualize and test for results?
When I finish training for instance segmentation and use demo.py to generate masks, I get the result of the first image.
First image includes box and class(0 in this case) and probability.
Also, the segment object has edge with different colors.
I want to ask how to generate the mask like the second image.
I want to generate image without edge, box, probability and class.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.