shuuchen / detco.pytorch Goto Github PK
View Code? Open in Web Editor NEWA PyTorch implementation of DetCo https://arxiv.org/pdf/2102.04803.pdf
License: MIT License
A PyTorch implementation of DetCo https://arxiv.org/pdf/2102.04803.pdf
License: MIT License
Hi
Thanks for your nice code. have you tried any experiments on cifar? if so, do you have any suggestions regarding the hyperparameter configs? And also do you think it might be a good idea to resize cifar images to 256*256 and use the global/local views mentioned in the paper or it is better to make a new arrangement suitable for cifar image size?
Thanks for your valuable work. And I wonder if you have achieved the performance as reported in the original paper.
Hi thanks for your work, may I ask about the training time?
I use 8GPU with 256bs, it costs about 240 hours for 200 epochs.
Hi,Thank you for the code,But An error occurred when I installed:
ERROR: Could not install packages due to an OSError: [Errno 2] There is no such file or directory: '/home/conda/feedstock_root/build_artifacts/cffi_1600276415718/work'
Do I need to create this folder?
Hello @shuuchen ,
Thank you for providing the unsupervised training code!
When my model is finished training, I noticed that the final .pth file was not saved. This might be caused by Line 283:
Line 283 in b4591b9
As the model is by default trained 200 epoch, the condition epoch % 50 == 0 may cause a miss-saving of the final checkpoint. I think this should be modified. What do you say?
Thanks!
Hi, I have encountered a weird issue when attempting to train a DetCo (ResNet18 backbone) model. In short, the model trained perfectly well for 9 epochs, with loss on a downwards trend. Then part-way through the 10th epoch, an error occurred:
RuntimeError('The expanded size of the tensor (8) must match the existing size (16) at non-singleton dimension 2. Target sizes: [8, 128, 8]. Tensor sizes: [8, 128, 16]')
Running it again with the code wrapped in try-except to validate, confirms that after this first occurs it occurs for every forward
call made from that point onwards, on all of the GPUs. The stack trace points to the line
self.queue[:, :, ptr:ptr + batch_size] = keys.permute(1, 2, 0)
in this function, which is called at the end of the forward
function:
DetCo.pytorch/detco/builder.py
Lines 48 to 62 in b4591b9
It seems like this cannot be a problem with the data, as the exact same data have been passed through the model several times previously and following the point it occurs, it occurs for every batch. I believe the exact point it occurs to be different each run, as if I remember correctly it occurred after 8 epochs the first time I encountered it but after 9 the second time.
I wondered whether it could be due to using 4 GPUs while the code was tested with 8, but this would still not really explain it only beginning part-way through training. Running continual experiments to test this or other hypotheses would require a lot of GPU-time, so I decided to post here first in case you have any insight. Thanks!
Hi @shuuchen
Thanks for the wonderful and ready-to-use repo. I have few questions, to just improve my understanding of the SSL approch.
As we use Detco for the pre-text tasks on certain unlabeled datasets like(coco) using the ResNet 50 Architecture. Onces we have the pre-trained model how do we set it for the downstream object detection task :
Will this be with Supervision => where we have the images and the related bbox information about it
Architecture => if for the DetCo pre-trained model I have used Resnet 50 and for the Downstream object detection with labels I want to use mobilenetV2 .. is it possible ? or it should be resnet 50 itself for downstream task
Thank you very much for the code. Can I run this code on one GPU?
Hi, thanks for your brilliant work, but i have some questions about this work when i try to reproduce it, my device is 1070tiX4, the storage is 8G,but when i set the batch-size to 4, it still have the error said that CUDA out of memory. i'm confused whether it's normal or i made something wrong?
Hi @shuuchen
I want to train the pre-trained model on the downstream task of object detection. I used the pre-trained model of mocov2 with 800 epochs here
I have followed the following process
step 1: Install detectron2.
step 2: Convert a pre-trained MoCo model to detectron2's format:
python3 convert-pretrain-to-detectron2.py input.pth.tar output.pkl
Put dataset under "./datasets" directory, following the directory structure required by detectron2.
step 3: Run training:
python train_net.py --config-file configs/pascal_voc_R_50_C4_24k_moco.yaml \
--num-gpus 1 MODEL.WEIGHTS ./output.pkl
The only change I did is used a single GPU rather than 8 GPU
I am getting the following error an
[08/31 12:42:12] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint:
�[34mproposal_generator.rpn_head.anchor_deltas.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.conv.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.objectness_logits.{bias, weight}�[0m
�[34mroi_heads.box_predictor.bbox_pred.{bias, weight}�[0m
�[34mroi_heads.box_predictor.cls_score.{bias, weight}�[0m
�[34mroi_heads.res5.norm.{bias, running_mean, running_var, weight}�[0m
[08/31 12:42:12] fvcore.common.checkpoint WARNING: The checkpoint state_dict contains keys that are not used by the model:
�[35mstem.fc.0.{bias, weight}�[0m
�[35mstem.fc.2.{bias, weight}�[0m
[08/31 12:42:12] d2.engine.train_loop INFO: Starting training from iteration 0
[08/31 12:42:13] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/defaults.py", line 493, in run_step
self._trainer.run_step()
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
loss_dict = self.model(data)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 154, in forward
features = self.backbone(images.tensor)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/layers/wrappers.py", line 88, in forward
x = self.norm(x)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 638, in get_world_size
return _get_group_size(group)
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
_check_default_pg()
File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
assert _default_pg is not None, \
AssertionError: Default process group is not initialized
[08/31 12:42:13] d2.engine.hooks INFO: Total training time: 0:00:00 (0:00:00 on hooks)
[08/31 12:42:13] d2.utils.events INFO: iter: 0 lr: N/A max_mem: 207M
how can we run the training on a single GPU?
attached are the logs for details
log.txt
Thank you very much for the code. I have some questions.
(1) local MLPs. Take Resnet50 as an example, the feature dim of the last stage is 2048, according to the paper and the code, the in_dim of the local mlps will be 2048 * 9 = 18432. So the learnable parameters is 18423 * 18432 = 339,738,624 ~ 340 M >> Resnet50 backbone (25.5 M), Is it possible to train such a network ? And is it really reasonable to use such a huge MLPs ? I open this issue just for discussion.
(2) G2L. I use this idea in other task, but I found both global and local streams could converge, but the g2l could not converge. I'd like to ask that have you met this situation ?
Thank you again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.