shuuchen / detco.pytorch Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 1.0 53 KB

A PyTorch implementation of DetCo https://arxiv.org/pdf/2102.04803.pdf

License: MIT License

Python 100.00%

detco.pytorch's People

Contributors

Stargazers

Watchers

Forkers

maodong2056

detco.pytorch's Issues

Experiment on CIFAR

Thanks for your nice code. have you tried any experiments on cifar? if so, do you have any suggestions regarding the hyperparameter configs? And also do you think it might be a good idea to resize cifar images to 256*256 and use the global/local views mentioned in the paper or it is better to make a new arrangement suitable for cifar image size?

The problem about the final performance

Thanks for your valuable work. And I wonder if you have achieved the performance as reported in the original paper.

About training time

Hi thanks for your work, may I ask about the training time?

I use 8GPU with 256bs, it costs about 240 hours for 200 epochs.

An pip install error

Hi，Thank you for the code，But An error occurred when I installed：
ERROR: Could not install packages due to an OSError: [Errno 2] There is no such file or directory: '/home/conda/feedstock_root/build_artifacts/cffi_1600276415718/work'
Do I need to create this folder?

'model_best.pth' saving issue

Hello @shuuchen ,

Thank you for providing the unsupervised training code!

When my model is finished training, I noticed that the final .pth file was not saved. This might be caused by Line 283:

DetCo.pytorch/main_detco.py

Line 283 in b4591b9

    
           if (not args.multiprocessing_distributed or (args.multiprocessing_distributed and args.rank % ngpus_per_node == 0)) and epoch % 50 == 0:

As the model is by default trained 200 epoch, the condition epoch % 50 == 0 may cause a miss-saving of the final checkpoint. I think this should be modified. What do you say?

Thanks!

Tensor dimension issue occurring partway through training

Hi, I have encountered a weird issue when attempting to train a DetCo (ResNet18 backbone) model. In short, the model trained perfectly well for 9 epochs, with loss on a downwards trend. Then part-way through the 10th epoch, an error occurred:

RuntimeError('The expanded size of the tensor (8) must match the existing size (16) at non-singleton dimension 2. Target sizes: [8, 128, 8]. Tensor sizes: [8, 128, 16]')

Running it again with the code wrapped in try-except to validate, confirms that after this first occurs it occurs for every forward call made from that point onwards, on all of the GPUs. The stack trace points to the line

self.queue[:, :, ptr:ptr + batch_size] = keys.permute(1, 2, 0)

in this function, which is called at the end of the forward function:

DetCo.pytorch/detco/builder.py

Lines 48 to 62 in b4591b9

    
           @torch.no_grad() 
        
           def _dequeue_and_enqueue(self, keys): 
        
               # gather keys before updating queue 
        
               keys = concat_all_gather(keys) 
        
               batch_size = keys.shape[0] 
        
               ptr = int(self.queue_ptr) 
        
               assert self.K % batch_size == 0  # for simplicity 
        
               # replace the keys at ptr (dequeue and enqueue) 
        
               self.queue[:, :, ptr:ptr + batch_size] = keys.permute(1,2,0) 
        
               ptr = (ptr + batch_size) % self.K  # move pointer 
        
               self.queue_ptr[0] = ptr

It seems like this cannot be a problem with the data, as the exact same data have been passed through the model several times previously and following the point it occurs, it occurs for every batch. I believe the exact point it occurs to be different each run, as if I remember correctly it occurred after 8 epochs the first time I encountered it but after 9 the second time.

I wondered whether it could be due to using 4 GPUs while the code was tested with 8, but this would still not really explain it only beginning part-way through training. Running continual experiments to test this or other hypotheses would require a lot of GPU-time, so I decided to post here first in case you have any insight. Thanks!

How to create an downstream object detection task

Hi @shuuchen

Thanks for the wonderful and ready-to-use repo. I have few questions, to just improve my understanding of the SSL approch.

As we use Detco for the pre-text tasks on certain unlabeled datasets like(coco) using the ResNet 50 Architecture. Onces we have the pre-trained model how do we set it for the downstream object detection task :

Will this be with Supervision => where we have the images and the related bbox information about it
Architecture => if for the DetCo pre-trained model I have used Resnet 50 and for the Downstream object detection with labels I want to use mobilenetV2 .. is it possible ? or it should be resnet 50 itself for downstream task

one GPU

Thank you very much for the code. Can I run this code on one GPU?

want to ask about your training machine

Hi, thanks for your brilliant work, but i have some questions about this work when i try to reproduce it, my device is 1070tiX4, the storage is 8G,but when i set the batch-size to 4, it still have the error said that CUDA out of memory. i'm confused whether it's normal or i made something wrong?

Training an Pretrained model on object detection task on single GPU

Hi @shuuchen

I want to train the pre-trained model on the downstream task of object detection. I used the pre-trained model of mocov2 with 800 epochs here

I have followed the following process
step 1: Install detectron2.

step 2: Convert a pre-trained MoCo model to detectron2's format:

python3 convert-pretrain-to-detectron2.py input.pth.tar output.pkl
Put dataset under "./datasets" directory, following the directory structure required by detectron2.

step 3: Run training:

python train_net.py --config-file configs/pascal_voc_R_50_C4_24k_moco.yaml \
 --num-gpus 1 MODEL.WEIGHTS ./output.pkl

The only change I did is used a single GPU rather than 8 GPU

I am getting the following error an

[08/31 12:42:12] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint:
�[34mproposal_generator.rpn_head.anchor_deltas.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.conv.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.objectness_logits.{bias, weight}�[0m
�[34mroi_heads.box_predictor.bbox_pred.{bias, weight}�[0m
�[34mroi_heads.box_predictor.cls_score.{bias, weight}�[0m
�[34mroi_heads.res5.norm.{bias, running_mean, running_var, weight}�[0m
[08/31 12:42:12] fvcore.common.checkpoint WARNING: The checkpoint state_dict contains keys that are not used by the model:
  �[35mstem.fc.0.{bias, weight}�[0m
  �[35mstem.fc.2.{bias, weight}�[0m
[08/31 12:42:12] d2.engine.train_loop INFO: Starting training from iteration 0
[08/31 12:42:13] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/defaults.py", line 493, in run_step
    self._trainer.run_step()
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 154, in forward
    features = self.backbone(images.tensor)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
    x = self.stem(x)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
    x = self.conv1(x)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/layers/wrappers.py", line 88, in forward
    x = self.norm(x)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 638, in get_world_size
    return _get_group_size(group)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized
[08/31 12:42:13] d2.engine.hooks INFO: Total training time: 0:00:00 (0:00:00 on hooks)
[08/31 12:42:13] d2.utils.events INFO:  iter: 0    lr: N/A  max_mem: 207M

how can we run the training on a single GPU?
attached are the logs for details
log.txt

Some questions about local mlps and G2L learning.

Thank you very much for the code. I have some questions.
(1) local MLPs. Take Resnet50 as an example, the feature dim of the last stage is 2048, according to the paper and the code, the in_dim of the local mlps will be 2048 * 9 = 18432. So the learnable parameters is 18423 * 18432 = 339,738,624 ~ 340 M >> Resnet50 backbone (25.5 M), Is it possible to train such a network ? And is it really reasonable to use such a huge MLPs ? I open this issue just for discussion.
(2) G2L. I use this idea in other task, but I found both global and local streams could converge, but the g2l could not converge. I'd like to ask that have you met this situation ?
Thank you again.

shuuchen / detco.pytorch Goto Github PK

detco.pytorch's People

Contributors

Stargazers

Watchers

Forkers

detco.pytorch's Issues

Experiment on CIFAR

The problem about the final performance

About training time

An pip install error

'model_best.pth' saving issue

Tensor dimension issue occurring partway through training

How to create an downstream object detection task

one GPU

want to ask about your training machine

Training an Pretrained model on object detection task on single GPU

Some questions about local mlps and G2L learning.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	@torch.no_grad()
	def _dequeue_and_enqueue(self, keys):
	# gather keys before updating queue
	keys = concat_all_gather(keys)

	batch_size = keys.shape[0]

	ptr = int(self.queue_ptr)
	assert self.K % batch_size == 0 # for simplicity

	# replace the keys at ptr (dequeue and enqueue)
	self.queue[:, :, ptr:ptr + batch_size] = keys.permute(1,2,0)
	ptr = (ptr + batch_size) % self.K # move pointer

	self.queue_ptr[0] = ptr