Coder Social home page Coder Social logo

detco.pytorch's People

Contributors

shuuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

maodong2056

detco.pytorch's Issues

Experiment on CIFAR

Hi

Thanks for your nice code. have you tried any experiments on cifar? if so, do you have any suggestions regarding the hyperparameter configs? And also do you think it might be a good idea to resize cifar images to 256*256 and use the global/local views mentioned in the paper or it is better to make a new arrangement suitable for cifar image size?

About training time

Hi thanks for your work, may I ask about the training time?

I use 8GPU with 256bs, it costs about 240 hours for 200 epochs.

An pip install error

Hi,Thank you for the code,But An error occurred when I installed:
ERROR: Could not install packages due to an OSError: [Errno 2] There is no such file or directory: '/home/conda/feedstock_root/build_artifacts/cffi_1600276415718/work'
Do I need to create this folder?

'model_best.pth' saving issue

Hello @shuuchen ,

Thank you for providing the unsupervised training code!

When my model is finished training, I noticed that the final .pth file was not saved. This might be caused by Line 283:

if (not args.multiprocessing_distributed or (args.multiprocessing_distributed and args.rank % ngpus_per_node == 0)) and epoch % 50 == 0:

As the model is by default trained 200 epoch, the condition epoch % 50 == 0 may cause a miss-saving of the final checkpoint. I think this should be modified. What do you say?

Thanks!

Tensor dimension issue occurring partway through training

Hi, I have encountered a weird issue when attempting to train a DetCo (ResNet18 backbone) model. In short, the model trained perfectly well for 9 epochs, with loss on a downwards trend. Then part-way through the 10th epoch, an error occurred:

RuntimeError('The expanded size of the tensor (8) must match the existing size (16) at non-singleton dimension 2. Target sizes: [8, 128, 8]. Tensor sizes: [8, 128, 16]')

Running it again with the code wrapped in try-except to validate, confirms that after this first occurs it occurs for every forward call made from that point onwards, on all of the GPUs. The stack trace points to the line

self.queue[:, :, ptr:ptr + batch_size] = keys.permute(1, 2, 0)

in this function, which is called at the end of the forward function:

@torch.no_grad()
def _dequeue_and_enqueue(self, keys):
# gather keys before updating queue
keys = concat_all_gather(keys)
batch_size = keys.shape[0]
ptr = int(self.queue_ptr)
assert self.K % batch_size == 0 # for simplicity
# replace the keys at ptr (dequeue and enqueue)
self.queue[:, :, ptr:ptr + batch_size] = keys.permute(1,2,0)
ptr = (ptr + batch_size) % self.K # move pointer
self.queue_ptr[0] = ptr

It seems like this cannot be a problem with the data, as the exact same data have been passed through the model several times previously and following the point it occurs, it occurs for every batch. I believe the exact point it occurs to be different each run, as if I remember correctly it occurred after 8 epochs the first time I encountered it but after 9 the second time.

I wondered whether it could be due to using 4 GPUs while the code was tested with 8, but this would still not really explain it only beginning part-way through training. Running continual experiments to test this or other hypotheses would require a lot of GPU-time, so I decided to post here first in case you have any insight. Thanks!

How to create an downstream object detection task

Hi @shuuchen

Thanks for the wonderful and ready-to-use repo. I have few questions, to just improve my understanding of the SSL approch.

As we use Detco for the pre-text tasks on certain unlabeled datasets like(coco) using the ResNet 50 Architecture. Onces we have the pre-trained model how do we set it for the downstream object detection task :

  • Will this be with Supervision => where we have the images and the related bbox information about it

  • Architecture => if for the DetCo pre-trained model I have used Resnet 50 and for the Downstream object detection with labels I want to use mobilenetV2 .. is it possible ? or it should be resnet 50 itself for downstream task

one GPU

Thank you very much for the code. Can I run this code on one GPU?

want to ask about your training machine

Hi, thanks for your brilliant work, but i have some questions about this work when i try to reproduce it, my device is 1070tiX4, the storage is 8G,but when i set the batch-size to 4, it still have the error said that CUDA out of memory. i'm confused whether it's normal or i made something wrong?

Training an Pretrained model on object detection task on single GPU

Hi @shuuchen

I want to train the pre-trained model on the downstream task of object detection. I used the pre-trained model of mocov2 with 800 epochs here

I have followed the following process
step 1: Install detectron2.

step 2: Convert a pre-trained MoCo model to detectron2's format:

python3 convert-pretrain-to-detectron2.py input.pth.tar output.pkl
Put dataset under "./datasets" directory, following the directory structure required by detectron2.

step 3: Run training:

python train_net.py --config-file configs/pascal_voc_R_50_C4_24k_moco.yaml \
 --num-gpus 1 MODEL.WEIGHTS ./output.pkl

The only change I did is used a single GPU rather than 8 GPU

I am getting the following error an

[08/31 12:42:12] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint:
�[34mproposal_generator.rpn_head.anchor_deltas.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.conv.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.objectness_logits.{bias, weight}�[0m
�[34mroi_heads.box_predictor.bbox_pred.{bias, weight}�[0m
�[34mroi_heads.box_predictor.cls_score.{bias, weight}�[0m
�[34mroi_heads.res5.norm.{bias, running_mean, running_var, weight}�[0m
[08/31 12:42:12] fvcore.common.checkpoint WARNING: The checkpoint state_dict contains keys that are not used by the model:
  �[35mstem.fc.0.{bias, weight}�[0m
  �[35mstem.fc.2.{bias, weight}�[0m
[08/31 12:42:12] d2.engine.train_loop INFO: Starting training from iteration 0
[08/31 12:42:13] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/defaults.py", line 493, in run_step
    self._trainer.run_step()
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 154, in forward
    features = self.backbone(images.tensor)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
    x = self.stem(x)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
    x = self.conv1(x)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/livesense/Detectron2/detectron2/detectron2/layers/wrappers.py", line 88, in forward
    x = self.norm(x)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 638, in get_world_size
    return _get_group_size(group)
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/home/ubuntu/anaconda3/envs/detectron_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized
[08/31 12:42:13] d2.engine.hooks INFO: Total training time: 0:00:00 (0:00:00 on hooks)
[08/31 12:42:13] d2.utils.events INFO:  iter: 0    lr: N/A  max_mem: 207M

how can we run the training on a single GPU?
attached are the logs for details
log.txt

Some questions about local mlps and G2L learning.

Thank you very much for the code. I have some questions.
(1) local MLPs. Take Resnet50 as an example, the feature dim of the last stage is 2048, according to the paper and the code, the in_dim of the local mlps will be 2048 * 9 = 18432. So the learnable parameters is 18423 * 18432 = 339,738,624 ~ 340 M >> Resnet50 backbone (25.5 M), Is it possible to train such a network ? And is it really reasonable to use such a huge MLPs ? I open this issue just for discussion.
(2) G2L. I use this idea in other task, but I found both global and local streams could converge, but the g2l could not converge. I'd like to ask that have you met this situation ?
Thank you again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.