hi, have you successfully run the train.py? I encountered a runt

runtime error about ssd.pytorch HOT 20 CLOSED

amdegroot commented on July 30, 2024

runtime error

from ssd.pytorch.

Comments (20)

superhans commented on July 30, 2024 2

Edit : I believe there are basic Python2.7 vs Python3 compatibility issues which cause the problem, since this code was written for Python3 and not Python2.7

Adding the line from __future__ import division in box_utils.py, prior_box.py and detection.py gets rid of the above error, and some other errors.

from ssd.pytorch.

amdegroot commented on July 30, 2024

Yes I've successfully trained several models. For some reason I cannot reproduce this error on my machine. Did you make sure your repo is up to date with the current master branch?

from ssd.pytorch.

amdegroot commented on July 30, 2024

I am using Python 3, and have not tested it using 2.7, so that is the only thing I can think of at the moment if you're local repo is up to date. I will add the lack of 2.7 support to the README if that's the issue.

from ssd.pytorch.

pillar02 commented on July 30, 2024

I didn't build from the source but installed pytorch from pip. I have also made some changes to adapt your code to python 2.7 (link star expression)

I checked the latest master branch and found that https://github.com/pytorch/pytorch/blob/master/torch/autograd/variable.py#L317-L320 still only supports scalar division. In the case of your code "x/=norm.expand_as(x)", it is clearly an element-wise division. But I don't understand how the python version can affect this.

from ssd.pytorch.

pillar02 commented on July 30, 2024

BTW, could you please give me a rough time estimation for running one epoch ( with machine specs)?

from ssd.pytorch.

amdegroot commented on July 30, 2024

Yeah I agree I don't understand how it is working on my computer if that's the case. I'll look into it more after my classes today, sorry I don't have an answer right this second. As for the time estimate, it takes ~1.4 seconds to run a batch of size 32 forward and backward, but I'm not in my lab right now so I can't remember the exact time per epoch. Will get back to you on all of this right after class.

from ssd.pytorch.

amdegroot commented on July 30, 2024

And that's on a single Tesla K80 ^

from ssd.pytorch.

amdegroot commented on July 30, 2024

I think if you update to the latest version of Pytorch you will see that element-wise division with .div_() is supported. I do remember that it was originally not supported, but they added it not too long ago. When I run something as simple as:

x = torch.Tensor([1,2,3,4,5,6])
y = torch.Tensor([2,2,2,2,2,2])
x/=y

the correct result is returned. With a batch size of 32, on 1 Tesla K80, it takes me ~ 109 sec. per epoch.

from ssd.pytorch.

pillar02 commented on July 30, 2024

As I mentioned in the previous post, in the latest github pytorch source code (master branch), it still shows:

def div_(self, other):
    if not isinstance(other, Variable) and not torch.is_tensor(other):
        return DivConstant(other, inplace=True)(self)
    raise RuntimeError("div_ only supports scalar multiplication")

I still don't understand how it works for your case. But I will try to update my pytorch to the latest version.
Thanks a lot.

from ssd.pytorch.

amdegroot commented on July 30, 2024

Yeah, I apologize for lack of a better answer, but since I cannot reproduce I am closing the issue for now. Let me know if updating PyTorch fixes the issue, I will try to see if I can figure out more info myself in the mean time..

from ssd.pytorch.

amdegroot commented on July 30, 2024

Ah, figured it out. That line in the source code is referring to Variables, so it is just saying Variables cannot be divided by Tensors, but Variables can be divided by other Variables of the same size (which is the case here) and Tensors can be divided by other Tensors of the same size.

torch/csrc/generic/methods/TensorMath.cwrap line 1038 has what looks like the place that bridges the python and C for the tensor div_ definition, and it's implied in torch/tensor.py 378: return self.div_(other) even though it doesn't seem like self.div_ is defined.

So again, not sure what the exact source of the problem is in your case, but my best bet is your version of PyTorch. Hopefully that helps.

from ssd.pytorch.

amdegroot commented on July 30, 2024

Also, update on training time: it takes approx. 37.5 sec. per epoch with a gtx1060 and batch size of 16, which is what I am currently using (ran out of money to afford the K80 EC2 instance :P).

from ssd.pytorch.

pillar02 commented on July 30, 2024

Thanks a lot. I will definitely update my Pytorch.

Regarding the training time, it only takes 37.5 sec for one epoch? (I suppose you were training using VOC2007 with about 10000 images, right?). I have tried training a mxnet SSD implementation which takes about 270 sec for one epoch using both VOC2007 and VOC2012 data on my titan x gpu card. Does this mean this pytorch ssd is even faster than the mxnet implementation, which doesn't seem to be true.

from ssd.pytorch.

amdegroot commented on July 30, 2024

Yeah, that's my bad. Disregard that number, its late here. Training on purely the training set (2501 images) from VOC07 it takes on average ~140 sec. per epoch on a single GTX 1060... So yeah the previous number was off by alot. I would be curious to see how it compares on a Titan X though.

from ssd.pytorch.

pillar02 commented on July 30, 2024

one more question ;-)

I am wondering how you got the fc-reduced VGG-16 weights?

from ssd.pytorch.

amdegroot commented on July 30, 2024

Hahah of course... I converted them to Chainer and then from Chainer to PyTorch. I also was able to convert them to Torch and then from Torch to PyTorch, but the specific weight file I supply was one that took the Chainer route.

from ssd.pytorch.

pillar02 commented on July 30, 2024

hi, i just updated the pytorch to the latest version (0.1.11_5) and had the train.py run.
luckily, i didn't get the div_ error.
but this time, I got "RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/T1" at "conf = labels[best_truth_idx] + 1 " from the box_utils.py
any idea about this? it seems like something to do with tensor add

from ssd.pytorch.

amdegroot commented on July 30, 2024

Is this on the first feed forward or were you able to get through some iterations? The only time that line has every been an issue was a while back when I had an explicit 'background' label in the voc labelmap and it just became an index out of range issue for softmax. But I'm currently training as I type this and can't think of what could be causing that. Have you pulled the most recent update of master? Or maybe you're on a different branch?

from ssd.pytorch.

meetps commented on July 30, 2024

I faced this issue as well, with PyTorch version ( 0.1.12_4 ) which is very recent.

I fixed it by changing the forward() function in L2Norm.py as follows:

def forward(self, x):
    norm = x.pow(2).sum(1).sqrt()+self.eps
    norm_stretch = norm.expand_as(x)
    x = x / norm_stretch
    out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x
    return out

I then am facing an issue in the box_utils.py as:

THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu line=226 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train_cars.py", line 232, in <module>
    train()
  File "train_cars.py", line 184, in train
    loss_l, loss_c = criterion(out, targets)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mshah/code/ssd.pytorch/layers/modules/multibox_loss.py", line 70, in forward
    match(self.threshold,truths,defaults,self.variance,labels,loc_t,conf_t,idx)
  File "/home/mshah/code/ssd.pytorch/layers/box_utils.py", line 107, in match
    loc = encode(matches, priors, variances)
  File "/home/mshah/code/ssd.pytorch/layers/box_utils.py", line 133, in encode
    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226

from ssd.pytorch.

acrosson commented on July 30, 2024

great suggestion @superhans . adding from __future__ import division to most of the files, gets rid of any nan, inf in the loss for python 2.7

from ssd.pytorch.

runtime error about ssd.pytorch HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent