when training: Traceback (most recent call last): File "train.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I run python train.py with torch 0.4.1. <div class="snippet-clipboard-content notr

And I found that similar error in <a href="https://discuss.pytorch.org/t/runtimeerror-

Ah yes, multi-GPU is not supported yet, sorry. Issue <a class="issue-link js-issue-lin

rloss['nT'] is zero when training about yolov3 HOT 11 CLOSED

violet17 commented on May 14, 2024

rloss['nT'] is zero when training

from yolov3.

Comments (11)

violet17 commented on May 14, 2024 1

Thank you for the code. Finally, I found that the labels path was wrong, so it couldn't find the labels. Thank you very much for your patience.

In train.py line 125 in the latest commit your code, name is "module.module_list.0.conv_0.weight"

name.split('.')[2]=0
Traceback (most recent call last):
  File "train.py", line 221, in <module>
    main(opt)
  File "train.py", line 127, in main
    if int(name.split('.')[1]) < 75:  # if layer < 75
ValueError: invalid literal for int() with base 10: 'module_list'

And I think if int(name.split('.')[1]) < 75: should be if int(name.split('.')[2]) < 75:

from yolov3.

glenn-jocher commented on May 14, 2024

nT is the number of targets per batch, rloss['nT'] should be the running mean of nT for the epoch. This behavior shouldn't occur in the default code, have you changed the code? If not try the latest commit, the code changes fairly often!

from yolov3.

violet17 commented on May 14, 2024

I use the latest commit and the command "python train.py". But the rloss['nT'] is zero.
The version of torch is 0.4.1

from yolov3.

glenn-jocher commented on May 14, 2024

@violet17 Sorry, its possible one of the latest commits broke something, as I've been updating quite often, and I noticed others had problems as well. I've been running internal tests with a more advanced version with no issues. I just committed these changes now as b07ee41. Can you try to download this from scratch and rerun?

If this works I'd advise you to check back in a day or two, I'm compiling updates that should increase performance significantly. These updates are not reflected in the latest commit, they are being tracked in #2 (comment) and should be implemented in one single commit in the next few days.

from yolov3.

glenn-jocher commented on May 14, 2024

If you run the default repo everything works fine. Start from there.

sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3 && cd yolov3 && python3 train.py

from yolov3.

violet17 commented on May 14, 2024

I run python train.py with torch 0.4.1.

    Epoch      Batch          x          y          w          h       conf        cls      total          P          R   nTargets         TP         FP         FN       time
Traceback (most recent call last):
 File "train.py", line 227, in <module>
   main(opt)
 File "train.py", line 151, in main
   loss = model(imgs.to(device), targets, batch_report=opt.batch_report, var=opt.var)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
   result = self.forward(*input, **kwargs)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 119, in forward
   inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 130, in scatter
   return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
   inputs = scatter(inputs, target_gpus, dim) if inputs else []
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
   return scatter_map(inputs)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
   return list(zip(*map(scatter_map, obj)))
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map
   return list(map(list, zip(*map(scatter_map, obj))))
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
   return Scatter.apply(target_gpus, None, dim, obj)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 87, in forward
   outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 142, in scatter
   return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: start (0) + length (0) exceeds dimension size (0). (narrow at /opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/ATen/native/TensorShape.cpp:157)
frame #0: at::Type::narrow(at::Tensor const&, long, long, long) const + 0x49 (0x7f74bac67639 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #1: at::native::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) + 0x12e (0x7f74baabb64e in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: at::Type::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) const + 0x49 (0x7f74bac65f49 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #3: torch::autograd::VariableType::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) const + 0x496 (0x7f74b0795e76 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: at::native::chunk(at::Tensor const&, long, long) + 0x11c (0x7f74baabbc7c in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::chunk(at::Tensor const&, long, long) const + 0x41 (0x7f74bac6b9e1 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::chunk(at::Tensor const&, long, long) const + 0x183 (0x7f74b0733de3 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::scatter(at::Tensor const&, at::ArrayRef<long>, at::optional<std::vector<long, std::allocator<long> > > const&, long, at::optional<std::vector<CUDAStreamInternals*, std::allocator<CUDAStreamInternals*> > > const&) + 0xd98 (0x7f74b0b34128 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: <unknown function> + 0xc42a0b (0x7f74b0b3ba0b in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x38a5cb (0x7f74b02835cb in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #20: THPFunction_apply(_object*, _object*) + 0x38f (0x7f74b0661a2f in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

I am trying to figure out why these things appear.

from yolov3.

violet17 commented on May 14, 2024

And I found that similar error in https://discuss.pytorch.org/t/runtimeerror-start-0-length-0-exceeds-dimension-size-0/24233.
Then I comment these lines：

        if torch.cuda.device_count() > 1:
            print('Using ', torch.cuda.device_count(), ' GPUs')
            model = nn.DataParallel(model)

and

        if torch.cuda.device_count() > 1:
            print('Using ', torch.cuda.device_count(), ' GPUs')
            model = nn.DataParallel(model)

Finally it works.
if there are multiply gpu devices, it didn't use all the gpus and only use gpu 0.
And when I changed
device = torch.device('cuda:0' if cuda else 'cpu')
to cuda 1, it use gpu 0 and gpu 1. Then errors in different gpus appear.

from yolov3.

glenn-jocher commented on May 14, 2024

Ah yes, multi-GPU is not supported yet, sorry. Issue #21 is open for this. I only have a single GPU machine, so I can't debug this. Any help would be appreciated! In the meantime I will add a warning there to alert users.

from yolov3.

glenn-jocher commented on May 14, 2024

I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.

yolov3/train.py

Lines 60 to 63 in af0033c

    
           if torch.cuda.device_count() > 1: 
        
               raise Exception('Multi-GPU not currently supported: https://github.com/ultralytics/yolov3/issues/21') 
        
               # print('Using ', torch.cuda.device_count(), ' GPUs') 
        
               # model = nn.DataParallel(model)

from yolov3.

varunnair18 commented on May 14, 2024

I was able to fix this issue by changing the augment argument on line 40 of train.py from True to False.

Previous: dataloader = LoadImagesAndLabels(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)

Fixed: dataloader = LoadImagesAndLabels(train_path, batch_size, img_size, multi_scale=multi_scale, augment=False)

from yolov3.

glenn-jocher commented on May 14, 2024

@varunnair18 good to hear you found a workaround. This is an interesting finding. Thank you for sharing your solution!

from yolov3.

rloss['nT'] is zero when training about yolov3 HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if torch.cuda.device_count() > 1:
	raise Exception('Multi-GPU not currently supported: https://github.com/ultralytics/yolov3/issues/21')
	# print('Using ', torch.cuda.device_count(), ' GPUs')
	# model = nn.DataParallel(model)