Coder Social home page Coder Social logo

Comments (11)

violet17 avatar violet17 commented on May 14, 2024 1

Thank you for the code. Finally, I found that the labels path was wrong, so it couldn't find the labels. Thank you very much for your patience.

In train.py line 125 in the latest commit your code, name is "module.module_list.0.conv_0.weight"

So

name.split('.')[2]=0
Traceback (most recent call last):
  File "train.py", line 221, in <module>
    main(opt)
  File "train.py", line 127, in main
    if int(name.split('.')[1]) < 75:  # if layer < 75
ValueError: invalid literal for int() with base 10: 'module_list'

And I think if int(name.split('.')[1]) < 75: should be if int(name.split('.')[2]) < 75:

from yolov3.

glenn-jocher avatar glenn-jocher commented on May 14, 2024

nT is the number of targets per batch, rloss['nT'] should be the running mean of nT for the epoch. This behavior shouldn't occur in the default code, have you changed the code? If not try the latest commit, the code changes fairly often!

from yolov3.

violet17 avatar violet17 commented on May 14, 2024

I use the latest commit and the command "python train.py". But the rloss['nT'] is zero.
The version of torch is 0.4.1

from yolov3.

glenn-jocher avatar glenn-jocher commented on May 14, 2024

@violet17 Sorry, its possible one of the latest commits broke something, as I've been updating quite often, and I noticed others had problems as well. I've been running internal tests with a more advanced version with no issues. I just committed these changes now as b07ee41. Can you try to download this from scratch and rerun?

If this works I'd advise you to check back in a day or two, I'm compiling updates that should increase performance significantly. These updates are not reflected in the latest commit, they are being tracked in #2 (comment) and should be implemented in one single commit in the next few days.

from yolov3.

glenn-jocher avatar glenn-jocher commented on May 14, 2024

If you run the default repo everything works fine. Start from there.

sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3 && cd yolov3 && python3 train.py

from yolov3.

violet17 avatar violet17 commented on May 14, 2024

I run python train.py with torch 0.4.1.

    Epoch      Batch          x          y          w          h       conf        cls      total          P          R   nTargets         TP         FP         FN       time
Traceback (most recent call last):
 File "train.py", line 227, in <module>
   main(opt)
 File "train.py", line 151, in main
   loss = model(imgs.to(device), targets, batch_report=opt.batch_report, var=opt.var)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
   result = self.forward(*input, **kwargs)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 119, in forward
   inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 130, in scatter
   return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
   inputs = scatter(inputs, target_gpus, dim) if inputs else []
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
   return scatter_map(inputs)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
   return list(zip(*map(scatter_map, obj)))
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map
   return list(map(list, zip(*map(scatter_map, obj))))
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
   return Scatter.apply(target_gpus, None, dim, obj)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 87, in forward
   outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
 File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 142, in scatter
   return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: start (0) + length (0) exceeds dimension size (0). (narrow at /opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/ATen/native/TensorShape.cpp:157)
frame #0: at::Type::narrow(at::Tensor const&, long, long, long) const + 0x49 (0x7f74bac67639 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #1: at::native::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) + 0x12e (0x7f74baabb64e in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: at::Type::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) const + 0x49 (0x7f74bac65f49 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #3: torch::autograd::VariableType::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) const + 0x496 (0x7f74b0795e76 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: at::native::chunk(at::Tensor const&, long, long) + 0x11c (0x7f74baabbc7c in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::chunk(at::Tensor const&, long, long) const + 0x41 (0x7f74bac6b9e1 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::chunk(at::Tensor const&, long, long) const + 0x183 (0x7f74b0733de3 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::scatter(at::Tensor const&, at::ArrayRef<long>, at::optional<std::vector<long, std::allocator<long> > > const&, long, at::optional<std::vector<CUDAStreamInternals*, std::allocator<CUDAStreamInternals*> > > const&) + 0xd98 (0x7f74b0b34128 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: <unknown function> + 0xc42a0b (0x7f74b0b3ba0b in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x38a5cb (0x7f74b02835cb in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #20: THPFunction_apply(_object*, _object*) + 0x38f (0x7f74b0661a2f in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

I am trying to figure out why these things appear.

from yolov3.

violet17 avatar violet17 commented on May 14, 2024

And I found that similar error in https://discuss.pytorch.org/t/runtimeerror-start-0-length-0-exceeds-dimension-size-0/24233.
Then I comment these lines:

        if torch.cuda.device_count() > 1:
            print('Using ', torch.cuda.device_count(), ' GPUs')
            model = nn.DataParallel(model)

and

        if torch.cuda.device_count() > 1:
            print('Using ', torch.cuda.device_count(), ' GPUs')
            model = nn.DataParallel(model)

Finally it works.
if there are multiply gpu devices, it didn't use all the gpus and only use gpu 0.
And when I changed
device = torch.device('cuda:0' if cuda else 'cpu')
to cuda 1, it use gpu 0 and gpu 1. Then errors in different gpus appear.

from yolov3.

glenn-jocher avatar glenn-jocher commented on May 14, 2024

Ah yes, multi-GPU is not supported yet, sorry. Issue #21 is open for this. I only have a single GPU machine, so I can't debug this. Any help would be appreciated! In the meantime I will add a warning there to alert users.

from yolov3.

glenn-jocher avatar glenn-jocher commented on May 14, 2024

I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.

yolov3/train.py

Lines 60 to 63 in af0033c

if torch.cuda.device_count() > 1:
raise Exception('Multi-GPU not currently supported: https://github.com/ultralytics/yolov3/issues/21')
# print('Using ', torch.cuda.device_count(), ' GPUs')
# model = nn.DataParallel(model)

from yolov3.

varunnair18 avatar varunnair18 commented on May 14, 2024

I was able to fix this issue by changing the augment argument on line 40 of train.py from True to False.

Previous: dataloader = LoadImagesAndLabels(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)

Fixed: dataloader = LoadImagesAndLabels(train_path, batch_size, img_size, multi_scale=multi_scale, augment=False)

from yolov3.

glenn-jocher avatar glenn-jocher commented on May 14, 2024

@varunnair18 good to hear you found a workaround. This is an interesting finding. Thank you for sharing your solution!

from yolov3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.