Comments (11)
Thank you for the code. Finally, I found that the labels path was wrong, so it couldn't find the labels. Thank you very much for your patience.
In train.py
line 125 in the latest commit your code, name is "module.module_list.0.conv_0.weight"
So
name.split('.')[2]=0
Traceback (most recent call last):
File "train.py", line 221, in <module>
main(opt)
File "train.py", line 127, in main
if int(name.split('.')[1]) < 75: # if layer < 75
ValueError: invalid literal for int() with base 10: 'module_list'
And I think if int(name.split('.')[1]) < 75:
should be if int(name.split('.')[2]) < 75:
from yolov3.
nT is the number of targets per batch, rloss['nT'] should be the running mean of nT for the epoch. This behavior shouldn't occur in the default code, have you changed the code? If not try the latest commit, the code changes fairly often!
from yolov3.
I use the latest commit and the command "python train.py". But the rloss['nT'] is zero.
The version of torch is 0.4.1
from yolov3.
@violet17 Sorry, its possible one of the latest commits broke something, as I've been updating quite often, and I noticed others had problems as well. I've been running internal tests with a more advanced version with no issues. I just committed these changes now as b07ee41. Can you try to download this from scratch and rerun?
If this works I'd advise you to check back in a day or two, I'm compiling updates that should increase performance significantly. These updates are not reflected in the latest commit, they are being tracked in #2 (comment) and should be implemented in one single commit in the next few days.
from yolov3.
If you run the default repo everything works fine. Start from there.
sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3 && cd yolov3 && python3 train.py
from yolov3.
I run python train.py with torch 0.4.1.
Epoch Batch x y w h conf cls total P R nTargets TP FP FN time
Traceback (most recent call last):
File "train.py", line 227, in <module>
main(opt)
File "train.py", line 151, in main
loss = model(imgs.to(device), targets, batch_report=opt.batch_report, var=opt.var)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 119, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 130, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
return scatter_map(inputs)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map
return list(map(list, zip(*map(scatter_map, obj))))
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 87, in forward
outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
File "/home/liumm/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 142, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: start (0) + length (0) exceeds dimension size (0). (narrow at /opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/ATen/native/TensorShape.cpp:157)
frame #0: at::Type::narrow(at::Tensor const&, long, long, long) const + 0x49 (0x7f74bac67639 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #1: at::native::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) + 0x12e (0x7f74baabb64e in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: at::Type::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) const + 0x49 (0x7f74bac65f49 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #3: torch::autograd::VariableType::split_with_sizes(at::Tensor const&, at::ArrayRef<long>, long) const + 0x496 (0x7f74b0795e76 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: at::native::chunk(at::Tensor const&, long, long) + 0x11c (0x7f74baabbc7c in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::chunk(at::Tensor const&, long, long) const + 0x41 (0x7f74bac6b9e1 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::chunk(at::Tensor const&, long, long) const + 0x183 (0x7f74b0733de3 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::scatter(at::Tensor const&, at::ArrayRef<long>, at::optional<std::vector<long, std::allocator<long> > > const&, long, at::optional<std::vector<CUDAStreamInternals*, std::allocator<CUDAStreamInternals*> > > const&) + 0xd98 (0x7f74b0b34128 in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: <unknown function> + 0xc42a0b (0x7f74b0b3ba0b in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x38a5cb (0x7f74b02835cb in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #20: THPFunction_apply(_object*, _object*) + 0x38f (0x7f74b0661a2f in /home/liumm/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
I am trying to figure out why these things appear.
from yolov3.
And I found that similar error in https://discuss.pytorch.org/t/runtimeerror-start-0-length-0-exceeds-dimension-size-0/24233.
Then I comment these lines:
if torch.cuda.device_count() > 1:
print('Using ', torch.cuda.device_count(), ' GPUs')
model = nn.DataParallel(model)
and
if torch.cuda.device_count() > 1:
print('Using ', torch.cuda.device_count(), ' GPUs')
model = nn.DataParallel(model)
Finally it works.
if there are multiply gpu devices, it didn't use all the gpus and only use gpu 0.
And when I changed
device = torch.device('cuda:0' if cuda else 'cpu')
to cuda 1, it use gpu 0 and gpu 1. Then errors in different gpus appear.
from yolov3.
Ah yes, multi-GPU is not supported yet, sorry. Issue #21 is open for this. I only have a single GPU machine, so I can't debug this. Any help would be appreciated! In the meantime I will add a warning there to alert users.
from yolov3.
I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.
Lines 60 to 63 in af0033c
from yolov3.
I was able to fix this issue by changing the augment argument on line 40 of train.py from True to False.
Previous: dataloader = LoadImagesAndLabels(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)
Fixed: dataloader = LoadImagesAndLabels(train_path, batch_size, img_size, multi_scale=multi_scale, augment=False)
from yolov3.
@varunnair18 good to hear you found a workaround. This is an interesting finding. Thank you for sharing your solution!
from yolov3.
Related Issues (20)
- About the instructions and code comments HOT 3
- A hopelessly long try to replicate the YOLOv3 kernel HOT 2
- Change in the anchor boxes HOT 10
- ❗️Closed per Code of Conduct HOT 1
- no anchor_grid in V9.6.0 yolov3.pt HOT 5
- Convert YOLOv3 dataset format to YOLOv8 HOT 3
- What's the difference between it and Yolov3 by Joseph Redmon ? HOT 7
- Integrating YOLOv8 into YOLOv3 Ultralytics HOT 2
- Seeking Advice on Equivalent YOLOv5 Variant to Standard YOLOv3 HOT 1
- Unexpectedly large trained model size (~200 MB .pt and ~400 MB .onnx) HOT 4
- Training requires much more VRAM than v5/v8 and results in ~200 MB models comparing to <15 MB models of v5/v8 HOT 5
- how to train your yolov8?
- Need info regarding yolov3-tiny anchors, dataset creation and loss function. HOT 5
- Cannot compute loss function from best model HOT 1
- yolov3_ros input topic channel problem HOT 5
- Issue with training YOLOv3-tiny from scratch HOT 4
- yolov3.pt HOT 3
- 关于调用推理代码块遇到的与一些问题 HOT 8
- Bug of incomplete information display HOT 2
- No module named 'ultralytics.yolo' HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yolov3.