ultralytics / yolov3 Goto Github PK
View Code? Open in Web Editor NEWYOLOv3 in PyTorch > ONNX > CoreML > TFLite
Home Page: https://docs.ultralytics.com
License: GNU Affero General Public License v3.0
YOLOv3 in PyTorch > ONNX > CoreML > TFLite
Home Page: https://docs.ultralytics.com
License: GNU Affero General Public License v3.0
means I read empty data ?
line 24 in train.py import test
makes line 9~20 may not work.
And it outputs two namespaces, which are
Namespace(batch_report=False, batch_size=16, cfg='cfg/yolov2.cfg', data_config_path='cfg/coco.data', epochs=100, freeze_darknet53=False, img_size=416, optimizer='SGD', resume=False, var=0)
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', class_path='data/coco.names', conf_thres=0.3, data_config_path='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpu=0, nms_thres=0.45, weights_path='weights/yolov3.pt')
Maybe put line 7~19 in test.py
inside if __name__ == '__main__':
could be better.
Hi,
I started to train the yolov3 using 1 GPU without changing your code. And i got the below graphs...Which are all slightly different from your results. The shapes are roughly the same but the values are all in a different range shown below. I am a bit confused...It will be great if you could point me out the right direction thank you!
When developing the training code I found that replacing Binary Cross Entropy (BCE) loss with Cross Entropy (CE) loss significantly improves Precision, Recall and mAP. All show about 2X improvements using CE, though the YOLOv3 paper states these loss terms as BCE in darknet.
The two loss terms are on lines 162 and 163 of models.py
. If anyone has any insight into this phenomenon I'd be very interested to hear it. For now you can swap the two back and forth. Note that SGD does not converge using either BCE or CE, so that issue appears independent of this one.
when i print (labels) in datasets.py ,row 143, there are a problem:
i can not print (lables), but i can print (labels[0][0]), print(labels0.shape)
if os.path.isfile(label_path):
labels0 = np.loadtxt(label_path, dtype=np.float32).reshape(-1, 5)
print(labels0.shape) #143
print(labels0[0][0]) #144
print(labels) #145
exit()
#############
Traceback (most recent call last):
File "train.py", line 193, in
main(opt)
File "train.py", line 116, in main
for i, (imgs, targets) in enumerate(dataloader):
File "/home/chenfei/Downloads/yolov3-master1/utils/datasets.py", line 143, in next
print(labels)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 1504, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "")
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 668, in array2string
return _array2string(a, options, separator, prefix)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 460, in wrapper
return f(self, *args, **kwargs)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 495, in _array2string
summary_insert, options['legacy'])
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 796, in _formatArray
curr_width=line_width)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 750, in recurser
word = recurser(index + (-i,), next_hanging_indent, next_width)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 704, in recurser
return format_function(a[index])
IndexError: tuple index out of range
Hi,
Thank you very much for the code. But when I run the test.py with yolov3.pt/latest.pt Im getting the below error.
File "test.py", line 40, in <module> model.load_state_dict(checkpoint['model']) File "/home/xxiaofan/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Darknet: Unexpected key(s) in state_dict: "module_list.0.batch_norm_0.num_batches_tracked", "module_list.1.batch_norm_1.num_batches_tracked", "module_list.2.batch_norm_2.num_batches_tracked", "module_list.3.batch_norm_3.num_batches_tracked", "module_list.5.batch_norm_5.num_batches_tracked", "module_list.6.batch_norm_6.num_batches_tracked", "module_list.7.batch_norm_7.num_batches_tracked", "module_list.9.batch_norm_9.num_batches_tracked", "module_list.10.batch_norm_10.num_batches_tracked", "module_list.12.batch_norm_12.num_batches_tracked", "module_list.13.batch_norm_13.num_batches_tracked", "module_list.14.batch_norm_14.num_batches_tracked", "module_list.16.batch_norm_16.num_batches_tracked", "module_list.17.batch_norm_17.num_batches_tracked", "module_list.19.batch_norm_19.num_batches_tracked", "module_list.20.batch_norm_20.num_batches_tracked", "module_list.22.batch_norm_22.num_batches_tracked", "module_list.23.batch_norm_23.num_batches_tracked", "module_list.25.batch_norm_25.num_batches_tracked", "module_list.26.batch_norm_26.num_batches_tracked", "module_list.28.batch_norm_28.num_batches_tracked", "module_list.29.batch_norm_29.num_batches_tracked", "module_list.31.batch_norm_31.num_batches_tracked", "module_list.32.batch_norm_32.num_batches_tracked", "module_list.34.batch_norm_34.num_batches_tracked", "module_list.35.batch_norm_35.num_batches_tracked", "module_list.37.batch_norm_37.num_batches_tracked", "module_list.38.batch_norm_38.num_batches_tracked", "module_list.39.batch_norm_39.num_batches_tracked", "module_list.41.batch_norm_41.num_batches_tracked", "module_list.42.batch_norm_42.num_batches_tracked", "module_list.44.batch_norm_44.num_batches_tracked", "module_list.45.batch_norm_45.num_batches_tracked", "module_list.47.batch_norm_47.num_batches_tracked", "module_list.48.batch_norm_48.num_batches_tracked", "module_list.50.batch_norm_50.num_batches_tracked", "module_list.51.batch_norm_51.num_batches_tracked", "module_list.53.batch_norm_53.num_batches_tracked", "module_list.54.batch_norm_54.num_batches_tracked", "module_list.56.batch_norm_56.num_batches_tracked", "module_list.57.batch_norm_57.num_batches_tracked", "module_list.59.batch_norm_59.num_batches_tracked", "module_list.60.batch_norm_60.num_batches_tracked", "module_list.62.batch_norm_62.num_batches_tracked", "module_list.63.batch_norm_63.num_batches_tracked", "module_list.64.batch_norm_64.num_batches_tracked", "module_list.66.batch_norm_66.num_batches_tracked", "module_list.67.batch_norm_67.num_batches_tracked", "module_list.69.batch_norm_69.num_batches_tracked", "module_list.70.batch_norm_70.num_batches_tracked", "module_list.72.batch_norm_72.num_batches_tracked", "module_list.73.batch_norm_73.num_batches_tracked", "module_list.75.batch_norm_75.num_batches_tracked", "module_list.76.batch_norm_76.num_batches_tracked", "module_list.77.batch_norm_77.num_batches_tracked", "module_list.78.batch_norm_78.num_batches_tracked", "module_list.79.batch_norm_79.num_batches_tracked", "module_list.80.batch_norm_80.num_batches_tracked", "module_list.84.batch_norm_84.num_batches_tracked", "module_list.87.batch_norm_87.num_batches_tracked", "module_list.88.batch_norm_88.num_batches_tracked", "module_list.89.batch_norm_89.num_batches_tracked", "module_list.90.batch_norm_90.num_batches_tracked", "module_list.91.batch_norm_91.num_batches_tracked", "module_list.92.batch_norm_92.num_batches_tracked", "module_list.96.batch_norm_96.num_batches_tracked", "module_list.99.batch_norm_99.num_batches_tracked", "module_list.100.batch_norm_100.num_batches_tracked", "module_list.101.batch_norm_101.num_batches_tracked", "module_list.102.batch_norm_102.num_batches_tracked", "module_list.103.batch_norm_103.num_batches_tracked", "module_list.104.batch_norm_104.num_batches_tracked".
Dear @glenn-jocher,
I am facing a issue when training my dataset with your code.
Dataset: 1 class
pytorch: 0.4.1
Ubuntu 16.04
1 GPU
+++++++++++++++++++++++++++++++++++++++++++++++
The logs are:
Traceback (most recent call last):
File "train.py", line 198, in
main(opt)
File "train.py", line 132, in main
loss = model(imgs.to(device), targets, requestPrecision=True)
File "/home/khanhdd/anaconda3/envs/actiondetectionYOLO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/khanhdd/KhanhWorkSpace/realtime-action-detection/YOLOv3_Training/models.py", line 237, in forward
x, *losses = module[0](x, targets, requestPrecision)
File "/home/khanhdd/anaconda3/envs/actiondetectionYOLO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/khanhdd/KhanhWorkSpace/realtime-action-detection/YOLOv3_Training/models.py", line 117, in forward
p = p.view(bs, self.nA, self.bbox_attrs, nG, nG).permute(0, 1, 3, 4, 2).contiguous() # prediction
RuntimeError: invalid argument 2: size '[16 x 3 x 6 x 10 x 10]' is invalid for input with 408000 elements at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/TH/THStorage.cpp:84
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Please give me your advise about this problem.
Thank you,
Khanh
Thanks a lot for sharing your project.
I have a small question about function "bbox_iou" in utils/utils.py.
line 174 -------def bbox_iou(box1, box2, x1y1x2y2=True):
I find yolo-darknet53 model output is 'xywh' format, but here you set "x1y1x2y2=True",.
And in line 400-------ious = bbox_iou(max_detections[-1], detections_class[1:]), there is no formal parameter to change "x1y1x2y2" value.
I manual changed it from "True" to "False", but I found detect mAP declined
Could you tell me the reason?
Thank you very much!
I've trained coco with this code and the result is impressive. So I want to try voc on this code . I made train_list.txt as : cls_name x_center y_center width height. But the result is not as good as I prospected. Could anybody who successfully trained voc give me some suggestion?
When I load the trained model and run the detected.py,I found no bounding boxes was detected,I was very confused,anyone who can give me some suggestions to solve the problem,thanks.
I have encountered a problem and really need your help. Can you help me?
I want to train my own coco dataset. When I run train.py
, the problem occur
like this:
Traceback (most recent call last):
File "train.py", line 193, in <module>
main(opt)
File "train.py", line 116, in main
for i, (imgs, targets) in enumerate(dataloader):
File "/home/pytorch/github/yolov3/utils/datasets.py", line 189, in __next__
img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) # BGR to RGB and cv2 to pytorch
File "/home/pytorch/anaconda3/envs/pytorch0.4/lib/python3.6/site-packages/numpy/core/shape_base.py", line 349, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
I change the coco.data
like this:
classes= 3
train=data/coco/trainval.txt
valid=data/coco/test.txt
names=data/coco.names
backup=backup/
eval=coco
and the train_path
in train.py
:
if platform == 'darwin': # MacOS (local)
train_path = data_config['train']
else: # linux (cloud, i.e. gcp)
train_path = 'data/coco/trainval.txt'
Thank you in advance!
Hi
i am new to all this . i am trying to get it work in pycharm in windows but i get this error .
the detections work but no bounding boxes and no images in output folder.
Anyone please help.
Namespace(batch_size=1, cfg='cfg/yolov3.cfg', class_path='data/coco.names', conf_thres=0.5, image_folder='data/samples', img_size=416, nms_thres=0.45, output_folder='output', plot_flag=True, txt_out=False)
'rm' is not recognized as an internal or external command,
operable program or batch file.
0 (3, 416, 416) C:\Users--\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
Batch 0... (Done 1.172s)
image 0: 'data/samples\img1.jpg'
1 trucks
1 dogs
1 bicycles
Could you add support for yolov3-spp model? https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-spp.cfg
Hi, glenn-jocher:
When I ran train.py on PASCAL VOC2007 (about 160 epoch), I got 87% recall and 85% precision, but when I ran test.py on the PASCAL VOC2007 test set, I only got 0.62. mAP, and I found that mAP can't grow around 100 epoch, staying around 0.62? How can I further improve mAP?
# Normalized xywh to pixel xyxy format labels = labels0.copy() labels[:, 1] = ratio * w * (labels0[:, 1] - labels0[:, 3] / 2) + padw labels[:, 2] = ratio * h * (labels0[:, 2] - labels0[:, 4] / 2) + padh labels[:, 3] = ratio * w * (labels0[:, 1] + labels0[:, 3] / 2) + padw labels[:, 4] = ratio * h * (labels0[:, 2] + labels0[:, 4] / 2) + padh
Is (x, y) of (xywh) the center coordinate?
hello,I'm so sorry to have trouble with U.I have trained COCO data set myself,I get the test mAP about 55.9%,Ican't reach 58%,Is there some advices?(Batchsize= 20,train epoch=80).I get the result of your module use the yolov3.weights,then using COCO API to count the mAP ,which is much slower than the result of 57.9%. I find that there may be something different in deal with resize.The Darknet define a function itself to resize image to the size of 416X416,and fill the space around image use value 0.5,but our pytorch module use 0.502,and other pixel value is also different,Then I change the resize function in pytorch module as same as Darknet , this make a same detect result with Darknet.Sorry, My English is bad!
Hi , didn't the learning rate update during the training phase?
I'm training VOC2007, afer 1 epoch, an error shows up:
F:\pytorch-yolov3-master-ul\test.py:122: RuntimeWarning: invalid value encountered in double_scalars
print('%15s: %-.4f' % (c, AP_accum[i] / AP_accum_count[i]))
aeroplane: nan
bicycle: nan
bird: nan
boat: nan
bottle: nan
bus: nan
car: nan
cat: nan
chair: nan
cow: nan
diningtable: nan
dog: nan
horse: nan
motorbike: nan
person: nan
pottedplant: nan
sheep: nan
sofa: nan
train: nan
tvmonitor: nan
Traceback (most recent call last):
File "train.py", line 268, in
var=opt.var,
File "train.py", line 224, in train
img_size=img_size,
File "F:\pytorch-yolov3-master-ul\test.py", line 125, in test
return mean_mAP, mean_R, mean_P
UnboundLocalError: local variable 'mean_mAP' referenced before assignment
What's the problem?
The CPU usage is too high.
How to solve it?
When developing the training code I found that SGD caused divergence very quickly at the default LR of 1e-4. Loss terms began to grow exponentially, becoming Inf within about 10 batches of starting training.
Adam always seems to converge in contrast, which is why I use it as the default optimizer in train.py
. I don't understand why Adam works and SGD does not, as darknet uses SGD successfully. This is one of the key differences between darknet and this repo, so any insights into how we can get SGD to converge would be appreciated.
It might be that I simply don't have the proper learning rate (and scheduler) in place.
line 82 in train.py
# optimizer = torch.optim.SGD(model.parameters(), lr=.001, momentum=.9, weight_decay=5e-4)
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4, weight_decay=5e-4)
The mAP computation code is similar as https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/959e0ff43f5b82bdacef87f4240bae8415eac45b/test.py#L69
It is incorrect to average the AP for each sample, because AP is computed per-class. The right way is to rank all detected instances across the whole test set for each object class, compute AP for each class, and then average the AP.
Hi @glenn-jocher,
Thanks for this wonderful port of Yolo v3. I had two questions, however, about the matching step in build_targets -- where you compute which anchor box corresponds to each ground truth box.
You seem to be computing IoU using only the width and height of each anchor box against each target. Darknet doesn't appear to do this -- if I'm reading the implementation correctly, it iterates through the grid cells and computes the IoU using the X and Y of each cell. Is there a reason you compute IoU using width and height only? Optimization?
Also during this step, the ignore_threshold is set to 0.5 in the Darknet paper, and 0.7 in the Darknet implementation, while you seem to be using 0.1 in build_targets. Is there a reason for that?
Thanks!
COCO2014 mAP computation on official YOLOv3 weights corresponds to expected value of 0.58 (same as darknet), but mAP computation on trained checkpoints seems higher than should be. In particular many false positives do not seem to negatively impact mAP.
For example validation image 2 should have 4 people and 1 baseball bat. At epoch 37, I see ~140 objects detected. Precision and Recall look like this:
Precision-Recall curve looks like this:
AP for this image is then calculated as 0.78, which is strangely high for 4 TP and ~140 FP's.
AP = compute_ap(recall, precision)
Out[66]: 0.78596
Lastly, I believe mAP is supposed to be calculated per class in each image, but here all the classes seem combined.
I found darknet's polynomial learning rate curve here:
case POLY:
return net->learning_rate * pow(1 - (float)batch_num / net->max_batches, net->power);
https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L111
If I use power = 4
from parser.c
then I plot the following curve (in MATLAB), assuming max_batches = 1563360
(160 epochs at batch_size 12, for 9771 batches/epoch). This leaves the final lr(1563360) = 0
. This means that is is impossible for anyone to begin training a model from the official YOLOv3 weights and expect to resume training at lr = 0.001
with no problems. The model is going to clearly bounce out of its local minimum back into the huge gradients it first saw at epoch 0.
>> batch = 0:(9771*160);
>> lr = 1e-3 * (1 - batch./1563360).^4;
>> fig; plot(batch,lr,'.-'); xyzlabel('batch','learning rate'); fcnfontsize(14); fcntight;
Thanks for sharing your work.
yolov3 initializes model weights (up to line 549 in yolov3.cfg) from darknet53 classifier if I am not mistaken. Your model might not converge at epoch 160 if that is the case. Have you tried initializing yolov3 with darknet53?
Could you please add the code for tiny-yolo?
I tried tiny-yolo.cfg and got a very bad result with map=0.04 , training about 50 epoch.
Say for a batch of 32 images for computing the mAP?
Thank you for this awesome repository.
Due to some changes in the training scheme v.s. original darknet code, I wonder if the provided PyTorch weights are converted from the original .weights file - or rather, results of a fresh training session in PyTorch.
If it's the former, it would be wonderful of the results of PyTorch training could be provided as well!
Thanks!
@glenn-jocher
Hi, First of all, thank you very much for your code. I have started training for 24 hours (10 epochs) on my GTX1080, but it seems that I can't load the pre-trained weights on ImageNet (the darknet53.comv
). And it takes long time to train from scratch. TAT
Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you
Traceback (most recent call last):
File "train.py", line 194, in <module>
main(opt)
File "train.py", line 128, in main
loss = model(imgs, targets, requestPrecision=True)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'
I've got a model trained using Darknet on a custom dataset containing 43 classes. This requires changing the number of filters for 3 conv layers from 255 to 144 (and of course the number of classes, although this shouldn't affect the bug I'm describing).
Alas, when loading this weight file (with the relevant cfg file), the network loading part crashes due to the weights file 'ending' before it should.
The actual error is generated by:
conv_w = torch.from_numpy(weights[ptr:ptr + num_w]).view_as(conv_layer.weight)
While the error is:
RuntimeError: invalid argument 2: size '[144 x 256 x 1 x 1]' is invalid for input with 36863 elements at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/TH/THStorage.cpp:84
Now, the size of weights
is 61802511, and ptr+num_w
is 61802512 - i.e. it's one float shorter.
I'll note that this doesn't happen when loading the original yolov3.weights file that comes with the original Darknet/YOLOv3 repository (in that one, the numbers perfectly align).
I can't think of a cause other than something causing the counter to advance one index too fast, but can't think of a reason why this would happen. Any ideas?
Thanks!
Is the following for loop necessary? Except for the last batch, len(imgs) = n
, so j
could only be 0 in the loop. In the last batch, if len(imgs)
is smaller than n
, int(len(imgs) / n) = 0
, the loop is ignored. Otherwise, len(imgs) = n
, so j
could only be 0 in the loop.
Lines 118 to 119 in 68de92f
All, I've started training using the official darknet repo to compare. The first two things I noticed are:
yolov3.cfg
, max_batches = 500200
is the total train time, and batch=64
is the images per batch, then this will take about 28 days on a GCP P100 at about 18,000 batches per day (all train settings to default).-multi_scale = True
in train.py (though currently this changes the size every batch).For some reason I can't seem to train my own dataset on the latest commit. I am able to do it from an earlier commit e.g. this state. In this state if i run my training (with the exact same cfg files, dataset etc), i get these results after a couple epochs:
Epoch Batch x y w h conf cls total P R nTargets TP FP FN time
0/99 99/99 1.49 1.47 7.46 12.8 111 7.38 141 0 0 3 0 2.9e+03 0 0.124
1/99 99/99 1.26 1.19 1.99 3.07 16.1 7.35 30.9 0 0 10 0 1 8 0.131
2/99 99/99 1.04 1.02 0.831 1.08 4.64 7.25 15.9 0 0 7 0 3 4 0.129
3/99 99/99 0.756 0.796 0.666 0.79 3.67 7.25 13.9 0.000769 0.00187 10 0 2 7 0.129
4/99 99/99 0.58 0.683 0.574 0.739 2.93 7.15 12.7 0.00314 0.0167 3 0 3 1 0.129
5/99 99/99 0.455 0.54 0.462 0.64 2.62 7.14 11.9 0.00636 0.0221 6 0 6 1 0.132
If I use the latest commit, i get this:
Epoch Batch x y w h conf cls total P R nTargets TP FP FN time
0/99 99/99 1.49 1.47 7.41 12.6 111 7.38 141 0 0 3 0 0 0 0.117
1/99 99/99 4.91 4.93 nan nan nan nan nan 0 0 10 0 0 0 0.125
2/99 99/99 5.52 5.24 nan nan nan nan nan 0 0 7 0 0 0 0.124
3/99 99/99 5.23 5.21 nan nan nan nan nan 0 0 10 0 0 0 0.129
4/99 99/99 5.35 5.17 nan nan nan nan nan 0 0 3 0 0 0 0.125
5/99 99/99 5.65 5.41 nan nan nan nan nan 0 0 6 0 0 0 0.124
6/99 99/99 5.13 5.13 nan nan nan nan nan 0 0 9 0 0 0 0.126
Also in the latest commit, line 197 of train.py causes the following error:
Traceback (most recent call last):
File "/home/rick/Documents/yolov3-master/train.py", line 208, in <module>
main(opt)
File "/home/rick/Documents/yolov3-master/train.py", line 195, in main
mAP, R, P = test.main(test.opt)
File "/home/rick/Documents/yolov3-master/test.py", line 42, in main
model.load_state_dict(checkpoint['model'])
File "/media/rick/HDD/Env/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
size mismatch for module_list.81.conv_81.weight: copying a param of torch.Size([255, 1024, 1, 1]) from checkpoint, where the shape is torch.Size([303, 1024, 1, 1]) in current model.
size mismatch for module_list.81.conv_81.bias: copying a param of torch.Size([255]) from checkpoint, where the shape is torch.Size([303]) in current model.
size mismatch for module_list.93.conv_93.weight: copying a param of torch.Size([255, 512, 1, 1]) from checkpoint, where the shape is torch.Size([303, 512, 1, 1]) in current model.
size mismatch for module_list.93.conv_93.bias: copying a param of torch.Size([255]) from checkpoint, where the shape is torch.Size([303]) in current model.
size mismatch for module_list.105.conv_105.weight: copying a param of torch.Size([255, 256, 1, 1]) from checkpoint, where the shape is torch.Size([303, 256, 1, 1]) in current model.
size mismatch for module_list.105.conv_105.bias: copying a param of torch.Size([255]) from checkpoint, where the shape is torch.Size([303]) in current model.
So I replaced it with (the old) code:
with open('results.txt', 'a') as file:
file.write(s + '\n')
I don't know if this is normal but i can't seem to find a solution.
Do you know why i get nan whilst using the exact same cfg files and data? My txt files for each image is spot on, the bounding boxes, width, height etc are all relative to the image width and height.
Lines 207 to 213 in fd6619d
Can somebody explain this?
The correct YOLO v3 loss constants are:
lambda_coord = 1.0
lambda_obj = 1.0
lambda_noobj = 1.0
rather than the below constants, which derive from the original yolov1.cfg file:
https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/cfg/yolov1.cfg#L257-L260
lambda_coord = 5.0
lambda_obj = 1.0
lambda_noobj = 0.5
The latest yolov3 constants appear to be hard coded into the parser.c rather than in yolov3.cfg Credit to @ydixon who originally noticed this discrepancy in issue #12.
https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/parser.c#L376-L381
Hey,
Following Section 2.2 of YOLO, I have a few questions about the loss calculation shown at the end of this issue.
We are using λ coord = 5
from line 156 to line 159. Should we also use λ noobj = .5
in line 167?
Why are we multiplying BCELoss with 1.5
in line 160? I have not found any reference to this in the papers.
pred_conf
gives us a [batch_size x anchor_number x grid_size x grid_size]
tensor. Assuming batch_size = 1
, anchor_number=3
and grid_size = 2
, there are 12 elements in this tensor. If nM = 3
, pred_conf[~mask]
contains 9 elements, so does mask[~mask].float()
. BCEWithLogitsLoss1
gives the sum of BCE loss for these 9 elements, whereas BCEWithLogitsLoss2
takes the mean of BCEWithLogitsLoss1
(i.e. divides it by 9 for our case). Now, my question is why are we multiplying BCEWithLogitsLoss2
with nM
instead of using BCEWithLogitsLoss1
(should divide by batch_size too prob.) in line 167? There is no division in Section 2.2 of YOLO. Btw, pred_conf[~mask]
could contain 15k elements normally, so we are practically ignoring the confidence loss in line 167.
Similar to 3, we should use BCEWithLogitsLoss1
(should divide by batch_size too prob.) in line 163. Because
BCEWithLogitsLoss1(pred_cls[mask], tcls.float()) / BCEWithLogitsLoss2(pred_cls[mask], tcls.float()) = batch_size x nM x number_of_classes
.
Why are we not dividing all the losses by the batch_size
? As the batch_size
increases, the loss increases too. However, we should minimize the expected loss per sample.
Lines 155 to 167 in 9514e74
THANK YOU
The following error occurred while I was training coco.
Traceback (most recent call last):
File "/project/yolov3/train.py", line 202, in <module>
main(opt)
File "/project/yolov3/train.py", line 132, in main
loss = model(imgs.to(device), targets, requestPrecision=True)
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/project/yolov3/models.py", line 238, in forward
x, *losses = module[0](x, targets, requestPrecision)
File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/project/yolov3/models.py", line 156, in forward
requestPrecision)
File "/project/yolov3/utils/utils.py", line 278, in build_targets
tmp = pred_cls[b, a, gj, gi]
IndexError: index 8 is out of bounds for dimension 0 with size 8
Traceback (most recent call last):
File "train.py", line 211, in
main(opt)
File "train.py", line 177, in main
loss_per_target = rloss['loss'] / rloss['nT']
ZeroDivisionError: float division by zero
Hi,
first thanks to your work here.
I have a problem. always when i get from epoch 0 to 1 i get an "cuda out of memory" error.
I decreased the batch-size to 1 and still get the error. The first epoch runs fine from 8 down.
I am training on a custom dataset. My imagesizes vary.
Running it on a GTX1070.
Thanks in advance
Edit:
multi_scale is set to false
while training used memory of my gtx is:
2445/8116mib
After the first epoch the usage of vram bloats. I just could check it mid epochchange and it was nearly completly used till it ran out of memory again. Whats running that is so intensive in between epochs?
Line 243 in b48c108
as far as I understand, multi_scale parameter is setting a random height in a range when you resize the image to a square size and setting the border.
Wouldn't this just be the same as setting the scale range parameter later in the affine transformation?
Thanks for your improvement of this YOLOv3 implementation.
I have just test the training ,got some problem .
I follow these steps.
4.I save the weight with precision0.2, and run the detect.py
the result like this ,
if I do not train,the orginal wight can get this result:
I do not know whether I used wrong parameters or something else, lead to generation of many bbox .
could you give me some suggestion?
Thank you~
when i run :
~/PycharmProjects/yolov3-master$ python test.py -weights_path checkpoints/latest.pt
there is error in dataset.py:
img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) #row 218(dataset.py)
Traceback (most recent call last):
File "test.py", line 59, in
for batch_i, (imgs, targets) in enumerate(dataloader):
File "/home/chenfei/PycharmProjects/yolov3-master/utils/datasets.py", line 218, in next
img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) # BGR to RGB and cv2 to pytorch
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/shape_base.py", line 349, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
can you help me ?
I ran detect.py but got nothing in output file, so i changed the results_img_path and results_txt_path as follows:
results_img_path = os.path.join(output, path.split('/')[-1].split('\')[-1])
results_txt_path = results_img_path.split('.')[-2] + '.txt'
Is it a small bug?
P.s. I'm a rookie in Deep Learning,no offense.
Can you explain why have you used the following constants? I have inspected a few different yolov3 implementation but none had a similar operation.
Line 183 in a284fc9
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.