ultralytics / yolov3 Goto Github PK

View Code? Open in Web Editor NEW

10.0K 156.0 3.4K 10.01 MB

YOLOv3 in PyTorch > ONNX > CoreML > TFLite

Home Page: https://docs.ultralytics.com

License: GNU Affero General Public License v3.0

Shell 0.73% Python 77.73% Dockerfile 0.39% Jupyter Notebook 21.15%

yolov3 object-detection yolo yolov5 deep-learning machine-learning ultralytics

yolov3's Issues

raise ValueError 'need at least one array to stack

means I read empty data ?

line 9~20 in train.py may not work

line 24 in train.py import test makes line 9~20 may not work.
And it outputs two namespaces, which are

Namespace(batch_report=False, batch_size=16, cfg='cfg/yolov2.cfg', data_config_path='cfg/coco.data', epochs=100, freeze_darknet53=False, img_size=416, optimizer='SGD', resume=False, var=0)

Namespace(batch_size=32, cfg='cfg/yolov3.cfg', class_path='data/coco.names', conf_thres=0.3, data_config_path='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpu=0, nms_thres=0.45, weights_path='weights/yolov3.pt')

Maybe put line 7~19 in test.py inside if __name__ == '__main__': could be better.

different training results

Hi,
I started to train the yolov3 using 1 GPU without changing your code. And i got the below graphs...Which are all slightly different from your results. The shapes are roughly the same but the values are all in a different range shown below. I am a bit confused...It will be great if you could point me out the right direction thank you!

RuntimeError: invalid argument 2: size '[16 x 3 x 15 x 13 x 13]' is invalid for input with 689520 elements at /pytorch/aten/src/TH/THStorage.cpp:84

models.py
p = p.view(bs, self.nA, self.bbox_attrs, nG, nG).permute(0, 1, 3, 4, 2).contiguous()
when i use this coda run my own dataset, it stoped here , erros was as follow:

Classification Loss: CE vs BCE

When developing the training code I found that replacing Binary Cross Entropy (BCE) loss with Cross Entropy (CE) loss significantly improves Precision, Recall and mAP. All show about 2X improvements using CE, though the YOLOv3 paper states these loss terms as BCE in darknet.

The two loss terms are on lines 162 and 163 of models.py. If anyone has any insight into this phenomenon I'd be very interested to hear it. For now you can swap the two back and forth. Note that SGD does not converge using either BCE or CE, so that issue appears independent of this one.

PyCharm Printing Numpy Arrays (IndexError: tuple index out of range)

when i print (labels) in datasets.py ,row 143, there are a problem:
i can not print (lables), but i can print (labels[0][0]), print(labels0.shape)

Load labels

        if os.path.isfile(label_path):
            labels0 = np.loadtxt(label_path, dtype=np.float32).reshape(-1, 5)
            print(labels0.shape)    #143
            print(labels0[0][0])      #144
            print(labels)                #145
            exit()

#############

Traceback (most recent call last):
File "train.py", line 193, in
main(opt)
File "train.py", line 116, in main
for i, (imgs, targets) in enumerate(dataloader):
File "/home/chenfei/Downloads/yolov3-master1/utils/datasets.py", line 143, in next
print(labels)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 1504, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "")
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 668, in array2string
return _array2string(a, options, separator, prefix)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 460, in wrapper
return f(self, *args, **kwargs)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 495, in _array2string
summary_insert, options['legacy'])
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 796, in _formatArray
curr_width=line_width)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 750, in recurser
word = recurser(index + (-i,), next_hanging_indent, next_width)
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/arrayprint.py", line 704, in recurser
return format_function(a[index])
IndexError: tuple index out of range

Unexpected key(s) in state_dict when running test.py

Hi,
Thank you very much for the code. But when I run the test.py with yolov3.pt/latest.pt Im getting the below error.
File "test.py", line 40, in <module> model.load_state_dict(checkpoint['model']) File "/home/xxiaofan/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Darknet: Unexpected key(s) in state_dict: "module_list.0.batch_norm_0.num_batches_tracked", "module_list.1.batch_norm_1.num_batches_tracked", "module_list.2.batch_norm_2.num_batches_tracked", "module_list.3.batch_norm_3.num_batches_tracked", "module_list.5.batch_norm_5.num_batches_tracked", "module_list.6.batch_norm_6.num_batches_tracked", "module_list.7.batch_norm_7.num_batches_tracked", "module_list.9.batch_norm_9.num_batches_tracked", "module_list.10.batch_norm_10.num_batches_tracked", "module_list.12.batch_norm_12.num_batches_tracked", "module_list.13.batch_norm_13.num_batches_tracked", "module_list.14.batch_norm_14.num_batches_tracked", "module_list.16.batch_norm_16.num_batches_tracked", "module_list.17.batch_norm_17.num_batches_tracked", "module_list.19.batch_norm_19.num_batches_tracked", "module_list.20.batch_norm_20.num_batches_tracked", "module_list.22.batch_norm_22.num_batches_tracked", "module_list.23.batch_norm_23.num_batches_tracked", "module_list.25.batch_norm_25.num_batches_tracked", "module_list.26.batch_norm_26.num_batches_tracked", "module_list.28.batch_norm_28.num_batches_tracked", "module_list.29.batch_norm_29.num_batches_tracked", "module_list.31.batch_norm_31.num_batches_tracked", "module_list.32.batch_norm_32.num_batches_tracked", "module_list.34.batch_norm_34.num_batches_tracked", "module_list.35.batch_norm_35.num_batches_tracked", "module_list.37.batch_norm_37.num_batches_tracked", "module_list.38.batch_norm_38.num_batches_tracked", "module_list.39.batch_norm_39.num_batches_tracked", "module_list.41.batch_norm_41.num_batches_tracked", "module_list.42.batch_norm_42.num_batches_tracked", "module_list.44.batch_norm_44.num_batches_tracked", "module_list.45.batch_norm_45.num_batches_tracked", "module_list.47.batch_norm_47.num_batches_tracked", "module_list.48.batch_norm_48.num_batches_tracked", "module_list.50.batch_norm_50.num_batches_tracked", "module_list.51.batch_norm_51.num_batches_tracked", "module_list.53.batch_norm_53.num_batches_tracked", "module_list.54.batch_norm_54.num_batches_tracked", "module_list.56.batch_norm_56.num_batches_tracked", "module_list.57.batch_norm_57.num_batches_tracked", "module_list.59.batch_norm_59.num_batches_tracked", "module_list.60.batch_norm_60.num_batches_tracked", "module_list.62.batch_norm_62.num_batches_tracked", "module_list.63.batch_norm_63.num_batches_tracked", "module_list.64.batch_norm_64.num_batches_tracked", "module_list.66.batch_norm_66.num_batches_tracked", "module_list.67.batch_norm_67.num_batches_tracked", "module_list.69.batch_norm_69.num_batches_tracked", "module_list.70.batch_norm_70.num_batches_tracked", "module_list.72.batch_norm_72.num_batches_tracked", "module_list.73.batch_norm_73.num_batches_tracked", "module_list.75.batch_norm_75.num_batches_tracked", "module_list.76.batch_norm_76.num_batches_tracked", "module_list.77.batch_norm_77.num_batches_tracked", "module_list.78.batch_norm_78.num_batches_tracked", "module_list.79.batch_norm_79.num_batches_tracked", "module_list.80.batch_norm_80.num_batches_tracked", "module_list.84.batch_norm_84.num_batches_tracked", "module_list.87.batch_norm_87.num_batches_tracked", "module_list.88.batch_norm_88.num_batches_tracked", "module_list.89.batch_norm_89.num_batches_tracked", "module_list.90.batch_norm_90.num_batches_tracked", "module_list.91.batch_norm_91.num_batches_tracked", "module_list.92.batch_norm_92.num_batches_tracked", "module_list.96.batch_norm_96.num_batches_tracked", "module_list.99.batch_norm_99.num_batches_tracked", "module_list.100.batch_norm_100.num_batches_tracked", "module_list.101.batch_norm_101.num_batches_tracked", "module_list.102.batch_norm_102.num_batches_tracked", "module_list.103.batch_norm_103.num_batches_tracked", "module_list.104.batch_norm_104.num_batches_tracked".

RuntimeError: invalid argument 2: size '[16 x 3 x 6 x 10 x 10]' is invalid for input with 408000 elements at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/TH/THStorage.cpp:84

Dear @glenn-jocher,
I am facing a issue when training my dataset with your code.
Dataset: 1 class
pytorch: 0.4.1
Ubuntu 16.04
1 GPU
+++++++++++++++++++++++++++++++++++++++++++++++
The logs are:
Traceback (most recent call last):
File "train.py", line 198, in
main(opt)
File "train.py", line 132, in main
loss = model(imgs.to(device), targets, requestPrecision=True)
File "/home/khanhdd/anaconda3/envs/actiondetectionYOLO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/khanhdd/KhanhWorkSpace/realtime-action-detection/YOLOv3_Training/models.py", line 237, in forward
x, *losses = module[0](x, targets, requestPrecision)
File "/home/khanhdd/anaconda3/envs/actiondetectionYOLO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/khanhdd/KhanhWorkSpace/realtime-action-detection/YOLOv3_Training/models.py", line 117, in forward
p = p.view(bs, self.nA, self.bbox_attrs, nG, nG).permute(0, 1, 3, 4, 2).contiguous() # prediction
RuntimeError: invalid argument 2: size '[16 x 3 x 6 x 10 x 10]' is invalid for input with 408000 elements at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/TH/THStorage.cpp:84
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Please give me your advise about this problem.
Thank you,
Khanh

I have a question about 'x1y1x2y2' in 'bbox_iou'

Thanks a lot for sharing your project.
I have a small question about function "bbox_iou" in utils/utils.py.
line 174 -------def bbox_iou(box1, box2, x1y1x2y2=True):
I find yolo-darknet53 model output is 'xywh' format, but here you set "x1y1x2y2=True",.
And in line 400-------ious = bbox_iou(max_detections[-1], detections_class[1:]), there is no formal parameter to change "x1y1x2y2" value.
I manual changed it from "True" to "False", but I found detect mAP declined
Could you tell me the reason?

Thank you very much!

Did anybody train voc with this code?

I've trained coco with this code and the result is impressive. So I want to try voc on this code . I made train_list.txt as : cls_name x_center y_center width height. But the result is not as good as I prospected. Could anybody who successfully trained voc give me some suggestion?

Nothing was detected

When I load the trained model and run the detected.py,I found no bounding boxes was detected,I was very confused,anyone who can give me some suggestions to solve the problem,thanks.

ValueError: need at least one array to stack

I have encountered a problem and really need your help. Can you help me？
I want to train my own coco dataset. When I run train.py, the problem occur
like this:

Traceback (most recent call last):
  File "train.py", line 193, in <module>
    main(opt)
  File "train.py", line 116, in main
    for i, (imgs, targets) in enumerate(dataloader):
  File "/home/pytorch/github/yolov3/utils/datasets.py", line 189, in __next__
    img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2)  # BGR to RGB and cv2 to pytorch
  File "/home/pytorch/anaconda3/envs/pytorch0.4/lib/python3.6/site-packages/numpy/core/shape_base.py", line 349, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

I change the coco.data like this:

classes= 3
train=data/coco/trainval.txt
valid=data/coco/test.txt
names=data/coco.names
backup=backup/
eval=coco

and the train_path in train.py:

if platform == 'darwin':  # MacOS (local)
        train_path = data_config['train']
    else:  # linux (cloud, i.e. gcp)
        train_path = 'data/coco/trainval.txt'

Thank you in advance!

Windows vs Unix/MacOS pathnames

Hi
i am new to all this . i am trying to get it work in pycharm in windows but i get this error .
the detections work but no bounding boxes and no images in output folder.
Anyone please help.

Namespace(batch_size=1, cfg='cfg/yolov3.cfg', class_path='data/coco.names', conf_thres=0.5, image_folder='data/samples', img_size=416, nms_thres=0.45, output_folder='output', plot_flag=True, txt_out=False)
'rm' is not recognized as an internal or external command,
operable program or batch file.
0 (3, 416, 416) C:\Users--\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
Batch 0... (Done 1.172s)
image 0: 'data/samples\img1.jpg'
1 trucks
1 dogs
1 bicycles

yolov3-spp.cfg Support

Could you add support for yolov3-spp model? https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-spp.cfg

VOC mAP

Hi, glenn-jocher:
When I ran train.py on PASCAL VOC2007 (about 160 epoch), I got 87% recall and 85% precision, but when I ran test.py on the PASCAL VOC2007 test set, I only got 0.62. mAP, and I found that mAP can't grow around 100 epoch, staying around 0.62? How can I further improve mAP?

Could you tell me the meaning of 'Normalized xywh to pixel xyxy format'

# Normalized xywh to pixel xyxy format labels = labels0.copy() labels[:, 1] = ratio * w * (labels0[:, 1] - labels0[:, 3] / 2) + padw labels[:, 2] = ratio * h * (labels0[:, 2] - labels0[:, 4] / 2) + padh labels[:, 3] = ratio * w * (labels0[:, 1] + labels0[:, 3] / 2) + padw labels[:, 4] = ratio * h * (labels0[:, 2] + labels0[:, 4] / 2) + padh

Is (x, y) of (xywh) the center coordinate?

Something about mAP of COCO API

hello,I'm so sorry to have trouble with U.I have trained COCO data set myself,I get the test mAP about 55.9%,Ican't reach 58%,Is there some advices?(Batchsize= 20,train epoch=80).I get the result of your module use the yolov3.weights,then using COCO API to count the mAP ,which is much slower than the result of 57.9%. I find that there may be something different in deal with resize.The Darknet define a function itself to resize image to the size of 416X416,and fill the space around image use value 0.5,but our pytorch module use 0.502,and other pixel value is also different,Then I change the resize function in pytorch module as same as Darknet , this make a same detect result with Darknet.Sorry, My English is bad!

SGD Learning Rate 'Burn In'

Hi , didn't the learning rate update during the training phase?

mean_mAP issue

I'm training VOC2007, afer 1 epoch, an error shows up:

F:\pytorch-yolov3-master-ul\test.py:122: RuntimeWarning: invalid value encountered in double_scalars
print('%15s: %-.4f' % (c, AP_accum[i] / AP_accum_count[i]))
aeroplane: nan
bicycle: nan
bird: nan
boat: nan
bottle: nan
bus: nan
car: nan
cat: nan
chair: nan
cow: nan
diningtable: nan
dog: nan
horse: nan
motorbike: nan
person: nan
pottedplant: nan
sheep: nan
sofa: nan
train: nan
tvmonitor: nan
Traceback (most recent call last):
File "train.py", line 268, in
var=opt.var,
File "train.py", line 224, in train
img_size=img_size,
File "F:\pytorch-yolov3-master-ul\test.py", line 125, in test
return mean_mAP, mean_R, mean_P
UnboundLocalError: local variable 'mean_mAP' referenced before assignment

What's the problem?

CPU usage toooo high in train.py!!!

The CPU usage is too high.
How to solve it?

RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.FloatTensor for argument #2 'other'

python 3.7.1
torch 0.4.1

Optimizer Choice: SGD vs Adam

When developing the training code I found that SGD caused divergence very quickly at the default LR of 1e-4. Loss terms began to grow exponentially, becoming Inf within about 10 batches of starting training.

Adam always seems to converge in contrast, which is why I use it as the default optimizer in train.py. I don't understand why Adam works and SGD does not, as darknet uses SGD successfully. This is one of the key differences between darknet and this repo, so any insights into how we can get SGD to converge would be appreciated.

It might be that I simply don't have the proper learning rate (and scheduler) in place.

line 82 in train.py

# optimizer = torch.optim.SGD(model.parameters(), lr=.001, momentum=.9, weight_decay=5e-4)
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4, weight_decay=5e-4)

mAP Computation vs Pycocotools

The mAP computation code is similar as https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/959e0ff43f5b82bdacef87f4240bae8415eac45b/test.py#L69

It is incorrect to average the AP for each sample, because AP is computed per-class. The right way is to rank all detected instances across the whole test set for each object class, compute AP for each class, and then average the AP.

IoU step in build_targets compared to Darknet implementation

Hi @glenn-jocher,

Thanks for this wonderful port of Yolo v3. I had two questions, however, about the matching step in build_targets -- where you compute which anchor box corresponds to each ground truth box.

You seem to be computing IoU using only the width and height of each anchor box against each target. Darknet doesn't appear to do this -- if I'm reading the implementation correctly, it iterates through the grid cells and computes the IoU using the X and Y of each cell. Is there a reason you compute IoU using width and height only? Optimization?

Also during this step, the ignore_threshold is set to 0.5 in the Darknet paper, and 0.7 in the Darknet implementation, while you seem to be using 0.1 in build_targets. Is there a reason for that?

Thanks!

mAP Computation in test.py

COCO2014 mAP computation on official YOLOv3 weights corresponds to expected value of 0.58 (same as darknet), but mAP computation on trained checkpoints seems higher than should be. In particular many false positives do not seem to negatively impact mAP.

For example validation image 2 should have 4 people and 1 baseball bat. At epoch 37, I see ~140 objects detected. Precision and Recall look like this:

Precision-Recall curve looks like this:

AP for this image is then calculated as 0.78, which is strangely high for 4 TP and ~140 FP's.

AP = compute_ap(recall, precision)
Out[66]: 0.78596

Lastly, I believe mAP is supposed to be calculated per class in each image, but here all the classes seem combined.

Darknet Polynomial LR Curve

I found darknet's polynomial learning rate curve here:

case POLY:
    return net->learning_rate * pow(1 - (float)batch_num / net->max_batches, net->power);

https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L111

If I use power = 4 from parser.c then I plot the following curve (in MATLAB), assuming max_batches = 1563360 (160 epochs at batch_size 12, for 9771 batches/epoch). This leaves the final lr(1563360) = 0. This means that is is impossible for anyone to begin training a model from the official YOLOv3 weights and expect to resume training at lr = 0.001 with no problems. The model is going to clearly bounce out of its local minimum back into the huge gradients it first saw at epoch 0.

>> batch = 0:(9771*160);
>> lr = 1e-3 * (1 - batch./1563360).^4;
>> fig; plot(batch,lr,'.-'); xyzlabel('batch','learning rate'); fcnfontsize(14); fcntight;

Pretrained Convolutional Weights from darknet53

Thanks for sharing your work.
yolov3 initializes model weights (up to line 549 in yolov3.cfg) from darknet53 classifier if I am not mistaken. Your model might not converge at epoch 160 if that is the case. Have you tried initializing yolov3 with darknet53?

tiny-yolo needed!

Could you please add the code for tiny-yolo?
I tried tiny-yolo.cfg and got a very bad result with map=0.04 , training about 50 epoch.

What is the expected inference time?

Say for a batch of 32 images for computing the mAP?

Checkpoint from PyTorch-trained model?

Thank you for this awesome repository.
Due to some changes in the training scheme v.s. original darknet code, I wonder if the provided PyTorch weights are converted from the original .weights file - or rather, results of a fresh training session in PyTorch.
If it's the former, it would be wonderful of the results of PyTorch training could be provided as well!

Thanks!

The pretrained weights on ImageNet

@glenn-jocher
Hi, First of all, thank you very much for your code. I have started training for 24 hours (10 epochs) on my GTX1080, but it seems that I can't load the pre-trained weights on ImageNet (the darknet53.comv). And it takes long time to train from scratch. TAT

Multi-GPU Training

Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you

Traceback (most recent call last):
  File "train.py", line 194, in <module>
    main(opt)
  File "train.py", line 128, in main
    loss = model(imgs, targets, requestPrecision=True)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

Can't seem to load weights from a custom training session.

I've got a model trained using Darknet on a custom dataset containing 43 classes. This requires changing the number of filters for 3 conv layers from 255 to 144 (and of course the number of classes, although this shouldn't affect the bug I'm describing).

Alas, when loading this weight file (with the relevant cfg file), the network loading part crashes due to the weights file 'ending' before it should.
The actual error is generated by:
conv_w = torch.from_numpy(weights[ptr:ptr + num_w]).view_as(conv_layer.weight)

While the error is:

RuntimeError: invalid argument 2: size '[144 x 256 x 1 x 1]' is invalid for input with 36863 elements at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/TH/THStorage.cpp:84

Now, the size of weights is 61802511, and ptr+num_w is 61802512 - i.e. it's one float shorter.
I'll note that this doesn't happen when loading the original yolov3.weights file that comes with the original Darknet/YOLOv3 repository (in that one, the numbers perfectly align).

I can't think of a cause other than something causing the counter to advance one index too fast, but can't think of a reason why this would happen. Any ideas?

Thanks!

Can u get the mAP as reported in darknet ?

Training code

Is the following for loop necessary? Except for the last batch, len(imgs) = n, so j could only be 0 in the loop. In the last batch, if len(imgs) is smaller than n, int(len(imgs) / n) = 0, the loop is ignored. Otherwise, len(imgs) = n, so j could only be 0 in the loop.

yolov3/train.py

Lines 118 to 119 in 68de92f

    
           n = opt.batch_size  # number of pictures at a time 
        
           for j in range(int(len(imgs) / n)):

Darknet Training Comparison

All, I've started training using the official darknet repo to compare. The first two things I noticed are:

Darknet training speed appears quite slow. In darknet yolov3.cfg, max_batches = 500200 is the total train time, and batch=64 is the images per batch, then this will take about 28 days on a GCP P100 at about 18,000 batches per day (all train settings to default).
Darknet appears set to train for 267 epochs. This is 500200 batches times 64 images per batch divided by 120,000 images in the training set. Can this be right? This seems like a lot.
Darknet is using multi_scale training, changing the image size every 10 batches. I've set this behavior as well in this repo if -multi_scale = True in train.py (though currently this changes the size every batch).

Own dataset doesn't work on latest commit

For some reason I can't seem to train my own dataset on the latest commit. I am able to do it from an earlier commit e.g. this state. In this state if i run my training (with the exact same cfg files, dataset etc), i get these results after a couple epochs:

      Epoch      Batch          x          y          w          h       conf        cls      total          P          R   nTargets         TP         FP         FN       time
       0/99      99/99       1.49       1.47       7.46       12.8        111       7.38        141          0          0          3          0    2.9e+03          0      0.124
       1/99      99/99       1.26       1.19       1.99       3.07       16.1       7.35       30.9          0          0         10          0          1          8      0.131
       2/99      99/99       1.04       1.02      0.831       1.08       4.64       7.25       15.9          0          0          7          0          3          4      0.129
       3/99      99/99      0.756      0.796      0.666       0.79       3.67       7.25       13.9   0.000769    0.00187         10          0          2          7      0.129
       4/99      99/99       0.58      0.683      0.574      0.739       2.93       7.15       12.7    0.00314     0.0167          3          0          3          1      0.129
       5/99      99/99      0.455       0.54      0.462       0.64       2.62       7.14       11.9    0.00636     0.0221          6          0          6          1      0.132

If I use the latest commit, i get this:

      Epoch      Batch          x          y          w          h       conf        cls      total          P          R   nTargets         TP         FP         FN       time
       0/99      99/99       1.49       1.47       7.41       12.6        111       7.38        141          0          0          3          0          0          0      0.117
       1/99      99/99       4.91       4.93        nan        nan        nan        nan        nan          0          0         10          0          0          0      0.125
       2/99      99/99       5.52       5.24        nan        nan        nan        nan        nan          0          0          7          0          0          0      0.124
       3/99      99/99       5.23       5.21        nan        nan        nan        nan        nan          0          0         10          0          0          0      0.129
       4/99      99/99       5.35       5.17        nan        nan        nan        nan        nan          0          0          3          0          0          0      0.125
       5/99      99/99       5.65       5.41        nan        nan        nan        nan        nan          0          0          6          0          0          0      0.124
       6/99      99/99       5.13       5.13        nan        nan        nan        nan        nan          0          0          9          0          0          0      0.126

Also in the latest commit, line 197 of train.py causes the following error:

Traceback (most recent call last):
  File "/home/rick/Documents/yolov3-master/train.py", line 208, in <module>
    main(opt)
  File "/home/rick/Documents/yolov3-master/train.py", line 195, in main
    mAP, R, P = test.main(test.opt)
  File "/home/rick/Documents/yolov3-master/test.py", line 42, in main
    model.load_state_dict(checkpoint['model'])
  File "/media/rick/HDD/Env/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
	size mismatch for module_list.81.conv_81.weight: copying a param of torch.Size([255, 1024, 1, 1]) from checkpoint, where the shape is torch.Size([303, 1024, 1, 1]) in current model.
	size mismatch for module_list.81.conv_81.bias: copying a param of torch.Size([255]) from checkpoint, where the shape is torch.Size([303]) in current model.
	size mismatch for module_list.93.conv_93.weight: copying a param of torch.Size([255, 512, 1, 1]) from checkpoint, where the shape is torch.Size([303, 512, 1, 1]) in current model.
	size mismatch for module_list.93.conv_93.bias: copying a param of torch.Size([255]) from checkpoint, where the shape is torch.Size([303]) in current model.
	size mismatch for module_list.105.conv_105.weight: copying a param of torch.Size([255, 256, 1, 1]) from checkpoint, where the shape is torch.Size([303, 256, 1, 1]) in current model.
	size mismatch for module_list.105.conv_105.bias: copying a param of torch.Size([255]) from checkpoint, where the shape is torch.Size([303]) in current model.

So I replaced it with (the old) code:

        with open('results.txt', 'a') as file:
            file.write(s + '\n')

I don't know if this is normal but i can't seem to find a solution.
Do you know why i get nan whilst using the exact same cfg files and data? My txt files for each image is spot on, the bounding boxes, width, height etc are all relative to the image width and height.

Sum False Positives from unassigned anchors

yolov3/models.py

Lines 207 to 213 in fd6619d

    
           # Sum False Positives from unassigned anchors 
        
           FPe = torch.zeros(self.nC) 
        
           if batch_report: 
        
               i = torch.sigmoid(pred_conf[~mask]) > 0.5 
        
               if i.sum() > 0: 
        
                   FP_classes = torch.argmax(pred_cls[~mask][i], 1) 
        
                   FPe = torch.bincount(FP_classes, minlength=self.nC).float().cpu()  # extra FPs

Can somebody explain this?

Loss Constants: _coord, _obj and _noobj

The correct YOLO v3 loss constants are:

lambda_coord = 1.0
lambda_obj = 1.0
lambda_noobj = 1.0

rather than the below constants, which derive from the original yolov1.cfg file:
https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/cfg/yolov1.cfg#L257-L260

lambda_coord = 5.0
lambda_obj = 1.0
lambda_noobj = 0.5

The latest yolov3 constants appear to be hard coded into the parser.c rather than in yolov3.cfg Credit to @ydixon who originally noticed this discrepancy in issue #12.
https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/parser.c#L376-L381

Hi，I want to train on VOC, can you help me?

Model Loss

Hey,
Following Section 2.2 of YOLO, I have a few questions about the loss calculation shown at the end of this issue.

We are using λ coord = 5 from line 156 to line 159. Should we also use λ noobj = .5 in line 167?
Why are we multiplying BCELoss with 1.5 in line 160? I have not found any reference to this in the papers.
pred_conf gives us a [batch_size x anchor_number x grid_size x grid_size] tensor. Assuming batch_size = 1, anchor_number=3 and grid_size = 2, there are 12 elements in this tensor. If nM = 3, pred_conf[~mask] contains 9 elements, so does mask[~mask].float(). BCEWithLogitsLoss1 gives the sum of BCE loss for these 9 elements, whereas BCEWithLogitsLoss2 takes the mean of BCEWithLogitsLoss1 (i.e. divides it by 9 for our case). Now, my question is why are we multiplying BCEWithLogitsLoss2 with nM instead of using BCEWithLogitsLoss1 (should divide by batch_size too prob.) in line 167? There is no division in Section 2.2 of YOLO. Btw, pred_conf[~mask] could contain 15k elements normally, so we are practically ignoring the confidence loss in line 167.
Similar to 3, we should use BCEWithLogitsLoss1 (should divide by batch_size too prob.) in line 163. Because
BCEWithLogitsLoss1(pred_cls[mask], tcls.float()) / BCEWithLogitsLoss2(pred_cls[mask], tcls.float()) = batch_size x nM x number_of_classes.
Why are we not dividing all the losses by the batch_size? As the batch_size increases, the loss increases too. However, we should minimize the expected loss per sample.

yolov3/models.py

Lines 155 to 167 in 9514e74

    
           if nM > 0: 
        
               lx = 5 * MSELoss(x[mask], tx[mask]) 
        
               ly = 5 * MSELoss(y[mask], ty[mask]) 
        
               lw = 5 * MSELoss(w[mask], tw[mask]) 
        
               lh = 5 * MSELoss(h[mask], th[mask]) 
        
               lconf = 1.5 * BCEWithLogitsLoss1(pred_conf[mask], mask[mask].float()) 
        
               # lcls = nM * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1)) 
        
               lcls = nM * BCEWithLogitsLoss2(pred_cls[mask], tcls.float()) 
        
           else: 
        
               lx, ly, lw, lh, lcls, lconf = FT([0]), FT([0]), FT([0]), FT([0]), FT([0]), FT([0]) 
        
           lconf += nM * BCEWithLogitsLoss2(pred_conf[~mask], mask[~mask].float())

Aout train

THANK YOU

Train erro

The following error occurred while I was training coco.

Traceback (most recent call last):
  File "/project/yolov3/train.py", line 202, in <module>
    main(opt)
  File "/project/yolov3/train.py", line 132, in main
    loss = model(imgs.to(device), targets, requestPrecision=True)
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/yolov3/models.py", line 238, in forward
    x, *losses = module[0](x, targets, requestPrecision)
  File "/data_b/VirEnv/project/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/yolov3/models.py", line 156, in forward
    requestPrecision)
  File "/project/yolov3/utils/utils.py", line 278, in build_targets
    tmp = pred_cls[b, a, gj, gi]
IndexError: index 8 is out of bounds for dimension 0 with size 8

rloss['nT'] is zero when training

when training:

Traceback (most recent call last):
File "train.py", line 211, in
main(opt)
File "train.py", line 177, in main
loss_per_target = rloss['loss'] / rloss['nT']
ZeroDivisionError: float division by zero

Cuda out of memory while training

Hi,

first thanks to your work here.

I have a problem. always when i get from epoch 0 to 1 i get an "cuda out of memory" error.
I decreased the batch-size to 1 and still get the error. The first epoch runs fine from 8 down.

I am training on a custom dataset. My imagesizes vary.

Running it on a GTX1070.

Thanks in advance

Edit:
multi_scale is set to false
while training used memory of my gtx is:
2445/8116mib
After the first epoch the usage of vram bloats. I just could check it mid epochchange and it was nearly completly used till it ran out of memory again. Whats running that is so intensive in between epochs?

multi_scale parameter used to resize_square instead of affine transformation

yolov3/train.py

Line 243 in b48c108

    
           parser.add_argument('--multi-scale', action='store_true', help='random image sizes per batch 320 - 608')

as far as I understand, multi_scale parameter is setting a random height in a range when you resize the image to a square size and setting the border.

Wouldn't this just be the same as setting the scale range parameter later in the affine transformation?

Resume training from official yolov3 weights

Thanks for your improvement of this YOLOv3 implementation.
I have just test the training ,got some problem .
I follow these steps.

load the original yolov3.weight to the model
train it on coco2014 with your train.py.
3.Got the following logs ,the precision is down fast from 0.5->0.1. but recall is up to 0.35.
see Screenshot here

4.I save the weight with precision0.2, and run the detect.py
the result like this ,

if I do not train,the orginal wight can get this result:

I do not know whether I used wrong parameters or something else, lead to generation of many bbox .
could you give me some suggestion?
Thank you~

5k.txt / 5k.part file extension

when i run :
~/PycharmProjects/yolov3-master$ python test.py -weights_path checkpoints/latest.pt
there is error in dataset.py:

img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) #row 218(dataset.py)

Traceback (most recent call last):
File "test.py", line 59, in
for batch_i, (imgs, targets) in enumerate(dataloader):
File "/home/chenfei/PycharmProjects/yolov3-master/utils/datasets.py", line 218, in next
img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) # BGR to RGB and cv2 to pytorch
File "/home/chenfei/anaconda3/lib/python3.6/site-packages/numpy/core/shape_base.py", line 349, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

can you help me ?

Path and File Separators

I ran detect.py but got nothing in output file, so i changed the results_img_path and results_txt_path as follows:

results_img_path = os.path.join(output, path.split('/')[-1].split('\')[-1])
results_txt_path = results_img_path.split('.')[-2] + '.txt'

Is it a small bug?

P.s. I'm a rookie in Deep Learning，no offense.

build_targets function

Can you explain why have you used the following constants? I have inspected a few different yolov3 implementation but none had a similar operation.

yolov3/utils/utils.py

Line 183 in a284fc9

    
           u = gi.float() * 0.4361538773074043 + gj.float() * 0.28012496588736746 + a.float() * 0.6627147212460307

	n = opt.batch_size # number of pictures at a time
	for j in range(int(len(imgs) / n)):

	# Sum False Positives from unassigned anchors
	FPe = torch.zeros(self.nC)
	if batch_report:
	i = torch.sigmoid(pred_conf[~mask]) > 0.5
	if i.sum() > 0:
	FP_classes = torch.argmax(pred_cls[~mask][i], 1)
	FPe = torch.bincount(FP_classes, minlength=self.nC).float().cpu() # extra FPs

	if nM > 0:
	lx = 5 * MSELoss(x[mask], tx[mask])
	ly = 5 * MSELoss(y[mask], ty[mask])
	lw = 5 * MSELoss(w[mask], tw[mask])
	lh = 5 * MSELoss(h[mask], th[mask])
	lconf = 1.5 * BCEWithLogitsLoss1(pred_conf[mask], mask[mask].float())

	# lcls = nM * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1))
	lcls = nM * BCEWithLogitsLoss2(pred_cls[mask], tcls.float())
	else:
	lx, ly, lw, lh, lcls, lconf = FT([0]), FT([0]), FT([0]), FT([0]), FT([0]), FT([0])

	lconf += nM * BCEWithLogitsLoss2(pred_conf[~mask], mask[~mask].float())

ultralytics / yolov3 Goto Github PK

yolov3's Issues

Load labels

when training:

Recommend Projects

Recommend Topics

Recommend Org