I use the default training (train_ssd.py) to train SSD300, However, the trainin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The training loss is nan in SSD,about dmlc/gluon-cv

Comments (31)

zhreshold commented on August 28, 2024

The smoothl1 is nan at the first batch, suggesting there's some problem.
What is the mxnet version you use

from gluon-cv.

1292765944 commented on August 28, 2024

I use mxnet 1.2.0 version

from gluon-cv.

zhreshold commented on August 28, 2024

Just re-run with latest 1.2.0 mxnet, no problem.
You can self-diagnose it by reduce the --log-interval to 1 and see if the loss is consistantly NaN, if so, there's some problem with your data.

from gluon-cv.

1292765944 commented on August 28, 2024

@zhreshold
I run the training script five times, and I only get one training which is correct in the first epoch but the loss becomes nan from the second epoch on (shown in below, the 1st log), while the other training fall into nan loss right at the second batch of the first epoch (the 2nd log).

INFO:root:Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=240, gpus='0', log_interval=1, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, wd=0.0005)
INFO:root:Start training from [Epoch 0]
[11:29:29] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[Epoch 0][Batch 0], Speed: 1.592764 samples/sec, CrossEntropy=18.970078, SmoothL1=3.773359
[Epoch 0][Batch 1], Speed: 33.675918 samples/sec, CrossEntropy=18.700781, SmoothL1=3.705951
[Epoch 0][Batch 2], Speed: 34.501906 samples/sec, CrossEntropy=17.878072, SmoothL1=3.636311
[Epoch 0][Batch 3], Speed: 34.971340 samples/sec, CrossEntropy=17.188730, SmoothL1=3.532549
[Epoch 0][Batch 4], Speed: 35.210838 samples/sec, CrossEntropy=16.539876, SmoothL1=3.594374
[Epoch 0][Batch 5], Speed: 35.344982 samples/sec, CrossEntropy=16.004512, SmoothL1=3.554986
[Epoch 0][Batch 6], Speed: 35.138346 samples/sec, CrossEntropy=15.519500, SmoothL1=3.495123
[Epoch 0][Batch 7], Speed: 35.231662 samples/sec, CrossEntropy=15.133650, SmoothL1=3.454569
[Epoch 0][Batch 8], Speed: 34.617724 samples/sec, CrossEntropy=14.822596, SmoothL1=3.410199
[Epoch 0][Batch 9], Speed: 32.935670 samples/sec, CrossEntropy=14.517015, SmoothL1=3.382657
[Epoch 0][Batch 10], Speed: 31.795520 samples/sec, CrossEntropy=14.196736, SmoothL1=3.346445
[Epoch 0][Batch 11], Speed: 31.414617 samples/sec, CrossEntropy=13.857106, SmoothL1=3.314300
[Epoch 0][Batch 12], Speed: 31.531594 samples/sec, CrossEntropy=13.554048, SmoothL1=3.303076
[Epoch 0][Batch 13], Speed: 31.247252 samples/sec, CrossEntropy=13.248046, SmoothL1=3.292221
[Epoch 0][Batch 14], Speed: 30.715182 samples/sec, CrossEntropy=12.961388, SmoothL1=3.299442
[Epoch 0][Batch 15], Speed: 30.934725 samples/sec, CrossEntropy=12.669294, SmoothL1=3.280241
[Epoch 0][Batch 16], Speed: 31.433194 samples/sec, CrossEntropy=12.401437, SmoothL1=3.253198
[Epoch 0][Batch 17], Speed: 31.619426 samples/sec, CrossEntropy=12.157580, SmoothL1=3.254118
[Epoch 0][Batch 18], Speed: 31.543910 samples/sec, CrossEntropy=11.918905, SmoothL1=3.228512
[Epoch 0][Batch 19], Speed: 31.251217 samples/sec, CrossEntropy=11.685043, SmoothL1=3.222793
[Epoch 0][Batch 20], Speed: 30.518158 samples/sec, CrossEntropy=11.486743, SmoothL1=3.205430
...

[Epoch 0][Batch 504], Speed: 30.911898 samples/sec, CrossEntropy=5.058636, SmoothL1=2.151901
[Epoch 0][Batch 505], Speed: 31.521841 samples/sec, CrossEntropy=5.056911, SmoothL1=2.151394
[Epoch 0][Batch 506], Speed: 31.359363 samples/sec, CrossEntropy=5.054802, SmoothL1=2.151377
[Epoch 0][Batch 507], Speed: 30.824867 samples/sec, CrossEntropy=5.052479, SmoothL1=2.150690
[Epoch 0][Batch 508], Speed: 30.793573 samples/sec, CrossEntropy=5.051375, SmoothL1=2.150632
[Epoch 0][Batch 509], Speed: 31.249697 samples/sec, CrossEntropy=5.049899, SmoothL1=2.150408
[Epoch 0][Batch 510], Speed: 31.339885 samples/sec, CrossEntropy=5.048257, SmoothL1=2.149687
[Epoch 0][Batch 511], Speed: 30.592469 samples/sec, CrossEntropy=5.046074, SmoothL1=2.149366
[Epoch 0][Batch 512], Speed: 31.167457 samples/sec, CrossEntropy=5.043880, SmoothL1=2.148887
[Epoch 0][Batch 513], Speed: 31.217229 samples/sec, CrossEntropy=5.042284, SmoothL1=2.148108
[Epoch 0][Batch 514], Speed: 31.367938 samples/sec, CrossEntropy=5.040168, SmoothL1=2.146903
[Epoch 0][Batch 515], Speed: 30.607803 samples/sec, CrossEntropy=5.038861, SmoothL1=2.146586
[Epoch 0][Batch 516], Speed: 30.433037 samples/sec, CrossEntropy=5.037583, SmoothL1=2.146050
[Epoch 0] Training cost: 555.827060, CrossEntropy=5.037583, SmoothL1=2.146050
[Epoch 0] Validation: 
aeroplane=0.123616
bicycle=0.030370
bird=0.122992
boat=0.006477
bottle=0.026802
bus=0.247903
car=0.480617
cat=0.302473
chair=0.031894
cow=0.191469
diningtable=0.008598
dog=0.260007
horse=0.086784
motorbike=0.222414
person=0.435699
pottedplant=0.013110
sheep=0.113426
sofa=0.061493
train=0.130224
tvmonitor=0.064738
mAP=0.148055
[Epoch 1][Batch 0], Speed: 7.914320 samples/sec, CrossEntropy=4.576447, SmoothL1=1.757895
[Epoch 1][Batch 1], Speed: 34.310139 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 2], Speed: 34.706138 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 3], Speed: 33.453595 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 4], Speed: 34.769189 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 5], Speed: 35.310856 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 6], Speed: 35.516562 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 7], Speed: 35.191598 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 8], Speed: 34.956712 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 9], Speed: 35.018677 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 10], Speed: 33.698188 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 11], Speed: 32.429172 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 12], Speed: 31.884709 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 13], Speed: 31.174225 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 14], Speed: 30.056717 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 15], Speed: 32.228178 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 16], Speed: 31.982889 samples/sec, CrossEntropy=nan, SmoothL1=nan

Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=240, gpus='0', log_interval=1, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, wd=0.0005)
Start training from [Epoch 0]
[Epoch 0][Batch 0], Speed: 1.199697 samples/sec, CrossEntropy=19.308395, SmoothL1=3.815743
[Epoch 0][Batch 1], Speed: 35.496752 samples/sec, CrossEntropy=4357492302527741261497340633546752.000000, SmoothL1=nan
[Epoch 0][Batch 2], Speed: 36.115509 samples/sec, CrossEntropy=2904994868351830197815071171674112.000000, SmoothL1=nan
[Epoch 0][Batch 3], Speed: 36.453160 samples/sec, CrossEntropy=2178746151263872648361303378755584.000000, SmoothL1=nan
[Epoch 0][Batch 4], Speed: 35.790659 samples/sec, CrossEntropy=1742996921011098176335117933346816.000000, SmoothL1=nan
[Epoch 0][Batch 5], Speed: 35.105847 samples/sec, CrossEntropy=1452497434175915098907535585837056.000000, SmoothL1=nan
[Epoch 0][Batch 6], Speed: 35.377832 samples/sec, CrossEntropy=1244997800722212983096512809533440.000000, SmoothL1=nan
[Epoch 0][Batch 7], Speed: 35.383120 samples/sec, CrossEntropy=1089373075631936324180651689377792.000000, SmoothL1=nan
[Epoch 0][Batch 8], Speed: 35.142164 samples/sec, CrossEntropy=968331622783943447310086415843328.000000, SmoothL1=nan
[Epoch 0][Batch 9], Speed: 35.495522 samples/sec, CrossEntropy=871498460505549088167558966673408.000000, SmoothL1=nan
[Epoch 0][Batch 10], Speed: 35.528794 samples/sec, CrossEntropy=792271327732317392183741263118336.000000, SmoothL1=nan
[Epoch 0][Batch 11], Speed: 35.216732 samples/sec, CrossEntropy=726248717087957549453767792918528.000000, SmoothL1=nan
[Epoch 0][Batch 12], Speed: 35.383988 samples/sec, CrossEntropy=670383431158114672120030891606016.000000, SmoothL1=nan
[Epoch 0][Batch 13], Speed: 36.050767 samples/sec, CrossEntropy=622498900361106491548256404766720.000000, SmoothL1=nan
[Epoch 0][Batch 14], Speed: 35.381637 samples/sec, CrossEntropy=580998973670366010739976619163648.000000, SmoothL1=nan
[Epoch 0][Batch 15], Speed: 35.286246 samples/sec, CrossEntropy=544686537815968162090325844688896.000000, SmoothL1=nan
[Epoch 0][Batch 16], Speed: 35.189660 samples/sec, CrossEntropy=512646153238558295634751631917056.000000, SmoothL1=nan
[Epoch 0][Batch 17], Speed: 35.466971 samples/sec, CrossEntropy=484165811391971723655043207921664.000000, SmoothL1=nan
[Epoch 0][Batch 18], Speed: 35.249279 samples/sec, CrossEntropy=458683400266078459871600083730432.000000, SmoothL1=nan
[Epoch 0][Batch 19], Speed: 35.442255 samples/sec, CrossEntropy=435749230252774544083779483336704.000000, SmoothL1=nan
[Epoch 0][Batch 20], Speed: 35.256659 samples/sec, CrossEntropy=414999266907404303679639590535168.000000, SmoothL1=nan
[Epoch 0][Batch 21], Speed: 35.038926 samples/sec, CrossEntropy=396135663866158696091870631559168.000000, SmoothL1=nan

from gluon-cv.

zhreshold commented on August 28, 2024

@1292765944 I will investigate this problem.
Just some question that may or may not relate to this problem, is your GPUs running pretty hot?

from gluon-cv.

1292765944 commented on August 28, 2024

@zhreshold I think my gpu is rather ok. Maxwell TITANX is used in my experiments. The running temp is 85C, and the idle temp is 42C (by nvidia-smi).
I also find that if two gpus is adopted for training, the training loss is worse than one gpu and the loss always becomes nan immediately after one batch training.
Before using gluoncv, I also try your old mxnet-ssd project. This project works well but the accuracy is a little lower than caffe (74.5% vs 77.5% for voc 2007 test), but the training efficiency of mxnet is far better than caffe and the training time is much shorter. So I'd appreciate your help.

from gluon-cv.

1292765944 commented on August 28, 2024

@zhreshold Could you just reproduce my error? So what's the problem for this error? Is it the model initialization problem?

from gluon-cv.

zhreshold commented on August 28, 2024

I tried multiple times on ec2, cannot reproduce the error yet. Since you are getting exploding loss after first update, I suspect your pretrained model or initialized tail are abnormal. Let me think where could possibly go wrong.

from gluon-cv.

1292765944 commented on August 28, 2024

@zhreshold any ideas for this problem?

from gluon-cv.

zhreshold commented on August 28, 2024

@1292765944 Try remove related pretrained models in ~/.mxnet/models and update mxnet/gluon-cv, run with default parameters again?

from gluon-cv.

Intellige commented on August 28, 2024

Hello,
We used the file: gluon-cv/scripts/detection/ssd/train_ssd.py. MXnetcu80 version 1.3.0
1 When we used the COCO datasets, we had the Nan problem as the same as @1292765944 provided.
we ran : python train_ssd.py --gpus 0,1 --dataset coco --network vgg16_atrous --data-shape 300
the result: INFO:root:[Epoch 0][Batch 99], Speed: 65.487 samples/sec, CrossEntropy=17.562, SmoothL1=nan .
2 While we used the VOC datesets, even though CrossEntropy and SmoothLy were normal，in the validation step，The last mAP was poorly low，around 0.002 in the first epoch, and,smaller lr did not work.

We have reinstalled mxnet and reloaded initial paras,but don't work ~~~

from gluon-cv.

zhreshold commented on August 28, 2024

We fixed a minor problem with coco, so the behavior of coco is confirmed.
However, for the VOC datasets, it is very weird that people are getting all kinds of different problems during training. I have repeated multiple times, all going pretty smoothly.

I will try some instances with CUDA80 and latest mxnet, please let me know if you guys have new findings.

from gluon-cv.

zqburde commented on August 28, 2024

First I also encounter this problem, decreasing the lr is my solution.

from gluon-cv.

1292765944 commented on August 28, 2024

@zqburde How did you set the lr? Does it hurt the final mAP?

from gluon-cv.

zqburde commented on August 28, 2024

@1292765944 I set the lr 0.001 to 0.0008 and increase the epochs, it doesn't influence the final mAP.

from gluon-cv.

Wallart commented on August 28, 2024

I am working with a GTX 1080Ti and I had the same issue on the previous gluoncv release (0.1), the loss was rapidly going to nan. Unfortunately I can no longer reproduce on gluoncv 0.2.

For better understanding I also implemented my own SSD inspired by amdegroot 's ssd.pytorch and had no issues. Then I tried to improve my implementation using some gluoncv concept and the training loss was nan too. In my experiments it's linked to :
self.init = {
'weight_initializer': Xavier(
rnd_type='gaussian', factor_type='out', magnitude=2),
'bias_initializer': 'zeros'
}

Used in VGGAtrousBase class. Removing this initializer solved my nan issue.
Hope it helps.

from gluon-cv.

Angzz commented on August 28, 2024

When I train ssd with coco, the training curve seems normal, but when validation, the AP[0.5:0.95] is always nearly zero, so what's the problem maybe?

this is my params setting :
Namespace(batch_size=16, data_shape=512, dataset='coco', epochs=240, gpus='0,1', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='resnet50_v1', num_workers=4, resume='', save_interval=1, save_prefix='ssd_512_resnet50_v1_coco', seed=233, start_epoch=0, val_interval=1, wd=0.0005)

this is my training curve:
[Epoch 9][Batch 4799], Speed: 53.161 samples/sec, CrossEntropy=3.285, SmoothL1=3.270
[Epoch 9][Batch 4899], Speed: 54.386 samples/sec, CrossEntropy=3.285, SmoothL1=3.274
[Epoch 9][Batch 4999], Speed: 58.573 samples/sec, CrossEntropy=3.285, SmoothL1=3.275
[Epoch 9][Batch 5099], Speed: 55.887 samples/sec, CrossEntropy=3.285, SmoothL1=3.274
[Epoch 9][Batch 5199], Speed: 54.713 samples/sec, CrossEntropy=3.285, SmoothL1=3.275
[Epoch 9][Batch 5299], Speed: 41.466 samples/sec, CrossEntropy=3.284, SmoothL1=3.276
[Epoch 9][Batch 5399], Speed: 53.258 samples/sec, CrossEntropy=3.283, SmoothL1=3.278
[Epoch 9][Batch 5499], Speed: 57.705 samples/sec, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9][Batch 5599], Speed: 56.832 samples/sec, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9][Batch 5699], Speed: 54.667 samples/sec, CrossEntropy=3.284, SmoothL1=3.280
[Epoch 9][Batch 5799], Speed: 53.720 samples/sec, CrossEntropy=3.285, SmoothL1=3.278
[Epoch 9][Batch 5899], Speed: 32.793 samples/sec, CrossEntropy=3.285, SmoothL1=3.279
[Epoch 9][Batch 5999], Speed: 57.327 samples/sec, CrossEntropy=3.286, SmoothL1=3.282
[Epoch 9][Batch 6099], Speed: 40.294 samples/sec, CrossEntropy=3.286, SmoothL1=3.281
[Epoch 9][Batch 6199], Speed: 55.066 samples/sec, CrossEntropy=3.286, SmoothL1=3.283
[Epoch 9][Batch 6299], Speed: 56.626 samples/sec, CrossEntropy=3.285, SmoothL1=3.281
[Epoch 9][Batch 6399], Speed: 54.353 samples/sec, CrossEntropy=3.285, SmoothL1=3.282
[Epoch 9][Batch 6499], Speed: 27.371 samples/sec, CrossEntropy=3.284, SmoothL1=3.284
[Epoch 9][Batch 6599], Speed: 38.576 samples/sec, CrossEntropy=3.284, SmoothL1=3.286
[Epoch 9][Batch 6699], Speed: 36.140 samples/sec, CrossEntropy=3.283, SmoothL1=3.286
[Epoch 9][Batch 6799], Speed: 14.973 samples/sec, CrossEntropy=3.283, SmoothL1=3.285
[Epoch 9][Batch 6899], Speed: 56.728 samples/sec, CrossEntropy=3.284, SmoothL1=3.283
[Epoch 9][Batch 6999], Speed: 56.083 samples/sec, CrossEntropy=3.284, SmoothL1=3.281
[Epoch 9][Batch 7099], Speed: 53.444 samples/sec, CrossEntropy=3.283, SmoothL1=3.280
[Epoch 9][Batch 7199], Speed: 19.701 samples/sec, CrossEntropy=3.283, SmoothL1=3.281
[Epoch 9][Batch 7299], Speed: 55.817 samples/sec, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9] Training cost: 3115.671, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9] Validation:

=Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.006

So Why, I use the newest pre version of gluoncv without any modification. and mxnet is also the pre version, my gpu is TITANXp(12G). Can you give me some suggestions? @zhreshold

from gluon-cv.

zhreshold commented on August 28, 2024

What I got after one epoch with python3 ../gluon-cv/scripts/detection/ssd/train_ssd.py --gpus 0,1,2,3 -j 32 --network vgg16_atrous --data-shape 300 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240

~~~~ Summary metrics ~~~~
=Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.019
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.048
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.010
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.004
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.020
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.028
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.038
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.058
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.063
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.053
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.083

from gluon-cv.

Angzz commented on August 28, 2024

totady I find when train with python3, everything is ok, but not ok with python2, can you have a try? @zhreshold

from gluon-cv.

zhreshold commented on August 28, 2024

@Angzz So, now you remind me of this thread #195
Yes, it is a known bug for division on python2.

from gluon-cv.

Angzz commented on August 28, 2024

@zhreshold OK, I will turn the orig_height and orig_width into float and have a try, thanks!!

from gluon-cv.

zhreshold commented on August 28, 2024

Closing this, let me know if it is still a problem.

from gluon-cv.

LeeRel1991 commented on August 28, 2024

Hi,all. Recently I also encounter such problem（loss=nan）. Specifically, when i train ssd512_vgg16_atrous on GTX1080 for face detecttion, the batchsize=8, both SmoothL1Loss and CrossEntropy are nan always. Then i comment the line net.hybridize() in train function and validate function, the loss become normal and the success. Finally i change the batchsize from 8 to 16 and lr from 0.001 to 0.00001 under net.hybridize() on voc2012 and seld-designed dataset, the find out the followings:
batchsize >=12 GTX1080 will out of memory,
batchsize=10 the train is normal,
batchsize<=8 the loss is easy to nan.

so the conclusion may be that small batchsize tend to output nan loss when using net.hybridize(), the possible solution is comment is or change large batchsize if the gpu support. Or, you can change ssd300_vgg16 on which gtx1080 also support batchsize>=16
Hope to help.

from gluon-cv.

Feywell commented on August 28, 2024

@zhreshold
I encounter the same question.
How can i fix it?
environment:

centos 6.4
python 3.5
mxnet-cuda80 1.3.0
gluoncv 0.3.0
gpu tesla k20 * 2

command:
train_ssd.py --batch-size 32 --num-workers 4 --gpus 0,1 --log-interval 1 --epochs 20
result:

/home/liyang/anaconda2/envs/gluon/lib/python3.5/site-packages/mxnet/gluon/block.py:421: UserWarning: load_params is deprecated. Please use load_parameters.
warnings.warn("load_params is deprecated. Please use load_parameters.")
INFO:root:Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=20, gpus='0,1', log_interval=1, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, val_interval=1, wd=0.0005)
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0][Batch 0], Speed: 1.764 samples/sec, CrossEntropy=19.147, SmoothL1=3.899
INFO:root:[Epoch 0][Batch 1], Speed: 24.210 samples/sec, CrossEntropy=2911234189311495764698741583380480.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 2], Speed: 24.310 samples/sec, CrossEntropy=1940822792874330605875953106157568.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 3], Speed: 24.335 samples/sec, CrossEntropy=1455617094655747882349370791690240.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 4], Speed: 24.308 samples/sec, CrossEntropy=1164493675724598305879496633352192.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 5], Speed: 24.310 samples/sec, CrossEntropy=970411396437165302937976553078784.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 6], Speed: 24.105 samples/sec, CrossEntropy=831781196946141647056783309537280.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 7], Speed: 24.253 samples/sec, CrossEntropy=727808547327873941174685395845120.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 8], Speed: 23.193 samples/sec, CrossEntropy=646940930958110201958651035385856.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 9], Speed: 24.274 samples/sec, CrossEntropy=582246837862299152939748316676096.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 10], Speed: 24.294 samples/sec, CrossEntropy=529315307147544684490680287887360.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 11], Speed: 24.299 samples/sec,

ps:
batchsize = 16 is also not work
but batchsize = 8 is ok.

from gluon-cv.

zhreshold commented on August 28, 2024

@Feywell Try reduce lr slightly with --lr 0.0005 for example.

from gluon-cv.

jacky4323 commented on August 28, 2024

Hi,
I have tested different batch-size, Why larger batch-size will get nan loss?
And also I test three command below ,2nd and 3th use same batch but get NaN in 3th training.

1st :python3 train_ssd.py --batch-size 12 --num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
2nd:python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
3th :python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

python3 train_ssd.py --batch-size 12 --num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

INFO:root:[Epoch 0][Batch 0], Speed: 2.745 samples/sec, CrossEntropy=19.230, SmoothL1=3.947
INFO:root:[Epoch 0][Batch 1], Speed: 10.406 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 1], Speed: 10.406 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 1], Speed: 10.406 samples/sec, CrossEntropy=nan, SmoothL1=nan

python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

INFO:root:[Epoch 0][Batch 0], Speed: 2.362 samples/sec, CrossEntropy=18.623, SmoothL1=4.061
INFO:root:[Epoch 0][Batch 1], Speed: 10.102 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 2], Speed: 10.230 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 3], Speed: 10.214 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 4], Speed: 10.209 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 5], Speed: 10.219 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 6], Speed: 10.218 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 7], Speed: 10.201 samples/sec, CrossEntropy=nan, SmoothL1=nan

python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0][Batch 0], Speed: 1.939 samples/sec, CrossEntropy=18.577, SmoothL1=3.817
INFO:root:[Epoch 0][Batch 1], Speed: 10.114 samples/sec, CrossEntropy=18.320, SmoothL1=3.906
INFO:root:[Epoch 0][Batch 2], Speed: 10.197 samples/sec, CrossEntropy=18.193, SmoothL1=4.051
INFO:root:[Epoch 0][Batch 3], Speed: 10.168 samples/sec, CrossEntropy=17.870, SmoothL1=3.842
INFO:root:[Epoch 0][Batch 4], Speed: 10.158 samples/sec, CrossEntropy=17.454, SmoothL1=4.076
INFO:root:[Epoch 0][Batch 5], Speed: 10.179 samples/sec, CrossEntropy=16.981, SmoothL1=3.938
INFO:root:[Epoch 0][Batch 6], Speed: 10.168 samples/sec, CrossEntropy=16.503, SmoothL1=3.988
INFO:root:[Epoch 0][Batch 7], Speed: 10.134 samples/sec, CrossEntropy=16.100, SmoothL1=3.868

from gluon-cv.

FishYuLi commented on August 28, 2024

@Intellige Have you solved the problem? How?

Hello, I got a similar problem.
The environment: cuda8.0 mxnet1.5.0 python2.7 (I don't have sudo permission to update the gpu driver...)
I run exactly the same shell script as the provided one.
The losses are pretty large at first, lager than the provided log, but the validation mAP is really poor. (Shown as follows)

The cross entropy loss converges normally, but the SmoothL1 does not converge, and the validation mAP is still very bad after 20 epochs. (Shown as follows)

I also try python 3.6, but I got the same problem.
@zhreshold Any possible suggestions?

from gluon-cv.

Intellige commented on August 28, 2024

Hi,sorry so late to answer the question.
When I encountered the question, I had tried to change the versions of MXNET and cuda . Of course
I also tried some suggestions above.
However, I wasn't lucky. at last, I reinstalled the UBUNTU, cuda and so on, they were all the same as I faced the questions. It really works. I don't know why..

from gluon-cv.

FishYuLi commented on August 28, 2024

@Intellige Thanks. It's really confusing...

from gluon-cv.

zhreshold commented on August 28, 2024

please reduce learning rate a little bit in case you meet sudden nan problems, I have some feedbacks says that reduce lr to half can solve nan problem in some cases.

from gluon-cv.

zhreshold commented on August 28, 2024

Just an update, the root cause is found and fix has been merged to master: apache/mxnet#14209

By using master/nightly built pip package hopefully you won't meet same problem any more

from gluon-cv.

The training loss is nan in SSD about gluon-cv HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent