Training problem when epoch changing about mobilefacenet_tf HOT 13 CLOSED

sirius-ai commented on August 24, 2024

Training problem when epoch changing

from mobilefacenet_tf.

Comments (13)

zfs1993 commented on August 24, 2024 1

No description provided.

can you provide the hyperparameters you set，my inference loss doesn't get converge

from mobilefacenet_tf.

ltcs11 commented on August 24, 2024

when i'm training with the Refined MS1M data-set as the training data
the loss would change greatly during the epoch number changes

attached is my tensorboard results

from above you can see the loss change

and this would become a terrible problem when the learning rate decrease since it would increase the converge time that you need to finish the training greatly

have you ever meet this problem when you training the net?
if so, how did you solve this?

thanks a lot

from mobilefacenet_tf.

muyoucun commented on August 24, 2024

i have the same problem

from mobilefacenet_tf.

sirius-ai commented on August 24, 2024

@ltcs11 @muyoucun thanks for point out the bug. It likely caused by dataset.shuffle, the reason is that (total samples) % (shuffle buffer size) = remainder， if remainder far away smaller than suffer buffer size that dataset.shuffle will repeat it until equal to buffer size， so the end of every epoch will lead to overfiting，and increase the loss when beginning to train next epoch.(refer to https://stackoverflow.com/questions/46928328/why-training-loss-is-increased-at-the-beginning-of-each-epoch)
Temporary measures is annotation “dataset.shuffle” if your datasets had randomly enough, or set the dataset.shuffle buffer size to len(datasets).

from mobilefacenet_tf.

sirius-ai commented on August 24, 2024

https://github.com/sirius-ai/MobileFaceNet_TF/commit/604e36bf5d98d875a0d68ca2ef8b624973f91a0a
close！

from mobilefacenet_tf.

zfs1993 commented on August 24, 2024

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

from mobilefacenet_tf.

ltcs11 commented on August 24, 2024

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

the provided dataset is in order by the label number
i just randomly split one tfrecord file into many small size files and use dataset.shuffle(size) big enough to cover the max length of those files

i just use the default hyperparameters, and i didn't get the results as 99.2+ either

from mobilefacenet_tf.

muyoucun commented on August 24, 2024

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.

from mobilefacenet_tf.

zfs1993 commented on August 24, 2024

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.
i check the numbers of the tran.tfrecord(generate by the default datasets),there is only 3804846 pictures in it?(if i doesn't make any mistake), and i change the buffer size to 8747(3804846%8747=99), but the increase loss still occur. maybe it is not a good idea. did you try the method ltcs11 provide? i am going to try it

from mobilefacenet_tf.

muyoucun commented on August 24, 2024

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

from mobilefacenet_tf.

zfs1993 commented on August 24, 2024

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

i follow your advice. and find that if i annotate the dataset.shuffle, an error occur.
Traceback (most recent call last):
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/optimize/zeros.py", line 442, in brentq
r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/polyint.py", line 79, in call
y = self._evaluate(x)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 610, in _evaluate
below_bounds, above_bounds = self._check_bounds(x_new)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 642, in _check_bounds
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
i don't know why this happened, did you encounter this problem?

from mobilefacenet_tf.

sirius-ai commented on August 24, 2024

@muyoucun thanks you！
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

from mobilefacenet_tf.

zfs1993 commented on August 24, 2024

@muyoucun thanks you！
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

i trained model from initial.the accuracy on lfw has reached 98.5%,the val is 95%,but the result on agebd is really bad,especially the val part which is about 30%,it has a huge difference

from mobilefacenet_tf.

Training problem when epoch changing about mobilefacenet_tf HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent