Coder Social home page Coder Social logo

Comments (13)

zfs1993 avatar zfs1993 commented on August 24, 2024 1

No description provided.

can you provide the hyperparameters you set,my inference loss doesn't get converge

from mobilefacenet_tf.

ltcs11 avatar ltcs11 commented on August 24, 2024

when i'm training with the Refined MS1M data-set as the training data
the loss would change greatly during the epoch number changes

attached is my tensorboard results
image
image

from above you can see the loss change

and this would become a terrible problem when the learning rate decrease since it would increase the converge time that you need to finish the training greatly

have you ever meet this problem when you training the net?
if so, how did you solve this?

thanks a lot

from mobilefacenet_tf.

muyoucun avatar muyoucun commented on August 24, 2024

i have the same problem

from mobilefacenet_tf.

sirius-ai avatar sirius-ai commented on August 24, 2024

@ltcs11 @muyoucun thanks for point out the bug. It likely caused by dataset.shuffle, the reason is that (total samples) % (shuffle buffer size) = remainder, if remainder far away smaller than suffer buffer size that dataset.shuffle will repeat it until equal to buffer size, so the end of every epoch will lead to overfiting,and increase the loss when beginning to train next epoch.(refer to https://stackoverflow.com/questions/46928328/why-training-loss-is-increased-at-the-beginning-of-each-epoch)
Temporary measures is annotation “dataset.shuffle” if your datasets had randomly enough, or set the dataset.shuffle buffer size to len(datasets).

from mobilefacenet_tf.

sirius-ai avatar sirius-ai commented on August 24, 2024

https://github.com/sirius-ai/MobileFaceNet_TF/commit/604e36bf5d98d875a0d68ca2ef8b624973f91a0a
close!

from mobilefacenet_tf.

zfs1993 avatar zfs1993 commented on August 24, 2024

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

from mobilefacenet_tf.

ltcs11 avatar ltcs11 commented on August 24, 2024

did you just annotate the dataset.shuffle? i used the dataset it provided,i don't know weather it has disordered

the provided dataset is in order by the label number
i just randomly split one tfrecord file into many small size files and use dataset.shuffle(size) big enough to cover the max length of those files

i just use the default hyperparameters, and i didn't get the results as 99.2+ either

from mobilefacenet_tf.

muyoucun avatar muyoucun commented on August 24, 2024

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.

from mobilefacenet_tf.

zfs1993 avatar zfs1993 commented on August 24, 2024

i have 5822653 images and i use shuffle buffer size 35504, so the remainder will be 5822653 % 35504 = 35501 .
According to @sirius-ai , it will only repeat 3 images at last since 35504-35501=3. If my understanding is right, the remainder is so close to shuffle buffer size so maybe it would not overfit.
But the loss changed great when first epoch ends. (BTW after that the second epoch or the third epoch ends, loss will not increase greatly)

maybe i should split one tfrecord file into many small size files.
i check the numbers of the tran.tfrecord(generate by the default datasets),there is only 3804846 pictures in it?(if i doesn't make any mistake), and i change the buffer size to 8747(3804846%8747=99), but the increase loss still occur. maybe it is not a good idea. did you try the method ltcs11 provide? i am going to try it

from mobilefacenet_tf.

muyoucun avatar muyoucun commented on August 24, 2024

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

from mobilefacenet_tf.

zfs1993 avatar zfs1993 commented on August 24, 2024

insightface author has updated his datasets. you can download from his github.
and about this issue, now i believe it's totally the SHUFFLE's problem ( you can see https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle?noredirect=1&lq=1)

i test it. like this:

a = np.array([1,2,3.....,59])
...
dataset = dataset.shuffle(10)
dataset = dataset.batch(9)
...
el = iteration.get_next()

and i print the el every time. one example like this
[ 5 7 8 4 1 3 14 10 17]
[ 6 16 15 20 13 24 21 2 9]
[18 19 23 25 28 12 29 30 32]
[36 33 27 39 26 31 11 34 45]
[43 40 41 38 42 50 35 22 53]
[54 48 52 49 55 56 51 58 46]
[44 37 47 57]
you can see that it will get smaller number firstly.
since the provided dataset is in order by the label number, the batch chosen is not random.
so now i shuffle the data and generate tfrecord file again.

i follow your advice. and find that if i annotate the dataset.shuffle, an error occur.
Traceback (most recent call last):
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/optimize/zeros.py", line 442, in brentq
r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
File "train_nets4.py", line 262, in
eer = brentq(lambda x: 1. - x - interpolate.interp1d(fpr, tpr)(x), 0., 1.)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/polyint.py", line 79, in call
y = self._evaluate(x)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 610, in _evaluate
below_bounds, above_bounds = self._check_bounds(x_new)
File "/opt/app/anaconda3/lib/python3.6/site-packages/scipy/interpolate/interpolate.py", line 642, in _check_bounds
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
i don't know why this happened, did you encounter this problem?

from mobilefacenet_tf.

sirius-ai avatar sirius-ai commented on August 24, 2024

@muyoucun thanks you!
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

from mobilefacenet_tf.

zfs1993 avatar zfs1993 commented on August 24, 2024

@muyoucun thanks you!
@zfs1993 Don't using pretrained model to Initialize weights, and trying to retrain.

i trained model from initial.the accuracy on lfw has reached 98.5%,the val is 95%,but the result on agebd is really bad,especially the val part which is about 30%,it has a huge difference

from mobilefacenet_tf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.