Hi, I am running bert with use_fp16 and kept getting the following w

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

bert ft16 warning: global step (tf.train.get_global_step) has not been increased about deeplearningexamples HOT 6 CLOSED

nvidia commented on May 10, 2024

bert ft16 warning: global step (tf.train.get_global_step) has not been increased

from deeplearningexamples.

Comments (6)

swethmandava commented on May 10, 2024 1

When a NaN loss is encountered, the batch is skipped and loss scale is adjusted - global step is not incremented. You see this warning in that case but it is benign.

Can you quantify the speedup between fp16 and fp32? Can you also share your launch commands?

from deeplearningexamples.

JianJiao16 commented on May 10, 2024

Thanks swethmandava for the reply.

I also realized that nan could cause the warning and I added the following code in optimization.py but it did not capture the nan or print anything.
if new_global_step == global_step:
tf.logging.info("not all_are_finite")

How could I narrow down the problem? Why it does not happen to fp32?

The speedup on my side is roughly 3~4 times. I made some changes in the interfaces. here are the ones related to speed: --use_fp16=True --horovod --use_xla=True

Thanks.

from deeplearningexamples.

swethmandava commented on May 10, 2024

Please provide the entire launch command. Do you mean fp16 is 3-4 times faster than fp32? Or the other way round?

You will have to add training hooks or tf.cond, if else will not help in TensorFlow.

Please refer to this https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html on details about why loss scaling is required for fp16.

from deeplearningexamples.

652994331 commented on May 10, 2024

@swethmandava @JianJiao16 Hi, just noticed that I have same problem, when i use 4 GPU with one machine to train using --train_batch_size == 32, at step = 0, it will show global step has not been increased, and the train step will not move to step =1(next time). But after i changed the --train_batch_size == 8(smaller than 32), it's working. Could you please help me out with this question? Thanks.

from deeplearningexamples.

swethmandava commented on May 10, 2024

by not working do you mean it is indefinitely stuck on Step = 1 or does it simply take longer to proceed from Step = 1? Have you tried adjusting the learning rate?

from deeplearningexamples.

652994331 commented on May 10, 2024

@swethmandava Thanks so much for your reply, i think it's just takes longer , but anyway, it solved (though i am not sure the reason why it working now). the learning rate it's what i am concerning now, if it's convenient to you, please see my latest issue, the bad performance of bert pretrain. for now i juts keep the original learining rate in the run_pretraining.sh(1e-4). I got 4 gpus, but it seems the code already scale it for us if we use horovod. But for the problem in my latest issue, i got a large loss and a pretrain model with bad performance, i guess the reason might do with learning rate?

from deeplearningexamples.

Recommend Projects

bert ft16 warning: global step (tf.train.get_global_step) has not been increased about deeplearningexamples HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent