Coder Social home page Coder Social logo

Comments (6)

swethmandava avatar swethmandava commented on May 10, 2024 1

When a NaN loss is encountered, the batch is skipped and loss scale is adjusted - global step is not incremented. You see this warning in that case but it is benign.

Can you quantify the speedup between fp16 and fp32? Can you also share your launch commands?

from deeplearningexamples.

JianJiao16 avatar JianJiao16 commented on May 10, 2024

Thanks swethmandava for the reply.

I also realized that nan could cause the warning and I added the following code in optimization.py but it did not capture the nan or print anything.
if new_global_step == global_step:
tf.logging.info("not all_are_finite")

How could I narrow down the problem? Why it does not happen to fp32?

The speedup on my side is roughly 3~4 times. I made some changes in the interfaces. here are the ones related to speed: --use_fp16=True --horovod --use_xla=True

Thanks.

from deeplearningexamples.

swethmandava avatar swethmandava commented on May 10, 2024

Please provide the entire launch command. Do you mean fp16 is 3-4 times faster than fp32? Or the other way round?

You will have to add training hooks or tf.cond, if else will not help in TensorFlow.

Please refer to this https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html on details about why loss scaling is required for fp16.

from deeplearningexamples.

652994331 avatar 652994331 commented on May 10, 2024

@swethmandava @JianJiao16 Hi, just noticed that I have same problem, when i use 4 GPU with one machine to train using --train_batch_size == 32, at step = 0, it will show global step has not been increased, and the train step will not move to step =1(next time). But after i changed the --train_batch_size == 8(smaller than 32), it's working. Could you please help me out with this question? Thanks.

from deeplearningexamples.

swethmandava avatar swethmandava commented on May 10, 2024

by not working do you mean it is indefinitely stuck on Step = 1 or does it simply take longer to proceed from Step = 1? Have you tried adjusting the learning rate?

from deeplearningexamples.

652994331 avatar 652994331 commented on May 10, 2024

@swethmandava Thanks so much for your reply, i think it's just takes longer , but anyway, it solved (though i am not sure the reason why it working now). the learning rate it's what i am concerning now, if it's convenient to you, please see my latest issue, the bad performance of bert pretrain. for now i juts keep the original learining rate in the run_pretraining.sh(1e-4). I got 4 gpus, but it seems the code already scale it for us if we use horovod. But for the problem in my latest issue, i got a large loss and a pretrain model with bad performance, i guess the reason might do with learning rate?

from deeplearningexamples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.