Comments (6)
When a NaN loss is encountered, the batch is skipped and loss scale is adjusted - global step is not incremented. You see this warning in that case but it is benign.
Can you quantify the speedup between fp16 and fp32? Can you also share your launch commands?
from deeplearningexamples.
Thanks swethmandava for the reply.
I also realized that nan could cause the warning and I added the following code in optimization.py but it did not capture the nan or print anything.
if new_global_step == global_step:
tf.logging.info("not all_are_finite")
How could I narrow down the problem? Why it does not happen to fp32?
The speedup on my side is roughly 3~4 times. I made some changes in the interfaces. here are the ones related to speed: --use_fp16=True --horovod --use_xla=True
Thanks.
from deeplearningexamples.
Please provide the entire launch command. Do you mean fp16 is 3-4 times faster than fp32? Or the other way round?
You will have to add training hooks or tf.cond, if else will not help in TensorFlow.
Please refer to this https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html on details about why loss scaling is required for fp16.
from deeplearningexamples.
@swethmandava @JianJiao16 Hi, just noticed that I have same problem, when i use 4 GPU with one machine to train using --train_batch_size == 32, at step = 0, it will show global step has not been increased, and the train step will not move to step =1(next time). But after i changed the --train_batch_size == 8(smaller than 32), it's working. Could you please help me out with this question? Thanks.
from deeplearningexamples.
by not working do you mean it is indefinitely stuck on Step = 1 or does it simply take longer to proceed from Step = 1? Have you tried adjusting the learning rate?
from deeplearningexamples.
@swethmandava Thanks so much for your reply, i think it's just takes longer , but anyway, it solved (though i am not sure the reason why it working now). the learning rate it's what i am concerning now, if it's convenient to you, please see my latest issue, the bad performance of bert pretrain. for now i juts keep the original learining rate in the run_pretraining.sh(1e-4). I got 4 gpus, but it seems the code already scale it for us if we use horovod. But for the problem in my latest issue, i got a large loss and a pretrain model with bad performance, i guess the reason might do with learning rate?
from deeplearningexamples.
Related Issues (20)
- [Model/Framework] What is the problem?
- [Model/Framework or something else] Feature requested
- [Model/Framework or something else] Feature requested
- [Model/Framework] What is the problem?
- [Model/Framework] What is the problem?
- NVIDIA
- [Model/Framework] What is the problem?
- [Model/Framework or something else] Feature requested
- [Model/Framework or something else] Feature requested
- [FastPitch] Why do you hierarchically predict the variance features (pitch and energy)? HOT 2
- [BERT/PyTorch] How can we use
- [BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining HOT 1
- Seeking Help with Tacotron 2 Training for Telugu Language
- [Model/Framework or something else] Feature requested
- [ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs HOT 2
- [Model/Framework] in the model_zoo.py the torch.hub api use wrong
- Inconsistent librosa versions PyTorch/SpeechSynthesis/All and CUDA-Optimized/FastSpeech
- Support for Ada Lovelace Architecture
- [nnUNet] pytorch_lightning.utilities.exceptions.MisconfigurationException when training
- [nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deeplearningexamples.