Comments (2)
@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale
at False if you grep
deepspeed's source code.
And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow
issue.
from deepspeed.
@zhoubay yes, It looks like there is nowhere to set
raise_error_at_min_scale
at False if yougrep
deepspeed's source code. And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause thisoverflow
issue.
Thank you for your reply! After changing training of fp16 to bf16, the error disappears!
I'm closing this issue!
from deepspeed.
Related Issues (20)
- [BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2 HOT 3
- [REQUEST] too many unrelated warning HOT 1
- [REQUEST] Use python sysconfig to generate CFLAGs HOT 1
- RuntimeError: cannot pin 'CUDABFloat16Type' only dense CPU tensors can be pinned HOT 2
- JIT build fails for ROCM 6.0 HOT 1
- DeepSpeed just doesn't install properly on Databricks HOT 5
- nv-nightly CI test failure HOT 1
- CUDA error: unknown error HOT 3
- [BUG] Deepspeed memory allocation estimation different than real!
- [BUG] Fails to finetune certain subset of parameters via torch.optim.AdamW code (not .json setting)
- [REQUEST] How to finetune ONLY certain subset of the network parameters
- How to finetune certain portion of the whole parameter
- [BUG] Frozen Parameters not saved when bf16 enabled but are when fp16 enabled
- Deepspeed Ulysses HOT 2
- Content window is blocking text on deepspeed.ai HOT 1
- [BUG] Training crashes with "'Tensor' object has no attribute 'ds_id'"
- [BUG] Memory Leak in Stage 2 Optimizer
- [BUG] import deepspeed, MissingCUDAException HOT 2
- [REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3
- [REQUEST] Any arguments for disabling saving global steps?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.