Comments (7)
I would like to work on this.
from deepspeed.
Update: was trying to reproduce with a more lightweight example using this test case added by @HeyangQin by simply adding model.gradient_checkpointing_enable()
after loading model with from_pretrained
but the loss is consistent between zero3 and zeropp, no hanging observed either (gist here) đ (model used here is again llama-2-7b-hf ; I changed the test case to use bf16 accordingly too)
from deepspeed.
The logged loss appears zero at 2nd step (0-index) because the loss at that step is NaN (how it appeared zero: here).
By adding logs to check NaN in model parameters throughout the training loop:
def check_nan_parameters(model, stage, found_nan=[False]):
torch.cuda.synchronize()
for i, (name, param) in enumerate(model.named_parameters()):
if torch.isnan(param).any() and not found_nan[0] and torch.distributed.get_rank() == 0:
print(f"NaN detected in {param.ds_id} during {stage}, param stage {param.ds_status}")
found_nan[0] = True
we found it's related to the fact that there's one particular param with NaN after the forward pass in 1st step (the step before the one whose loss is NaN) (i.e. at this point), which then kick off the cascade to turn all layernorm weights to NaN after a backprop.
Yet another interesting finding is that this only happens if I set the number of layers to be >= 9:
config = transformers.AutoConfig.from_pretrained(MODEL_NAME)
config.num_hidden_layers = 9
model = transformers.AutoModelForCausalLM.from_config(config)
with <=8 the issue doesn't occur. It doesn't take place on smaller models such as GPT2 either. Gradient checkpointing is not related and with this 9-layer llama the issue can be observed without gradient checkpointing.
And the particular parameter thatâs having NaN value is having its ds_status = ZeroParamStatus.INFLIGHT
at the time when it appears NaN; While others are all in ZeroParamStatus.NOT_AVAILABLE
. Ideally it should be NOT_AVAILABLE because at the stage when forward is completed, the params will be partitioned, its data will be set as torch.empty(0), an the status wouldâve been set as ZeroParamStatus.NOT_AVAILABLE
This drove us thinking about some out of sync etc issue because at this stage after forward is done and before explicitly calling backward, thereâs no any request for allgathering a model parameter. A run with prefetch disabled (i.e. setting stage3_prefetch_bucket_size
to 0) shows that this issue doesnât happen.
from deepspeed.
was trying to reproduce with a much simplified training loop without HF trainer but we can't observe the nan loss problem:
import deepspeed.comm as dist
import deepspeed
from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader
import numpy as np
import pytest
import transformers
MODEL_NAME = "meta-llama/Llama-2-7b"
DS_CONFIG = {
'bf16': {'enabled': True},
'optimizer': {
'type': 'AdamW',
'params': {
'lr': 2e-05,
'betas': [0.9, 0.999],
'eps': 1e-08,
'weight_decay': 0.0
}
},
'scheduler': {
'type': 'WarmupLR',
'params': {
'warmup_min_lr': 0,
'warmup_max_lr': 2e-05,
'warmup_num_steps': 0
}
},
'zero_optimization': {
'stage': 3,
'overlap_comm': True,
'contiguous_gradients': True,
'sub_group_size': 1e9,
'sub_group_size': 1000000000.0,
'reduce_bucket_size': 16777216,
'stage3_prefetch_bucket_size': 15099494.4,
'stage3_max_live_parameters': 1000000000.0,
'stage3_max_reuse_distance': 1000000000.0,
'stage3_gather_16bit_weights_on_model_save': True,
'zero_hpz_partition_size': 4
},
'gradient_accumulation_steps': 1,
'gradient_clipping': 1.0,
'steps_per_print': float('inf'),
'train_batch_size': 16,
'train_micro_batch_size_per_gpu': 2,
'wall_clock_breakdown': False,
}
def load_and_prepare_data(model_name):
"""Load model, tokenizer and dataset, and prepare data loader."""
from datasets import load_from_disk
# Load model and tokenizer
config = transformers.AutoConfig.from_pretrained(model_name)
config.num_hidden_layers = 9
model = transformers.AutoModelForCausalLM.from_config(config).to(torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Load and tokenize dataset
dataset = load_dataset("wikitext", 'wikitext-103-raw-v1', split='train[:1%]').filter(lambda x: x['text'])
def tokenize_function(examples):
# Tokenize and ensure 'labels' are the same as 'input_ids'
tokenized_output = tokenizer(examples["text"], padding="longest", truncation=True, return_tensors='pt', max_length=256)
tokenized_output["labels"] = tokenized_output["input_ids"].clone()
return tokenized_output
tokenized_dataset = dataset.map(tokenize_function, batched=True).filter(lambda x: x['text'])
tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
# Create data loader
data_loader = DataLoader(tokenized_dataset, batch_size=2, shuffle=False)
return model, data_loader
def get_loss(model, data_loader, config_dict, step=500):
"""Train the model and calculate average loss."""
# Initialize DeepSpeed
model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=config_dict, dist_init_required=True)
dist.barrier()
model.train()
# Training loop
losses = []
for n, batch in enumerate(data_loader):
if n >= step:
break
batch = {k: v.to(model.device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
if torch.distributed.get_rank() == 0:
print(f"loss: {loss}")
model.backward(loss)
model.step()
losses.append(loss.item())
return np.nanmean(losses[-100:])
if __name__ == "__main__":
torch.manual_seed(0)
model, data_loader = load_and_prepare_data(MODEL_NAME)
zeropp_loss = get_loss(model, data_loader, DS_CONFIG)
(this DS config is an exact match of what HF/accelerate renders after filling all the "auto"
fields -- got it by printing out the DS config before accelerate passes it to deepspeed.initialize
)
I then tried modify HF trainer training loop code by pretty much nuking everything after initialization and replacing them with the same training loop as in above script (patched trainer.py
in HF trainer, based on transformers==4.37.2
: https://gist.github.com/yundai424/d785089f55684c3fce48434d9e727c1a#file-trainer-py-L1739), the issue appears as well.
from deepspeed.
More findings:
So far we know this is related to prefetch and here we want to understand why. If enabling logging for prefetch, release and wait events, it can be noticed that the particular param (ds_id
= 39) is having its second prefetch (the one for backprop) almost immediately after it gets released after forward on it:
...
-prefetch: {'id': 79, 'status': 'NOT_AVAILABLE', 'requires_grad': True, 'persist': False}
-release: {'id': 38, 'status': 'AVAILABLE', 'requires_grad': True, 'persist': False}
-wait: {'id': 39, 'status': 'INFLIGHT', 'requires_grad': True, 'persist': False}
-prefetch: {'id': 83, 'status': 'NOT_AVAILABLE', 'requires_grad': True, 'persist': False}
-release: {'id': 39, 'status': 'AVAILABLE', 'requires_grad': True, 'persist': False}
-prefetch: {'id': 39, 'status': 'NOT_AVAILABLE', 'requires_grad': True, 'persist': False}
-wait: {'id': 40, 'status': 'INFLIGHT', 'requires_grad': True, 'persist': False}
...
i.e. the execution order is:
prefetch(39) -> [ some time spent on forward pass for previous modules] -> wait(39) -> forward on 39 -> release(39) -> [very short time, even immediately!] -> prefetch(39)
This makes us wonder if it's because partitioning 2ndary tensor is not a blocking operation. Our hypothesis is that, secondary tensor will first be initialized with empty and later on we do copy to fill it. However prior to the time the copy is finished, the next prefetch on it has been kicked off, which will use the secondary tensor now that it's not None
. But the copy may not have finished yet so it's possible to allgather on a piece of arbitrary data (since it's just torch.empty
), thus leading to the trouble.
To validate this hypothesis I added get_accelerator().synchronize()
to the end of _partition_param_sec
i.e. here and the issue doesn't pop up again.
As @ByronHsu suggested, putting it to before launching allgather kernel will work as well and that theoretically can be more performant.
However as I try to further validate this by initializing 2ndary tensor with torch.randn
instead of torch.empty
(and removing cuda synchronize added above), the issue persists. We may need to take even closer look at what's falling off between these async ops.
It's not clear yet when does the actual allgather kernel happens (the "-prefetch"
print statement is at the time when kernel gets enqueued but it may take time before it's executed) so we need to further pinpoint what is exactly THE operation that should've been synchronized.
(to this point I believe we can safely rule HuggingFace trainer out of the scope too)
from deepspeed.
- The issue can be reproduced even if we initialize the 2nd tensor as torch.rand, which seems to break our hypothesis.
- Add a check for the 2nd tensor before all gather. And we can observe that
param_ds_tensor
does contain nan.
if torch.isnan(param_ds_tensor).any().item():
print("param ds tensor contains nan!!!")
exit()
handles = _dist_allgather_fn(
param_ds_tensor.to(get_accelerator().current_device_name()),
param_buffer,
ds_process_group,
)
param ds tensor contains nan!!!
- Add a nan check in the if condition again to see if it could be due to async problem
if torch.isnan(param_ds_tensor).any().item():
import time
time.sleep(2)
print(torch.isnan(param_ds_tensor).any().item())
print("param ds tensor contains nan!!!")
exit()
param ds tensor contains nan!!!
False
It means that the ds tensor contains nan in the beginning, but after 2 seconds, the nan is gone!!
- Remove the above two check and add a check after 2nd partitioning. In this case, the nan doesnât happen (what???)
if torch.isnan(param.ds_secondary_tensor).any().item():
print("param ds tensor contains nan after partition!!!")
exit()
print_rank_0(f"{param.ds_id} partitioned type {param.dtype} dev {param.device} shape {param.shape}",
force=False)
copy_
is non-blocking if it is d2d
-
Instead of injecting nan check in deepspeed code, i moved the check to huggingface trainer. I added checks in multiple places, including âbefore forwardâ, âafter forwardâ, âbefore backwardâ, âafter backwardâ, âbefore stepâ, and âafter stepâ. I observed that nan happens at âafter backwardâ, which means during backward, the weights get mutated to NaN, thus causing NaN in grad. Technically, backward shouldn't change weights at all. It is likely that the collective ops messed up the weights.
-
With torch.cuda.synchronize() added before all gather, if we tune stage3_param_persistence_threshold to zero, we can still observe nan issue.
-
NaN issue is not reproducible with a ~7B linear model
from deepspeed.
Some questions that bother us and our hypothesis
Question: Why adding the check âafter 2nd partitioningâ can prevent nan issue, but adding the check âbefore all gatherâ still makes the issue present?
Hypothesis:
âtorch.isnan(tensor).any().item()â essentially copies the data from device to host thus serves as a sync point. However, âAfter 2nd partitioningâ is running on the forward thread but âBefore all gatheringâ is running on the backward thread. If we put the check at the same thread where the copying happens, it will sync due to .item(). On the other hand, if we put the check at a different thread, .item() does not wait.
Question: Why initializing 2nd tensors with torch.randn doesnât fix the issue?
Hypothesis 1:
[copying ]
[all gather ]
Even if we initialize 2nd tensors as non-nan, if we probe the values of the tensor during copying, we will get some nan values.
Hypothesis 2:
The all gather doesnât get any nan, but instead it gets a randn value, and then uses the value for optimizer step, causing nan in the updated weights.
from deepspeed.
Related Issues (20)
- [REQUEST] detect opbuilder list at launch time HOT 7
- [BUG] FP32 gradient accumulation result in crash. HOT 2
- [BUG]Training speed of deepspeed>=0.12.5 becomed slower than before! HOT 9
- The sequence length is not divisible by Sequence Parallel World Size
- [REQUEST] parameter sharding, gradient sharding, and optimizer state sharding with various sharding factors like Zero++
- RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. HOT 1
- [BUG] `deepspeed.zero.Init` leaks
- AFAIK if I run the `install.sh` script with `curl` before installing dos2unix, it will fail, so dos2unix should be installed before the `curl`. HOT 2
- [BUG] Differences between training result using zero-2 and zero-3
- [BUG]Cannot install deepspeed when cuda is installed by a non-root user
- Compilation Errors with DeepSpeed on Multi-GPU Training Setup HOT 1
- [BUG] 13.3 version pre-build Compilation error `fused_adam_frontend.o: No such file or directory` and `multi_tensor_adam.o: No such file or directory` HOT 3
- [BUG] ValueError: `.to` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
- [BUG] Memory leak during autograd backwards
- [BUG] access the gradient while tranining
- [BUG] Errors when running with Microsoft Phi models HOT 2
- [BUG] reduce scatter cannot be overlap when using zero HOT 2
- xpus not detected[BUG] HOT 5
- [BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam HOT 3
- [BUG] Deepspeed repeatedly requests to lock files and gets stuck
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ď¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.