Describe the bug With DeepSpeed built from source on latest maste

The issue can be reproduced even if we initialize the 2nd tensor as torch.rand,

[BUG] ZeRO++ hpZ on llama2-7b gets zero loss about deepspeed HOT 7 OPEN

yundai424 commented on May 18, 2024

[BUG] ZeRO++ hpZ on llama2-7b gets zero loss

from deepspeed.

Comments (7)

ighoshsubho commented on May 18, 2024 1

I would like to work on this.

from deepspeed.

yundai424 commented on May 18, 2024

Update: was trying to reproduce with a more lightweight example using this test case added by @HeyangQin by simply adding model.gradient_checkpointing_enable() after loading model with from_pretrained but the loss is consistent between zero3 and zeropp, no hanging observed either (gist here) 😓 (model used here is again llama-2-7b-hf ; I changed the test case to use bf16 accordingly too)

from deepspeed.

yundai424 commented on May 18, 2024

The logged loss appears zero at 2nd step (0-index) because the loss at that step is NaN (how it appeared zero: here).

By adding logs to check NaN in model parameters throughout the training loop:

def check_nan_parameters(model, stage, found_nan=[False]):
    torch.cuda.synchronize()
    for i, (name, param) in enumerate(model.named_parameters()):
        if torch.isnan(param).any() and not found_nan[0] and torch.distributed.get_rank() == 0:
            print(f"NaN detected in {param.ds_id} during {stage}, param stage {param.ds_status}")
            found_nan[0] = True

we found it's related to the fact that there's one particular param with NaN after the forward pass in 1st step (the step before the one whose loss is NaN) (i.e. at this point), which then kick off the cascade to turn all layernorm weights to NaN after a backprop.

Yet another interesting finding is that this only happens if I set the number of layers to be >= 9:

config = transformers.AutoConfig.from_pretrained(MODEL_NAME)
config.num_hidden_layers = 9
model = transformers.AutoModelForCausalLM.from_config(config)

with <=8 the issue doesn't occur. It doesn't take place on smaller models such as GPT2 either. Gradient checkpointing is not related and with this 9-layer llama the issue can be observed without gradient checkpointing.

And the particular parameter that’s having NaN value is having its ds_status = ZeroParamStatus.INFLIGHT at the time when it appears NaN; While others are all in ZeroParamStatus.NOT_AVAILABLE. Ideally it should be NOT_AVAILABLE because at the stage when forward is completed, the params will be partitioned, its data will be set as torch.empty(0), an the status would’ve been set as ZeroParamStatus.NOT_AVAILABLE

DeepSpeed/deepspeed/runtime/zero/partition_parameters.py

Line 293 in 9d2660d

param.ds_status = ZeroParamStatus.NOT_AVAILABLE

This drove us thinking about some out of sync etc issue because at this stage after forward is done and before explicitly calling backward, there’s no any request for allgathering a model parameter. A run with prefetch disabled (i.e. setting stage3_prefetch_bucket_size to 0) shows that this issue doesn’t happen.

from deepspeed.

yundai424 commented on May 18, 2024

was trying to reproduce with a much simplified training loop without HF trainer but we can't observe the nan loss problem:

import deepspeed.comm as dist
import deepspeed
from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader
import numpy as np
import pytest
import transformers


MODEL_NAME = "meta-llama/Llama-2-7b"

DS_CONFIG = {
    'bf16': {'enabled': True}, 
    'optimizer': {
        'type': 'AdamW', 
        'params': {
            'lr': 2e-05, 
            'betas': [0.9, 0.999], 
            'eps': 1e-08, 
            'weight_decay': 0.0
        }
    }, 
    'scheduler': {
        'type': 'WarmupLR', 
        'params': {
            'warmup_min_lr': 0, 
            'warmup_max_lr': 2e-05, 
            'warmup_num_steps': 0
        }
    }, 
    'zero_optimization': {
        'stage': 3, 
        'overlap_comm': True, 
        'contiguous_gradients': True,
        'sub_group_size': 1e9,
        'sub_group_size': 1000000000.0,
        'reduce_bucket_size': 16777216,
        'stage3_prefetch_bucket_size': 15099494.4,
        'stage3_max_live_parameters': 1000000000.0,
        'stage3_max_reuse_distance': 1000000000.0,
        'stage3_gather_16bit_weights_on_model_save': True, 
        'zero_hpz_partition_size': 4
    },
    'gradient_accumulation_steps': 1, 
    'gradient_clipping': 1.0, 
    'steps_per_print': float('inf'), 
    'train_batch_size': 16, 
    'train_micro_batch_size_per_gpu': 2, 
    'wall_clock_breakdown': False, 
}

def load_and_prepare_data(model_name):
    """Load model, tokenizer and dataset, and prepare data loader."""
    from datasets import load_from_disk

    # Load model and tokenizer
    config = transformers.AutoConfig.from_pretrained(model_name)
    config.num_hidden_layers = 9
    model = transformers.AutoModelForCausalLM.from_config(config).to(torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # Load and tokenize dataset
    dataset = load_dataset("wikitext", 'wikitext-103-raw-v1', split='train[:1%]').filter(lambda x: x['text'])

    def tokenize_function(examples):
        # Tokenize and ensure 'labels' are the same as 'input_ids'
        tokenized_output = tokenizer(examples["text"], padding="longest", truncation=True, return_tensors='pt', max_length=256)
        tokenized_output["labels"] = tokenized_output["input_ids"].clone()
        return tokenized_output

    tokenized_dataset = dataset.map(tokenize_function, batched=True).filter(lambda x: x['text'])
    tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

    # Create data loader
    data_loader = DataLoader(tokenized_dataset, batch_size=2, shuffle=False)
    return model, data_loader

def get_loss(model, data_loader, config_dict, step=500):
    """Train the model and calculate average loss."""
    # Initialize DeepSpeed
    model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=config_dict, dist_init_required=True)
    dist.barrier()
    model.train()

    # Training loop
    losses = []
    for n, batch in enumerate(data_loader):
        if n >= step:
            break
        batch = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        if torch.distributed.get_rank() == 0:
            print(f"loss: {loss}")
        model.backward(loss)
        model.step()
        losses.append(loss.item())

    return np.nanmean(losses[-100:])

if __name__ == "__main__":
    torch.manual_seed(0)
    model, data_loader = load_and_prepare_data(MODEL_NAME)
    zeropp_loss = get_loss(model, data_loader, DS_CONFIG)

(this DS config is an exact match of what HF/accelerate renders after filling all the "auto" fields -- got it by printing out the DS config before accelerate passes it to deepspeed.initialize)

I then tried modify HF trainer training loop code by pretty much nuking everything after initialization and replacing them with the same training loop as in above script (patched trainer.py in HF trainer, based on transformers==4.37.2: https://gist.github.com/yundai424/d785089f55684c3fce48434d9e727c1a#file-trainer-py-L1739), the issue appears as well.

from deepspeed.

yundai424 commented on May 18, 2024

More findings:

So far we know this is related to prefetch and here we want to understand why. If enabling logging for prefetch, release and wait events, it can be noticed that the particular param (ds_id = 39) is having its second prefetch (the one for backprop) almost immediately after it gets released after forward on it:

...
-prefetch: {'id': 79, 'status': 'NOT_AVAILABLE', 'requires_grad': True, 'persist': False}
-release: {'id': 38, 'status': 'AVAILABLE', 'requires_grad': True, 'persist': False}
-wait: {'id': 39, 'status': 'INFLIGHT', 'requires_grad': True, 'persist': False}
-prefetch: {'id': 83, 'status': 'NOT_AVAILABLE', 'requires_grad': True, 'persist': False}
-release: {'id': 39, 'status': 'AVAILABLE', 'requires_grad': True, 'persist': False}
-prefetch: {'id': 39, 'status': 'NOT_AVAILABLE', 'requires_grad': True, 'persist': False}
-wait: {'id': 40, 'status': 'INFLIGHT', 'requires_grad': True, 'persist': False}
...

i.e. the execution order is:

prefetch(39) -> [ some time spent on forward pass for previous modules] -> wait(39) -> forward on 39 -> release(39) -> [very short time, even immediately!] -> prefetch(39)

This makes us wonder if it's because partitioning 2ndary tensor is not a blocking operation. Our hypothesis is that, secondary tensor will first be initialized with empty and later on we do copy to fill it. However prior to the time the copy is finished, the next prefetch on it has been kicked off, which will use the secondary tensor now that it's not None. But the copy may not have finished yet so it's possible to allgather on a piece of arbitrary data (since it's just torch.empty), thus leading to the trouble.

To validate this hypothesis I added get_accelerator().synchronize() to the end of _partition_param_sec i.e. here and the issue doesn't pop up again.

As @ByronHsu suggested, putting it to before launching allgather kernel will work as well and that theoretically can be more performant.

However as I try to further validate this by initializing 2ndary tensor with torch.randn instead of torch.empty (and removing cuda synchronize added above), the issue persists. We may need to take even closer look at what's falling off between these async ops.

It's not clear yet when does the actual allgather kernel happens (the "-prefetch" print statement is at the time when kernel gets enqueued but it may take time before it's executed) so we need to further pinpoint what is exactly THE operation that should've been synchronized.

(to this point I believe we can safely rule HuggingFace trainer out of the scope too)

from deepspeed.

ByronHsu commented on May 18, 2024

The issue can be reproduced even if we initialize the 2nd tensor as torch.rand, which seems to break our hypothesis.
Add a check for the 2nd tensor before all gather. And we can observe that param_ds_tensor does contain nan.

if torch.isnan(param_ds_tensor).any().item():
    print("param ds tensor contains nan!!!")
    exit()
    
handles = _dist_allgather_fn(
    param_ds_tensor.to(get_accelerator().current_device_name()),
    param_buffer,
    ds_process_group,
)

param ds tensor contains nan!!!

Add a nan check in the if condition again to see if it could be due to async problem

if torch.isnan(param_ds_tensor).any().item():
    import time
    time.sleep(2)
    print(torch.isnan(param_ds_tensor).any().item())
    print("param ds tensor contains nan!!!")
    exit()

param ds tensor contains nan!!!
False

It means that the ds tensor contains nan in the beginning, but after 2 seconds, the nan is gone!!

Remove the above two check and add a check after 2nd partitioning. In this case, the nan doesn’t happen (what???)

if torch.isnan(param.ds_secondary_tensor).any().item():
 print("param ds tensor contains nan after partition!!!")
 exit()
print_rank_0(f"{param.ds_id} partitioned type {param.dtype} dev {param.device} shape {param.shape}",
          force=False)

copy_ is non-blocking if it is d2d

Instead of injecting nan check in deepspeed code, i moved the check to huggingface trainer. I added checks in multiple places, including “before forward”, “after forward”, “before backward”, “after backward”, “before step”, and “after step”. I observed that nan happens at “after backward”, which means during backward, the weights get mutated to NaN, thus causing NaN in grad. Technically, backward shouldn't change weights at all. It is likely that the collective ops messed up the weights.
With torch.cuda.synchronize() added before all gather, if we tune stage3_param_persistence_threshold to zero, we can still observe nan issue.
NaN issue is not reproducible with a ~7B linear model

from deepspeed.

ByronHsu commented on May 18, 2024

Some questions that bother us and our hypothesis

Question: Why adding the check “after 2nd partitioning” can prevent nan issue, but adding the check “before all gather” still makes the issue present?

Hypothesis:
“torch.isnan(tensor).any().item()” essentially copies the data from device to host thus serves as a sync point. However, “After 2nd partitioning” is running on the forward thread but “Before all gathering” is running on the backward thread. If we put the check at the same thread where the copying happens, it will sync due to .item(). On the other hand, if we put the check at a different thread, .item() does not wait.

Question: Why initializing 2nd tensors with torch.randn doesn’t fix the issue?

Hypothesis 1:

[copying ]
[all gather ]
Even if we initialize 2nd tensors as non-nan, if we probe the values of the tensor during copying, we will get some nan values.

Hypothesis 2:
The all gather doesn’t get any nan, but instead it gets a randn value, and then uses the value for optimizer step, causing nan in the updated weights.

from deepspeed.

[BUG] ZeRO++ hpZ on llama2-7b gets zero loss about deepspeed HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent