Pipeline parallelism and gradient checkpointing both work when you use them individual

All of the errors and warnings that occur for zero2+pipeline: <div class="snippet-

With the patch applied, Zero2+pipeline now works <br

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together about gpt-neox HOT 12 CLOSED

eleutherai commented on May 9, 2024 1

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

from gpt-neox.

Comments (12)

StellaAthena commented on May 9, 2024

Update: we can now use any two of the following three options: ZeRO Stage 2, Parallel Pipelining, and Activation Checkpointing. If all three are enabled, it throws the following error:

/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py:993: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

if you open the Stella branch you can replicate this. If you run sh scripts/train_enwik8_pipeline.sh it will have all three enabled and error.

To turn off activation checkpointing, set “number_checkpoints”: null in configs/deepspeed_zero2.json.

To turn off pipelining, run sh scripts/train_enwik8.sh.

To turn off ZeRO Stage 2, use configs/deepspeed_zero1.json as your config file.

from gpt-neox.

leogao2 commented on May 9, 2024

I ran all pairs of 3 and the results are as follows.

Zero2+pipeline: does not work (contiguous gradients both on and off)
Checkpoint+pipeline: does work (contiguous gradients both on and off)
Zero2+checkpoint: does work (contiguous gradients on; didn't test off)

from gpt-neox.

leogao2 commented on May 9, 2024

All of the errors and warnings that occur for zero2+pipeline:

Traceback (most recent call last):
  File "train_enwik8_pipeline.py", line 109, in <module>
    loss = model_engine.train_batch()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    self._exec_schedule(sched)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 952, in _exec_optimizer_step
    self._take_model_step(lr_kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 916, in _take_model_step
    self.optimizer.step()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1341, in step
    self.check_overflow()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1612, in check_overflow
    self._check_overflow(partition_gradients)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1516, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1535, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1528, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

(this one shows up 4 times)

Traceback (most recent call last):
  File "train_enwik8_pipeline.py", line 109, in <module>
    self._exec_schedule(sched)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
    loss = model_engine.train_batch()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    self._exec_schedule(sched)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 602, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 602, in _exec_backward_pass
    allow_unreachable=True)  # allow_unreachable flag
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 594, in reduce_partition_and_remove_grads
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 984, in reduce_ready_partitions_and_remove_grads
    allow_unreachable=True)  # allow_unreachable flag
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 594, in reduce_partition_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 633, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 984, in reduce_ready_partitions_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

(this one pops up 4 times and I think 2 of them got mangled together here)

As well as the warning that stella mentioned.

from gpt-neox.

leogao2 commented on May 9, 2024

With contiguous gradients off, the FP16_DeepSpeedZeroOptimizer error no longer happens and I get 8 KeyErrors.

from gpt-neox.

leogao2 commented on May 9, 2024

Checkpoint+pipeline works with both continuous gradients on and off. Therefore, I don't think it's a major factor for zero2 breaking, but I'll keep it off for the remainder of my tests.

from gpt-neox.

leogao2 commented on May 9, 2024

Focusing on the KeyError now.

The only place where self.averaged_gradients is written to within stage2.py is in the independent_gradient_partition_epilogue function (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py#L485). So either this function just isn't being called, or it is being called but L485 is never being reached.

from gpt-neox.

leogao2 commented on May 9, 2024

So the only place where independent_gradient_partition_epilogue is called is https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py#L580. Which is only called at https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L835. Which is only called at https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L915.

from gpt-neox.

leogao2 commented on May 9, 2024

The reason the pipeline code is problematic is because it disables the backward_allreduce
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L56
which means allreduce_gradients in the non-Pipelined engine never runs
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L914
which means independent_gradient_partition_epilogue never gets called (see previous comment)

The pipeline code pushes a ReduceTiedGrads and then a ReduceGrads here: https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/schedule.py#L235

Execution of that ReduceTiedGrads op:
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L1139
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L208
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/module.py#L405

Execution of ReduceGrads:
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L211
calls buffered_allreduce_fallback, but only if dataparallel is enabled.
https://github.com/microsoft/DeepSpeed/blob/865104be85902ca398038045ad9cf94ec7d48745/deepspeed/runtime/engine.py#L1156

from gpt-neox.

leogao2 commented on May 9, 2024

With the patch applied,

Zero2+pipeline now works
Checkpoint+pipeline now works
Zero2+checkpoint now works

Zero2+checkpoint+pipeline now works

from gpt-neox.

leogao2 commented on May 9, 2024

Profiling results:

patched, zero2+checkpoint+pipeline: samples/sec: 1159.741, max vram: 3245MiB
patched: zero2+checkpoint: samples/sec: 1120.8568733324405, max vram: 1704MiB

from gpt-neox.

StellaAthena commented on May 9, 2024

With DeepSpeed's updates this seems to run just fine. The question of if it runs efficiently is still open though.

microsoft/DeepSpeed#677

from gpt-neox.

StellaAthena commented on May 9, 2024

Turns out we weren’t using gradient checkpointing at all! You can add checkpointing to the params without initializing the checkpointer and you can initialize the checkpointer without actually using it! #90 should actually implement gradient checkpointing.

from gpt-neox.

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together about gpt-neox HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent