Coder Social home page Coder Social logo

Comments (12)

StellaAthena avatar StellaAthena commented on May 9, 2024

Update: we can now use any two of the following three options: ZeRO Stage 2, Parallel Pipelining, and Activation Checkpointing. If all three are enabled, it throws the following error:

/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py:993: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

if you open the Stella branch you can replicate this. If you run sh scripts/train_enwik8_pipeline.sh it will have all three enabled and error.

To turn off activation checkpointing, set “number_checkpoints”: null in configs/deepspeed_zero2.json.

To turn off pipelining, run sh scripts/train_enwik8.sh.

To turn off ZeRO Stage 2, use configs/deepspeed_zero1.json as your config file.

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

I ran all pairs of 3 and the results are as follows.

Zero2+pipeline: does not work (contiguous gradients both on and off)
Checkpoint+pipeline: does work (contiguous gradients both on and off)
Zero2+checkpoint: does work (contiguous gradients on; didn't test off)

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

All of the errors and warnings that occur for zero2+pipeline:

Traceback (most recent call last):
  File "train_enwik8_pipeline.py", line 109, in <module>
    loss = model_engine.train_batch()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    self._exec_schedule(sched)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 952, in _exec_optimizer_step
    self._take_model_step(lr_kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 916, in _take_model_step
    self.optimizer.step()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1341, in step
    self.check_overflow()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1612, in check_overflow
    self._check_overflow(partition_gradients)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1516, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1535, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1528, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

(this one shows up 4 times)

Traceback (most recent call last):
  File "train_enwik8_pipeline.py", line 109, in <module>
    self._exec_schedule(sched)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
    loss = model_engine.train_batch()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    self._exec_schedule(sched)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 602, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/pipe/engine.py", line 602, in _exec_backward_pass
    allow_unreachable=True)  # allow_unreachable flag
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 594, in reduce_partition_and_remove_grads
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 984, in reduce_ready_partitions_and_remove_grads
    allow_unreachable=True)  # allow_unreachable flag
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 594, in reduce_partition_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 633, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 984, in reduce_ready_partitions_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

(this one pops up 4 times and I think 2 of them got mangled together here)

As well as the warning that stella mentioned.

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

With contiguous gradients off, the FP16_DeepSpeedZeroOptimizer error no longer happens and I get 8 KeyErrors.

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

Checkpoint+pipeline works with both continuous gradients on and off. Therefore, I don't think it's a major factor for zero2 breaking, but I'll keep it off for the remainder of my tests.

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

Focusing on the KeyError now.

The only place where self.averaged_gradients is written to within stage2.py is in the independent_gradient_partition_epilogue function (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py#L485). So either this function just isn't being called, or it is being called but L485 is never being reached.

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

So the only place where independent_gradient_partition_epilogue is called is https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py#L580. Which is only called at https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L835. Which is only called at https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L915.

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

The reason the pipeline code is problematic is because it disables the backward_allreduce
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L56
which means allreduce_gradients in the non-Pipelined engine never runs
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L914
which means independent_gradient_partition_epilogue never gets called (see previous comment)

The pipeline code pushes a ReduceTiedGrads and then a ReduceGrads here: https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/schedule.py#L235

Execution of that ReduceTiedGrads op:
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L1139
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L208
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/module.py#L405

Execution of ReduceGrads:
https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/pipe/engine.py#L211
calls buffered_allreduce_fallback, but only if dataparallel is enabled.
https://github.com/microsoft/DeepSpeed/blob/865104be85902ca398038045ad9cf94ec7d48745/deepspeed/runtime/engine.py#L1156

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

With the patch applied,

Zero2+pipeline now works
Checkpoint+pipeline now works
Zero2+checkpoint now works

Zero2+checkpoint+pipeline now works

from gpt-neox.

leogao2 avatar leogao2 commented on May 9, 2024

Profiling results:

patched, zero2+checkpoint+pipeline: samples/sec: 1159.741, max vram: 3245MiB
patched: zero2+checkpoint: samples/sec: 1120.8568733324405, max vram: 1704MiB

from gpt-neox.

StellaAthena avatar StellaAthena commented on May 9, 2024

With DeepSpeed's updates this seems to run just fine. The question of if it runs efficiently is still open though.

microsoft/DeepSpeed#677

from gpt-neox.

StellaAthena avatar StellaAthena commented on May 9, 2024

Turns out we weren’t using gradient checkpointing at all! You can add checkpointing to the params without initializing the checkpointer and you can initialize the checkpointer without actually using it! #90 should actually implement gradient checkpointing.

from gpt-neox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.