azure / ms-amp Goto Github PK

View Code? Open in Web Editor NEW

458.0 11.0 33.0 1.72 MB

Microsoft Automatic Mixed Precision Library

Home Page: https://azure.github.io/MS-AMP/

License: MIT License

Makefile 0.17% Python 84.48% C++ 5.50% Cuda 7.73% Dockerfile 0.54% CMake 0.25% Shell 0.06% JavaScript 1.03% CSS 0.25%

amp deep-learning fp8 gpu pytorch transformer mixed-precision

ms-amp's Introduction

MS-AMP: Microsoft Automatic Mixed Precision

MS-AMP is an automatic mixed precision package for deep learning developed by Microsoft.

📢 v0.4.0 has been released!

Check aka.ms/msamp/doc for more details.

Publication

FP8-LM: Training FP8 Large Language Models [bib]

@misc{fp8lm,
      title={FP8-LM: Training FP8 Large Language Models},
      author={Houwen Peng and Kan Wu and Yixuan Wei and Guoshuai Zhao and Yuxiang Yang and Ze Liu and Yifan Xiong and Ziyue Yang and Bolin Ni and Jingcheng Hu and Ruihang Li and Miaosen Zhang and Chen Li and Jia Ning and Ruizhe Wang and Zheng Zhang and Shuguang Liu and Joe Chau and Han Hu and Peng Cheng},
      year={2023},
      eprint={2310.18313},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

ms-amp's People

Contributors

Stargazers

Watchers

ms-amp's Issues

Qusetion: FP8 Allreduce

I have noticed in MS-AMP, you use fp8 datatype in communication.

In TransformerEngine, when using FP8 operation, it need a scale to adjust result. And I wonder in MS-AMP FP8 Allreduce, does it need a scale to adjust allreduce result?....

[Question] How to apply MS-AMP to only part of the model?

What's the issue, what's expected?:
I would like to apply MS-AMP to only parts of the model that are less sensitive to reduced precision.

Additional information:

Some parts of models are more sensitive to reduced precision than others but the current API makes it difficult to apply MS-AMP only to desired parts of the model. Is there an easy way of doing this?

Moving extension installation from post install to setup.py under project root folder

What would you like to be added:
Moving extension installation from post install to setup.py under project root folder.

Why is this needed:
Extensions are part of MS-AMP and should be installed with MS-AMP package.

Without this feature, how does current msamp work：
MS-AMP use make postinstall to install pytorch extensions such as distop and optim.

Components that may involve changes:
msamp.operators.dist_op and msamp.optim

Brief description of your proposal if any:

Is activation checkpointing used for Table 5 from the FP8-LM paper?

Hi,

I'm wondering if the TFLOPs/MFU numbers in table 5 of the paper is using activation checkpointing?

I've looked through the MS-AMP-Examples repo and it seems like GPT3 megatron does not have the --checkpoint-activations while the megatron-deepspeed scripts does.

Thank you

V0.3 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: Oct. 20th, 2023
Bug Bash date: Oct. 23rd, 2023
Release date: Nov. 3rd, 2023

Main Features

MS-AMP Performance Improvement

1. Integrate latest TE with MS-AMP (#98)
2. Support latest Megatron-LM (#95 and #100)

MS-AMP O3 Optimization

1. Support DistributedDataParallel with FP8-support (#93)
2. Integrate with MSCCL (#90)

MS-AMP Improvement

1. Refactor code in dist_op module (#94)
2. Support UT for distributed testing

MS-AMP Demo & Documentation

1. Setup MS-AMP Website (Related to #29)

Backlog

MS-AMP O3 Optimization

Support auto scaling factor tuning (ASFT) for FP8 collective communication (Related to #41)

MS-AMP Improvement

Move extension installation from post install to setup.py (Related to #43)

Questions: Clarifying the use of FP8 for Training

@tocean @wkcn

In line with the investigation in NVIDIA/TransformerEngine#424, it would be great to get the insights from the team at microsoft for using FP8 in aspects of training besides matmul.

Questions

1. Performance

The repo only mention training accuracy and memory savings. However, the kernels may not be very optimized and majority is implemented in Torch. I guess that performance is still unexplored.

2. Weight Update

is the weight update applied while the backward pass is running (on-the-fly)? Or is it applied after the entire backward pass is complete?
- Seems that one can get memory savings from on-the-fly
- Is there a possibility to use CUDA graphs API to efficiently schedule the weight updates concurrently even if it does not saturate the GPU? Furthermore, one should be able to batch uneven-sized weight update into a single kernel invocation based on an outstanding_weight_updates_bytes threshold.

3. More Accurate Scaling Factors

Is there a way to maintain more accurate amax by estimating:

For e.g. naive SGD case:
- scaling_factor_weights_t = amax_weights_t-1 + amax_grad_t - this is an accurate upper bound (no necessity of apriori knowledge)
- amax_weights_t = max(abs(weights_t)) - this is only used for the next iteration
For Adam optimizer:
- Utilizing e5m2 might be able to help with dynamic range for v (same dynamic range as FP16).
- Storing sqrt_v rather than v may help the precision. Update rule: see appendix
  - Intuition: sqrt will reduce the dynamic range of bits by half (2^16 -> 2^8, 2^-16 -> 2^-8). Hence we perform sqrt in fp32/fp16 and quantize that as fp8, thus preserving the dynamic range
- A more rigorous analysis is needed here.
If it is possible to better estimate scaling_factor_weights_t then it may be possible to use more of the dynamic range. Hence, storing the weights as FP8 (rather than FP16 as in the MS-AMP repo) might be possible.
- Since Adam optimizer is momentum-based, the effect of deviation of amax on a per-batch basis is more bounded.

4. Adaptive Precision

Has it been explored using lower precision (FP8) at high learning rate (at earlier epochs) and higher precision (e.g. FP32, FP16) at lower learning rate (at later epochs)?

Appendix

Update Rule for `sqrt_v_fp8`

scaling_factor = amax_sqrt_v_prev / 448. # 2^8 * (1 + 3/4) - use more of fp8e5m2 dynamic range. margin = 7

v_fp32 = pow2(sqrt_v_fp8.to(dtype.fp32) * scaling_factor)
v_new = beta_2 * v_fp32 + (1 - beta_2) * grad_sq
sqrt_v_fp8 = (sqrt(v_new) / scaling_factor).to(dtype.fp8e5m2)

# end of loop
amax_sqrt_v_new = sqrt(max(v_new))

Notes:

If amax_sqrt_v_fp8 = 448.0, then the scaling factor is 1. This is captured in margin bits:

MS-AMP/msamp/common/tensor/meta.py

Line 39 in aed29d6

def compute_scaling_factor(amax, scale, fp_max: float, margin: int):

Not support pytorch 1.14

What's the issue, what's expected?:
MS-AMP does not support pytorch 1.14. MS-AMP should support 1.14 since many users are using it to train their models.

How to reproduce it?:

Install MS-AMP in docker nvcr.io/nvidia/pytorch:22.12-py3 and run mnist ddp example.

sudo docker run -it  -d --name=torchtest --privileged --net=host --ipc=host --gpus=all nvcr.io/nvidia/pytorch:22.12-py3  bash
sudo docker exec -it torchtest bash

git clone https://github.com/Azure/MS-AMP.git
cd MS-AMP
python -m pip install --upgrade pip
pip install -e .
make postinstall

cd examples
python mnist.py --enable-msamp --opt-level=O2

Log message or shapshot?:

Additional information:

V0.2 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: July 5th, 2023
Bug Bash date: July 8th, 2023
Release date: July 19th, 2023

Main Features

MS-AMP O3 Optimization

- Support pipelining parallelism and tensor parallelism (#81, Related to #40)
- Support Zero optimization (Related to #34) (#71, #73, #77, and #78)

MS-AMP New Features

- Support ScalingTensor in functional.linear (#65)
- Support customized attributes in FP8Linear (#64)

MS-AMP Bug Fix

- Fix bug in all_reduce_grads() when training multiple models (Related to #62) (#63)
- Fix bug in the cast_to_fp16 (#76)

MS-AMP Improvement

- Performance optimization (Related to #30) (#23 and #39)
- Add DockerFile for CUDA12.1 support (#72)
- Support PyTorch 2 (#74)
- Support FP8 progress group in PyTorch (Related to #50) (#85)
- Add DeepSpeed cifar10 as UT (#79)
- Cache TE build in pipeline (#82)
- Remove env PYTHONOPTIMIZE in docker image (#83)

MS-AMP Demo & Documentation

- Add RoBERTa FP8 demo
- Add GPT-345M and 13B FP8 demo (Azure/MS-AMP-Examples#8)
- Support Transformer Engine FP8 in DeiT and Swin-Transformer (Azure/MS-AMP-Examples#5)
- Improve the homepage (#66, and #69)

Backlog

Move extension installation from post install to setup.py (Related to #43)
Solve NCCL installation issue (Related to #44)
Support latest Megatron-LM
Support auto scaling factor tuning (ASFT) for FP8 collective communication (Related to #41)
Setup MS-AMP Website (Related to #29)
Remove the dependency on Nvidia Transformer Engine (TE) (Related to #33)
Support DistributedDataParallel with FP8-support (#75)

Support pipeline parallelism and tensor parallelism

What would you like to be added:
Support pipeline parallelism and tensor parallelism in MS-AMP.

Why is this needed:
Pipeline parallelism and tensor parallelism are essential for training large scale models such as GPT3. MS-AMP needs to support them.

Without this feature, how does current msamp work：
Currently ms-amp doesn't support them.

Components that may involve changes:

Brief description of your proposal if any:
TBD

MNIST example failed in docker nvcr.io/nvidia/pytorch:22.09-py3

What's the issue, what's expected?:
MNIST example failed in docker nvcr.io/nvidia/pytorch:22.09-py3. It can succeed in superbench/dev:cuda11.8.

How to reproduce it?:

Start a docker container using nvcr.io/nvidia/pytorch:22.09-py3

docker run -it  -d --name=yuxmsamp --privileged --net=host --ipc=host --gpus=all  -v /:/data nvcr.io/nvidia/pytorch:22.09-py3 bash
docker exec -it yuxmsamp  bash

Install nccl and MS-AMP

Run MNIST example

cd examples
python mnist.py --enable-msamp --opt-level=O2
python -m torch.distributed.launch --nproc_per_node=8 mnist_ddp.py --enable-msamp --opt-level=O2

Log message or shapshot?:

Additional information:

There is a down spike in curve of accuracy@1 when training vision transformer

What's the issue, what's expected?:
There is a down spike in the curve of acc@1 when training vision transformer. However, the curve is very smooth when using mfp8.

How to reproduce it?:
Train vision transformer model using MS-AMP O2 with multiple H100 GPUs.

Log message or shapshot?:

Additional information:

Question: Is FP8-LM only supported on H100?

Can it run on A100?

Remove the dependency of Transformer Engine

What would you like to be added:
Remove the dependency of Transformer Engine and implement the related fp8 functions such as gemm, transpose, cast in MS-AMP.

Why is this needed:
As a competitor of Transformer Engine, MS-AMP should not depend on it. Implement these fp8 related functions by ourself also give us opportunities to improve performance.

Without this feature, how does current msamp work：
Currently, MS-AMP uses the fp8-related functions from Transformer Engine.

Components that may involve changes:
msamp.common.tesnor
msamp.operators.gemm

Brief description of your proposal if any:

question about the paper

Hello, thanks for your great work. I don't understand parts of the paper very clearly. Here is some questions.

To equip tensor parallelism with FP8, we convert the sharded weight and activation tensors to FP8 format for linear layer computation.

1.If I don't use tensor parallelism, the liner weights are FP8 or FP16?
2.When forward and backward, are all the weights FP8 to compute? (not master weights)

g = g 1 ′′ + g 2 ′′ + · · · + g N''

3.and when compute g use this formula, is the g FP8 or FP16? if it's FP8, will overflow occur?

FP8 in linear layer question

Are the weights of the FP8Linear actually in FP8 or in FP16? - The paper seems to imply so

To equip tensor parallelism with FP8, we convert the sharded weight and activation tensors to FP8 format for linear layer computation.

But I was checking the GPT-3 example as far as I can see the weight_qtype is float16: https://github.com/Azure/MS-AMP-Examples/blob/main/gpt3/Megatron-LM.patch#L647
The underlying tensor in FP8Linear gets created consequentially with float16 type: https://github.com/Azure/MS-AMP/blob/main/msamp/nn/linear.py#L35
However, the functional linear override, does cast the weights to FP8 before performing the forward: https://github.com/Azure/MS-AMP/blob/main/msamp/nn/functional.py#L33

So what I understand from this is that the activations are in FP8 but the actual weights are in float16. Am I missing something?

Automatic Scaling in the code

I'm having a bit of a hard time mapping the equations from the paper to the code. Can you point to where in the code is eq 3?

Essentially, where's mu updated every 1k steps after checking the gradient statistics.

I've gotten so far that compute_scaling_factor is rather related to eq 4 and 5

V0.1.0 Test Plan

Test table

Machine Type	#Node * #GPU * #GPU type	Pytorch Version	Accelerated Computing Toolkit
NDV4	1 * 8 * A100 80GB	1.13	CUDA 11.8
NDV4	1 * 8 * A100 80GB	1.14	CUDA 11.8
NDV5	1 * 8 * H100 80GB	1.13	CUDA 11.8
NDV5	1 * 8 * H100 80GB	1.14	CUDA 11.8

Use nvcr.io/nvidia/pytorch:22.09-py3 and nvcr.io/nvidia/pytorch:22.12-py3 to test pytorch 1.13 and 1.14.

Test cases

Installation

Install nccl
Install ms-amp
unit test

Run MNIST

MNIST on single card
MNIST on multi cards

Swin-Transformer

Swin-Tiny
Swin-Giant for demonstrating memory saving

Deit

Deit-Small
Deit-Large fo demonstrating memory saving

MS-AMP install from source

What's the issue, what's expected?:
When I install MS-AMP from source , excuting "python3 -m pip install" , An error occurs

How to reproduce it?:
ENV:
nvcr.io/nvidia/pytorch: 23.10-py3

/usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.1.0a0+32f93b1
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S
GPU 4: NVIDIA L40S
GPU 5: NVIDIA L40S
GPU 6: NVIDIA L40S
GPU 7: NVIDIA L40S

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 208
On-line CPU(s) list: 0-207
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8470
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 52
Socket(s): 2
Stepping: 6
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cat_l2 cdp_l3 intel_pt cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl intel_stibp flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 4.9 MiB (104 instances)
L1i cache: 3.3 MiB (104 instances)
L2 cache: 208 MiB (104 instances)
L3 cache: 210 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-51,104-155
NUMA node1 CPU(s): 52-103,156-207
Vulnerability L1tf: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; Load fences, __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.1.0a0+32f93b1
[pip3] torch-tensorrt==0.0.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.16.0a0
[pip3] triton==2.1.0+e621604
[conda] Could not collect

Log message or shapshot?:

Collecting deepspeed==0.9.2 (from MS-AMP==0.3.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9e/6a/ce4e9afcf36f242f55bf2c91fa1366b26f0f40653d98a96c0f64b7154c12/deepspeed-0.9.2.tar.gz (779 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 779.3/779.3 kB 13.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [40 lines of output]
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'validate_all' has been renamed to 'validate_default'
warnings.warn(message, UserWarning)
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/fields.py:128: UserWarning: Field "model_persistence_threshold" has conflict with protected namespace "model".

  You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
    warnings.warn(
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/setup.py", line 36, in <module>
      from op_builder import get_default_compute_capabilities, OpBuilder
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/op_builder/__init__.py", line 18, in <module>
      import deepspeed.ops.op_builder  # noqa: F401
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/__init__.py", line 16, in <module>
      from . import module_inject
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/module_inject/__init__.py", line 6, in <module>
      from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/module_inject/replace_module.py", line 731, in <module>
      from ..pipe import PipelineModule
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/pipe/__init__.py", line 6, in <module>
      from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
      from .module import PipelineModule, LayerSpec, TiedLayerSpec
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/pipe/module.py", line 19, in <module>
      from ..activation_checkpointing import checkpointing
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 25, in <module>
      from deepspeed.runtime.config import DeepSpeedConfig
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/config.py", line 28, in <module>
      from .zero.config import get_zero_config, ZeroStageEnum
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/zero/__init__.py", line 6, in <module>
      from .partition_parameters import ZeroParamType
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/zero/partition_parameters.py", line 603, in <module>
      class Init(InsertPostInitMethodToModuleSubClasses):
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/zero/partition_parameters.py", line 605, in Init
      param_persistence_threshold = get_config_default(DeepSpeedZeroConfig, "param_persistence_threshold")
    File "/tmp/pip-install-k3j9klrt/deepspeed_ee89b11125db4d71bd6069eb5c252c37/deepspeed/runtime/config_utils.py", line 116, in get_config_default
      field_name).required, f"'{field_name}' is a required field and does not have a default value"
  AttributeError: 'FieldInfo' object has no attribute 'required'. Did you mean: 'is_required'?
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Additional information:
Whether it is supporting L40s?

Optimize performance to close the gap between TE and MS-AMP

What's the issue, what's expected?:
There is a performance gap between transformer engine and MS-AMP. The performance should be very close.

How to reproduce it?:
Launch end-2-end training workload using transformer engine and MS-AMP, compare the time used per training step.

Log message or shapshot?:

Additional information:
Using profiling tool to analyze the bottleneck.

[Bug] `LBOptimizer.all_reduce_grads` reduces gradients of only a model, even if training several models

What's the issue, what's expected?:
In the function LBOptimizer.all_reduce_grads, the gradients of a passed model will be reduced when model_state.ready_to_all_reduce_grads is True, otherwise the gradients are not reduced.

When training more than one models, the gradients of the first model are reduced then model_state.ready_to_all_reduce_grads is set to False. However, the gradients of the second model are not reduced since model_state.ready_to_all_reduce_grads is False.

https://github.com/Azure/MS-AMP/blob/main/msamp/optim/optimizer.py#LL47C1-L55C54

    def all_reduce_grads(self, model):
        """All-reduce gradients of parameters."""
        if not model_state.ready_to_all_reduce_grads:  # skip when model_state.ready_to_all_reduce_grads is False
            return
        scaling_params = [p for p in model.parameters() if isinstance(p, ScalingParameter)]
        grads = [p.grad for p in scaling_params if p.grad is not None]
        TensorDist.all_reduce_avg(grads)
        # make sure that FP8 weight gradients have been reduced.
        model_state.ready_to_all_reduce_grads = False

How to reproduce it?:

model1 = Model1()
model2 = Model2()

loss = loss_func1(model1(x), target1) + loss_func2(model2(x), target2)
loss.backward()

# `model_state.ready_to_all_reduce_grads` will be set to False. The gradients of model1 are reduced,
optimizer.all_reduce_grads(model1)
# [BUG Here] `model_state.ready_to_all_reduce_grads` has been False. The gradients of model2 will not be reduced,
optimizer.all_reduce_grads(model2)

Log message or shapshot?:

Additional information:

MS-AMP does not support pytorch2.1

What's the issue, what's expected?:
MS-AMP can not install with pytorch 2.0

How to reproduce it?:
Install MS-AMP in docker nvcr.io/nvidia/pytorch:23.01-py3.

sudo docker run -it  -d --name=torchtest --privileged --net=host --ipc=host --gpus=all nvcr.io/nvidia/pytorch:23.01-py3  bash
sudo docker exec -it torchtest bash

git clone https://github.com/Azure/MS-AMP.git
cd MS-AMP
python -m pip install --upgrade pip
pip install -e .
make postinstall

Log message or shapshot?:

Additional information:

[Question]Is MS-AMP going to support ZeRO-2 + PP ?

MS-AMP/msamp/deepspeed/runtime/pipe/engine.py

Lines 14 to 29 in 2fbe898

    
           class MSAMPPipelineEngine(MSAMPDeepSpeedEngine, PipelineEngine): 
        
               """Pipeline engine supports pipeline+ZeRO-2+BF16.""" 
        
               def _exec_reduce_grads(self): 
        
                   """Reduce gradients across pipeline stages.""" 
        
                   self._force_grad_boundary = True 
        
                   if self.pipeline_enable_backward_allreduce: 
        
                       if self.bfloat16_enabled(): 
        
                           if self.zero_optimization_stage() == ZeroStageEnum.disabled: 
        
                               self._bf16_reduce_grads() 
        
                           elif self.zero_optimization_stage() == ZeroStageEnum.optimizer_states: 
        
                               self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE) 
        
                           else: 
        
                               raise NotImplementedError('PP+BF16 only work for ZeRO Stage 1') 
        
                       else: 
        
                           self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE) 
        
                   self._force_grad_boundary = False

The comment of the function says Pipeline engine supports pipeline+ZeRO-2+BF16., but it's not implemented.

When I tried zero stage 2 without bf16, with fp16 for example, I got the error as that of deepspeed:

Traceback (most recent call last):
  File "train.py", line 117, in <module>
    train_pipe(args)
  File "train.py", line 97, in train_pipe
    engine, _, _, _ = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/msamp/deepspeed/__init__.py", line 152, in initialize
    engine = MSAMPPipelineEngine(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 58, in __init__
    assert self.zero_optimization_stage() < 2, "ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism"
AssertionError: ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism

Here's the code that I tried. I followed #81 with little change to follow v0.3.0. ZeRO-1 + PP worked with this code.

#!/usr/bin/env python3

import os
import argparse

import torch
import torch.distributed as dist

import torchvision
import torchvision.transforms as transforms
from torchvision.models import AlexNet
from torchvision.models import vgg19

from deepspeed import PipelineModule
from msamp import deepspeed


def cifar_trainset(local_rank, dl_path='/tmp/cifar10-data'):
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.ConvertImageDtype(dtype=torch.bfloat16),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    # Ensure only one rank downloads.
    # Note: if the download path is not on a shared filesytem, remove the semaphore
    # and switch to args.local_rank
    dist.barrier()
    if local_rank != 0:
        dist.barrier()
    trainset = torchvision.datasets.CIFAR10(root=dl_path,
                                            train=True,
                                            download=True,
                                            transform=transform)
    if local_rank == 0:
        dist.barrier()
    return trainset

def get_args():
    parser = argparse.ArgumentParser(description='CIFAR')
    parser.add_argument('--local_rank',
                        type=int,
                        default=-1,
                        help='local rank passed from distributed launcher')
    parser.add_argument('-s',
                        '--steps',
                        type=int,
                        default=100,
                        help='quit after this many steps')
    parser.add_argument('-p',
                        '--pipeline-parallel-size',
                        type=int,
                        default=2,
                        help='pipeline parallelism')
    parser.add_argument('--backend',
                        type=str,
                        default='nccl',
                        help='distributed backend')
    parser.add_argument('--seed', type=int, default=1138, help='PRNG seed')
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    return args


def join_layers(vision_model):
    layers = [
        *vision_model.features,
        vision_model.avgpool,
        lambda x: torch.flatten(x, 1),
        *vision_model.classifier,
    ]
    return layers


def train_pipe(args, part='parameters'):
    torch.manual_seed(args.seed)
    # deepspeed.runtime.utils.set_random_seed(args.seed)
    #
    # Build the model
    #

    # VGG also works :-)
    #net = vgg19(num_classes=10)
    net = AlexNet(num_classes=10)
    net = PipelineModule(layers=join_layers(net),
                         loss_fn=torch.nn.CrossEntropyLoss(),
                         num_stages=args.pipeline_parallel_size,
                         partition_method=part,
                         activation_checkpoint_interval=0)

    trainset = cifar_trainset(args.local_rank)

    engine, _, _, _ = deepspeed.initialize(
        args=args,
        model=net,
        model_parameters=[p for p in net.parameters() if p.requires_grad],
        training_data=trainset)

    for step in range(args.steps):
        loss = engine.train_batch()


if __name__ == '__main__':
    args = get_args()

    deepspeed.init_distributed(dist_backend=args.backend)
    args.local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(args.local_rank)

    if args.pipeline_parallel_size == 0:
        train_base(args)
    else:
        train_pipe(args)

Is there any plan to support ZeRO-2 + PP in MS-AMP roadmap? Or is ZeRO-2 + PP impossible empirically or theoretically?

`make postinstall` failed due to undeclared symbols.

What's the issue, what's expected?:
Build MSAMP failed due to can't find ncclFp8E4M3 and ncclFp8E5M2.

How to reproduce it?:

cd third_party/nccl
make -j src.build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80"

apt-get update
apt install build-essential devscripts debhelper fakeroot
make pkg.debian.build
dpkg -i build/pkg/deb/libnccl2_*.deb

cd -
python3 -m pip install --upgrade pip
python3 -m pip install .
make postinstall

Log message or shapshot?:

Additional information:

V0.4 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: Feb. 9th, 2024
Bug Bash date: Feb. 12th, 2024
Release date: Feb. 23rd, 2024

Main Features

MS-AMP O3 Optimization

1. Support auto scaling factor tuning (ASFT) for FP8 collective communication (Related to #41, #140)
2. Support PyTorch FSDP (Related to #122)

MS-AMP Improvement

1. Move extension installation from post install to setup.py (Related to #43)
2. Improve FP8 kernel performance in MS-AMP (#132)
3. MS-AMP support on different devices (Nvidia A100 and AMD MI300X)

MS-AMP Examples

1. Release the datapoints (Related to #115)

Backlog

FP8 on Activation

Question: Difficulty of FP8 + ZeRO

Directly applying FP8 to ZeRO is infeasible, because it is difficult to handle the scaling factors associated with the FP8 partitions. The per-tensor scaling factors should be distributed along with FP8 partitions. [...] In this way, the tensor scaling factors can be distributed along with the tensors smoothly, while reducing communication and compute complexity

from page 7 of https://arxiv.org/pdf/2310.18313.pdf

I wanted to clarify exactly what is the challenge with the original ZeRO partitioning that is intra-parameter. Is the challenge solely a performance challenge (namely, (1) communicating scaling factors and (2) requiring scale computation for a single logical parameter on possibly multiple workers)? Or, are there other non-performance-related challenges as well?

Furthermore, in ZeRO distributed training, our method distributes each FP8 tensor along with its associated scaling factor as a whole, rather than partitioning the tensor into splits across GPUs. This strategy not only results in more GPU memory savings but also maintains a balanced memory load across GPUs, as demonstrated in Tab. 8.

I looked at this problem very briefly as an intern a few years ago (pytorch/pytorch#59410), so this design choice was curious to me.

I have not read your implementation, but I was wondering: Would changing from a intra-parameter sharding (like existing ZeRO) to an inter-parameter sharding make it challenging to implement a performant ZeRO-2 or ZeRO-3, where the consideration is that all-gather (for intra-parameter sharding) may be more performant than broadcast/group of broadcast (for inter-parameter sharding)?

Moreover, do you have any concern around the load balancing when the data-parallel degree is larger (e.g. >= 128 GPUs)?

MNIST single GPU example: GradScaler AssertionError

What's the issue, what's expected?:
python mnist.py --enable-msamp --opt-level=O2 should work with the versions pinned in pyproject.toml. Specifically, it should work with torch==2.2.1, given that torch is unpinned.

How to reproduce it?:
build MS-AMP with torch==2.2.1.

Log message or shapshot?:

$ python mnist.py --enable-msamp --opt-level=O2
[2024-03-05 14:56:15,819] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
msamp is enabled, opt_level: O2
Traceback (most recent call last):
  File "/home/a/MS-AMP/examples/mnist.py", line 185, in <module>
    main()
  File "/home/a/MS-AMP/examples/mnist.py", line 176, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/a/MS-AMP/examples/mnist.py", line 73, in train
    scaler.step(optimizer)
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 447, in step
    self.unscale_(optimizer)
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 255, in _unscale_grads_
    assert isinstance(param, torch.Tensor)
AssertionError

Additional information:
This occurs because

the isinstance check was introduced in this torch commit
optimizer.param_groups[:,'params'] contains ScalingParameters
ScalingParameters subclass ScalingTensor which subclasses nothing, so the isinstance check fails

Commenting out the assertion line manually fixes the issue. I do not know how to reasonably fix this without resorting to that.

nccl buildig failed without specifying NVCC_GENCODE

What's the issue, what's expected?:
nccl building failed if we don't specify NVCC_GENCODE. According to nccl documentation, specifying NVCC_GENCODE only speeds up the compiling. It should not fail even we don't pass NVCC_GENCODE.

How to reproduce it?:

git clone https://github.com/Azure/MS-AMP.git
cd MS-AMP
git submodule update --init --recursive
cd third_party/nccl
make -j src.build

Log message or shapshot?:

Additional information:

Is MS-AMP reproducing the FP8-LM paper's results?

Hi. Is MS-AMP reproducing the FP8-LM paper's results?

MS-AMP needs a website

What would you like to be added:
A website for MS-AMP.

Why is this needed:
We want to provide our customers with more comprehensive information, including details about the design, user tutorials, and developer guides.

Without this feature, how does current msamp work：
Currently, msamp uses README.md to introduce itself.

Components that may involve changes:

Brief description of your proposal if any:
The website of superbench is a good reference to us.

Can not run mnist_ddp.py when using pytorch 1.14

What's the issue, what's expected?:
Running mnist_ddp.py in pytorch 1.14 got a failure.

How to reproduce it?:
Start a docker container

docker run -it  -d --name=torch_test--privileged --net=host --ipc=host --gpus=all -v /:/data superbench/dev:cuda11.8 bash
docker exec -it torch_test bash

Please make sure the pytorch version is 1.14.
Install MS-AMP following the README.md and run the mnist_ddp.py example.

Log message or shapshot?:

Additional information:

Replace dist_op with fp8_op

What's the issue, what's expected?:
After introducing fp8_op, dist_op can be removed.

How to reproduce it?:

Log message or shapshot?:

Additional information:

Question : does it work with Apple mps ?

Support Zero

What would you like to be added:
We'd like to support Zero. Users who use Zero for training DL models can easily enable fp8 with MS-AMP.

Why is this needed:
Zero is widely used for training large scale deep learning models. People who use Zero also want to enable fp8 to save memory and improve efficiency.

Without this feature, how does current msamp work：
Currenlty, MS-AMP does not support Zero.

Components that may involve changes:
Will add a new component in MS-AMP for Zero.

Brief description of your proposal if any:

Auto scaling factor tuning for FP8 collective communication

What would you like to be added:
Tune scaling factor automatically for fp8 collective communication.

Why is this needed:
Reduce the scaling factor to min value across all GPUs may cause underflow. What's more, the gradient reduced may not be the real average gradient.

Without this feature, how does current msamp work：
In FP8Linear's backward, it synchronizes scaling factor's minimum value to all GPUs. And in optimizer's all_reduce_grads, it will synchronizes average gradient to all GPUs. It will first compute the sum, and then divides the sum by world size. When computing sum, it may cause overflow.

Components that may involve changes:

Brief description of your proposal if any:
Currently we do not have a perfect solution for this.

Support FP8 ProcessGroup in pytorch

What's the issue, what's expected?:
GPT training need FP8-supported ProcessGroup to all reduce FP8 gradients in Data Parallism.
PyTorch ProcessGroup does not support FP8 data type.

How to reproduce it?:
Here is the ProcessGroupNccl definition in pytorch:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L53
Need to Add FP8-support like {at::kByte, ncclFp8E4M3}, {at::kChar, ncclFp8E5M2}

Log message or shapshot?:

Additional information:
We may work around it by compiling pytorch by-ourself and waiting the official support for fp8 in pytorch.

unit-test for multi-process training

What would you like to be added:
unit-test for multi-process training

Why is this needed:
Current unit-tests are all executed on single-process (single GPU). It did not test the cases on multi-process.

The following cases need unit-tests on multi-process.

msamp/nn/linear.py
https://github.com/Azure/MS-AMP/blob/main/msamp/nn/linear.py#L171
https://github.com/Azure/MS-AMP/blob/main/msamp/nn/linear.py#L258
msamp/common/tensor/cast.py
https://github.com/Azure/MS-AMP/blob/main/msamp/common/tensor/cast.py#L45
https://github.com/Azure/MS-AMP/blob/main/msamp/common/tensor/cast.py#L79
msamp/operators/dist_op
msamp/optim/optimizer.py
https://github.com/Azure/MS-AMP/blob/main/msamp/optim/optimizer.py#L47

Without this feature, how does current msamp work：
It works but the potential bugs may exist when training on multi-process.

Components that may involve changes:
unit-test

Brief description of your proposal if any:

add topic tag mixed-precision

I suggest adding the topic mixed-precision in the About section

Question about FP8 matmul coverage in FP8-LM

Hello, I appreciate your pioneering work and believe this is the promising direction for future LLM.

As far as I read the code, msamp provides FP8 training by

torch.nn.functional is overridden by msamp.nn._FP8GemmFunction to execute FP8 matmul via TransformerEngine API.
msamp.te.TeReplacer or msamp.nn.LinearReplacer is called to override the model's submodules to FP8 training compatible instances such as FP8Linear.

I also read through MS-AMP-Example, but I don't know about the following points of FP8-LM implementation.

Are matmul in mult-head attention or flash attention modules executed in fp8?
Does the input and positional embedding layer remain fp16 for matmul input?

Thank you.

NVLink bandwidth of H100 FP8 is only 1/10 of H100 FP16

What's the issue, what's expected?:
The NVLink bandwidth of FP8 is 1/10 of FP16 on H100. The expect ratio is about 1/2.

How to reproduce it?:
Launch two training jobs with multiple H100 GPUs: One uses MS-AMP FP8 and the other uses FP16 AMP. Monitor nv_link_tx_bytes/nv_link_rx_bytes using dcgmi. Will observe that NVLink bandwith of FP8 is only 1/10 of FP16.

Log message or shapshot?:

Additional information:
It may be related to the implementation of all-reduce in MS-AMP. In all-reduce the tensors are distributed by bucket, and the bucket size if 512M. Please refer https://github.com/Azure/MS-AMP/blob/main/msamp/common/tensor/tensor_dist.py.

Training curve datapoints or smoothing

Hello. Would it be possible to release the datapoints for the curves in figures 4? I'm primarily interested to apply some smoothing to see how close they're after smoothing - Alternatively adding a bit of smoothing and just showing how it looks like would be interesting too :)

Fig 4 of
https://arxiv.org/abs/2310.18313

V0.2.0 Test Plan

Test table

Machine Type	#Node * #GPU * #GPU type	Pytorch Version	Accelerated Computing Toolkit	Docker Image
NDV4	1 * 8 * A100 80GB	1.14	CUDA 11.8	nvcr.io/nvidia/pytorch:22.12-py3
NDV4	1 * 8 * A100 80GB	2.1	CUDA 12.1	nvcr.io/nvidia/pytorch:23.04-py3
NDV5	1 * 8 * H100 80GB	1.14	CUDA 11.8	nvcr.io/nvidia/pytorch:22.12-py3
NDV5	1 * 8 * H100 80GB	2.1	CUDA 12.1	nvcr.io/nvidia/pytorch:23.04-py3

Test cases

Installation

Install nccl
Install ms-amp
unit test

Run MNIST

MNIST on single card
MNIST on multi cards

Run CIFAR10

CIFAR10 using deepspeed
CIFAR10 using deepspeed with msmap
CIFAR10 using deepspeed-ZeRO with msmap

Swin-Transformer

Swin-Tiny
Swin-Giant for demonstrating memory saving

Deit

Deit-Small
Deit-Large fo demonstrating memory saving

RoBERTa

RoBERTa amp
RoBERTa ms-amp

GPT-3

GPT3-345M fp16/msamp
GPT3-13B bf16/msamp

Docker test

Pull ghcr.io/azure/msamp:main-cuda11.8, run UT and all examples in docker
Pull ghcr.io/azure/msamp:main-cuda12.1, run UT and all examples in docker

Support for latest Megatron-LM and transformer-engine 1.0 +

Thank you for such a great and exciting project !

What would you like to be added:
Support for latest Megatron-LM and transformer-engine 1.0 +

Why is this needed:
latest Megatron-LM support context-parallel and expert-parallel with transformer-engine 1.0+, help train LLMs with long-context and moe model!

Please update obsolete dependencies

This library is currently pinned to Flash Attention 1.0.9, Transformer Engine 0.11, and DeepSpeed 0.9.2. All of these are approximately six months old, and lack important new features (eg. DeepSpeed's Zero++). Please update the dependency versions, thanks :)

Questions about error reporting

What's the issue, what's expected?:
Hello author, I had an nvlink reference error when I installed the environment myself,When I installed MSCCL using make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" an error occurred. Another question is whether this set of projects can be run on the H800.I did not use --privileged --net=host --ipc=host when creating a new container. Will this also have an impact?

How to reproduce it?:
My system environment is ubuntu22.04, python 3.11, cuda 11.8, and torch 2.0.1. The hardware environment is 8-card H800.

Log message or shapshot?:

Additional information:

Optimizer compilation fails with PyTorch 2.2

What's the issue, what's expected?:

I tried to compile the MS-AMP optimizer with the new Torch 2.2:

cd msamp/optim
pip install -v .

but got this error:

    File "/scratch/brr/MS-AMP/msamp/optim/setup.py", line 7, in <module>
      from torch.utils import cpp_extension
    File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
      from torch._C import *  # noqa: F403
  ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

How to reproduce it?:

Running this code in Python reproduces the error:

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

Log message or shapshot?:

See above

Additional information:

My best guess is that this is caused by MS-AMP being pinned to an external old version of libnccl (2.17.1), while PyTorch 2.2 seems to depend on a newer version (2.19.3).

V0.3.0 Test Plan

Test table

Machine Type	#Node * #GPU * #GPU type	Pytorch Version	Accelerated Computing Toolkit	Docker Image
NDV4	1 * 8 * A100 80GB	1.14	CUDA 11.8	nvcr.io/nvidia/pytorch:22.12-py3
NDV4	1 * 8 * A100 80GB	2.1	CUDA 12.1	nvcr.io/nvidia/pytorch:23.04-py3
NDV5	1 * 8 * H100 80GB	1.14	CUDA 11.8	nvcr.io/nvidia/pytorch:22.12-py3
NDV5	1 * 8 * H100 80GB	2.1	CUDA 12.1	nvcr.io/nvidia/pytorch:23.04-py3

Test cases

Installation

Install nccl
Install ms-amp
unit test

Run MNIST

MNIST on single card
MNIST on multi cards

Run CIFAR10

CIFAR10 using deepspeed
CIFAR10 using deepspeed with msmap
CIFAR10 using deepspeed-ZeRO with msmap

Swin-Transformer

Swin-Tiny
Swin-Giant for demonstrating memory saving

Deit

Deit-Small
Deit-Large fo demonstrating memory saving

RoBERTa

RoBERTa amp
RoBERTa ms-amp

GPT-3

GPT3-345M fp16/msamp using Megatron-LM
GPT3-13B bf16/msamp using Megatron-LM
GPT3-345M fp16/msamp using Megatron-DeepSpeed
GPT3-13B bf16/msamp using Megatron-DeepSpeed

Docker test

Pull ghcr.io/azure/msamp:main-cuda11.8, run UT and all examples in docker
Pull ghcr.io/azure/msamp:main-cuda12.1, run UT and all examples in docker
Build docker image from docker file, run UT and MNIST/CIFAR10 in docker container

MS-AMP crashes with DeepSpeed ZeRO 3

I am fine-tuning Facebook's OPT-1.3B on 2x 4090 GPUs, using Ubuntu 22.04, PyTorch 2.1.0, CUDA 12.1, and HuggingFace Accelerate, using this code from the HuggingFace examples repo:

https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py

When using DeepSpeed ZeRO 3 for partitioning model sizes with optimization level O3, MS-AMP crashes with this stack trace:

When I switch to optimization level O2, it instead crashes with this stack trace, presumably because the MS-AMP cast.py code doesn't expect DeepSpeed's parameter partitioning:

Traceback (most recent call last):
File "/home/alyssa/lm_fun/run_clm.py", line 769, in
main()
File "/home/alyssa/lm_fun/run_clm.py", line 583, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/accelerator.py", line 1667, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/deepspeed/init.py", line 135, in initialize
engine = MSAMPDeepSpeedEngine(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/deepspeed/runtime/engine.py", line 81, in _configure_optimizer
model, basic_optimizer = msamp_initialize(self.module, basic_optimizer, optlevel)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/init.py", line 61, in initialize
cast_model = LinearReplacer.replace(model)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 176, in replace
Traceback (most recent call last):
File "/home/alyssa/lm_fun/run_clm.py", line 769, in
model = cls._replace(model, weight_qtype)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 158, in _replace
setattr(model, child_name, cls._replace(child, weight_qtype))
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 158, in _replace
setattr(model, child_name, cls._replace(child, weight_qtype))
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 158, in _replace
setattr(model, child_name, cls._replace(child, weight_qtype))
[Previous line repeated 3 more times]
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 154, in _replace
main()
File "/home/alyssa/lm_fun/run_clm.py", line 583, in main
fp8_net = cls._build_fp8linear(model, weight_qtype)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 98, in _build_fp8linear
weight = weight.cast(weight_qtype)
result = self._prepare_deepspeed(*args) File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/common/tensor/tensor.py", line 703, in _cast_to_scalingtensor

File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/accelerator.py", line 1667, in _prepare_deepspeed
return ScalingTensor(TypeCast.cast_to_fp16(self, meta, sync=sync), meta=meta)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/common/tensor/cast.py", line 81, in cast_to_fp16
meta.amax[0] = input.abs().max()
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/deepspeed/init.py", line 135, in initialize
engine = MSAMPDeepSpeedEngine(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/deepspeed/runtime/engine.py", line 81, in _configure_optimizer
model, basic_optimizer = msamp_initialize(self.module, basic_optimizer, optlevel)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/init.py", line 61, in initialize
cast_model = LinearReplacer.replace(model)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 176, in replace
model = cls._replace(model, weight_qtype)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 158, in _replace
setattr(model, child_name, cls._replace(child, weight_qtype))
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 158, in _replace
setattr(model, child_name, cls._replace(child, weight_qtype))
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 158, in _replace
setattr(model, child_name, cls._replace(child, weight_qtype))
[Previous line repeated 3 more times]
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 154, in _replace
fp8_net = cls._build_fp8linear(model, weight_qtype)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/nn/linear.py", line 98, in _build_fp8linear
weight = weight.cast(weight_qtype)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/common/tensor/tensor.py", line 703, in _cast_to_scalingtensor
return ScalingTensor(TypeCast.cast_to_fp16(self, meta, sync=sync), meta=meta)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/msamp/common/tensor/cast.py", line 81, in cast_to_fp16
meta.amax[0] = input.abs().max()
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
[2023-11-14 17:42:01,194] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 543272) of binary: /home/alyssa/anaconda3/envs/lm_fun/bin/python3
Traceback (most recent call last):
File "/home/alyssa/anaconda3/envs/lm_fun/bin/accelerate", line 8, in
sys.exit(main())
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
deepspeed_launcher(args)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/commands/launch.py", line 695, in deepspeed_launcher
distrib_run.run(args)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Support for MS-AMP in FSDP

What would you like to be added:
Support for MS-AMP in FSDP.

Why is this needed:
This will help train large model with optimizer state sharding.

FP8 in tensor parallel region question

Hello, I have a couple of questions about the FP8 operations circled in the Fig 2. from the paper:

First green circle: In Megatron-LM (eg gpt-3 example) where does the conversion to FP8 happens after the layer-norm?
For the pink circle, why do we need to convert to FP8 again, doesn't the g op already gives us the results in FP8?
For the blue circle, why do we need to cast here again? As I understood the LinearReplacer it replaces all the nn.Linear modules, this should include the linear layer before the GELU, so the output of such layer should be already in FP8, no?

I suspect I'm misunderstanding the convention of the "FP8 low-bit" orange box in the image, I'd appreciate if you can help me clear up my confusion :)

Huggingface Accelerate Support

Hi all! I'm working on looking at this library as an alternative to transformers engine for the accelerate project and had a few general questions:

One pain point with TE is HF models (such as Llama-2) have many custom layers to do ops that generally can be replaced with TE/FP8 specific ops, and as a result cannot be done OOTB easily. Have you tried your framework on models from HF? (Such as using bloomz-3b or falcon-7b). If so, does MS-AMP know the layers/ops to modify already? Or is there something specific we have do for each individual layer where it can be done.
Does the framework support all FP8-compatible graphics cards? (e.g. can I test this on some local 4090's)

Looking forward to hearing back, we're very excited about this framework!

	class MSAMPPipelineEngine(MSAMPDeepSpeedEngine, PipelineEngine):
	"""Pipeline engine supports pipeline+ZeRO-2+BF16."""
	def _exec_reduce_grads(self):
	"""Reduce gradients across pipeline stages."""
	self._force_grad_boundary = True
	if self.pipeline_enable_backward_allreduce:
	if self.bfloat16_enabled():
	if self.zero_optimization_stage() == ZeroStageEnum.disabled:
	self._bf16_reduce_grads()
	elif self.zero_optimization_stage() == ZeroStageEnum.optimizer_states:
	self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
	else:
	raise NotImplementedError('PP+BF16 only work for ZeRO Stage 1')
	else:
	self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
	self._force_grad_boundary = False

azure / ms-amp Goto Github PK

ms-amp's Introduction

MS-AMP: Microsoft Automatic Mixed Precision

Check aka.ms/msamp/doc for more details.

Publication

Trademarks

ms-amp's People

Contributors

Stargazers

Watchers

Forkers

ms-amp's Issues

Release Manager

Endgame

Main Features

MS-AMP Performance Improvement

MS-AMP O3 Optimization

MS-AMP Improvement

MS-AMP Demo & Documentation

Backlog

MS-AMP O3 Optimization

MS-AMP Improvement

Questions

1. Performance

2. Weight Update

3. More Accurate Scaling Factors

4. Adaptive Precision

Appendix

Update Rule for sqrt_v_fp8

Release Manager

Endgame

Main Features

MS-AMP O3 Optimization

MS-AMP New Features

MS-AMP Bug Fix

MS-AMP Improvement

MS-AMP Demo & Documentation

Backlog

Start a docker container using nvcr.io/nvidia/pytorch:22.09-py3

Install nccl and MS-AMP

Run MNIST example

V0.1.0 Test Plan

Test table

Test cases

Installation

Run MNIST

Swin-Transformer

Deit

Release Manager

Endgame

Main Features

MS-AMP O3 Optimization

MS-AMP Improvement

MS-AMP Examples

Backlog

FP8 on Activation

V0.2.0 Test Plan

Test table

Test cases

Installation

Run MNIST

Run CIFAR10

Swin-Transformer

Deit

RoBERTa

GPT-3

Docker test

V0.3.0 Test Plan

Test table

Test cases

Installation

Run MNIST

Run CIFAR10

Swin-Transformer

Deit

RoBERTa

GPT-3

Docker test

Recommend Projects

Recommend Topics

Update Rule for `sqrt_v_fp8`