Coder Social home page Coder Social logo

torchdistpackage's Introduction

TorchDistPackage

TorchDistPackage provides some easy-to-use modules and tools for Distributed Training in PyTorch.

It is under construction. Welcome to use and contribute.

主要特性介绍-中文

安装使用

  • install
git clone https://github.com/KimmiShi/TorchDistPackage.git
cd TorchDistPackage
pip install -e . # or pip install . --user
  • simple example
import torch
from torchdistpackage import setup_distributed,test_comm,tpc

# init torch disttributed
setup_distributed()

# init process groups
pp_size=2
tp_size=2
dist_config = [('data',world_size/(2*pp_size)), ('pipe',pp_size), ('tensor',tp_size)]
tpc.setup_process_groups(dist_config)

# test communication in groups
tmp = torch.rand([100,1024]).cuda()

# collective
dist.broadcast(tmp, tpc.get_ranks_in_group('model')[0], tpc.get_group('model'))

# p2p
if tpc.is_first_in_pipeline_group():
    dist.send(tmp, tpc.get_next_global_rank('pipe'))
if tpc.is_last_in_pipeline_group():
    dist.recv(tmp, tpc.get_prev_global_rank('pipe'))

特性介绍

0. 简单的纯Python实现DDP - Simple DDP Module in PyTorch

example: TestNaiveDdp

Highlights:

  • Python only implementation. Easy to understand and debug.
  • overlaps grad reduce with compute like TorchDDP
  • For Pipeline Parallelism, only reduce grad at the last micro-batch; and could still overlap comm, which is better than ColossalAI impl.

Drawbacks/TODO:

  • the all-reduce launch seems to take more time than TorchDDP in some model

1. 从slurm初始化torch distributed - torch_launch_from_slurm

torch dist init from slurm

example

2. 灵活的通信组划分 - Flexible process group initialization for Mixed Parallelism

详见主要特性介绍

3. 流水并行相关 - For Pipeline Parallelism

使用示例 测例参考

4. MoE-数据并行

在专家并行(Expert Parallel)的基础上,支持 MoE 数据并行:即复制一些expert,相同的expert之间做数据并行(初始参数广播,梯度平均),不同的expert之间做专家并行。

使用示例

5. Tensor Parallel & Sequence Parallel

简单的TP实现。

测例参考

6. Hybrid ZeRO / 节点内ZeRO - 加速ZeRO多卡训练速度

详见主要特性介绍

7. 分片EMA - sharded EMA

节省EMA的显存消耗,见sharded ema example

TOOLS 工具类

1. model profiler

分级时间和显存消耗 参考

torchdistpackage's People

Contributors

kimmishi avatar shenglongz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.