pytorch / tau Goto Github PK

View Code? Open in Web Editor NEW

636.0 36.0 75.0 4.04 MB

Pipeline Parallelism for PyTorch

License: BSD 3-Clause "New" or "Revised" License

Python 98.53% Shell 1.47%

tau's People

Stargazers

Watchers

Forkers

guttappa1238 sdwldchl haishin jamesr66a tryweirdier techthiyanes wz337 jianingfu otavioon armbiant yhna940 mrshenli crutcher kumpera yunxing machinelearningsystem hamidshojanazeri shenggan huydhn sanketpurandare lessw2020 cemberk acforvs backyes ajunlonglive haohongxiang phi-line yurishin929 sailfish009 mbrukman jiqing-feng dumpmemory vatshank zsc wuziyou199217 oliver-ss ssghost ethicalsecurity-agency enricobeltramo chauhang rttt1093 anj-s weimingzha0 moonbucks kt-ujwal eddogola evepetty lycing botbw jamesthesnake andreslavescu trangle eltociear 5l1v3r1 h-huang sunmarc yuenxq mao3267 statelesshz cjim8889 mreso evelynmitchell dominikandreasseitz schwidola0607 nvbkdw fancyxun oscarkey thomaskalnik qinxuye kwang1012 phoenics398 spupyrev k-wu paperwave jerome-habana

tau's Issues

Get rid of manual microbatch splitting code

https://github.com/jamesr66a/PiPPy/blob/e1763834d0b239ea14c7c49174b963fbfac567fd/PipelineDriver.py#L309

I wrote this to get something working, but we should figure out a more clean and standardized way of doing it

Design more systematic way of handling micro-batch losses at output

Right now i hacked in a sum call:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/PipelineDriver.py#L642

We should figure out a programming model for splitting losses, which have a natural interaction across the batch dimension. Do we need an interface for "separable" losses?

Expose partitioning policy in `Pipe.from_tracing` as a programmable interface

Currently, we only partition the model based on the presence of IR.pipe_split (and its derivatives like IR.annotate_split_points and IR.PipeSplitWrapper). However, someone might want to make use of the graph representation of the original code to automatically partition the model into stages: https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L376 We should expose an interface on which a user can specify a computation to partition the graph

Expected input batch_size to match target batch_size error during HF forward/backward tracing

Resolve the issue here:

https://github.com/pytorch/PiPPy/pull/68/files#diff-6d49246d94990874a38b3d05e50ea765d5c0a75270de5eec6dcda377f934976dR296

Add unit tests for PipelineDriver.py

We have unit tests for IR.py in test/test_ir.py. We should add similar unit tests to exercise the functionality of PipelineDriver.py. We can use pytest-cov (run pytest --cov=pippy test/ in root) to ensure that we have optimal coverage of this file. As of the time of this writing, coverage looks like:

---------- coverage: platform linux, python 3.7.12-final-0 -----------
Name                      Stmts   Miss  Cover
---------------------------------------------
pippy/IR.py                 371      7    98%
pippy/PipelineDriver.py     394    330    [16](https://github.com/jamesr66a/PiPPy/runs/5221199398?check_suite_focus=true#step:6:16)%
pippy/__init__.py             3      0   100%
pippy/version.py              2      2     0%
---------------------------------------------
TOTAL                       770    339    56%

Clean up lint

We should be able to run flake8 pippy/ and have it pass. Once we do that, we can enable testing for this in the CI: https://github.com/jamesr66a/PiPPy/blob/03edbe235df155fc42d93a4d45413b291a6eb5c0/.github/workflows/ci_tests.yaml#L24

RRef refcounting consistency issue

Repro:

$ test/launch_local_test_forward_backward.sh 
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
REPLICATE config: False -> MultiUseParameterConfig.TRANSMIT
GraphModule(
  (submod_0): GraphModule()
  (submod_1): GraphModule()
  (submod_2): GraphModule()
  (_loss): MSELoss()
)



def forward(self, x, target):
    submod_0 = self.submod_0(x)
    getitem_2 = submod_0[2]
    getitem = submod_0[0]
    getitem_1 = submod_0[1]
    submod_1 = self.submod_1(getitem, getitem_2)
    getitem_4 = submod_1[1]
    getitem_3 = submod_1[0]
    submod_2 = self.submod_2(getitem_3, getitem_1, getitem_4)
    _loss = self._loss(submod_2, target)
    stage_backward = pippy_IR_stage_backward(stage_output = _loss, output_grads = None, input_values = [submod_2, target]);  target = None
    getitem_5 = stage_backward[0]
    getitem_6 = stage_backward[1];  stage_backward = None
    getitem_7 = getitem_5[0]
    getitem_8 = getitem_5[1];  getitem_5 = None
    stage_backward_1 = pippy_IR_stage_backward(stage_output = submod_2, output_grads = getitem_7, input_values = [getitem_3, getitem_1, getitem_4]);  submod_2 = getitem_7 = getitem_3 = getitem_1 = getitem_4 = None
    getitem_9 = stage_backward_1[0]
    getitem_10 = stage_backward_1[1];  stage_backward_1 = None
    getitem_11 = getitem_9[0]
    getitem_12 = getitem_9[1]
    getitem_13 = getitem_9[2];  getitem_9 = None
    stage_backward_2 = pippy_IR_stage_backward(stage_output = submod_1, output_grads = [getitem_11, getitem_13], input_values = [getitem, getitem_2]);  submod_1 = getitem_11 = getitem_13 = getitem = getitem_2 = None
    getitem_14 = stage_backward_2[0]
    getitem_15 = stage_backward_2[1];  stage_backward_2 = None
    getitem_16 = getitem_14[0]
    getitem_17 = getitem_14[1];  getitem_14 = None
    stage_backward_3 = pippy_IR_stage_backward(stage_output = submod_0, output_grads = [getitem_16, getitem_12, getitem_17], input_values = [x]);  submod_0 = getitem_16 = getitem_12 = getitem_17 = x = None
    getitem_18 = stage_backward_3[0]
    getitem_19 = stage_backward_3[1];  stage_backward_3 = None
    getitem_20 = getitem_18[0];  getitem_18 = None
    sync_barrier = pippy_IR_sync_barrier(_loss, [getitem_6, getitem_10, getitem_15, getitem_19]);  _loss = getitem_6 = getitem_10 = getitem_15 = getitem_19 = None
    return sync_barrier
    
/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py:498: UserWarning: Running pipeline with 3 stages on world_size of 10. Remaining ranks will be idle.
  warnings.warn(f'Running pipeline with {len(executor_descriptors)} stages on world_size of {self.world_size}. '
Traceback (most recent call last):
  File "/fsx/users/jamesreed/pipeline_for_real/test/local_test_forward_backward.py", line 117, in <module>
    out = pipe_driver.run((input, target), {}, chunks=CHUNKS, _debug_mask_minibatches = DEBUG_MASK_MINIBATCHES)
  File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 691, in run
    last_nodes.append(interp.run_until(lambda n: n.op == 'output'))
  File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 606, in run_until
    self.env[node] = self.run_node(node)
  File "/fsx/users/jamesreed/pytorch/torch/fx/interpreter.py", line 152, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 573, in call_function
    return args[0].remote().__getitem__(args[1])
  File "/fsx/users/jamesreed/pytorch/torch/distributed/rpc/rref_proxy.py", line 41, in _invoke_rpc
    rref_fut.wait()
RuntimeError: RPCErr:1:RPC ran for more than set timeout (60000 ms) and will now be marked with an error

When I replace the __getitem__ call with this DEBUG_INDEX call, things work:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/PipelineDriver.py#L554

Is this because the __getitem__ call is done synchronously or something?

Composition of PiPPy with DDP

Does this work out of the box by setting up the ProcessGroups in the correct way? If not, what do we have to do to make this work?

[docs] Write docstrings for IR.Pipe

Currently IR.Pipe has no docs

[docs] Write docstrings for IR.PipeSplitWrapper and IR.annotate_split_points

Currently we don't have docs

Test cases where values with requires_grad = False flow through the network

For example, in our gradient accumulation (accum_grads) we might need to dynamically decide based on whether the lhs_grads or rhs_grads are None

Design a more systematic way to close over the autograd traced program in runtime

Right now we use this HACK:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/PipelineDriver.py#L268

As it is, we are wasting communication time/BW sending the tensor values over RPC and then ignoring them and accessing the local values directly. We should figure out some better e2e design for keeping these values on the host to eventually use them for their associated autograd trace

NOTE: also, if torch.autograd allowed you to just pass in grad_fn rather than a tensor, we could save some space as well

Make PipeStageExecutor multi-threaded

This can be one approach to implementing asynchronously-scheduled pipeline stages. Micro-batches within a stage are intrinsically unordered, and thus can be (theoretically) executed in any order. If a pipeline stage has I/O bound tasks, such as collectives, we can yield the executing micro-batch and admit another one (assuming we can post multiple collectives in the same process group). Using threads and relying on the GIL for serial admission is one approach that might work here

Might also need to limit admission via registers/resources/etc

Set up CI + testing infrastructure

We should have a battery of tests that runs on PRs/commits to ensure correctness and (ideally) catch performance regressions. We could potentially set this up via GitHub Actions. Some challenges here:

We should flesh out our test suite and test many different configurations (e.g. GPU vs. CPU, different topologies across different interconnects/network configurations, etc)
We will need to allocate GPU VMs to properly test GPU execution. This isn't natively supported by GHA. AFAIU, this is done via calling out to AWS for the main PyTorch CI

[Bug] Pipe.from_tracing(transformers.T5Model()) crashes with 'GraphModule' object has no attribute 'decoder'

import inspect

import transformers.utils.fx as fx
from pippy.IR import MultiUseParameterConfig, Pipe
from transformers import *

model = T5Model(T5Config())
print(model)

input_names = model.dummy_inputs.keys()
sig = inspect.signature(model.forward)
concrete_args = {p.name: p.default for p in sig.parameters.values() if p.name not in input_names}

hf_tracer = fx.HFTracer()

model_pipe = Pipe.from_tracing(model, MultiUseParameterConfig.TRANSMIT, tracer=hf_tracer,
                               concrete_args=concrete_args)

T5Model(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(...)
  (decoder): T5Stack(...)
  ...

Traceback (most recent call last):
  File "/Users/pbelevich/PycharmProjects/PiPPy/test/hf_t5_test.py", line 15, in <module>
    model_pipe = Pipe.from_tracing(model, MultiUseParameterConfig.TRANSMIT, tracer=hf_tracer,
  File "/Users/pbelevich/PycharmProjects/PiPPy/pippy/IR.py", line 606, in from_tracing
    return Pipe._from_traced(mod, traced, multi_use_param_spec, loss_fn, **kwargs)
  File "/Users/pbelevich/PycharmProjects/PiPPy/pippy/IR.py", line 457, in _from_traced
    mod_itr = getattr(mod_itr, atom)
  File "/Users/pbelevich/miniconda3/envs/PiPPy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'GraphModule' object has no attribute 'decoder'

TRANSMIT synchronization type for leaf modules

Right now we only support REPLICATE for leaf modules that we don't know the implementation details of. We should see if we can implement TRANSMIT

[Bug] PipelineDriverFillDrain._retrieve_output_values crashes with custom result types

HF models(e.g. GPT2Model) return custom types (e.g. BaseModelOutputWithPastAndCrossAttentions->ModelOutput->OrderedDict) and _retrieve_output_values fails while handling it:

/home/pbelevich/local/PiPPy/pippy/PipelineDriver.py:393: UserWarning: Running pipeline with 13 stages on world_size of 20. Remaining ranks will be idle.
  warnings.warn(f'Running pipeline with {len(executor_descriptors)} stages on world_size of {self.world_size}. '
Traceback (most recent call last):
  File "/home/pbelevich/local/PiPPy/test/local_test_forward.py", line 55, in <module>
    out = pipe_driver.run(gpt2_input, chunks=5, _debug_mask_minibatches=True)
  File "/home/pbelevich/local/PiPPy/pippy/PipelineDriver.py", line 570, in run
    return self._retrieve_output_values(microbatch_interpreters, last_nodes, _debug_mask_minibatches, splits_per_arg)
  File "/home/pbelevich/local/PiPPy/pippy/PipelineDriver.py", line 605, in _retrieve_output_values
    sliced_outputs.append(result[start:end])
TypeError: unhashable type: 'slice'

see https://github.com/jamesr66a/PiPPy/pull/43 for the details

Checkpointing Support

Add checkpointing support

Move stage_backward to a separate file

It's kinda weird for this to be in IR.py considering it also implements the runtime semantics of running backward: https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L18

Move it to a separate common file that both IR and PipelineDriver import

Make RemoteInterpreter use the full implementation of `Interpreter.run`

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/PipelineDriver.py#L565

I wrote run_until as a hack, it should probably be a copy-paste of Interpreter.run with some termination branch inside (or we should refactor Interpreter.run to make implementing like this easier

Pipe.from_tracing should handle repetitive pipe_split()

Handling

        def forward(self, x):
            x = self.linear(x)
            pipe_split()
            pipe_split()
            pipe_split()
            return x

should either show meaningful error message or collapse repetitive pipe_split()s

Move `wait_for_confirmation` to use sites

https://github.com/jamesr66a/PiPPy/blob/adfecc61a7f9ba2647b0c32a57354105b68261f6/pippy/PipelineDriver.py#L524

Support multi-use parameters in leaf modules (whether tracing or sequential frontend)

Currently, only for the tracing frontend, we detect parameters that are used in more than one module and emit different code depending on the policy that the user specified (TRANSMIT or REPLICATE):

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L465

We should generalize this to work on:

Leaf modules in fx tracing
Modules that share parameters in the sequential frontend

Concretely, these things are already supported by default with the REPLICATE mode. To support the TRANSMIT mode, we could emit code at the end of the first stage that uses the parameter that fetches the parameter value and transmits it to the subsequent use stages.

Implement shared parameter synchronization as part of runtime

Right now we're just hacking it up and doing it manually in the test:

https://github.com/jamesr66a/PiPPy/blob/5f4c6cd4676d6135dec9ee86341286416afd296f/test/local_test_forward_backward.py#L118

We could possibly implement a SyncTiedParameters phase in WorkItem, similar to DeepSpeed, to automatically do the necessary allreduce to synchronize parameters shared across stages

Error Handling

We should figure out how to gracefully handle when an exception is thrown in one of the pipeline stages

Fix whatever is broken with REPLICATE

https://github.com/jamesr66a/PiPPy/blob/5f4c6cd4676d6135dec9ee86341286416afd296f/test/local_test_forward_backward.py#L57

Kill fwd_cache in PipeStageExecutor

PiPPy/pippy/PipelineDriver.py

Line 315 in e8ba58e

# HACK: here we are directly accessing the saved tensor outputs

Now that we've switched over to using manually refcounted local Futures, we can probably kill fwd_cache and access the Tensor values directly

Make loss + backward more configurable

Currently, when building a Pipe object, if the user specifies a loss_fn, we automatically insert the loss function as well as the backward pass: https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L314

Potentially a user might want to just insert the loss function. We could make this more configurable

Figure out what to do about RPC timeouts

If a pipeline runs for more than 60 seconds I think we time out

https://github.com/jamesr66a/PiPPy/runs/5240681838?check_suite_focus=true

Upstream functionality to lift single-use parameters into the module that uses them into fx.passes.split_module

We have this code to move parameters that are only used by one module into the using module:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L408

We should upstream this into the fx.passes.split_module implementation

Fix qualname mapping hack

Currently we infer the mapping from pipeline qualname to original module qualname by matching values in the corresponding modules:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L241

Instead, we should systematically track the movements of parameters/modules throughout the partitioning process and use that tracked information to build this mapping.

Explore shape-specialized tracing

May make front-end a bit better by supporting models w/ shape expressions
Use it as a data source for analysis/auto partitioning

Use Ray?

Is there any value in using Ray for the runtime?

[Bug] RemoteInterpreter.call_function doesn't handle _null_coalesce_accumulate

# Copyright (c) Meta Platforms, Inc. and affiliates
import argparse
import os
import socket

import torch
import torch.distributed.rpc as rpc
import torch.multiprocessing as mp

from pippy.IR import Pipe, pipe_split, TrivialLossWrapper, _null_coalesce_accumulate
from pippy.PipelineDriver import PipelineDriverFillDrain, PipelineDriver1F1B, PipelineDriverBase
from pippy.microbatch import TensorChunkSpec, CustomReducer

PROFILING_ENABLED = True
CHECK_NUMERIC_EQUIVALENCE = True

schedules = {
    'FillDrain': PipelineDriverFillDrain,
    '1F1B': PipelineDriver1F1B,
}

torch.fx.Tracer.proxy_buffer_attributes = True


def run_master(args):
    all_ranks = list(range(1, args.world_size))  # exclude master rank = 0
    chunks = len(all_ranks)
    batches = 1
    bs = 4 * chunks
    hid_dim = 50

    class Code(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.linear = torch.nn.Linear(hid_dim, hid_dim)

        def forward(self, x):
            x = self.linear(x)
            pipe_split()
            y = torch.relu(x)
            pipe_split()
            z = torch.sigmoid(x)
            pipe_split()
            return y + z

    c = Code()
    c.train()
    mse_loss = torch.nn.MSELoss()
    wrapper = TrivialLossWrapper(c, mse_loss)
    accum_pipe = Pipe.from_tracing(wrapper)
    assert 4 == len(list(accum_pipe.split_gm.children()))
    assert any(n.target == _null_coalesce_accumulate for n in accum_pipe.split_gm.graph.nodes)
    input = torch.randn(bs, hid_dim)
    target = torch.randn(bs, hid_dim)
    accum_pipe(input, target)

    args_chunk_spec = (TensorChunkSpec(0), TensorChunkSpec(0))
    kwargs_chunk_spec = {}
    output_chunk_spec = CustomReducer(torch.tensor(0.0), lambda a, b: a + b)
    pipe_driver: PipelineDriverBase = schedules[args.schedule](accum_pipe, args_chunk_spec, kwargs_chunk_spec,
                                                               output_chunk_spec, args.world_size - 1,
                                                               all_ranks=all_ranks, _debug_mask_minibatches=True)

    for i in range(batches):
        pipe_driver.run(chunks, input, target)


def run_worker(rank, world_size, args):
    print(f"rank = {rank} host/pid = {socket.gethostname()}/{os.getpid()}")
    os.environ['MASTER_ADDR'] = args.master_addr
    os.environ['MASTER_PORT'] = args.master_port
    if args.rank == -1:  # run via mp.spawn
        # each worker will see its GPU as `cuda:0`
        try:
            import subprocess
            device_count = int(subprocess.getoutput('nvidia-smi --list-gpus | wc -l'))
            os.environ['CUDA_VISIBLE_DEVICES'] = str(rank % device_count)
        except ValueError:
            pass
    assert not torch.cuda.is_available() or torch.cuda.device_count() == 1, \
        "Do not use torch.cuda.* before setting CUDA_VISIBLE_DEVICES"
    options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256, _transports=["uv"])  # uv for AWS EFA instances
    if args.use_cuda and torch.cuda.is_available():
        for i in range(world_size):
            options.set_device_map(f"worker{i}", {0: 0})
    rpc.init_rpc(
        f"worker{rank}",
        rank=rank,
        world_size=world_size,
        rpc_backend_options=options
    )
    if rank == 0:
        run_master(args)
    rpc.shutdown()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--world_size', type=int, default=int(os.getenv("WORLD_SIZE", 5)))
    parser.add_argument('--rank', type=int, default=int(os.getenv("RANK", -1)))
    parser.add_argument('--master_addr', type=str, default=os.getenv('MASTER_ADDR', 'localhost'))
    parser.add_argument('--master_port', type=str, default=os.getenv('MASTER_PORT', '29500'))
    parser.add_argument('-s', '--schedule', type=str, default=list(schedules.keys())[0], choices=schedules.keys())
    parser.add_argument('--replicate', type=int, default=int(os.getenv("REPLICATE", '0')))
    parser.add_argument('--use_cuda', type=int, default=1)
    args = parser.parse_args()

    if args.rank == -1:
        mp.spawn(run_worker, args=(args.world_size, args,), nprocs=args.world_size, join=True)
    elif args.rank < args.world_size:
        run_worker(args.rank, args.world_size, args)
    else:
        print("I'm unused, exiting")

Traceback (most recent call last):
  File "/Users/pbelevich/miniconda3/envs/PiPPy/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/Users/pbelevich/PycharmProjects/PiPPy/examples/slurm/hf/t5/test_null_coalesce_accumulate.py", line 93, in run_worker
    run_master(args)
  File "/Users/pbelevich/PycharmProjects/PiPPy/examples/slurm/hf/t5/test_null_coalesce_accumulate.py", line 65, in run_master
    pipe_driver.run(chunks, input, target)
  File "/Users/pbelevich/PycharmProjects/PiPPy/pippy/PipelineDriver.py", line 871, in run
    last_nodes.append(interp.run_until(lambda n: n.op == 'output'))
  File "/Users/pbelevich/PycharmProjects/PiPPy/pippy/PipelineDriver.py", line 811, in run_until
    self.env[node] = super().run_node(node)
  File "/Users/pbelevich/miniconda3/envs/PiPPy/lib/python3.9/site-packages/torch/fx/interpreter.py", line 152, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/Users/pbelevich/PycharmProjects/PiPPy/pippy/PipelineDriver.py", line 800, in call_function
    raise AssertionError(f'Unknown operator {torch.typename(target)}')
AssertionError: Unknown operator pippy.IR._null_coalesce_accumulate

Support custom `torch.fx` tracers in `Pipe.from_tracing`

Some users may want to make use of a custom torch.fx tracer when pipelining their model. An example is HF with their custom tracer:

https://github.com/huggingface/transformers/blob/f65fe3663a6c62975a9c04654703252644c9a652/src/transformers/utils/fx.py#L233

We should generalize the interface to Pipe.from_tracing to allow users to use a custom tracer: https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L382

Ask about multiple-meta-devices

Design more systematic way of figuring out split properties of output values

Right now we're assuming that the output splits are the same as the splits for the 0th tensor input, which is probably hella wrong:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/PipelineDriver.py#L475

Fix fact that Pipe qualname mapping does not include replicated parameters

https://github.com/jamesr66a/PiPPy/blob/5f4c6cd4676d6135dec9ee86341286416afd296f/test/local_test_forward_backward.py#L199

This might get fixed by the solution to https://github.com/jamesr66a/PiPPy/issues/12, but we should keep an eye on it

Design interaction between PipelineDriver and DistributedOptimizer

https://github.com/jamesr66a/PiPPy/blob/5f4c6cd4676d6135dec9ee86341286416afd296f/test/local_test_forward_backward.py#L107

Our test currently just check that PipelineDriver gets the right gradient values; we're not actually applying an update step yet. I think this should just work out of the box with DistributedOptimizer (https://pytorch.org/docs/stable/rpc.html#torch.distributed.optim.DistributedOptimizer), but we should make sure

Yield running coroutines when I/O bound (e.g. when running a collective)

Shouldn't block here. But need to account for how this affects memory high-watermark

Explore special case of ShardedTensor to represent microbatch values

This could be a special case of ShardedTensor

Implement get_attr normalization

This is reproducible with HF transformers 4.16.2!

Traced HF BertModel.forward looks like this:

def forward(self, input_ids):
    ...
    _tensor_constant0 = self._tensor_constant0
    ...
    _tensor_constant0_1 = self._tensor_constant0
    ...
    _tensor_constant0_2 = self._tensor_constant0
    ...
    _tensor_constant0_3 = self._tensor_constant0
    ...
    _tensor_constant0_4 = self._tensor_constant0
    ...
    _tensor_constant0_5 = self._tensor_constant0
    ...
    _tensor_constant0_6 = self._tensor_constant0
    ...
    _tensor_constant0_7 = self._tensor_constant0
    ...
    _tensor_constant0_8 = self._tensor_constant0
    ...
    _tensor_constant0_9 = self._tensor_constant0
    ...
    _tensor_constant0_10 = self._tensor_constant0
    ...
    _tensor_constant0_11 = self._tensor_constant0
    ...

And bert_pipe = Pipe.from_traced(bert, bert_traced, ...) fails with

AttributeError: 'GraphModule' object has no attribute '_tensor_constant0'

It's caused by deleting the attribute in line 442

see https://github.com/jamesr66a/PiPPy/pull/43 for the details

bert_traced.txt

Implement kwargs support on Pipe.forward()

Currently we don't accept kwargs in Pipe.forward: https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/pippy/IR.py#L229

We should allow this by looking at the placeholder nodes on Pipe.split_gm and forming a schema from this, then binding kwarg values to the positional slots for the schema. (TODO: can this be upstreamed to torch.fx?)

Fix handling of gradient accumulation

When we have a parameter that has uses in multiple stages, we emit a torch.add operation to sum together the gradient components that come from all of those use-sites. This is currently broken and requires some debugging and fixing:

https://github.com/jamesr66a/PiPPy/blob/527af1fd8123d35bd81b9fe304a8d0ed29c9fd8d/test/test_ir.py#L121

Make PipeStageExecutor just admit a bunch of CUDA kernels to streams

If we can defer scheduling to the CUDA device scheduler, that may be the most optimal way to doing things

CUDA Device placement + integration with device_map

TODO: fill out details out here

Investigate flakiness of fwd/backward integration test

https://github.com/jamesr66a/PiPPy/runs/5224234407?check_suite_focus=true

Figure out if this is just because of the random parameter values

Implement barrier semantics in PipelineDriver to wait until all jobs are done

Right now we're hacking it up and just sleeping:

https://github.com/jamesr66a/PiPPy/blob/5f4c6cd4676d6135dec9ee86341286416afd296f/test/local_test_forward_backward.py#L110

We should make it so that the coordinator PipelineDriver waits for all jobs to retire from the pipeline

Systematic way of analyzing the program to make sure it is partitionable along the batch dimension

When we create micro-batches, we are assuming that the program can be split up along the specified batch dimension. We should have a way to analyze the program and tell if this is true or not (e.g. it would not be trivially true if we have something like BatchNorm, which has interactions across elements in the batch).

TypeError: full() received an invalid combination of arguments - got (HFProxy, int) error in T5ForConditionalGeneration

Resolve the issue here: https://github.com/pytorch/PiPPy/pull/68/files#diff-6d49246d94990874a38b3d05e50ea765d5c0a75270de5eec6dcda377f934976dR303

pytorch / tau Goto Github PK

tau's People

Stargazers

Watchers

Forkers

tau's Issues

Recommend Projects

Recommend Topics

Recommend Org