Hello - I realize this issue has been raised a couple of times but n

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for the clarification <a class="user-mention notranslate" data-hovercard-type="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Training Operator - Pytorch, multi-gpu + multi-worker distributed training about training-operator HOT 12 CLOSED

kkolli commented on August 17, 2024

Training Operator - Pytorch, multi-gpu + multi-worker distributed training

from training-operator.

Comments (12)

kuizhiqing commented on August 17, 2024

You should add nprocPerNode and run with torch run/launch, see #1840 for more.

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-helloworld
  namespace: default
spec:
  nprocPerNode: "8"
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/...
              imagePullPolicy: Always
              resources:
                limits:
                  nvidia.com/gpu: 8
                requests:
                  nvidia.com/gpu: 8

    Worker:
          replicas: 1
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: pytorch
                  image: gcr.io/...
                  imagePullPolicy: Always
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                    requests:
                      nvidia.com/gpu: 8

from training-operator.

av98-alphonse commented on August 17, 2024

Hello @kuizhiqing this example should work with the current released version (v.1.6.0) or only with the proposed changes in #1840 ? thanks

from training-operator.

kuizhiqing commented on August 17, 2024

@av98-alphonse in current version, you should use nProcPerNode in spec.elasticPolicy.

from training-operator.

av98-alphonse commented on August 17, 2024

Thanks @kuizhiqing . I am reading the excelent issues #1840, #1836. But a quick TLDR would be that currently torchrun is not currently supported. In elastic mode would be possible to use python -m torch.distributed.launch since the environment variables RANK, LOCAL_RANK and WORLD_SIZE is configured and also the case python train_script.py? THis would be correct?

from training-operator.

kuizhiqing commented on August 17, 2024

@av98-alphonse torchrun is supported in elastic mode by setting spec.elasticPolicy, while the multi-gpu mode proposed in this issue can use launch or run as entrypoint.

from training-operator.

kkolli commented on August 17, 2024

Thanks for the clarification @kuizhiqing. I really appreciate it. I have tried the following from your suggestion, but feel there is something still missing or i'm doing something wrong. While I'm able to run multi-GPU training now (Thanks again 😄), i'm not able to complete a successful training run on any toy model. These training runs work just fine on 1 GPU training or 2 worker 1 GPU training.

I seem to constantly get:
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError causing my pods to keep restarting and crashing. Have you seen anything like this in your testing?

Additionally, with this setup, when the workers complete, the master pod will not terminate successfully.

Anything obvious i'm doing wrong from my steps below?

Go to the latest build in master: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
Kubectl apply manifest:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-dist-helloworld
  namespace: default
spec:
   elasticPolicy:
     nProcPerNode: 8
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/...
              command: ["python", "-m", "torch.distributed.launch", "/app/binary"]
              imagePullPolicy: Always
              resources:
                limits:
                  nvidia.com/gpu: 8
                requests:
                  nvidia.com/gpu: 8
    Worker:
          replicas: 1
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: pytorch
                  image: gcr.io/....
                  command: ["python", "-m", "torch.distributed.launch", "/app/binary"]
                  imagePullPolicy: Always
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                    requests:
                      nvidia.com/gpu: 8

Full error:

  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 607, in run
    has_set = self._state_holder.sync()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
    get_response = self._backend.get_state()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details

from training-operator.

kuizhiqing commented on August 17, 2024

@kkolli please try the two modifs,

use nprocPerNode: "2" in spec
use torch.distributed.run instead of torch.distributed.launch in command

from training-operator.

kkolli commented on August 17, 2024

@kuizhiqing Unfortunately, I still wasn't able to get it to work with your suggestion. However, converting to an MPI based deployment did the trick for me.

Here are my changes:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pytorch-mpi-2w
  labels:
    mpi_job: pytorch
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi_job: pytorch
        spec:
          containers:
          - image: gcr.io/..
            name: mpi-launcher
            resources:
              limits:
                nvidia.com/gpu: 8
            env: 
            - name: MASTER_ADDR
              value: pytorch-mpi-2w-worker-0.pytorch-mpi-2w-worker.default.svc.cluster.local
            - name: MASTER_PORT
              value: "23456"
            command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "16"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x 
            - MASTER_ADDR
            - -x
            - MASTER_PORT
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - /app/binary
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi_job: kkolli-worker
        spec:
          containers:
          - image: gcr.io/
            name: mpi-worker
            resources:
              limits:
                nvidia.com/gpu: 8

Thanks again for your suggestions @kuizhiqing. I'll be closing the ticket given I was able to find a solution

from training-operator.

deepanker13 commented on August 17, 2024

hi @kuizhiqing , may I know if the required changes are done in the training client python sdk as well?

from training-operator.

tenzen-y commented on August 17, 2024

hi @kuizhiqing , may I know if the required changes are done in the training client python sdk as well?

@deepanker13 You mean that we can set nprocPerNode via SDK? If so, yes.

from training-operator.

deepanker13 commented on August 17, 2024

@tenzen-y I couldn't see it in the sdk
def create_job( self, job: Optional[constants.JOB_MODELS_TYPE] = None, name: Optional[str] = None, namespace: Optional[str] = None, job_kind: Optional[str] = None, base_image: Optional[str] = None, train_func: Optional[Callable] = None, parameters: Optional[Dict[str, Any]] = None, num_worker_replicas: Optional[int] = None, num_chief_replicas: Optional[int] = None, num_ps_replicas: Optional[int] = None, packages_to_install: Optional[List[str]] = None, pip_index_url: str = constants.DEFAULT_PIP_INDEX_URL, )

from training-operator.

tenzen-y commented on August 17, 2024

@tenzen-y I couldn't see it in the sdk def create_job( self, job: Optional[constants.JOB_MODELS_TYPE] = None, name: Optional[str] = None, namespace: Optional[str] = None, job_kind: Optional[str] = None, base_image: Optional[str] = None, train_func: Optional[Callable] = None, parameters: Optional[Dict[str, Any]] = None, num_worker_replicas: Optional[int] = None, num_chief_replicas: Optional[int] = None, num_ps_replicas: Optional[int] = None, packages_to_install: Optional[List[str]] = None, pip_index_url: str = constants.DEFAULT_PIP_INDEX_URL, )

Oops, you're right.

https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L106-L131

We might want to add args to pass runPolicy in the SDK.

cc: @andreyvelich

from training-operator.

Training Operator - Pytorch, multi-gpu + multi-worker distributed training about training-operator HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent