Coder Social home page Coder Social logo

Comments (12)

kuizhiqing avatar kuizhiqing commented on August 17, 2024

You should add nprocPerNode and run with torch run/launch, see #1840 for more.

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-helloworld
  namespace: default
spec:
  nprocPerNode: "8"
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/...
              imagePullPolicy: Always
              resources:
                limits:
                  nvidia.com/gpu: 8
                requests:
                  nvidia.com/gpu: 8

    Worker:
          replicas: 1
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: pytorch
                  image: gcr.io/...
                  imagePullPolicy: Always
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                    requests:
                      nvidia.com/gpu: 8

from training-operator.

av98-alphonse avatar av98-alphonse commented on August 17, 2024

Hello @kuizhiqing this example should work with the current released version (v.1.6.0) or only with the proposed changes in #1840 ? thanks

from training-operator.

kuizhiqing avatar kuizhiqing commented on August 17, 2024

@av98-alphonse in current version, you should use nProcPerNode in spec.elasticPolicy.

from training-operator.

av98-alphonse avatar av98-alphonse commented on August 17, 2024

Thanks @kuizhiqing . I am reading the excelent issues #1840, #1836. But a quick TLDR would be that currently torchrun is not currently supported. In elastic mode would be possible to use python -m torch.distributed.launch since the environment variables RANK, LOCAL_RANK and WORLD_SIZE is configured and also the case python train_script.py? THis would be correct?

from training-operator.

kuizhiqing avatar kuizhiqing commented on August 17, 2024

@av98-alphonse torchrun is supported in elastic mode by setting spec.elasticPolicy, while the multi-gpu mode proposed in this issue can use launch or run as entrypoint.

from training-operator.

kkolli avatar kkolli commented on August 17, 2024

Thanks for the clarification @kuizhiqing. I really appreciate it. I have tried the following from your suggestion, but feel there is something still missing or i'm doing something wrong. While I'm able to run multi-GPU training now (Thanks again 😄), i'm not able to complete a successful training run on any toy model. These training runs work just fine on 1 GPU training or 2 worker 1 GPU training.

I seem to constantly get:
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError causing my pods to keep restarting and crashing. Have you seen anything like this in your testing?

Additionally, with this setup, when the workers complete, the master pod will not terminate successfully.
Screenshot 2023-08-01 at 11 48 29 AM

Anything obvious i'm doing wrong from my steps below?

  1. Go to the latest build in master: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
  2. Kubectl apply manifest:
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-dist-helloworld
  namespace: default
spec:
   elasticPolicy:
     nProcPerNode: 8
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/...
              command: ["python", "-m", "torch.distributed.launch", "/app/binary"]
              imagePullPolicy: Always
              resources:
                limits:
                  nvidia.com/gpu: 8
                requests:
                  nvidia.com/gpu: 8
    Worker:
          replicas: 1
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: pytorch
                  image: gcr.io/....
                  command: ["python", "-m", "torch.distributed.launch", "/app/binary"]
                  imagePullPolicy: Always
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                    requests:
                      nvidia.com/gpu: 8

Full error:

  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 607, in run
    has_set = self._state_holder.sync()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
    get_response = self._backend.get_state()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details

from training-operator.

kuizhiqing avatar kuizhiqing commented on August 17, 2024

@kkolli please try the two modifs,

  • use nprocPerNode: "2" in spec
  • use torch.distributed.run instead of torch.distributed.launch in command

from training-operator.

kkolli avatar kkolli commented on August 17, 2024

@kuizhiqing Unfortunately, I still wasn't able to get it to work with your suggestion. However, converting to an MPI based deployment did the trick for me.

Here are my changes:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pytorch-mpi-2w
  labels:
    mpi_job: pytorch
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi_job: pytorch
        spec:
          containers:
          - image: gcr.io/..
            name: mpi-launcher
            resources:
              limits:
                nvidia.com/gpu: 8
            env: 
            - name: MASTER_ADDR
              value: pytorch-mpi-2w-worker-0.pytorch-mpi-2w-worker.default.svc.cluster.local
            - name: MASTER_PORT
              value: "23456"
            command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "16"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x 
            - MASTER_ADDR
            - -x
            - MASTER_PORT
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - /app/binary
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi_job: kkolli-worker
        spec:
          containers:
          - image: gcr.io/
            name: mpi-worker
            resources:
              limits:
                nvidia.com/gpu: 8

Thanks again for your suggestions @kuizhiqing. I'll be closing the ticket given I was able to find a solution

from training-operator.

deepanker13 avatar deepanker13 commented on August 17, 2024

hi @kuizhiqing , may I know if the required changes are done in the training client python sdk as well?

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

hi @kuizhiqing , may I know if the required changes are done in the training client python sdk as well?

@deepanker13 You mean that we can set nprocPerNode via SDK? If so, yes.

from training-operator.

deepanker13 avatar deepanker13 commented on August 17, 2024

@tenzen-y I couldn't see it in the sdk
def create_job( self, job: Optional[constants.JOB_MODELS_TYPE] = None, name: Optional[str] = None, namespace: Optional[str] = None, job_kind: Optional[str] = None, base_image: Optional[str] = None, train_func: Optional[Callable] = None, parameters: Optional[Dict[str, Any]] = None, num_worker_replicas: Optional[int] = None, num_chief_replicas: Optional[int] = None, num_ps_replicas: Optional[int] = None, packages_to_install: Optional[List[str]] = None, pip_index_url: str = constants.DEFAULT_PIP_INDEX_URL, )

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

@tenzen-y I couldn't see it in the sdk def create_job( self, job: Optional[constants.JOB_MODELS_TYPE] = None, name: Optional[str] = None, namespace: Optional[str] = None, job_kind: Optional[str] = None, base_image: Optional[str] = None, train_func: Optional[Callable] = None, parameters: Optional[Dict[str, Any]] = None, num_worker_replicas: Optional[int] = None, num_chief_replicas: Optional[int] = None, num_ps_replicas: Optional[int] = None, packages_to_install: Optional[List[str]] = None, pip_index_url: str = constants.DEFAULT_PIP_INDEX_URL, )

Oops, you're right.

https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L106-L131

We might want to add args to pass runPolicy in the SDK.

cc: @andreyvelich

from training-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.