Coder Social home page Coder Social logo

Comments (9)

tenzen-y avatar tenzen-y commented on August 17, 2024 1

I am looking into the mpi-operator repository. Is there any guidelines on how to support ssh on the images to be used by this operator (thinking about custom images). I am seeing only one set of images that add the ssh (https://github.com/kubeflow/mpi-operator/tree/master/build/base), is there any documentation about the contract expected by the mpi operator ? thanks again =)

Can you create a separate issue on the mpi-operator repository?
I think that isn't related to training-operator. Thanks for your understanding.

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

Hello everyone. I am trying to run the training-operator with a small test-cluster of rpi4. The training operator have been installed and appears to be working. However I had tried to run a small test and I got an error with the launcher container.

The image kubectl-delivery on the github appears to be last updated two years ago and only shows amd64 archs https://hub.docker.com/r/mpioperator/kubectl-delivery/tags

The log of the launcher container is:

Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
Error from server (BadRequest): container "mpi" in pod "simple-hello-world-launcher" is waiting to start: PodInitializing

Is this expected ?

thanks

Yes, that image isn't built automatically. But building the image might be good, feel free to open PR:

https://github.com/kubeflow/training-operator/blob/9e084ff0b0904b82312225c4baca295baf482b1e/.github/workflows/publish-core-images.yaml

But I would suggest using the MPIJob v2 (https://github.com/kubeflow/mpi-operator) instead of MPIJob v1.

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

/kind question

from training-operator.

aavbsouza avatar aavbsouza commented on August 17, 2024

Hello @tenzen-y it appears that the dockerfile for this image does not exist on this repository and it was removed from the mpi-operator with this commit (https://github.com/kubeflow/mpi-operator/pull/494/files). What is the replacement for this image when using mpijob v2 ? Would be to pass as argument on the CRD definition (#1525) ?

Another question is mandatory to use a scheduling plugin like the one provided by the volcano project?

thanks

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

it appears that the dockerfile for this image does not exist on this repository and it was removed from the mpi-operator with this commit (https://github.com/kubeflow/mpi-operator/pull/494/files).

Oh, yes. It seems that we need to copy the Dockerfile to this repository (kubeflow/training-operator).

What is the replacement for this image when using mpijob v2 ? Would be to pass as argument on the CRD definition (#1525) ?

We have 2 MPIJob,s and those MPIJobs are hosted in separate operator (repository):

  • MPIJob v1 is deployed as part of training-operator.
  • MPIJob v2 is deployed by mpi-operator.

Then, MPIJob v1 uses kubectl exec to initialize MPI env via kubectl-delivery, and MPIJob v2 uses ssh to initialize MPI env. So MPIJob v2 doesn't need to kubectl-delivery and is scalable rather than MPIJob v1.

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

Another question is mandatory to use a scheduling plugin like the one provided by the volcano project?

The training-operator supports the volcano gang-scheduling, and you can refer to the following docs how to use volcano scheduler:

https://www.kubeflow.org/docs/components/training/job-scheduling

However, we currently confirm only volcano gang scheduling. So I'm not sure if the training operator can work well with the other volcano scheduler plugins.

from training-operator.

aavbsouza avatar aavbsouza commented on August 17, 2024

Hello @tenzen-y . I am looking into the mpi-operator repository. Is there any guidelines on how to support ssh on the images to be used by this operator (thinking about custom images). I am seeing only one set of images that add the ssh (https://github.com/kubeflow/mpi-operator/tree/master/build/base), is there any documentation about the contract expected by the mpi operator ? thanks again =)

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

/close

If you have any other questions about the training-operator, feel free to open new issues.

from training-operator.

google-oss-prow avatar google-oss-prow commented on August 17, 2024

@tenzen-y: Closing this issue.

In response to this:

/close

If you have any other questions about the training-operator, feel free to open new issues.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from training-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.