Coder Social home page Coder Social logo

Comments (9)

pineking avatar pineking commented on May 22, 2024

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

correct.

I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

paste your yaml file and the output of kubectl describe node for each node

from k8s-device-plugin.

jonathan-goh avatar jonathan-goh commented on May 22, 2024

ok. I ran nvidia_pod.yaml and got the following error:
Message: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu.

Attached are the description of each node.

node2.txt
node3.txt
node4.txt
node1.txt
nvidia_pod.yml.txt

from k8s-device-plugin.

jonathan-goh avatar jonathan-goh commented on May 22, 2024

Am I suppose to have both:

alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1

in capacity and allocation fields?

from k8s-device-plugin.

pineking avatar pineking commented on May 22, 2024

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

from k8s-device-plugin.

RenaudWasTaken avatar RenaudWasTaken commented on May 22, 2024

Hello @jonathan-goh !

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

Looks like it thanks for handling this issue @pineking !

Am I suppose to have both:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
in capacity and allocation fields?

Ideally it's better if you don't enable the Accelerator flags on kubelet. Though it shouldn't have any impact.

from k8s-device-plugin.

jonathan-goh avatar jonathan-goh commented on May 22, 2024

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

from k8s-device-plugin.

RenaudWasTaken avatar RenaudWasTaken commented on May 22, 2024

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

If you are talking about MPI, that's not supported yet but we are working on it :)

from k8s-device-plugin.

pineking avatar pineking commented on May 22, 2024

If you are talking about MPI, that's not supported yet but we are working on it :)

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@jonathan-goh For distributed training, you can create more than 1 pod (worker) , each pod has 1 GPU. https://github.com/kubeflow/kubeflow https://github.com/tensorflow/k8s
For TensorFolw and MPI, see https://github.com/uber/horovod

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

@jonathan-goh I think you can ignore it.

from k8s-device-plugin.

RenaudWasTaken avatar RenaudWasTaken commented on May 22, 2024

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

Nope, it's on our roadmap but really depends on getting the Resource Class API merged.

from k8s-device-plugin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.