Hi there, My Kubernetes cluster is as such Master

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Issues when requesting for more than 1 GPU about k8s-device-plugin HOT 9 CLOSED

nvidia commented on May 22, 2024

Issues when requesting for more than 1 GPU

from k8s-device-plugin.

Comments (9)

pineking commented on May 22, 2024

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

correct.

I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

paste your yaml file and the output of kubectl describe node for each node

from k8s-device-plugin.

jonathan-goh commented on May 22, 2024

ok. I ran nvidia_pod.yaml and got the following error:
Message: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu.

Attached are the description of each node.

node2.txt
node3.txt
node4.txt
node1.txt
nvidia_pod.yml.txt

from k8s-device-plugin.

jonathan-goh commented on May 22, 2024

Am I suppose to have both:

alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1

in capacity and allocation fields?

from k8s-device-plugin.

pineking commented on May 22, 2024

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

from k8s-device-plugin.

RenaudWasTaken commented on May 22, 2024

Hello @jonathan-goh !

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

Looks like it thanks for handling this issue @pineking !

Am I suppose to have both:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
in capacity and allocation fields?

Ideally it's better if you don't enable the Accelerator flags on kubelet. Though it shouldn't have any impact.

from k8s-device-plugin.

jonathan-goh commented on May 22, 2024

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

from k8s-device-plugin.

RenaudWasTaken commented on May 22, 2024

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

If you are talking about MPI, that's not supported yet but we are working on it :)

from k8s-device-plugin.

pineking commented on May 22, 2024

If you are talking about MPI, that's not supported yet but we are working on it :)

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@jonathan-goh For distributed training, you can create more than 1 pod (worker) , each pod has 1 GPU. https://github.com/kubeflow/kubeflow https://github.com/tensorflow/k8s
For TensorFolw and MPI, see https://github.com/uber/horovod

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

@jonathan-goh I think you can ignore it.

from k8s-device-plugin.

RenaudWasTaken commented on May 22, 2024

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

Nope, it's on our roadmap but really depends on getting the Resource Class API merged.

from k8s-device-plugin.

Issues when requesting for more than 1 GPU about k8s-device-plugin HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent