Deploying any PODS with the nvidia.com/gpu resource limits results in "0/1 nodes are a

Same problem here. My YAML file looks like this: <div class="snippet-clipboard-con

Hi! here's the outputs: node deion (single

0/1 nodes are available: 1 Insufficient nvidia.com/gpu,about nvidia/k8s-device-plugin

Comments (23)

rohitagarwal003 commented on May 16, 2024 7

It's not getting scheduled because you only have 1 GPU and your pod is requesting two GPUs: 1 for cuda-container and 1 for digits-container.

from k8s-device-plugin.

RenaudWasTaken commented on May 16, 2024

Hello!

Your GPU is too old to be healthchecked. The behavior we've decided internally is to mark them as unhealthy.

If you still need to use it I can add an option to the daemonset to have GPUs that can't be healthchecked maked as healthy (though it wouldn't be the default behavior) would that be a solution for you?

from k8s-device-plugin.

ernestmartinez commented on May 16, 2024

Yes that would be very helpful.

Thank you.

from k8s-device-plugin.

flx42 commented on May 16, 2024

Fixed, set the following environment variable when starting the device plugin: DP_DISABLE_HEALTHCHECKS=xids. Either at the pod-spec level for the daemon set, or in your docker run.

from k8s-device-plugin.

ernestmartinez commented on May 16, 2024

Same problem.
I assume when you said at the POD spec level for the daemon set you meant in nvidia-device-plugin.yml

Here is the file I edited, I just added the env stanza:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      nodeSelector:
        gpu: "true"
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.9
        name: nvidia-device-plugin-ctr
        env:
        - name: DP_DISABLE_HEALTHCHECKS
          value: "xids"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

from k8s-device-plugin.

flx42 commented on May 16, 2024

Try docker pull nvidia/k8s-device-plugin:1.9

from k8s-device-plugin.

ernestmartinez commented on May 16, 2024

That did it. Thanks!!!!

from k8s-device-plugin.

chi-hung commented on May 16, 2024

Same problem here. My YAML file looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      env:
      - name: DP_DISABLE_HEALTHCHECKS
        value: "xids"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
    - name: digits-container
      image: nvidia/digits:6.0
      env:
      - name: DP_DISABLE_HEALTHCHECKS
        value: "xids"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1GPU

and here's the output of kubectl describe pods:

Name:         gpu-pod
Namespace:    default
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
Containers:
  cuda-container:
    Image:      nvidia/cuda:9.0-devel
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      DP_DISABLE_HEALTHCHECKS:  xids
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7x4mv (ro)
  digits-container:
    Image:      nvidia/digits:6.0
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      DP_DISABLE_HEALTHCHECKS:  xids
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7x4mv (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-7x4mv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-7x4mv
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  3s (x23 over 3m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

System info

Output of `lsb_release -a`:

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.4 LTS
Release:	16.04
Codename:	xenial

Output of `docker run --rm nvidia/cuda nvidia-smi`:

Tue Jul 10 12:11:53 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0  On |                  N/A |
| 29%   43C    P8     7W / 180W |     51MiB /  8116MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Output of `docker -v`:

Docker version 18.03.1-ce, build 9ee9f40

Output of `kubeadm version`:

kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:14:41Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}

Is this also the issue of health-checking?

from k8s-device-plugin.

RenaudWasTaken commented on May 16, 2024

Hello!

The DP_DISABLE_HEALTHCHECKS is a variable expected to be set in the device plugin pod

Can you:

Make sure you don't have any other GPU pods running
Provide the output of kubectl describe node $MY_GPU_NODE?
Provide the logs of the device plugin (kubectl logs -n kube-system $DEVICE_PLUGIN_POD) on the GPU node?

Thanks!

from k8s-device-plugin.

chi-hung commented on May 16, 2024

Hi!

here's the outputs:

node description (single node only, for testing)

Name:               127.0.0.1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=127.0.0.1
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
                    volumes.kubernetes.io/keep-terminated-pod-volumes=true
CreationTimestamp:  Wed, 11 Jul 2018 16:07:40 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Wed, 11 Jul 2018 16:27:52 +0800   Wed, 11 Jul 2018 16:07:40 +0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 11 Jul 2018 16:27:52 +0800   Wed, 11 Jul 2018 16:07:40 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 11 Jul 2018 16:27:52 +0800   Wed, 11 Jul 2018 16:07:40 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 11 Jul 2018 16:27:52 +0800   Wed, 11 Jul 2018 16:07:40 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 11 Jul 2018 16:27:52 +0800   Wed, 11 Jul 2018 16:07:40 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  127.0.0.1
  Hostname:    127.0.0.1
Capacity:
 cpu:                12
 ephemeral-storage:  214726764Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             16342940Ki
 nvidia.com/gpu:     1
 pods:               110
Allocatable:
 cpu:                12
 ephemeral-storage:  197892185375
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             16240540Ki
 nvidia.com/gpu:     1
 pods:               110
System Info:
 Machine ID:                 c0c5f10b8ebccd69fed924285b432951
 System UUID:                1C2B5312-6345-F8F3-52D2-4CEDFB66E216
 Boot ID:                    bddbd025-64d2-4cbf-ba4f-56ac8c7bd745
 Kernel Version:             4.4.0-130-generic
 OS Image:                   Ubuntu 16.04.4 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.3.1
 Kubelet Version:            v1.11.0-dirty
 Kube-Proxy Version:         v1.11.0-dirty
Non-terminated Pods:         (2 in total)
  Namespace                  Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                    ------------  ----------  ---------------  -------------
  kube-system                kube-dns-7b479ccbc6-pdzjp               260m (2%)     0 (0%)      110Mi (0%)       170Mi (1%)
  kube-system                nvidia-device-plugin-daemonset-wdn9w    0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource        Requests    Limits
  --------        --------    ------
  cpu             260m (2%)   0 (0%)
  memory          110Mi (0%)  170Mi (1%)
  nvidia.com/gpu  0           0
Events:
  Type    Reason                   Age                From                   Message
  ----    ------                   ----               ----                   -------
  Normal  Starting                 20m                kube-proxy, 127.0.0.1  Starting kube-proxy.
  Normal  Starting                 20m                kubelet, 127.0.0.1     Starting kubelet.
  Normal  NodeHasSufficientDisk    20m (x2 over 20m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  20m (x2 over 20m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    20m (x2 over 20m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     20m (x2 over 20m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  20m                kubelet, 127.0.0.1     Updated Node Allocatable limit across pods
  Normal  NodeReady                20m                kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeReady

log of the device plugin

2018/07/11 08:17:27 Loading NVML
2018/07/11 08:17:27 Fetching devices.
2018/07/11 08:17:27 Starting FS watcher.
2018/07/11 08:17:27 Starting OS watcher.
2018/07/11 08:17:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/07/11 08:17:27 Registered device plugin with Kubelet

from k8s-device-plugin.

RenaudWasTaken commented on May 16, 2024

Hello!

Looks like your node is correctly advertising the number of GPUs it has it has (here one).
Additionally your node is not advertising your GPU as unhealthy.

At this point your pod should get scheduled. How much time did you wait?

from k8s-device-plugin.

chi-hung commented on May 16, 2024

@RenaudWasTaken thanks for the comment! I understand it better now.
@mindprince you are absolutely right... my stupid mistake. Thanks for pointing it out!

from k8s-device-plugin.

ZhiyuanChen commented on May 16, 2024

Hello!

Looks like your node is correctly advertising the number of GPUs it has it has (here one).
Additionally your node is not advertising your GPU as unhealthy.

At this point your pod should get scheduled. How much time did you wait?

Hi, I got the same issue, except that I got no nvidia.com/gpu in the capacity tab. I'm currently running docker 18.09.1 (I'm not sure why I got the latest version when running docker version as I manually setted docker version to 18.06 when installing), with kubernetes version 1.13.2 and nvidia plugin 1.12. I'm wondering if it is because the plug-in does not support such a new version?

from k8s-device-plugin.

AjayZinngg commented on May 16, 2024

I am facing the same error.

This is the output of the kubectl describe pods command:

  Name:           xyz-v1-5c5b57cf9c-kvjxn
  Namespace:      default
  Node:           <none>
  Labels:         app=xyz
                  pod-template-hash=1716137957
                  version=v1
  Annotations:    <none>
  Status:         Pending
  IP:             
  Controlled By:  ReplicaSet/xyz-v1-5c5b57cf9c
  Containers:
    aadhar:
      Image:      tensorflow/serving:1.11.1-gpu
      Port:       9000/TCP
      Host Port:  0/TCP
      Command:
        /usr/bin/tensorflow_model_server
      Args:
        --port=9000
        --model_name=xyz
        --model_base_path=gs://xyz_kuber_app-xyz-identification/export/
      Limits:
        cpu:             4
        memory:          4Gi
        nvidia.com/gpu:  1
      Requests:
        cpu:             1
        memory:          1Gi
        nvidia.com/gpu:  1
      Environment:       <none>
      Mounts:
        /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
    aadhar-http-proxy:
      Image:      gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
      Port:       8000/TCP
      Host Port:  0/TCP
      Command:
        python
        /usr/src/app/server.py
        --port=8000
        --rpc_port=9000
        --rpc_timeout=10.0
      Limits:
        cpu:     1
        memory:  1Gi
      Requests:
        cpu:        500m
        memory:     500Mi
      Environment:  <none>
      Mounts:
        /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
  Conditions:
    Type           Status
    PodScheduled   False 
  Volumes:
    default-token-b6dpn:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  default-token-b6dpn
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  <none>
  Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                   node.kubernetes.io/unreachable:NoExecute for 300s
                   nvidia.com/gpu:NoSchedule
  Events:
    Type     Reason             Age                   From                Message
    ----     ------             ----                  ----                -------
    Warning  FailedScheduling   20m (x5 over 21m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
    Warning  FailedScheduling   20m (x2 over 20m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
    Warning  FailedScheduling   16m (x9 over 19m)     default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
    Normal   NotTriggerScaleUp  15m (x26 over 20m)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)
    Warning  FailedScheduling   2m42s (x54 over 23m)  default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
    Normal   TriggeredScaleUp   13s                   cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/adhaar-identification/zones/us-central1-a/instanceGroups/gke-kuberflow-aadhaar-pool-1-9753107b-grp 1->2 (max: 10)}]


  Name:           mnist-deploy-gcp-b4dd579bf-sjwj7
  Namespace:      default
  Node:           gke-kuberflow-xyz-default-pool-ab1fa086-w6q3/10.128.0.8
  Start Time:     Thu, 14 Feb 2019 14:44:08 +0530
  Labels:         app=xyz-object
                  pod-template-hash=608813569
                  version=v1
  Annotations:    sidecar.istio.io/inject: 
  Status:         Running
  IP:             10.36.4.18
  Controlled By:  ReplicaSet/mnist-deploy-gcp-b4dd579bf
  Containers:
    aadhaar-object:
      Container ID:  docker://921717d82b547a023034e7c8be78216493beeb55dca57f4eddb5968122e36c16
      Image:         tensorflow/serving:1.11.1
      Image ID:      docker-pullable://tensorflow/serving@sha256:a01c6475c69055c583aeda185a274942ced458d178aaeb84b4b842ae6917a0bc
      Ports:         9000/TCP, 8500/TCP
      Host Ports:    0/TCP, 0/TCP
      Command:
        /usr/bin/tensorflow_model_server
      Args:
        --port=9000
        --rest_api_port=8500
        --model_name=xyz-object
        --model_base_path=gs://xyz_kuber_app-xyz-identification/export
        --monitoring_config_file=/var/config/monitoring_config.txt
      State:          Running
        Started:      Thu, 14 Feb 2019 14:48:21 +0530
      Last State:     Terminated
        Reason:       Error
        Exit Code:    137
        Started:      Thu, 14 Feb 2019 14:45:58 +0530
        Finished:     Thu, 14 Feb 2019 14:48:21 +0530
      Ready:          True
      Restart Count:  1
      Limits:
        cpu:     4
        memory:  4Gi
      Requests:
        cpu:     1
        memory:  1Gi
      Liveness:  tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
      Environment:
        GOOGLE_APPLICATION_CREDENTIALS:  /secret/gcp-credentials/user-gcp-sa.json
      Mounts:
        /secret/gcp-credentials from gcp-credentials (rw)
        /var/config/ from config-volume (rw)
        /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
  Conditions:
    Type           Status
    Initialized    True 
    Ready          True 
    PodScheduled   True 
  Volumes:
    config-volume:
      Type:      ConfigMap (a volume populated by a ConfigMap)
      Name:      mnist-deploy-gcp-config
      Optional:  false
    gcp-credentials:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  user-gcp-sa
      Optional:    false
    default-token-b6dpn:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  default-token-b6dpn
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  <none>
  Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                   node.kubernetes.io/unreachable:NoExecute for 300s
  Events:          <none>

The output of kubectl describe pods | grep gpu is :

    Image:      tensorflow/serving:1.11.1-gpu
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
                 nvidia.com/gpu:NoSchedule
  Warning  FailedScheduling   28m (x5 over 29m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
  Warning  FailedScheduling   28m (x2 over 28m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
  Warning  FailedScheduling   24m (x9 over 27m)     default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   11m (x54 over 31m)    default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   48s (x23 over 6m57s)  default-scheduler   0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

from k8s-device-plugin.

maminio commented on May 16, 2024

@AjayZinngg Did you manage to fix it? I got the same issue.

from k8s-device-plugin.

guo897654050 commented on May 16, 2024

@maminio have you solve this problem?i got the same error

from k8s-device-plugin.

toniz commented on May 16, 2024

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: "1" # requesting 2 GPUs
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: "1" # requesting 2 GPUs
  tolerations:
  - effect: NoSchedule
    operator: Exists

I add the tolerations, then it runs.

from k8s-device-plugin.

Baren123 commented on May 16, 2024

@chi-hung Do you have some detailed suggests to slove this problem?I am facing the same error.

from k8s-device-plugin.

Baren123 commented on May 16, 2024

@ernestmartinez Do you have some detailed suggests to slove this problem?I am facing the same error.

from k8s-device-plugin.

klueska commented on May 16, 2024

Please file a new issue with your specific details if you want to get looked at. This issue has already been resolved.

from k8s-device-plugin.

hellojack123 commented on May 16, 2024

我也遇到这个问题了，后来发现是其他pod把gpu资源占满了，删掉那个pod后好了。所以要检查gpu资源够不够，有没有占用。

from k8s-device-plugin.

gocpplua commented on May 16, 2024

This issue has been resolved：

vim /etc/docker/daemon.json
add： "default-runtime": "nvidia",
systemctl restart docker

Then:
kubectl describe node -A | grep nvidia
nvidia.com/gpu: 4

from k8s-device-plugin.

Abhishekghosh1998 commented on May 16, 2024

Hello there, I am facing a similar issue while integrating an Nvidia GPU with a pod.
I have successfully set up the nvidia-container-toolkit.

$ docker run --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Jul  5 06:46:43 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080        On  | 00000000:26:00.0 Off |                  N/A |
| 23%   35C    P8              18W / 215W |     92MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

I have setup a kind-control-plane node on kubectl.

$ kubectl get nodes
NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   35m   v1.27.3

I have labeled the node with gpu=installed

$ kubectl label nodes kind-control-plane gpu=installed

And I have also tained the node with

kubectl taint nodes kind-control-plane gpu:NoSchedule

Because without this the plugin was complaining for the node to be not a GPU node:

$ DEVICE_PLUGIN_POD="nvidia-device-plugin-daemonset-tm4pv"
abhishek@abhishek:~$ kubectl logs -n kube-system $DEVICE_PLUGIN_POD
I0705 05:08:24.579349       1 main.go:154] Starting FS watcher.
I0705 05:08:24.580666       1 main.go:161] Starting OS watcher.
I0705 05:08:24.581008       1 main.go:176] Starting Plugins.
I0705 05:08:24.581018       1 main.go:234] Loading configuration.
I0705 05:08:24.581125       1 main.go:242] Updating config with default resource matching patterns.
I0705 05:08:24.581304       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0705 05:08:24.581315       1 main.go:256] Retreiving plugins.
W0705 05:08:24.584333       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0705 05:08:24.584382       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0705 05:08:24.584425       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0705 05:08:24.584433       1 factory.go:115] Incompatible platform detected
E0705 05:08:24.584440       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0705 05:08:24.584447       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0705 05:08:24.584455       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0705 05:08:24.584465       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0705 05:08:24.584475       1 main.go:287] No devices found. Waiting indefinitely.

So after labeling and tainting the node we have the following node description:

$ kubectl describe node  kind-control-plane
Name:               kind-control-plane
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    gpu=installed
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kind-control-plane
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 05 Jul 2023 11:42:02 +0530
Taints:             gpu:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  kind-control-plane
  AcquireTime:     <unset>
  RenewTime:       Wed, 05 Jul 2023 12:26:51 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 05 Jul 2023 12:21:56 +0530   Wed, 05 Jul 2023 11:42:01 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Jul 2023 12:21:56 +0530   Wed, 05 Jul 2023 11:42:01 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Jul 2023 12:21:56 +0530   Wed, 05 Jul 2023 11:42:01 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Jul 2023 12:21:56 +0530   Wed, 05 Jul 2023 11:42:22 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.19.0.2
  Hostname:    kind-control-plane
Capacity:
  cpu:                16
  ephemeral-storage:  238948692Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65782984Ki
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  238948692Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65782984Ki
  pods:               110
System Info:
  Machine ID:                 99d5a8ae622644c889e61e882ec29ec9
  System UUID:                b9544999-5e7c-40f5-a2e1-519c23074823
  Boot ID:                    4fdb330e-7a23-453d-99b2-5a4073672224
  Kernel Version:             5.15.0-76-generic
  OS Image:                   Debian GNU/Linux 11 (bullseye)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.1
  Kubelet Version:            v1.27.3
  Kube-Proxy Version:         v1.27.3
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
ProviderID:                   kind://docker/kind/kind-control-plane
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-5d78c9869d-fdxj7                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     44m
  kube-system                 coredns-5d78c9869d-sgnqj                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     44m
  kube-system                 etcd-kind-control-plane                       100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         44m
  kube-system                 kindnet-56rdl                                 100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      44m
  kube-system                 kube-apiserver-kind-control-plane             250m (1%)     0 (0%)      0 (0%)           0 (0%)         44m
  kube-system                 kube-controller-manager-kind-control-plane    200m (1%)     0 (0%)      0 (0%)           0 (0%)         44m
  kube-system                 kube-proxy-vswdf                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         44m
  kube-system                 kube-scheduler-kind-control-plane             100m (0%)     0 (0%)      0 (0%)           0 (0%)         44m
  local-path-storage          local-path-provisioner-6bc4bddd6b-9ntj6       0 (0%)        0 (0%)      0 (0%)           0 (0%)         44m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                950m (5%)   100m (0%)
  memory             290Mi (0%)  390Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type    Reason                   Age                From             Message
  ----    ------                   ----               ----             -------
  Normal  Starting                 44m                kube-proxy       
  Normal  Starting                 40m                kube-proxy       
  Normal  NodeAllocatableEnforced  45m                kubelet          Updated Node Allocatable limit across pods
  Normal  Starting                 45m                kubelet          Starting kubelet.
  Normal  NodeHasSufficientMemory  45m (x8 over 45m)  kubelet          Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasSufficientPID     45m (x7 over 45m)  kubelet          Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  NodeHasNoDiskPressure    45m (x8 over 45m)  kubelet          Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeAllocatableEnforced  44m                kubelet          Updated Node Allocatable limit across pods
  Normal  Starting                 44m                kubelet          Starting kubelet.
  Normal  NodeHasSufficientMemory  44m                kubelet          Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    44m                kubelet          Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     44m                kubelet          Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  RegisteredNode           44m                node-controller  Node kind-control-plane event: Registered Node kind-control-plane in Controller
  Normal  NodeReady                44m                kubelet          Node kind-control-plane status is now: NodeReady
  Normal  NodeAllocatableEnforced  40m                kubelet          Updated Node Allocatable limit across pods
  Normal  Starting                 40m                kubelet          Starting kubelet.
  Normal  NodeHasSufficientMemory  40m (x8 over 40m)  kubelet          Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    40m (x8 over 40m)  kubelet          Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     40m (x7 over 40m)  kubelet          Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  RegisteredNode           40m                node-controller  Node kind-control-plane event: Registered Node kind-control-plane in Controller

Note that in the above description, the node does not have nvidia.com/gpu listed anywhere.

The description of the gpu-pod is as follows:

$ kubectl describe pods
Name:         gpu-pod
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-86lnd (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-86lnd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  35s   default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {gpu: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

But now, after the labeling and taint, the daemon does not even start or rather show up in the kubectl get pods -n kube-system command.

$ kubectl get pods -n kube-system
NAME                                         READY   STATUS    RESTARTS      AGE
coredns-5d78c9869d-fdxj7                     1/1     Running   1 (45m ago)   49m
coredns-5d78c9869d-sgnqj                     1/1     Running   1 (45m ago)   49m
etcd-kind-control-plane                      1/1     Running   1 (45m ago)   50m
kindnet-56rdl                                1/1     Running   1 (45m ago)   49m
kube-apiserver-kind-control-plane            1/1     Running   1 (45m ago)   50m
kube-controller-manager-kind-control-plane   1/1     Running   1 (45m ago)   50m
kube-proxy-vswdf                             1/1     Running   1 (45m ago)   49m
kube-scheduler-kind-control-plane            1/1     Running   1 (45m ago)   50m

Trying to create it again throws the following error:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
Error from server (AlreadyExists): error when creating "https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml": daemonsets.apps "nvidia-device-plugin-daemonset" already exists

Please, can anyone help me out?

from k8s-device-plugin.

0/1 nodes are available: 1 Insufficient nvidia.com/gpu about k8s-device-plugin HOT 23 CLOSED

Comments (23)

System info

Output of `lsb_release -a`:

Output of `docker run --rm nvidia/cuda nvidia-smi`:

Output of `docker -v`:

Output of `kubeadm version`:

node description (single node only, for testing)

log of the device plugin

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (23)

System info

Output of lsb_release -a:

Output of docker run --rm nvidia/cuda nvidia-smi:

Output of docker -v:

Output of kubeadm version:

node description (single node only, for testing)

log of the device plugin

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Output of `lsb_release -a`:

Output of `docker run --rm nvidia/cuda nvidia-smi`:

Output of `docker -v`:

Output of `kubeadm version`: