Comments (23)
It's not getting scheduled because you only have 1 GPU and your pod is requesting two GPUs: 1 for cuda-container and 1 for digits-container.
from k8s-device-plugin.
Hello!
Your GPU is too old to be healthchecked. The behavior we've decided internally is to mark them as unhealthy.
If you still need to use it I can add an option to the daemonset to have GPUs that can't be healthchecked maked as healthy (though it wouldn't be the default behavior) would that be a solution for you?
from k8s-device-plugin.
Yes that would be very helpful.
Thank you.
from k8s-device-plugin.
Fixed, set the following environment variable when starting the device plugin: DP_DISABLE_HEALTHCHECKS=xids
. Either at the pod-spec level for the daemon set, or in your docker run
.
from k8s-device-plugin.
Same problem.
I assume when you said at the POD spec level for the daemon set you meant in nvidia-device-plugin.yml
Here is the file I edited, I just added the env stanza:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
template:
metadata:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
nodeSelector:
gpu: "true"
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia/k8s-device-plugin:1.9
name: nvidia-device-plugin-ctr
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "xids"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
from k8s-device-plugin.
Try docker pull nvidia/k8s-device-plugin:1.9
from k8s-device-plugin.
That did it. Thanks!!!!
from k8s-device-plugin.
Same problem here. My YAML file looks like this:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "xids"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
- name: digits-container
image: nvidia/digits:6.0
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "xids"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1GPU
and here's the output of kubectl describe pods
:
Name: gpu-pod
Namespace: default
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
Containers:
cuda-container:
Image: nvidia/cuda:9.0-devel
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
DP_DISABLE_HEALTHCHECKS: xids
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-7x4mv (ro)
digits-container:
Image: nvidia/digits:6.0
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
DP_DISABLE_HEALTHCHECKS: xids
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-7x4mv (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-7x4mv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-7x4mv
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s (x23 over 3m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
System info
Output of lsb_release -a
:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial
Output of docker run --rm nvidia/cuda nvidia-smi
:
Tue Jul 10 12:11:53 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:01:00.0 On | N/A |
| 29% 43C P8 7W / 180W | 51MiB / 8116MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Output of docker -v
:
Docker version 18.03.1-ce, build 9ee9f40
Output of kubeadm version
:
kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:14:41Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Is this also the issue of health-checking?
from k8s-device-plugin.
Hello!
The DP_DISABLE_HEALTHCHECKS
is a variable expected to be set in the device plugin pod
Can you:
- Make sure you don't have any other GPU pods running
- Provide the output of
kubectl describe node $MY_GPU_NODE
? - Provide the logs of the device plugin (
kubectl logs -n kube-system $DEVICE_PLUGIN_POD
) on the GPU node?
Thanks!
from k8s-device-plugin.
Hi!
here's the outputs:
node description (single node only, for testing)
Name: 127.0.0.1
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=127.0.0.1
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
volumes.kubernetes.io/keep-terminated-pod-volumes=true
CreationTimestamp: Wed, 11 Jul 2018 16:07:40 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Wed, 11 Jul 2018 16:27:52 +0800 Wed, 11 Jul 2018 16:07:40 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 11 Jul 2018 16:27:52 +0800 Wed, 11 Jul 2018 16:07:40 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 11 Jul 2018 16:27:52 +0800 Wed, 11 Jul 2018 16:07:40 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 11 Jul 2018 16:27:52 +0800 Wed, 11 Jul 2018 16:07:40 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 11 Jul 2018 16:27:52 +0800 Wed, 11 Jul 2018 16:07:40 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 127.0.0.1
Hostname: 127.0.0.1
Capacity:
cpu: 12
ephemeral-storage: 214726764Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16342940Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 12
ephemeral-storage: 197892185375
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16240540Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: c0c5f10b8ebccd69fed924285b432951
System UUID: 1C2B5312-6345-F8F3-52D2-4CEDFB66E216
Boot ID: bddbd025-64d2-4cbf-ba4f-56ac8c7bd745
Kernel Version: 4.4.0-130-generic
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.3.1
Kubelet Version: v1.11.0-dirty
Kube-Proxy Version: v1.11.0-dirty
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system kube-dns-7b479ccbc6-pdzjp 260m (2%) 0 (0%) 110Mi (0%) 170Mi (1%)
kube-system nvidia-device-plugin-daemonset-wdn9w 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 260m (2%) 0 (0%)
memory 110Mi (0%) 170Mi (1%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 20m kube-proxy, 127.0.0.1 Starting kube-proxy.
Normal Starting 20m kubelet, 127.0.0.1 Starting kubelet.
Normal NodeHasSufficientDisk 20m (x2 over 20m) kubelet, 127.0.0.1 Node 127.0.0.1 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 20m (x2 over 20m) kubelet, 127.0.0.1 Node 127.0.0.1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 20m (x2 over 20m) kubelet, 127.0.0.1 Node 127.0.0.1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 20m (x2 over 20m) kubelet, 127.0.0.1 Node 127.0.0.1 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 20m kubelet, 127.0.0.1 Updated Node Allocatable limit across pods
Normal NodeReady 20m kubelet, 127.0.0.1 Node 127.0.0.1 status is now: NodeReady
log of the device plugin
2018/07/11 08:17:27 Loading NVML
2018/07/11 08:17:27 Fetching devices.
2018/07/11 08:17:27 Starting FS watcher.
2018/07/11 08:17:27 Starting OS watcher.
2018/07/11 08:17:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/07/11 08:17:27 Registered device plugin with Kubelet
from k8s-device-plugin.
Hello!
Looks like your node is correctly advertising the number of GPUs it has it has (here one).
Additionally your node is not advertising your GPU as unhealthy.
At this point your pod should get scheduled. How much time did you wait?
from k8s-device-plugin.
@RenaudWasTaken thanks for the comment! I understand it better now.
@mindprince you are absolutely right... my stupid mistake. Thanks for pointing it out!
from k8s-device-plugin.
Hello!
Looks like your node is correctly advertising the number of GPUs it has it has (here one).
Additionally your node is not advertising your GPU as unhealthy.At this point your pod should get scheduled. How much time did you wait?
Hi, I got the same issue, except that I got no nvidia.com/gpu in the capacity tab. I'm currently running docker 18.09.1 (I'm not sure why I got the latest version when running docker version as I manually setted docker version to 18.06 when installing), with kubernetes version 1.13.2 and nvidia plugin 1.12. I'm wondering if it is because the plug-in does not support such a new version?
from k8s-device-plugin.
I am facing the same error.
This is the output of the kubectl describe pods
command:
Name: xyz-v1-5c5b57cf9c-kvjxn
Namespace: default
Node: <none>
Labels: app=xyz
pod-template-hash=1716137957
version=v1
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/xyz-v1-5c5b57cf9c
Containers:
aadhar:
Image: tensorflow/serving:1.11.1-gpu
Port: 9000/TCP
Host Port: 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--model_name=xyz
--model_base_path=gs://xyz_kuber_app-xyz-identification/export/
Limits:
cpu: 4
memory: 4Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
aadhar-http-proxy:
Image: gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
Port: 8000/TCP
Host Port: 0/TCP
Command:
python
/usr/src/app/server.py
--port=8000
--rpc_port=9000
--rpc_timeout=10.0
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 500m
memory: 500Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-b6dpn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b6dpn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x5 over 21m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
Warning FailedScheduling 20m (x2 over 20m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
Warning FailedScheduling 16m (x9 over 19m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Normal NotTriggerScaleUp 15m (x26 over 20m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added)
Warning FailedScheduling 2m42s (x54 over 23m) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Normal TriggeredScaleUp 13s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/adhaar-identification/zones/us-central1-a/instanceGroups/gke-kuberflow-aadhaar-pool-1-9753107b-grp 1->2 (max: 10)}]
Name: mnist-deploy-gcp-b4dd579bf-sjwj7
Namespace: default
Node: gke-kuberflow-xyz-default-pool-ab1fa086-w6q3/10.128.0.8
Start Time: Thu, 14 Feb 2019 14:44:08 +0530
Labels: app=xyz-object
pod-template-hash=608813569
version=v1
Annotations: sidecar.istio.io/inject:
Status: Running
IP: 10.36.4.18
Controlled By: ReplicaSet/mnist-deploy-gcp-b4dd579bf
Containers:
aadhaar-object:
Container ID: docker://921717d82b547a023034e7c8be78216493beeb55dca57f4eddb5968122e36c16
Image: tensorflow/serving:1.11.1
Image ID: docker-pullable://tensorflow/serving@sha256:a01c6475c69055c583aeda185a274942ced458d178aaeb84b4b842ae6917a0bc
Ports: 9000/TCP, 8500/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--rest_api_port=8500
--model_name=xyz-object
--model_base_path=gs://xyz_kuber_app-xyz-identification/export
--monitoring_config_file=/var/config/monitoring_config.txt
State: Running
Started: Thu, 14 Feb 2019 14:48:21 +0530
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 14 Feb 2019 14:45:58 +0530
Finished: Thu, 14 Feb 2019 14:48:21 +0530
Ready: True
Restart Count: 1
Limits:
cpu: 4
memory: 4Gi
Requests:
cpu: 1
memory: 1Gi
Liveness: tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
Environment:
GOOGLE_APPLICATION_CREDENTIALS: /secret/gcp-credentials/user-gcp-sa.json
Mounts:
/secret/gcp-credentials from gcp-credentials (rw)
/var/config/ from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: mnist-deploy-gcp-config
Optional: false
gcp-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: user-gcp-sa
Optional: false
default-token-b6dpn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b6dpn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
The output of kubectl describe pods | grep gpu
is :
Image: tensorflow/serving:1.11.1-gpu
nvidia.com/gpu: 1
nvidia.com/gpu: 1
nvidia.com/gpu:NoSchedule
Warning FailedScheduling 28m (x5 over 29m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
Warning FailedScheduling 28m (x2 over 28m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
Warning FailedScheduling 24m (x9 over 27m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Warning FailedScheduling 11m (x54 over 31m) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Warning FailedScheduling 48s (x23 over 6m57s) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu.
from k8s-device-plugin.
@AjayZinngg Did you manage to fix it? I got the same issue.
from k8s-device-plugin.
@maminio have you solve this problem?i got the same error
from k8s-device-plugin.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: "1" # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: "1" # requesting 2 GPUs
tolerations:
- effect: NoSchedule
operator: Exists
I add the tolerations, then it runs.
from k8s-device-plugin.
@chi-hung Do you have some detailed suggests to slove this problem?I am facing the same error.
from k8s-device-plugin.
@ernestmartinez Do you have some detailed suggests to slove this problem?I am facing the same error.
from k8s-device-plugin.
Please file a new issue with your specific details if you want to get looked at. This issue has already been resolved.
from k8s-device-plugin.
我也遇到这个问题了,后来发现是其他pod把gpu资源占满了,删掉那个pod后好了。所以要检查gpu资源够不够,有没有占用。
from k8s-device-plugin.
This issue has been resolved:
-
vim /etc/docker/daemon.json
add: "default-runtime": "nvidia", -
systemctl restart docker
Then:
kubectl describe node -A | grep nvidia
nvidia.com/gpu: 4
from k8s-device-plugin.
Hello there, I am facing a similar issue while integrating an Nvidia GPU with a pod.
I have successfully set up the nvidia-container-toolkit
.
$ docker run --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Jul 5 06:46:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 On | 00000000:26:00.0 Off | N/A |
| 23% 35C P8 18W / 215W | 92MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
I have setup a kind-control-plane
node on kubectl
.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 35m v1.27.3
I have labeled the node with gpu=installed
$ kubectl label nodes kind-control-plane gpu=installed
And I have also tained the node with
kubectl taint nodes kind-control-plane gpu:NoSchedule
Because without this the plugin was complaining for the node to be not a GPU node:
$ DEVICE_PLUGIN_POD="nvidia-device-plugin-daemonset-tm4pv"
abhishek@abhishek:~$ kubectl logs -n kube-system $DEVICE_PLUGIN_POD
I0705 05:08:24.579349 1 main.go:154] Starting FS watcher.
I0705 05:08:24.580666 1 main.go:161] Starting OS watcher.
I0705 05:08:24.581008 1 main.go:176] Starting Plugins.
I0705 05:08:24.581018 1 main.go:234] Loading configuration.
I0705 05:08:24.581125 1 main.go:242] Updating config with default resource matching patterns.
I0705 05:08:24.581304 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0705 05:08:24.581315 1 main.go:256] Retreiving plugins.
W0705 05:08:24.584333 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0705 05:08:24.584382 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0705 05:08:24.584425 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0705 05:08:24.584433 1 factory.go:115] Incompatible platform detected
E0705 05:08:24.584440 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0705 05:08:24.584447 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0705 05:08:24.584455 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0705 05:08:24.584465 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0705 05:08:24.584475 1 main.go:287] No devices found. Waiting indefinitely.
So after labeling and tainting the node we have the following node description:
$ kubectl describe node kind-control-plane
Name: kind-control-plane
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
gpu=installed
kubernetes.io/arch=amd64
kubernetes.io/hostname=kind-control-plane
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 05 Jul 2023 11:42:02 +0530
Taints: gpu:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: kind-control-plane
AcquireTime: <unset>
RenewTime: Wed, 05 Jul 2023 12:26:51 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 05 Jul 2023 12:21:56 +0530 Wed, 05 Jul 2023 11:42:01 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 05 Jul 2023 12:21:56 +0530 Wed, 05 Jul 2023 11:42:01 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 05 Jul 2023 12:21:56 +0530 Wed, 05 Jul 2023 11:42:01 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 05 Jul 2023 12:21:56 +0530 Wed, 05 Jul 2023 11:42:22 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.19.0.2
Hostname: kind-control-plane
Capacity:
cpu: 16
ephemeral-storage: 238948692Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65782984Ki
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 238948692Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65782984Ki
pods: 110
System Info:
Machine ID: 99d5a8ae622644c889e61e882ec29ec9
System UUID: b9544999-5e7c-40f5-a2e1-519c23074823
Boot ID: 4fdb330e-7a23-453d-99b2-5a4073672224
Kernel Version: 5.15.0-76-generic
OS Image: Debian GNU/Linux 11 (bullseye)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.1
Kubelet Version: v1.27.3
Kube-Proxy Version: v1.27.3
PodCIDR: 10.244.0.0/24
PodCIDRs: 10.244.0.0/24
ProviderID: kind://docker/kind/kind-control-plane
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system coredns-5d78c9869d-fdxj7 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 44m
kube-system coredns-5d78c9869d-sgnqj 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 44m
kube-system etcd-kind-control-plane 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 44m
kube-system kindnet-56rdl 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 44m
kube-system kube-apiserver-kind-control-plane 250m (1%) 0 (0%) 0 (0%) 0 (0%) 44m
kube-system kube-controller-manager-kind-control-plane 200m (1%) 0 (0%) 0 (0%) 0 (0%) 44m
kube-system kube-proxy-vswdf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44m
kube-system kube-scheduler-kind-control-plane 100m (0%) 0 (0%) 0 (0%) 0 (0%) 44m
local-path-storage local-path-provisioner-6bc4bddd6b-9ntj6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 950m (5%) 100m (0%)
memory 290Mi (0%) 390Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 44m kube-proxy
Normal Starting 40m kube-proxy
Normal NodeAllocatableEnforced 45m kubelet Updated Node Allocatable limit across pods
Normal Starting 45m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 45m (x8 over 45m) kubelet Node kind-control-plane status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 45m (x7 over 45m) kubelet Node kind-control-plane status is now: NodeHasSufficientPID
Normal NodeHasNoDiskPressure 45m (x8 over 45m) kubelet Node kind-control-plane status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 44m kubelet Updated Node Allocatable limit across pods
Normal Starting 44m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 44m kubelet Node kind-control-plane status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 44m kubelet Node kind-control-plane status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 44m kubelet Node kind-control-plane status is now: NodeHasSufficientPID
Normal RegisteredNode 44m node-controller Node kind-control-plane event: Registered Node kind-control-plane in Controller
Normal NodeReady 44m kubelet Node kind-control-plane status is now: NodeReady
Normal NodeAllocatableEnforced 40m kubelet Updated Node Allocatable limit across pods
Normal Starting 40m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 40m (x8 over 40m) kubelet Node kind-control-plane status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 40m (x8 over 40m) kubelet Node kind-control-plane status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 40m (x7 over 40m) kubelet Node kind-control-plane status is now: NodeHasSufficientPID
Normal RegisteredNode 40m node-controller Node kind-control-plane event: Registered Node kind-control-plane in Controller
Note that in the above description, the node does not have nvidia.com/gpu
listed anywhere.
The description of the gpu-pod is as follows:
$ kubectl describe pods
Name: gpu-pod
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-86lnd (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-86lnd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 35s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {gpu: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
But now, after the labeling and taint, the daemon does not even start or rather show up in the kubectl get pods -n kube-system
command.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-5d78c9869d-fdxj7 1/1 Running 1 (45m ago) 49m
coredns-5d78c9869d-sgnqj 1/1 Running 1 (45m ago) 49m
etcd-kind-control-plane 1/1 Running 1 (45m ago) 50m
kindnet-56rdl 1/1 Running 1 (45m ago) 49m
kube-apiserver-kind-control-plane 1/1 Running 1 (45m ago) 50m
kube-controller-manager-kind-control-plane 1/1 Running 1 (45m ago) 50m
kube-proxy-vswdf 1/1 Running 1 (45m ago) 49m
kube-scheduler-kind-control-plane 1/1 Running 1 (45m ago) 50m
Trying to create it again throws the following error:
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
Error from server (AlreadyExists): error when creating "https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml": daemonsets.apps "nvidia-device-plugin-daemonset" already exists
Please, can anyone help me out?
from k8s-device-plugin.
Related Issues (20)
- More flexible time-slicing strategy configuration
- an amazon machine image (AMI HOT 1
- an amazon machine image (AMI) that meets the prequites of k8s-device-plugin HOT 2
- How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease HOT 4
- Unable to install in Ubuntu 20.04 a nvidia container toolkit with version < 1.14.4 HOT 15
- GPU health status exposure and remediation methods HOT 1
- GPU distribution wrong after reboot node HOT 2
- Addressing several security vulnerabilities in the version v0.14.4 and v0.14.5 HOT 1
- GPU allocation does not respect NVLink HOT 5
- A pod can access all gpu resources even if no nvidia.com/gpu is configed. HOT 1
- Using CUDA MPS to enable GPU sharing, the pod occupies all GPU memory. HOT 11
- 0/1 nodes are available: 1 Insufficient nvidia.com/gpu HOT 2
- Limiting GPU Resource Usage per Docker Container with MPS Daemon
- K8s 1.24 failed to schedule using GPU-(error code CUDA driver HOT 6
- Access NVIDIA GPUs in K8s in a non-privileged container
- can't install 0.15.0-rc.2 HOT 3
- Device plugin does not start on MIG-enabled host due to insufficient permissions HOT 6
- Daemonset yaml file is not picking up Timeslicing configMap
- Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" HOT 2
- Dedicated GPU's for time slicing on multi GPU set ups.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.