nvidia / k8s-device-plugin Goto Github PK
View Code? Open in Web Editor NEWNVIDIA device plugin for Kubernetes
License: Apache License 2.0
NVIDIA device plugin for Kubernetes
License: Apache License 2.0
I've installed nvidia-docker2(nvidia-container-runtime) + kubernetes 1.9.7, and run tensorflow training in 8 1080ti nvidia-gpu . When I'm using kubectl to deploy pod with nvidia.com/gpu=8. The logs shows
Iteration 200 (0.833003 iter/s)
But when I run with the command: docker run --runtime=nvidia caffe-mpi:v0.2.22.test1
, the performance is much better.
Iteration 200 (1.655003 iter/sr)
But when I add "cgroup-parent" which of the pod I created first, I find the performance is as the same as the pod.
docker run --runtime=nvidia --cgroup-parent=kubepods-besteffort-podf4e9758b_6fda_11e8_93ce_00163e008c08.slice caffe-mpi:v0.2.22.test1
Iteration 200 (0.851113 iter/s)
I suspect it's related to the cgroup setting of the kubernetes. Do you have any suggestions? Thanks in advanced.
I have install k8s v1.9. And do everything in Readme file.
the gpu job shows pending. when I try kubectl restart, the feature gates show nothing
I0302 17:35:17.372045 16680 feature_gate.go:220] feature gates: &{{} map[]}
but I have add Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true" in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Is it normal ? How should I debug this problem ?
when kubectl describe pod gpu-pod
it shows
oot@a-Z170-HD3P:/home/a/fyk/k8s-device-plugin# kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Node:
Labels:
Annotations:
Status: Pending
IP:
Containers:
cuda-container:
Image: nvidia/cuda:9.0-devel
Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zxxzs (ro)
digits-container:
Image: nvidia/digits:6.0
Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zxxzs (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-zxxzs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-zxxzs
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 2m (x124 over 37m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
I tried kubectl describe node, result shows nothing about gpu
Capacity:
cpu: 8
memory: 16387484Ki
pods: 110
I think the device plugin does not work at all.
Following blog posting "How to use GPUs with Device Plugin in OpenShift 3.9 (Now Tech Preview!)" in blog.openshift.com
In my case, nvidia-device-plugin shows errors like below:
# oc logs -f nvidia-device-plugin-daemonset-nj9p8
2018/06/06 12:40:11 Loading NVML
2018/06/06 12:40:11 Fetching devices.
2018/06/06 12:40:11 Starting FS watcher.
2018/06/06 12:40:11 Starting OS watcher.
2018/06/06 12:40:11 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/06 12:40:16 Could not register device plugin: context deadline exceeded
2018/06/06 12:40:16 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/06/06 12:40:16 You can check the prerequisites at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
...
# oc describe pod nvidia-device-plugin-daemonset-2
Name: nvidia-device-plugin-daemonset-2jqgk
Namespace: nvidia
Node: node02/192.168.5.102
Start Time: Wed, 06 Jun 2018 22:59:32 +0900
Labels: controller-revision-hash=4102904998
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: openshift.io/scc=nvidia-deviceplugin
Status: Running
IP: 192.168.5.102
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://b92280bd124df9fd46fe08ab4bbda76e2458cf5572f5ffc651661580bcd9126d
Image: nvidia/k8s-device-plugin:1.9
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:7ba244bce75da00edd907209fe4cf7ea8edd0def5d4de71939899534134aea31
Port: <none>
State: Running
Started: Wed, 06 Jun 2018 22:59:34 +0900
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from nvidia-deviceplugin-token-cv7p5 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
nvidia-deviceplugin-token-cv7p5:
Type: Secret (a volume populated by a Secret)
SecretName: nvidia-deviceplugin-token-cv7p5
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 1h kubelet, node02 MountVolume.SetUp succeeded for volume "device-plugin"
Normal SuccessfulMountVolume 1h kubelet, node02 MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5"
Normal Pulled 1h kubelet, node02 Container image "nvidia/k8s-device-plugin:1.9" already present on machine
Normal Created 1h kubelet, node02 Created container
Normal Started 1h kubelet, node02 Started container
And running
"docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9" shows the log messages just like above.
On each origin-nodes, docker run test shows like this(its normal, right?),
# docker run --rm nvidia/cuda nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
Tesla-P40
# docker run -it --rm docker.io/mirrorgoogleconta...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[Test Env.]
[Master]
# oc version
oc v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://MYDOMAIN.local:8443
openshift v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
# uname -r
3.10.0-862.3.2.el7.x86_64
# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[GPU nodes]
# docker version
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:20:16 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:58 2018
OS/Arch: linux/amd64
Experimental: false
# uname -r
3.10.0-862.3.2.el7.x86_64
# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4b1a37d31cb9 openshift/node:v3.9.0 "/usr/local/bin/orig…" 22 minutes ago Up 21 minutes origin-node
efbedeeb88f0 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
36aa988447b8 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
6e6b598fa144 openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" 2 hours ago Up 2 hours openvswitch
# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Please help me with this problem. TIA!
I try tp deployed device-plugin v1.9 on k8s.
And I have similar problem nvidia-device-plugin container CrashLoopBackOff error v1.8
and container CrashLoopBackOff error
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-2h9rh 0/1 CrashLoopBackOff 11 33m
Use docker Run locally problem
docker build -t nvidia/k8s-device-plugin:1.9 .
Successfully built d12ed13b386a
Successfully tagged nvidia/k8s-device-plugin:1.9
14:25:40 Loading NVML
14:25:40 Failed to start nvml with error: could not load NVML library.
Environment :
$ cat /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf
/usr/lib/nvidia-384
/usr/lib32/nvidia-384
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:03:00.0 Off | N/A |
| 38% 29C P8 6W / 120W | 0MiB / 6069MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
And I used docker run --runtime=nvidia --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9
show error :
2017/12/27 14:38:22 Loading NVML
2017/12/27 14:38:22 Fetching devices.
2017/12/27 14:38:22 Starting FS watcher.
2017/12/27 14:38:22 Starting OS watcher.
2017/12/27 14:38:22 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/27 14:38:27 Could not register device plugin: context deadline exceeded
2017/12/27 14:38:27 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2017/12/27 14:38:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/27 14:38:32 Could not register device plugin: context deadline exceeded
2017/12/27 14:38:32 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2017/12/27 14:38:32 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/27 14:38:37 Could not register device plugin: context deadline exceeded
.
.
.
like this one. 67e5d3f
1.Question
This is not issue, only Question.
Can we use k8s-device-plugin and Nvidia-docker2 in minikube
2.Enviroment
・OS:Ubuntu16.04
・minikube: v0.24.1
・ kubectl : v1.10.0
・ Nvidia driver :384.111
・ Docker :Client:
Version: 18.03.0-ce
API version: 1.37
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:10:01 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:08:31 2018
OS/Arch: linux/amd64
Experimental: false
・Results of kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"0b9efaeb34a2fc51ff8e4d34ad9bc6375459c4a4", GitTreeState:"clean", BuildDate:"2017-11-29T22:43:34Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}
・GPU:Maxwell Geforce TITUN-X
・ /erc/docker/daemon.json
{
"dns": ["150.16.X.X", "150.16.X.X"],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
3.Problem
(1) I installerd minikube and kubectl to test Nvidia-docker2 .
(2)I started minikube as below
sudo CHANGE_MINIKUBE_NONE_USER=true minikube start --vm-driver=none --featuregates=Accelerators=true
★Hyper visor=on(Ubuntu PC BIOS)
(2) I did as below
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
(3)Next I did as below
$ kubectl create -f test.yml
(4) test.yml file
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidi$
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
(5)Results
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Pending 0 11s
nvidia-device-plugin-daemonset-mq4pm 0/1 CrashLoopBackOff 4 2m
★Pod error and nvidia-device-plugin-daemonset error
(6)My opinion
I faced on this eroor(pods ans Daemon Set were nor Running), I think nvidia-device-plug in was disabel.
But I don't kown the way to eanble nvidia-device-plug.
Perhaps I must set --feature-gates="DevicePlugins=true.
But minikube looks like that kubelet is not used.
★Could you any advices to use nvidia-docker2 in minikube?
I'm using JupyterLAB on Kubernetes and have a cluster of 8 CPU worker nodes and one CPU/GPU worker node. I have the device plugin set up etc... and when I log into JupyterLAB, a user pod is created and the device plugin/scheduler run it on my GPU node ... all is great! .. until .. a second user logs in .. the second user pod fails to start as the GPU has already been allocated by the first user....
Q: is it correct that pods can't share a GPU device? If so why not? seems like there is a valid use case here of multiple users being able to do training tasks on a shared GPU at different times at least?
I have a mixed cluster -- some nodes with GPUs, some without. The plugins start up nicely on the GPU nodes, but not so much on the nodes without GPUs (obviously).
The implementation uses a DaemonSet, so each node gets a pod ... but the pod on the non-GPU node is in CrashLoopBackOff -- I assume because of the lack of GPU. My question is whether I can set a flag/label/something to tell the pod running on the non-GPU node to just stop trying? I'd rather not just leave it there continually trying to restart ...
I deployed device-plugin container on k8s via the guide.
However I got container CrashLoopBackOff error:
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-zb8xn 0/1 CrashLoopBackOff 6 9m
And when I run
docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8
I got error like this:
2017/11/29 01:54:30 Loading NVML
2017/11/29 01:54:30 could not load NVML library
But I am pretty sure that I have installed NVML library.
So did I miss anything here?
How to check if I installed NVML library?
When using daemon sets in this kind of cluster, non-GPU nodes will complains
Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=ALL --utility --compute --pid=16424 /var/lib/docker/overlay/a86473af4c52afb44dfdfdcc817edb45316d520cccfb086d87cc227314d09015/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\""
It's straightforward to use taints (which could be documented), but how about also done it in this plugin (i.e. better error handling)?
on k8s CPU node, set nvidia as the default runtime,
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia",
"registry-mirrors": ["https://registry.docker-cn.com"]
}
when we start the pod of nvidia/k8s-device-plugin:1.10
the error is:
kubelet, 00-25-90-c0-f7-c8 Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=3567 /var/lib/docker/overlay2/c4498cb4052e704adff6d4ce5d4a8190afb89764a7bc8645d97c6b0520ba3a81/merged]\\\\nnvidia-container-cli: initialization error: cuda error: unknown error\\\\n\\\"\""
Warning BackOff 1s (x3 over 5s) kubelet, 00-25-90-c0-f7-c8 Back-off restarting failed container
what we expected:
nvidia/k8s-device-plugin:1.10
can run on non-GPU node with nvidia docker runtime.
Hi, everyone.
I've got the CrashLoopBackOff error too:
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-csjxw 0/1 CrashLoopBackOff 12 39m
However, when i ran the container on the node with nvidia:
docker run --runtime=nvidia -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8
It shows that the device plugin is registered succeessfully, isn't it?
2017/12/08 06:41:44 Loading NVML
2017/12/08 06:41:45 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/08 06:41:45 Registered device plugin with Kubelet
And the kubelet has started with --feature-gates=DevicePlugins=true.
By the way, my Nividia is GeForce GTX 1070.
Why this error came out?
I mean:
Does k8s-device-plugin:v1.8 only work for kubernetes:v1.8.x ,
k8s-device-plugin:v1.9 -> kubernetes:v1.9.x?
could we use k8s-device-plugin:v1.9 for kubernetes:v1.8.x?
Hi everyone.
I got some trouble today installing this plugin.
Here is my environment
AWS Ubuntu Server 16.04
docker 18.03.1-ce
NVIDIA Docker: 2.0.3
CUDA Version 9.1.85
I have already installed nvidia-docker 2 . Then I used the following command to test the nvidia-docker2 and it is successful.
docker run --runtime=nvidia -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
Then I followed the guide to install this plugin.I tried to configure the /etc/docker/daemon.json and
run the following commands:
sudo systemctl daemon-reload && sudo systemctl restart docker
And my configuration in daemon.json is here
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
But this step was wrong and I got the following output
job for docker.service failed because the control process exited with error code
Who can help me?
Thank you!
I am unable get GPU device support through k8s.
I am running 2 p2.xlarge nodes on AWS with a manual installation of K8s.
The nvidia-docker2 is installed and set as the default runtime. I tested this by running the following and getting the expected output.
docker run --rm nvidia/cuda nvidia-smi
I followed all the steps in the readme of this repo, and cannot seem to get the containers to have GPU access. Running the nvidia-device-plugin.yml seems to be up and working, but running a pod gives this error when trying to launch the digits job:
$ kubectl get pod gpu-pod --template '{{.status.conditions}}' [map[type:PodScheduled lastProbeTime:<nil> lastTransitionTime:2018-02-26T21:58:32Z message:0/2 nodes are available: 1 PodToleratesNodeTaints, 2 Insufficient nvidia.com/gpu. reason:Unschedulable status:False]]
I thought that it might be that I was requiring too many resources (2 per node), but even lowering the requirements in the yml still yielded the same result. Any ideas where things could be going wrong?
As mentioned in our discussion, it'd be helpful to push the device plugin docker image to docker hub so that we don't have to build and manage it ourselves.
Trying to install device plugin, but no luck
Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31021 /var/lib/docker/aufs/mnt/a2f849e29fcb8dc87d51e90497d7e44a38d7ecf93acabc285523d13c1cdf9046]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Back-off restarting failed container
Installed default and configured runtime
# nvidia-docker version
NVIDIA Docker: 2.0.2
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: false
# docker info | grep -i runtime
Runtimes: nvidia runc
WARNING: No swap limit support
Default Runtime: nvidia
Configured kubernetes with feature gates
# ps -ef | grep kube | grep featu
root 23964 23945 3 15:03 ? 00:00:16 kube-apiserver --bind-address=0.0.0.0 --insecure-bind-address=127.0.0.1 --insecure-port=8080 --service-node-port-range=30000-32767 --storage-backend=etcd3 --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ValidatingAdmissionWebhook,ResourceQuota --allow-privileged=true --apiserver-count=1 \
--feature-gates=Initializers=False,PersistentLocalVolumes=False,DevicePlugins=True --runtime-config=admissionregistration.k8s.io/v1alpha1 --requestheader-extra-headers-prefix=X-Remote-Extra- --advertise-address=192.168.0.102 --service-account-key-file=/etc/kubernetes/ssl/sa.pub --enable-bootstrap-token-auth=true --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --requestheader-group-headers=X-Remote-Group --client-ca-file=/etc/kubernetes/ssl/ca.crt --tls-private-key-file=/etc/kubernetes/ssl/apiserver.key --kubelet-client-key=/etc/kubernetes/ssl/apiserver-kubelet-client.key --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.crt --proxy-client-cert-file=/etc/kubernetes/ssl/front-proxy-client.crt --tls-cert-file=/etc/kubernetes/ssl/apiserver.crt --proxy-client-key-file=/etc/kubernetes/ssl/front-proxy-client.key --requestheader-username-headers=X-Remote-User --requestheader-allowed-names=front-proxy-client --service-cluster-ip-range=10.233.0.0/18 --kubelet-client-certificate=/etc/kubernetes/ssl/apiserver-kubelet-client.crt --secure-port=6443 --authorization-mode=Node,RBAC --etcd-servers=https://192.168.0.102:2379 --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.pem --etcd-certfile=/etc/kubernetes/ssl/etcd/node-rig3.pem --etcd-keyfile=/etc/kubernetes/ssl/etcd/node-rig3-key.pem
root 24226 24208 1 15:03 ? 00:00:07 kube-controller-manager --feature-gates=Initializers=False,PersistentLocalVolumes=False,DevicePlugins=True \
--node-monitor-grace-period=40s --node-monitor-period=5s --pod-eviction-timeout=5m0s --cluster-signing-cert-file=/etc/kubernetes/ssl/ca.crt --cluster-signing-key-file=/etc/kubernetes/ssl/ca.key --use-service-account-credentials=true --root-ca-file=/etc/kubernetes/ssl/ca.crt --service-account-private-key-file=/etc/kubernetes/ssl/sa.key --kubeconfig=/etc/kubernetes/controller-manager.conf --address=127.0.0.1 --leader-elect=true --controllers=*,bootstrapsigner,tokencleaner --allocate-node-cidrs=true --cluster-cidr=10.233.64.0/18 --node-cidr-mask-size=24
root 25315 1 2 15:04 ? 00:00:09 /usr/local/bin/kubelet --logtostderr=true --v=2 --address=0.0.0.0 --node-ip=192.168.0.102 --hostname-override=rig3 --allow-privileged=true --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/ssl/ca.crt --pod-manifest-path=/etc/kubernetes/manifests --cadvisor-port=0 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --kube-reserved cpu=100m,memory=256M --node-status-update-frequency=10s --cgroup-driver=cgroupfs --docker-disable-shared-pid=True --anonymous-auth=false --read-only-port=0 --fail-swap-on=True --cluster-dns=10.233.0.3 --cluster-domain=umine.farm --resolv-conf=/etc/resolv.conf --kube-reserved cpu=200m,memory=512M \
--feature-gates=Initializers=False,PersistentLocalVolumes=False,DevicePlugins=True --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
Latest version of kubernetes
# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2+coreos.0", GitCommit:"b427929b2982726eeb64e985bc1ebb41aaa5e095", GitTreeState:"clean", BuildDate:"2018-01-18T22:56:14Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2+coreos.0", GitCommit:"b427929b2982726eeb64e985bc1ebb41aaa5e095", GitTreeState:"clean", BuildDate:"2018-01-18T22:56:14Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Pod description:
# kubectl describe pod nvidia-device-plugin-daemonset-kzlbx -n kube-system
Name: nvidia-device-plugin-daemonset-kzlbx
Namespace: kube-system
Node: rig1/192.168.0.103
Start Time: Fri, 16 Feb 2018 15:06:31 +0200
Labels: controller-revision-hash=54069593
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP: 10.233.101.88
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://42676f92a1cce3489f87650433029ad27aa2bb24d9529a15689641410ed31d41
Image: nvidia/k8s-device-plugin:1.9
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:ed1cb6269dd827bada9691a7ae59dab4f431a05a9fb8082f8c28bfa9fd90b6c4
Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=2193 /var/lib/docker/aufs/mnt/ac16d904f39b452545a1bebf06148a8802b1a4b088a183f4fe733cf2547ed32c]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Exit Code: 128
Started: Fri, 16 Feb 2018 15:17:32 +0200
Finished: Fri, 16 Feb 2018 15:17:32 +0200
Ready: False
Restart Count: 7
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pm75k (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
default-token-pm75k:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pm75k
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 13m kubelet, rig1 MountVolume.SetUp succeeded for volume "device-plugin"
Normal SuccessfulMountVolume 13m kubelet, rig1 MountVolume.SetUp succeeded for volume "default-token-pm75k"
Warning Failed 13m kubelet, rig1 Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31021 /var/lib/docker/aufs/mnt/a2f849e29fcb8dc87d51e90497d7e44a38d7ecf93acabc285523d13c1cdf9046]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Warning Failed 13m kubelet, rig1 Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31068 /var/lib/docker/aufs/mnt/508159dc054cd38ef20a75373a230703de9cba817f44e69da02b82ceac08fb64]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Warning Failed 12m kubelet, rig1 Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31210 /var/lib/docker/aufs/mnt/9dab03a8dcf80c0de647bc46b985c0e66fed9cead529e20d499dfaf7d9dcc49c]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Warning Failed 12m kubelet, rig1 Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31378 /var/lib/docker/aufs/mnt/2dbe1488b7df983513be06da0e3d439e0dda69c169ac4cbe4e5c7204a892c448]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Normal Created 11m (x5 over 13m) kubelet, rig1 Created container
Normal Pulled 11m (x5 over 13m) kubelet, rig1 Container image "nvidia/k8s-device-plugin:1.9" already present on machine
Warning Failed 11m kubelet, rig1 Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31670 /var/lib/docker/aufs/mnt/148216fd0c884ee7e2a6978c4035b7cc7651ad715b086b1e9aba14f0a24a733e]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Warning BackOff 2m (x42 over 12m) kubelet, rig1 Back-off restarting failed containe
I have 9 worker nodes in my cluster but only ONE of them has a GPU. However, the device plugin seems to be running on ALL nodes. On the nodes without a GPU you can see the device plugin failing to find NVML .. (succeeds on the node with a GPU) ... so it seems to me that this plugin should only be running on the node that has a GPU.
Q: How can make the device plugin only run on my GPU node? Labels? Taints? Something else?
I installed NVIDIA docker and now trying to test it on my local minikube without success.
I followed few threads around the same topics, also without luck.
sudo minikube start --vm-driver=none --feature-gates=Accelerators=true
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
Getting:
NAME GPUs
minikube <none>
Hi, I'm building a single machine testing environment, because Minikube doesn't support GPU well, so I use the local-up-cluster.sh
provided at https://github.com/kubernetes/kubernetes/blob/master/hack/local-up-cluster.sh
to build up single node cluster, but it not work with k8s-device-plugin
well.
Do the following to reproduce it:
get source code using go get -d k8s.io/kubernetes
In order to make the local-up-cluster.sh
launch kubelet
with gate options, I insert the following line to the top of local-up-cluster.sh
FEATURE_GATES="DevicePlugins=true"
start the cluster using sudo ./hack/local-up-cluster.sh
When running
docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9
I got:
2018/02/28 12:14:51 Loading NVML
2018/02/28 12:14:51 Fetching devices.
2018/02/28 12:14:51 Starting FS watcher.
2018/02/28 12:14:51 Starting OS watcher.
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/02/28 12:14:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/02/28 12:14:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/02/28 12:14:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/02/28 12:14:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/02/28 12:14:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/02/28 12:14:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/02/28 12:14:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/02/28 12:14:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/02/28 12:14:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/02/28 12:14:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/02/28 12:14:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/02/28 12:14:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
Here are some of my configs:
/etc/docker/daemon.json
root@ubuntu-10-53-66-17:~# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
The command line to run kubelet
# ps aux | grep kubelet
root 71370 3.4 0.0 2621216 123892 pts/0 Sl+ 20:20 0:06 /home/mi/go/src/k8s.io/kubernetes/_output/local/bin/linux/amd64/hyperkube kubelet --v=3 --vmodule= --chaos-chance=0.0 --container-runtime=docker --rkt-path= --rkt-stage1-image= --hostname-override=127.0.0.1 --cloud-provider= --cloud-config= --address=127.0.0.1 --kubeconfig /var/run/kubernetes/kubelet.kubeconfig --feature-gates=DevicePlugins=true --cpu-cfs-quota=true --enable-controller-attach-detach=true --cgroups-per-qos=true --cgroup-driver=cgroupfs --keep-terminated-pod-volumes=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5% --eviction-soft= --eviction-pressure-transition-period=1m --pod-manifest-path=/var/run/kubernetes/static-pods --fail-swap-on=false --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --port=10250
The output of nvidia-smi
on the machine:
Wed Feb 28 20:26:18 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:05:00.0 Off | 0 |
| N/A 34C P8 26W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
| N/A 28C P8 30W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:84:00.0 Off | 0 |
| N/A 40C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:85:00.0 Off | 0 |
| N/A 33C P8 29W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Hi there,
My Kubernetes cluster is as such
Master (no GPU)
Node 1 (GPU)
Node 2 (GPU)
Node 3 (GPU)
Node 4 (GPU)
Nodes 1 - 4 have Nvidia drivers (384) and nvidia docker 2 installed.
First issue:
When i run the command
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml"
The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?
Second issue:
I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.
How do i solve this? Thanks.
@flx42 Hi, happy to find this project, this is a wonderful project!
I have try it on my local, and it works well.
I am just wondering is this project open for accepting contributions at the moment?
Please push a docker image built for ppc64le and use manifest-tool (https://github.com/estesp/manifest-tool ). Then, when the provided k8s DaemonSet pulls the docker image, it will "just work" on all platforms.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:34:11Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
$ kubelet --version
Kubernetes v1.8.6
NVIDIA-SMI 375.26
$ docker version
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:31:19 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:31:19 2017
OS/Arch: linux/amd64
Experimental: false
OS system is Debian 9.
GPU: Tesla K40m.
CUDA: Cuda compilation tools, release 8.0, V8.0.61
I installed Nvidia-docker according to the Debian instrunctions and NVIDIA/nvidia-docker#516 and can run docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
successfully. I set Nvidia as default-runtime and enabled the DevicePlugins feature gate on my 2-node k8s cluster equipped with Tesla K40m.
But when I run
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
or
docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8
they gave error:
2018/01/05 12:25:01 Loading NVML
2018/01/05 12:25:01 Failed to start nvml with error: could not load NVML library.
The output of ldconfig is
$ ldconfig -p | grep nvidia-ml
libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so.1 (libc6) => /usr/lib32/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
libnvidia-ml.so (libc6) => /usr/lib32/libnvidia-ml.so
I checked other issues like NVIDIA/nvidia-docker#74 and NVIDIA/nvidia-docker#470, they failed to run nvidia-docker but I can.
Another strange thing is that there is no nvidia-device-plugin
in my path and the output of locate nvidia-device-plugin
is blank.
Could please me check what went wrong?
Thanks!
i have a kubernetes node on 1.10.2
with nvidia/k8s-device-plugin:1.10
. Everything worked great initially, but now i can't schedule any pods with nvidia.com/gpu
. Looking at the output of kubectl get node
, i see:
status:
addresses:
- address: 134.79.129.97
type: InternalIP
- address: ocio-gpu01
type: Hostname
allocatable:
cpu: "48"
ephemeral-storage: "9391196145"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 263412492Ki
nvidia.com/gpu: "0"
pods: "110"
capacity:
cpu: "48"
ephemeral-storage: 10190100Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 263514892Ki
nvidia.com/gpu: "16"
pods: "110"
i think i cannot schedule any pods because the allocatable
is zero. i have pods running on the box, but none that requested any gpus.
any pointers in how i can troubleshoot this?
thanks,
we op one gpu cluster , every server has 4 gpus. suppose the id is 0, 1, 2,3. one job taken id 0, if the comming job needs 2 gpus. can the plugin give 2,3 to kubelet?.(now is 1,2 ) if do this ,job in the same pcie can connect faster than in diffrent pcie slot 。
Currently alpha.kubernetes.io/nvidia-gpu
resource does not support ResourceQuota because it is an alpha resource. nvidia.com/gpu
support?
I am getting following error when starting the plugin as docker container
2017/11/24 09:06:24 Loading NVML
2017/11/24 09:06:24 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/11/24 09:06:29 Could not register device plugin: context deadline exceeded
My installation of nvidia-docker works fine.
What is the problem?
I can deploy the device-plugin on my gpu nodes successfully. After running kubectl create
command, all the worker nodes deploy one nvidia-device-plugin. I know this is because using DaemonSet
to deploy the plugin, but what confuses me is that do we also need to deploy the plugin on these non-gpu nodes?
Suggest to add Node affinity
requiredDuringSchedulingIgnoredDuringExecution
type or nodeSelector
I'm working on a feature where I can hotplug nvidia gpus to the host. But when I did that, the device plugin is not recognizing the hotplugged gpu.
It would be great if the support for hotplug events is provided.
I want to build k8s cluster with gpu support.
which procedure is better ?
kubernetes/kubernetes@e64517c
this commit changes the deviceplugin api from v1alpha to v1beta1, and will be released in v1.10 of k8s.
I have test k8s-device-plugin:v1.9, which does not work with this commit.
could someone please update the code to a new release, k8s-device-plugin:v1.10?
checked issue #19 does not help me out.
versions:
docker version
Client:
Version: 1.13.1
API version: 1.26
Go version: go1.6.2
Git commit: 092cba3
Built: Thu Nov 2 20:40:23 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.6.2
Git commit: 092cba3
Built: Thu Nov 2 20:40:23 2017
OS/Arch: linux/amd64
Experimental: false
kubectl version
GitVersion:"v1.10.2"
kubeadm version
GitVersion:"v1.10.2
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
ldconfig -p | grep nvidia-ml
libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
nvidia-smi
Wed May 2 21:18:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.124 Driver Version: 367.124 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K1 Off | 0000:0B:00.0 Off | N/A |
| N/A 29C P8 8W / 31W | 0MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
errors:
docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10
2018/05/02 21:18:02 Loading NVML
2018/05/02 21:18:02 Failed to initialize NVML: could not load NVML library.
2018/05/02 21:18:02 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2018/05/02 21:18:02 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/05/02 21:18:02 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
@RenaudWasTaken mentions at #19 saying old GPU might got this issue,
is this the case here, please help to take a look, thanks a lot.
go build
failed because of lack of nvml sources in recent github.com/NVIDIA/nvidia-docker. I found nvml src files only exist in version 1.0.x of nvidia-docker.
Deploying any PODS with the nvidia.com/gpu resource limits results in "0/1 nodes are available: 1 Insufficient nvidia.com/gpu."
I also see this error in the Daemonset POD logs:
2018/02/27 16:43:50 Warning: GPU with UUID GPU-edae6d5d-6698-fb8d-2c6b-2a791224f089 is too old to support healtchecking with error: %!s(MISSING). Marking it unhealthy
running nvidia-docker2, have deployed the nvidia device plugin as a daemonset.
On worker Node
uname -a
Linux gpu 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
docker run --rm nvidia/cuda nvidia-smi
Wed Feb 28 18:07:07 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 760 Off | 00000000:0B:00.0 N/A | N/A |
| 34% 43C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 760 Off | 00000000:90:00.0 N/A | N/A |
| 34% 42C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
I install nvidia-docker2, and deploment device plugin in kubernetes 1.8, but kubectl describe pods, I get the error.
loading NVML
Failed to start nvml with error: could not load NVML library
I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml),the pod was sill pending:
[root@mlssdi010001 k8s]# kubectl describe pod tf-notebook-747db6987b-86zts
Name: tf-notebook-747db6987b-86zts
....
Events:
Type Reason Age From Message
Warning FailedScheduling 47s (x15 over 3m) default-scheduler 0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.
Pod info:
[root@mlssdi010001 k8s]# kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default tf-notebook-747db6987b-86zts 0/1 Pending 0 5s
....
kube-system nvidia-device-plugin-daemonset-ljrwc 1/1 Running 0 34s 10.244.1.11 mlssdi010003
kube-system nvidia-device-plugin-daemonset-m7h2r 1/1 Running 0 34s 10.244.2.12 mlssdi010002
Nodes info:
NAME STATUS ROLES AGE VERSION
mlssdi010001 Ready master 1d v1.9.0
mlssdi010002 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40)
mlssdi010003 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40)
In the current settings, the nvidia runtime is set to the default docker runtime instead of the origin runC.
so an issue as described in kubernetes/kubernetes#59631 kubernetes/kubernetes#59629
All GPUs are exposed into the container.
so one way of the solution is to use env to tell nvidia-container-runtime not to expose the GPUs.
another better way:
so the issue kubernetes/kubernetes#59631 can be fixed.
@flx42 @RenaudWasTaken @cmluciano @jiayingz @vikaschoudhary16
I'm running K8s 1.10.2-0 on RHEL7.4 with docker 18.03.1
I have a 9 worker node K8s cluster. Only one of those nodes has a GPU on it (NVIDIA TitanXp).
I installed nvidia-docker2 on ALL worker nodes:
nvidia-docker2.noarch 2.0.3-1.docker18.03.1.ce
I installed nvidia-container-runtime on ALL worker nodes:
nvidia-container-runtime.x86_64 2.0.0-1.docker18.03.1
I installed nvidia-device-plugin.yml v1.10 via kubectl (the device plugin is running OK on all worker nodes)
I can ssh into my GPU worker node and run nvidia-smi inside a container OK:
[whacuser@gpu ~]$ sudo docker run --rm nvidia/cuda nvidia-smi
Unable to find image 'nvidia/cuda:latest' locally
latest: Pulling from nvidia/cuda
297061f60c36: Pull complete
e9ccef17b516: Pull complete
dbc33716854d: Pull complete
8fe36b178d25: Pull complete
686596545a94: Pull complete
f611dfbee954: Pull complete
c51814f3e9ba: Pull complete
5da0fc07e73a: Pull complete
97462b1887aa: Pull complete
924ea239f6fe: Pull complete
Digest: sha256:69f3780f80a72cb7cebc7f401a716370f79412c5aa9362306005ca4eb84d0f3c
Status: Downloaded newer image for nvidia/cuda:latest
Mon May 14 20:14:16 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:13:00.0 Off | N/A |
| 23% 21C P8 8W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I label my GPU worker node like so:
kubectl label nodes gpu accelerator=nvidia-titan-xp --overwrite=true
However, when I try to run a pod:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
nodeSelector:
accelerator: nvidia-titan-xp
I get an error:
0/12 nodes are available: 11 MatchNodeSelector, 12 Insufficient nvidia.com/gpu, 3 PodToleratesNodeTaints.
any ideas?
docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git
should be:
docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9
is there k8s-device-plugin for cuda 8.0? in my env I run tensorflow 1.4 need cuda 8.0
I read another device plugin example, https://github.com/vikaschoudhary16/sfc-device-plugin
In this device plugin, during allocate phase, it only response the Host Path and Container Path of the device. So does it mean the k8s will mount the device to container?
In nvidia device plugin, it set an env "NVIDIA_VISIBLE_DEVICES", then the nvidia-container-cli will use the env, then mount the device to container.
Quoting what @RenaudWasTaken mentioned in another thread:
"The Nvidia Device plugin has a lot of such features coming up a few of these are:
memory scrubbing
healthCheck and reset in case of bad state
GPU Allocated memory checks
"Zombie processes" checks
...
"
Creating this issue to track the progress on these improvements.
@RenaudWasTaken could you also provide more details on some of these features, like what GPU Allocated memory checks and "Zombie processes" checks do?
I deploy the k8s v1.10,and deploy k8s-device-plugin v1.10,but when I use
$kubectl describe node bjpg-g271.yz02
can not find the capatity of GPU
I deploy cuda is
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
nvidia-docker is 17.03.2
I can run the GPU container by docker run,but the GPU can not scheduler by k8s
Hi, I'm trying to deploy TensorFlow
(with GPU support) on kubernetes with this device plugin.
And some error occurred:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
After debug the source code I think this is because Allocate
function don't return mount path of libcuda.so
.
@flx42 PTAL, and I think i can send a PR to fix it later.
The yaml manifest available upstream [1] a) is not the one suggested in the project's README [2] and b) is gke specific. As a result it is not clear to the kubernetes distributions (such as for example CDK [3]) what manifest should be shipped with each k8s release. We are doing our best, but any feedback from you on what the right path is, would be much appreciated.
[1] https://github.com/kubernetes/kubernetes/blob/release-1.10/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
[2] https://github.com/NVIDIA/k8s-device-plugin/blob/v1.9/nvidia-device-plugin.yml
[3] https://www.ubuntu.com/kubernetes
Are you open to a PR that allocates the same GPU to multiple requests based on additional requirements passed to the process?
I'm thinking I could ask for nvidia/gpu:1
and get 1 whole GPU, or I could ask for nvidia/gpu-memory:1Gi
and nvidia/gpu-cpu:2
and get "allocated" 1Gi of memory and 2 cores on 1 GPU, leaving whatever is left for other nvidia/gpu-memory
and nvidia/gpu-cpu
requests.
It wouldn't be enforced, but this way we can at least context switch between multiple processes on 1 GPU, which is something the main kubernetes project doesn't seem to want to support until at least v1.11 (kubernetes/kubernetes#52757)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.