Coder Social home page Coder Social logo

k8s-device-plugin's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

k8s-device-plugin's Issues

Using kubernetes to run tensorflow training much slower than `docker run`

I've installed nvidia-docker2(nvidia-container-runtime) + kubernetes 1.9.7, and run tensorflow training in 8 1080ti nvidia-gpu . When I'm using kubectl to deploy pod with nvidia.com/gpu=8. The logs shows

Iteration 200 (0.833003 iter/s)

But when I run with the command: docker run --runtime=nvidia caffe-mpi:v0.2.22.test1, the performance is much better.

Iteration 200 (1.655003 iter/sr)

But when I add "cgroup-parent" which of the pod I created first, I find the performance is as the same as the pod.

docker run --runtime=nvidia --cgroup-parent=kubepods-besteffort-podf4e9758b_6fda_11e8_93ce_00163e008c08.slice caffe-mpi:v0.2.22.test1
Iteration 200 (0.851113 iter/s)

I suspect it's related to the cgroup setting of the kubernetes. Do you have any suggestions? Thanks in advanced.

Always pending

I have install k8s v1.9. And do everything in Readme file.
the gpu job shows pending. when I try kubectl restart, the feature gates show nothing
I0302 17:35:17.372045 16680 feature_gate.go:220] feature gates: &{{} map[]}
but I have add Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true" in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Is it normal ? How should I debug this problem ?

when kubectl describe pod gpu-pod
it shows
oot@a-Z170-HD3P:/home/a/fyk/k8s-device-plugin# kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Node:
Labels:
Annotations:
Status: Pending
IP:
Containers:
cuda-container:
Image: nvidia/cuda:9.0-devel
Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zxxzs (ro)
digits-container:
Image: nvidia/digits:6.0
Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zxxzs (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-zxxzs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-zxxzs
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message


Warning FailedScheduling 2m (x124 over 37m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

I tried kubectl describe node, result shows nothing about gpu
Capacity:
cpu: 8
memory: 16387484Ki
pods: 110

I think the device plugin does not work at all.

OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded

Following blog posting "How to use GPUs with Device Plugin in OpenShift 3.9 (Now Tech Preview!)" in blog.openshift.com

In my case, nvidia-device-plugin shows errors like below:

# oc logs -f nvidia-device-plugin-daemonset-nj9p8
2018/06/06 12:40:11 Loading NVML
2018/06/06 12:40:11 Fetching devices.
2018/06/06 12:40:11 Starting FS watcher.
2018/06/06 12:40:11 Starting OS watcher.
2018/06/06 12:40:11 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/06 12:40:16 Could not register device plugin: context deadline exceeded
2018/06/06 12:40:16 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/06/06 12:40:16 You can check the prerequisites at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
...
  • One of the device-plugin-daemonset pod description is
# oc describe pod nvidia-device-plugin-daemonset-2
Name:           nvidia-device-plugin-daemonset-2jqgk
Namespace:      nvidia
Node:           node02/192.168.5.102
Start Time:     Wed, 06 Jun 2018 22:59:32 +0900
Labels:         controller-revision-hash=4102904998
                name=nvidia-device-plugin-ds
                pod-template-generation=1
Annotations:    openshift.io/scc=nvidia-deviceplugin
Status:         Running
IP:             192.168.5.102
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://b92280bd124df9fd46fe08ab4bbda76e2458cf5572f5ffc651661580bcd9126d
    Image:          nvidia/k8s-device-plugin:1.9
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:7ba244bce75da00edd907209fe4cf7ea8edd0def5d4de71939899534134aea31
    Port:           <none>
    State:          Running
      Started:      Wed, 06 Jun 2018 22:59:34 +0900
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-deviceplugin-token-cv7p5 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  nvidia-deviceplugin-token-cv7p5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-deviceplugin-token-cv7p5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type    Reason                 Age   From             Message
  ----    ------                 ----  ----             -------
  Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "device-plugin"
  Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5"
  Normal  Pulled                 1h    kubelet, node02  Container image "nvidia/k8s-device-plugin:1.9" already present on machine
  Normal  Created                1h    kubelet, node02  Created container
  Normal  Started                1h    kubelet, node02  Started container
  • And running
    "docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9" shows the log messages just like above.

  • On each origin-nodes, docker run test shows like this(its normal, right?),

# docker run --rm nvidia/cuda nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
Tesla-P40
# docker run -it --rm docker.io/mirrorgoogleconta...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

[Test Env.]

  • 1 Master with OpenShift v3.9(Origin)
  • 2 GPU nodes with Tesla-P40*2
  • Docker-CE, nvidia-docker2 on GPU nodes

[Master]

# oc version
oc v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://MYDOMAIN.local:8443
openshift v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
# uname -r
3.10.0-862.3.2.el7.x86_64
# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)

[GPU nodes]

# docker version
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:20:16 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm

Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:58 2018
OS/Arch: linux/amd64
Experimental: false
# uname -r
3.10.0-862.3.2.el7.x86_64
# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4b1a37d31cb9 openshift/node:v3.9.0 "/usr/local/bin/orig…" 22 minutes ago Up 21 minutes origin-node
efbedeeb88f0 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
36aa988447b8 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
6e6b598fa144 openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" 2 hours ago Up 2 hours openvswitch
# cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Please help me with this problem. TIA!

k8s-device-plugin v1.9 deployment CrashLoopBackOff

I try tp deployed device-plugin v1.9 on k8s.

And I have similar problem nvidia-device-plugin container CrashLoopBackOff error v1.8

and container CrashLoopBackOff error

NAME                                   READY     STATUS             RESTARTS   AGE
nvidia-device-plugin-daemonset-2h9rh   0/1       CrashLoopBackOff   11          33m

Use docker Run locally problem

docker build -t nvidia/k8s-device-plugin:1.9 .

Successfully built d12ed13b386a
Successfully tagged nvidia/k8s-device-plugin:1.9
14:25:40 Loading NVML
14:25:40 Failed to start nvml with error: could not load NVML library.

Environment :

$ cat /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf 
/usr/lib/nvidia-384
/usr/lib32/nvidia-384
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:03:00.0 Off |                  N/A |
| 38%   29C    P8     6W / 120W |      0MiB /  6069MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


And I used docker run --runtime=nvidia --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9

show error :

2017/12/27 14:38:22 Loading NVML
2017/12/27 14:38:22 Fetching devices.
2017/12/27 14:38:22 Starting FS watcher.
2017/12/27 14:38:22 Starting OS watcher.
2017/12/27 14:38:22 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/27 14:38:27 Could not register device plugin: context deadline exceeded
2017/12/27 14:38:27 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2017/12/27 14:38:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/27 14:38:32 Could not register device plugin: context deadline exceeded
2017/12/27 14:38:32 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2017/12/27 14:38:32 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/27 14:38:37 Could not register device plugin: context deadline exceeded
.
.
.

Can we use k8s-device-plugin and Nvidia-docker2 in minikube?

1.Question
This is not issue, only Question.
Can we use k8s-device-plugin and Nvidia-docker2 in minikube

2.Enviroment
・OS:Ubuntu16.04
・minikube: v0.24.1
・ kubectl : v1.10.0
・ Nvidia driver :384.111
・ Docker :Client:
 Version: 18.03.0-ce
 API version: 1.37
 Go version: go1.9.4
  Git commit: 0520e24
 Built: Wed Mar 21 23:10:01 2018
 OS/Arch: linux/amd64
 Experimental: false
 Orchestrator: swarm

 Server:
 Engine:
 Version: 18.03.0-ce
 API version: 1.37 (minimum version 1.12)
 Go version: go1.9.4
 Git commit: 0520e24
 Built: Wed Mar 21 23:08:31 2018
 OS/Arch: linux/amd64
 Experimental: false

・Results of kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"0b9efaeb34a2fc51ff8e4d34ad9bc6375459c4a4", GitTreeState:"clean", BuildDate:"2017-11-29T22:43:34Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}

・GPU:Maxwell Geforce TITUN-X

・ /erc/docker/daemon.json
{
"dns": ["150.16.X.X", "150.16.X.X"],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

3.Problem
(1) I installerd minikube and kubectl to test Nvidia-docker2 .

(2)I started minikube as below
sudo CHANGE_MINIKUBE_NONE_USER=true minikube start --vm-driver=none --featuregates=Accelerators=true

★Hyper visor=on(Ubuntu PC BIOS)

(2) I did as below
 $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml

(3)Next I did as below
$ kubectl create -f test.yml

(4) test.yml file
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidi$
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU

(5)Results
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Pending 0 11s
nvidia-device-plugin-daemonset-mq4pm 0/1 CrashLoopBackOff 4 2m

★Pod error and nvidia-device-plugin-daemonset error

(6)My opinion
I faced on this eroor(pods ans Daemon Set were nor Running), I think nvidia-device-plug in was disabel.
But I don't kown the way to eanble nvidia-device-plug.
Perhaps I must set --feature-gates="DevicePlugins=true.
But minikube looks like that kubelet is not used.

★Could you any advices to use nvidia-docker2 in minikube?

pods cannot share a GPU?

I'm using JupyterLAB on Kubernetes and have a cluster of 8 CPU worker nodes and one CPU/GPU worker node. I have the device plugin set up etc... and when I log into JupyterLAB, a user pod is created and the device plugin/scheduler run it on my GPU node ... all is great! .. until .. a second user logs in .. the second user pod fails to start as the GPU has already been allocated by the first user....

Q: is it correct that pods can't share a GPU device? If so why not? seems like there is a valid use case here of multiple users being able to do training tasks on a shared GPU at different times at least?

Can I disable the device plugin pod for an individual node?

I have a mixed cluster -- some nodes with GPUs, some without. The plugins start up nicely on the GPU nodes, but not so much on the nodes without GPUs (obviously).

The implementation uses a DaemonSet, so each node gets a pod ... but the pod on the non-GPU node is in CrashLoopBackOff -- I assume because of the lack of GPU. My question is whether I can set a flag/label/something to tell the pod running on the non-GPU node to just stop trying? I'd rather not just leave it there continually trying to restart ...

nvidia-device-plugin container CrashLoopBackOff error

I deployed device-plugin container on k8s via the guide.
However I got container CrashLoopBackOff error:

NAME                                   READY     STATUS             RESTARTS   AGE
nvidia-device-plugin-daemonset-zb8xn   0/1       CrashLoopBackOff   6          9m

And when I run

docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8

I got error like this:

2017/11/29 01:54:30 Loading NVML
2017/11/29 01:54:30 could not load NVML library

But I am pretty sure that I have installed NVML library.
So did I miss anything here?
How to check if I installed NVML library?

Using in clusters which contains both GPU nodes and non-GPU nodes

When using daemon sets in this kind of cluster, non-GPU nodes will complains

Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=ALL --utility --compute --pid=16424 /var/lib/docker/overlay/a86473af4c52afb44dfdfdcc817edb45316d520cccfb086d87cc227314d09015/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\""

It's straightforward to use taints (which could be documented), but how about also done it in this plugin (i.e. better error handling)?

”nvidia-container-cli: initialization error: cuda error: unknown error“ on CPU node

on k8s CPU node, set nvidia as the default runtime,

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia",
    "registry-mirrors": ["https://registry.docker-cn.com"]
}

when we start the pod of nvidia/k8s-device-plugin:1.10
the error is:

kubelet, 00-25-90-c0-f7-c8  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=3567 /var/lib/docker/overlay2/c4498cb4052e704adff6d4ce5d4a8190afb89764a7bc8645d97c6b0520ba3a81/merged]\\\\nnvidia-container-cli: initialization error: cuda error: unknown error\\\\n\\\"\""
  Warning  BackOff                1s (x3 over 5s)    kubelet, 00-25-90-c0-f7-c8  Back-off restarting failed container

what we expected:
nvidia/k8s-device-plugin:1.10 can run on non-GPU node with nvidia docker runtime.

container CrashLoopBackOff error

Hi, everyone.
I've got the CrashLoopBackOff error too:

NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-csjxw 0/1 CrashLoopBackOff 12 39m

However, when i ran the container on the node with nvidia:
docker run --runtime=nvidia -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8
It shows that the device plugin is registered succeessfully, isn't it?

2017/12/08 06:41:44 Loading NVML
2017/12/08 06:41:45 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/12/08 06:41:45 Registered device plugin with Kubelet

And the kubelet has started with --feature-gates=DevicePlugins=true.

By the way, my Nividia is GeForce GTX 1070.
Why this error came out?

what's the difference between v1.8 and v1.9?

I mean:
Does k8s-device-plugin:v1.8 only work for kubernetes:v1.8.x ,
k8s-device-plugin:v1.9 -> kubernetes:v1.9.x?

could we use k8s-device-plugin:v1.9 for kubernetes:v1.8.x?

failed create pod sandbox

I'm trying to install Nvidia plugin over kubeadm 1.9
I already install Nvidia driver, cuda toolkit, Nvidia-docker.
But when I create k8s-device-plugin at the master node, the pod is stuck in ContainerCreating state.
When i use kubectl describe pod, it shows the error failed create pod sandbox
image

Cannot restart docker after configuring /etc/docker/daemon.json

Hi everyone.
I got some trouble today installing this plugin.
Here is my environment
AWS Ubuntu Server 16.04
docker 18.03.1-ce
NVIDIA Docker: 2.0.3
CUDA Version 9.1.85

I have already installed nvidia-docker 2 . Then I used the following command to test the nvidia-docker2 and it is successful.
docker run --runtime=nvidia -it -p 8888:8888 tensorflow/tensorflow:latest-gpu

Then I followed the guide to install this plugin.I tried to configure the /etc/docker/daemon.json and
run the following commands:
sudo systemctl daemon-reload && sudo systemctl restart docker

And my configuration in daemon.json is here
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

But this step was wrong and I got the following output
job for docker.service failed because the control process exited with error code

Who can help me?
Thank you!

Error running GPU pod: "Insufficient nvidia.com/gpu"

I am unable get GPU device support through k8s.
I am running 2 p2.xlarge nodes on AWS with a manual installation of K8s.
The nvidia-docker2 is installed and set as the default runtime. I tested this by running the following and getting the expected output.
docker run --rm nvidia/cuda nvidia-smi

I followed all the steps in the readme of this repo, and cannot seem to get the containers to have GPU access. Running the nvidia-device-plugin.yml seems to be up and working, but running a pod gives this error when trying to launch the digits job:

$ kubectl get pod gpu-pod --template '{{.status.conditions}}' [map[type:PodScheduled lastProbeTime:<nil> lastTransitionTime:2018-02-26T21:58:32Z message:0/2 nodes are available: 1 PodToleratesNodeTaints, 2 Insufficient nvidia.com/gpu. reason:Unschedulable status:False]]

I thought that it might be that I was requiring too many resources (2 per node), but even lowering the requirements in the yml still yielded the same result. Any ideas where things could be going wrong?

failed to start container "nvidia-device-plugin-ctr"

Trying to install device plugin, but no luck

Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31021 /var/lib/docker/aufs/mnt/a2f849e29fcb8dc87d51e90497d7e44a38d7ecf93acabc285523d13c1cdf9046]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
Back-off restarting failed container

Installed default and configured runtime

# nvidia-docker version
NVIDIA Docker: 2.0.2
Client:
 Version:	17.12.0-ce
 API version:	1.35
 Go version:	go1.9.2
 Git commit:	c97c6d6
 Built:	Wed Dec 27 20:11:19 2017
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.0-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.2
  Git commit:	c97c6d6
  Built:	Wed Dec 27 20:09:53 2017
  OS/Arch:	linux/amd64
  Experimental:	false
# docker info | grep -i runtime
Runtimes: nvidia runc
WARNING: No swap limit support
Default Runtime: nvidia

Configured kubernetes with feature gates

# ps -ef | grep kube | grep featu
root     23964 23945  3 15:03 ?        00:00:16 kube-apiserver --bind-address=0.0.0.0 --insecure-bind-address=127.0.0.1 --insecure-port=8080 --service-node-port-range=30000-32767 --storage-backend=etcd3 --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ValidatingAdmissionWebhook,ResourceQuota --allow-privileged=true --apiserver-count=1 \
--feature-gates=Initializers=False,PersistentLocalVolumes=False,DevicePlugins=True --runtime-config=admissionregistration.k8s.io/v1alpha1 --requestheader-extra-headers-prefix=X-Remote-Extra- --advertise-address=192.168.0.102 --service-account-key-file=/etc/kubernetes/ssl/sa.pub --enable-bootstrap-token-auth=true --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --requestheader-group-headers=X-Remote-Group --client-ca-file=/etc/kubernetes/ssl/ca.crt --tls-private-key-file=/etc/kubernetes/ssl/apiserver.key --kubelet-client-key=/etc/kubernetes/ssl/apiserver-kubelet-client.key --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.crt --proxy-client-cert-file=/etc/kubernetes/ssl/front-proxy-client.crt --tls-cert-file=/etc/kubernetes/ssl/apiserver.crt --proxy-client-key-file=/etc/kubernetes/ssl/front-proxy-client.key --requestheader-username-headers=X-Remote-User --requestheader-allowed-names=front-proxy-client --service-cluster-ip-range=10.233.0.0/18 --kubelet-client-certificate=/etc/kubernetes/ssl/apiserver-kubelet-client.crt --secure-port=6443 --authorization-mode=Node,RBAC --etcd-servers=https://192.168.0.102:2379 --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.pem --etcd-certfile=/etc/kubernetes/ssl/etcd/node-rig3.pem --etcd-keyfile=/etc/kubernetes/ssl/etcd/node-rig3-key.pem
root     24226 24208  1 15:03 ?        00:00:07 kube-controller-manager --feature-gates=Initializers=False,PersistentLocalVolumes=False,DevicePlugins=True \
--node-monitor-grace-period=40s --node-monitor-period=5s --pod-eviction-timeout=5m0s --cluster-signing-cert-file=/etc/kubernetes/ssl/ca.crt --cluster-signing-key-file=/etc/kubernetes/ssl/ca.key --use-service-account-credentials=true --root-ca-file=/etc/kubernetes/ssl/ca.crt --service-account-private-key-file=/etc/kubernetes/ssl/sa.key --kubeconfig=/etc/kubernetes/controller-manager.conf --address=127.0.0.1 --leader-elect=true --controllers=*,bootstrapsigner,tokencleaner --allocate-node-cidrs=true --cluster-cidr=10.233.64.0/18 --node-cidr-mask-size=24
root     25315     1  2 15:04 ?        00:00:09 /usr/local/bin/kubelet --logtostderr=true --v=2 --address=0.0.0.0 --node-ip=192.168.0.102 --hostname-override=rig3 --allow-privileged=true --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/ssl/ca.crt --pod-manifest-path=/etc/kubernetes/manifests --cadvisor-port=0 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --kube-reserved cpu=100m,memory=256M --node-status-update-frequency=10s --cgroup-driver=cgroupfs --docker-disable-shared-pid=True --anonymous-auth=false --read-only-port=0 --fail-swap-on=True --cluster-dns=10.233.0.3 --cluster-domain=umine.farm --resolv-conf=/etc/resolv.conf --kube-reserved cpu=200m,memory=512M \
--feature-gates=Initializers=False,PersistentLocalVolumes=False,DevicePlugins=True --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin

Latest version of kubernetes

# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2+coreos.0", GitCommit:"b427929b2982726eeb64e985bc1ebb41aaa5e095", GitTreeState:"clean", BuildDate:"2018-01-18T22:56:14Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2+coreos.0", GitCommit:"b427929b2982726eeb64e985bc1ebb41aaa5e095", GitTreeState:"clean", BuildDate:"2018-01-18T22:56:14Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Pod description:

# kubectl describe pod nvidia-device-plugin-daemonset-kzlbx -n kube-system
Name:           nvidia-device-plugin-daemonset-kzlbx
Namespace:      kube-system
Node:           rig1/192.168.0.103
Start Time:     Fri, 16 Feb 2018 15:06:31 +0200
Labels:         controller-revision-hash=54069593
                name=nvidia-device-plugin-ds
                pod-template-generation=1
Annotations:    scheduler.alpha.kubernetes.io/critical-pod=
Status:         Running
IP:             10.233.101.88
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://42676f92a1cce3489f87650433029ad27aa2bb24d9529a15689641410ed31d41
    Image:          nvidia/k8s-device-plugin:1.9
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:ed1cb6269dd827bada9691a7ae59dab4f431a05a9fb8082f8c28bfa9fd90b6c4
    Port:           <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=2193 /var/lib/docker/aufs/mnt/ac16d904f39b452545a1bebf06148a8802b1a4b088a183f4fe733cf2547ed32c]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
      Exit Code:    128
      Started:      Fri, 16 Feb 2018 15:17:32 +0200
      Finished:     Fri, 16 Feb 2018 15:17:32 +0200
    Ready:          False
    Restart Count:  7
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pm75k (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  default-token-pm75k:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pm75k
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                 Age                From           Message
  ----     ------                 ----               ----           -------
  Normal   SuccessfulMountVolume  13m                kubelet, rig1  MountVolume.SetUp succeeded for volume "device-plugin"
  Normal   SuccessfulMountVolume  13m                kubelet, rig1  MountVolume.SetUp succeeded for volume "default-token-pm75k"
  Warning  Failed                 13m                kubelet, rig1  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31021 /var/lib/docker/aufs/mnt/a2f849e29fcb8dc87d51e90497d7e44a38d7ecf93acabc285523d13c1cdf9046]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
  Warning  Failed                 13m                kubelet, rig1  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31068 /var/lib/docker/aufs/mnt/508159dc054cd38ef20a75373a230703de9cba817f44e69da02b82ceac08fb64]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
  Warning  Failed                 12m                kubelet, rig1  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31210 /var/lib/docker/aufs/mnt/9dab03a8dcf80c0de647bc46b985c0e66fed9cead529e20d499dfaf7d9dcc49c]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
  Warning  Failed                 12m                kubelet, rig1  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31378 /var/lib/docker/aufs/mnt/2dbe1488b7df983513be06da0e3d439e0dda69c169ac4cbe4e5c7204a892c448]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
  Normal   Created                11m (x5 over 13m)  kubelet, rig1  Created container
  Normal   Pulled                 11m (x5 over 13m)  kubelet, rig1  Container image "nvidia/k8s-device-plugin:1.9" already present on machine
  Warning  Failed                 11m                kubelet, rig1  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=31670 /var/lib/docker/aufs/mnt/148216fd0c884ee7e2a6978c4035b7cc7651ad715b086b1e9aba14f0a24a733e]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown
  Warning  BackOff                2m (x42 over 12m)  kubelet, rig1  Back-off restarting failed containe

device plugin runs on ALL nodes

I have 9 worker nodes in my cluster but only ONE of them has a GPU. However, the device plugin seems to be running on ALL nodes. On the nodes without a GPU you can see the device plugin failing to find NVML .. (succeeds on the node with a GPU) ... so it seems to me that this plugin should only be running on the node that has a GPU.

Q: How can make the device plugin only run on my GPU node? Labels? Taints? Something else?

Minikube doesn't recognize GPU

I installed NVIDIA docker and now trying to test it on my local minikube without success.
I followed few threads around the same topics, also without luck.

sudo minikube start --vm-driver=none --feature-gates=Accelerators=true 
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml 
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu' 

Getting:

NAME GPUs
minikube <none>

not working on a local one-node cluster created by local-up-cluster.sh

Hi, I'm building a single machine testing environment, because Minikube doesn't support GPU well, so I use the local-up-cluster.sh provided at https://github.com/kubernetes/kubernetes/blob/master/hack/local-up-cluster.sh to build up single node cluster, but it not work with k8s-device-plugin well.

Do the following to reproduce it:

  • get source code using go get -d k8s.io/kubernetes

  • In order to make the local-up-cluster.sh launch kubelet with gate options, I insert the following line to the top of local-up-cluster.sh
    FEATURE_GATES="DevicePlugins=true"

  • start the cluster using sudo ./hack/local-up-cluster.sh

When running
docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9

I got:

2018/02/28 12:14:51 Loading NVML
2018/02/28 12:14:51 Fetching devices.
2018/02/28 12:14:51 Starting FS watcher.
2018/02/28 12:14:51 Starting OS watcher.
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/02/28 12:14:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/02/28 12:14:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/02/28 12:14:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/02/28 12:14:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/02/28 12:14:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/02/28 12:14:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/02/28 12:14:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/02/28 12:14:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/02/28 12:14:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/02/28 12:14:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/02/28 12:14:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/02/28 12:14:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/02/28 12:14:51 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock

Here are some of my configs:
/etc/docker/daemon.json

root@ubuntu-10-53-66-17:~# cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

The command line to run kubelet

# ps aux | grep kubelet

root      71370  3.4  0.0 2621216 123892 pts/0  Sl+  20:20   0:06 /home/mi/go/src/k8s.io/kubernetes/_output/local/bin/linux/amd64/hyperkube kubelet --v=3 --vmodule= --chaos-chance=0.0 --container-runtime=docker --rkt-path= --rkt-stage1-image= --hostname-override=127.0.0.1 --cloud-provider= --cloud-config= --address=127.0.0.1 --kubeconfig /var/run/kubernetes/kubelet.kubeconfig --feature-gates=DevicePlugins=true --cpu-cfs-quota=true --enable-controller-attach-detach=true --cgroups-per-qos=true --cgroup-driver=cgroupfs --keep-terminated-pod-volumes=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5% --eviction-soft= --eviction-pressure-transition-period=1m --pod-manifest-path=/var/run/kubernetes/static-pods --fail-swap-on=false --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --port=10250

The output of nvidia-smi on the machine:

Wed Feb 28 20:26:18 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                    0 |
| N/A   34C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   28C    P8    30W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   40C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:85:00.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Issues when requesting for more than 1 GPU

Hi there,

My Kubernetes cluster is as such

Master (no GPU)
Node 1 (GPU)
Node 2 (GPU)
Node 3 (GPU)
Node 4 (GPU)

Nodes 1 - 4 have Nvidia drivers (384) and nvidia docker 2 installed.

First issue:
When i run the command
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml"

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

Second issue:
I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

How do i solve this? Thanks.

Can't deploy NVIDIA device plugin on k8s 1.8.6 because could not load NVML library

version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:34:11Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

$ kubelet --version
Kubernetes v1.8.6

NVIDIA-SMI 375.26

$ docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:31:19 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:31:19 2017
 OS/Arch:      linux/amd64
 Experimental: false

OS system is Debian 9.
GPU: Tesla K40m.
CUDA: Cuda compilation tools, release 8.0, V8.0.61

error

I installed Nvidia-docker according to the Debian instrunctions and NVIDIA/nvidia-docker#516 and can run docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi successfully. I set Nvidia as default-runtime and enabled the DevicePlugins feature gate on my 2-node k8s cluster equipped with Tesla K40m.

But when I run

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml

or

docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8

they gave error:

2018/01/05 12:25:01 Loading NVML
2018/01/05 12:25:01 Failed to start nvml with error: could not load NVML library.

The output of ldconfig is

$ ldconfig -p | grep nvidia-ml
	libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
	libnvidia-ml.so.1 (libc6) => /usr/lib32/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
	libnvidia-ml.so (libc6) => /usr/lib32/libnvidia-ml.so

I checked other issues like NVIDIA/nvidia-docker#74 and NVIDIA/nvidia-docker#470, they failed to run nvidia-docker but I can.

Another strange thing is that there is no nvidia-device-plugin in my path and the output of locate nvidia-device-plugin is blank.

Could please me check what went wrong?
Thanks!

allocatable stuck at zero

i have a kubernetes node on 1.10.2 with nvidia/k8s-device-plugin:1.10. Everything worked great initially, but now i can't schedule any pods with nvidia.com/gpu. Looking at the output of kubectl get node, i see:

status:
  addresses:
  - address: 134.79.129.97
    type: InternalIP
  - address: ocio-gpu01
    type: Hostname
  allocatable:
    cpu: "48"
    ephemeral-storage: "9391196145"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 263412492Ki
    nvidia.com/gpu: "0"
    pods: "110"
  capacity:
    cpu: "48"
    ephemeral-storage: 10190100Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 263514892Ki
    nvidia.com/gpu: "16"
    pods: "110"

i think i cannot schedule any pods because the allocatable is zero. i have pods running on the box, but none that requested any gpus.

any pointers in how i can troubleshoot this?

thanks,

can k8s-device-plugin support numa-aware alloc

we op one gpu cluster , every server has 4 gpus. suppose the id is 0, 1, 2,3. one job taken id 0, if the comming job needs 2 gpus. can the plugin give 2,3 to kubelet?.(now is 1,2 ) if do this ,job in the same pcie can connect faster than in diffrent pcie slot 。

Could not register device plugin: context deadline exceeded

I am getting following error when starting the plugin as docker container

2017/11/24 09:06:24 Loading NVML
2017/11/24 09:06:24 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2017/11/24 09:06:29 Could not register device plugin: context deadline exceeded

My installation of nvidia-docker works fine.

What is the problem?

Does the nvidia-device-plugin need to be running on all worker nodes?

I can deploy the device-plugin on my gpu nodes successfully. After running kubectl create command, all the worker nodes deploy one nvidia-device-plugin. I know this is because using DaemonSet to deploy the plugin, but what confuses me is that do we also need to deploy the plugin on these non-gpu nodes?

Suggest to add Node affinity requiredDuringSchedulingIgnoredDuringExecution type or nodeSelector

Handling hot plug events

I'm working on a feature where I can hotplug nvidia gpus to the host. But when I did that, the device plugin is not recognizing the hotplugged gpu.

It would be great if the support for hotplug events is provided.

k8s-device-plugin Failed to initialize NVML: could not load NVML library

checked issue #19 does not help me out.

versions:

docker version
Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.6.2
 Git commit:   092cba3
 Built:        Thu Nov  2 20:40:23 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.6.2
 Git commit:   092cba3
 Built:        Thu Nov  2 20:40:23 2017
 OS/Arch:      linux/amd64
 Experimental: false


kubectl version
GitVersion:"v1.10.2"


kubeadm version
GitVersion:"v1.10.2

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61


ldconfig -p | grep nvidia-ml
	libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so


nvidia-smi
Wed May  2 21:18:39 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.124                Driver Version: 367.124                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K1             Off  | 0000:0B:00.0     Off |                  N/A |
| N/A   29C    P8     8W /  31W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

errors:

docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10
2018/05/02 21:18:02 Loading NVML
2018/05/02 21:18:02 Failed to initialize NVML: could not load NVML library.
2018/05/02 21:18:02 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2018/05/02 21:18:02 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/05/02 21:18:02 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

@RenaudWasTaken mentions at #19 saying old GPU might got this issue,
is this the case here, please help to take a look, thanks a lot.

0/1 nodes are available: 1 Insufficient nvidia.com/gpu

Deploying any PODS with the nvidia.com/gpu resource limits results in "0/1 nodes are available: 1 Insufficient nvidia.com/gpu."

I also see this error in the Daemonset POD logs:
2018/02/27 16:43:50 Warning: GPU with UUID GPU-edae6d5d-6698-fb8d-2c6b-2a791224f089 is too old to support healtchecking with error: %!s(MISSING). Marking it unhealthy

running nvidia-docker2, have deployed the nvidia device plugin as a daemonset.

On worker Node
uname -a
Linux gpu 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

docker run --rm nvidia/cuda nvidia-smi
Wed Feb 28 18:07:07 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 760 Off | 00000000:0B:00.0 N/A | N/A |
| 34% 43C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 760 Off | 00000000:90:00.0 N/A | N/A |
| 34% 42C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+

Failed to start nvml with error

I install nvidia-docker2, and deploment device plugin in kubernetes 1.8, but kubectl describe pods, I get the error.

loading NVML
Failed to start nvml with error: could not load NVML library

0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml),the pod was sill pending:

[root@mlssdi010001 k8s]# kubectl describe pod tf-notebook-747db6987b-86zts
Name: tf-notebook-747db6987b-86zts
....
Events:
Type Reason Age From Message


Warning FailedScheduling 47s (x15 over 3m) default-scheduler 0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

Pod info:

[root@mlssdi010001 k8s]# kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default tf-notebook-747db6987b-86zts 0/1 Pending 0 5s
....
kube-system nvidia-device-plugin-daemonset-ljrwc 1/1 Running 0 34s 10.244.1.11 mlssdi010003
kube-system nvidia-device-plugin-daemonset-m7h2r 1/1 Running 0 34s 10.244.2.12 mlssdi010002

Nodes info:

NAME STATUS ROLES AGE VERSION
mlssdi010001 Ready master 1d v1.9.0
mlssdi010002 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40)
mlssdi010003 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40)

Is it better to change the docker runtime from nvidia to the default runC?

In the current settings, the nvidia runtime is set to the default docker runtime instead of the origin runC.
so an issue as described in kubernetes/kubernetes#59631 kubernetes/kubernetes#59629
All GPUs are exposed into the container.

so one way of the solution is to use env to tell nvidia-container-runtime not to expose the GPUs.
another better way:

  1. set the default docker runtime back to runC.
  2. we do not need nvidia-container-runtime to do pre-start hooks, use k8s-device-plugin to do the same jobs, such as inject GPU device.

so the issue kubernetes/kubernetes#59631 can be fixed.

@flx42 @RenaudWasTaken @cmluciano @jiayingz @vikaschoudhary16

can't schedule GPU pod

I'm running K8s 1.10.2-0 on RHEL7.4 with docker 18.03.1
I have a 9 worker node K8s cluster. Only one of those nodes has a GPU on it (NVIDIA TitanXp).
I installed nvidia-docker2 on ALL worker nodes:
nvidia-docker2.noarch 2.0.3-1.docker18.03.1.ce
I installed nvidia-container-runtime on ALL worker nodes:
nvidia-container-runtime.x86_64 2.0.0-1.docker18.03.1
I installed nvidia-device-plugin.yml v1.10 via kubectl (the device plugin is running OK on all worker nodes)

I can ssh into my GPU worker node and run nvidia-smi inside a container OK:

[whacuser@gpu ~]$ sudo docker run --rm nvidia/cuda nvidia-smi
Unable to find image 'nvidia/cuda:latest' locally
latest: Pulling from nvidia/cuda
297061f60c36: Pull complete
e9ccef17b516: Pull complete
dbc33716854d: Pull complete
8fe36b178d25: Pull complete
686596545a94: Pull complete
f611dfbee954: Pull complete
c51814f3e9ba: Pull complete
5da0fc07e73a: Pull complete
97462b1887aa: Pull complete
924ea239f6fe: Pull complete
Digest: sha256:69f3780f80a72cb7cebc7f401a716370f79412c5aa9362306005ca4eb84d0f3c
Status: Downloaded newer image for nvidia/cuda:latest
Mon May 14 20:14:16 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:13:00.0 Off |                  N/A |
| 23%   21C    P8     8W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I label my GPU worker node like so:
kubectl label nodes gpu accelerator=nvidia-titan-xp --overwrite=true

However, when I try to run a pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  nodeSelector:
    accelerator: nvidia-titan-xp

I get an error:

0/12 nodes are available: 11 MatchNodeSelector, 12 Insufficient nvidia.com/gpu, 3 PodToleratesNodeTaints.

any ideas?

a small change in README

docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git

should be:

docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9

what k8s does behind device plugin ?

I read another device plugin example, https://github.com/vikaschoudhary16/sfc-device-plugin
In this device plugin, during allocate phase, it only response the Host Path and Container Path of the device. So does it mean the k8s will mount the device to container?
In nvidia device plugin, it set an env "NVIDIA_VISIBLE_DEVICES", then the nvidia-container-cli will use the env, then mount the device to container.

Enhance Nvidia Device plugin with more health checking features

Quoting what @RenaudWasTaken mentioned in another thread:
"The Nvidia Device plugin has a lot of such features coming up a few of these are:

memory scrubbing
healthCheck and reset in case of bad state
GPU Allocated memory checks
"Zombie processes" checks
...
"

Creating this issue to track the progress on these improvements.

@RenaudWasTaken could you also provide more details on some of these features, like what GPU Allocated memory checks and "Zombie processes" checks do?

I deploy k8s 1.10 and deploy the k8s-device-plugin but the Capacity of GPU not found

I deploy the k8s v1.10,and deploy k8s-device-plugin v1.10,but when I use
$kubectl describe node bjpg-g271.yz02
can not find the capatity of GPU

I deploy cuda is
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

nvidia-docker is 17.03.2
I can run the GPU container by docker run,but the GPU can not scheduler by k8s

Allocate() need return mount path of libcuda

Hi, I'm trying to deploy TensorFlow (with GPU support) on kubernetes with this device plugin.
And some error occurred:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.

After debug the source code I think this is because Allocate function don't return mount path of libcuda.so.

@flx42 PTAL, and I think i can send a PR to fix it later.

Manifest in upstream kubernetes?

The yaml manifest available upstream [1] a) is not the one suggested in the project's README [2] and b) is gke specific. As a result it is not clear to the kubernetes distributions (such as for example CDK [3]) what manifest should be shipped with each k8s release. We are doing our best, but any feedback from you on what the right path is, would be much appreciated.

[1] https://github.com/kubernetes/kubernetes/blob/release-1.10/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
[2] https://github.com/NVIDIA/k8s-device-plugin/blob/v1.9/nvidia-device-plugin.yml
[3] https://www.ubuntu.com/kubernetes

Allocating same GPU to multiple requests

Are you open to a PR that allocates the same GPU to multiple requests based on additional requirements passed to the process?

I'm thinking I could ask for nvidia/gpu:1 and get 1 whole GPU, or I could ask for nvidia/gpu-memory:1Gi and nvidia/gpu-cpu:2 and get "allocated" 1Gi of memory and 2 cores on 1 GPU, leaving whatever is left for other nvidia/gpu-memory and nvidia/gpu-cpu requests.

It wouldn't be enforced, but this way we can at least context switch between multiple processes on 1 GPU, which is something the main kubernetes project doesn't seem to want to support until at least v1.11 (kubernetes/kubernetes#52757)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.