Comments (24)
So much thanks, @RenaudWasTaken - with your help, it works now!
Here's what I've done. May these be helpful with further documentation of the project.
In my case, using OpenShift v3.9 and Docker-ce with Nvidia-p40 GPUs, not like kubernetes-only env(editing kubernetes manifest files in the worker node...)., I had to edit the file '/etc/systemd/system/origin-node.service'
### edit "ExecStart=/usr/bin/docker run ..." line
### add "-v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/" line as below
# REDACTED
ExecStart=/usr/bin/docker run --name origin-node \
...
\
-v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS -v /etc/pki:/etc/pki:ro \
\
-v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/ \
\
openshift/node:${IMAGE_VERSION}
...
Next, restart origin-node service(on each GPU worker node)
# setenforce 0
# systemctl daemon-reload
# systemctl restart origin-node
Delete existing k8s-device-plugin daemonset with abnormal status(on master node).
# oc delete -f nvidia-device-plugin-daemonset.yml
daemonset "nvidia-device-plugin-daemonset" deleted
Check the yaml for running k8s-device-plugin and re-create the daemonset(on master node).
# vi 02-nvidia-device-plugin-daemonset.yml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: nvidia
spec:
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
priorityClassName: system-node-critical
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: openshift.com/gpu-accelerator
operator: Exists
securityContext:
privileged: true
serviceAccount: nvidia-deviceplugin
serivceAccountName: nvidia-deviceplugin
hostNetwork: true
hostPID: true
containers:
- image: nvidia/k8s-device-plugin:1.9
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
# oc create -f 02-nvidia-device-plugin-daemonset.yml
daemonset "nvidia-device-plugin-daemonset" created
# oc logs -f nvidia-device-plugin-daemonset-89nrf
2018/06/18 09:54:12 Loading NVML
2018/06/18 09:54:12 Fetching devices.
2018/06/18 09:54:12 Starting FS watcher.
2018/06/18 09:54:12 Starting OS watcher.
2018/06/18 09:54:12 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/18 09:54:12 Registered device plugin with Kubelet
^C
Now, check the node's GPU capacity.
# oc describe node node01 | egrep 'Capacity|Allocatable|gpu'
Labels: apptier=gpu
openshift.com/gpu-accelerator=true
Capacity:
nvidia.com/gpu: 2
Allocatable:
nvidia.com/gpu: 2
Normal NodeAllocatableEnforced 17m kubelet, node01 Updated Node Allocatable limit across pods
Normal NodeAllocatableEnforced 14m kubelet, node01 Updated Node Allocatable limit across pods
Running the test(cuda-vector-add) pod ...
# vi cuda-vector-add.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
namespace: nvidia
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
- name: NVIDIA_REQUIRE_CUDA
value: "cuda>=8.0"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
# oc create -f cuda-vector-add.yaml
pod "cuda-vector-add" created
# oc get pods
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 5s
nvidia-device-plugin-daemonset-7rl44 1/1 Running 0 4m
nvidia-device-plugin-daemonset-nghph 1/1 Running 0 4m
# oc logs -f cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Thank you for all your interest and help.
from k8s-device-plugin.
I think I figured it out. Your kubelet is running in a container!
You need to mount the host directory there too. Add -v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/
to that docker command (of your origin-node systemd unit file)
from k8s-device-plugin.
Did you do this part?
After successful installation of OpenShift 3.9, the first step is to add the Device Plugin Feature Gate to the /etc/origin/node/node-config.yaml on every node that has a GPU.
kind: NodeConfig
kubeletArguments:
feature-gates:
- DevicePlugins=true
from k8s-device-plugin.
@flx42 hi!
Thanks for the interest. One of my GPU node's /etc/origin/node/node-config.yaml is as below.
And also, did "systemctl daemon-reload && systemctl restart origin-node".
# cat /etc/origin/node/node-config.yaml
allowDisabledDocker: false
apiVersion: v1
dnsBindAddress: 127.0.0.1:53
dnsRecursiveResolvConf: /etc/origin/node/resolv.conf
dnsDomain: cluster.local
dnsIP: 192.168.5.101
dockerConfig:
execHandlerName: ""
iptablesSyncPeriod: "30s"
imageConfig:
format: openshift/origin-${component}:${version}
latest: False
kind: NodeConfig
kubeletArguments:
node-labels:
- region=primary
- zone=east
feature-gates:
- DevicePlugins=true
masterClientConnectionOverrides:
acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
contentType: application/vnd.kubernetes.protobuf
burst: 200
qps: 100
masterKubeConfig: system:node:node01.kubeconfig
networkPluginName: redhat/openshift-ovs-subnet
# networkConfig struct introduced in origin 1.0.6 and OSE 3.0.2 which
# deprecates networkPluginName above. The two should match.
networkConfig:
mtu: 1450
networkPluginName: redhat/openshift-ovs-subnet
nodeName: node01
podManifestConfig:
servingInfo:
bindAddress: 0.0.0.0:10250
certFile: server.crt
clientCA: ca.crt
keyFile: server.key
volumeDirectory: /var/lib/origin/openshift.local.volumes
proxyArguments:
proxy-mode:
- iptables
volumeConfig:
localQuota:
perFSGroup:
from k8s-device-plugin.
Can you do systemctl status kubelet
on the GPU node?
from k8s-device-plugin.
@flx42
You mean the origin-node.service, right?
# systemctl status origin-node -l
● origin-node.service
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Active: active (running) since 목 2018-06-07 02:28:23 KST; 50min ago
Process: 69206 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
Process: 69202 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
Process: 69136 ExecStop=/usr/bin/docker stop origin-node (code=exited, status=0/SUCCESS)
Process: 69809 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
Process: 69804 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
Process: 69801 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
Process: 69773 ExecStartPre=/usr/bin/docker rm -f origin-node (code=exited, status=1/FAILURE)
Main PID: 69808 (docker)
Tasks: 41
Memory: 18.8M
CGroup: /system.slice/origin-node.service
└─69808 /usr/bin/docker run --name origin-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/origin-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=/etc/origin/node/node-config.yaml -e OPTIONS=--loglevel=2 -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev -v /etc/pki:/etc/pki:ro openshift/node:v3.9.0
6월 07 02:33:13 node02 origin-node[69808]: I0607 02:33:13.794741 69862 kubelet.go:1286] Image garbage collection succeeded
6월 07 02:38:13 node02 origin-node[69808]: I0607 02:38:13.794012 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:43:13 node02 origin-node[69808]: I0607 02:43:13.794367 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:48:13 node02 origin-node[69808]: I0607 02:48:13.794702 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:53:13 node02 origin-node[69808]: I0607 02:53:13.794955 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:58:13 node02 origin-node[69808]: I0607 02:58:13.795198 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:03:13 node02 origin-node[69808]: I0607 03:03:13.795521 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:08:13 node02 origin-node[69808]: I0607 03:08:13.795751 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:13:13 node02 origin-node[69808]: I0607 03:13:13.796042 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:18:13 node02 origin-node[69808]: I0607 03:18:13.796350 69862 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
and ...
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f48ca242b676 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-2jqgk_nvidia_d6560d98-6991-11e8-8dd7-0cc47ad9bf7a_1
868070455e00 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-2jqgk_nvidia_d6560d98-6991-11e8-8dd7-0cc47ad9bf7a_1
08de767f8aa2 openshift/node:v3.9.0 "/usr/local/bin/orig…" About an hour ago Up About an hour origin-node
32c907f3987a openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" About an hour ago Up About an hour openvswitch
from k8s-device-plugin.
Part of journalctl logs(wierd)
# journalctl -xeu origin-node -l -f
...
6월 07 02:28:06 node01 origin-node[98054]: E0607 02:28:06.668042 98108 container_manager_linux.go:584] [ContainerManager]: Fail to get rootfs information unable to find data for container /
6월 07 02:28:07 node01 origin-node[98054]: E0607 02:28:07.668242 98108 container_manager_linux.go:584] [ContainerManager]: Fail to get rootfs information unable to find data for container /
6월 07 02:28:07 node01 origin-node[98054]: I0607 02:28:07.682324 98108 factory.go:54] Registering systemd factory
6월 07 02:28:07 node01 origin-node[98054]: I0607 02:28:07.683411 98108 factory.go:86] Registering Raw factory
6월 07 02:28:07 node01 origin-node[98054]: I0607 02:28:07.684331 98108 manager.go:1178] Started watching for new ooms in manager
6월 07 02:28:07 node01 origin-node[98054]: I0607 02:28:07.690052 98108 nvidia.go:59] Starting goroutine to initialize NVML
6월 07 02:28:07 node01 origin-node[98054]: I0607 02:28:07.690577 98108 manager.go:329] Starting recovery of all containers
6월 07 02:28:07 node01 origin-node[98054]: I0607 02:28:07.766060 98108 manager.go:334] Recovery completed
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.668318 98108 kubelet.go:1837] SyncLoop (ADD, "api"): "nvidia-device-plugin-daemonset-82x2d_nvidia(d6572e89-6991-11e8-8dd7-0cc47ad9bf7a)"
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.676270 98108 kubelet.go:1882] SyncLoop (PLEG): "nvidia-device-plugin-daemonset-82x2d_nvidia(d6572e89-6991-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"d6572e89-6991-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerDied", Data:"e3a158a0c6209a6465af3e279bc91267de6e249f3bb0fd574a9e09702c07a0b3"}
6월 07 02:28:10 node01 origin-node[98054]: W0607 02:28:10.676324 98108 pod_container_deletor.go:77] Container "e3a158a0c6209a6465af3e279bc91267de6e249f3bb0fd574a9e09702c07a0b3" not found in pod's containers
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.868317 98108 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "nvidia-deviceplugin-token-cv7p5" (UniqueName: "kubernetes.io/secret/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a-nvidia-deviceplugin-token-cv7p5") pod "nvidia-device-plugin-daemonset-82x2d" (UID: "d6572e89-6991-11e8-8dd7-0cc47ad9bf7a")
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.868357 98108 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "device-plugin" (UniqueName: "kubernetes.io/host-path/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a-device-plugin") pod "nvidia-device-plugin-daemonset-82x2d" (UID: "d6572e89-6991-11e8-8dd7-0cc47ad9bf7a")
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.868555 98108 reconciler.go:262] operationExecutor.MountVolume started for volume "device-plugin" (UniqueName: "kubernetes.io/host-path/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a-device-plugin") pod "nvidia-device-plugin-daemonset-82x2d" (UID: "d6572e89-6991-11e8-8dd7-0cc47ad9bf7a")
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.868587 98108 reconciler.go:262] operationExecutor.MountVolume started for volume "nvidia-deviceplugin-token-cv7p5" (UniqueName: "kubernetes.io/secret/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a-nvidia-deviceplugin-token-cv7p5") pod "nvidia-device-plugin-daemonset-82x2d" (UID: "d6572e89-6991-11e8-8dd7-0cc47ad9bf7a")
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.868677 98108 operation_generator.go:552] MountVolume.SetUp succeeded for volume "device-plugin" (UniqueName: "kubernetes.io/host-path/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a-device-plugin") pod "nvidia-device-plugin-daemonset-82x2d" (UID: "d6572e89-6991-11e8-8dd7-0cc47ad9bf7a")
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.881478 98108 operation_generator.go:552] MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5" (UniqueName: "kubernetes.io/secret/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a-nvidia-deviceplugin-token-cv7p5") pod "nvidia-device-plugin-daemonset-82x2d" (UID: "d6572e89-6991-11e8-8dd7-0cc47ad9bf7a")
6월 07 02:28:10 node01 origin-node[98054]: I0607 02:28:10.986472 98108 kuberuntime_manager.go:403] No ready sandbox for pod "nvidia-device-plugin-daemonset-82x2d_nvidia(d6572e89-6991-11e8-8dd7-0cc47ad9bf7a)" can be found. Need to start a new one
6월 07 02:28:11 node01 origin-node[98054]: W0607 02:28:11.503953 98108 container.go:393] Failed to create summary reader for "/libcontainer_98576_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:11 node01 origin-node[98054]: W0607 02:28:11.504124 98108 container.go:393] Failed to create summary reader for "/libcontainer_98588_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:11 node01 origin-node[98054]: I0607 02:28:11.504163 98108 kuberuntime_manager.go:758] checking backoff for container "nvidia-device-plugin-ctr" in pod "nvidia-device-plugin-daemonset-82x2d_nvidia(d6572e89-6991-11e8-8dd7-0cc47ad9bf7a)"
6월 07 02:28:11 node01 origin-node[98054]: W0607 02:28:11.504327 98108 container.go:393] Failed to create summary reader for "/libcontainer_98598_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:11 node01 origin-node[98054]: W0607 02:28:11.669403 98108 container.go:507] Failed to update stats for container "%s": %s/libcontainer_98615_systemd_test_default.slicefailed to parse memory.memsw.max_usage_in_bytes - read /sys/fs/cgroup/memory/libcontainer_98615_systemd_test_default.slice/memory.memsw.max_usage_in_bytes: no such device, continuing to push stats
6월 07 02:28:13 node01 origin-node[98054]: W0607 02:28:13.423573 98108 container.go:393] Failed to create summary reader for "/libcontainer_98647_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:13 node01 origin-node[98054]: W0607 02:28:13.423822 98108 container.go:393] Failed to create summary reader for "/libcontainer_98653_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:13 node01 origin-node[98054]: W0607 02:28:13.424033 98108 container.go:393] Failed to create summary reader for "/libcontainer_98659_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:13 node01 origin-node[98054]: I0607 02:28:13.424722 98108 kubelet.go:1882] SyncLoop (PLEG): "nvidia-device-plugin-daemonset-82x2d_nvidia(d6572e89-6991-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"d6572e89-6991-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerStarted", Data:"a0a0585147cc56b31bb3383575872b56dc0d0f0c8a43248786db0d6de8c529d7"}
6월 07 02:28:14 node01 origin-node[98054]: I0607 02:28:14.434689 98108 kubelet.go:1882] SyncLoop (PLEG): "nvidia-device-plugin-daemonset-82x2d_nvidia(d6572e89-6991-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"d6572e89-6991-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerStarted", Data:"efbee1f979dd709651c135adb0c00d0432a405b9b443546a60ef4780c6c3ab58"}
6월 07 02:28:14 node01 systemd[1]: Started origin-node.service.
-- Subject: Unit origin-node.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit origin-node.service has finished starting up.
--
-- The start-up result is done.
6월 07 02:28:15 node01 origin-node[98054]: W0607 02:28:15.812939 98108 container.go:393] Failed to create summary reader for "/libcontainer_98778_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:15 node01 origin-node[98054]: W0607 02:28:15.813144 98108 container.go:393] Failed to create summary reader for "/libcontainer_98789_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:15 node01 origin-node[98054]: W0607 02:28:15.813321 98108 container.go:393] Failed to create summary reader for "/libcontainer_98795_systemd_test_default.slice": none of the resources are being tracked.
6월 07 02:28:15 node01 origin-node[98054]: I0607 02:28:15.836199 98108 kubelet_node_status.go:454] Recording NodeReady event message for node node01
6월 07 02:28:17 node01 origin-node[98054]: E0607 02:28:17.883854 98108 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/libcontainer_98615_systemd_test_default.slice": RecentStats: unable to find data for container /libcontainer_98615_systemd_test_default.slice]
6월 07 02:28:27 node01 origin-node[98054]: E0607 02:28:27.897193 98108 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/libcontainer_98615_systemd_test_default.slice": RecentStats: unable to find data for container /libcontainer_98615_systemd_test_default.slice]
6월 07 02:33:05 node01 origin-node[98054]: I0607 02:33:05.668099 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:33:05 node01 origin-node[98054]: I0607 02:33:05.669667 98108 kubelet.go:1286] Image garbage collection succeeded
6월 07 02:38:05 node01 origin-node[98054]: I0607 02:38:05.668345 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:43:05 node01 origin-node[98054]: I0607 02:43:05.668652 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:48:05 node01 origin-node[98054]: I0607 02:48:05.668921 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:53:05 node01 origin-node[98054]: I0607 02:53:05.669240 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 02:58:05 node01 origin-node[98054]: I0607 02:58:05.669575 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:03:05 node01 origin-node[98054]: I0607 03:03:05.669862 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:08:05 node01 origin-node[98054]: I0607 03:08:05.670156 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:13:05 node01 origin-node[98054]: I0607 03:13:05.670453 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:18:05 node01 origin-node[98054]: I0607 03:18:05.670754 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:23:05 node01 origin-node[98054]: I0607 03:23:05.671054 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:28:05 node01 origin-node[98054]: I0607 03:28:05.671360 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:33:05 node01 origin-node[98054]: I0607 03:33:05.671768 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:38:05 node01 origin-node[98054]: I0607 03:38:05.672065 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:43:05 node01 origin-node[98054]: I0607 03:43:05.672355 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:48:05 node01 origin-node[98054]: I0607 03:48:05.672692 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:53:05 node01 origin-node[98054]: I0607 03:53:05.673016 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 03:58:05 node01 origin-node[98054]: I0607 03:58:05.673287 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:03:05 node01 origin-node[98054]: I0607 04:03:05.673524 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:08:05 node01 origin-node[98054]: I0607 04:08:05.673932 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:13:05 node01 origin-node[98054]: I0607 04:13:05.674171 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:18:05 node01 origin-node[98054]: I0607 04:18:05.674454 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:23:05 node01 origin-node[98054]: I0607 04:23:05.674692 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:28:05 node01 origin-node[98054]: I0607 04:28:05.674955 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:33:05 node01 origin-node[98054]: I0607 04:33:05.675251 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:38:05 node01 origin-node[98054]: I0607 04:38:05.675499 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:43:05 node01 origin-node[98054]: I0607 04:43:05.675761 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:48:05 node01 origin-node[98054]: I0607 04:48:05.676065 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:53:05 node01 origin-node[98054]: I0607 04:53:05.676349 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 04:58:05 node01 origin-node[98054]: I0607 04:58:05.676612 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:03:05 node01 origin-node[98054]: I0607 05:03:05.676877 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:08:05 node01 origin-node[98054]: I0607 05:08:05.677169 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:13:05 node01 origin-node[98054]: I0607 05:13:05.677528 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:18:05 node01 origin-node[98054]: I0607 05:18:05.677807 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:23:05 node01 origin-node[98054]: I0607 05:23:05.678061 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:28:05 node01 origin-node[98054]: I0607 05:28:05.678359 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:33:05 node01 origin-node[98054]: I0607 05:33:05.678751 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:38:05 node01 origin-node[98054]: I0607 05:38:05.679128 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:43:05 node01 origin-node[98054]: I0607 05:43:05.679393 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:48:05 node01 origin-node[98054]: I0607 05:48:05.679743 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:53:05 node01 origin-node[98054]: I0607 05:53:05.680150 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 05:58:05 node01 origin-node[98054]: I0607 05:58:05.680504 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:03:05 node01 origin-node[98054]: I0607 06:03:05.680793 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:08:05 node01 origin-node[98054]: I0607 06:08:05.681052 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:13:05 node01 origin-node[98054]: I0607 06:13:05.681377 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:18:05 node01 origin-node[98054]: I0607 06:18:05.681708 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:23:05 node01 origin-node[98054]: I0607 06:23:05.682018 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:27:05 node01 origin-node[98054]: E0607 06:27:05.977260 98108 logs.go:351] Failed with err write tcp 192.168.5.101:10250->192.168.5.100:32938: i/o timeout when writing log for log file "/var/log/pods/d6572e89-6991-11e8-8dd7-0cc47ad9bf7a/nvidia-device-plugin-ctr_1.log": &{timestamp:{wall:974901445 ext:63663917225 loc:<nil>} stream:stderr log:[50 48 49 56 47 48 54 47 48 54 32 50 49 58 50 55 58 48 53 32 89 111 117 32 99 97 110 32 99 104 101 99 107 32 116 104 101 32 112 114 101 114 101 113 117 105 115 105 116 101 115 32 97 116 58 32 104 116 116 112 115 58 47 47 103 105 116 104 117 98 46 99 111 109 47 78 86 73 68 73 65 47 107 56 115 45 100 101 118 105 99 101 45 112 108 117 103 105 110 35 112 114 101 114 101 113 117 105 115 105 116 101 115 10]}
6월 07 06:27:06 node01 origin-node[98054]: I0607 06:27:05.977499 98108 logs.go:41] http: multiple response.WriteHeader calls
6월 07 06:28:05 node01 origin-node[98054]: I0607 06:28:05.682304 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
6월 07 06:33:05 node01 origin-node[98054]: I0607 06:33:05.682593 98108 container_manager_linux.go:426] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
...
from k8s-device-plugin.
can you upload the log (of kubelet) file?
from k8s-device-plugin.
@RenaudWasTaken
Attached the log file.
origin-node-journalctl.log
from k8s-device-plugin.
Any update on this?
from k8s-device-plugin.
With this command:
# docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9
# ls -l /var/lib/kubelet/device-plugins/
srwxr-xr-x. 1 root root 0 6월 12 10:22 nvidia.sock
I have tested almost all current plugin releases, but the same results('context deadline exceeded...')
- 1.8
- 1.9, 1.9-centos7
- 1.10, 1.10-centos7
from k8s-device-plugin.
What about ls -lZ /var/lib/kubelet/device-plugins/
?
from k8s-device-plugin.
# ls -lZ /var/lib/kubelet/device-plugins/
srwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 nvidia.sock
from k8s-device-plugin.
And as to /dev/nvidia* : (wondering whether it matters or not, though)
# ls -alZ /dev/nvidia*
crw-rw-rw-. root root system_u:object_r:device_t:s0 /dev/nvidia-modeset
crw-rw-rw-. root root system_u:object_r:device_t:s0 /dev/nvidia-uvm
crw-rw-rw-. root root system_u:object_r:device_t:s0 /dev/nvidia-uvm-tools
crw-rw-rw-. root root unconfined_u:object_r:xserver_misc_device_t:s0 /dev/nvidia0
crw-rw-rw-. root root unconfined_u:object_r:xserver_misc_device_t:s0 /dev/nvidia1
crw-rw-rw-. root root unconfined_u:object_r:xserver_misc_device_t:s0 /dev/nvidiactl
from k8s-device-plugin.
If you temporarily disable SELinux, does the device plugin work?
from k8s-device-plugin.
@flx42
I've run on the GPU node
# setenforce 0
# systemctl restart docker
and
# systemctl restart origin-node
but nothing changes.
from k8s-device-plugin.
This seems really weird. The logs you provided says that Kubelet is exposing a socket on /var/lib/kubelet/device-plugins/kubelet.sock
however from the output you provided the socket isn't in the directory....
- Can you provide me with the version of kubelet on that node?
- Can you make sure kubelet has write access to that directory?
- If you remove that directory and restart kubelet can you tell me if the socket is there afterwards?
I'm not sure what else might cause this issue.
from k8s-device-plugin.
@RenaudWasTaken
First, module version information on the node01, node02
# oc describe node node01 | grep -i vers
Kernel Version: 3.10.0-862.3.2.el7.x86_64
Container Runtime Version: docker://18.3.1
Kubelet Version: v1.9.1+a0ce1bc657
Kube-Proxy Version: v1.9.1+a0ce1bc657
# oc describe node node02 | grep -i vers
Kernel Version: 3.10.0-862.3.2.el7.x86_64
Container Runtime Version: docker://18.3.1
Kubelet Version: v1.9.1+a0ce1bc657
Kube-Proxy Version: v1.9.1+a0ce1bc657
from k8s-device-plugin.
@RenaudWasTaken
Secondly, could you plz show me how to know and tell if kubelet has write access to /var/lib/kubelet/device-plugins/ in OpenShift Origin cluster?
And as far as I can see, only nvidia.sock is in the /var/lib/kubelet/device-plugins/ directory as below ... no kubelet.sock file is in there.
# ls -alZ /var/lib/kubelet/device-plugins/
drwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 .
drwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 ..
srwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 nvidia.sock
from k8s-device-plugin.
@RenaudWasTaken
Third, I've done these steps as following.
[root@node02 ~]# ls -alZ /var/lib/kubelet/device-plugins/
drwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 .
drwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 ..
srwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 nvidia.sock
[root@node02 ~]# rm -rf /var/lib/kubelet/device-plugins
[root@node02 ~]# systemctl restart origin-node
[root@node02 ~]# ls -alZ /var/lib/kubelet/device-plugins/
ls: cannot access /var/lib/kubelet/device-plugins/: No such file or directory
[root@node02 ~]# setenforce 0
[root@node02 ~]# systemctl restart docker
[root@node02 ~]# systemctl restart origin-node
[root@node02 ~]# ls -alZ /var/lib/kubelet/device-plugins/
drwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 .
drwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 ..
srwxr-xr-x. root root system_u:object_r:container_var_lib_t:s0 nvidia.sock
[root@node02 ~]# systemctl status origin-node -l
● origin-node.service
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Active: active (running) since 목 2018-06-14 12:43:27 KST; 1min 1s ago
Process: 60519 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
Process: 60515 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
Process: 59089 ExecStop=/usr/bin/docker stop origin-node (code=exited, status=0/SUCCESS)
Process: 60557 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
Process: 60553 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
Process: 60550 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
Process: 60522 ExecStartPre=/usr/bin/docker rm -f origin-node (code=exited, status=1/FAILURE)
Main PID: 60556 (docker)
Tasks: 26
Memory: 15.0M
CGroup: /system.slice/origin-node.service
└─60556 /usr/bin/docker run --name origin-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/origin-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=/etc/origin/node/node-config.yaml -e OPTIONS=--loglevel=2 -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev -v /etc/pki:/etc/pki:ro openshift/node:v3.9.0
6월 14 12:43:25 node02 origin-node[60556]: W0614 12:43:25.366200 60604 container.go:393] Failed to create summary reader for "/libcontainer_61272_systemd_test_default.slice": none of the resources are being tracked.
6월 14 12:43:25 node02 origin-node[60556]: W0614 12:43:25.366466 60604 container.go:393] Failed to create summary reader for "/libcontainer_61278_systemd_test_default.slice": none of the resources are being tracked.
6월 14 12:43:25 node02 origin-node[60556]: I0614 12:43:25.835890 60604 kubelet.go:1882] SyncLoop (PLEG): "nvidia-device-plugin-daemonset-q9vls_nvidia(a33deba4-6f7c-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"a33deba4-6f7c-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerStarted", Data:"004a127291e8abcd5672fddc3d43672356ee6dea486f9cf2f7cd0928275ae4bc"}
6월 14 12:43:25 node02 origin-node[60556]: I0614 12:43:25.836112 60604 kubelet.go:1882] SyncLoop (PLEG): "nvidia-device-plugin-daemonset-q9vls_nvidia(a33deba4-6f7c-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"a33deba4-6f7c-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerStarted", Data:"d044082bfb1af7164f94195d58645e791a2a0aa306fba684f817796895be7eb4"}
6월 14 12:43:25 node02 origin-node[60556]: I0614 12:43:25.848582 60604 kubelet.go:1882] SyncLoop (PLEG): "gitlab-ce-redis-1-fv9js_gitlab(5fe9e42b-6ae6-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"5fe9e42b-6ae6-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerStarted", Data:"496980a9dfce3072eb2b88e463df676c0c2abe7ccd8fde097a63a4e57dac0c19"}
6월 14 12:43:25 node02 origin-node[60556]: I0614 12:43:25.848618 60604 kubelet.go:1882] SyncLoop (PLEG): "gitlab-ce-redis-1-fv9js_gitlab(5fe9e42b-6ae6-11e8-8dd7-0cc47ad9bf7a)", event: &pleg.PodLifecycleEvent{ID:"5fe9e42b-6ae6-11e8-8dd7-0cc47ad9bf7a", Type:"ContainerStarted", Data:"6c62786ea442bb6ed75fec8d37daed3ce1fa23af3d312758600bc1d0a0560cdb"}
6월 14 12:43:25 node02 origin-node[60556]: I0614 12:43:25.898109 60604 roundrobin.go:310] LoadBalancerRR: Setting endpoints for gitlab/gitlab-ce-redis:6379-redis to [10.129.4.20:6379]
6월 14 12:43:25 node02 origin-node[60556]: I0614 12:43:25.898135 60604 roundrobin.go:240] Delete endpoint 10.129.4.20:6379 for service "gitlab/gitlab-ce-redis:6379-redis"
6월 14 12:43:27 node02 systemd[1]: Started origin-node.service.
6월 14 12:43:28 node02 origin-node[60556]: I0614 12:43:28.615763 60604 kubelet_node_status.go:454] Recording NodeReady event message for node node02
and on OpenShift Origin master,
# oc get nodes -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready,SchedulingDisabled master 18d v1.9.1+a0ce1bc657 <none> CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://1.13.1
node01 Ready <none> 7d v1.9.1+a0ce1bc657 <none> CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.3.1
node02 Ready <none> 7d v1.9.1+a0ce1bc657 <none> CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.3.1
node03 Ready <none> 18d v1.9.1+a0ce1bc657 <none> CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://1.13.1
node04 Ready <none> 18d v1.9.1+a0ce1bc657 <none> CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://1.13.1
node05 Ready <none> 18d v1.9.1+a0ce1bc657 <none> CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://1.13.1
from k8s-device-plugin.
FYI, hoping to be any help for this, attached origin-node's log file on node02 with --loglevel=2:
# docker inspect --format='{{.LogPath}}' origin-node
# cp /var/lib/docker/containers/2e7f9d352193de3bb68f260cccfd5c292cd04e8cf455243ce08a0c2974ba0270/2e7f9d352193de3bb68f260cccfd5c292cd04e8cf455243ce08a0c2974ba0270-json.log ./origin-node.log
from k8s-device-plugin.
Attached origin-node's log file on node01 with --loglevel=5:
# cat `docker inspect --format='{{.LogPath}}' origin-node` > origin-node-json-node01.log
origin-node-json-node01.log.tar.gz
from k8s-device-plugin.
@RenaudWasTaken
I'll set it up and test when ready, being far from the machines now.
Will post the result right after that. Thanks!
from k8s-device-plugin.
Thanks to @RenaudWasTaken @flx42
from k8s-device-plugin.
Related Issues (20)
- [gfd] Incorrect implementation of atomic writing to a file when exporting features by gpu-feature-discovery
- [gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit HOT 2
- The MPS container has started running, but cannot call GPU resources inside the container HOT 6
- nvidia-device-plugin-daemonset CrashLoopBackoff in Truenas Scale Dragonfish HOT 5
- `/demo/clusters/kind/create-cluster.sh` fails with `umount: /proc/driver/nvidia: not mounted` HOT 1
- RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx HOT 2
- When I want to use MPS in Kubernetes, I need to specify --mps-root. HOT 3
- Addressing several security vulnerabilities in the version v0.15.1
- [k0s] `libnvidia-ml.so.1` missing in the pod HOT 2
- Docker image tag v0.9.0-ubuntu20.04 HOT 2
- Why there is no GPU resource allocatable on a GPU cloud instance HOT 2
- README section for MPS should state `spec.hostIPC: true` is required in a Pod HOT 5
- Documentation for GFD HOT 1
- Security Vulnerability: Red Hat Enterprise Linux 8.10 - openldap Remote Denial of Service Vulnerability - RHSA-2024:4264 HOT 1
- Add section to README with catalog of device plugin specific labels HOT 1
- Helm Chart v0.16.1 not available HOT 1
- gpu pod Pending HOT 2
- Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart
- `nvml init failed: ERROR_LIBRARY_NOT_FOUND` error after upgrading from `0.15.1` to `0.16.x` HOT 8
- Lack of Metrics for SLO Onboarding with k8s-device-plugin HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.