K3d GPU Support

This is a simple repository for preparing a k3s + nvidia/cuda base image that enables a K3d cluster to have access to your host machine's NVIDIA, CUDA-capable GPU(s).

Pre-Requisites

Access to GitHub and GitHub Container Registry. Please follow the GitHub Container Registry instructions.

Docker and all of its dependencies must be installed.

For the container GPU test, a NVIDIA GPU with CUDA cores and drivers must be present. Additionally, the CUDA toolkit and NVIDIA container toolkit must be installed.

For Kubernetes testing and pre-requisites, please see Kubernetes Deployment for details.

Usage

Building and Pushing the Image

Check out the Make targets for the various options.

Kubernetes Deployment

Follow the instructions in the zarf-package-k3d-airgap repository for bootstrapping a K3d cluster that can access your NVIDIA GPUs.

You can also a use more abstracted version of the above Kubernetes deployment by following the instructions in the uds-leapfrogai bundle repository.

Test

Run:

kubectl apply -f test/cuda-vector-add.yaml
kubectl logs cuda-vector-add

References

https://k3d.io/v5.7.2/usage/advanced/cuda/

Issues under WSL2

Issues

Thank you very much for your work. I have attempted to run K3S and CUDA workloads under WSL2. Based on this issue and the files provided by your repository, I have conducted testing and I feel that it is almost successful.

The current issue is that the nvidia device plugin pod can execute nvidia smi, but the logs indicate that the graphics card cannot be recognized.

I suspect it may be due to the unavailability of NVCC. Do you have any ideas?

System：Win11 23H2
Runtime：Docker Desktop 4.28.0 (139021)

nvidia-device-plugin log

$ kubectl logs nvidia-device-plugin-daemonset-vvpkz -n kube-system
I0414 10:00:33.522494       1 main.go:154] Starting FS watcher.
I0414 10:00:33.522555       1 main.go:161] Starting OS watcher.
I0414 10:00:33.522912       1 main.go:176] Starting Plugins.
I0414 10:00:33.522931       1 main.go:234] Loading configuration.
I0414 10:00:33.522979       1 main.go:242] Updating config with default resource matching patterns.
I0414 10:00:33.523113       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0414 10:00:33.523131       1 main.go:256] Retreiving plugins.
I0414 10:00:33.524465       1 factory.go:107] Detected NVML platform: found NVML library
I0414 10:00:33.524495       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0414 10:00:33.541706       1 main.go:287] No devices found. Waiting indefinitely.

nvidia-device-plugin pod run nvidia-smi / nvcc

root@nvidia-device-plugin-daemonset-t68w2:/# nvidia-smi
Sun Apr 14 10:25:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 31%   28C    P8             16W /  250W |    1515MiB /  22528MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@nvidia-device-plugin-daemonset-t68w2:/#
root@nvidia-device-plugin-daemonset-t68w2:/# nvcc -V
bash: nvcc: command not found
root@nvidia-device-plugin-daemonset-t68w2:/#

cuda-vector-add pod describe

$ kubectl describe pod cuda-vector-add
Name:             cuda-vector-add
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  cuda-vector-add:
    Image:      tingweiwu/cuda-vector-add:v0.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fljnc (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-fljnc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  16m                  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  6m19s (x2 over 11m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

justinthelaw / k3d-gpu-support Goto Github PK

k3d-gpu-support's Introduction

K3d GPU Support

Pre-Requisites

Usage

Building and Pushing the Image

Kubernetes Deployment

Test

References

k3d-gpu-support's People

Contributors

Stargazers

Forkers

k3d-gpu-support's Issues

Issues

nvidia-device-plugin log

nvidia-device-plugin pod run nvidia-smi / nvcc

cuda-vector-add pod describe

Recommend Projects

Recommend Topics

Recommend Org