Coder Social home page Coder Social logo

k3d-gpu-support's Introduction

K3d GPU Support

This is a simple repository for preparing a k3s + nvidia/cuda base image that enables a K3d cluster to have access to your host machine's NVIDIA, CUDA-capable GPU(s).

Pre-Requisites

Access to GitHub and GitHub Container Registry. Please follow the GitHub Container Registry instructions.

Docker and all of its dependencies must be installed.

For the container GPU test, a NVIDIA GPU with CUDA cores and drivers must be present. Additionally, the CUDA toolkit and NVIDIA container toolkit must be installed.

For Kubernetes testing and pre-requisites, please see Kubernetes Deployment for details.

Usage

Building and Pushing the Image

Check out the Make targets for the various options.

Kubernetes Deployment

Follow the instructions in the zarf-package-k3d-airgap repository for bootstrapping a K3d cluster that can access your NVIDIA GPUs.

You can also a use more abstracted version of the above Kubernetes deployment by following the instructions in the uds-leapfrogai bundle repository.

Test

Run:

kubectl apply -f test/cuda-vector-add.yaml
kubectl logs cuda-vector-add

References

k3d-gpu-support's People

Contributors

gphorvath avatar justinthelaw avatar qingfengfenga avatar

Stargazers

 avatar  avatar  avatar Rob Ferguson avatar  avatar

k3d-gpu-support's Issues

Issues under WSL2

Issues

Thank you very much for your work. I have attempted to run K3S and CUDA workloads under WSL2. Based on this issue and the files provided by your repository, I have conducted testing and I feel that it is almost successful.

The current issue is that the nvidia device plugin pod can execute nvidia smi, but the logs indicate that the graphics card cannot be recognized.

I suspect it may be due to the unavailability of NVCC. Do you have any ideas?

System:Win11 23H2
Runtime:Docker Desktop 4.28.0 (139021)

nvidia-device-plugin log

$ kubectl logs nvidia-device-plugin-daemonset-vvpkz -n kube-system
I0414 10:00:33.522494       1 main.go:154] Starting FS watcher.
I0414 10:00:33.522555       1 main.go:161] Starting OS watcher.
I0414 10:00:33.522912       1 main.go:176] Starting Plugins.
I0414 10:00:33.522931       1 main.go:234] Loading configuration.
I0414 10:00:33.522979       1 main.go:242] Updating config with default resource matching patterns.
I0414 10:00:33.523113       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0414 10:00:33.523131       1 main.go:256] Retreiving plugins.
I0414 10:00:33.524465       1 factory.go:107] Detected NVML platform: found NVML library
I0414 10:00:33.524495       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0414 10:00:33.541706       1 main.go:287] No devices found. Waiting indefinitely.

nvidia-device-plugin pod run nvidia-smi / nvcc

root@nvidia-device-plugin-daemonset-t68w2:/# nvidia-smi
Sun Apr 14 10:25:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 31%   28C    P8             16W /  250W |    1515MiB /  22528MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@nvidia-device-plugin-daemonset-t68w2:/#
root@nvidia-device-plugin-daemonset-t68w2:/# nvcc -V
bash: nvcc: command not found
root@nvidia-device-plugin-daemonset-t68w2:/#

cuda-vector-add pod describe

$ kubectl describe pod cuda-vector-add
Name:             cuda-vector-add
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  cuda-vector-add:
    Image:      tingweiwu/cuda-vector-add:v0.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fljnc (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-fljnc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  16m                  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  6m19s (x2 over 11m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.