Coder Social home page Coder Social logo

4paradigm / k8s-vgpu-scheduler Goto Github PK

View Code? Open in Web Editor NEW
416.0 11.0 80.0 242.68 MB

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.

License: Apache License 2.0

Go 72.89% Makefile 0.21% Shell 0.64% Dockerfile 0.22% Smarty 0.46% C 25.59%

k8s-vgpu-scheduler's Introduction

English version|中文版

OpenAIOS vGPU scheduler for Kubernetes

build status docker pulls slack discuss Contact Me

Supperted devices

nvidia GPU cambricon MLU hygon DCU

Note This project has beed renamed to project-HAMi, We reserve old repo here for compatable reasons

Introduction

4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8s cluster, it has everything you expect for a k8s GPU manager, including:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory size (i.e 3000M) or device memory percentage of whole GPU(i.e 50%) and have made it that it does not exceed the boundary.

Virtual Device memory: You can oversubscribe GPU device memory by using host memory as its swap.

GPU Type Specification: You can specify which type of GPU to use or to avoid for a certain GPU task, by setting "nvidia.com/use-gputype" or "nvidia.com/nouse-gputype" annotations.

Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation. In addition, you can specify your resource name other than "nvidia.com/gpu" if you wish

The k8s vGPU scheduler is based on retaining features of 4paradigm k8s-device-plugin (4paradigm/k8s-device-plugin), such as splitting the physical GPU, limiting the memory, and computing unit. It adds the scheduling module to balance the GPU usage across GPU nodes. In addition, it allows users to allocate GPU by specifying the device memory and device core usage. Furthermore, the vGPU scheduler can virtualize the device memory (the used device memory can exceed the physical device memory), run some tasks with large device memory requirements, or increase the number of shared tasks. You can refer to the benchmarks report.

When to use

  1. Scenarios when pods need to be allocated with certain device memory usage or device cores.
  2. Needs to balance GPU usage in cluster with mutiple GPU node
  3. Low utilization of device memory and computing units, such as running 10 tf-servings on one GPU.
  4. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform that provides small GPU instance.
  5. In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

  • NVIDIA drivers >= 384.81
  • nvidia-docker version > 2.0
  • Kubernetes version >= 1.16
  • glibc >= 2.17
  • kernel version >= 3.10
  • helm > 3.0

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and the nvidia-container-toolkit have been pre-installed. It also assumes that you have configured the nvidia-container-runtime as the default low-level runtime to use.

Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Example for debian-based systems with docker and containerd

Install the nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
Configure docker

When running kubernetes with docker, edit the config file which is usually present at /etc/docker/daemon.json to set up nvidia-container-runtime as the default low-level runtime:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

And then restart docker:

$ sudo systemctl daemon-reload && systemctl restart docker
Configure containerd

When running kubernetes with containerd, edit the config file which is usually present at /etc/containerd/config.toml to set up nvidia-container-runtime as the default low-level runtime:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

And then restart containerd:

$ sudo systemctl daemon-reload && systemctl restart containerd

Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-scheduler by adding "gpu=on", otherwise, it cannot be managed by our scheduler.

kubectl label nodes {nodeid} gpu=on

Enabling vGPU Support in Kubernetes

First, you need to heck your Kubernetes version by the using the following command

kubectl version

Then, add our repo in helm

helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler

You need to set the Kubernetes scheduler image version according to your Kubernetes server version during installation. For example, if your cluster server version is 1.16.8, then you should use the following command for deployment

helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system

You can customize your installation by adjusting configs.

You can verify your installation by the following command:

$ kubectl get pods -n kube-system

If the following two pods vgpu-device-plugin and vgpu-scheduler are in Running state, then your installation is successful.

Running GPU Jobs

NVIDIA vGPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer)
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)

You should be cautious that if the task can't fit in any GPU node(ie. the number of nvidia.com/gpu you request exceeds the number of GPU in any node). The task will get stuck in pending state.

You can now execute nvidia-smi command in the container and see the difference of GPU memory between vGPU and real GPU.

WARNING: if you don't request vGPUs when using the device plugin with NVIDIA images all the vGPUs on the machine will be exposed inside your container.

More examples

Click here

Scheduler Webhook Service NodePort

Default schedulerPort is 31998, other values can be set using --set deivcePlugin.service.schedulerPort during installation.

Monitoring vGPU status

Monitoring is automatically enabled after installation. You can get vGPU status of a node by visiting

http://{nodeip}:{monitorPort}/metrics

Default monitorPort is 31992, other values can be set using --set deivcePlugin.service.httpPort during installation.

grafana dashboard example

Note The status of a node won't be collected before any GPU operations

Upgrade

To Upgrade the k8s-vGPU to the latest version, all you need to do is update the repo and restart the chart.

$ helm uninstall vgpu -n kube-system
$ helm repo update
$ helm install vgpu vgpu -n kube-system

Uninstall

helm uninstall vgpu -n kube-system

Scheduling

Current schedule strategy is to select GPU with the lowest task. Thus balance the loads across mutiple GPUs

Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

Test Environment description
Kubernetes version v1.12.9
Docker version 18.09.1
GPU Type Tesla V100
GPU Num 2
Test instance description
nvidia-device-plugin k8s + nvidia k8s-device-plugin
vGPU-device-plugin k8s + VGPU k8s-device-plugin,without virtual device memory
vGPU-device-plugin(virtual device memory) k8s + VGPU k8s-device-plugin,with virtual device memory

Test Cases:

test id case type params
1.1 Resnet-V2-50 inference batch=50,size=346*346
1.2 Resnet-V2-50 training batch=20,size=346*346
2.1 Resnet-V2-152 inference batch=10,size=256*256
2.2 Resnet-V2-152 training batch=10,size=256*256
3.1 VGG-16 inference batch=20,size=224*224
3.2 VGG-16 training batch=2,size=224*224
4.1 DeepLab inference batch=2,size=512*512
4.2 DeepLab training batch=1,size=384*384
5.1 LSTM inference batch=100,size=1024*300
5.2 LSTM training batch=10,size=1024*300

Test Result: img

img

To reproduce:

  1. install k8s-vGPU-scheduler,and configure properly
  2. run benchmark job
$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
  1. View the result by using kubctl logs
$ kubectl logs [pod id]

Features

  • Specify the number of vGPUs divided by each physical GPU.
  • Limits vGPU's Device Memory.
  • Allows vGPU allocation by specifying device memory
  • Limits vGPU's Streaming Multiprocessor.
  • Allows vGPU allocation by specifying device core usage
  • Zero changes to existing programs

Experimental Features

  • Virtual Device Memory

    The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

Known Issues

  • Currently, A100 MIG can only support "none" and "mixed" mode
  • Currently, task with filed "nodeName" can't be scheduled, please use "nodeSelector" instead
  • Currently, only computing tasks are supported, and video codec processing is not supported.

TODO

  • Support video codec processing
  • Support Multi-Instance GPUs (MIG)

Tests

  • TensorFlow 1.14.0/2.4.1
  • torch 1.1.0
  • mxnet 1.4.0
  • mindspore 1.1.1

The above frameworks have passed the test.

Issues and Contributing

Authors

Contact

Owner & Maintainer: Limengxuan

Feel free to reach me by

email: <[email protected]> 
phone: +86 18810644493
WeChat: xuanzong4493

k8s-vgpu-scheduler's People

Contributors

archlitchi avatar atttx123 avatar calvin0327 avatar chaunceyjiang avatar coderth avatar gsakun avatar haitwang-cloud avatar klueska avatar lengrongfu avatar peizhaoyou avatar wawa0210 avatar whybeyoung avatar zhengbingxian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

k8s-vgpu-scheduler's Issues

vgpu-device-plugin CreateContainerError

kubectl logs -n kube-system vgpu-admission-patch-n7psv
W0516 05:24:15.597365 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
{"level":"info","msg":"patching webhook configurations 'vgpu-webhook' mutating=true, validating=false, failurePolicy=Fail","source":"k8s/k8s.go:118","time":"2024-05-16T05:24:15Z"}
{"err":"mutatingwebhookconfigurations.admissionregistration.k8s.io "vgpu-webhook" not found","level":"fatal","msg":"failed getting mutating webhook","source":"cmd/patch.go:103","time":"2024-05-16T05:24:15Z"}

显存显示问题

容器内执行nvidia-smi返回如下:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 |
| N/A 36C P0 42W / 300W | 112MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |

Memory-Usage: 112MiB / 16160MiB

  1. 还没程序跑,显示112MiB已使用?
  2. 默认一张卡相当于3张vgpu卡,总的显存不应该是16160MiB/3吗?

vgpu repo had not found

When I execute the command: helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.25.0 --set devicePlugin.deviceMemoryScaling=3 -n kube-system, it prompts Error: INSTALLATION FAILED: chart "vgpu" matching not found in vgpu-charts index. (try 'helm repo update'): no chart name found, indicating that the vgpu repo does not exist anymore.

切分功能不起作用,请求帮助?

环境 环境描述
Kubernetes version v1.11.2
Docker version 18.03.1-ce
GPU Type Tesla V100
GPU Num 2

配置参数为

"args": [
"--fail-on-init-error=false",
"--device-split-count=2",
"--device-memory-scaling=2",
"--device-cores-scaling=2"
],

查看GPU所在节点 kubectl describe node xxx.xxx.xxx.xxx

Capacity:
cpu: 36
ephemeral-storage: 3478455808Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131500528Ki
nvidia.com/gpu: 2
pods: 110

nvidia.com/gpu的数量并没有像demo中的变成4,求作者大大帮忙瞅瞅,辛苦了

Handle_remap not found handle

1. Issue or feature description

在使用vgpu的过程中偶尔会出现Handle_remap not found handle的问题

2. Steps to reproduce the issue

偶尔会出现 这时候重建pod可以恢复正常
在pod容器中输入nvidia-smi会报错

宿主机输入nvidia-smi正常
同一台宿主机的pod输入nvidia-smi正常

3. Information to attach (optional if deemed irrelevant)

错误日志

root@service416776181220773888-55d7479f64-tvg9r:/# nvidia-smi
[4pdvGPU Debug(99:140414784235264:libvgpu.c:39)]: init_dlsym

[4pdvGPU Debug(99:140414784235264:libvgpu.c:61)]: into dlsym nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
...
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuEventDestroy_v2 89
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadDataEx 90
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadFatBinary 91
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleGetFunction 92
[4pdvGPU Info(99:140414784235264:hook.c:136)]: loaded_cuda_libraries
[4pdvGPU Debug(99:140414784235264:multiprocess_memory_limit.c:476)]: Try create shrreg
[4pdvGPU Debug(99:140414784235264:hook.c:558)]: nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:560)]: Hijacking nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:544)]: Hijacking nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=1
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=2
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU ERROR (pid:99 thread=140414784235264 hook.c:285)]: Handle_remap not found handle=7fb4daa19938
nvidia-smi: /home/limengxuan/work/libcuda_override/src/nvml/hook.c:285: handle_remap: Assertion `0' failed.
Aborted (core dumped)

error-in-container.log

宿主机 nvidia-smi -a
nvidia-smi-host.txt

分配2张vgpu却只能看到1张

一台8张A100 的机器,每张卡分成5张vgpu --device-split-count=5。创建一个2 vgpu的pod,在容器里使用nvidia-smi 命令只能看到一张vgpu,/dev 目录下能看到两个gpu。k8s-vgpu-plugin 为v0.9.0.18

Vgpu的限制问题

6月前更新的libvgpu.so。可以工作,在pytorch上工作正常,超出显存大小会正常报错。但是在tensorflow上不正常,显存限制不正常,可以超出切分的大小而不报错。

parameter devicePlugin.deviceSplitCount does not work

i use helm to install k8s-vgpu-scheduler, set devicePlugin.deviceSplitCount = 5. after deployed successfully, i run 'kubectl describe node ', i can see the allocatable resources 'nvidia.com/gpu' count 40 (it has 8 A40 card in machine). Then i create 6 pod, every pod assign 1 'nvidia.com/gpu', but when i create a pod which needs 3 'nvidia.com/gpu',the k8s said the pod can't not be schedulerd.

the logs of vgpu-scheduler is showed below, it seems said only 2 gpu card can usable?
image
I0313 00:58:35.594437 1 score.go:65] "devices status" I0313 00:58:35.594467 1 score.go:67] "device status" device id="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" device detail={"Id":"GPU-0707087e-8264-4ba4-bc45-30c70272ec4a","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594519 1 score.go:67] "device status" device id="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" device detail={"Id":"GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce","Index":1,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594542 1 score.go:67] "device status" device id="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" device detail={"Id":"GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4","Index":2,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594568 1 score.go:67] "device status" device id="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" device detail={"Id":"GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e","Index":3,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594600 1 score.go:67] "device status" device id="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" device detail={"Id":"GPU-56967eb2-30b7-c808-367a-225b8bd8a12e","Index":4,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594639 1 score.go:67] "device status" device id="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" device detail={"Id":"GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb","Index":5,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594671 1 score.go:67] "device status" device id="GPU-e731cd15-879f-6d00-485d-d1b468589de9" device detail={"Id":"GPU-e731cd15-879f-6d00-485d-d1b468589de9","Index":6,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594693 1 score.go:67] "device status" device id="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" device detail={"Id":"GPU-865edbf8-5d63-8e57-5e14-36682179eaf6","Index":7,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594725 1 score.go:90] "Allocating device for container request" pod="default/gpu-pod-2" card request={"Nums":5,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0} I0313 00:58:35.594757 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=5 device index=7 device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594800 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594829 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=4 device index=6 device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594850 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594869 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=5 device="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" I0313 00:58:35.594889 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=4 device="GPU-e731cd15-879f-6d00-485d-d1b468589de9" I0313 00:58:35.594911 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=3 device="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" I0313 00:58:35.594929 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=2 device="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" I0313 00:58:35.594948 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=1 device="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" I0313 00:58:35.594966 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=0 device="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" I0313 00:58:35.594989 1 score.go:211] "calcScore:node not fit pod" pod="default/gpu-pod-2" node="gpu-230"

the kubectl describe node gpu-230 said:
image

the nvidia-smi said:
image

so somebody can solve this issue? thanks

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1

nvidia-smi runs successfully on host
and insider the container if I use the original k8s-device-plugin,
but I got following erorr if using this vgpu device plugin

output of nvidia-smi

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1
Tue Aug  3 08:36:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 00000000:00:08.0 Off |                  Off |
| N/A   35C    P0    63W / 250W |    174MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

大佬,K8S小白请教个问题,还请麻烦指导一下

1. Issue or feature description

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

2. Steps to reproduce the issue

master节点8核16G腾讯云虚拟机,node节点20核80G腾讯云虚拟机带一张nvidia T4显卡。操作系统为ubuntu server 18.04
node节点安装如下安装docker、nvidia-docker2并开启vgpu,使用的镜像是latest,参数均为默认(尝试过修改参数但是结果一样)
image

执行如下操作
image
进入pod内部执行nvidia-smi结果如下
image

在带显卡的宿主机上执行nvidia-smi是没有问题的
docker版本20.10
K8S版本1.19.0使用kubeadm安装,kubelet版本也是1.19.0
docker info结果如下,daemon.json也已经配置了runtime和default-runtime为nvidia
image

Who's using vGPU K8s Device Plugin / 您在使用vGPU K8s Device Plugin吗 ?

Sincerely thank you for using and continuing to pay attention to vGPU K8s Device Plugin. In order to better build the community and attract more people to use and contribute to vGPU K8s Device Plugin to strengthen the community, please comment the following information in the issue:

  1. Your company, school or organization.
  2. Your contact info: email.
  3. Your scenarios using vGPU K8s Device Plugin.

You can refer to the following format to provide information:
Company(Organization): xxx
Website: xxx (Just to get the company logo)
Contact: xxx
Scenarios: DL inference

诚挚的感谢每一位使用并持续关注 vGPU K8s Device Plugin 的朋友。为了更好的建设社区并聆听社区的声音,吸引更多的人使用 vGPU K8s Device Plugin 并给 vGPU K8s Device Plugin 的社区贡献力量,我们期待您能够提交一条评论, 其中包括以下内容:

  1. 您所在公司、学校
  2. 您的联系方式
  3. 您在哪些业务场景中使用

您可以用这些格式来提供信息:
公司:xxx
联系方式:xxx
使用场景:深度学习推理

关于使用的疑惑

1. Issue or feature description

我们正准备尝试这个gpu插件,假如我的集群里只有一块GPU,假如我分割成两块,根据"分配到节点上任务所需要的vGPU数量,不能大于节点实际GPU数量"这条限制,我实际上能启动/可用的算力实例(使用gpu的应用)是否也只有一个

两张GPU,只识别了一张卡

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:1A:00.0 Off | 0 |
| N/A 33C P0 24W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:68:00.0 Off | 0 |
| N/A 27C P0 23W / 250W | 4MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

如上,服务器上总共有两张V100,但是使用device-plugin 后,只会针对0号卡进行分割。

下面为分割的参数:
args:
- '--fail-on-init-error=false'
- '--device-split-count=4'
- '--device-memory-scaling=2'
- '--device-cores-scaling=4'

describe gpunode后,也只得到 4 vgpu,而不是8 vgpu:

Capacity:
cpu: 36
ephemeral-storage: 3478455808Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131500528Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 35600m
ephemeral-storage: 3478455808Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 127399425539
nvidia.com/gpu: 4
pods: 110

docker version: 20.10.12

k8s version: v1.19.9

Is there a way to monitor vGPU with DCGM?

DCGM exporter is not picking the pods that are using vGPU, making it hard to to track utilization of the pods.
is there any workaround to monitor GPU utilization with vGPU?
is there a way to get the mapping between the vGPU and the actual GPU IDs?

我使用的是v0.9.0.0这个版本,build之后,部署为daemon服务到 GPU节点, 报device-split-count等几个参数未定义,去掉这几个参数后,POD可正常在GPU节点running;但看日志找到不到NVML,GPU节点是P100,求联系求指导

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
  • NVIDIA container library logs (see troubleshooting)

run nvidia-smi err in pod

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

create pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: xx:runtime-py3.6-cudnn7.3-cuda9.2-centos7
      command: 
        - /bin/bash
        - -c
        - sleep 1d
      env:
        - name: LIBCUDA_LOG_LEVEL
          value: "5"
      resources:
        limits:
          nvidia.com/gpu: 2

run nvidia-smi err:

[root@gpu-pod /]# nvidia-smi
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlDeviceGetMemoryInfo_v2 in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlDeviceSetTemperatureThreshold in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlVgpuInstanceGetGpuInstanceId in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:339)]: NVML error at line 339: 1
Failed to initialize NVML: Unknown Error

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
[root@xxx ~]# nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Sat Aug 13 16:10:39 2022
Driver Version                            : 455.38
CUDA Version                              : 11.1

Attached GPUs                             : 4
GPU 00000000:02:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320618057831
    GPU UUID                              : GPU-f5a3f95f-2685-cf01-2063-7bc624963433
    Minor Number                          : 0
    VBIOS Version                         : 88.00.41.00.18
    MultiGPU Board                        : No
    Board ID                              : 0x200
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 28 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 39 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 36 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.28 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:03:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320618058217
    GPU UUID                              : GPU-3d5859fc-73e1-23d1-2e59-78c5e7049d61
    Minor Number                          : 1
    VBIOS Version                         : 88.00.41.00.18
    MultiGPU Board                        : No
    Board ID                              : 0x300
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x03
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:03:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 31 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 45 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 43 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 30.67 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:82:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0324917182924
    GPU UUID                              : GPU-e2336a65-b527-8ba6-c005-209ebc071c78
    Minor Number                          : 2
    VBIOS Version                         : 88.00.36.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x8200
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x82
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:82:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 28 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 42 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 38 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.65 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:83:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320618057916
    GPU UUID                              : GPU-5908bbc9-ddab-ebe7-5624-446b3fc15348
    Minor Number                          : 3
    VBIOS Version                         : 88.00.41.00.18
    MultiGPU Board                        : No
    Board ID                              : 0x8300
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x83
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:83:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 28 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 41 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 40 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.91 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

  • Your docker configuration file (e.g: /etc/docker/daemon.json)
{
    "init": true,
    "exec-opts": ["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  • The k8s-device-plugin container logs
2022/08/13 07:41:28 Starting FS watcher.
2022/08/13 07:41:28 Starting OS watcher.
2022/08/13 07:41:28 Retreiving plugins.
2022/08/13 07:41:28 migstrategy= none
2022/08/13 07:41:28 uuid= GPU-f5a3f95f-2685-cf01-2063-7bc624963433
2022/08/13 07:41:28 uuid= GPU-3d5859fc-73e1-23d1-2e59-78c5e7049d61
2022/08/13 07:41:28 uuid= GPU-e2336a65-b527-8ba6-c005-209ebc071c78
2022/08/13 07:41:28 uuid= GPU-5908bbc9-ddab-ebe7-5624-446b3fc15348
2022/08/13 07:41:28 Starting GRPC server for 'nvidia.com/gpu'
2022/08/13 07:41:28 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/08/13 07:41:28 Registered device plugin for 'nvidia.com/gpu' with Kubelet
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
Client: Docker Engine - Community
 Version:           19.03.3
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        a872fc2f86
 Built:             Tue Oct  8 00:58:10 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:02:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 nvidia:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
  • Docker command, image and tag used
  • Kernel version from uname -a
    Linux xxx 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.0.0
build date: 2018-03-06T02:05+0000
build revision: be797da00b156493e80f1ae6f38d69f23c932554
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

commited image can not run in another node.

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

commited image can not run in another node.

2. Steps to reproduce the issue

  1. start pod with gpu enabled
  2. commit container to image and push to registry
  3. start pod with commited image in another node
    container can not run with following error
Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: 
exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-caba9b00-6386-2c33-7834-646ef2692cb7: unknown device\\\\n\\\"\"": unknown

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version: 19.03
  • Docker command, image and tag used: docker commit
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
  • NVIDIA container library logs (see troubleshooting)

core dump when request 2 or more gpus with Tesla T4

1. Issue or feature description

It's ok when request 1 gpu in yaml. But when request more than 1, the output of nvidia-smi is below:
image
The output of nvidia-smi in host machine is ok.

In another machine with GeForce RTX 2070 SUPER ,it's all right when request 2 gpus.
image
but when I run application locally , it abort due to :

[4pdvGPU ERROR (pid:697 thread=140106827071488 context.c:189)]: cuCtxGetDevice Not Found. tid=140106827071488 ctx=0x239601906000:0x23960041a000
 home/limengxuan/work/libcuda_override/src/cuda/context.c:189: cuCtxGetDevice: Assertion `0' failed.

2. Steps to reproduce the issue

ubuntu1~20.04 + microk8s + Tesla T4 GPU + 510driver

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
    image
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
    -{
    "default-runtime": "nvidia",
    "runtimes": {
    "nvidia": {
    "path": "nvidia-container-runtime",
    "runtimeArgs": []
    }
    }
    }

Additional information that might help better understand your environment and reproduce the bug:

  • Any relevant kernel output lines from dmesg
 nvidia-smi[2260220]: segfault at 0 ip 00007fde46d051ce sp 00007ffe1ae4c9e8 error 4 in libc-2.31.so[7fde46b9d000+178000]
[89993.700532] Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21 0f
[90182.697502] nvidia-smi[2265941]: segfault at 0 ip 00007f241971c1ce sp 00007fffff703d08 error 4 in libc-2.31.so[7f24195b4000+178000]
[90182.697509] Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21 0f

Segmentation fault (core dumped)

1. Issue or feature description

当我使用示例进行实验时,报错Segmentation fault (core dumped)。
卡片种类为NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

2. Steps to reproduce the issue

1、修改https://raw.githubusercontent.com/4paradigm/k8s-device-plugin/master/nvidia-device-plugin.yml文件,
"--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"

2、kubectl apply -f nvidia-device-plugin.yml

3、部署
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 vGPUs
并且exec进入pod,执行nvidia-smi,结果为

[root@node1 4p]# kubectl exec -it gpu-pod /bin/sh

nvidia-smi
[4pdvGPU Msg(29:140241530709824:libvgpu.c:813)]: Initializing...
[4pdvGPU Msg(29:140241530709824:context.c:120)]: vdevices_pci=0000:84:00.0
Segmentation fault (core dumped)

是gpu需要做额外设置吗? 还是因为操作系统本身是centos76引起的?

3 尝试

我不知道是不是更深层次的原因例如so文件在处理1个pod分配2vgpu 有些问题导致的。
但在设备插件层面这样修改,能够解决。

for i, vd := range vdevices {
	if i != 0{       //  新增部分:直接只遍历一次
		break
	}
		
	limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
       // 新增: 直接给第一个分配内存* len数量
	response.Envs[limitKey] = fmt.Sprintf("%vm", vd.memory * uint64(len(vdevices))) 
	mapEnvs = append(mapEnvs, fmt.Sprintf("%v:%v", i, vd.dev.ID))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = 
// 新增: 这里也乘以 要分配的vgpu数量
strconv.Itoa(int(100 * global.DeviceCoresScalingFlag / float64(global.DeviceSplitCountFlag) )*len(vdevices) ) 

切分10份,但是VGPU显存无变化 NVIDIA A100

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
annotations:
deprecated.daemonset.template.generation: '2'
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
creationTimestamp: null
labels:
name: nvidia-device-plugin-ds
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
type: ''
- name: vgpu-dir
hostPath:
path: /usr/local/vgpu
type: ''
containers:
- name: nvidia-device-plugin-ctr
image: 4pdosc/k8s-device-plugin:latest
args:
- '--fail-on-init-error=true'
- '--device-split-count=10'
- '--device-memory-scaling=1'
- '--device-cores-scaling=1'
env:
- name: PCIBUSFILE
value: /usr/local/vgpu/pciinfo.vgpu
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
resources: {}
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vgpu-dir
mountPath: /usr/local/vgpu
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
securityContext:
capabilities:
add:
- SYS_ADMIN
drop:
- ALL
allowPrivilegeEscalation: false
restartPolicy: Always

Thu May 5 09:50:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:0C.0 Off | 0 |
| N/A 28C P0 51W / 400W | 413MiB / 40960MiB | 3% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 77977 C 411MiB |
+-----------------------------------------------------------------------------+

显存仍然是40GB

Error: failed to create FS watcher: no such file or directory

2021/08/26 07:14:50 Loading PciInfo
0 = 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
1 = 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2 = 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
3 = 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
4 = 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
5 = 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
6 = 00:03.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01)
7 = 00:04.0 Communication controller: Red Hat, Inc. Virtio console
8 = 00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
9 = 00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
10 = 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
11 = 00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
12 = 00:09.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:09.0
13 = 00:0a.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:0a.0
14 = 00:0b.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:0b.0
15 = 00:0c.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:0c.0
16 = 00:0d.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
17 = 00:0e.0 Ethernet controller: Red Hat, Inc. Virtio network device
18 = 00:0f.0 Ethernet controller: Red Hat, Inc. Virtio network device
19 = 00:10.0 Ethernet controller: Red Hat, Inc. Virtio network device
2021/08/26 07:14:50 Loading NVML
20 = 00:11.0 Ethernet controller: Red Hat, Inc. Virtio network device
21 = 00:12.0 Ethernet controller: Red Hat, Inc. Virtio network device
22 = 00:13.0 Ethernet controller: Red Hat, Inc. Virtio network device
23 = 00:14.0 Ethernet controller: Red Hat, Inc. Virtio network device
24 =
pcibusstr= 00:09.0
00:0a.0
00:0b.0
00:0c.0

2021/08/26 07:14:50 Starting FS watcher.
2021/08/26 07:14:50 Shutdown of NVML returned:
2021/08/26 07:14:50 Error: failed to create FS watcher: no such file or directory

Driver Version: 440.64.00

Failed to initialize NVML: could not load NVML library.

ENV :

K8s : v1.23.10
Runtime: docker 20.10.8
NVIDIA System Management Interface -- v535.161.07
Image: 4pdosc/k8s-device-plugin:v0.10.0.4-ubuntu20.04

Issue:

after deploy the plugin ds ,the logs shows:

2024/03/27 15:41:13 Loading PciInfo

 0 = 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)

 1 = 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

 2 = 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]

 3 = 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)

 4 = 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)

 5 = 00:02.0 VGA compatible controller: Cirrus Logic GD 5446

 6 = 00:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 7 = 00:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 8 = 00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device

 9 = 00:06.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)

 10 = 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device

 11 = 00:08.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

 found 00:08.0

 12 = 00:09.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon

 13 = 

 pcibusstr= 00:08.0


 2024/03/27 15:41:13 Loading NVML

 2024/03/27 15:41:13 Failed to initialize NVML: could not load NVML library.

 2024/03/27 15:41:13 If this is a GPU node, did you set the docker default runtime to `nvidia`?

 2024/03/27 15:41:13 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites

 2024/03/27 15:41:13 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

 2024/03/27 15:41:13 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

  1. I have checked the env, and nvidia-smi works on the vm
root@master:/usr/local/vgpu# nvidia-smi 
Wed Mar 27 15:46:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           Off | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0              23W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

undefined symbol:_dl_sym,version GLIBC_PRIVATE

when i build this plugin by the latest source code ,and deploy in my k8s cluster, the ML process of Using GPU occur the error "symbol lookup error:/usr/local/vpgu/libvgpu.so:undefined symbol:_dl_sym,version GLIBC_PRIVATE."

显存隔离

我们目前正在内部的测试集群上使用这个项目进行试验。
集群版本信息如下:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

docker版本如下:
Client: Docker Engine - Community
Version: 20.10.10
API version: 1.41
Go version: go1.16.9
Git commit: b485636
Built: Mon Oct 25 07:42:59 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.9
API version: 1.41 (minimum version 1.12)
Go version: go1.16.8
Git commit: 79ea9d3
Built: Mon Oct 4 16:06:37 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.11
GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8
nvidia:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0

部署gpu插件的yaml如下:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia-device-enable: enable
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: 4pdosc/k8s-device-plugin:latest
# - image: m7-ieg-pico-test01:5000/k8s-device-plugin-test:v0.9.0-ubuntu20.04
imagePullPolicy: Always
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=true", "--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"]
env:
- name: PCIBUSFILE
value: "/usr/local/vgpu/pciinfo.vgpu"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vgpu-dir
mountPath: /usr/local/vgpu
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: vgpu-dir
hostPath:
path: /usr/local/vgpu

GPU驱动相关信息如下:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 Off | 00000000:01:00.0 Off | N/A |
| 36% 33C P8 27W / 200W | 0MiB / 4041MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

我们发现在集群内成功分割GPU后,启动不同的pod使用vGpu,好像并没有实现显存隔离?并且不同的pod间同时训练时会互相产生影响?请问这是因为我的CUDA版本问题还是因为我们实际上并没有显存隔离?

非常感谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.