4paradigm / k8s-vgpu-scheduler Goto Github PK

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.

License: Apache License 2.0

Go 72.89% Makefile 0.21% Shell 0.64% Dockerfile 0.22% Smarty 0.46% C 25.59%

k8s-vgpu-scheduler's Introduction

English version|中文版

OpenAIOS vGPU scheduler for Kubernetes

Supperted devices

Note This project has beed renamed to project-HAMi, We reserve old repo here for compatable reasons

Introduction

4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8s cluster, it has everything you expect for a k8s GPU manager, including:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory size (i.e 3000M) or device memory percentage of whole GPU(i.e 50%) and have made it that it does not exceed the boundary.

Virtual Device memory: You can oversubscribe GPU device memory by using host memory as its swap.

GPU Type Specification: You can specify which type of GPU to use or to avoid for a certain GPU task, by setting "nvidia.com/use-gputype" or "nvidia.com/nouse-gputype" annotations.

Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation. In addition, you can specify your resource name other than "nvidia.com/gpu" if you wish

The k8s vGPU scheduler is based on retaining features of 4paradigm k8s-device-plugin (4paradigm/k8s-device-plugin), such as splitting the physical GPU, limiting the memory, and computing unit. It adds the scheduling module to balance the GPU usage across GPU nodes. In addition, it allows users to allocate GPU by specifying the device memory and device core usage. Furthermore, the vGPU scheduler can virtualize the device memory (the used device memory can exceed the physical device memory), run some tasks with large device memory requirements, or increase the number of shared tasks. You can refer to the benchmarks report.

When to use

Scenarios when pods need to be allocated with certain device memory usage or device cores.
Needs to balance GPU usage in cluster with mutiple GPU node
Low utilization of device memory and computing units, such as running 10 tf-servings on one GPU.
Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform that provides small GPU instance.
In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

NVIDIA drivers >= 384.81
nvidia-docker version > 2.0
Kubernetes version >= 1.16
glibc >= 2.17
kernel version >= 3.10
helm > 3.0

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and the nvidia-container-toolkit have been pre-installed. It also assumes that you have configured the nvidia-container-runtime as the default low-level runtime to use.

Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Example for debian-based systems with `docker` and `containerd`

Install the `nvidia-container-toolkit`

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Configure `docker`

When running kubernetes with docker, edit the config file which is usually present at /etc/docker/daemon.json to set up nvidia-container-runtime as the default low-level runtime:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

And then restart docker:

$ sudo systemctl daemon-reload && systemctl restart docker

Configure `containerd`

When running kubernetes with containerd, edit the config file which is usually present at /etc/containerd/config.toml to set up nvidia-container-runtime as the default low-level runtime:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

And then restart containerd:

$ sudo systemctl daemon-reload && systemctl restart containerd

Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-scheduler by adding "gpu=on", otherwise, it cannot be managed by our scheduler.

kubectl label nodes {nodeid} gpu=on

Enabling vGPU Support in Kubernetes

First, you need to heck your Kubernetes version by the using the following command

kubectl version

Then, add our repo in helm

helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler

You need to set the Kubernetes scheduler image version according to your Kubernetes server version during installation. For example, if your cluster server version is 1.16.8, then you should use the following command for deployment

helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system

You can customize your installation by adjusting configs.

You can verify your installation by the following command:

$ kubectl get pods -n kube-system

If the following two pods vgpu-device-plugin and vgpu-scheduler are in Running state, then your installation is successful.

Running GPU Jobs

NVIDIA vGPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory （Optional,Integer）
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU （Optional,Integer)

You should be cautious that if the task can't fit in any GPU node(ie. the number of nvidia.com/gpu you request exceeds the number of GPU in any node). The task will get stuck in pending state.

You can now execute nvidia-smi command in the container and see the difference of GPU memory between vGPU and real GPU.

WARNING: if you don't request vGPUs when using the device plugin with NVIDIA images all the vGPUs on the machine will be exposed inside your container.

More examples

Click here

Scheduler Webhook Service NodePort

Default schedulerPort is 31998, other values can be set using --set deivcePlugin.service.schedulerPort during installation.

Monitoring vGPU status

Monitoring is automatically enabled after installation. You can get vGPU status of a node by visiting

http://{nodeip}:{monitorPort}/metrics

Default monitorPort is 31992, other values can be set using --set deivcePlugin.service.httpPort during installation.

grafana dashboard example

Note The status of a node won't be collected before any GPU operations

Upgrade

To Upgrade the k8s-vGPU to the latest version, all you need to do is update the repo and restart the chart.

$ helm uninstall vgpu -n kube-system
$ helm repo update
$ helm install vgpu vgpu -n kube-system

Uninstall

helm uninstall vgpu -n kube-system

Scheduling

Current schedule strategy is to select GPU with the lowest task. Thus balance the loads across mutiple GPUs

Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

Test Environment	description
Kubernetes version	v1.12.9
Docker version	18.09.1
GPU Type	Tesla V100
GPU Num	2

Test instance	description
nvidia-device-plugin	k8s + nvidia k8s-device-plugin
vGPU-device-plugin	k8s + VGPU k8s-device-plugin，without virtual device memory
vGPU-device-plugin(virtual device memory)	k8s + VGPU k8s-device-plugin，with virtual device memory

Test Cases:

test id	case	type	params
1.1	Resnet-V2-50	inference	batch=50,size=346*346
1.2	Resnet-V2-50	training	batch=20,size=346*346
2.1	Resnet-V2-152	inference	batch=10,size=256*256
2.2	Resnet-V2-152	training	batch=10,size=256*256
3.1	VGG-16	inference	batch=20,size=224*224
3.2	VGG-16	training	batch=2,size=224*224
4.1	DeepLab	inference	batch=2,size=512*512
4.2	DeepLab	training	batch=1,size=384*384
5.1	LSTM	inference	batch=100,size=1024*300
5.2	LSTM	training	batch=10,size=1024*300

Test Result:

To reproduce:

install k8s-vGPU-scheduler，and configure properly
run benchmark job

$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml

View the result by using kubctl logs

$ kubectl logs [pod id]

Features

Specify the number of vGPUs divided by each physical GPU.
Limits vGPU's Device Memory.
Allows vGPU allocation by specifying device memory
Limits vGPU's Streaming Multiprocessor.
Allows vGPU allocation by specifying device core usage
Zero changes to existing programs

Experimental Features

Virtual Device Memory

The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

Known Issues

Currently, A100 MIG can only support "none" and "mixed" mode
Currently, task with filed "nodeName" can't be scheduled, please use "nodeSelector" instead
Currently, only computing tasks are supported, and video codec processing is not supported.

TODO

Support video codec processing
Support Multi-Instance GPUs (MIG)

Tests

TensorFlow 1.14.0/2.4.1
torch 1.1.0
mxnet 1.4.0
mindspore 1.1.1

The above frameworks have passed the test.

Issues and Contributing

You can report a bug, a doubt or modify by filing a new issue
If you want to know more or have ideas, you can participate in the Discussions and the slack exchanges

Authors

Mengxuan Li ([email protected])
Zhaoyou Pei ([email protected])
Guangchuan Shi ([email protected])
Zhao Zheng ([email protected])

Contact

Owner & Maintainer: Limengxuan

Feel free to reach me by

email: <[email protected]> 
phone: +86 18810644493
WeChat: xuanzong4493

k8s-vgpu-scheduler's People

Contributors

Stargazers

Watchers

Forkers

zoyopei fingerliu liwenjuna chenhaiwu soaringanecdotalbear itzk-sgh qifengz haijohn wwj-2017-1117 alexpei zhouhr yelianjin tianzhengg liyunbin raydonliu minjac huzhengchuan nishuihanqiu mazhaoshuo greatljn minyongbing joviwb lzlz99 alberthuang96 distancemountain boxrice007 hsunchiu uberizual edorus dripman tweakzx rory602 fu7100 jin2022 tgfree7 raymond-sun clockfly ufoundit-dev aslov1 hide-in-code kjylmr blankxyz genie88 rexxar-liang yangsuiyun w-mj vincentlux dingfan711 labchy alvinlv00 lstarby trimagiccube qzweng ccf-yang lxyzhangqing herocouple williamzhangzhe gpues larry-lu-lu linyqh renoshen robin-2016 chen-mao gggyniidt zhonglin6666 jadeluo xytsinghua yolunghiu nstream-ai jydxkj johnhello wiki-qi edenbuaa yuewucl ldg81817 web3creator makotov thatwho hairongchen huanwei

k8s-vgpu-scheduler's Issues

请问该调度器的实现方式是Extender吗？是否可以改为Scheduling Framework？

你好，请问该调度器的实现方式是Extender吗？在V1.16版本中Extender也已经被废弃了，2022年4月1日后也被移除了。为什么要使用Extender方式实现？是否可以改为Scheduling Framework方式实现调度器？

可以支持一下 kubeedge 么

kubeedgev1.17.0 支持 InClusterConfig，但是部署的时候，会有部分pod 起不来，大佬们可以支持一下

failed calling webhook "vgpu.4pd.io"

Error from server (InternalError): Internal error occurred: failed calling webhook "vgpu.4pd.io": Post https://vgpu-scheduler.kube-system.svc:443/webhook?timeout=10s: context deadline exceeded

在自己的机器上安装了vgpu，但是跑启动benchmark的时候报了以上错误

环境配置：
k8s: 1.16.15
docker: 18.09.9
改了docker的runtime为nvidia-docker2
系统ubuntu18.04.6 LTS

vgpu-device-plugin CreateContainerError

kubectl logs -n kube-system vgpu-admission-patch-n7psv
W0516 05:24:15.597365 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
{"level":"info","msg":"patching webhook configurations 'vgpu-webhook' mutating=true, validating=false, failurePolicy=Fail","source":"k8s/k8s.go:118","time":"2024-05-16T05:24:15Z"}
{"err":"mutatingwebhookconfigurations.admissionregistration.k8s.io "vgpu-webhook" not found","level":"fatal","msg":"failed getting mutating webhook","source":"cmd/patch.go:103","time":"2024-05-16T05:24:15Z"}

显存显示问题

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |

Memory-Usage： 112MiB / 16160MiB

还没程序跑，显示112MiB已使用？
默认一张卡相当于3张vgpu卡，总的显存不应该是16160MiB/3吗？

vgpu repo had not found

When I execute the command: helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.25.0 --set devicePlugin.deviceMemoryScaling=3 -n kube-system, it prompts Error: INSTALLATION FAILED: chart "vgpu" matching not found in vgpu-charts index. (try 'helm repo update'): no chart name found, indicating that the vgpu repo does not exist anymore.

能否支持像cpu那样按微核分配资源，比如我把一块划分为100份，我一个pod可以指定任意份数，比如“10m”或“20m”

切分功能不起作用，请求帮助？

环境	环境描述
Kubernetes version	v1.11.2
Docker version	18.03.1-ce
GPU Type	Tesla V100
GPU Num	2

配置参数为

"args": [
"--fail-on-init-error=false",
"--device-split-count=2",
"--device-memory-scaling=2",
"--device-cores-scaling=2"
],

查看GPU所在节点 kubectl describe node xxx.xxx.xxx.xxx

Capacity:
cpu: 36
ephemeral-storage: 3478455808Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131500528Ki
nvidia.com/gpu: 2
pods: 110

nvidia.com/gpu的数量并没有像demo中的变成4，求作者大大帮忙瞅瞅，辛苦了

Handle_remap not found handle

1. Issue or feature description

在使用vgpu的过程中偶尔会出现Handle_remap not found handle的问题

2. Steps to reproduce the issue

偶尔会出现这时候重建pod可以恢复正常
在pod容器中输入nvidia-smi会报错

宿主机输入nvidia-smi正常
同一台宿主机的pod输入nvidia-smi正常

3. Information to attach (optional if deemed irrelevant)

错误日志

root@service416776181220773888-55d7479f64-tvg9r:/# nvidia-smi
[4pdvGPU Debug(99:140414784235264:libvgpu.c:39)]: init_dlsym

[4pdvGPU Debug(99:140414784235264:libvgpu.c:61)]: into dlsym nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
...
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuEventDestroy_v2 89
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadDataEx 90
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadFatBinary 91
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleGetFunction 92
[4pdvGPU Info(99:140414784235264:hook.c:136)]: loaded_cuda_libraries
[4pdvGPU Debug(99:140414784235264:multiprocess_memory_limit.c:476)]: Try create shrreg
[4pdvGPU Debug(99:140414784235264:hook.c:558)]: nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:560)]: Hijacking nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:544)]: Hijacking nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=1
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=2
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU ERROR (pid:99 thread=140414784235264 hook.c:285)]: Handle_remap not found handle=7fb4daa19938
nvidia-smi: /home/limengxuan/work/libcuda_override/src/nvml/hook.c:285: handle_remap: Assertion `0' failed.
Aborted (core dumped)

error-in-container.log

宿主机 nvidia-smi -a
nvidia-smi-host.txt

分配2张vgpu却只能看到1张

一台8张A100 的机器，每张卡分成5张vgpu --device-split-count=5。创建一个2 vgpu的pod，在容器里使用nvidia-smi 命令只能看到一张vgpu，/dev 目录下能看到两个gpu。k8s-vgpu-plugin 为v0.9.0.18

Vgpu的限制问题

6月前更新的libvgpu.so。可以工作，在pytorch上工作正常，超出显存大小会正常报错。但是在tensorflow上不正常，显存限制不正常，可以超出切分的大小而不报错。

parameter devicePlugin.deviceSplitCount does not work

i use helm to install k8s-vgpu-scheduler, set devicePlugin.deviceSplitCount = 5. after deployed successfully, i run 'kubectl describe node ', i can see the allocatable resources 'nvidia.com/gpu' count 40 (it has 8 A40 card in machine). Then i create 6 pod, every pod assign 1 'nvidia.com/gpu'， but when i create a pod which needs 3 'nvidia.com/gpu'，the k8s said the pod can't not be schedulerd.

the logs of vgpu-scheduler is showed below, it seems said only 2 gpu card can usable？

I0313 00:58:35.594437 1 score.go:65] "devices status" I0313 00:58:35.594467 1 score.go:67] "device status" device id="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" device detail={"Id":"GPU-0707087e-8264-4ba4-bc45-30c70272ec4a","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594519 1 score.go:67] "device status" device id="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" device detail={"Id":"GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce","Index":1,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594542 1 score.go:67] "device status" device id="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" device detail={"Id":"GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4","Index":2,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594568 1 score.go:67] "device status" device id="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" device detail={"Id":"GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e","Index":3,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594600 1 score.go:67] "device status" device id="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" device detail={"Id":"GPU-56967eb2-30b7-c808-367a-225b8bd8a12e","Index":4,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594639 1 score.go:67] "device status" device id="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" device detail={"Id":"GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb","Index":5,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594671 1 score.go:67] "device status" device id="GPU-e731cd15-879f-6d00-485d-d1b468589de9" device detail={"Id":"GPU-e731cd15-879f-6d00-485d-d1b468589de9","Index":6,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594693 1 score.go:67] "device status" device id="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" device detail={"Id":"GPU-865edbf8-5d63-8e57-5e14-36682179eaf6","Index":7,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594725 1 score.go:90] "Allocating device for container request" pod="default/gpu-pod-2" card request={"Nums":5,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0} I0313 00:58:35.594757 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=5 device index=7 device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594800 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594829 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=4 device index=6 device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594850 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594869 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=5 device="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" I0313 00:58:35.594889 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=4 device="GPU-e731cd15-879f-6d00-485d-d1b468589de9" I0313 00:58:35.594911 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=3 device="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" I0313 00:58:35.594929 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=2 device="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" I0313 00:58:35.594948 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=1 device="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" I0313 00:58:35.594966 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=0 device="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" I0313 00:58:35.594989 1 score.go:211] "calcScore:node not fit pod" pod="default/gpu-pod-2" node="gpu-230"

the kubectl describe node gpu-230 said:

the nvidia-smi said:

so somebody can solve this issue? thanks

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1

nvidia-smi runs successfully on host
and insider the container if I use the original k8s-device-plugin,
but I got following erorr if using this vgpu device plugin

output of nvidia-smi

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1
Tue Aug  3 08:36:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 00000000:00:08.0 Off |                  Off |
| N/A   35C    P0    63W / 250W |    174MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

大佬，K8S小白请教个问题，还请麻烦指导一下

1. Issue or feature description

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

2. Steps to reproduce the issue

master节点8核16G腾讯云虚拟机，node节点20核80G腾讯云虚拟机带一张nvidia T4显卡。操作系统为ubuntu server 18.04
node节点安装如下安装docker、nvidia-docker2并开启vgpu，使用的镜像是latest，参数均为默认（尝试过修改参数但是结果一样）

执行如下操作

进入pod内部执行nvidia-smi结果如下

在带显卡的宿主机上执行nvidia-smi是没有问题的
docker版本20.10
K8S版本1.19.0使用kubeadm安装，kubelet版本也是1.19.0
docker info结果如下，daemon.json也已经配置了runtime和default-runtime为nvidia

Who's using vGPU K8s Device Plugin / 您在使用vGPU K8s Device Plugin吗 ?

Sincerely thank you for using and continuing to pay attention to vGPU K8s Device Plugin. In order to better build the community and attract more people to use and contribute to vGPU K8s Device Plugin to strengthen the community, please comment the following information in the issue:

Your company, school or organization.
Your contact info: email.
Your scenarios using vGPU K8s Device Plugin.

You can refer to the following format to provide information:
Company(Organization): xxx
Website: xxx (Just to get the company logo)
Contact: xxx
Scenarios: DL inference

诚挚的感谢每一位使用并持续关注 vGPU K8s Device Plugin 的朋友。为了更好的建设社区并聆听社区的声音，吸引更多的人使用 vGPU K8s Device Plugin 并给 vGPU K8s Device Plugin 的社区贡献力量，我们期待您能够提交一条评论, 其中包括以下内容:

您所在公司、学校
您的联系方式
您在哪些业务场景中使用

您可以用这些格式来提供信息：
公司：xxx
联系方式：xxx
使用场景：深度学习推理

关于使用的疑惑

1. Issue or feature description

我们正准备尝试这个gpu插件，假如我的集群里只有一块GPU，假如我分割成两块，根据"分配到节点上任务所需要的vGPU数量，不能大于节点实际GPU数量"这条限制，我实际上能启动/可用的算力实例（使用gpu的应用）是否也只有一个

两张GPU，只识别了一张卡

如上，服务器上总共有两张V100，但是使用device-plugin 后，只会针对0号卡进行分割。

下面为分割的参数：
args:
- '--fail-on-init-error=false'
- '--device-split-count=4'
- '--device-memory-scaling=2'
- '--device-cores-scaling=4'

describe gpunode后，也只得到 4 vgpu，而不是8 vgpu:

Capacity:
cpu: 36
ephemeral-storage: 3478455808Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131500528Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 35600m
ephemeral-storage: 3478455808Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 127399425539
nvidia.com/gpu: 4
pods: 110

docker version: 20.10.12

k8s version: v1.19.9

Is there a way to monitor vGPU with DCGM?

DCGM exporter is not picking the pods that are using vGPU, making it hard to to track utilization of the pods.
is there any workaround to monitor GPU utilization with vGPU?
is there a way to get the mapping between the vGPU and the actual GPU IDs?

请问下 libvgpu.so 的代码可以开源么?

我使用的是v0.9.0.0这个版本，build之后，部署为daemon服务到 GPU节点, 报device-split-count等几个参数未定义，去掉这几个参数后，POD可正常在GPU节点running；但看日志找到不到NVML,GPU节点是P100,求联系求指导

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
NVIDIA container library logs (see troubleshooting)

Why the server could not find the requested resource？

When i use ’ kubectl apply -f nvidia-device-plugin.yml‘，
return ’Error from server (NotFound): the server could not find the requested resource‘

run nvidia-smi err in pod

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

create pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: xx:runtime-py3.6-cudnn7.3-cuda9.2-centos7
      command: 
        - /bin/bash
        - -c
        - sleep 1d
      env:
        - name: LIBCUDA_LOG_LEVEL
          value: "5"
      resources:
        limits:
          nvidia.com/gpu: 2

run nvidia-smi err:

[root@gpu-pod /]# nvidia-smi
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlDeviceGetMemoryInfo_v2 in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlDeviceSetTemperatureThreshold in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlVgpuInstanceGetGpuInstanceId in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:339)]: NVML error at line 339: 1
Failed to initialize NVML: Unknown Error

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host

[root@xxx ~]# nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Sat Aug 13 16:10:39 2022
Driver Version                            : 455.38
CUDA Version                              : 11.1

Attached GPUs                             : 4
GPU 00000000:02:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320618057831
    GPU UUID                              : GPU-f5a3f95f-2685-cf01-2063-7bc624963433
    Minor Number                          : 0
    VBIOS Version                         : 88.00.41.00.18
    MultiGPU Board                        : No
    Board ID                              : 0x200
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 28 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 39 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 36 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.28 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:03:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320618058217
    GPU UUID                              : GPU-3d5859fc-73e1-23d1-2e59-78c5e7049d61
    Minor Number                          : 1
    VBIOS Version                         : 88.00.41.00.18
    MultiGPU Board                        : No
    Board ID                              : 0x300
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x03
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:03:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 31 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 45 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 43 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 30.67 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:82:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0324917182924
    GPU UUID                              : GPU-e2336a65-b527-8ba6-c005-209ebc071c78
    Minor Number                          : 2
    VBIOS Version                         : 88.00.36.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x8200
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x82
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:82:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 28 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 42 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 38 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.65 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:83:00.0
    Product Name                          : TITAN V
    Product Brand                         : Titan
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320618057916
    GPU UUID                              : GPU-5908bbc9-ddab-ebe7-5624-446b3fc15348
    Minor Number                          : 3
    VBIOS Version                         : 88.00.41.00.18
    MultiGPU Board                        : No
    Board ID                              : 0x8300
    GPU Part Number                       : 900-1G500-2500-000
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x83
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1D8110DE
        Bus Id                            : 00000000:83:00.0
        Sub System Id                     : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 28 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12066 MiB
        Used                              : 0 MiB
        Free                              : 12066 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 41 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 91 C
        Memory Current Temp               : 40 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.91 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1200 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

Your docker configuration file (e.g: /etc/docker/daemon.json)

{
    "init": true,
    "exec-opts": ["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

The k8s-device-plugin container logs

2022/08/13 07:41:28 Starting FS watcher.
2022/08/13 07:41:28 Starting OS watcher.
2022/08/13 07:41:28 Retreiving plugins.
2022/08/13 07:41:28 migstrategy= none
2022/08/13 07:41:28 uuid= GPU-f5a3f95f-2685-cf01-2063-7bc624963433
2022/08/13 07:41:28 uuid= GPU-3d5859fc-73e1-23d1-2e59-78c5e7049d61
2022/08/13 07:41:28 uuid= GPU-e2336a65-b527-8ba6-c005-209ebc071c78
2022/08/13 07:41:28 uuid= GPU-5908bbc9-ddab-ebe7-5624-446b3fc15348
2022/08/13 07:41:28 Starting GRPC server for 'nvidia.com/gpu'
2022/08/13 07:41:28 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/08/13 07:41:28 Registered device plugin for 'nvidia.com/gpu' with Kubelet

The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version

Client: Docker Engine - Community
 Version:           19.03.3
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        a872fc2f86
 Built:             Tue Oct  8 00:58:10 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:02:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 nvidia:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Docker command, image and tag used
Kernel version from uname -a
Linux xxx 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V

version: 1.0.0
build date: 2018-03-06T02:05+0000
build revision: be797da00b156493e80f1ae6f38d69f23c932554
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

NVIDIA container library logs (see troubleshooting)

commited image can not run in another node.

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

commited image can not run in another node.

2. Steps to reproduce the issue

start pod with gpu enabled
commit container to image and push to registry
start pod with commited image in another node
container can not run with following error

Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: 
exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-caba9b00-6386-2c33-7834-646ef2692cb7: unknown device\\\\n\\\"\"": unknown

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version: 19.03
Docker command, image and tag used: docker commit
Kernel version from uname -a
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
NVIDIA container library logs (see troubleshooting)

core dump when request 2 or more gpus with Tesla T4

1. Issue or feature description

It's ok when request 1 gpu in yaml. But when request more than 1, the output of nvidia-smi is below:

The output of nvidia-smi in host machine is ok.

In another machine with GeForce RTX 2070 SUPER ,it's all right when request 2 gpus.

but when I run application locally , it abort due to :

[4pdvGPU ERROR (pid:697 thread=140106827071488 context.c:189)]: cuCtxGetDevice Not Found. tid=140106827071488 ctx=0x239601906000:0x23960041a000
 home/limengxuan/work/libcuda_override/src/cuda/context.c:189: cuCtxGetDevice: Assertion `0' failed.

2. Steps to reproduce the issue

ubuntu1~20.04 + microk8s + Tesla T4 GPU + 510driver

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
-{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}

Additional information that might help better understand your environment and reproduce the bug:

Any relevant kernel output lines from dmesg

 nvidia-smi[2260220]: segfault at 0 ip 00007fde46d051ce sp 00007ffe1ae4c9e8 error 4 in libc-2.31.so[7fde46b9d000+178000]
[89993.700532] Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21 0f
[90182.697502] nvidia-smi[2265941]: segfault at 0 ip 00007f241971c1ce sp 00007fffff703d08 error 4 in libc-2.31.so[7f24195b4000+178000]
[90182.697509] Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21 0f

Segmentation fault (core dumped)

1. Issue or feature description

当我使用示例进行实验时，报错Segmentation fault (core dumped)。
卡片种类为NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

2. Steps to reproduce the issue

1、修改https://raw.githubusercontent.com/4paradigm/k8s-device-plugin/master/nvidia-device-plugin.yml文件，
"--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"

2、kubectl apply -f nvidia-device-plugin.yml

3、部署
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 vGPUs
并且exec进入pod，执行nvidia-smi，结果为

[root@node1 4p]# kubectl exec -it gpu-pod /bin/sh

nvidia-smi
[4pdvGPU Msg(29:140241530709824:libvgpu.c:813)]: Initializing...
[4pdvGPU Msg(29:140241530709824:context.c:120)]: vdevices_pci=0000:84:00.0
Segmentation fault (core dumped)

是gpu需要做额外设置吗？还是因为操作系统本身是centos76引起的？

3 尝试

我不知道是不是更深层次的原因例如so文件在处理1个pod分配2vgpu 有些问题导致的。
但在设备插件层面这样修改，能够解决。

for i, vd := range vdevices {
	if i != 0{       //  新增部分：直接只遍历一次
		break
	}
		
	limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
       // 新增： 直接给第一个分配内存* len数量
	response.Envs[limitKey] = fmt.Sprintf("%vm", vd.memory * uint64(len(vdevices))) 
	mapEnvs = append(mapEnvs, fmt.Sprintf("%v:%v", i, vd.dev.ID))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = 
// 新增： 这里也乘以 要分配的vgpu数量
strconv.Itoa(int(100 * global.DeviceCoresScalingFlag / float64(global.DeviceSplitCountFlag) )*len(vdevices) )

how to install in openshift4 ?

切分10份，但是VGPU显存无变化 NVIDIA A100

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
annotations:
deprecated.daemonset.template.generation: '2'
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
creationTimestamp: null
labels:
name: nvidia-device-plugin-ds
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
type: ''
- name: vgpu-dir
hostPath:
path: /usr/local/vgpu
type: ''
containers:
- name: nvidia-device-plugin-ctr
image: 4pdosc/k8s-device-plugin:latest
args:
- '--fail-on-init-error=true'
- '--device-split-count=10'
- '--device-memory-scaling=1'
- '--device-cores-scaling=1'
env:
- name: PCIBUSFILE
value: /usr/local/vgpu/pciinfo.vgpu
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
resources: {}
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vgpu-dir
mountPath: /usr/local/vgpu
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
securityContext:
capabilities:
add:
- SYS_ADMIN
drop:
- ALL
allowPrivilegeEscalation: false
restartPolicy: Always

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 77977 C 411MiB |
+-----------------------------------------------------------------------------+

显存仍然是40GB

Error: failed to create FS watcher: no such file or directory

2021/08/26 07:14:50 Loading PciInfo
0 = 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
1 = 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2 = 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
3 = 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
4 = 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
5 = 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
6 = 00:03.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01)
7 = 00:04.0 Communication controller: Red Hat, Inc. Virtio console
8 = 00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
9 = 00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
10 = 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
11 = 00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
12 = 00:09.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:09.0
13 = 00:0a.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:0a.0
14 = 00:0b.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:0b.0
15 = 00:0c.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
found 00:0c.0
16 = 00:0d.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
17 = 00:0e.0 Ethernet controller: Red Hat, Inc. Virtio network device
18 = 00:0f.0 Ethernet controller: Red Hat, Inc. Virtio network device
19 = 00:10.0 Ethernet controller: Red Hat, Inc. Virtio network device
2021/08/26 07:14:50 Loading NVML
20 = 00:11.0 Ethernet controller: Red Hat, Inc. Virtio network device
21 = 00:12.0 Ethernet controller: Red Hat, Inc. Virtio network device
22 = 00:13.0 Ethernet controller: Red Hat, Inc. Virtio network device
23 = 00:14.0 Ethernet controller: Red Hat, Inc. Virtio network device
24 =
pcibusstr= 00:09.0
00:0a.0
00:0b.0
00:0c.0

2021/08/26 07:14:50 Starting FS watcher.
2021/08/26 07:14:50 Shutdown of NVML returned:
2021/08/26 07:14:50 Error: failed to create FS watcher: no such file or directory

Driver Version: 440.64.00

Failed to initialize NVML: could not load NVML library.

ENV :

K8s : v1.23.10
Runtime: docker 20.10.8
NVIDIA System Management Interface -- v535.161.07
Image: 4pdosc/k8s-device-plugin:v0.10.0.4-ubuntu20.04

Issue:

after deploy the plugin ds ,the logs shows:

2024/03/27 15:41:13 Loading PciInfo

 0 = 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)

 1 = 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

 2 = 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]

 3 = 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)

 4 = 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)

 5 = 00:02.0 VGA compatible controller: Cirrus Logic GD 5446

 6 = 00:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 7 = 00:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 8 = 00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device

 9 = 00:06.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)

 10 = 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device

 11 = 00:08.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

 found 00:08.0

 12 = 00:09.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon

 13 = 

 pcibusstr= 00:08.0


 2024/03/27 15:41:13 Loading NVML

 2024/03/27 15:41:13 Failed to initialize NVML: could not load NVML library.

 2024/03/27 15:41:13 If this is a GPU node, did you set the docker default runtime to `nvidia`?

 2024/03/27 15:41:13 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites

 2024/03/27 15:41:13 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

 2024/03/27 15:41:13 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

I have checked the env, and nvidia-smi works on the vm

root@master:/usr/local/vgpu# nvidia-smi 
Wed Mar 27 15:46:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           Off | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0              23W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

undefined symbol:_dl_sym,version GLIBC_PRIVATE

when i build this plugin by the latest source code ,and deploy in my k8s cluster, the ML process of Using GPU occur the error "symbol lookup error:/usr/local/vpgu/libvgpu.so:undefined symbol:_dl_sym,version GLIBC_PRIVATE."

can use for 2080Ti/1080Ti ？

感谢您的工作
本人新手，请问是否支持2080Ti/1080Ti等其他非tesla V100环境？

显存隔离

我们目前正在内部的测试集群上使用这个项目进行试验。
集群版本信息如下：
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

docker版本如下：
Client: Docker Engine - Community
Version: 20.10.10
API version: 1.41
Go version: go1.16.9
Git commit: b485636
Built: Mon Oct 25 07:42:59 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.9
API version: 1.41 (minimum version 1.12)
Go version: go1.16.8
Git commit: 79ea9d3
Built: Mon Oct 4 16:06:37 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.11
GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8
nvidia:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0

部署gpu插件的yaml如下：
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia-device-enable: enable
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: 4pdosc/k8s-device-plugin:latest
# - image: m7-ieg-pico-test01:5000/k8s-device-plugin-test:v0.9.0-ubuntu20.04
imagePullPolicy: Always
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=true", "--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"]
env:
- name: PCIBUSFILE
value: "/usr/local/vgpu/pciinfo.vgpu"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vgpu-dir
mountPath: /usr/local/vgpu
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: vgpu-dir
hostPath:
path: /usr/local/vgpu

我们发现在集群内成功分割GPU后，启动不同的pod使用vGpu，好像并没有实现显存隔离？并且不同的pod间同时训练时会互相产生影响？请问这是因为我的CUDA版本问题还是因为我们实际上并没有显存隔离？

非常感谢

能否增加选择指定GPU切分

比如我一台机器上有3张卡，我可不可以指定某张卡进行切分，某张卡不切分这样。

4paradigm / k8s-vgpu-scheduler Goto Github PK

k8s-vgpu-scheduler's Introduction

OpenAIOS vGPU scheduler for Kubernetes

Supperted devices

Introduction

When to use

Prerequisites

Quick Start

Preparing your GPU Nodes

Example for debian-based systems with docker and containerd

Install the nvidia-container-toolkit

Configure docker

Configure containerd

Enabling vGPU Support in Kubernetes

Running GPU Jobs

More examples

Scheduler Webhook Service NodePort

Monitoring vGPU status

Upgrade

Uninstall

Scheduling

Benchmarks

Features

Experimental Features

Known Issues

TODO

Tests

Issues and Contributing

Authors

Contact

k8s-vgpu-scheduler's People

Contributors

Stargazers

Watchers

Forkers

k8s-vgpu-scheduler's Issues

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

1. Issue or feature description

2. Steps to reproduce the issue

1. Issue or feature description

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

1. Issue or feature description

2. Steps to reproduce the issue

3 尝试

ENV :

Issue:

Recommend Projects

Recommend Topics

Recommend Org

Example for debian-based systems with `docker` and `containerd`

Install the `nvidia-container-toolkit`

Configure `docker`

Configure `containerd`