Comments (11)
How did you set up MPS?
from k8s-device-plugin.
您是如何设置 MPS 的?
I haven't set MPS in the YAML, just applied for GPU resources like in Time-slicing mode. How should I set it up? Thank you!
from k8s-device-plugin.
How did you set up MPS?
The settings to enable CUDA MPS are as follows:
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: "envvar"
deviceIDStrategy: "uuid"
gfd:
oneshot: false
noTimestamp: false
outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
sleepInterval: 60s
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 10
from k8s-device-plugin.
@ysz-github do you have an example application / podspec that you're using to confirm this?
Could you also please confirm your driver version? We are investigating an issue where setting the device memory limits by UUID are not having the desired effect.
from k8s-device-plugin.
I have same issues using mps in docker cuda process, driver 535.129.03 and nvdp version is 0.15.0-rc1
from k8s-device-plugin.
There is a known issue with 0.15.0-rc.1 where memory limits were not correctly applied. This will be addressed in v0.15.0-rc.2 which we will release soon.
from k8s-device-plugin.
There is a known issue with 0.15.0-rc.1 where memory limits were not correctly applied. This will be addressed in v0.15.0-rc.2 which we will release soon.
ok, i know, thanks for your reply!
from k8s-device-plugin.
@aphrodite1028 @ysz-github we have just released https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0-rc.2 which should address this issue. Please let us know if you're still experiencing problems.
from k8s-device-plugin.
@aphrodite1028 @ysz-github we have just released https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0-rc.2 which should address this issue. Please let us know if you're still experiencing problems.
I found https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L77-L85 here.
if I do not set CUDA_VISIBLE_DEVICES env and start nvidia-cuda-mps-control -d and nvidia-cuda-mps-control, then limit device memory failed and not found nvidia-cuda-mps-server in container。 if I setting again, ignore mps-control-daemon ds config,will success in host machine, but Segmentation fault in container.
how to set device memory limit for client in container?
driver version is 535.129.03
GPU is RTX A6000
and i use helm deploy in k8s has an error like "linux mounts: path /run/nvidia/mps is mounted on /run but it is not a shared mount" when has mountPropagation
volumeMounts:
- mountPath: /mps
mountPropagation: Bidirectional
name: mps-root
from k8s-device-plugin.
@aphrodite1028 . You shouldn't need to do anything special in your user container. The system starts the MPS server for all GPUs on the machine and your client will be forced to make use of it.
These lines set the upper limit on the pinned device memory and thread percentage consumable by the client.
https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L111-L122
You can manually adjust the pinned memory limit and thread percentage to something smaller that this using the envvars when you start your container (but you can't set it to something larger).
from k8s-device-plugin.
@aphrodite1028 . You shouldn't need to do anything special in your user container. The system starts the MPS server for all GPUs on the machine and your client will be forced to make use of it.
These lines set the upper limit on the pinned device memory and thread percentage consumable by the client. https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L111-L122
You can manually adjust the pinned memory limit and thread percentage to something smaller that this using the envvars when you start your container (but you can't set it to something larger).
thanks for your reply.
if mps pinned device memory has driver version limit when i use? I found using man nvidia-cuda-mps-control
in driver 470, not found set_default_device_pinned_mem_limit method.
from k8s-device-plugin.
Related Issues (20)
- Using CUDA MPS to enable GPU sharing in K8S, error:error checking MPS daemon health HOT 2
- K3s in Docker (K3D) - `nvml error: insufficient permissions`
- Fix e2e tests HOT 1
- WSL2 - No devices found. Waiting indefinitely. HOT 3
- MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! HOT 30
- Back-off restarting failed container nvidia-device-plugin-ctr HOT 3
- Error in nvidia-device-plugin pod. HOT 2
- Go Package: github.com/opencontainers/runc 1.0.0-rc93 < 1.1.12 - Local Sandbox Bypass Vulnerability HOT 1
- When use MPS, add a initContainers to default set compute model
- update nodelabel for config-manger k8s-device-plugin continuing printing error msg, not stop HOT 1
- allPossibleMigStrategiesAreNone is false when using default values HOT 4
- Fix mode detection on Tegra-based platforms that support NVML HOT 1
- Workloads keep in hang state except cuda-sample:vectoradd under MPS mode HOT 9
- mps server error Failed to start : invalid argument
- nvidia-device-plugin.hasConfigMap returns a string HOT 9
- helm: can't upgrade to 0.15.0 in place due to daemonset label selector change HOT 3
- Addressing several security vulnerabilities in the version v0.15.0
- Failed when deploy via helm HOT 1
- The plugin has already support nvlink? HOT 1
- K3S - Failed to start plugin: error waiting for MPS daemon HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.