Comments (6)
This definitely points to NVIDIA_MIG_MONITOR_DEVICES
not being set correctly. Can you verify that this setting is actually being picked up in the container? Meaning, exec into the container can run export
to observe the envvars set.
from k8s-device-plugin.
@yunfeng-scale since you mention the GPU Operator being used, could you please confirm the GPU Operator version that is being used to deploy the v0.14.0 version of the device plugin?
from k8s-device-plugin.
GPU Operator version is 22.9.1, driver version is 470.161.03. will circle back on checking NVIDIA_MIG_MONITOR_DEVICES
next week
from k8s-device-plugin.
This definitely points to
NVIDIA_MIG_MONITOR_DEVICES
not being set correctly. Can you verify that this setting is actually being picked up in the container? Meaning, exec into the container can runexport
to observe the envvars set.
sorry for the late reply. @klueska yes i can confirm this set correctly by getting into a container and grep env vars.
also using the latest GPU operator 23.9.2 the problem persists
from k8s-device-plugin.
tried to install gpu-feature-discovery from its own helm chart (removing it from gpu operator) and that didn't work either
also can you help me understand what sets the permissions based on env var NVIDIA_MIG_MONITOR_DEVICES
, so I may able to do some investigations?
from k8s-device-plugin.
for others encountering the same issue: we upgraded EKS from 1.23 to 1.29 and the permission issue is resolved.
from k8s-device-plugin.
Related Issues (20)
- Error in nvidia-device-plugin pod. HOT 2
- Go Package: github.com/opencontainers/runc 1.0.0-rc93 < 1.1.12 - Local Sandbox Bypass Vulnerability HOT 1
- When use MPS, add a initContainers to default set compute model
- update nodelabel for config-manger k8s-device-plugin continuing printing error msg, not stop HOT 1
- allPossibleMigStrategiesAreNone is false when using default values HOT 4
- Fix mode detection on Tegra-based platforms that support NVML HOT 1
- Workloads keep in hang state except cuda-sample:vectoradd under MPS mode HOT 9
- mps server error Failed to start : invalid argument
- nvidia-device-plugin.hasConfigMap returns a string HOT 9
- helm: can't upgrade to 0.15.0 in place due to daemonset label selector change HOT 3
- Addressing several security vulnerabilities in the version v0.15.0
- Failed when deploy via helm HOT 1
- The plugin has already support nvlink? HOT 1
- K3S - Failed to start plugin: error waiting for MPS daemon HOT 6
- Using MPS sharing, the pod running multiple cuda application will exceed resources.limits."nvidia.com/gpu.shared" HOT 2
- Questions about NVIDIA Device Plugin installation error - how to make GPU nodes be preempted by gpu pod scheduling?
- When GPU is MIG mixed model, one pod use two MIG Instance, use nvidia-smi can't find. HOT 2
- Incorrect deviceClassWhitelist configuration is provided HOT 1
- Does not set the right value for nvidia.com/gpu.replicas label when timesharing is enabled
- Change GFD repository image V0.15.0 Helm HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.