Comments (10)
This appears to have been reported repeatedly #151 #201 #222
from dcgm-exporter.
The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.
from dcgm-exporter.
there is only pod per node trying to read the metrics but multiple pods using the same GPU. The issue is that dcgm exporter should report metrics for all the pods using the GPU.
from dcgm-exporter.
@ThisIsQasim, Can you share how you request GPU resources for pods?
from dcgm-exporter.
Sure. A single GPU is advertised as multiple using the nvidia device plugin
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
and then GPUs are requested with the regular resource requests
esources:
requests:
cpu: 3600m
limits:
memory: 13000Mi
nvidia.com/gpu: "1"
from dcgm-exporter.
@ThisIsQasim , And you use the gpu operator?
from dcgm-exporter.
I do not. Itβs manually deployed.
from dcgm-exporter.
Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.
from dcgm-exporter.
Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.
Is there a known root-cause for this issue?
From what I've dug up:
Pods using timesliced GPUs append a -<idx>
to the end of their deviceIDs, like so:
&ContainerDevices{ResourceName:nvidia.com/gpu,DeviceIds:[GPU-51424525-5928-4e4c-2503-8ca3bca0b134-2],}
Thanks @larry-lu-lu (#201 (comment)).
Therefore when the deviceToPodMap is updated here, none of the pods using the GPU are associated with the base deviceID. Execution then reaches this loop and, because none of the pods in deviceToPod are associated with the baseID, dcgm-exporter totally skips the pod/namespace label and moves on.
Unfortunately there doesn't seem to be a quick fix. As far as I understand, the DCGM metrics we collect are associated with exactly one UUID. This is OK for MIGs because they will each have a unique UUID. But metrics on time-sliced GPUs will, if I'm not mistaken, have the UUID of the base device, without an index attached.
@nikkon-dev and others, forgive me for pinging, I would really like to know if my understanding is correct here.
from dcgm-exporter.
I understand that meanwhile it will be the same for mps new support in device plugin - per- pod metrics will not be shown
is it correct?
another quest, if we have pods that are not requesting gpu through device plugin but are able to use GPU due to some tricks (mounts etc.) can they be reported to dcgm when they use GPU?
from dcgm-exporter.
Related Issues (20)
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error HOT 3
- MIG device support for hpc_job metric labels HOT 4
- dcp metrics supports gpu architecture HOT 4
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error
- time="2024-08-08T03:09:05Z" level=error msg="Failed to write response." error="write tcp 10.202.3.1:9400->10.202.2.2:49674: i/o timeout
- The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty
- Update contribution doc to require signing
- How does dcgm-exporter, when running on k8s as a daemonset, communicate with the host's dcgm host engine?
- failed to transform metrics for transform 'podMapper'
- Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus HOT 2
- No DCGM_FI_DEV_FB_FREE reported for MIG-enabled GPUs
- Error with "make binary" operation in local development
- How does the DCGM exporter work with DCGM? HOT 3
- Add a health status metric for every gpu card HOT 1
- DCGM-exporter pods stuck in Running State, Not getting Ready without GPU allocation. HOT 6
- DCGM Exporter in EKS p4d.24xlarge instance type controller error
- DCGM Exporter in EKS p4d.24xlarge instance type controller error
- DCGM Exporter does not collect individual pod metrics when MPS is enabled in Kubernetes HOT 1
- Missing 3.3.8 builds HOT 2
- In the case of gpu pass-through, does dcgm-exporter on the physical host support capturing gpu metrics of kvm virtual machines?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.