What is the version? 3.3.5-3.4.1 What happened?

This appears to have been reported repeatedly <a class="issue-link js-issue-link" data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Per pod metrics not exposed with time-slicing enabled about dcgm-exporter HOT 10 OPEN

ThisIsQasim commented on September 27, 2024 2

Per pod metrics not exposed with time-slicing enabled

from dcgm-exporter.

Comments (10)

ThisIsQasim commented on September 27, 2024

This appears to have been reported repeatedly #151 #201 #222

from dcgm-exporter.

nvvfedorov commented on September 27, 2024

The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.

from dcgm-exporter.

ThisIsQasim commented on September 27, 2024

there is only pod per node trying to read the metrics but multiple pods using the same GPU. The issue is that dcgm exporter should report metrics for all the pods using the GPU.

from dcgm-exporter.

nvvfedorov commented on September 27, 2024

@ThisIsQasim, Can you share how you request GPU resources for pods?

from dcgm-exporter.

ThisIsQasim commented on September 27, 2024

Sure. A single GPU is advertised as multiple using the nvidia device plugin

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

and then GPUs are requested with the regular resource requests

esources:
  requests:
    cpu: 3600m
  limits:
    memory: 13000Mi
    nvidia.com/gpu: "1"

from dcgm-exporter.

nvvfedorov commented on September 27, 2024

@ThisIsQasim , And you use the gpu operator?

from dcgm-exporter.

ThisIsQasim commented on September 27, 2024

I do not. It’s manually deployed.

from dcgm-exporter.

nvvfedorov commented on September 27, 2024

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

from dcgm-exporter.

svetly-todorov commented on September 27, 2024

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

Is there a known root-cause for this issue?

From what I've dug up:

Pods using timesliced GPUs append a -<idx> to the end of their deviceIDs, like so:

&ContainerDevices{ResourceName:nvidia.com/gpu,DeviceIds:[GPU-51424525-5928-4e4c-2503-8ca3bca0b134-2],}

Thanks @larry-lu-lu (#201 (comment)).

Therefore when the deviceToPodMap is updated here, none of the pods using the GPU are associated with the base deviceID. Execution then reaches this loop and, because none of the pods in deviceToPod are associated with the baseID, dcgm-exporter totally skips the pod/namespace label and moves on.

Unfortunately there doesn't seem to be a quick fix. As far as I understand, the DCGM metrics we collect are associated with exactly one UUID. This is OK for MIGs because they will each have a unique UUID. But metrics on time-sliced GPUs will, if I'm not mistaken, have the UUID of the base device, without an index attached.

@nikkon-dev and others, forgive me for pinging, I would really like to know if my understanding is correct here.

from dcgm-exporter.

ettelr commented on September 27, 2024

I understand that meanwhile it will be the same for mps new support in device plugin - per- pod metrics will not be shown
is it correct?
another quest, if we have pods that are not requesting gpu through device plugin but are able to use GPU due to some tricks (mounts etc.) can they be reported to dcgm when they use GPU?

from dcgm-exporter.

Per pod metrics not exposed with time-slicing enabled about dcgm-exporter HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent