Coder Social home page Coder Social logo

Comments (10)

ThisIsQasim avatar ThisIsQasim commented on September 27, 2024

This appears to have been reported repeatedly #151 #201 #222

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on September 27, 2024

The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.

from dcgm-exporter.

ThisIsQasim avatar ThisIsQasim commented on September 27, 2024

there is only pod per node trying to read the metrics but multiple pods using the same GPU. The issue is that dcgm exporter should report metrics for all the pods using the GPU.

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on September 27, 2024

@ThisIsQasim, Can you share how you request GPU resources for pods?

from dcgm-exporter.

ThisIsQasim avatar ThisIsQasim commented on September 27, 2024

Sure. A single GPU is advertised as multiple using the nvidia device plugin

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

and then GPUs are requested with the regular resource requests

esources:
  requests:
    cpu: 3600m
  limits:
    memory: 13000Mi
    nvidia.com/gpu: "1"

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on September 27, 2024

@ThisIsQasim , And you use the gpu operator?

from dcgm-exporter.

ThisIsQasim avatar ThisIsQasim commented on September 27, 2024

I do not. It’s manually deployed.

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on September 27, 2024

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

from dcgm-exporter.

svetly-todorov avatar svetly-todorov commented on September 27, 2024

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

Is there a known root-cause for this issue?


From what I've dug up:

Pods using timesliced GPUs append a -<idx> to the end of their deviceIDs, like so:

&ContainerDevices{ResourceName:nvidia.com/gpu,DeviceIds:[GPU-51424525-5928-4e4c-2503-8ca3bca0b134-2],}

Thanks @larry-lu-lu (#201 (comment)).

Therefore when the deviceToPodMap is updated here, none of the pods using the GPU are associated with the base deviceID. Execution then reaches this loop and, because none of the pods in deviceToPod are associated with the baseID, dcgm-exporter totally skips the pod/namespace label and moves on.

Unfortunately there doesn't seem to be a quick fix. As far as I understand, the DCGM metrics we collect are associated with exactly one UUID. This is OK for MIGs because they will each have a unique UUID. But metrics on time-sliced GPUs will, if I'm not mistaken, have the UUID of the base device, without an index attached.

@nikkon-dev and others, forgive me for pinging, I would really like to know if my understanding is correct here.

from dcgm-exporter.

ettelr avatar ettelr commented on September 27, 2024

I understand that meanwhile it will be the same for mps new support in device plugin - per- pod metrics will not be shown
is it correct?
another quest, if we have pods that are not requesting gpu through device plugin but are able to use GPU due to some tricks (mounts etc.) can they be reported to dcgm when they use GPU?

from dcgm-exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.