Right now, when dcgm-exporter is deployed in Kubernetes (we're using gpu-operator), th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, as I commented here in <a class="issue-link js-issue-link" data-error-text="Failed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add Kubernetes node name to exported labels about dcgm-exporter HOT 9 CLOSED

neggert commented on June 21, 2024 7

Add Kubernetes node name to exported labels

from dcgm-exporter.

Comments (9)

k0nstantinv commented on June 21, 2024 3

@neggert as workaround right now you can do it from the Prometheus dcgm-exporter job side, e.g relabel_config. For example:

- job_name: dcgm-exporter
  scrape_interval: 30s
  scrape_timeout: 10s
  kubernetes_sd_configs:
    - role: endpoints
  relabel_configs:
    - source_labels: [__meta_kubernetes_endpoints_name]
      regex: 'dcgm-exporter'
      action: keep
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: pod_name
    - source_labels: [__meta_kubernetes_pod_node_name]
      action: replace
      target_label: node

result

DCGM_FI_DEV_FB_FREE{UUID="GPU-16e319ba-0b7d-3a4b-e35f-915bf484870f", cluster="test-cluster01", device="nvidia0", gpu="0", instance="10.109.133.139:9400", job="dcgm-exporter", namespace="dcgm-exporter", node="node06", pod_name="dcgm-exporter-k22f5"}

from dcgm-exporter.

PlayMTL commented on June 21, 2024 1

Hey, i have solved this by adding some relabeling to my ServiceMonitor.

I also overwrote the instance label so I don't have to customize my grafana boards.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
  endpoints:
  - relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: nodename
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    ...

from dcgm-exporter.

harjitdotsingh commented on June 21, 2024 1

How can I get the name of the pod which is actually using the GPU ? Right now what we get is the name of DCGM-Exporter and the job name is being populated with gpu metrics job name.

from dcgm-exporter.

alex-g-tejada commented on June 21, 2024

Would it be possible to have labels in general added as well? We have kubernetes pods that have label.app=dev for example and these are not visible in DCGM metrics

from dcgm-exporter.

slyt commented on June 21, 2024

I needed node_ip and hostname to join DCGM metrics to Node Exporter metrics.

I used the following relabelings in my ServiceMonitor to add node_ip and hostname labels to all DCGM metrics:

spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter # target service
  namespaceSelector:
    matchNames:
      - nvidia-gpu-operator
  endpoints:
    - port: gpu-metrics
      interval: 5s # Scrape interval
      relabelings:
        - sourceLabels: ["__meta_kubernetes_pod_host_ip"]
          regex: "(.*)"
          replacement: "$1"
          targetLabel: "node_ip"
        - sourceLabels: ["__meta_kubernetes_pod_node_name"]
          regex: "(.*)"
          replacement: "$1"
          targetLabel: "hostname"

from dcgm-exporter.

thekuffs commented on June 21, 2024

Hi, as I commented here in #99 (comment). This feature does not need to be added to dcgm exporter. It is already available in prometheus. My comment has references on how to configure it.

from dcgm-exporter.

francescov1 commented on June 21, 2024

@thekuffs Do you mind elaborating on how to do this?

I've got a GPU node which is running two pods: dcgm-exporter and a workload which uses the GPU. Id like to be able to associate that workload with the metrics exposed from dcgm-exporter.

How exactly is this done? I understand how to do the pod role with kubernetes_sd_configs, but not clear how this helps the problem. I can target my workload through this, but my workload doesnt expose a /metrics endpoint (only dcgm-exporter does). I can target the dcgm-exporter pod, but then im still not sure how to associate those with the workload.

from dcgm-exporter.

thekuffs commented on June 21, 2024

@francescov1 I think I was just wrong when I posted that comment. I had forgotten that available GPU resources can be shared among multiple pods. If your workload is 1:1:1 (gpu:pod:node) you can use something like the cadvisor/kubelet metrics to "join" between dcgm metrics and your own workload. i.e. there are metrics in cadvisor/kubelet that have both a pod and a node label. You can select those by your workload's pod name. You can join those against themselves to find the specific instance of the dcgm exporter for that node. Which you can use to find the dcgm metrics for the node your workload is running on. Like I said though, that requires you to be allocating the entire GPU on a node to a single pod.

So, I apologize for my comment. I made a few other comments on other similar issues without thinking thoroughly about it.

from dcgm-exporter.

francescov1 commented on June 21, 2024

@thekuffs Thanks for the insight, this is exactly what I need. Will give it a shot!

from dcgm-exporter.

Add Kubernetes node name to exported labels about dcgm-exporter HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent