Comments (9)
@neggert as workaround right now you can do it from the Prometheus dcgm-exporter job side, e.g relabel_config. For example:
- job_name: dcgm-exporter
scrape_interval: 30s
scrape_timeout: 10s
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: 'dcgm-exporter'
action: keep
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod_name
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: node
result
DCGM_FI_DEV_FB_FREE{UUID="GPU-16e319ba-0b7d-3a4b-e35f-915bf484870f", cluster="test-cluster01", device="nvidia0", gpu="0", instance="10.109.133.139:9400", job="dcgm-exporter", namespace="dcgm-exporter", node="node06", pod_name="dcgm-exporter-k22f5"}
from dcgm-exporter.
Hey, i have solved this by adding some relabeling to my ServiceMonitor.
I also overwrote the instance label so I don't have to customize my grafana boards.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
endpoints:
- relabelings:
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: nodename
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
...
from dcgm-exporter.
How can I get the name of the pod which is actually using the GPU ? Right now what we get is the name of DCGM-Exporter and the job name is being populated with gpu metrics job name.
from dcgm-exporter.
Would it be possible to have labels in general added as well? We have kubernetes pods that have label.app=dev for example and these are not visible in DCGM metrics
from dcgm-exporter.
I needed node_ip and hostname to join DCGM metrics to Node Exporter metrics.
I used the following relabelings in my ServiceMonitor
to add node_ip
and hostname
labels to all DCGM metrics:
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter # target service
namespaceSelector:
matchNames:
- nvidia-gpu-operator
endpoints:
- port: gpu-metrics
interval: 5s # Scrape interval
relabelings:
- sourceLabels: ["__meta_kubernetes_pod_host_ip"]
regex: "(.*)"
replacement: "$1"
targetLabel: "node_ip"
- sourceLabels: ["__meta_kubernetes_pod_node_name"]
regex: "(.*)"
replacement: "$1"
targetLabel: "hostname"
from dcgm-exporter.
Hi, as I commented here in #99 (comment). This feature does not need to be added to dcgm exporter. It is already available in prometheus. My comment has references on how to configure it.
from dcgm-exporter.
@thekuffs Do you mind elaborating on how to do this?
I've got a GPU node which is running two pods: dcgm-exporter and a workload which uses the GPU. Id like to be able to associate that workload with the metrics exposed from dcgm-exporter.
How exactly is this done? I understand how to do the pod role with kubernetes_sd_configs, but not clear how this helps the problem. I can target my workload through this, but my workload doesnt expose a /metrics endpoint (only dcgm-exporter does). I can target the dcgm-exporter pod, but then im still not sure how to associate those with the workload.
from dcgm-exporter.
@francescov1 I think I was just wrong when I posted that comment. I had forgotten that available GPU resources can be shared among multiple pods. If your workload is 1:1:1 (gpu:pod:node) you can use something like the cadvisor/kubelet metrics to "join" between dcgm metrics and your own workload. i.e. there are metrics in cadvisor/kubelet that have both a pod and a node label. You can select those by your workload's pod name. You can join those against themselves to find the specific instance of the dcgm exporter for that node. Which you can use to find the dcgm metrics for the node your workload is running on. Like I said though, that requires you to be allocating the entire GPU on a node to a single pod.
So, I apologize for my comment. I made a few other comments on other similar issues without thinking thoroughly about it.
from dcgm-exporter.
@thekuffs Thanks for the insight, this is exactly what I need. Will give it a shot!
from dcgm-exporter.
Related Issues (20)
- Cannot build from source HOT 9
- how to query rated power? HOT 1
- Cannot build from source via Ansible HOT 4
- Executing dcgmi diag -r 3 in dcgm-exporter, the prompt shows "nvvs binary was not found" HOT 1
- hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand HOT 5
- Profiling module failed to load HOT 5
- Could not enable kubernetes metric collection: nvml: Unknown Error HOT 2
- Failed to watch metrics: Error watching fields: The third-party Profiling module returned an u HOT 2
- Makefile missing DIST_DIR := cmd/dcgm-exporter HOT 1
- Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() HOT 1
- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ is not signed HOT 2
- nvlink metrics are not available on the gh200 gpu node HOT 2
- I can't get the following metrics, but I've set the environment variable HOT 3
- config csv DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, but cannot get on metrics HOT 2
- can I get computeRunningProcesses and graphicsRunningProcesses this two metrics?? HOT 1
- exported_pod cause issue with query -> every sample a different metrics HOT 3
- Switch GPU Util metric to `DCGM_FI_PROF_GR_ENGINE_ACTIVE` in NVIDIA DCGM Metrics Dashboard
- `namespace` and `pod` labels are sometimes missing from metrics HOT 10
- How to obtain the namespace , pod and container data HOT 4
- How to install dcgm-exporter on Windows Server? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.