Comments (11)
Which version of dcgm-exporter can I use to see exported_pod?
from dcgm-exporter.
I use dcgm-exporter-2.4.5-2.6.7-ubuntu20.04
from dcgm-exporter.
2.2.9-2.4.0-ubuntu20.04 didn't help
Nvidia gpu-operator 1.11.0
from dcgm-exporter.
You can check your servicemonitor config. Do you add honorLabels: true
?
from dcgm-exporter.
You can check your servicemonitor config. Do you add
honorLabels: true
?
Yes
from dcgm-exporter.
curl localhost:9400/metrics gives nothing with "exported"
from dcgm-exporter.
Is there any solution?
from dcgm-exporter.
@Muscule
The exported_pod label is made by the prometheus, which is called server-side labels. So it is right that 'curl localhost:9400/metrics gives nothing with "exported"'.
Let assume you set the honor_label to "false".
Then when the prometheus scrapes the metrics, there could be some rules to relabel the source labels.
If you are using the servicemonitor to scrape dcgm-exporter in the kubernetes, then the additional label "node", "pod" and "namespace" would be created by the prometheus.
So these new labels will crash with the source labels from the dcgm-exporter.
In this case, the prometheus adds the prefix "exported" to the labels from the scraped data.
But you set honor_label to true, then the prometheus won't make server-side labels, which is keeping the source labels.
Check the link below
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
# If honor_labels is set to "true", label conflicts are resolved by keeping label
# values from the scraped data and ignoring the conflicting server-side labels.
#
# If honor_labels is set to "false", label conflicts are resolved by renaming
# conflicting labels in the scraped data to "exported_<original-label>" (for
# example "exported_instance", "exported_job") and then attaching server-side
# labels.
from dcgm-exporter.
source metrics don't show me the info
from dcgm-exporter.
SM also doesn't do any relabel by default
https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/templates/service-monitor.yaml
from dcgm-exporter.
extraEnv:
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
- name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
value: "uid"
seems working, but testing now
BTW: nvidia's products are awesome and docs are total shit
from dcgm-exporter.
Related Issues (20)
- Protobuf handling is incorrect HOT 2
- dcgm-exporter log: No Kubelet socket, ignoring HOT 2
- dcgm-exporter dont show metrics from other namespaces and pods k8s HOT 11
- DCGM exporter image vulnerable to https://nvd.nist.gov/vuln/detail/CVE-2024-24790 HOT 1
- Can't collecting DCP metrics HOT 4
- Let dcgm-exporter be a daemon HOT 5
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error HOT 3
- MIG device support for hpc_job metric labels HOT 4
- dcp metrics supports gpu architecture HOT 4
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error
- time="2024-08-08T03:09:05Z" level=error msg="Failed to write response." error="write tcp 10.202.3.1:9400->10.202.2.2:49674: i/o timeout
- The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty
- Update contribution doc to require signing
- How does dcgm-exporter, when running on k8s as a daemonset, communicate with the host's dcgm host engine?
- failed to transform metrics for transform 'podMapper'
- Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus HOT 2
- No DCGM_FI_DEV_FB_FREE reported for MIG-enabled GPUs
- Error with "make binary" operation in local development
- How does the DCGM exporter work with DCGM? HOT 3
- Add a health status metric for every gpu card HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.