Comments (3)
Hi,
You need to check if dcgm-exporter is able to connect to the remove nv-hostengine DCGM_REMOTE_HOSTENGINE_INFO: $(NODE_IP):5555
.
There are two modes dcgm-exporter supports: 1) dcgm-exporter connects to a standalone nv-hostengine working in another container or right on the node. 2) dcgm-exporter runs embedded hostengine and does not need to connect to any other hostengine on port 5555. In the second mode, dcgm-exporter would fail to initialize some modules if there is another nv-hostengine process on the same physical machine.
You need to see the dcgm-exporter command line to understand the mode it's running.
from dcgm-exporter.
Hi,
You need to see the dcgm-exporter command line to understand the mode it's running.
Can you help with some pointers on how do I check this? I'm trying this for the 1st time.
from dcgm-exporter.
Please, take a look here
dcgm-exporter/deployment/values.yaml
Line 23 in 0869a2a
from dcgm-exporter.
Related Issues (20)
- README link about "To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide." is already invalid HOT 1
- dcgm-exporter crashes when run on Debian 12 HOT 1
- Protobuf handling is incorrect HOT 2
- dcgm-exporter log: No Kubelet socket, ignoring HOT 2
- dcgm-exporter dont show metrics from other namespaces and pods k8s HOT 11
- DCGM exporter image vulnerable to https://nvd.nist.gov/vuln/detail/CVE-2024-24790 HOT 1
- Can't collecting DCP metrics HOT 4
- Let dcgm-exporter be a daemon HOT 5
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error HOT 3
- MIG device support for hpc_job metric labels HOT 4
- dcp metrics supports gpu architecture HOT 4
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error
- time="2024-08-08T03:09:05Z" level=error msg="Failed to write response." error="write tcp 10.202.3.1:9400->10.202.2.2:49674: i/o timeout
- The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty
- Update contribution doc to require signing
- How does dcgm-exporter, when running on k8s as a daemonset, communicate with the host's dcgm host engine?
- failed to transform metrics for transform 'podMapper'
- Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus HOT 2
- No DCGM_FI_DEV_FB_FREE reported for MIG-enabled GPUs
- Error with "make binary" operation in local development
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.