Comments (13)
I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment)
My helm chart values:
###
#
# Reference: https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml
#
serviceMonitor:
enabled: false
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
securityContext:
privileged: true
tolerations:
- operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
podAnnotations:
ad.datadoghq.com/exporter.check_names: |
["openmetrics"]
ad.datadoghq.com/exporter.init_configs: |
[{}]
ad.datadoghq.com/exporter.instances: |
[
{
"openmetrics_endpoint": "http://%%host%%:9400/metrics",
"namespace": "nvidia-dcgm-exporter",
"metrics": [{"*":"*"}]
}
]
extraHostVolumes:
- name: vulkan-icd-mount
hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
- name: nvidia-install-dir-host
hostPath: /home/kubernetes/bin/nvidia
extraVolumeMounts:
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
readOnly: true
- name: vulkan-icd-mount
mountPath: /etc/vulkan/icd.d
readOnly: true
and...
❯ kubectl -n nvidia-dcgm-exporter logs -f nvidia-dcgm-exporter-4m855
time="2022-05-12T00:10:42Z" level=info msg="Starting dcgm-exporter"
time="2022-05-12T00:10:42Z" level=info msg="DCGM successfully initialized!"
time="2022-05-12T00:10:43Z" level=info msg="Collecting DCP Metrics"
time="2022-05-12T00:10:43Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-05-12T00:10:44Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-05-12T00:10:44Z" level=info msg="Pipeline starting"
time="2022-05-12T00:10:44Z" level=info msg="Starting webserver"
from dcgm-exporter.
Was there any progress on this in the last month? We're seeing the exact same issue on GKE, and it would be great to get some actual metrics from the GPUs.
from dcgm-exporter.
Just to clarify, the libnvidia-ml.so files were found inside the container? I just pulled them and checked and did not find them. Also, have you followed the integration guide for kubernetes found here? https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#integrating-gpu-telemetry-into-kubernetes
from dcgm-exporter.
@glowkey the Nvidia drivers were already installed on the node with the nvidia-driver-installer
that's now automatically included in GKE clusters
The libnvidia-ml.so
are available inside the container because it was mounted from the node.
volumeMounts:
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
volumes:
- name: nvidia-install-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
from dcgm-exporter.
@praveenperera,
Inside the container, could you run ldconfig -p | grep libnvidia-ml.so
and see if there are any results?
It's not enough to just mount the /usr/local/nvidia, you also need to tell the OS where to look for the Nvidia libraries (update ldcache).
from dcgm-exporter.
@nikkon-dev yes sorry ldconfig -p | grep libnvidia-ml.so
was run inside the container and it showed the library files where I expected them to be (the folder I mounted them to).
from dcgm-exporter.
@praveenperera,
Then we need to understand what the LD is loading on the system (nv-hostengine and other DCGM libraries are using RPATH that may interfere with the system environment).
Could you provide the results of the LD_DEBUG=all ./nv-hostengine -n
?
from dcgm-exporter.
Hey @nikkon-dev this is the output I get: https://gist.github.com/praveenperera/48ca14a4a898ef9a51d9e8b91b5076b1
And the output of ldconfig -p | grep libnvidia-ml.so
is
root@nvidia-gpu-metrics-dcgm-exporter-dp86x:/# ldconfig -p | grep libnvidia-ml.so
libnvidia-ml.so.1 (libc6,x86-64) => /usr/local/nvidia/lib64/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/local/nvidia/lib64/libnvidia-ml.so
from dcgm-exporter.
I took a look at your configuration, and here is some issue I noticed:
You should not mount the Nvidia libraries inside the container on your own - the Nvidia docker runtime handles that automatically.
from dcgm-exporter.
I don't believe the nvidia docker runtime is in play in GKE, but I could be wrong.
As far as I'm aware, this is all native containerd
functionality via their container OS (COS
) AMI and a daemonset they provide to manage the nvidia drivers. So, I believe that we would need to mount the nvidia drivers as indicated in the example provided.
from dcgm-exporter.
I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment)
....
I'll try that thanks!
from dcgm-exporter.
Thanks a lot for sharing the values. I also got it running on GKE with that setup.
I had to bump the memory request up to 256Mi, otherwise the pods got OOMKilled.
from dcgm-exporter.
Thanks for sharing the values. I did the same thing to bump the memory to 256Mi and things working now. but it's weird that there are 17 pods, but 4 are still having the CrashLoopBackOff issue. not sure if anyone has a clue
from dcgm-exporter.
Related Issues (20)
- Cannot build from source HOT 9
- how to query rated power? HOT 1
- Cannot build from source via Ansible HOT 4
- Executing dcgmi diag -r 3 in dcgm-exporter, the prompt shows "nvvs binary was not found" HOT 1
- hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand HOT 5
- Profiling module failed to load HOT 5
- Could not enable kubernetes metric collection: nvml: Unknown Error HOT 2
- Failed to watch metrics: Error watching fields: The third-party Profiling module returned an u HOT 2
- Makefile missing DIST_DIR := cmd/dcgm-exporter HOT 1
- Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() HOT 1
- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ is not signed HOT 2
- nvlink metrics are not available on the gh200 gpu node HOT 2
- I can't get the following metrics, but I've set the environment variable HOT 3
- config csv DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, but cannot get on metrics HOT 2
- can I get computeRunningProcesses and graphicsRunningProcesses this two metrics?? HOT 1
- exported_pod cause issue with query -> every sample a different metrics HOT 3
- Switch GPU Util metric to `DCGM_FI_PROF_GR_ENGINE_ACTIVE` in NVIDIA DCGM Metrics Dashboard
- `namespace` and `pod` labels are sometimes missing from metrics HOT 10
- How to obtain the namespace , pod and container data HOT 4
- How to install dcgm-exporter on Windows Server? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.