Similar to issue: <a class="issue-link js-issue-link" data-error-text="Failed to load

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Crash loop backoff with `Error: Failed to initialize NVML` on GKE about dcgm-exporter HOT 13 CLOSED

praveenperera commented on June 22, 2024 1

Crash loop backoff with `Error: Failed to initialize NVML` on GKE

from dcgm-exporter.

Comments (13)

lhriley commented on June 22, 2024 9

I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment)

My helm chart values:

###
#
# Reference: https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml
#

serviceMonitor:
  enabled: false

resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi

securityContext:
  privileged: true

tolerations:
  - operator: Exists

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: cloud.google.com/gke-accelerator
              operator: Exists

podAnnotations:
  ad.datadoghq.com/exporter.check_names: |
          ["openmetrics"]
  ad.datadoghq.com/exporter.init_configs: |
          [{}]
  ad.datadoghq.com/exporter.instances: |
    [
      {
        "openmetrics_endpoint": "http://%%host%%:9400/metrics",
        "namespace": "nvidia-dcgm-exporter",
        "metrics": [{"*":"*"}]
      }
    ]

extraHostVolumes:
  - name: vulkan-icd-mount
    hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
  - name: nvidia-install-dir-host
    hostPath: /home/kubernetes/bin/nvidia

extraVolumeMounts:
  - name: nvidia-install-dir-host
    mountPath: /usr/local/nvidia
    readOnly: true
  - name: vulkan-icd-mount
    mountPath: /etc/vulkan/icd.d
    readOnly: true

and...

❯ kubectl -n nvidia-dcgm-exporter logs -f nvidia-dcgm-exporter-4m855
time="2022-05-12T00:10:42Z" level=info msg="Starting dcgm-exporter"
time="2022-05-12T00:10:42Z" level=info msg="DCGM successfully initialized!"
time="2022-05-12T00:10:43Z" level=info msg="Collecting DCP Metrics"
time="2022-05-12T00:10:43Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-05-12T00:10:44Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-05-12T00:10:44Z" level=info msg="Pipeline starting"
time="2022-05-12T00:10:44Z" level=info msg="Starting webserver"

from dcgm-exporter.

lhriley commented on June 22, 2024 1

Was there any progress on this in the last month? We're seeing the exact same issue on GKE, and it would be great to get some actual metrics from the GPUs.

from dcgm-exporter.

glowkey commented on June 22, 2024

Just to clarify, the libnvidia-ml.so files were found inside the container? I just pulled them and checked and did not find them. Also, have you followed the integration guide for kubernetes found here? https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#integrating-gpu-telemetry-into-kubernetes

from dcgm-exporter.

praveenperera commented on June 22, 2024

@glowkey the Nvidia drivers were already installed on the node with the nvidia-driver-installer that's now automatically included in GKE clusters

The libnvidia-ml.so are available inside the container because it was mounted from the node.

volumeMounts:
  - name: nvidia-install-dir-host
  mountPath: /usr/local/nvidia

volumes:
  - name: nvidia-install-dir-host
    hostPath:
      path: /home/kubernetes/bin/nvidia

from dcgm-exporter.

nikkon-dev commented on June 22, 2024

@praveenperera,
Inside the container, could you run ldconfig -p | grep libnvidia-ml.so and see if there are any results?
It's not enough to just mount the /usr/local/nvidia, you also need to tell the OS where to look for the Nvidia libraries (update ldcache).

from dcgm-exporter.

praveenperera commented on June 22, 2024

@nikkon-dev yes sorry ldconfig -p | grep libnvidia-ml.so was run inside the container and it showed the library files where I expected them to be (the folder I mounted them to).

from dcgm-exporter.

nikkon-dev commented on June 22, 2024

@praveenperera,
Then we need to understand what the LD is loading on the system (nv-hostengine and other DCGM libraries are using RPATH that may interfere with the system environment).
Could you provide the results of the LD_DEBUG=all ./nv-hostengine -n?

from dcgm-exporter.

praveenperera commented on June 22, 2024

Hey @nikkon-dev this is the output I get: https://gist.github.com/praveenperera/48ca14a4a898ef9a51d9e8b91b5076b1

And the output of ldconfig -p | grep libnvidia-ml.so is

root@nvidia-gpu-metrics-dcgm-exporter-dp86x:/# ldconfig -p | grep libnvidia-ml.so
	libnvidia-ml.so.1 (libc6,x86-64) => /usr/local/nvidia/lib64/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /usr/local/nvidia/lib64/libnvidia-ml.so

from dcgm-exporter.

nikkon-dev commented on June 22, 2024

I took a look at your configuration, and here is some issue I noticed:
You should not mount the Nvidia libraries inside the container on your own - the Nvidia docker runtime handles that automatically.

from dcgm-exporter.

lhriley commented on June 22, 2024

I don't believe the nvidia docker runtime is in play in GKE, but I could be wrong.

As far as I'm aware, this is all native containerd functionality via their container OS (COS) AMI and a daemonset they provide to manage the nvidia drivers. So, I believe that we would need to mount the nvidia drivers as indicated in the example provided.

from dcgm-exporter.

praveenperera commented on June 22, 2024

I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment)

....

I'll try that thanks!

from dcgm-exporter.

vanHavel commented on June 22, 2024

Thanks a lot for sharing the values. I also got it running on GKE with that setup.
I had to bump the memory request up to 256Mi, otherwise the pods got OOMKilled.

from dcgm-exporter.

xiaoyifan commented on June 22, 2024

Thanks for sharing the values. I did the same thing to bump the memory to 256Mi and things working now. but it's weird that there are 17 pods, but 4 are still having the CrashLoopBackOff issue. not sure if anyone has a clue

from dcgm-exporter.

Crash loop backoff with `Error: Failed to initialize NVML` on GKE about dcgm-exporter HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent