nvidia / dcgm-exporter Goto Github PK

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

License: Apache License 2.0

Makefile 2.23% Mustache 0.83% Shell 0.31% Go 95.74% Dockerfile 0.88%

dcgm-exporter's Introduction

DCGM-Exporter

This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

Documentation

Official documentation for DCGM-Exporter can be found on docs.nvidia.com.

Quickstart

To gather metrics on a GPU node, simply start the dcgm-exporter container:

$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Quickstart on Kubernetes

Note: Consider using the NVIDIA GPU Operator rather than DCGM-Exporter directly.

Ensure you have already setup your cluster with the default runtime as NVIDIA.

The recommended way to install DCGM-Exporter is to use the Helm chart:

$ helm repo add gpu-helm-charts \
  https://nvidia.github.io/dcgm-exporter/helm-charts

Update the repo:

$ helm repo update

And install the chart:

$ helm install \
    --generate-name \
    gpu-helm-charts/dcgm-exporter

Once the dcgm-exporter pod is deployed, you can use port forwarding to obtain metrics quickly:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

# Let's get the output of a random pod:
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
                         -o "jsonpath={ .items[0].metadata.name}")

$ kubectl port-forward $NAME 8080:9400 &
$ curl -sL http://127.0.0.1:8080/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...

To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide. dcgm-exporter is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator user guide.

TLS and Basic Auth

Exporter supports TLS and basic auth using exporter-toolkit. To use TLS and/or basic auth, users need to use --web-config-file CLI flag as follows

dcgm-exporter --web-config-file=web-config.yaml

A sample web-config.yaml file can be fetched from exporter-toolkit repository. The reference of the web-config.yaml file can be consulted in the docs.

Building from Source

In order to build dcgm-exporter ensure you have the following:

$ git clone https://github.com/NVIDIA/dcgm-exporter.git
$ cd dcgm-exporter
$ make binary
$ sudo make install
...
$ dcgm-exporter &
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Changing Metrics

With dcgm-exporter you can configure which fields are collected by specifying a custom CSV file. You will find the default CSV file under etc/default-counters.csv in the repository, which is copied on your system or container to /etc/dcgm-exporter/default-counters.csv

The layout and format of this file is as follows:

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

A custom csv file can be specified using the -f option or --collectors as follows:

$ dcgm-exporter -f /tmp/custom-collectors.csv

Notes:

Always make sure your entries have 2 commas (',')
The complete list of counters that can be collected can be found on the DCGM API reference manual: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html

What about a Grafana Dashboard?

You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239

You will also find the json file on this repo under grafana/dcgm-exporter-dashboard.json

Pull requests are accepted!

Building the containers

This project uses docker buildx for multi-arch image creation. Follow the instructions on that page to get a working builder instance for creating these containers. Some other useful build options follow.

Builds local images based on the machine architecture and makes them available in 'docker images'

make local

Build the ubuntu image and export to 'docker images'

make ubuntu22.04 PLATFORMS=linux/amd64 OUTPUT=type=docker

Build and push the images to some other 'private_registry'

make REGISTRY=<private_registry> push

Issues and Contributing

Checkout the Contributing document!

Please let us know by filing a new issue
You can contribute by opening a pull request

Reporting Security Issues

We ask that all community members and users of DCGM Exporter follow the standard NVIDIA process for reporting security vulnerabilities. This process is documented at the NVIDIA Product Security website. Following the process will result in any needed CVE being created as well as appropriate notifications being communicated to the entire DCGM Exporter community. NVIDIA reserves the right to delete vulnerability reports until they're fixed.

Please refer to the policies listed there to answer questions related to reporting security issues.

dcgm-exporter's People

Contributors

Stargazers

Watchers

Forkers

treydock zhu733756 tmke8 surfndez abhisheknishant138 odidev hfurkanvural rocketsoftware iamzxw ubccr hchoi405 teou guillaumesmaha qiankunli v1nc3nt27 apx103 sssherg hpsony94 mgiessing chongchuanbing dailymotion-oss arunv barry-jin uderik shoemoney pint1022 lukeyeager pullyvan hahayyum shan100github linux-kern vasudev-singhc-by jasine titaneric xiaomi-cloudnative kitter tafarus-code bmerry chrissng sagarmohalkar alphonse-rms smallccn xyl5869 rj7 wikiy223 zhangdanyangcherry kindomlee gulin90 dave-rtzr saifhaq moonooooo 5l1v3r1 yaonq gmintoco ydhao doronkg shivamerla zengyijie lynnsong yh0413 erwannmillon sozercan fschlich uonxhou sourcegraph-ce cheyunhua brannondorsey faryang-sh gmoshiko heybronson donstang lifehacking guilbaults runzhliu elicharlese iliakur runyontr artmakh tobbez nicklausbrown fuchanghai hsuanpai cyrinux murata-yu zhangsetsail empovit sachinvarghese winstonww kkpan11 nikawang xstone0527 plazonic youscan phamcs zclyne edingroot doker78 reanimatedmanx leochen12-rgb djeis97

dcgm-exporter's Issues

dcgm-exporter running error: dcgm-exporter: symbol lookup error: dcgm-exporter: undefined symbol: errorString

ubuntu 18.04
dcgmi version: 1.7.2
dcgm-exporter gpu-monitoring-tools-2.4.0

INFO[0000] Starting dcgm-exporter                 
INFO[0000] DCGM successfully initialized!        
INFO[0000] Collecting DCP Metrics
dcgm-exporter: symbol lookup error: dcgm-exporter: undefined symbol: errorString

Error with unsupported new metrics on V100 GPU's

Running 2.4.6-2.6.9 and if I enable the following metrics:

DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE
DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE

The DaemonSet set works and reports metrics for nodes with A100 GPU's

For nodes with V100-16GB GPUs the DaemonSet fails with the following message:

setting up csv
/etc/dcgm-exporter/dcp-metrics-bolt.csv
done
time="2022-07-20T01:54:11Z" level=info msg="Starting dcgm-exporter"
time="2022-07-20T01:54:11Z" level=info msg="DCGM successfully initialized!"
time="2022-07-20T01:54:11Z" level=info msg="Collecting DCP Metrics"
time="2022-07-20T01:54:11Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-bolt.csv"
time="2022-07-20T01:54:12Z" level=fatal msg="Error watching fields: Feature not supported"
running

Does not run when one GPU has ERR! state

If nvidia-smi reports that one of any of the GPUs has Fan/Temp/Perf of ERR!, dcgm-exporter cannot run, even if the GPU is excluded. For example on a machine in which the eighth of 8 GPUs shows:

+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80G...  On   | 00000000:D6:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!      1W /  N/A |      0MiB / 80994MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+

docker reports the following when I attempt to pass the 7th GPU to the container:

$ docker run --gpus 6 --rm -p 9400:9400 nvidia/dcgm-exporter bash
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: nvml error: unknown error: unknown.

An identically configured machine with no GPUs in the ERR! state can run the above command without errors.

DCGM exporter crashloopbackoff

Hi,

Using the nvidia GPU operator, on OpenShift (4.7.37).
As of a couple weeks ago, when deploying new GPU nodes, the dcgm-exporter DaemonSet creates Pods that would crash on a loop.
We have several clusters affected. Running the same OpenShift version, same version of the GPU operator (v1.8.2), same ClusterPolicy object.

Failing Pods would pull their image from nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.5.0-ubi8
Works on nodes that resolved this to nvcr.io/nvidia/k8s/dcgm-exporter@sha256:fd1c82078e67368b49baa69e4c0644325fb2bda8ca98833cd1794f2b7e8f7f16
Broken on nodes running that version: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:9264ed31a3190de36ce652c3b8d35ee3b0b14bcac50b4451b58c809e93004785

Pods in crashloopbackoff shows the following logs:

$> oc logs -n gpu-operator-resources nvidia-dcgm-exporter-4s2g9 -p
time="2021-12-22T14:52:20Z" level=info msg="Starting dcgm-exporter"
time="2021-12-22T14:52:20Z" level=info msg="Attemping to connect to remote hostengine at 100.79.3.83:5555"
time="2021-12-22T14:52:20Z" level=fatal msg="Error connecting to nv-hostengine: API version mismatch"

All other pods in that namespace are OK, ready/alive.

ClusterPolicy:

spec:
  daemonsets:
    priorityClassName: xxx
    tolerations:
    - operator: Exists
  dcgm:
    image: dcgm
    repository: nvcr.io/nvidia/cloud-native
    version: 2.2.9-ubi8
  dcgmExporter:
    image: dcgm-exporter
    repository: nvcr.io/nvidia/k8s
    version: 2.2.9-2.5.0-ubi8
  devicePlugin:
    image: k8s-device-plugin
    repository: nvcr.io/nvidia
    version: v0.9.0-ubi8
  driver:
    image: driver
    manager:
      env:
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "false"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: 0s
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "false"
      image: k8s-driver-manager
      repository: nvcr.io/nvidia/cloud-native
      version: v0.1.0
    repository: nvcr.io/nvidia
    version: 470.57.02
  gfd:
    image: gpu-feature-discovery
    repository: nvcr.io/nvidia
    version: v0.4.1
  migManager:
    enabled: false
    image: k8s-mig-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.1.3-ubi8
  nodeStatusExporter:
    image: gpu-operator-validator
    repository: nvcr.io/nvidia/cloud-native
    version: v1.8.2-ubi8
  operator:
    defaultRuntime: crio
    initContainer:
      image: cuda
      repository: nvcr.io/nvidia
      version: 11.3.0-base-ubi8
    runtimeClass: nvidia
  toolkit:
    image: container-toolkit
    repository: nvcr.io/nvidia/k8s
    version: 1.7.1-ubi8
  validator:
    image: gpu-operator-validator
    repository: nvcr.io/nvidia/cloud-native
    version: v1.8.2-ubi8

Inspecting the image tags, I don't see anything unexpected, like a mismatching version ... Not sure what would be going on.

Besides, that CRD schema won't allow me to define some sha256 formatted version for my dgcm-exporter image ...
What can I do?
Is this a known issue pending some fix?

Zero values for MIG instances using dcgm-exporter.

Hi,

DCGM Version: 2.2.9
CUDA: 11.4
Driver: datacenter-gpu-manager-2.2.9-1.x86_64

We have recently purchased a Dell R750xa with x4 A100-40GB GPUs. I built the dcgm-exporter binary from source and when running can obtain values of the parent GPU cards. However, all values are set as zero for MIG instances even though we have utilization on the MIG instances.

I have also noticed that not all the MIG profiles are listed.

Thank You !!

Error starting nv-hostengine: DCGM initialization error

Run this command on a server with Nvidia A100 GPU, that one of them have MIG turned on:
docker run --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04
and the output I got:

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-12-14T17:28:56Z" level=info msg="Starting dcgm-exporter"
CacheManager Init Failed. Error: -17
time="2021-12-14T17:28:56Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

Docker version 20.10.11, build dea9396
Ubuntu: VERSION="20.04.3 LTS (Focal Fossa)" x86_64
CPU: AMD

user@host~$ nvidia-smi
Tue Dec 14 17:30:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0    33W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:14:00.0 Off |                   On |
| N/A   30C    P0    32W / 250W |     20MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  1    1   0   0  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   1  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

DCGM_FI_DEV_GPU_UTIL for HPA is showing error "no metrics returned from custom metrics API"

Hi,

I have deployed triton on the top of Kubernetes (EKS). I've installed everything and is working well and I have created a HPA as well for the triton deployment on the basis of GPU Utilization (DCGM_FI_DEV_GPU_UTIL).

The yaml file for HPA looks like this:-

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: triton
  namespace: prometheus
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton
  minReplicas: 2
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metricName: DCGM_FI_DEV_GPU_UTIL # Average GPU usage of the pod.
      targetAverageValue: 90

But, I have realized that it is not working. It is showing me the error that "unable to get metric DCGM_FI_DEV_GPU_UTIL: no metrics returned from custom metrics API".

**kubectl describe hpa -n prometheus**

Name:                              triton
Namespace:                         prometheus
Labels:                            <none>
Annotations:                       <none>
CreationTimestamp:                 Sat, 05 Feb 2022 20:48:07 +0530
Reference:                         Deployment/triton
Metrics:                           ( current / target )
  "DCGM_FI_DEV_GPU_UTIL" on pods:  <unknown> / 90
Min replicas:                      1
Max replicas:                      3
Deployment pods:                   2 current / 2 desired
Conditions:
  Type           Status  Reason               Message
  ----           ------  ------               -------
  AbleToScale    True    SucceededGetScale    the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetPodsMetric  the HPA was unable to compute the replica count: unable to get metric DCGM_FI_DEV_GPU_UTIL: no metrics returned from custom metrics API
Events:
  Type     Reason               Age                     From                       Message
  ----     ------               ----                    ----                       -------
  Warning  FailedGetPodsMetric  4m33s (x9421 over 39h)  horizontal-pod-autoscaler  unable to get metric DCGM_FI_DEV_GPU_UTIL: no metrics returned from custom metrics API

could not select device driver with capabilities "gpu"

Hi,

I am getting:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Metric DCGM_FI_DEV_FB_RESERVED does not appear to be reported by dcgm-exporter (2.4.6-2.6.9)

We used to have just:

DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_FB_USED
DCGM_FI_DEV_FB_TOTAL

We were calculating the % by doing the following:

DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL

I've tried to move this over to the new metric DCGM_FI_DEV_FB_USED_PERCENT to make everything easier however this is now based of a new metric added in 2.4.5 which is DCGM_FI_DEV_FB_RESERVED ((instead off just the old metrics).

DCGM_FI_DEV_FB_USED_PERCENT = (DCGM_FI_DEV_FB_RESERVED + DCGM_FI_DEV_FB_USED) / DCGM_FI_DEV_FB_TOTAL

This means as an example on an unused GPU where we used to get 0% we now get 0.000008% on V100-16GB.

I'm trying to sanity check this and I think this is coming from what I assume means GPU System reserved: DCGM_FI_DEV_FB_RESERVED but this metric is not being reported by dcgm-exporter.

Consider adding a 'pod' labels，which aggregate data at prometheus？

from NVIDIA/gpu-monitoring-tools#131

Using dcgm-exporter and dcgmi simultaneously

Hi Devs,

I noticed that while using both the dcgm-exporter and dcgmi dmon an error occurred.

/bin/dcgmi dmon -e 1001,1004,1005,150 -g 3 -c 1
# Entity  GRACT  TENSO  DRAMA  TMPTR
      Id                           C
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error

It seems you have to stop the dcgm-exporter to make use of dcgmi. Is this a limitation of the nv-hostengine?

Thanks

DCGM_FI_DEV_GPU_UTIL for grafana not showing

Hi,

I've installed everything and is working well , but I realized that even with DCGM_FI_DEV_GPU_UTIL allowed in the map documents, this metric is not showing in prometheus and grafana.

Anyone could help me?

Crash loop backoff with `Error: Failed to initialize NVML` on GKE

Similar to issue: #27

My daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-metrics-dcgm-exporter
  namespace: default
  uid: 3415e29d-346f-4580-b99e-aaca03a672ad
  resourceVersion: '5254468'
  generation: 12
  creationTimestamp: '2022-03-31T15:25:15Z'
  labels:
    app.kubernetes.io/component: dcgm-exporter
    app.kubernetes.io/instance: nvidia-gpu-metrics
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 2.6.5
    helm.sh/chart: dcgm-exporter-2.6.5
  annotations:
    deprecated.daemonset.template.generation: '12'
    meta.helm.sh/release-name: nvidia-gpu-metrics
    meta.helm.sh/release-namespace: default
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: dcgm-exporter
      app.kubernetes.io/instance: nvidia-gpu-metrics
      app.kubernetes.io/name: dcgm-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: dcgm-exporter
        app.kubernetes.io/instance: nvidia-gpu-metrics
        app.kubernetes.io/name: dcgm-exporter
    spec:
      volumes:
        - name: pod-gpu-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
            type: ''
        - name: nvidia-install-dir-host
          hostPath:
            path: /home/kubernetes/bin/nvidia
            type: ''
      containers:
        - name: exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
          args:
            - '-f'
            - /etc/dcgm-exporter/dcp-metrics-included.csv
          ports:
            - name: metrics
              containerPort: 9400
              protocol: TCP
          env:
            - name: DCGM_EXPORTER_KUBERNETES
              value: 'true'
            - name: DCGM_EXPORTER_LISTEN
              value: ':9400'
          resources: {}
          volumeMounts:
            - name: pod-gpu-resources
              readOnly: true
              mountPath: /var/lib/kubelet/pod-resources
            - name: nvidia-install-dir-host
              mountPath: /usr/local/nvidia
          livenessProbe:
            httpGet:
              path: /health
              port: 9400
              scheme: HTTP
            initialDelaySeconds: 45
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 9400
              scheme: HTTP
            initialDelaySeconds: 45
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
            runAsUser: 0
            runAsNonRoot: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: nvidia-gpu-metrics-dcgm-exporter
      serviceAccount: nvidia-gpu-metrics-dcgm-exporter
      securityContext: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: Exists
      schedulerName: default-scheduler
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0
  revisionHistoryLimit: 10

What I've tried

Tried running nvidia-smi in container, same error
ldconfig -p | grep -i libnvidia-ml.so the library was found in the /usr/local/nvidia/lib64/
Ran /usr/bin/nv-hostengine -f /tmp/nvhostengine.debug.log --log-level debug

2022-04-01 15:25:29.629 DEBUG [29:29] Initialized base logger [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6806] [dcgmStartEmbedded_v2]
2022-04-01 15:25:29.629 INFO  [29:29] version:2.3.5;arch:x86_64;buildtype:Release;buildid:13;builddate:2022-03-09;commit:e7246b91195b78740e0db2d0f1edf15dd88436d6;branch:rel_dcgm_2_3;buildplatform:Linux 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64;;crc:d764e6617965aa186e46fc5540b128aa [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6809] [dcgmStartEmbedded_v2]
2022-04-01 15:25:29.632 ERROR [29:29] Cannot initialize the hostengine: Error: Failed to initialize NVML [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3677] [DcgmHostEngineHandler::Init]
2022-04-01 15:25:29.632 ERROR [29:29] DcgmHostEngineHandler::Init failed [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6824] [dcgmStartEmbedded_v2]
2022-04-01 15:25:29.632 DEBUG [29:29] Before dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7137] [dcgmShutdown]
2022-04-01 15:25:29.632 INFO  [29:29] Another thread freed the client handler for us. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:255] [dcgmapiFreeClientHandler]
2022-04-01 15:25:29.632 DEBUG [29:29] After dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7139] [dcgmShutdown]
2022-04-01 15:25:29.632 DEBUG [29:29] dcgmShutdown completed successfully [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7174] [dcgmShutdown]

Running dcgm-exporter with Docker

Hi!

I am trying to run dcgm-exporter with Docker like this:

DCGM_EXPORTER_VERSION=2.1.4-2.3.1 && \
docker run -d --rm \
   --gpus all \
   --net host \
   --cap-add SYS_ADMIN \
   nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
   -f /etc/dcgm-exporter/dcp-metrics-included.csv

What I would like to do is modify this dcp-metrics-included.csv file. How can I do this? Do I need to clone the repo, modify the file there and rebuild the Docker image?

Best regards

Support for reporting driver version

Is there any way to configure dcgm-exporter so that one can get the driver version as a label?

GPU freezes when dcgm-exporter is SIGKILL'd

GPU Type: A100
Driver Version: 515.48.07
OS: RHEL8 running Kubernetes
dcgm-exporter version: 2.3.4-2.6.4-ubuntu20.04
MIG: yes

If the dcgm-exporter is forcibly killed (either kill -9, takes to long to respond to SIGTERM and k8s SIGKILL's it, or an oomkill), it appears to cause my GPU to freeze. nvidia-smi hangs and no other processes are able to use the GPU until the server is restarted.

Since I also observe dcgm-exporter having a memory leak as noted in #340, that means SIGKILLs can be a regular occurrence.

When the GPU is frozen, I see this in dmesg:

[Mon Jul 25 16:46:04 2022] NVRM: GPU Board Serial Number: 1565020013641
[Mon Jul 25 16:46:04 2022] NVRM: Xid (PCI:0000:3b:00): 120, pid='<unknown>', name=<unknown>, GSP Error: Task 1 raised error code 0x5 for reason 0x0 at 0x63f01ac (0 more errors skipped)
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: GPU at PCI:0000:d8:00: GPU-d7099080-bc3c-6429-51d1-5825fdccd129
[Mon Jul 25 16:46:20 2022] NVRM: GPU Board Serial Number: 1565020014726
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 120, pid='<unknown>', name=<unknown>, GSP Error: Task 1 raised error code 0x5 for reason 0x0 at 0x63f01ac (0 more errors skipped)
[Mon Jul 25 16:46:26 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20801348 0x410).
[Mon Jul 25 16:46:30 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x0 0x6c).
[Mon Jul 25 16:46:30 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x80 0x38).
[Mon Jul 25 16:46:34 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x2080 0x4).
[Mon Jul 25 16:46:38 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:42 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[Mon Jul 25 16:46:55 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20801348 0x410).
[Mon Jul 25 16:47:03 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x0 0x6c).
[Mon Jul 25 16:47:03 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x80 0x38).
[Mon Jul 25 16:47:11 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x2080 0x4).

That Xid code appears to be undocumented: https://docs.nvidia.com/deploy/xid-errors/index.html

Also reported here: https://forums.developer.nvidia.com/t/a100-gpu-freezes-after-process-gets-oomkilled/221630

Any ideas what could be causing this?

how to interpret DCGM_FI_PROF_PCIE_TX_BYTES metric

im trying testing some metrics on dcgm-exporter and ran into following metrics and just could not figure out how the following metrics work.

DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.

the two metrics above are bot shown as counters, so everytime protheus collects data it monotonically increase the value accordingly.
so in theory it seems as if we are required to match the scraping interval of prometheus and request interval of dcgm-exporter to the speed of pcie and nvlink.
I could not find any other interpretation or reference or any other information reguarding these metrics.
Am I interpreting these metrics correctly?

And do those above logic apply to following metrics as well?
DCGM_FI_PROF_NVLINK_TX_BYTES
DCGM_FI_PROF_NVLINK_RX_BYTES

Can anyone please help me? any guidance or link to reference is very much appreciated!
thank you in advance.

Add Kubernetes node name to exported labels

Right now, when dcgm-exporter is deployed in Kubernetes (we're using gpu-operator), the Hostname label is set to the pod name, which is not particularly useful. I'd like to suggest either:

Adding a new label node or
Using different logic to populate Hostname when running in Kubernetes

It should be fairly straightforward to inject the node name into the container using the Downward API.

Edge node deployment dcgm

For advice, I use kubeedge to deploy edge nodes, and dcgm also needs to be deployed to edge nodes to collect GPU metrics, but there is no kubelet.sock file. Is there any good solution for this?

Error creating DCGM fields group: Duplicate Key passed to function

when i do this:
dcgm-exporter -f default-counters.csv
show:
FATA[0001] Error creating DCGM fields group: Duplicate Key passed to function

I wonder why？please!

Extracting errors and bugs in k8s environment

I have a pod in status Completed, and I use a GPU card
‘kubectl describe node gpu-178‘ View and from exporte dissimilarity，Obviously, dcgm exporter has included the cards of the completed pod

Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU

Expected Behavior: I'm trying to get gpu metrics working for my workloads and would expect be able to see my pod name show up in the prometheus metrics as per this guide in the section "Per-pod GPU metrics in a Kubernetes cluster"

Existing Behavior: The metrics show up but the "pod" tag is "somename-gpu-dcgm-exporter" which is unhelpful as it does not map back to my pods.

example metric: DCGM_FI_DEV_GPU_TEMP{UUID="GPU-<UUID>", container="exporter", device="nvidia0", endpoint="metrics", gpu="0", instance="<Instance>", job="somename-gpu-dcgm-exporter", namespace="some-namespace", pod="somename-gpu-dcgm-exporter-vfbhl", service="somename-gpu-dcgm-exporter"}

K8s cluster: GKE clusters with a nodepool running 2 V100 GPUs per node
Setup: I used helm template to generate the yaml to apply to my GKE cluster. I ran into the issue described here, so I needed to add privileged: true, downgrade to nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, and add nvidia-install-dir-host volume.

Things I've tried:

Verified DCGM_EXPORTER_KUBERNETES is set to true
Went through https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L126 to see if I misunderstood the functionality or could find any easy resolution
I see there is a code change since my downgrade, but that seemed enable MIG, but that didn't seem like it applied to me. Even if it did, the issue I encountered that forced the downgrade would still exist.

The daemonset looked as below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: somename-gpu-dcgm-exporter
  namespace: some-namespace
  labels:
    helm.sh/chart: dcgm-exporter-2.4.0
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: somename-gpu
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: somename-gpu
      app.kubernetes.io/component: "dcgm-exporter"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: somename-gpu
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: Exists
      serviceAccountName: gpu-dcgm-exporter
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: "Exists"
        - effect: NoSchedule
          key: nodeSize
          operator: Equal
          value: my-special-nodepool-taint
      containers:
      - name: exporter
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
          runAsNonRoot: false
          runAsUser: 0
          privileged: true
        image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        imagePullPolicy: "IfNotPresent"
        args:
        - -f
        - /etc/dcgm-exporter/dcp-metrics-included.csv
        env:
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        ports:
        - name: "metrics"
          containerPort: 9400
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        livenessProbe:
          httpGet:
            path: /health
            port: 9400
          initialDelaySeconds: 5
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /health
            port: 9400
          initialDelaySeconds: 5

support jetson？

Increase readiness and liveness probes values

Need to be able to configure both:

The initial delay is too small on AKS.

Will DCGM_FI_DEV_FB_USED_PERCENT be coming to the exporter in the next release

Hi @glowkey I noticed this PR Synchronize and update headers from DCGM 2.4.5

Does that mean DCGM_FI_DEV_FB_USED_PERCENT will also be coming to the next exporter release? It would be extremely handy

Thanks

No exported_pod in metrics

@nikkon-dev I got it working! I needed to add this to my env as well since it was the non-default option
- name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
              value: "device-name"
Now I see my pod coming as exported_pod="my-pod-zzzzzzz-xxxx". Thanks a ton for all your help here!

Didn't help with gpu-operator v1.11.0

Originally posted by @Muscule in #27 (comment)

Confirm DCP GPU family

Hi.

I have two questions.

I would like to know about DCP GPU family. Which gpu are including?
How should I one standard dashboard to show the GPU utilization with some GPU familiy sever(T4, RTX A6000, A100 or Geforce RTX3080 and so on) under the K8s environment?

As you know, if not included in the DCP GPU family, the DCGM_FI_PROF_* metrics will be disabled. If it will mixed the GPU family for our cluster, the dashboard will not work well... Or should I use the previous metrics of "DCGM_FI_DEV_GPU_UTIL"?

Best regards.
Kaka

Require image for ARM64 architecture

Hi Team,

I am trying to use the nvidia/dcgm-exporter image on the arm64 platform but it seems it is not available for arm64.

I have successfully built the image using the command docker build -t image_name . on the arm64 platform by making some changes in the files Dockerfile and .github/workflows/go.yml.

I have used Github action to build and push the image for both the platforms.

Commit Link - odidev@9dec3a2

Github action link - https://github.com/odidev/dcgm-exporter/runs/3743542640?check_suite_focus=true

Docker Hub Link - https://hub.docker.com/repository/registry-1.docker.io/odidev/dcgm-exporter/tags?page=1&ordering=last_updated

Do you have any plans on releasing arm64 images?

It will be very helpful if an arm64 image is available. If interested, I will raise a PR

can't find `dcgm_agent.h`

git clone
go mod vendor
go version = 1.15
go build cmd/dcgm-exporter/main.go
I got this:

# github.com/NVIDIA/go-dcgm/pkg/dcgm
vendor/github.com/NVIDIA/go-dcgm/pkg/dcgm/admin.go:24:10: fatal error: ../../internal/include/dcgm_agent.h: No such file or directory
   24 | #include "../../internal/include/dcgm_agent.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

How to start flex mode with parameters when starting dcgm-exporter via docker

I want to monitor all instances of MIG devices, but have not found how to do it. From a previous issue I saw that starting flex mode seems to do this, but it's not stated in the user guide.

Support of gpu utilization and memory utilization per instance

I found that dcgm-exporter in flex mode cannot report gpu utilization DCGM_FI_DEV_GPU_UTIL and and memory utilization DCGM_FI_DEV_MEM_COPY_UTIL for each MIG instance. Is this a problem with my setup or is it not supported by dcgm for now?

Configuration doesn't get mounted when depeloying the helm chart

On booting the exporter pod by applying the helm chart, the logs read:

level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"

I solved the issue by adding a volume mount and associated volume to the daemon set. I believe the role, rolebinding, service account and configmap are all working correctly, but the mechanism for explicitly giving the pods spawned by the daemon set access to the configmap contents is omitted. This may have been missed because the fallback mechanism only results in an info level log, not a warning or error.

My implementation relies on terraform, so it is possible something got lost in translation.

Here's my solution with terraform:

resource "kubernetes_daemonset" "daemonset_workbench_dcgm_exporter" {
  metadata {
[...]
  }
  spec {
    strategy {
      type = "RollingUpdate"
    }
    selector {
     [...]
    }
    template {
      [...]
      }
      spec {
        container {
            image             = "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.5.0-ubuntu20.04"
            name              = "dcgm-exporter-pod"
            image_pull_policy = "IfNotPresent"
            args = [
              "-f",
              "/etc/dcgm-exporter/dcp-metrics-included.csv",
            ]
            env {
                name = "DCGM_EXPORTER_KUBERNETES"
                value = "true"
            }
            env {
                name = "DCGM_EXPORTER_LISTEN"
                value = ":9400"
            }
            port {
                container_port = 9400
                name = "metrics"
            }
            readiness_probe {
              http_get {
                path = "/health"
                port = 9400
              }
              initial_delay_seconds = 45
            }
            security_context {
              capabilities {
                add = [
                  "SYS_ADMIN",
                ]
              }
              run_as_non_root = false
              run_as_user = 0
            }
            volume_mount {
              mount_path = "/var/lib/kubelet/pod-resources"
              name       = "pod-gpu-resources"
              read_only = true
            }
            volume_mount {
              name = "dcgm-exporter-config-volume"
              mount_path = "/etc/dcgm-exporter/dcp-metrics-included.csv"
              sub_path   = "dcp-metrics-included.csv"
            }
        }

        service_account_name = "dcgm-exporter-sa"
        volume {
          host_path {
            path = "/var/lib/kubelet/pod-resources"
          }
          name = "pod-gpu-resources"
        }
        volume {
          name = "dcgm-exporter-config-volume"
          config_map {
            name = "dcgm-exporter-metrics-cm"
          }
        }
      }
    }
  }
}

Latest Release bugs 2.4.5-2.6.7 - metrics missing

When upgrading to 2.4.5-2.6.7 we lose access to:
DCGM_FI_DEV_FB_USED
DCGM_FI_DEV_FB_TOTAL
DCGM_FI_DEV_FB_FREE

Going back to 2.3.5-2.6.5 resolves the issue

Also it looks like DCGM 2.4.5 should now support the following, however it appears the new release which uses 2.4.5 does not yet, this is more a question than an issue
DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE
DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE

Servers are running 470.57.02

Does this exporter support monitoring AWS virtual gpu via AWS virtual gpu device plugin?

I've an AWS EKS cluster with GPU nodes, and installed AWS virtual gpu device plugin to share GPU between different pods. It seems that this exporter dependent on Nvidia device plugin, but AWS virtual gpu device plugin cannot work with Nvidia device plugin together.

RFE: Export GPU's capacity

When working on a GPU Dashboard to show actual usage per GPU and provide overview of all GPUs installed in a cluster, we would need to know capacity of a particular GPU (by its UUID probably).

This is relevant for non percent-based metrics. So far we query historical maximums of these metrics but this is not reliable source.

Examples:

    DCGM_FI_DEV_SM_CLOCK - what is the maximum per GPU? Is burst clock speed relevant?
    DCGM_FI_DEV_MEM_CLOCK - likewise
    DCGM_FI_DEV_POWER_USAGE - what is the maximal consumption? Can we think about any thresholds here?
    DCGM_FI_DEV_GPU_TEMP - what are the thresholds?

TYPE DCGM_FI_PROF_ metrics value issue

im testing dcgm exporter version [2.3.6-2.6.6] with mig enabled on k8s
everything seems to work fine except for the profiling metrics.
profiling metritcs also seem to work fine at the beginning but when the dcgm-exporter pod has been up for several hours the profiling metrics show values that just doesn't make sense.

below is the value dcgm exporter gives when running a tensorflow benchmark on a mig instance.
the dcgm exporter pod has been up for 18 hours during which the mig instance was sitting idle for the first 17 hours and the TF benchmark was running for the last hour. And still running.

# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-8cqdc",container="container-testest-1",namespace="gi-user1-giops-ai",pod="testest-9sxmd"} 0.117468

and if i restart the dcgm exporter while keeping the TF job running i get the following value.

# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="container-testest-1",namespace="gi-user1-giops-ai",pod="testest-9sxmd"} 0.959228

the difference in the two values are too large to neglect. it seems like the values are not getting flushed for some reason.

Container, namespace and pod informations on metrics

Why in my dcgm-exporter /metrics,doesnt show container, namespace and pod informations?
In every metric comes like this container="",namespace="",pod=""

HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).

TYPE DCGM_FI_DEV_SM_CLOCK gauge

DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-55274b56-9033-f166-7723-77088b0ff94d",device="nvidia0",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135
DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-dad58a68-8239-6bb7-1f07-539b85fabfee",device="nvidia1",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135
DCGM_FI_DEV_SM_CLOCK{gpu="2",UUID="GPU-a29d2dc7-d18a-858c-a87b-f81eb1384333",device="nvidia2",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135
DCGM_FI_DEV_SM_CLOCK{gpu="3",UUID="GPU-4b2cedc3-89f9-dde9-ce2f-3b8d40dbe531",device="nvidia3",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135
DCGM_FI_DEV_SM_CLOCK{gpu="4",UUID="GPU-58fae45b-e5a5-0c62-d478-8ff7c485e74e",device="nvidia4",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 1335
DCGM_FI_DEV_SM_CLOCK{gpu="5",UUID="GPU-5f10c931-56f5-ad53-4ce6-54e98a17d07f",device="nvidia5",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135
DCGM_FI_DEV_SM_CLOCK{gpu="6",UUID="GPU-d56e5edb-72a1-d150-0ce0-dfe2c3092263",device="nvidia6",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135
DCGM_FI_DEV_SM_CLOCK{gpu="7",UUID="GPU-7d2d5bed-82d2-c5fd-1fee-ccc74692ff91",device="nvidia7",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 135

HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).

TYPE DCGM_FI_DEV_MEM_CLOCK gauge

DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-55274b56-9033-f166-7723-77088b0ff94d",device="nvidia0",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-dad58a68-8239-6bb7-1f07-539b85fabfee",device="nvidia1",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="2",UUID="GPU-a29d2dc7-d18a-858c-a87b-f81eb1384333",device="nvidia2",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="3",UUID="GPU-4b2cedc3-89f9-dde9-ce2f-3b8d40dbe531",device="nvidia3",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="4",UUID="GPU-58fae45b-e5a5-0c62-d478-8ff7c485e74e",device="nvidia4",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="5",UUID="GPU-5f10c931-56f5-ad53-4ce6-54e98a17d07f",device="nvidia5",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="6",UUID="GPU-d56e5edb-72a1-d150-0ce0-dfe2c3092263",device="nvidia6",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850
DCGM_FI_DEV_MEM_CLOCK{gpu="7",UUID="GPU-7d2d5bed-82d2-c5fd-1fee-ccc74692ff91",device="nvidia7",modelName="NVIDIA TITAN V",Hostname="dcgm-exporter-lcxpl",container="",namespace="",pod=""} 850

[root@master1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.19.4-20211018+1dd39f5bc682f8cb90cdc7cc217c00465d00e1e0", GitCommit:"$Format:%H$", GitTreeState:"", BuildDate:"1970-01-01T00:00:00Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.19.4-20211018+1dd39f5bc682f8cb90cdc7cc217c00465d00e1e0", GitCommit:"$Format:%H$", GitTreeState:"", BuildDate:"1970-01-01T00:00:00Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}

kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

HOST:
[root@gpu34-tianv ~]# nvidia-smi
Tue Mar 22 15:46:52 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN V On | 00000000:3D:00.0 Off | N/A |
| 28% 27C P8 24W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN V On | 00000000:3E:00.0 Off | N/A |
| 28% 27C P8 23W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN V On | 00000000:3F:00.0 Off | N/A |
| 28% 26C P8 22W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN V On | 00000000:40:00.0 Off | N/A |
| 28% 29C P8 24W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA TITAN V On | 00000000:43:00.0 Off | N/A |
| 45% 63C P2 192W / 250W | 10889MiB / 12066MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA TITAN V On | 00000000:44:00.0 Off | N/A |
| 28% 26C P8 23W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA TITAN V On | 00000000:45:00.0 Off | N/A |
| 28% 26C P8 25W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA TITAN V On | 00000000:46:00.0 Off | N/A |
| 28% 28C P8 23W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 4 N/A N/A 650485 C ./gpu_burn 10885MiB |
+-----------------------------------------------------------------------------+

BUT dcgm-exporter Container：
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN V On | 00000000:3D:00.0 Off | N/A |
| 28% 27C P8 24W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN V On | 00000000:3E:00.0 Off | N/A |
| 28% 27C P8 23W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN V On | 00000000:3F:00.0 Off | N/A |
| 28% 26C P8 22W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN V On | 00000000:40:00.0 Off | N/A |
| 28% 29C P8 24W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA TITAN V On | 00000000:43:00.0 Off | N/A |
| 45% 62C P2 190W / 250W | 10889MiB / 12066MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA TITAN V On | 00000000:44:00.0 Off | N/A |
| 28% 26C P8 23W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA TITAN V On | 00000000:45:00.0 Off | N/A |
| 28% 26C P8 25W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA TITAN V On | 00000000:46:00.0 Off | N/A |
| 28% 28C P8 23W / 250W | 0MiB / 12066MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Doesn't available helm chart for 2.6.1 version

I have checked - https://nvidia.github.io/dcgm-exporter/helm-charts/index.yaml

I think need to push new chart for 2.6.1.
I know we can use the same template for 2.6.1 by passing image tag but problem is "helm list" showing me old version. This will create problem if someone assume by looking "helm list".

Whether to support TLS？

nvidia-dcgm-exporter pod keeps crashing

I have installed through 'microk8s enable gpu'. But the nvidia-dcgm-exporter-* pod keeps crashing - the control plane machine is not a gpu. But one of worker nodes is gpu.

gpu-operator-resources nvidia-dcgm-exporter-wjgff 0/1 CrashLoopBackOff 46 (2m52s ago) 4h18

time="2022-04-15T00:35:52Z" level=info msg="Starting dcgm-exporter"
time="2022-04-15T00:35:52Z" level=info msg="Attemping to connect to remote hostengine at 10.194.160.35:5555"
time="2022-04-15T00:36:52Z" level=info msg="DCGM successfully initialized!"
time="2022-04-15T00:36:52Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Host engine connection invalid/disconnected"
time="2022-04-15T00:36:52Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2022-04-15T00:36:52Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2022-04-15T00:36:52Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2022-04-15T00:36:52Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2022-04-15T00:36:52Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2022-04-15T00:36:52Z" level=fatal msg="Error getting devices count: Host engine connection invalid/disconnected"

(base) root@pbm088:~# kubectl describe pod/nvidia-dcgm-exporter-wjgff -n gpu-operator-resources
Name: nvidia-dcgm-exporter-wjgff
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Node: itml09-vm1/10.194.160.35
Start Time: Thu, 14 Apr 2022 13:20:48 -0700
Labels: app=nvidia-dcgm-exporter
controller-revision-hash=5ddb68bf9d
pod-template-generation=1
Annotations: cni.projectcalico.org/podIP: 10.1.241.16/32
cni.projectcalico.org/podIPs: 10.1.241.16/32
Status: Running
IP: 10.1.241.16
IPs:
IP: 10.1.241.16
Controlled By: DaemonSet/nvidia-dcgm-exporter
Init Containers:
toolkit-validation:
Container ID: containerd://0225e964397202c518b818711f65f6e4197a71af25298e2438e71522a395d943
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port:
Host Port:
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 14 Apr 2022 13:20:50 -0700
Finished: Thu, 14 Apr 2022 13:20:50 -0700
Ready: True
Restart Count: 0
Environment:
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fwz6d (ro)
Containers:
nvidia-dcgm-exporter:
Container ID: containerd://abbf1b909c33a0b0ba2d28b48b9c0c45fbbf2e0693021233b5521c40433a7f10
Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu20.04
Image ID: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:25d60a39b316702809d9645d175fea0aa8e51fddf749e334ae2f033a0590485f
Port: 9400/TCP
Host Port: 0/TCP
State: Running
Started: Thu, 14 Apr 2022 17:41:57 -0700
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 14 Apr 2022 17:35:52 -0700
Finished: Thu, 14 Apr 2022 17:36:52 -0700
Ready: True
Restart Count: 47
Environment:
NODE_IP: (v1:status.hostIP)
DCGM_EXPORTER_LISTEN: :9400
DCGM_EXPORTER_KUBERNETES: true
DCGM_EXPORTER_COLLECTORS: /etc/dcgm-exporter/dcp-metrics-included.csv
DCGM_REMOTE_HOSTENGINE_INFO: $(NODE_IP):5555
Mounts:
/var/lib/kubelet/pod-resources from pod-gpu-resources (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fwz6d (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
pod-gpu-resources:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/pod-resources
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType:
kube-api-access-fwz6d:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.dcgm-exporter=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message

Warning BackOff 2m5s (x993 over 4h21m) kubelet Back-off restarting failed container

What should I check for to resolve this?

Supported CPU architecture

Please let me confirm the supported CPU architecture.
Is it possible to run the dcgm-exporter on the ppc64le?
-> It is OK to run on ”bare metal” or "container".

Best regards.
Kaka

Comments in metrics file not escaped correctly

When using dcgm-metrics.csv for gpu operator as described in documentation the pod is crashing:

kubectl logs nvidia-dcgm-exporter-zlcrb -n gpu-operator
time="2022-05-11T12:30:38Z" level=info msg="Starting dcgm-exporter"
time="2022-05-11T12:30:38Z" level=info msg="DCGM successfully initialized!"
time="2022-05-11T12:30:38Z" level=info msg="Collecting DCP Metrics"
time="2022-05-11T12:30:38Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2022-05-11T12:30:38Z" level=error msg="Could not read metrics file '/etc/dcgm-exporter/dcgm-metrics.csv': record on line 3: wrong number of fields\n"
time="2022-05-11T12:30:38Z" level=fatal msg="record on line 3: wrong number of fields"

It seems to, that the comment lines are not escaped correctly. When adding two commas to every comment, the pod is running.

Used Version: dcgm-exporter:2.3.4-2.6.4-ubuntu20.04

Deployment by helm chart doesn't populate pod/container values

When I run the dcgm-exporter from the command line or a docker container on my GPU enabled host, "pod"and"container" show the names of the pods and container running my GPU application.

When I deploy with the helm chart, "pod" and "container" are in the metrics but are always blank

DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-<uuid>",device="nvidia0",modelName="<model>",Hostname="<pod name>",container="",namespace="",pod=""} 15007

[Dashboard - BUG] Grafana dashboard: ${DS_PROMETHEUS} - not found

I use pulumi to deploy the dashboard inside a ConfigMap along with DCGM itself.

When I open the dashboard in grafana I get:

Templating
Failed to upgrade legacy queries Datasource named ${DS_PROMETHEUS} was not found

And absolutely nothing works until I manually set the datasource as prometheus for every dashboard pannel.

Here is how I deploy the dashboard:

  const grafanaDcgm = new k8s.core.v1.ConfigMap("grafana-dashboard-dcgm", {
    metadata: {
      name: "grafana-dashboard-dcgm",
      namespace: namespace,
      labels: {
        grafana_dashboard: "1"
      }
    },
    data: {
      "dcgm.json":
        fetch("https://grafana.com/api/dashboards/12239/revisions/1/download")
          .then(res => res.text())
    }
  });

BTW, I think this dashboard should be in this repo and should be included in the helm chart.

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208

We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:

time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
fatal: morestack on gsignal

This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.

Copied from here: NVIDIA/gpu-monitoring-tools#208

[received unexpected HTTP status: 502 Bad Gateway] Failed to pull image nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04

Failed to pull image nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04

Error message is

received unexpected HTTP status: 502 Bad Gateway

Issue running 2.4.6-2.6.8

@glowkey where these any breaking changes in the latest release?

I just tested the release by swapping the docker images out and I get the following error:

setting up csv
/etc/dcgm-exporter/dcp-metrics-bolt.csv
done
time="2022-07-19T14:22:40Z" level=info msg="Starting dcgm-exporter"
time="2022-07-19T14:22:41Z" level=info msg="DCGM successfully initialized!"
time="2022-07-19T14:22:41Z" level=info msg="Collecting DCP Metrics"
time="2022-07-19T14:22:41Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-bolt.csv"
time="2022-07-19T14:22:41Z" level=fatal msg="Error getting device busid: API version mismatch"

Rolling back to 2.3.5-2.6.5 removed the issue and I didn't see this issue on 2.4.5-2.6.7 but that release had other metric issues.

the tests fail

try to run the test under pkg/dcgmexporter, it fails.
here is the steps:
cd pkg/dcgmexporter
go test
2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ListPodResourcesRequest
2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ListPodResourcesResponse
2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.PodResources
2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ContainerResources
2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ContainerDevices
--- FAIL: TestDCGMCollector (0.00s)
gpu_collector_test.go:35:
Error Trace: gpu_collector_test.go:35
Error: Received unexpected error:
libdcgm.so not Found
Test: TestDCGMCollector
/tmp/go-build21440241/b001/dcgmexporter.test: symbol lookup error: /tmp/go-build21440241/b001/dcgmexporter.test: undefined symbol: dcgmGetAllDevices
exit status 127
FAIL github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter 0.016s

is there any settings to run the test?

In MIG, the relationship between GRACT and DCGM_FI_PROF_GR_ENGINE_ACTIVE ?

env

GPU: A100
MIG: 7g.40gb
DCGM: v2.3.2
DCGM-EXPORT: v2.3.4-2.6.4

result

1、cmd: dcgmi dmon -e 1001,1004 -g 2

2、dcgm-export

question

I want to know the GRACT get from dcgmi dmon is the same as DCGM_FI_PROF_GR_ENGINE_ACTIVE get from dcgm-export?

Profiling metrics not being collected

Hello,

dcgmi version: 2.2.9

I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can't seem to get profiling metrics to show up, though other metrics show up fine.

root@node-0:/etc/dcgm-exporter# dcgm-exporter -f etc/dcp-metrics-included.csv  -a :9402
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] No configmap data specified, falling back to metric file etc/dcp-metrics-included.csv
WARN[0000] Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled

Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.

It looks like the profiling module fails to load:

root@node-0:/etc/dcgm-exporter# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

Though I'm not sure whether this is attributable to dcgm-exporter or dcgm, because when I can't get the metrics to load even when using dcgmi directly:

root@node-0:/home/user# dcgmi dmon -e 1010
# Entity                 PCIRX
      Id
Error setting watches. Result: This request is serviced by a module of DCGM that is not currently loaded

I've directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.

How can I enable the collection of profiling metrics?