Coder Social home page Coder Social logo

Comments (9)

andrewjamesbrown avatar andrewjamesbrown commented on September 23, 2024 1

I have the same issue, and I verified that the serviceaccount does have access to read the configmap.

time="2023-11-01T20:42:27Z" level=info msg="Starting dcgm-exporter"
time="2023-11-01T20:42:27Z" level=info msg="DCGM successfully initialized!"
time="2023-11-01T20:42:27Z" level=info msg="Collecting DCP Metrics"
time="2023-11-01T20:42:29Z" level=info msg="Malformed configmap contents. No metrics found, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2023-11-01T20:42:29Z" level=info msg="Kubernetes metrics collection enabled!"
time="2023-11-01T20:42:29Z" level=info msg="Pipeline starting"
time="2023-11-01T20:42:29Z" level=info msg="Starting webserver"

Using the serviceaccount successfully fetches the configmap:

% k get configmap/exporter-metrics-config-map -n dcgm-exporter  -o yaml | yq .kind
ConfigMap
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - ip-10-160-18-33.ec2.internal
  containers:
  - args:
    - -m
    - dcgm-exporter:exporter-metrics-config-map
    env:
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_LISTEN
      value: :9400
    image: nvcr.io/nvidia/k8s/dcgm-exporter:2.4.6-2.6.9-ubuntu20.04

from dcgm-exporter.

Muscule avatar Muscule commented on September 23, 2024

Same problem. But your solution didn't help.

from dcgm-exporter.

nikkon-dev avatar nikkon-dev commented on September 23, 2024

The message in the log is not an error.
There are two ways to specify metrics configuration:

  1. via CSV file
  2. via ConfigMap
    If you choose the latter and your dcgm-exporter command line has -m namespace:config-map-name argument, then dcgm-exporter will try to find such config map and fall back to the csv file if such config map does not exist.
    Please take a look here , here and here for examples

from dcgm-exporter.

avickars avatar avickars commented on September 23, 2024

@nikkon-dev Please note that option #2 does not work. I applied and received the following in the logs:

"time="2022-12-06T17:32:19Z" level=info msg="Malformed configmap contents. No metrics found, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
"

Please find my values.yaml attached (sorry for the format, github thinks its markdown). It appears that the configmap that is supplied is broken.

`

Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

image:
repository: nvcr.io/nvidia/k8s/dcgm-exporter
pullPolicy: IfNotPresent

Image tag defaults to AppVersion, but you can use the tag key

for the image tag, e.g:

tag: 3.0.4-3.0.0-ubuntu20.04

Comment the following line to stop profiling metrics from DCGM

arguments: ["-m", "metrics:exporter-metrics-config-map"]

NOTE: in general, add any command line arguments to arguments above

and they will be passed through.

Use "-r", ":" to connect to an already running hostengine

Example arguments: ["-r", "host123:5555"]

Use "-n" to remove the hostname tag from the output.

Example arguments: ["-n"]

Use "-d" to specify the devices to monitor. -d must be followed by a string

in the following format: [f] or [g[:numeric_range][+]][i[:numeric_range]]

Where a numeric range is something like 0-4 or 0,2,4, etc.

Example arguments: ["-d", "g+i"] to monitor all GPUs and GPU instances or

["-d", "g:0-3"] to monitor GPUs 0-3.

Use "-m" to specify the namespace and name of a configmap containing

the watched exporter fields.

Example arguments: ["-m", "default:exporter-metrics-config-map"]

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""

serviceAccount:

Specifies whether a service account should be created

create: true

Annotations to add to the service account

annotations: {}

The name of the service account to use.

If not set and create is true, a name is generated using the fullname template

name:

podAnnotations: {}

Using this annotation which is required for prometheus scraping

prometheus.io/scrape: "true"

prometheus.io/port: "9400"

podSecurityContext: {}

fsGroup: 2000

securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]

readOnlyRootFilesystem: true

service:
enable: true
type: ClusterIP
port: 9400
address: ":9400"

Annotations to add to the service

annotations: {}

resources: {}

limits:

cpu: 100m

memory: 128Mi

requests:

cpu: 100m

memory: 128Mi

serviceMonitor:
enabled: true
interval: 15s
honorLabels: false
additionalLabels: {}
#monitoring: prometheus
relabelings: []
# - sourceLabels: [__meta_kubernetes_pod_node_name]
# separator: ;
# regex: ^(.*)$
# targetLabel: nodename
# replacement: $1
# action: replace

mapPodsMetrics: false

nodeSelector: {}
#node: gpu

tolerations: []
#- operator: Exists

affinity: {}
#nodeAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

nodeSelectorTerms:

- matchExpressions:

- key: nvidia-gpu

operator: Exists

extraHostVolumes: []
#- name: host-binaries

hostPath: /opt/bin

extraConfigMapVolumes:

  • name: exporter-metrics-volume
    configMap:
    name: exporter-metrics-config-map

extraVolumeMounts: []
#- name: host-binaries

mountPath: /opt/bin

readOnly: true

extraEnv: []
#- name: EXTRA_VAR

value: "TheStringValue"

kubeletPath: "/var/lib/kubelet/pod-resources"

`

from dcgm-exporter.

nikkon-dev avatar nikkon-dev commented on September 23, 2024

Have you provided the config map itself? https://github.com/NVIDIA/dcgm-exporter/blob/574d63d717af9a4447092070d9501ec1eb83873e/deployment/templates/metrics-configmap.yaml

from dcgm-exporter.

harjitdotsingh avatar harjitdotsingh commented on September 23, 2024

How to provide it while installing the helm chart ?

from dcgm-exporter.

mckornfield avatar mckornfield commented on September 23, 2024

Is it a version issue? I noticed this https://github.com/NVIDIA/dcgm-exporter/blob/3.1.3-3.1.2/pkg/dcgmexporter/parser.go#L179 got added later on, but essentially the CSV is malformed if it has # in it before this version
ab87097

I also debugged this No metrics found with a small main.go script locally (go run main.go)

package main

import (
	"encoding/csv"
	"log"
	"os"
)

func main() {
	filePath := "test.csv"
	f, err := os.Open(filePath)
	if err != nil {
		log.Fatal("Unable to read input file "+filePath, err)
	}
	defer f.Close()
	r := csv.NewReader(f)
	// r.Comment = '#' // Comment in to see it work

	records, err := r.ReadAll()
	if err != nil {
		println(err.Error())
	}
	for _, record := range records {
		for _, item := range record {
			print(item)
		}
		println("")
	}
}

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on September 23, 2024

Please try to use the most recent version.

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on September 23, 2024

No response.

from dcgm-exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.