Comments (9)
I have the same issue, and I verified that the serviceaccount does have access to read the configmap.
time="2023-11-01T20:42:27Z" level=info msg="Starting dcgm-exporter"
time="2023-11-01T20:42:27Z" level=info msg="DCGM successfully initialized!"
time="2023-11-01T20:42:27Z" level=info msg="Collecting DCP Metrics"
time="2023-11-01T20:42:29Z" level=info msg="Malformed configmap contents. No metrics found, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2023-11-01T20:42:29Z" level=info msg="Kubernetes metrics collection enabled!"
time="2023-11-01T20:42:29Z" level=info msg="Pipeline starting"
time="2023-11-01T20:42:29Z" level=info msg="Starting webserver"
Using the serviceaccount successfully fetches the configmap:
% k get configmap/exporter-metrics-config-map -n dcgm-exporter -o yaml | yq .kind
ConfigMap
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-10-160-18-33.ec2.internal
containers:
- args:
- -m
- dcgm-exporter:exporter-metrics-config-map
env:
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_LISTEN
value: :9400
image: nvcr.io/nvidia/k8s/dcgm-exporter:2.4.6-2.6.9-ubuntu20.04
from dcgm-exporter.
Same problem. But your solution didn't help.
from dcgm-exporter.
The message in the log is not an error.
There are two ways to specify metrics configuration:
- via CSV file
- via ConfigMap
If you choose the latter and your dcgm-exporter command line has-m namespace:config-map-name
argument, then dcgm-exporter will try to find such config map and fall back to the csv file if such config map does not exist.
Please take a look here , here and here for examples
from dcgm-exporter.
@nikkon-dev Please note that option #2 does not work. I applied and received the following in the logs:
"time="2022-12-06T17:32:19Z" level=info msg="Malformed configmap contents. No metrics found, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
"
Please find my values.yaml attached (sorry for the format, github thinks its markdown). It appears that the configmap that is supplied is broken.
`
Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
image:
repository: nvcr.io/nvidia/k8s/dcgm-exporter
pullPolicy: IfNotPresent
Image tag defaults to AppVersion, but you can use the tag key
for the image tag, e.g:
tag: 3.0.4-3.0.0-ubuntu20.04
Comment the following line to stop profiling metrics from DCGM
arguments: ["-m", "metrics:exporter-metrics-config-map"]
NOTE: in general, add any command line arguments to arguments above
and they will be passed through.
Use "-r", ":" to connect to an already running hostengine
Example arguments: ["-r", "host123:5555"]
Use "-n" to remove the hostname tag from the output.
Example arguments: ["-n"]
Use "-d" to specify the devices to monitor. -d must be followed by a string
in the following format: [f] or [g[:numeric_range][+]][i[:numeric_range]]
Where a numeric range is something like 0-4 or 0,2,4, etc.
Example arguments: ["-d", "g+i"] to monitor all GPUs and GPU instances or
["-d", "g:0-3"] to monitor GPUs 0-3.
Use "-m" to specify the namespace and name of a configmap containing
the watched exporter fields.
Example arguments: ["-m", "default:exporter-metrics-config-map"]
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""
serviceAccount:
Specifies whether a service account should be created
create: true
Annotations to add to the service account
annotations: {}
The name of the service account to use.
If not set and create is true, a name is generated using the fullname template
name:
podAnnotations: {}
Using this annotation which is required for prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
podSecurityContext: {}
fsGroup: 2000
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
readOnlyRootFilesystem: true
service:
enable: true
type: ClusterIP
port: 9400
address: ":9400"
Annotations to add to the service
annotations: {}
resources: {}
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
serviceMonitor:
enabled: true
interval: 15s
honorLabels: false
additionalLabels: {}
#monitoring: prometheus
relabelings: []
# - sourceLabels: [__meta_kubernetes_pod_node_name]
# separator: ;
# regex: ^(.*)$
# targetLabel: nodename
# replacement: $1
# action: replace
mapPodsMetrics: false
nodeSelector: {}
#node: gpu
tolerations: []
#- operator: Exists
affinity: {}
#nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia-gpu
operator: Exists
extraHostVolumes: []
#- name: host-binaries
hostPath: /opt/bin
extraConfigMapVolumes:
- name: exporter-metrics-volume
configMap:
name: exporter-metrics-config-map
extraVolumeMounts: []
#- name: host-binaries
mountPath: /opt/bin
readOnly: true
extraEnv: []
#- name: EXTRA_VAR
value: "TheStringValue"
kubeletPath: "/var/lib/kubelet/pod-resources"
`
from dcgm-exporter.
Have you provided the config map itself? https://github.com/NVIDIA/dcgm-exporter/blob/574d63d717af9a4447092070d9501ec1eb83873e/deployment/templates/metrics-configmap.yaml
from dcgm-exporter.
How to provide it while installing the helm chart ?
from dcgm-exporter.
Is it a version issue? I noticed this https://github.com/NVIDIA/dcgm-exporter/blob/3.1.3-3.1.2/pkg/dcgmexporter/parser.go#L179 got added later on, but essentially the CSV is malformed if it has #
in it before this version
ab87097
I also debugged this No metrics found
with a small main.go script locally (go run main.go
)
package main
import (
"encoding/csv"
"log"
"os"
)
func main() {
filePath := "test.csv"
f, err := os.Open(filePath)
if err != nil {
log.Fatal("Unable to read input file "+filePath, err)
}
defer f.Close()
r := csv.NewReader(f)
// r.Comment = '#' // Comment in to see it work
records, err := r.ReadAll()
if err != nil {
println(err.Error())
}
for _, record := range records {
for _, item := range record {
print(item)
}
println("")
}
}
from dcgm-exporter.
Please try to use the most recent version.
from dcgm-exporter.
No response.
from dcgm-exporter.
Related Issues (20)
- Seeking community feedback on potential new feature: Standardize labels for next major release HOT 6
- README link about "To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide." is already invalid HOT 1
- dcgm-exporter crashes when run on Debian 12 HOT 1
- Protobuf handling is incorrect HOT 2
- dcgm-exporter log: No Kubelet socket, ignoring HOT 2
- dcgm-exporter dont show metrics from other namespaces and pods k8s HOT 11
- DCGM exporter image vulnerable to https://nvd.nist.gov/vuln/detail/CVE-2024-24790 HOT 1
- Can't collecting DCP metrics HOT 4
- Let dcgm-exporter be a daemon HOT 5
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error HOT 3
- MIG device support for hpc_job metric labels HOT 4
- dcp metrics supports gpu architecture HOT 4
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error
- time="2024-08-08T03:09:05Z" level=error msg="Failed to write response." error="write tcp 10.202.3.1:9400->10.202.2.2:49674: i/o timeout
- The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty
- Update contribution doc to require signing
- How does dcgm-exporter, when running on k8s as a daemonset, communicate with the host's dcgm host engine?
- failed to transform metrics for transform 'podMapper'
- Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus HOT 2
- No DCGM_FI_DEV_FB_FREE reported for MIG-enabled GPUs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.