Comments (10)
Hi @Shadowphax,
Have you tried to run dcgm-exporter with --devices f
or --devices i
command line argument?
By default, the dcgm-exporter will not monitor MIG instances. You either need to specify devices explicitly or set "Flex" mode, when dcgm-exporter will monitor all MIG instances instead of all GPUs by default.
I made a pull request to improve --devices
documentation a bit #4
In the current state, if you need both GPUs and MIG Instances, you need to specify them explicitly in --devices g:0,g:1...,i:0,i:1,...
from dcgm-exporter.
Hi @nikkon-dev
Appreciate the feedback, thank you.
dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv --devices g:2,3,i:0
INFO[0000] Starting dcgm-exporter
FATA[0000] Invalid ranged device option 'g:2,3,i:0': there can only be one specified range
I am not sure if I have the syntax correct? Also, MIG IDs (instance IDs) are required as input for "i". Are we meant to use the ID from "nvidia-smi" for the instance ID?
from dcgm-exporter.
@Shadowphax,
You are right. In the current implementation, dcgm-exporter does not support multiple types of ranges and more than one range in general. This limitation is on the dcgm-exporter side only, as DCGM itself supports such scenarios.
from dcgm-exporter.
Hi @nikkon-dev
Could you provide guidance on how to monitor a single MIG instance with dcgm-exporter, especially to identify the correct MIG instance IDs? I've partitioned the cards appropriately. Is there something I am meant to do with dcgmi in order for dcgm-exporter to expose the metrics for MIG?
dcgmi group -l
+-------------------+----------------------------------------------------------+
| GROUPS |
| 2 groups found. |
+===================+==========================================================+
| Groups | |
| -> 2 | |
| -> Group ID | 2 |
| -> Group Name | card0 |
| -> Entities | GPU 0, GPU_I 3, GPU_I 2, GPU_I 0, GPU_I 1 |
| -> 3 | |
| -> Group ID | 3 |
| -> Group Name | card1 |
| -> Entities | GPU 1, GPU_I 10, GPU_I 7, GPU_I 8, GPU_I 9 |
+-------------------+----------------------------------------------------------+
So I have four MIG instances on each card. I would like to know where the ID is obtained per MIG for --devices i:(x) where x is either 0 or 1 as mentioned in --help
Cheers
from dcgm-exporter.
Would you please look if dcgmi discovery -c
would give you the info you need?
WBR,
Nik
from dcgm-exporter.
@nikkon-dev
Hi, I got zero value for MIG instance too.
CUDA: 11.4
Driver: datacenter-gpu-manager 1:2.3.1 amd64
dcgm-exporter --devices=i:2
Thank you!
from dcgm-exporter.
Please take a look at the dcgmi discovery -c
output. In the --device=i:X
argument, the X
is the entity ID from the discovery command output, not the MIG Dev from the nvidia-smi
output.
WBR,
Nik
from dcgm-exporter.
@nikkon-dev
Here is the output:
from dcgm-exporter.
We have been making quite a few enhancements to DCGM and DCGM-Exporter with respect to MIG. Some of these fixes/enhancements are available in 2.3.2-2.6.3 and others are coming in next version as well. Please try upgrading now or in a few weeks and letting us know if these latest versions do not fix the issue.
from dcgm-exporter.
@glowkey Is there any example for mig instance states monitoring such as insatance memory or instance utilization?
from dcgm-exporter.
Related Issues (20)
- Let dcgm-exporter be a daemon HOT 5
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error HOT 3
- MIG device support for hpc_job metric labels HOT 4
- dcp metrics supports gpu architecture HOT 4
- Start the recompiled dcgm-exporter fails to collect GPU metrics with an error
- time="2024-08-08T03:09:05Z" level=error msg="Failed to write response." error="write tcp 10.202.3.1:9400->10.202.2.2:49674: i/o timeout
- The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty
- Update contribution doc to require signing
- How does dcgm-exporter, when running on k8s as a daemonset, communicate with the host's dcgm host engine?
- failed to transform metrics for transform 'podMapper'
- Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus HOT 2
- No DCGM_FI_DEV_FB_FREE reported for MIG-enabled GPUs
- Error with "make binary" operation in local development
- How does the DCGM exporter work with DCGM? HOT 3
- Add a health status metric for every gpu card HOT 1
- DCGM-exporter pods stuck in Running State, Not getting Ready without GPU allocation. HOT 6
- DCGM Exporter in EKS p4d.24xlarge instance type controller error
- DCGM Exporter in EKS p4d.24xlarge instance type controller error
- DCGM Exporter does not collect individual pod metrics when MPS is enabled in Kubernetes HOT 1
- Missing 3.3.8 builds HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.