Coder Social home page Coder Social logo

Comments (10)

nikkon-dev avatar nikkon-dev commented on September 23, 2024

Hi @Shadowphax,

Have you tried to run dcgm-exporter with --devices f or --devices i command line argument?
By default, the dcgm-exporter will not monitor MIG instances. You either need to specify devices explicitly or set "Flex" mode, when dcgm-exporter will monitor all MIG instances instead of all GPUs by default.

I made a pull request to improve --devices documentation a bit #4
In the current state, if you need both GPUs and MIG Instances, you need to specify them explicitly in --devices g:0,g:1...,i:0,i:1,...

from dcgm-exporter.

Shadowphax avatar Shadowphax commented on September 23, 2024

Hi @nikkon-dev

Appreciate the feedback, thank you.

dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv --devices g:2,3,i:0
INFO[0000] Starting dcgm-exporter
FATA[0000] Invalid ranged device option 'g:2,3,i:0': there can only be one specified range

I am not sure if I have the syntax correct? Also, MIG IDs (instance IDs) are required as input for "i". Are we meant to use the ID from "nvidia-smi" for the instance ID?

from dcgm-exporter.

nikkon-dev avatar nikkon-dev commented on September 23, 2024

@Shadowphax,
You are right. In the current implementation, dcgm-exporter does not support multiple types of ranges and more than one range in general. This limitation is on the dcgm-exporter side only, as DCGM itself supports such scenarios.

from dcgm-exporter.

Shadowphax avatar Shadowphax commented on September 23, 2024

Hi @nikkon-dev

Could you provide guidance on how to monitor a single MIG instance with dcgm-exporter, especially to identify the correct MIG instance IDs? I've partitioned the cards appropriately. Is there something I am meant to do with dcgmi in order for dcgm-exporter to expose the metrics for MIG?

dcgmi group -l
+-------------------+----------------------------------------------------------+
| GROUPS |
| 2 groups found. |
+===================+==========================================================+
| Groups | |
| -> 2 | |
| -> Group ID | 2 |
| -> Group Name | card0 |
| -> Entities | GPU 0, GPU_I 3, GPU_I 2, GPU_I 0, GPU_I 1 |
| -> 3 | |
| -> Group ID | 3 |
| -> Group Name | card1 |
| -> Entities | GPU 1, GPU_I 10, GPU_I 7, GPU_I 8, GPU_I 9 |
+-------------------+----------------------------------------------------------+

So I have four MIG instances on each card. I would like to know where the ID is obtained per MIG for --devices i:(x) where x is either 0 or 1 as mentioned in --help

Cheers

from dcgm-exporter.

nikkon-dev avatar nikkon-dev commented on September 23, 2024

@Shadowphax,

Would you please look if dcgmi discovery -c would give you the info you need?

WBR,
Nik

from dcgm-exporter.

xwhuang0923 avatar xwhuang0923 commented on September 23, 2024

@nikkon-dev
Hi, I got zero value for MIG instance too.
CUDA: 11.4
Driver: datacenter-gpu-manager 1:2.3.1 amd64
1637292603(1)

dcgm-exporter --devices=i:2

1637292721(1)

1637292787(1)

Thank you!

from dcgm-exporter.

nikkon-dev avatar nikkon-dev commented on September 23, 2024

@xwhuang0923,

Please take a look at the dcgmi discovery -c output. In the --device=i:X argument, the X is the entity ID from the discovery command output, not the MIG Dev from the nvidia-smi output.

WBR,
Nik

from dcgm-exporter.

xwhuang0923 avatar xwhuang0923 commented on September 23, 2024

@nikkon-dev
Here is the output:

image

from dcgm-exporter.

glowkey avatar glowkey commented on September 23, 2024

We have been making quite a few enhancements to DCGM and DCGM-Exporter with respect to MIG. Some of these fixes/enhancements are available in 2.3.2-2.6.3 and others are coming in next version as well. Please try upgrading now or in a few weeks and letting us know if these latest versions do not fix the issue.

from dcgm-exporter.

dixing0908 avatar dixing0908 commented on September 23, 2024

@glowkey Is there any example for mig instance states monitoring such as insatance memory or instance utilization?

from dcgm-exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.