rocm / rocm_smi_lib Goto Github PK

View Code? Open in Web Editor NEW

106.0 39.0 45.0 11.26 MB

ROCm SMI LIB

Home Page: https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/

License: Other

CMake 3.80% C++ 79.22% Shell 0.64% C 2.74% Python 13.59%

sysadmin trace

rocm_smi_lib's Introduction

ROCm System Management Interface (ROCm SMI) Library

The ROCm System Management Interface Library, or ROCm SMI library, is part of the Radeon Open Compute ROCm software stack . It is a C library for Linux that provides a user space interface for applications to monitor and control GPU applications.

For additional information refer to ROCm Documentation

DISCLAIMER

The information contained herein is for informational purposes only, and is subject to change without notice. In addition, any stated support is planned and is also subject to change. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein.

Planned Deprication Notice

ROCm System Management Interface (ROCm SMI) Library is planned to be depricated. Release date to be announced soon. Please start migrating to AMD SMI.

Documentation: https://rocm.docs.amd.com
Github: https://github.com/ROCm/amdsmi

Installation

Install amdgpu using ROCm

Install amdgpu driver:
See example below, your release and link may differ. The amdgpu-install --usecase=rocm triggers both an amdgpu driver update and ROCm SMI packages to be installed on your device.

sudo apt update
wget https://repo.radeon.com/amdgpu-install/6.0.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
sudo apt install ./amdgpu-install_6.0.60002-1_all.deb
sudo amdgpu-install --usecase=rocm

rocm-smi --help

Building ROCm SMI

Additional Required software for building

In order to build the ROCm SMI library, the following components are required. Note that the software versions listed are what was used in development. Earlier versions are not guaranteed to work:

CMake (v3.5.0)
g++ (5.4.0)

In order to build the latest documentation, the following are required:

Python 3.8+
NPM (sass)

The source code for ROCm SMI is available on Github.

After the ROCm SMI library git repository has been cloned to a local Linux machine, building the library is achieved by following the typical CMake build sequence. Specifically,

mkdir -p build
cd build
cmake ..
make -j $(nproc)
# Install library file and header; default location is /opt/rocm
make install

The built library will appear in the build folder.

To build the rpm and deb packages follow the above steps with:

make package

Documentation

The following is an example of how to build the docs:

python3 -m venv .venv
.venv/bin/python3 -m pip install -r docs/sphinx/requirements.txt
.venv/bin/python3 -m sphinx -T -E -b html -d docs/_build/doctrees -D language=en docs docs/_build/html

Building the Tests

In order to verify the build and capability of ROCm SMI on your system and to see an example of how ROCm SMI can be used, you may build and run the tests that are available in the repo. To build the tests, follow these steps:

mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make -j $(nproc)

To run the test, execute the program rsmitst that is built from the steps above.

Usage Basics

Device Indices

Many of the functions in the library take a "device index". The device index is a number greater than or equal to 0, and less than the number of devices detected, as determined by rsmi_num_monitor_devices(). The index is used to distinguish the detected devices from one another. It is important to note that a device may end up with a different index after a reboot, so an index should not be relied upon to be constant over reboots.

Hello ROCm SMI

The only required ROCm-SMI call for any program that wants to use ROCm-SMI is the rsmi_init() call. This call initializes some internal data structures that will be used by subsequent ROCm-SMI calls.

When ROCm-SMI is no longer being used, rsmi_shut_down() should be called. This provides a way to do any releasing of resources that ROCm-SMI may have held. In many cases, this may have no effect, but may be necessary in future versions of the library.

A simple "Hello World" type program that displays the device ID of detected devices would look like this:

#include <stdint.h>
#include "rocm_smi/rocm_smi.h"
int main() {
  rsmi_status_t ret;
  uint32_t num_devices;
  uint16_t dev_id;

  // We will skip return code checks for this example, but it
  // is recommended to always check this as some calls may not
  // apply for some devices or ROCm releases

  ret = rsmi_init(0);
  ret = rsmi_num_monitor_devices(&num_devices);

  for (int i=0; i < num_devices; ++i) {
    ret = rsmi_dev_id_get(i, &dev_id);
    // dev_id holds the device ID of device i, upon a
    // successful call
  }
  ret = rsmi_shut_down();
  return 0;
}

rocm_smi_lib's People

Contributors

Stargazers

Watchers

Forkers

hephaex stjordanis paulfreddy socal-ucr cfreehill rburraamd bgoglin mukjoshi xirdigh wanyaworld griffin305 onerebos jglaser jdrabeb maxzor haampie geobro120 philipyanga icenowy microfisher xw285cornell hitman77777 littlewu2508 emollier tpkessler jrmadsen oere pedrovhb displacer jaffir nveljkovic streamhpc pierreantoineh jerome3o trixirt liangxijun-1001 chaosagent vstempen calandracas606 xyq1113723547 dcsouthwick junyi-99 ptfoplayer abhimeda rmalavally

rocm_smi_lib's Issues

RSMI_STATUS_PERMISSION on rocm-smi --setmclk

System: ubuntu-focal (5.4.0-109-generic)
rocm-5.2.1
GPU: MI250X

I am trying to set the memory clock frequency using rocm-smi, and it fails with the RSMI_STATUS_PERMISSION error. The performance level was set to manual:

$ rocm-smi --showhw


======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS  SDMA RAS  UMC RAS  VBIOS           BUS
0    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:31:00.0
1    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:34:00.0
2    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:11:00.0
3    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:14:00.0
4    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:AE:00.0
5    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:B3:00.0
6    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:8E:00.0
7    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:93:00.0
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --showclkfrq --showperflevel


======================= ROCm System Management Interface =======================
============================ Show Performance Level ============================
GPU[0]          : Performance Level: manual
================================================================================
========================= Supported clock frequencies ==========================
GPU[0]          :
GPU[0]          : Supported fclk frequencies on GPU0
GPU[0]          : 0: 0Mhz *
GPU[0]          :
GPU[0]          : Supported mclk frequencies on GPU0
GPU[0]          : 0: 400Mhz
GPU[0]          : 1: 700Mhz
GPU[0]          : 2: 1200Mhz
GPU[0]          : 3: 1600Mhz *
GPU[0]          :
GPU[0]          : Supported sclk frequencies on GPU0
GPU[0]          : 0: 500Mhz
GPU[0]          : 1: 1700Mhz *
GPU[0]          :
GPU[0]          : Supported socclk frequencies on GPU0
GPU[0]          : 0: 666Mhz
GPU[0]          : 1: 857Mhz
GPU[0]          : 2: 1000Mhz
GPU[0]          : 3: 1090Mhz *
GPU[0]          : 4: 1333Mhz
GPU[0]          :
--------------------------------------------------------------------------------
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 2


======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x4
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 0


======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x1
================================================================================
============================= End of ROCm SMI Log ==============================

I found only sclk is configurable. Is this expected, or did I miss anything? Thanks!

Initialization sometimes fails on multi-GPU nodes due to race condition

When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:

347: pthread_mutex_timedlock() returned 110
 347: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 348: pthread_mutex_timedlock() returned 110
 348: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 757: pthread_mutex_timedlock() returned 110
 757: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 350: pthread_mutex_timedlock() returned 110
 350: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 753: pthread_mutex_timedlock() returned 110
 753: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 351: pthread_mutex_timedlock() returned 110
 351: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 756: pthread_mutex_timedlock() returned 110
 756: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 758: pthread_mutex_timedlock() returned 110
 758: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
1050: pthread_mutex_timedlock() returned 110
1050: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 347: rsmi_init() failed
1052: pthread_mutex_timedlock() returned 110

The reason is that rocm_smi_lib creates a mutex in /dev/shm whose name is independent of the process id, which creates a race condition.

Temperature on 5700XT

I am calling rsmi_dev_temp_metric_get() with RSMI_TEMP_CURRENT

That then calls get_dev_mon_value with the following params:
type: amd::smi::kMonTemp
dv_ind: 1
sensor_ind: 1

The issue is that val_str = "" and the later stoi (line 397) causes the application to crash.

Do you know how I can modify this to support the 5700XT ?

What is the diffference between these repos

There is what looks like another copy of this in another repo:
https://github.com/RadeonOpenCompute/ROC-smi

Which one is the one to use?

Please enable two factor authentication in your github account

@nitishjohn;@guansong;@ukidaveyash15;@amdgerritcr

We are going to enforce two factor authentication in (https://github.com/RadeonOpenCompute/) organization on 8th April , 2022 . Since we identified you as outside collaborator for this organization, you need to enable two factor authentication in your github account else you shall be removed from the organization after the enforcement. Please skip if already done.
To set up two factor authentication, please go through the steps in below link:

https://docs.github.com/en/free-pro-team@latest/github/authenticating-to-github/configuring-two-factor-authentication

Please reach out to "[email protected] " for queries

exception reading frequencies on Renoir APU 4650G

$ /opt/rocm/bin/rocm-smi -d 1 --showall

======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 5.11.10-gentoo

====================================== ID ======================================
GPU[1] : GPU ID: 0x1636

================================== Unique ID ===================================
GPU[1] : Unique ID: N/A

==================================== VBIOS =====================================
GPU[1] : VBIOS version: 113-RENOIR-033

================================= Temperature ==================================
GPU[1] : Temperature (Sensor edge) (C): 28.0
GPU[1] : Temperature (Sensor junction) (C): N/A
GPU[1] : Temperature (Sensor memory) (C): N/A

========================== Current clock frequencies ===========================
Exception caught: map::at
ERROR: GPU[1] : dcefclk clock is unsupported
python3: /storage/work/local/rocm_smi_lib/src/rocm_smi.cc:894: rsmi_status_t get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed.
Aborted (core dumped)

rocm-smi --setfan 100% temporarily increases GPU speeds, but they shortly revert

I have four GPUs in this machine: All are navi cards. A 5500XT, despite rocm-smi saying it succeeds, does not increase its fan speed at all.
The other GPUs will, after the machine becomes unresponsive to all inputs for a few moments, ramp up to 100%, but then come back down almost immediately.

What is the license type of this lib? GPL? or LGPL?

Major device number AMD GPUs

Hello!

My intention is to build a discovery mechanism using the rocm_smi_lib that queries the system for all available AMD GPUs. I have done similar work for NVIDIA GPUs and in that case I was able to assume the major device number of the GPUs always being 195 since it is reserved by NVIDIA and the minor device number from querying nvml. I see in your library you have rsmi_dev_id_get which from what I understand returns the minor device number, but how can I query for the major device number? Since I need both of them for every AMD GPU present on the system.

Thanks,
Robin

Unknown linker flag 'noexecheap' for ld.gold

When trying to build this with ld.gold I get an error that noexecheap is not a valid option. Is this flag required and necessary when building?

I can get around this in two ways, by either forcing the build to use ld.bfd or to patch out the noexecheap option, but I'm not sure which one to go for.

Can you create a flat directory structure on install?

Currently;

./
├── lib
│   ├── librocm_smi64.so -> ../rocm_smi/lib/librocm_smi64.so
│   └── librocm_smi64.so.1 -> ../rocm_smi/lib/librocm_smi64.so.1
├── oam
│   ├── include
│   │   └── oam
│   │       ├── amd_oam.h
│   │       └── oam_mapi.h
│   └── lib
└── rocm_smi
    ├── docs
    │   ├── README.md
    │   └── ROCm_SMI_Manual.pdf
    │       └── refman.pdf
    ├── include
    │   └── rocm_smi
    │       ├── kfd_ioctl.h
    │       └── rocm_smi.h
    └── lib
        ├── librocm_smi64.so -> librocm_smi64.so.1
        ├── librocm_smi64.so.1 -> librocm_smi64.so.1.0
        └── librocm_smi64.so.1.0

Can you just make it flat?

./
├── lib
│   ├── librocm_smi64.so -> librocm_smi64.so.1
│   ├── librocm_smi64.so.1 -> librocm_smi64.so.1.0
│   ├── librocm_smi64.so.1.0
│   ├── liboam.so -> liboam.so.1
│   ├── liboam.so.1 -> liboam.so.1.0
│   └── liboam.so.1.0
├── include
│   ├── rocm_smi
│   │   ├── kfd_ioctl.h
│   │   └── rocm_smi.h
│   └── oam
│       ├── amd_oam.h
│       └── oam_mapi.h
└── docs
    ├── README.md
    └── ROCm_SMI_Manual.pdf
        └── refman.pdf

`memcpy` not found -- missing `string.h`

When compiling rocm_smi_lib with GCC 12 errors like

/build/rocm-smi-lib/src/rocm_smi_lib-rocm-5.1.1/src/rocm_smi_gpu_metrics.cc:225:11: error: ‘memset’ was not declared in this scope
  225 |     (void)memset(data->temperature_hbm, 0,
      |           ^~~~~~

/build/rocm-smi-lib/src/rocm_smi_lib-rocm-5.1.1/src/rocm_smi_gpu_metrics.cc: In function ‘void map_gpu_metrics_1_2_to_rsmi_gpu_metrics_t(const rsmi_gpu_metrics_v_1_2*, rsmi_gpu_metrics_t*)’:
/build/rocm-smi-lib/src/rocm_smi_lib-rocm-5.1.1/src/rocm_smi_gpu_metrics.cc:242:5: error: ‘memcpy’ was not declared in this scope
  242 |     memcpy(rsmi_gpu_metrics, &gpu_metrics_v_1_2->base,
      |     ^~~~~~

appear. Adding string.h to the includes of this file fixes the isssue, see my PR.

init() and shut_down() needs reference counting

Hello

I am reviewing patches that add rsmi support to hwloc. The major complain is that rsmi init()/shutdown() do not have reference counting.

Any application using both rsmi and hwloc will call rsmi_init() and hwloc_topology_load() which also calls rsmi_init(). hwloc doesn't know when the application actually uses rsmi. Hence hwloc cannot ever call rsmi_shut_down() because it may always break somebody else using rsmi.

The issue basically prevents us from ever calling rsmi_shut_down(), which is ugly and makes valgrind complain etc.

Note that this is not specific to hwloc. It's a generic problem with libraries that may be used by multiple layers in a software stack. MPI is fixing similar issues (nobody cared 25 years ago, today it's a big issue).

Note also that I am not even talking about thread safety here. Reference counting can be thread-unsafe for now if the rest of the lib isn't thread-safe. But you must make rsmi_shut_down() a noop unless it's the last invocation with respect to number of rsmi_init() calls earlier.

Also it'd be good to know which version fixes this so that hwloc can check the rsmi version before deciding whether it may safely call rsmi_shut_down() or not.

thanks
Brice

Add lookup for device index given GPU ID

In order to unambiguously map GPU IDs as reported by kfd, please add API call:

rsmi_status_t rsmi_dev_id_get( uint64_t gpuid, uint32_t* pdv_ind);

Where:

gpuid - GPU ID as reported by kfd in /sys/class/kfd/kfd/topology/nodes/<node num>/gpu_id
pdv_ind - pointer to device index (as used throughout rocm_smi_lib APIs)

Return value:

RSMI_STATUS_SUCCESS - if OK
RSMI_STATUS_NOT_SUPPORTED - if given gpuid could not be mapped to device index

poor bandwidth on dual Radeon Pro VII GPUs

Hello everyone

We recently added a second Radeon Pro VII to our simulation system. Unfortunately, though, it seems the GPUs do not want to talk to each other, although they are directly connected with an Infinity Fabric Link Bridge.

The system usually runs Arch Linux, where I also started a discussion about the issue, but testing with Ubuntu shows the same issue. Everything posted here was done on the Ubuntu system.

system

hardware setup

GPUs: 2 AMD Radeon Pro VII
CPU: AMD Ryzen Threadripper 2950X
mainboard: Asus X399-A

The GPUs are connected with an Infinity Fabric Link Bridge.

software

OS: Ubuntu 20.04.3
kernel: 5.11
ROCM: installed via sudo amdgpu-install --usecase=rocm with amdgpu-install from here

other requirements

I did verify that critical requirements according to the ROCM supported hardware page are met, eg. hardware (see above), but also

IOMMU

$ sudo dmesg | grep -i iommu
[sudo] password for tinux: 
[    0.271162] iommu: Default domain type: Translated 
[    0.471020] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.471076] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    0.471120] pci 0000:00:01.0: Adding to iommu group 0
[    0.471133] pci 0000:00:01.1: Adding to iommu group 1
[    0.471146] pci 0000:00:01.2: Adding to iommu group 2
[    0.471166] pci 0000:00:02.0: Adding to iommu group 3
[    0.471183] pci 0000:00:03.0: Adding to iommu group 4
[    0.471195] pci 0000:00:03.1: Adding to iommu group 5
[    0.471213] pci 0000:00:04.0: Adding to iommu group 6
[    0.471231] pci 0000:00:07.0: Adding to iommu group 7
[    0.471243] pci 0000:00:07.1: Adding to iommu group 8
[    0.471261] pci 0000:00:08.0: Adding to iommu group 9
[    0.471273] pci 0000:00:08.1: Adding to iommu group 10
[    0.471297] pci 0000:00:14.0: Adding to iommu group 11
[    0.471308] pci 0000:00:14.3: Adding to iommu group 11
[    0.471368] pci 0000:00:18.0: Adding to iommu group 12
[    0.471379] pci 0000:00:18.1: Adding to iommu group 12
[    0.471390] pci 0000:00:18.2: Adding to iommu group 12
[    0.471401] pci 0000:00:18.3: Adding to iommu group 12
[    0.471414] pci 0000:00:18.4: Adding to iommu group 12
[    0.471425] pci 0000:00:18.5: Adding to iommu group 12
[    0.471436] pci 0000:00:18.6: Adding to iommu group 12
[    0.471447] pci 0000:00:18.7: Adding to iommu group 12
[    0.471506] pci 0000:00:19.0: Adding to iommu group 13
[    0.471517] pci 0000:00:19.1: Adding to iommu group 13
[    0.471529] pci 0000:00:19.2: Adding to iommu group 13
[    0.471542] pci 0000:00:19.3: Adding to iommu group 13
[    0.471553] pci 0000:00:19.4: Adding to iommu group 13
[    0.471565] pci 0000:00:19.5: Adding to iommu group 13
[    0.471577] pci 0000:00:19.6: Adding to iommu group 13
[    0.471588] pci 0000:00:19.7: Adding to iommu group 13
[    0.471622] pci 0000:01:00.0: Adding to iommu group 14
[    0.471635] pci 0000:01:00.1: Adding to iommu group 14
[    0.471649] pci 0000:01:00.2: Adding to iommu group 14
[    0.471654] pci 0000:02:00.0: Adding to iommu group 14
[    0.471658] pci 0000:02:01.0: Adding to iommu group 14
[    0.471662] pci 0000:02:02.0: Adding to iommu group 14
[    0.471666] pci 0000:02:03.0: Adding to iommu group 14
[    0.471670] pci 0000:02:04.0: Adding to iommu group 14
[    0.471674] pci 0000:02:09.0: Adding to iommu group 14
[    0.471678] pci 0000:05:00.0: Adding to iommu group 14
[    0.471683] pci 0000:08:00.0: Adding to iommu group 14
[    0.471695] pci 0000:09:00.0: Adding to iommu group 15
[    0.471707] pci 0000:0a:00.0: Adding to iommu group 16
[    0.471719] pci 0000:0b:00.0: Adding to iommu group 17
[    0.471744] pci 0000:0c:00.0: Adding to iommu group 18
[    0.471759] pci 0000:0c:00.1: Adding to iommu group 19
[    0.471772] pci 0000:0d:00.0: Adding to iommu group 20
[    0.471784] pci 0000:0d:00.2: Adding to iommu group 21
[    0.471798] pci 0000:0d:00.3: Adding to iommu group 22
[    0.471810] pci 0000:0e:00.0: Adding to iommu group 23
[    0.471825] pci 0000:0e:00.2: Adding to iommu group 24
[    0.471838] pci 0000:0e:00.3: Adding to iommu group 25
[    0.471856] pci 0000:40:01.0: Adding to iommu group 26
[    0.471872] pci 0000:40:02.0: Adding to iommu group 27
[    0.471890] pci 0000:40:03.0: Adding to iommu group 28
[    0.471902] pci 0000:40:03.1: Adding to iommu group 29
[    0.471920] pci 0000:40:04.0: Adding to iommu group 30
[    0.471937] pci 0000:40:07.0: Adding to iommu group 31
[    0.471949] pci 0000:40:07.1: Adding to iommu group 32
[    0.471968] pci 0000:40:08.0: Adding to iommu group 33
[    0.471981] pci 0000:40:08.1: Adding to iommu group 34
[    0.471994] pci 0000:41:00.0: Adding to iommu group 35
[    0.472006] pci 0000:42:00.0: Adding to iommu group 36
[    0.472031] pci 0000:43:00.0: Adding to iommu group 37
[    0.472048] pci 0000:43:00.1: Adding to iommu group 38
[    0.472061] pci 0000:44:00.0: Adding to iommu group 39
[    0.472074] pci 0000:44:00.2: Adding to iommu group 40
[    0.472086] pci 0000:44:00.3: Adding to iommu group 41
[    0.472100] pci 0000:45:00.0: Adding to iommu group 42
[    0.472113] pci 0000:45:00.2: Adding to iommu group 43
[    0.502585] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.502595] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.503499] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    0.503517] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    1.017979] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <[email protected]>

CRAT

$ sudo dmesg | grep -i crat
[    0.000000] ACPI: CRAT 0x0000000077CDE878 001DF8 (v01 AMD    AMD CRAT 00000001 AMD  00000001)
[    0.000000] ACPI: Reserving CRAT table memory at [mem 0x77cde878-0x77ce066f]
[    1.168518] amdgpu: Ignoring ACPI CRAT on non-APU system
[    1.168521] amdgpu: Virtual CRAT table created for CPU
[    2.265620] amdgpu: Virtual CRAT table created for GPU
[    3.261272] amdgpu: Virtual CRAT table created for GPU

Atomics

$ sudo dmesg | grep -i kfd
[    2.177738] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    2.265959] kfd kfd: amdgpu: added device 1002:66a1
[    3.169496] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    3.261636] kfd kfd: amdgpu: added device 1002:66a1

and

$ sudo lspci -vvv -s 43:00.0 | grep Atomic
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn+

issues

It seems the GPUs are not connected to each other, despite the fact that they are physically connected with an Infinity Fabric Link Bridge.

tests with `rocm-smi`

$ sudo rocm-smi --shownodesbw


======================= ROCm System Management Interface =======================
================================== Bandwidth ===================================
       GPU0         GPU1         
GPU0   N/A          0-0          
GPU1   0-0          N/A          
Format: min-max; Units: mps
"0-0" min-max bandwidth indicates devices are not connected dirrectly
============================= End of ROCm SMI Log ==============================

I also ran a few other test, but I cannot really make sense of it, given the output of the command above.

$ sudo rocm-smi --showtopoaccess


======================= ROCm System Management Interface =======================
===================== Link accessibility between two GPUs ======================
       GPU0         GPU1         
GPU0   True         True         
GPU1   True         True         
============================= End of ROCm SMI Log ==============================

and

$ sudo rocm-smi --showtopo


======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
       GPU0         GPU1         
GPU0   0            15           
GPU1   15           0            

============================ Hops between two GPUs =============================
       GPU0         GPU1         
GPU0   0            1            
GPU1   1            0            

========================== Link Type between two GPUs ==========================
       GPU0         GPU1         
GPU0   0            XGMI         
GPU1   XGMI         0            

================================== Numa Nodes ==================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 4294967295
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 4294967295
============================= End of ROCm SMI Log ==============================

other benchmarks

I also ran a benchmark from the RCCL repository, which is much slower on 2 GPUs than on a single.

2 GPUs

$ sudo ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   2916 on  ultrafast device  0 [0000:0c:00.0] AMD Radeon (TM) Pro VII
#   Rank  1 Pid   2916 on  ultrafast device  1 [0000:43:00.0] AMD Radeon (TM) Pro VII
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum    24.85    0.00    0.00  0e+00    21.89    0.00    0.00  0e+00
          16             4     float     sum    20.07    0.00    0.00  0e+00    19.69    0.00    0.00  0e+00
          32             8     float     sum    19.91    0.00    0.00  0e+00    19.60    0.00    0.00  0e+00
          64            16     float     sum    19.61    0.00    0.00  0e+00    19.63    0.00    0.00  0e+00
         128            32     float     sum    19.78    0.01    0.01  0e+00    21.51    0.01    0.01  0e+00
         256            64     float     sum    19.76    0.01    0.01  0e+00    19.83    0.01    0.01  0e+00
         512           128     float     sum    19.98    0.03    0.03  0e+00    19.97    0.03    0.03  0e+00
        1024           256     float     sum    35.68    0.03    0.03  0e+00    35.36    0.03    0.03  0e+00
        2048           512     float     sum    20.42    0.10    0.10  0e+00    20.13    0.10    0.10  0e+00
        4096          1024     float     sum    37.20    0.11    0.11  0e+00    37.01    0.11    0.11  0e+00
        8192          2048     float     sum    36.14    0.23    0.23  0e+00    33.72    0.24    0.24  0e+00
       16384          4096     float     sum    33.62    0.49    0.49  0e+00    32.19    0.51    0.51  0e+00
       32768          8192     float     sum    32.93    1.00    1.00  0e+00    32.84    1.00    1.00  0e+00
       65536         16384     float     sum    34.00    1.93    1.93  0e+00    33.47    1.96    1.96  0e+00
      131072         32768     float     sum    35.17    3.73    3.73  0e+00    34.86    3.76    3.76  0e+00
      262144         65536     float     sum    38.97    6.73    6.73  0e+00    38.77    6.76    6.76  0e+00
      524288        131072     float     sum    49.84   10.52   10.52  0e+00    49.69   10.55   10.55  0e+00
     1048576        262144     float     sum    66.13   15.86   15.86  0e+00    65.54   16.00   16.00  0e+00
     2097152        524288     float     sum    97.07   21.61   21.61  0e+00    97.34   21.55   21.55  0e+00
     4194304       1048576     float     sum    160.2   26.19   26.19  0e+00    160.3   26.16   26.16  0e+00
     8388608       2097152     float     sum    284.9   29.45   29.45  0e+00    285.0   29.43   29.43  0e+00
    16777216       4194304     float     sum    532.9   31.48   31.48  0e+00    536.1   31.30   31.30  0e+00
    33554432       8388608     float     sum   1043.1   32.17   32.17  0e+00   1056.0   31.77   31.77  0e+00
    67108864      16777216     float     sum   2072.9   32.37   32.37  0e+00   2074.7   32.35   32.35  0e+00
   134217728      33554432     float     sum   4095.4   32.77   32.77  0e+00   4096.3   32.77   32.77  0e+00
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.86367 
#

1 GPU

$ sudo ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThreads: 1 nGpus: 1 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   3122 on  ultrafast device  0 [0000:0c:00.0] AMD Radeon (TM) Pro VII
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum     8.38    0.00    0.00  0e+00     4.89    0.00    0.00  0e+00
          16             4     float     sum     7.77    0.00    0.00  0e+00     5.24    0.00    0.00  0e+00
          32             8     float     sum     7.27    0.00    0.00  0e+00     8.03    0.00    0.00  0e+00
          64            16     float     sum     7.46    0.01    0.00  0e+00     3.99    0.02    0.00  0e+00
         128            32     float     sum     8.42    0.02    0.00  0e+00     3.82    0.03    0.00  0e+00
         256            64     float     sum     7.72    0.03    0.00  0e+00     4.25    0.06    0.00  0e+00
         512           128     float     sum     8.05    0.06    0.00  0e+00     4.40    0.12    0.00  0e+00
        1024           256     float     sum     7.66    0.13    0.00  0e+00     4.01    0.26    0.00  0e+00
        2048           512     float     sum     9.16    0.22    0.00  0e+00     4.56    0.45    0.00  0e+00
        4096          1024     float     sum     7.51    0.55    0.00  0e+00     4.10    1.00    0.00  0e+00
        8192          2048     float     sum     7.88    1.04    0.00  0e+00     3.92    2.09    0.00  0e+00
       16384          4096     float     sum     7.84    2.09    0.00  0e+00     3.71    4.41    0.00  0e+00
       32768          8192     float     sum     7.42    4.42    0.00  0e+00     3.80    8.63    0.00  0e+00
       65536         16384     float     sum     7.45    8.80    0.00  0e+00     4.27   15.35    0.00  0e+00
      131072         32768     float     sum     8.17   16.05    0.00  0e+00     4.47   29.31    0.00  0e+00
      262144         65536     float     sum     9.10   28.81    0.00  0e+00     3.71   70.69    0.00  0e+00
      524288        131072     float     sum    39.66   13.22    0.00  0e+00     3.69  142.27    0.00  0e+00
     1048576        262144     float     sum    12.87   81.45    0.00  0e+00     3.96  264.85    0.00  0e+00
     2097152        524288     float     sum    14.53  144.29    0.00  0e+00     2.92  718.67    0.00  0e+00
     4194304       1048576     float     sum    23.76  176.54    0.00  0e+00     3.21  1308.00    0.00  0e+00
     8388608       2097152     float     sum    36.37  230.62    0.00  0e+00     3.60  2330.23    0.00  0e+00
    16777216       4194304     float     sum    67.07  250.16    0.00  0e+00     3.30  5079.62    0.00  0e+00
    33554432       8388608     float     sum    123.2  272.40    0.00  0e+00     3.19  10509.90    0.00  0e+00
    67108864      16777216     float     sum    240.4  279.14    0.00  0e+00     3.18  21079.55    0.00  0e+00
   134217728      33554432     float     sum    470.8  285.08    0.00  0e+00     4.88  27490.11    0.00  0e+00
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

Any help is highly appreciated.

About setting the performance level of AMD Instinct MI100 by rocm-smi

Hi,
According to the information promoted by running rocm-smi -h on AMD Instinct MI100 GPU, it can be preset by the rocm-smi command to a specific condition, e.g., its performance level in working can be set by rocm-smi --setperflevel [low| high | auto | manual].

Here I have a few questions about the use of the rocm-smi command with its --setperflevel flag.

What is the difference in performance between those parameters of this flag, like auto, high, low, manual?
Is there any other parameters available for this flag?
When this flag is set by "manual", anything else should be done to get MI100 into its best status for a specific computing task?

I am now doing some benchmarking tests on MI100 for its maximum performance in some HPC work, and consider the rocm-smi command is a tool for the intention to get reliable performance on this GPU device.

Thanks for any suggestions on the use of the AMD GPUs.

issue not resolved in rocm 4.1

Original issue
ROCm/ROC-smi#93
@kentrussell the division by zero issue still shows up after I upgrade to 4.1.

failure to build with link time optimization enabled

Greetings,

for information, the Debian project is investigating the use of link time optimization at large scale, and Matthias Klose noticed in Debian Bug #1015653 that the rocm-smi-lib was failing to link with the following error:

[ 87%] Linking CXX shared library librocm_smi64.so
cd /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/rocm_smi && /usr/bin/cmake -E cmake_link_script CMakeFiles/rocm_smi64.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -fno-rtti -m64 -msse -msse2 -std=c++11  -Wconversion -Wcast-align  -Wformat=2 -fno-common -Wstrict-overflow   -Woverloaded-virtual -Wreorder  -DFORTIFY_SOURCE=2 -fstack-protector-all -Wcast-align -Wl,-z,noexecstack -Wl,-znoexecheap -Wl,-z,relro  -Wtrampolines -Wl,-z,now -fPIE -ggdb -O0 -DDEBUG -flto=auto -ffat-lto-objects -Wl,-z,relro -Wl,-z,now -shared -Wl,-soname,librocm_smi64.so.1 -o librocm_smi64.so.1.0 CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_device.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_main.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_monitor.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_power_mon.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_utils.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_counters.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_kfd.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_io_link.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi_gpu_metrics.cc.o CMakeFiles/rocm_smi64.dir/__/src/rocm_smi.cc.o CMakeFiles/rocm_smi64.dir/__/third_party/shared_mutex/shared_mutex.cc.o  -lpthread -lrt 
/usr/bin/ld: warning: -z noexecheap ignored
/usr/bin/ld: /tmp/ccSiNzIs.ltrans0.ltrans.o: warning: relocation against `_ZNSt17_Function_handlerIFbcENSt8__detail11_AnyMatcherINSt7__cxx1112regex_traitsIcEELb1ELb0ELb0EEEE10_M_managerERSt9_Any_dataRKS8_St18_Manager_operation' in read-only section `.text'
/usr/bin/ld: /tmp/ccSiNzIs.ltrans0.ltrans.o: relocation R_X86_64_PC32 against symbol `_ZTVSt9exception@@GLIBCXX_3.4' can not be used when making a shared object; recompile with -fPIC

You can refer to a more complete log in the Debian bug tracker. Note that the build occurred with -fPIC enabled, so the message is probably a red herring. I'm not sure what to make of this error. It may be nothing, or maybe it could be symptomatic of something else, I don't know. In doubt, I thought you might be interested to be aware of the issue.

In the meantime, I can simply make sure no attempt will be made to build rocm-smi-lib with link time optimization enabled in Debian. This shouldn't be too harmful, I don't believe it's in the rocm-smi-lib that performances are the most critical.

Have a nice day :)
Étienne.

[question] Question about rsmi_dev_power_cap_range_get

Hi
Does rsmi_dev_power_cap_range_get and other functions, which work with power consumption, will work on kernel 4.15?

Thank you

Can I use it in windows10?

rocm-smi fails during initialization if old AMD GPUs are present

I have been using the deprecated rocm-smi for a while now to monitor the status of my GPUs. I have a FirePro S10000 (Tahiti), which works with amdgpu, but does not provide, or only provides on a different path, some of the hardware interfaces expected from newer GPUs (for example voltages, clocks, power draw/cap and gpu_busy_percent). This caused the now-deprecated rocm-smi to show a warning about being unable to read gpu_busy_percent, but otherwise it worked.

This new rocm-smi version sadly straight-up fails to deal with this and errors out during initialization.

> /opt/rocm/bin/rocm-smi
rsmi_init() failed
Exception caught: rsmi_init.
ERROR:root:ROCm SMI returned 8 (the expected value is 0)

I have already narrowed this initialization problem down to an attempt to read /sys/class/hwmon/hwmon2/in0_label, which does not exist on monitors of the Tahiti GPUs. This leads to the program to attempt to find "" within kVoltSensorNameMap, which throws an exception (Map::at).

Even without this issue, these GPUs don't provide a frequency table (as far as I know), which causes another exception:

» ./rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
python3: [..]/src/rocm_smi_lib-rocm-4.1.0/src/rocm_smi.cc:895: rsmi_status_t get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed.
[1]    69803 abort (core dumped)  ./rocm-smi

I don't expect rocm-smi to support these old GPUs, but it would be good if it still worked when old GPUs are present. Let me know if you need more information.

Relevant part of lspci:

0a:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0b:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0b:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti PRO GL [FirePro Series]
0c:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti HDMI Audio [Radeon HD 7870 XT / 7950/7970]
0d:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti PRO GL [FirePro Series]
0e:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
0f:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
10:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28

Hardware monitor files of Tahiti:

» ls /sys/class/drm/card0/device/hwmon/hwmon2/
device       fan1_input   freq1_input  freq2_input  name   pwm1         pwm1_max  subsystem   temp1_crit_hyst  temp1_label
fan1_enable  fan1_target  freq1_label  freq2_label  power  pwm1_enable  pwm1_min  temp1_crit  temp1_input      uevent

Hardware monitor files of Navi21:

» ls /sys/class/drm/card2/device/hwmon/hwmon4
device       fan1_target  in0_input       power1_cap      pwm1_max         temp1_emergency  temp2_emergency  temp3_emergency
fan1_enable  freq1_input  in0_label       power1_cap_max  pwm1_min         temp1_input      temp2_input      temp3_input
fan1_input   freq1_label  name            power1_cap_min  subsystem        temp1_label      temp2_label      temp3_label
fan1_max     freq2_input  power           pwm1            temp1_crit       temp2_crit       temp3_crit       uevent
fan1_min     freq2_label  power1_average  pwm1_enable     temp1_crit_hyst  temp2_crit_hyst  temp3_crit_hyst

Minor device number from AMD GPUs

@cfreehill Is there a corresponding library call for getting the minor device number as in nvmlDeviceGetMinorNumber from the Nvidia management library? I see there is a rsmi_dev_pci_id_get and rsmi_dev_id_get but it is unclear for me whether these can be used for my purpose.

Is such an API call on the roadmap?

`rocm-smi -a` fails with error code 2

This never happened with previous versions. Maybe better would be to just ignore info which cannot be queried and throw errors only when the user specifically uses options like --showpagesinfo on the command line?

Maybe error code is returned because of this:

================================== Pages Info ==================================
ERROR: 2 GPU[0]: ras: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
============================ Show Valid sclk Range =============================
ERROR: 2 GPU[0]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[0]		: Unable to display sclk range
ERROR: 2 GPU[1]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[1]		: Unable to display sclk range
================================================================================
============================ Show Valid mclk Range =============================
ERROR: 2 GPU[0]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[0]		: Unable to display mclk range
ERROR: 2 GPU[1]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[1]		: Unable to display mclk range
================================================================================
=========================== Show Valid voltage Range ===========================
ERROR: 2 GPU[0]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[0]		: Unable to display voltage range
ERROR: 2 GPU[1]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[1]		: Unable to display voltage range
================================================================================
============================= Voltage Curve Points =============================
ERROR: 2 GPU[0]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[0]		: Voltage Curve is not supported
ERROR: 2 GPU[1]: od volt: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.	
GPU[1]		: Voltage Curve is not supported
================================================================================
WARNING:  		 One or more commands failed
============================= End of ROCm SMI Log ==============================

ROCm/ROC-smi#95

rocm-smi.py - Function checkIfSecondaryDie is defined twice

Second definition was introduced in merge 98baeca
98baeca#diff-1bf57333566bfbc368aa56cb127afc34f9ee403eefb0af57980201c68e56e17c

The first definition was fixed in ecb1303 but is not used as the second definition overwrites it.

Causes ROCm/ROCm#2410

Instacrash on hu-HU localized Windows

When run on a hu-HU (Hungarian) localized Windows, the command-line tool instacrashes with the following stack trace:

PS C:\Users\mate> amdsmi.exe
Traceback (most recent call last):
  File "main.py", line 25, in <module>
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "PyInstaller\loader\pyimod02_importers.py", line 352, in exec_module
  File "exceptions.py", line 22, in <module>
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "PyInstaller\loader\pyimod02_importers.py", line 352, in exec_module
  File "amdsmi_import\__init__.py", line 26, in <module>
  File "settings.py", line 93, in initGlobalSettings
  File "settings.py", line 33, in __init__
  File "utils.py", line 198, in get_platform_desc
  File "subprocess.py", line 550, in run
  File "subprocess.py", line 1194, in communicate
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 32: invalid start byte
[6364] Failed to execute script 'main' due to unhandled exception!

Hex 0x82 in decimal is ASCII 130, which is the letter "á", likely picked up from my name, Máté, but who knows. When changing the display language to en-US it doesn't crash. (It simply fails to return any devices.)

Unable to get some GPU parameters in rocm 2.4

It worked in rocm 2.3 now rocm-smi and rocm-smi -a gives:

========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr  SCLK    MCLK    Fan     Perf  PwrCap  SCLK OD  MCLK OD  GPU%  
0    31.0c  N/A     852Mhz  167Mhz  28.63%  auto  N/A     0%       0%       N/A   
1    26.0c  N/A     852Mhz  167Mhz  27.84%  auto  N/A     0%       0%       N/A   
================================================================================
==============================End of ROCm SMI Log ==============================

GPU[0] 		: Unable to get maximum Graphics Package Power
GPU[1] 		: Unable to get maximum Graphics Package Power
================================================================================
================================================================================
GPU[0] 		: Unable to get Power Profile
GPU[1] 		: Unable to get Power Profile
================================================================================
================================================================================
GPU[0] 		: Unable to get Average Graphics Package Power Consumption
GPU[1] 		: Unable to get Average Graphics Package Power Consumption

GPU[0] 		: Unable to get GPU use.
GPU[1] 		: Unable to get GPU use.
================================================================================
================================================================================
GPU[0] 		: Unable to display voltage
GPU[1] 		: Unable to display voltage

Debian roc-smi manpage for review

Dear rocm-smi-lib maitainers,
Please review and suggest any correction before upload on this manpage.
It is a merge of the output of roc-smi --help and the README.md .

In particular if you would like any personal author to be cited.

If you find it welcome, in order to streamline future updates, I can submit a comprehensive PR on the README that would help a lot the production of future man pages here.

No cmake variable to explicitly enable/disable documentation

Currently documentation in rocm_smi/CMakeLists.txt is generated when doxygen and latex are found. However, being able to disable document generation explicitly is still desirable. In particular, I have encountered a problem where the document generation causes my build to hang. This is the case with the package manager spack where I have opened a PR with a patch commenting-out that part of the code. spack/spack#28842

It will be great if an a CMake variable explicitly enabling/disabling documentation is added.

driverInitialized check not valid

This check is not valid since one can have the GPU module baked into the kernel. In that case, lsmod//proc/modules won't report it.

Example in README.md should use uint16_t dev_id

In the README.md's Hello ROCm SMI example, it should change "uint64_t dev_id;" to "uint16_t dev_id;", otherwise we would see an error like this:

~/rocm_smi_lib$ hipcc -L /opt/rocm/rocm_smi/lib -lrocm_smi64 -I /opt/rocm/rocm_smi/include hello.c -o helloclang-10: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
hello.c:16:11: error: no matching function for call to 'rsmi_dev_id_get'
    ret = rsmi_dev_id_get(i, &dev_id);
          ^~~~~~~~~~~~~~~
/opt/rocm/rocm_smi/include/rocm_smi/rocm_smi.h:819:15: note: candidate function not viable: no known conversion from 'uint64_t *' (aka 'unsigned long *') to 'uint16_t *' (aka 'unsigned short *') for 2nd argument
rsmi_status_t rsmi_dev_id_get(uint32_t dv_ind, uint16_t *id);
              ^
1 error generated.
hello.c:16:11: error: no matching function for call to 'rsmi_dev_id_get'
    ret = rsmi_dev_id_get(i, &dev_id);
          ^~~~~~~~~~~~~~~
/opt/rocm/rocm_smi/include/rocm_smi/rocm_smi.h:819:15: note: candidate function not viable: no known conversion from 'uint64_t *' (aka 'unsigned long *') to 'uint16_t *' (aka 'unsigned short *') for 2nd argument
rsmi_status_t rsmi_dev_id_get(uint32_t dv_ind, uint16_t *id);
              ^
1 error generated.

Is there a plan for pip package?

Similar to nvidia-ml-py that can be used to get detail information about ROCm devices in Python.

Does AMD GPU support Cgroups to manage AMD gpu devices?

If supported, the GPUs are contained in one device cgroup are reallocated new GPU IDs, beginning with 0.
just as CUDA Version 7.0 or newer supports.

Error in rocm-smi -a

Hello,

I have a problem with rocm-smi -a:

 ./rocm-smi -a
========================ROCm System Management Interface========================
Driver version: 5.6.19
================================================================================
GPU[0]          : GPU ID: 0x731f
================================================================================
================================================================================
GPU[0]          : VBIOS version: 111
================================================================================
================================================================================
GPU[0]          : Temperature (Sensor edge) (C): 67.0
GPU[0]          : Temperature (Sensor junction) (C): 76.0
GPU[0]          : Temperature (Sensor mem) (C): 68.0
================================================================================
================================================================================
GPU[0]          : dcefclk clock level: 0 (506Mhz)
GPU[0]          : fclk clock level: 1 (1085Mhz)
GPU[0]          : mclk clock level: 0 (100Mhz)
GPU[0]          : pcie clock level: 1 (16.0GT/s, x16 619Mhz)
GPU[0]          : sclk clock level: 2 (2100Mhz)
GPU[0]          : socclk clock level: 1 (1085Mhz)
================================================================================
================================================================================
GPU[0]          : Fan Level: 68 (26%)
================================================================================
================================================================================
GPU[0]          : Performance Level: auto
================================================================================
================================================================================
GPU[0]          : GPU OverDrive value (%): 0
================================================================================
================================================================================
GPU[0]          : GPU Memory OverDrive value (%): 0
================================================================================
================================================================================
GPU[0]          : Max Graphics Package Power (W): 220.0
================================================================================
================================================================================
GPU[0]          :
GPU[0]          : PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS MinFreqType MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
GPU[0]          :  0 BOOTUP_DEFAULT :
GPU[0]          :                     0(       GFXCLK)       0       5       1       0       4     800 4587520  -65536       0
GPU[0]          :                     1(       SOCCLK)       0       5       1       0       3     800 1310720   -6553       0
GPU[0]          :                     2(        MEMLK)       0       5       1       0       4     800  327680  -65536       0
GPU[0]          :  1 3D_FULL_SCREEN :
GPU[0]          :                     0(       GFXCLK)       0       5       1       0       4     650 3932160   -6553  -65536
GPU[0]          :                     1(       SOCCLK)       0       5       1     850       4     800 1310720   -6553       0
GPU[0]          :                     2(        MEMLK)       0       5       4     850       4     800  327680  -65536       0
GPU[0]          :  2   POWER_SAVING :
GPU[0]          :                     0(       GFXCLK)       0       5       1       0       3       0 5898240  -65536       0
GPU[0]          :                     1(       SOCCLK)       0       5       1       0       3       0 1310720   -6553       0
GPU[0]          :                     2(        MEMLK)       0       5       1       0       3       0 1966080  -65536       0
GPU[0]          :  3          VIDEO :
GPU[0]          :                     0(       GFXCLK)       0       5       1       0       4     500 4587520  -65536       0
GPU[0]          :                     1(       SOCCLK)       0       5       1       0       4     500 1310720   -6553       0
GPU[0]          :                     2(        MEMLK)       0       5       1       0       4     500 1966080  -65536       0
GPU[0]          :  4             VR :
GPU[0]          :                     0(       GFXCLK)       0       5       4    1000       4     800 4587520  -65536       0
GPU[0]          :                     1(       SOCCLK)       0       5       1       0       4     800  327680  -65536       0
GPU[0]          :                     2(        MEMLK)       0       5       1       0       4     800  327680  -65536       0
GPU[0]          :  5        COMPUTE*:
GPU[0]          :                     0(       GFXCLK)       0       5       4    1000       3       0 3932160  -65536  -65536
GPU[0]          :                     1(       SOCCLK)       0       5       4     850       3       0  327680  -65536  -32768
GPU[0]          :                     2(        MEMLK)       0       5       4     850       3       0  327680  -65536  -32768
GPU[0]          :  6         CUSTOM :
GPU[0]          :                     0(       GFXCLK)       0       5       1       0       4     800 4587520  -65536       0
GPU[0]          :                     1(       SOCCLK)       0       5       1       0       3     800 1310720   -6553       0
GPU[0]          :                     2(        MEMLK)       0       5       1       0       4     800  327680  -65536       0
================================================================================
================================================================================
GPU[0]          : Average Graphics Package Power (W): 71.0
================================================================================
================================================================================
GPU[0]          : Supported dcefclk frequencies on GPU0
GPU[0]          : 0: 506Mhz *
GPU[0]          : 1: 886Mhz
GPU[0]          : 2: 1266Mhz
GPU[0]          :
GPU[0]          : Supported fclk frequencies on GPU0
GPU[0]          : 0: 506Mhz
GPU[0]          : 1: 1085Mhz *
GPU[0]          : 2: 1266Mhz
GPU[0]          :
GPU[0]          : Supported mclk frequencies on GPU0
GPU[0]          : 0: 100Mhz *
GPU[0]          : 1: 500Mhz
GPU[0]          : 2: 625Mhz
GPU[0]          : 3: 875Mhz
GPU[0]          :
GPU[0]          : Supported pcie frequencies on GPU0
GPU[0]          : 0: 2.5GT/s, x16 619Mhz
GPU[0]          : 1: 16.0GT/s, x16 619Mhz *
GPU[0]          :
GPU[0]          : Supported sclk frequencies on GPU0
GPU[0]          : 0: 300Mhz
GPU[0]          : 1: 1200Mhz
GPU[0]          : 2: 2100Mhz *
GPU[0]          :
GPU[0]          : Supported socclk frequencies on GPU0
GPU[0]          : 0: 506Mhz
GPU[0]          : 1: 1085Mhz *
GPU[0]          : 2: 1266Mhz
GPU[0]          :
================================================================================
================================================================================
GPU[0]          : GPU use (%): 99
================================================================================
================================================================================
GPU[0]          : GPU memory use (%): 0
================================================================================
================================================================================
GPU[0]          : GPU memory vendor: micron
================================================================================
================================================================================
GPU[0]          : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0]          : Unique ID: N/A
================================================================================
================================================================================
GPU[0]          : Serial Number: N/A
================================================================================
PIDs for KFD processes:
2683628
================================================================================
ERROR: GPU[0]           : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0]          : Voltage (mV): 1168
================================================================================
================================================================================
GPU[0]          : PCI Bus: 0000:4b:00.0
================================================================================
================================================================================
GPU[0]          : ASD firmware version:         553648188
GPU[0]          : CE firmware version:          37
GPU[0]          : DMCU firmware version:        0
GPU[0]          : MC firmware version:          0
GPU[0]          : ME firmware version:          94
GPU[0]          : MEC firmware version:         137
GPU[0]          : MEC2 firmware version:        137
GPU[0]          : PFP firmware version:         144
GPU[0]          : RLC firmware version:         128
GPU[0]          : RLC SRLC firmware version:    0
GPU[0]          : RLC SRLG firmware version:    0
GPU[0]          : RLC SRLS firmware version:    0
GPU[0]          : SDMA firmware version:        30
GPU[0]          : SDMA2 firmware version:       30
GPU[0]          : SMC firmware version:         00.42.61.00
GPU[0]          : SOS firmware version:         0x00100250
GPU[0]          : TA RAS firmware version:      00.00.00.00
GPU[0]          : TA XGMI firmware version:     00.00.00.00
GPU[0]          : UVD firmware version:         0x00000000
GPU[0]          : VCE firmware version:         0x00000000
GPU[0]          : VCN firmware version:         0x0510a00d
================================================================================
================================================================================
GPU[0]          : Card series:          Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
GPU[0]          : Card vendor:          Tul Corporation / PowerColor
GPU[0]          : 148d  DIGICOM Systems, Inc.
Traceback (most recent call last):
  File "./rocm-smi", line 3021, in <module>
    showProductName(deviceList)
  File "./rocm-smi", line 1853, in showProductName
    sku = vbios.split('-')[1][:6]
IndexError: list index out of range

Could you help me? I don't know if the problem is the code or the configuration, as all the configuration is in automatic mode.

Thanks,
Berta

rsmitst64 and rocm-smi -a fail on OD_VDDC_CURVE with ppfeaturemask set

Hello!

Recently back into ROCm and working on getting SMI going on 6900xt with Ubuntu 22.04, kernel 5.17.0-1025-oem and ROCm 5.4.1. All functionality works in SMI save for anything requiring the featuremask. With it set, I receive the error below:

/home/mk/Installs/rocm_smi_lib/src/rocm_smi.cc:1158: rsmi_status_t get_od_clk_volt_info(uint32_t, rsmi_od_volt_freq_data_t*): Assertion `val_vec[kOD_VDDC_CURVE_label_array_index] == "OD_VDDC_CURVE:"' failed.
Aborted

Thanks!

Are there any other way to reset the GPU except rocm-smi?

I'm using AMD GPU. Some of my codes have bugs. And if I kill the process, the GPU is not available. And the output of rocm-smi is:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]: power: Data (usually from reading a file) was not of the type that was expected	
================================================================================
================================================================================
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]:Data (usually from reading a file) was not of the type that was expected	
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]:Data (usually from reading a file) was not of the type that was expected	
GPU  Temp  AvgPwr  SCLK  MCLK  Fan   Perf     PwrCap       VRAM%  GPU%  
0    N/A   N/A     None  None  0.0%  unknown  Unsupported    0%   0%    
================================================================================
WARNING:  		 One or more commands failed
============================= End of ROCm SMI Log ==============================

If I use rocm-smi --gpureset -d 0 to reset the GPU, the output is:

======================= ROCm System Management Interface =======================
================================== Reset GPU ===================================
GPU[0]		: Successfully reset GPU 0
================================================================================

But the GPU is still not available unless I reboot the computer.

The document of rocm-smi mentioned Note that GPU reset will not always work, depending on the manner in which the GPU is hung.

If rocm-smi cannot reset the GPU, are there any other tools can do that?

License grant to various README.md and ROCm_SMI_Manual.pdf

Greetings,

While working on rocm_smi_lib packaging into debian, we noted that some of the files were under copyright of AMD but without any explicit reference to the NCSA Open Source License described in License.txt, but instead with a disclaimer that "feels" like it would be a problem if we were to redistribute the source code in the context of the debian project:

[...] No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.

© 2020 Advanced Micro Devices, Inc. All Rights Reserved.

These files are:

README.md
python_smi_tools/README.md
rocm_smi/docs/ROCm_SMI_Manual.pdf

If these files are actually redistributable per NCSA license, would it be possible to indicate it explicitly alongside the disclaimer? Otherwise I believe I must assume I cannot redistribute them.

In any case, many thanks for making available ROCm source code!

Have a nice day, :)
Étienne.

PS: I am not a native English speaker, so don't hesitate to let me know if I misinterpreted the disclaimer for a legally binding statement or misunderstood it entirely.

Error on building test

Hello,

I'm trying to test the rocm-smi, but these errors appears on building the tests:

[ 11%] Building CXX object CMakeFiles/rsmitst64.dir/test_common.cc.o
/usr/bin/c++ -DDEBUG -DLITTLEENDIAN_CPU=1 -D__linux__ -I/opt/rocm-4.0.0/rocm_smi/include -I/home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/.. -I/home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/gtest/include -isystem /home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/gtest/googletest/include -isystem /home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/gtest/googletest -std=c++11  -fexceptions -fno-rtti -fno-math-errno -fno-threadsafe-statics -fmerge-all-constants -fms-extensions -Wall -Wextra -m64  -msse -msse2 -ggdb -O0 -g -pthread -o CMakeFiles/rsmitst64.dir/test_common.cc.o -c /home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/test_common.cc
/home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/test_common.cc:70:6: error: ‘RSMI_DEV_PERF_LEVEL_DETERMINISM’ was not declared in this scope; did you mean ‘RSMI_DEV_PERF_LEVEL_MANUAL’?
   70 |     {RSMI_DEV_PERF_LEVEL_DETERMINISM, "RSMI_DEV_PERF_LEVEL_DETERMINISM"},
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |      RSMI_DEV_PERF_LEVEL_MANUAL
/home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/test_common.cc:73:1: error: could not convert ‘{{RSMI_DEV_PERF_LEVEL_AUTO, "RSMI_DEV_PERF_LEVEL_AUTO"}, {RSMI_DEV_PERF_LEVEL_LOW, "RSMI_DEV_PERF_LEVEL_LOW"}, {RSMI_DEV_PERF_LEVEL_HIGH, "RSMI_DEV_PERF_LEVEL_HIGH"}, {RSMI_DEV_PERF_LEVEL_MANUAL, "RSMI_DEV_PERF_LEVEL_MANUAL"}, {RSMI_DEV_PERF_LEVEL_STABLE_STD, "RSMI_DEV_PERF_LEVEL_STABLE_STD"}, {RSMI_DEV_PERF_LEVEL_STABLE_PEAK, "RSMI_DEV_PERF_LEVEL_STABLE_PEAK"}, {RSMI_DEV_PERF_LEVEL_STABLE_MIN_MCLK, "RSMI_DEV_PERF_LEVEL_STABLE_MIN_MCLK"}, {RSMI_DEV_PERF_LEVEL_STABLE_MIN_SCLK, "RSMI_DEV_PERF_LEVEL_STABLE_MIN_SCLK"}, {<expression error>, "RSMI_DEV_PERF_LEVEL_DETERMINISM"}, {RSMI_DEV_PERF_LEVEL_UNKNOWN, "RSMI_DEV_PERF_LEVEL_UNKNOWN"}}’ from ‘<brace-enclosed initializer list>’ to ‘const std::map<rsmi_dev_perf_level_t, const char*>’
   73 | };
      | ^
      | |
      | <brace-enclosed initializer list>
/home/zymvol/git/rocm_smi_lib/tests/rocm_smi_test/test_common.cc:76:43: error: ‘RSMI_DEV_PERF_LEVEL_DETERMINISM’ was not declared in this scope; did you mean ‘RSMI_DEV_PERF_LEVEL_MANUAL’?
   76 | static_assert(RSMI_DEV_PERF_LEVEL_LAST == RSMI_DEV_PERF_LEVEL_DETERMINISM,
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                           RSMI_DEV_PERF_LEVEL_MANUAL
make[2]: *** [CMakeFiles/rsmitst64.dir/build.make:111: CMakeFiles/rsmitst64.dir/test_common.cc.o] Error 1
make[2]: Leaving directory '/home/zymvol/git/rocm_smi_lib/build'
make[1]: *** [CMakeFiles/Makefile2:160: CMakeFiles/rsmitst64.dir/all] Error 2
make[1]: Leaving directory '/home/zymvol/git/rocm_smi_lib/build'
make: *** [Makefile:152: all] Error 2

I installed rocm-smi from rocm package, I don't know if this is related to that.
My card is: Radeon RX 5700 XT and the O.S is ubuntu 20.04.

If you need more information let me know.

Thanks,
Berta

Does not compile with doc

Can compile fine without doc, but with I get:

writing tag file...
Running plantuml with JAVA...
lookup cache used 346/65536 hits=1455 misses=358
finished...
[ 45%] Generating latex/refman.pdf
cd /debamd/rocm-smi-lib/rocm-smi-lib/obj-x86_64-linux-gnu/rocm_smi/latex && make > /dev/null
make[4]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
make[4]: *** [Makefile:8: refman.pdf] Error 1
make[3]: *** [rocm_smi/CMakeFiles/docs.dir/build.make:76: rocm_smi/latex/refman.pdf] Error 2
make[3]: Leaving directory '/debamd/rocm-smi-lib/rocm-smi-lib/obj-x86_64-linux-gnu'
make[2]: *** [CMakeFiles/Makefile2:260: rocm_smi/CMakeFiles/docs.dir/all] Error 2
make[2]: *** Waiting for unfinished jobs....

installing `python-is-python3` into rocm container removes all rocm packages

Not sure, if this is the right place, but I played around with the ROCm docker images but I also needed Python3 to be the default, thus I was trying to install python-is-python3 but this removes all ROCm packages again:

$ docker run --rm -it rocm/dev-ubuntu-20.04:5.2.3
# apt-get update -q
# apt-get install -q -y --no-install-recommends apt-utils
# apt-get upgrade -q -y --no-install-recommends
# apt-get install --no-install-recommends python-is-python3
Reading package lists... Done  
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  …
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  hip-runtime-amd openmp-extras python-is-python2 rocm-clang-ocl rocm-dev rocm-llvm rocm-utils
The following NEW packages will be installed:
  python-is-python3
0 upgraded, 1 newly installed, 7 to remove and 0 not upgraded.
Need to get 2364 B of archives.
After this operation, 85.8 MB disk space will be freed.

As the rocm-smi-lib packages is a known Python user, I suspect it has something to do with its dependencies. If you have other ideas how to prevent these removals, that would be appreciated too. Thanks.

Unable to get full info dump from -a

======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 5.10.0-7-amd64
================================================================================
====================================== ID ======================================
GPU[0]		: GPU ID: 0x687f
================================================================================
================================== Unique ID ===================================
GPU[0]		: Unique ID: 0x213fab3ebd840c4
================================================================================
==================================== VBIOS =====================================
GPU[0]		: VBIOS version: 113-D0500100-102
================================================================================
================================= Temperature ==================================
GPU[0]		: Temperature (Sensor edge) (C): 59.0
GPU[0]		: Temperature (Sensor junction) (C): 59.0
GPU[0]		: Temperature (Sensor memory) (C): 60.0
================================================================================
========================== Current clock frequencies ===========================
GPU[0]		: dcefclk clock level: 0: (600Mhz)
ERROR: GPU[0] 		: fclk clock is unsupported
GPU[0]		: mclk clock level: 3: (945Mhz)
GPU[0]		: sclk clock level: 0: (852Mhz)
GPU[0]		: socclk clock level: 5: (960Mhz)
python3: /home/<username>/Downloads/rocm_smi_lib-rocm-4.1.0/src/rocm_smi.cc:898: rsmi_status_t get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, uint32_t*): Assertion `f->current == RSMI_MAX_NUM_FREQUENCIES + 1' failed.
Aborted

Having issues just getting the info from all this. Not sure what all I need to provide for help.

Card: Vega64
OS: SparkyLinux 6 (Po-Tolo) [Debian testing] with 5.10.0-7-amd64 drivers.

I can do some manual control stuff like setting fans, but things like setting memclock say successful then don't actually change.

question: is there a cmdline switch for displaying which apps are using GPU

ROCm5.6 / ubuntu 20.04.
I was looking through commands to query current GPU utilization. There appears no commands for which application is using the GPU. As a contrast, nvidia-smi shows it. Is there a similar command available or plan in the future?

Python CLI - README.md table

Hello, just sharing an itch that I got trying to turn your readme into roff man pages.
The visual tables that you have for setprofile for example do not translate well into roff.
Classic Markdown tables do. I'd probably submit a PR directly.
apjanke/ronn-ng#79

Segmentation Fault (core dumped)

uname -ar
Linux cr-2 4.19.0-rc5-dandy #1 SMP Thu Sep 27 11:44:26 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Is this normal behavior?

    **Device ID: 0x6863
    **Performance Level:MANUAL
    **OverDrive Level:0
    **Supported GPU Memory clock frequencies: 4
    **  0: 9f437c0
    **  1: 1dcd6500
    **  2: 2faf0800
    **  3: 38538e40 *
    **Supported GPU clock frequencies: 8
    **  0: 32c87d00
    **  1: 3b1175c0
    **  2: 43d48080
    **  3: 4ba36740
    **  4: 5058d900 *
    **  5: 55d4a800
    **  6: 5b136e00
    **  7: 5f5e1000
    **Monitor name: amdgpu

    **Temperature: 30C
    **Current Fan Speed: 0% (0/ff)
    **Current fan RPMs: 0
    **rsmi_dev_power_max_get(): RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
    **Current Power Cap: d1cef00uW
    **Power Cap Range: 0 to d1cef00 uW
    **Averge Power Usage: 143 W
    =======
    **Device ID: 0x6863
    **Performance Level:MANUAL
    **OverDrive Level:0
    **Supported GPU Memory clock frequencies: 4
    **  0: 9f437c0
    **  1: 1dcd6500
    **  2: 2faf0800
    **  3: 38538e40 *
    **Supported GPU clock frequencies: 8
    **  0: 32c87d00
    **  1: 3b1175c0
    **  2: 43d48080
    **  3: 4ba36740
    **  4: 5058d900 *
    **  5: 55d4a800
    **  6: 5b136e00
    **  7: 5f5e1000
    **Monitor name: amdgpu

    **Temperature: 34C
    **Current Fan Speed: 29.8039% (4c/ff)
    **Current fan RPMs: 5df
    
     Segmentation fault (core dumped)

I also tried this on 4.18.7 with the same result. Not sure if kernel is relevant.

Update the default soname for tarballs

When building rocm_smi_lib from a source tarball the soname defaults to 1.0.0

Can you please update the soname in the cmake script upon releases so that the soname always makes sense?

How weights and hops are calculated

Hi,

Considering the following output:

======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
       GPU0         GPU1         GPU2         GPU3         
GPU0   0            52           52           52           
GPU1   52           0            52           52           
GPU2   52           52           0            52           
GPU3   52           52           52           0            

============================ Hops between two GPUs =============================
       GPU0         GPU1         GPU2         GPU3         
GPU0   0            3            3            3            
GPU1   3            0            3            3            
GPU2   3            3            0            3            
GPU3   3            3            3            0            

========================== Link Type between two GPUs ==========================
       GPU0         GPU1         GPU2         GPU3         
GPU0   0            PCIE         PCIE         PCIE         
GPU1   PCIE         0            PCIE         PCIE         
GPU2   PCIE         PCIE         0            PCIE         
GPU3   PCIE         PCIE         PCIE         0            

================================== Numa Nodes ==================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
GPU[1]		: (Topology) Numa Node: 1
GPU[1]		: (Topology) Numa Affinity: 1
GPU[2]		: (Topology) Numa Node: 3
GPU[2]		: (Topology) Numa Affinity: 3
GPU[3]		: (Topology) Numa Node: 2
GPU[3]		: (Topology) Numa Affinity: 2
============================= End of ROCm SMI Log ==============================

I could not find relevant documents explaining how hops and weights are calculated between AMD GPUs.
From the source code, it seems that these are summation of intrinsically assigned values for a specific HW.

In case of NVIDIA, there is a clear hierarchical of topology connect: NVLINK -> PIX -> PXB -> PHB -> NODE -> SYS -> X
So I can deduce the spatial relationship between GPUs from nvidia-smi

From rocm-smi it is not immediately clear to me how to interpret the aforementioned weights and hops.
Some clarifications are much appreciated.

Add support for package configuration files

This is required for https://github.com/ROCmSoftwarePlatform/MIOpen.

Why we need rocm_smi_lib:

Beginning from ROCm 4.1 RC, MIOpen needs to inform offline compilers about target features (XNACK and SRAMECC).
Otherwise HIP/OCL runtime will fail to load code objects OR performance may suffer
Target features can be obtained from OpenCL and HIP runtime, but the values are valid ONLY if DKMS driver is up-to-date.
Therefore MIOpen needs to know the DKMS driver version.

We are going to use the library for the above purpose.

To find external packages (like compilers, libraries, etc.), MIOpen relies on CMake. However, CMake's find_package() fails to find rocm_smi_lib because it does not provide package configuration files.

We can implement our own Findrocm_smi_lib.cmake but this is not future-proof solution.

I recommend using the https://github.com/RadeonOpenCompute/rocm-cmake to implement package config files. Please refer to CMake documentation for more information.

Compatibility with non-PRO AMD Software

The HIP SDK on Windows (if one is not careful) will install the PRO version of the driver which installs amdsmi.exe which is nice. The issue is that Radeon Software will start arguing soon enough, that newer non-PRO drivers are available. Intalling it (because I also use my box for gaming) and having fixed the localization issue #126 by installing a new display language, it simply fails to enumerate my devices (IGP inside Ryzen 7 6800HS and the dGPU Radeon RX 6800S).

PS C:\Users\mate> amdsmi.exe
SMI-LIB has returned error '-1007' - 'Device Not found' Error code: -1007
PS C:\Users\mate> & 'C:\Program Files\AMD\ROCm\5.5\bin\hipInfo.exe' | sls Name:

Name:                             AMD Radeon RX 6800S
gcnArchName:                      gfx1032
Name:                             AMD Radeon(TM) Graphics
gcnArchName:                      gfx1035

Bottom line, I've not seen this utility work once, but it would be nice to have a rocm-smi alternative on Windows.

Mapping between HIP device ID and rocm_smi

I have a HIP app that uses hipSetDevice and related API to do its things. It might be run with ROCR_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES set.

For a given HIP device, I want to query some info using rocm_smi_lib, e.g., rsmi_topo_get_numa_node_number.

What is the recommended way to map between a HIP device and a ROCm SMI device index?

Manually looping over results of rsmi_dev_pci_id_get for all devices and comparing with hipDeviceProp_t::pciBusID and friends seems like a possible solution, but I wonder if there's an easier / official way.

Does it support Arm64?

Our servers use arm64 cpus. Hope it can support as well.

roc_smi example has a bug.

Looking at the example here, device ID is returned in val_ui16, but instead val_ui64 is being used as the ID. The example probably need an statement such as val_ui64 = val_ui16 immediately after rsmi_dev_id_get call.