Coder Social home page Coder Social logo

ricks-lab / gpu-utils Goto Github PK

View Code? Open in Web Editor NEW
133.0 5.0 23.0 4.08 MB

A set of utilities for monitoring and customizing GPU performance

License: GNU General Public License v3.0

Python 100.00%
gpu-settings gpu-monitoring amdgpu overclock linux python3 boinc setiathome gpu-computing einsteinathome

gpu-utils's Introduction

Ricks-Lab GPU Utilities

GitHub commit activity GitHub last commit Libraries.io SourceRank

rickslab-gpu-utils

A set of utilities for monitoring GPU performance and modifying control settings.

In order to get maximum capability of these utilities, you should be running with a kernel that provides support of the GPUs you have installed. If using AMD GPUs, installing the latest amdgpu driver or ROCm package, may provide additional capabilities. If you have Nvidia GPUs installed, you should have nvidia-smi installed in order for the utility reading of the cards to be possible. Writing to GPUs is currently only possible for compatible AMD GPUs on systems with appropriate kernel version with the AMD ppfeaturemask set to enable this capability as described here.

Installation

There are 4 methods of installation available and are summarized here: If you get a key expired message during apt update, try updating the project PUBLIC.KEY with the following command:

wget -q -O - https://debian.rickslab.com/PUBLIC.KEY | sudo gpg --dearmour -o /usr/share/keyrings/rickslab-agent.gpg
  • Repository - This approach is recommended for those interested in contributing to the project or helping to troubleshoot an issue in realtime with the developer. This type of installation can exist alongside any of the other installation types.

    Custom badge

  • PyPI - Meant for users wanting to run the very latest version. All PATCH level versions are released here first. This installation method is also meant for users not on a Debian distribution.

    PyPI version Downloads

  • Rickslab.com Debian - Lags the PyPI release in order to assure robustness. May not include every PATCH version.

    Custom badge Custom badge

  • Official Debian - Only MAJOR/MINOR releases. This works for releases of Ubuntu 22.04 or Bullseye 11.3 or later.

    Custom badge

User Guide

For a detailed introduction, a community sourced User Guide is available. All tools are demonstrated and use cases are presented. Additions to the guide are welcome. Please submit a pull request with your suggested additions!

Commands

A summary of command line tools available in rickslab-gpu-utils follows. Additional details are available in man pages and the User Guide.

gpu-chk

This utility verifies if the user's environment is compatible with rickslab-gpu-utils.

gpu-ls

This utility displays most relevant parameters for installed and compatible GPUs. The default behavior is to list relevant parameters by GPU. OpenCL platform information is added when the --clinfo option is used. A brief listing of key parameters is available with the --short command line option. A simplified table of current GPU state is displayed with the --table option. The --no_fan can be used to ignore fan settings. The --pstate option can be used to output the p-state table for each GPU instead of the list of basic parameters. The --ppm option is used to output the table of available power/performance modes instead of basic parameters. The --features option is used to output the table of amdgpu pp features and their status instead of basic parameters. The --force_all results in an attempt to read all possible sensors, regardless of how the GPU is classified. The --raw will read all possible driver files and display with indicators of if a gpu-util key word and description is associated with each file along with its contents. The --verbose option will display progress and informational messages generated by the utilities. By default, output data is formatted and color coded, so the --no_markup option can be specified to get plain text.

gpu-mon

A utility to give the current state of all compatible GPUs. The default behavior is to continuously update a text based table in the current window until Ctrl-C is pressed. With the --gui option, a table of relevant parameters will be updated in a Gtk window. You can specify the delay between updates with the --sleep N option where N is an integer > zero that specifies the number of seconds to sleep between updates. The --no_fan option can be used to disable the reading and display of fan information. The --log option is used to write all monitor data to a psv log file. When writing to a log file, the utility will indicate this in red at the top of the window with a message that includes the log file name. The --plot will display a plot of critical GPU parameters which updates at the specified --sleep N interval. If you need both the plot and monitor displays, then using the --plot option is preferred over running both tools as a single read of the GPUs is used to update both displays. The --ltz option results in the use of local time instead of UTC. The --verbose option will display progress and informational messages generated by the utilities.

gpu-plot

A utility to continuously plot the trend of critical GPU parameters for all compatible GPUs. The --sleep N can be used to specify the update interval. The gpu-plot utility has 2 modes of operation. The default mode is to read the GPU driver details directly, which is useful as a standalone utility. The --stdin option causes gpu-plot to read GPU data from stdin. This is how gpu-mon produces the plot and can also be used to pipe your own data into the process. The --simlog option can be used with the --stdin when a monitor log file is piped as stdin. This is useful for troubleshooting and can be used to display saved log results. The --ltz option results in the use of local time instead of UTC. If you plan to run both gpu-plot and gpu-mon, then the --plot option of the gpu-mon utility should be used instead of both utilities in order reduce data reads by a factor of 2. The --verbose option will display progress and informational messages generated by the utilities.

gpu-pac

Program and Control compatible GPUs with this utility. By default, the commands to be written to a GPU are written to a bash file for the user to inspect and run. If you have confidence, the --execute_pac option can be used to execute and then delete the saved bash file. Since the GPU device files are writable only by root, sudo is used to execute commands in the bash file, as a result, you will be prompted for credentials in the terminal where you executed gpu-pac. The --no_fan option can be used to eliminate fan details from the utility. The --force_write option can be used to force all configuration parameters to be written to the GPU. The default behavior is to only write changes. The --verbose option will display progress and informational messages generated by the utilities.

New in Current Release - v3.9.0

  • Optimized regex compile strategy for improved performance.
  • Fixed issue with handling of alpha strings in kernel version.
  • Fixed matplotlib deprecation issue.
  • Enhanced gpu-ls --about output.
  • Catch and report PermissionError for driver files.
  • Prep for debian repository release.

Development Plans

  • Add status read capabilities for Intel GPUs. Need someone to provide gpu-ls --raw --no_markup and clinfo output for Intel GPU.
  • Add pac capabilities for Nvidia GPUs.

Known Issues

  • Seems like over/under clocking capabilities are disabled for Workstation cards.
  • Reset of Curve Points for Vega20 (Radeon VII) does not work.
  • Some windows do not support scrolling or resize, making it unusable for lower resolution installations.
  • I/O error when selecting CUSTOM ppm. Maybe it requires arguments to specify the custom configuration.
  • Doesn't work well with Fiji ProDuo cards.
  • P-state mask gets intermittently reset for GPU used as display output.
  • Utility gpu-pac doesn't show what the current P-state mask is. Not sure if that can be read back.
  • Utility gpu-pac fan speed setting results in actual fan speeds a bit different from setting and pac interface shows actual values instead of set values.

References

History

New in Previous Version - v3.8.4

  • Fixed GpuType and GpuVendor dictionary initialization as described in update to issue 139.
  • Fixed skip list for APU which incorrectly included memory parameters.
  • Fixed matplotlib 3.5.* compatibility issues.

New in Previous Release - v3.8.3

  • Implementation of gpu-pac capability for VDDGFX Offset mode type of AMD GPUs. Does not seem to work for negative values.
  • Improvements to code including improved use of Enum objects as dictionary keys.
  • Improved check for Gtk import errors.
  • Fixed bug 147, ignore invalid data read from GPU.

New in Previous Release - v3.8.2

  • Utility gpu-mon will default to text format when Gtk is not available.

New in Previous Release - v3.8.0

  • Prep for next official Debian release.

New in Previous Release - v3.7.8

  • Improved read/write status summary.
  • Made monitor window not resizable.
  • Ignore 'Non-VGA' PCIe entries.
  • Fixed AMD/ATI regex.
  • Changed INTEL color to blue and other to magenta.

New in Previous Release - v3.7.7

  • Add check of rickslab public key in apt-key. Users should follow new protocol of adding it to a shared keyring as described in UsersGuide. Apt-key is being deprecated and is just a better practice to use a shared keyring.

New in Previous Release - v3.7.6

  • Update installation guide due to deprecation of apt-key.
  • Fixed inconsistency in table/plot item formats.

New in Previous Release - v3.7.5

  • Fixed placement of read P-state data in gpu-ls for complete P-state details in the output.
  • Improved implementation of Vddc Range for CurvePts type AMD GPU.
  • Optimized by GPU type skip lists.
  • Disable clock and voltage range reading/displaying when pp_od_clk_voltage reading is not possible.

New in Previous Release - v3.7.4

  • Documentation updates.
  • Code clean up, simplification, and optimization.
  • Moved high level requirement definitions to init file, modify setup.py and env checks to use these.
  • Fixed hash-bang statements across project to use python as specified in env.

New in Previous Release - v3.7.3

  • Improved Icon file management.
  • Improved compute status logic. Add Unknown status for when clinfo is not available.
  • Do not display invalid energy reading in plot.
  • Resolved linter issues.
  • Code simplification.
  • Better organized credits.

New in Previous Release - v3.7.2

  • Implemented long version of gpu-ls, which will display all information from ppm, pstate, features, and clinfo.
  • Improved gpu-ls argument parsing.
  • Various code optimizations.

New in Previous Release - v3.7.1

  • Fixed an issue created just as I released. Omitted testing on my APU system.

New in Previous Release - v3.7.0

  • Fixed error in calculating power when invalid sensor data is returned.
  • Check for OSError when reading from all sensor files. Disable sensor reading on error.
  • Check for system type. Only systemD is fully supported. Issues in reading sockets in systemV are handled.
  • Added read of Power DPM State for AMD GPUs.
  • On read error, make read for the parameter False instead of indicating card is not readable.
  • Add gpu-ls option --force_all to attempt to read all relevant sensors, regardless of card classification.
  • Improve error message handling. Minor (expected) errors are suppressed unless --verbose is specified. GPU output will indicate all sensors that were disabled due to read errors.
  • Implemented gpu-ls option --raw to give a summary view of the content of all available driver files.
  • Enable gpu-plot and gpu-mon capability to include GPUs with incomplete driver coverage.
  • Allow plain text instead of formatted/color coded output with the --no_markup option.
  • Use pp_dpm_*clk files as a source of P-state information.
  • Separate lists to manage skipped and disabled parameters for easier user interpretation.

New in Previous Release - v3.6.2

  • Minor User Guide updates.
  • Add /usr/share/doc/pci.ids to possible locations of pci decode file.
  • Modify to handle pci addresses that include domain.

New in Previous Release - v3.6.1

  • Update logger to output hex version of amdfeaturemask value.
  • Improve reading/displaying of AMD GPUs when amdfeaturemask is not set to write.

New in Previous Release - v3.6.0

  • Rewrite of the installation guide and simplification of the readme.
  • Roll-up all v3.5.x patches into a new minor revision release.

New in Previous Release - v3.5.10

  • Set Neon as a validated distribution.
  • Check all possible package readers for undefined distribution.

New in Previous Release - v3.5.9

  • Optimize gpu-mon table size.
  • Toggle button color to match enable/disable status of plot line.
  • When install type is repository, force use of repository gpu-plot from gpu-mon.

New in Previous Release - v3.5.8

  • Fixed bug in determining AMD GPU card type. Now it properly identifies APU and Legacy types.

New in Previous Release - v3.5.7

  • More robust determination of install type and display this with --about and in logger.
  • Implementation of scroll within PAC window.
  • Fixed plot crash for invalid ticker increment.
  • Code robustness improvements with more typing for class variables.

New in Previous Release - v3.5.6

  • Fixed issue in reading AMD FeatureMask for Kernel 5.11

New in Previous Release - v3.5.5

  • Include debian release package.
  • Check gtk initialization for errors and handle nicely.
  • Use logger to output plot exceptions.
  • Check number of compatible and readable GPUs at utility start.
  • Minor User Guide and man page improvements.
  • Use minimal python packages in requirements.

New in Previous Release - v3.5.0

  • Utilities now include reading of NV GPUs with full gpu-ls, gpu-mon, and gpu-plot support!
  • Update name from amdgpu-utils to rickslab-gpu-utils.
  • Improved PyPI packaging.
  • Updated User Guide to cover latest features and capabilities.
  • Improved robustness of NV read by validating sensor support for each query item the first time read. This will assure functionality on older model GPUs.
  • Fixed issue in setting display model name for NV GPUs.
  • Improved how lack of voltage readings for NV is handled in the utilities.
  • Fixed an issue in assessing compute capability when GPUs of multiple vendors are installed.

New in Previous Release - v3.3.14

  • Display card path details in logger whenever card path exists.
  • Implemented read capabilities for Nvidia. Now supported by all utilities except pac.
  • Added APU type and tuned parameters read/displayed for AMD APU integrated GPU.
  • Read generic pcie sensors for all types of GPUs.
  • Improved lspci search by using a no-shell call and using compiled regex.
  • Implement PyPI package for easy installation.
  • More robust handling of missing Icon and PCIID files.

New in Previous Release - v3.2.0

  • Fixed CRITICAL issue where Zero fan speed could be written when invalid fan speed was read from the GPU.
  • Fixed issue in reading pciid file in Gentoo (@CH3CN).
  • Modified setup to indicate minimum instead of absolute package versions (@smoe).
  • Modified requirements to include min/max package versions for major packages.
  • Fixed crash for missing pci-ids file and add location for Arch Linux (@berturion).
  • Fixed a crash in amdgpu-pac when no fan details could be read (laptop GPU).
  • Fixed deprecation warnings for several property setting functions. Consolidated all property setting to a single function in a new module, and ignore warnings for those that are deprecated. All deprecated actions are marked with FIXME in GPUgui.py.
  • Replaced deprecated set properties statement for colors with css formatting.
  • Implemented a more robust string format of datetime to address datetime conversion for pandas in some installations.
  • Implemented dubug logging across the project. Activated with --debug option and output saved to a .log file.
  • Updated color scheme of Gtk applications to work in Ubuntu 20.04. Unified color scheme across all utilities.
  • Additional memory parameters added to utilities.
  • Read ID information for all GPUs and attempt to decode GPU name. For cards with no card path entry, determine system device path and use for reading ID. Report system device path in amdgpu-ls. Add amdgpu-ls --short report to give brief description of all installed GPUs.

New in Previous Release - v3.0.0

  • Style and code robustness improvements
  • Deprecated amdgpu-pciid and removed all related code.
  • Complete rewrite based on benchMT learning. Simplified code with ObjDict for GpuItem parameters and use of class variables for generic behavior parameters.
  • Use lspci as the starting point for developing GPU list and classify by vendor, readability, writability, and compute capability. Build in potential to be generic GPU util, instead of AMD focused.
  • Test for readability and writability of all GPUs and apply utilities as appropriate.
  • Add assessment of compute capability.
  • Eliminated the use of lshw to determine driver compatibility and display of driver details is now informational with no impact on the utilities.
  • Add p-state masking capability for Type 2 GPUs.
  • Optimized pac writing to GPUs.

New in Previous Release - v2.7.0

  • Initial release of man pages
  • Modifications to work with distribution installation
  • Use system pci.ids file and make amdgpu-pciid obsolete
  • Update setup.py file for successful installation.

New in Previous Release - v2.6.0

  • PEP8 style modifications
  • Fixed a bug in monitor display.
  • Implement requirements file for with and without a venv.
  • Found and fixed a few minor bugs.
  • Fixed issue with amdgpu-plot becoming corrupt over time.
  • Implemented clean shutdown of monitor and better buffering to plot. This could have caused in problems in systems with many GPUs.

New in Previous Release - v2.5.2

  • Some preparation work for Debian package (@smoe).
  • Added --ltz option to use local times instead of UTC for logging and plot data.
  • Added 0xfffd7fff to valid amdgpu.ppfeaturemask values (@pastaq).
  • Updates to User Guide to include instructions to apply PAC conditions on startup (@csecht).

New in Previous Release - v2.5.1

  • Fixed a compatibility issue with matplotlib 3.x. Converted time string to a datetime object.
  • Display version information for pandas, matplotlib, and numpy with the --about option for amdgpu-plot

New in Previous Release - v2.5.0

  • Implemented the --plot option for amdgpu-monitor. This will display plots of critical GPU parameters that update at an interval defined by the --sleep N option.
  • Errors in reading non-critical parameters will now show a warning the first time and are disabled for future reads.
  • Fixed a bug in implementation of compatibility checks and improved usage of try/except.

New in Previous Release - v2.4.0

  • Implemented amdgpu-pac feature for type 2 Freq/Voltage controlled GPUs, which includes the Radeon VII.
  • Implemented the amdgpu-pac --force_write option, which writes all configuration parameters to the GPU, even if unchanged. The default behavior is changed to now only write changed configuration parameters.
  • Indicate number of changes to be written by PAC, and if no changes, don't execute bash file. Display execute complete message in terminal, and update messages in PAC message box.
  • Implemented a new GPU type 0, which represent some older cards whose p-states can not be changed.
  • Tuned amdgpu-pac window format.

New in Previous Release - v2.3.1

  • Fixed and improved Python/Kernel compatibility checks.
  • Added Python2 compatible utility to check amdgpu-utils compatibility.
  • Fixed confusing mode/level fileptr names.
  • Removed CUSTOM PPM mode until I figure out syntax.
  • Implemented classification of card type based on how it implements frequency/voltage control. This is reported by amdgpu-ls and alters the behavior of both amdgpu-pac and amdgpu-monitor.
  • Changed dpkg error to a warning to handle custom driver installs.
  • Initial User Guide - Need contributors!

New in Previous Release - v2.3.0

  • Implemented a message box in amdgpu-pac to indicate details of PAC execution and indicate if sudo is pending credential entry.
  • Implement more robust classification of card compatibility and only use compatible GPUs in the utilities.
  • Official release of amdgpu-pciid which updates a local list of GPU names from the official pci.ids website.
  • Optimized refresh of data by moving static items to a different function and only read those that are dynamic.
  • Power Cap and Fan parameters can be reset by setting to -1 in the amdgpu-pac interface.
  • Initial basic functionality for Radeon VII GPU!

New in Previous Release - v2.2.0

  • Major bug fix in the way HWMON directory was determined. This fixes an issue in not seeing sensor files correctly when a some other card is resident in a PCIe slot.
  • Implemented logging option --log for amdgpu-monitor. A red indicator will indicate active logging and the target filename.
  • Implemented energy meter in amdgpu-monitor.
  • Implemented the ability to check the GPU extracted ID in a pci.ids file for correct model name. Implemented a function to extract only AMD information for the pci.ids file and store in the file amd_pci_id.txt which is included in this distribution.
  • Optimized long, short, and decoded GPU model names.
  • Alpha release of a utility to update device decode data from the pci.ids website.

New in Previous Release - v2.1.0

  • Significant bug fixes and error proofing. Added messages to stderr for missing driver related files.
  • Added fan monitor and control features.
  • Implemented --no_fan option across all tools. This eliminates the reading and display of fan parameters and useful for those who have installed GPU waterblocks.
  • Implemented P-state masking, which limits available P-states to those specified. Useful for power management.
  • Fixed implementation of global variables that broke with implementation of modules in library.
  • Added more validation checks before writing parameters to cards.

New in Previous Release - v2.0.0

  • Many bug fixes!
  • First release of amdgpu-pac.
  • Add check of amdgpu driver in the check of environment for all utilities. Add display of amdgpu driver version.
  • Split list functions of the original amdgpu-monitor into amdgpu-ls.
  • Added --clinfo option to amdgpu-ls which will list openCL platform details for each GPU.
  • Added --ppm option to amdgpu-ls which will display the table of available power/performance modes available for each GPU.
  • Error messages are now output to stderr instead stdout.
  • Added power cap and power/performance mode to the monitor utilities. I have also included them in the amdgpu-ls display in addtion to the power cap limits.

New in Previous Release - v1.1.0

  • Added --pstates feature to display table of p-states instead of GPU details.
  • Added more error checking and exit if no compatible AMD GPUs are found.

New in Previous Release - v1.0.0

  • Completed implementation of the GPU Monitor tool.

gpu-utils's People

Contributors

csecht avatar flying-x avatar natalyalangford avatar pastaq avatar ricks-lab avatar smoe avatar tripledes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gpu-utils's Issues

PCIID encoding issue when running amdgpu-ls

Thank you for creating this project. I am using Gentoo instead of Ubuntu and I was running into an issue when using amdgpu-ls.

(amdgpu-utils-env) gentoo /home/ch3cn/amdgpu-utils-master # ./amdgpu-ls
Package addon [clinfo] executable not found. Use sudo apt-get install clinfo to install
OS command [dpkg] executable not found.
OS Command [clinfo] not found. Use sudo apt-get install clinfo to install
Detected GPUs: INTEL: 1, AMD: 1
Command None not found. Can not determine amdgpu version.
AMD: Wattman features enabled: 0xfffd7fff
Traceback (most recent call last):
File "./amdgpu-ls", line 141, in
main()
File "./amdgpu-ls", line 111, in main
gpu_list.read_gpu_sensor_data(data_type='All')
File "/home/ch3cn/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1512, in read_gpu_sensor_data
v.read_gpu_sensor_data(data_type)
File "/home/ch3cn/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 940, in read_gpu_sensor_data
self.set_params_value(param, rdata)
File "/home/ch3cn/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 334, in set_params_value
self.prm.model_device_decode = self.read_pciid_model()
File "/home/ch3cn/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 353, in read_pciid_model
for line_item in pci_id_file_ptr:
File "/usr/lib/python-exec/python3.6/../../../lib64/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4514: ordinal not in range(128)

I checked the encoding of the pci.ids file:

(amdgpu-utils-env) gentoo /home/ch3cn/amdgpu-utils-master # file -i /usr/share/misc/pci.ids
/usr/share/misc/pci.ids: text/plain; charset=utf-8

I updated line 350 in GPUmodule.py to include the utf8 encoding:

with open(env.GUT_CONST.sys_pciid, 'r', encoding='utf8') as pci_id_file_ptr:

Now amdgpu-ls works:

(amdgpu-utils-env) gentoo /home/ch3cn/amdgpu-utils-master # ./amdgpu-ls
Package addon [clinfo] executable not found. Use sudo apt-get install clinfo to install
OS command [dpkg] executable not found.
OS Command [clinfo] not found. Use sudo apt-get install clinfo to install
Detected GPUs: INTEL: 1, AMD: 1
Command None not found. Can not determine amdgpu version.
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Card Number: 0
Vendor: INTEL
Readable: False
Writable: False
Compute: True
Card Model: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
PCIe ID: 00:02.0
Driver: i915
Card Path: /sys/class/drm/card0/device`

Card Number: 1
Vendor: AMD
Readable: True
Writable: True
Compute: True
GPU UID:
Device ID: {'vendor': '0x1002', 'device': '0x67ef', 'subsystem_vendor': '0x1458', 'subsystem_device': '0x22de'}
Decoded Device ID: Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev cf)
Display Card Model: Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]
PCIe ID: 01:00.0
Link Speed: 2.5 GT/s
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: xxx-xxx-xxx
Compute Platform: None
GPU Frequency/Voltage Control Type: 1
HWmon: /sys/class/drm/card1/device/hwmon/hwmon2
Card Path: /sys/class/drm/card1/device
##################################################
Current Power (W): 7.2
Power Cap (W): 48.0
Power Cap Range (W): [0, 72]
Fan Enable: 0
Fan PWM Mode: [2, 'Dynamic']
Fan Target Speed (rpm): 933
Current Fan Speed (rpm): 933
Current Fan PWM (%): 31
Fan Speed Range (rpm): [0, 4600]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 0
Current Memory Loading (%): 3
Current Temps (C): {'edge': 22.0}
Critical Temp (C): 94.0
Current Voltages (V): {'vddgfx': 800}
Vddc Range: ['800mV', '1150mV']
Current Clk Frequencies (MHz): {'sclk': 214.0, 'mclk': 300.0}
Current SCLK P-State: [0, '214Mhz']
SCLK Range: ['214MHz', '1800MHz']
Current MCLK P-State: [0, '300Mhz']
MCLK Range: ['300MHz', '2000MHz']
Power Profile Mode: 1-3D_FULL_SCREEN
Power DPM Force Performance Level: auto

I'm not sure why the encoding has to be explicitly stated and I'm not knowledgeable at all in Python to find out.

Ryzen 5 3400G

Should this tool also work with Ryzen 5 3400G (Picasso)?
Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Picasso driver: amdgpu v: kernel
Display: server: X.Org 1.20.8 driver: amdgpu resolution: 3840x2160~60Hz
It just see nothing :-(
amdgpu-ls
AMD Wattman features enabled: 0xfffd7fff
amdgpu version: UNKNOWN
No AMD GPUs detected, exiting...

apt show ricks-amdgpu-utils
Package: ricks-amdgpu-utils
Version: 2.6.0-1

Arch Linux Support

Hello, I followed the user guide and I am stuck at the first amdgpu-ls step.
The script tries to get a file that does not exist:

OS command [dpkg] executable not found.
Detected GPUs: AMD: 2
Command None not found. Can not determine amdgpu version.
AMD: Wattman features enabled: 0xfffd7fff
Error: Can not access system pci.ids file [/usr/share/misc/pci.ids]
Traceback (most recent call last):
  File "./amdgpu-ls", line 141, in <module>
    main()
  File "./amdgpu-ls", line 111, in main
    gpu_list.read_gpu_sensor_data(data_type='All')
  File "/home/bertrand/bin/amdgpu-utils/GPUmodules/GPUmodule.py", line 1512, in read_gpu_sensor_data
    v.read_gpu_sensor_data(data_type)
  File "/home/bertrand/bin/amdgpu-utils/GPUmodules/GPUmodule.py", line 940, in read_gpu_sensor_data
    self.set_params_value(param, rdata)
  File "/home/bertrand/bin/amdgpu-utils/GPUmodules/GPUmodule.py", line 336, in set_params_value
    len(self.prm.model_device_decode) < 1.2*len(self.prm.model_short)):
TypeError: object of type 'NoneType' has no len()

Also I had errors saying that dpkg is not available on amdgpu-chk command:

Using python 3.8.2
           Python version OK. 
Using Linux Kernel 5.6.3-arch1-1
           OS kernel OK. 
Command dpkg not found. Can not determine amdgpu version.
           gpu-utils can still be used. 
python3 venv is installed
           python3-venv OK. 
amdgpu-utils-env available
           amdgpu-utils-env OK. 
In amdgpu-utils-env
           amdgpu-utils-env is activated.

I think that these tools are not written for Arch Linux, but only for deb-like distros.
It would be great to support Arch Linux also.
Thank you.

gpu-plot not plotting

I just tried running gpu-plot for the first time since installing Ubuntu 20.04 and discovered that it is not plotting data.
Attached is the debug file and the terminal stdout.
The same errors are produced whether using venv or not.
Here is my environment check:

$ ./gpu-chk
Using python 3.8.2
           Python version OK. 
Using Linux Kernel 5.4.0-42-generic
           OS kernel OK. 
Using Linux distribution: Ubuntu 20.04.1 LTS
           Distro has been Validated. 
AMD: amdgpu version: 20.10-1048554
           AMD driver OK. 
python3 venv is installed
           python3-venv OK. 
rickslab-gpu-utils-env available
           rickslab-gpu-utils-env OK. 
In rickslab-gpu-utils-env
           rickslab-gpu-utils-env is activated. 

Probably unrelated, but this from UserGuide didn't work, saying "Command 'python3.6' not found," :
python3.6 -m venv rickslab-gpu-utils-env
...but this did work:
python3.8 -m venv rickslab-gpu-utils-env
debug_gpu-utils_20200904-104428.log
gpu-plot_error.log

suggestion: provide option to report averages

When optimizing certain GPU run parameters, it would be handy to have an option for amdgpu-monitor or amdgpu-plot, or both, to report averages of GPU performance variables like load, power, and clock speed. I suppose that reporting a cumulative moving average using a user-specified time interval would be most useful. For example, assuming default reporting every 3 sec, provide an option to monitor or graph moving averages of those values for 1, 3, or 10 minute windows.

can't run modules

On Lubuntu 18.04 with the AMDGPU 18.5 All-Open (Mesa) drivers package, I can't run any of the amdgpu-utils modules. They all fail to launch, like this:

~/Desktop/amdgpu-utils-master$ ./amdgpu-ls
AMD Wattman features enabled: 0xffff7fff
Traceback (most recent call last):
File "./amdgpu-ls", line 136, in
main()
File "./amdgpu-ls", line 94, in main
gut_const.get_amd_driver_version()
File "/home/craig/Desktop/amdgpu-utils-master/GPUmodules/GPUmodules.py", line 101, in get_amd_driver_version
stderr=subprocess.DEVNULL).decode().split("\n")
File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['dpkg', '-l', 'amdgpu-pro']' returned non-zero exit status 1.

Do you have any suggestions for what I might be doing wrong?

Review of man pages

I have been preparing a setup of man pages for use with a potential debian install package. I have posted a nearly complete version of the here. Please review and let me know if you see any issues?

The format is groff and they can be viewed with

man -l filename

Recent Updates Broke amdgpu-plot

I suspect it was the latest matplotlib update, but I am now seeing a similar issue to this one after upgrading to 18.04.2: 12892

I have made this change (not committed), but still have an issue:

            plt.figure(v["figure_num"])
            v["ax1"].clear()
            v["ax1"].set_ylabel('Loading/Power/Temp', color='k', fontsize=10)
            for plot_item in ['loading', 'power_cap', 'power', 'temp']:
                if gc.plot_items[plot_item] == True:
                    print([xi[0] for xi in (ldf[ldf["Card#"].isin([k])].loc[:,['Time']]).values.tolist()])
                    print([yi[0] for yi in (ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]]).values.tolist()])
                    print("color: ", gc.colors[plot_item])
                    v["ax1"].plot(kind='line',
                            x=[xi[0] for xi in (ldf[ldf["Card#"].isin([k])].loc[:,['Time']]).values.tolist()],
                            y=[yi[0] for yi in (ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]]).values.tolist()],
                            color=gc.colors[plot_item], linewidth=0.5)
                    v["ax1"].text(x=ldf[ldf["Card#"].isin([k])].loc[:,['Time']].iloc[-1],
                            y=ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]].iloc[-1],
                            s=str(int(ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]].iloc[-1].values[0])),
                            bbox = dict(boxstyle="round,pad=0.2", facecolor=gc.colors[plot_item]), fontsize=6)

            v["ax2"].clear()
            v["ax2"].set_ylabel('MHz/mV', color='k', fontsize=10)
            #ylim_val = 1.1* ldf[ldf["Card#"].isin([k])].loc[:,['vddgfx', 'sclk_f', 'mclk_f']].max().max()
            for plot_item in ['vddgfx', 'sclk_f', 'mclk_f']:
                if gc.plot_items[plot_item] == True:
                    v["ax2"].plot(kind='line',
                            x=[xi[0] for xi in (ldf[ldf["Card#"].isin([k])].loc[:,['Time']]).values.tolist()],
                            y=[yi[0] for yi in (ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]]).values.tolist()],
                            color=gc.colors[plot_item], linewidth=0.5)
                    v["ax2"].text(x=ldf[ldf["Card#"].isin([k])].loc[:,['Time']].iloc[-1],
                            y=ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]].iloc[-1],
                            s=str(int(ldf[ldf["Card#"].isin([k])].loc[:,[plot_item]].iloc[-1].values[0])),
                            bbox = dict(boxstyle="round,pad=0.2", facecolor=gc.colors[plot_item]), fontsize=6)

Original code still on master amdgpu-plot. Looking for help:

  • Can someone confirm that this is working before upgrade to 18.04.2 and if so, what version of matplotlib are you using?
  • Is my method of specifying plot data from a dataframe not optimal?

<amdgpu-pac --execute_pac> error when setting fan speeds

I tried the auto pac execution option to set fan speeds of both my cards at once, using Save All in the pac Gtk window. Below is the terminal output. (Card1 is the RX 460, Card0 is RX 570.) This may be the same problem as the issue I just reported where manual execution of pac shell scripts does not change fan speeds, but here provides more information about the nature of the errors (?)

$ ./amdgpu-pac --execute_pac
AMD Wattman features enabled: 0xffff7fff
amdgpu version: 18.50-708488
2 AMD GPUs detected
2 are compatible

WARNING: Under Development
WARNING: Works but not fully tested.
Please report any bugs found. Thanks!

Batch file completed: /home/craig/Desktop/amdgpu-utils-2.1.0-Features/pac_writer_8ef342bdb0fa4cca9bfe1de361976364.sh
Writing changes to GPU /sys/class/drm/card1/device/

  • sudo sh -c echo 1' > /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable
    [sudo] password for craig:
    sh: 1: Syntax error: Unterminated quoted string
  • sudo sh -c echo '63' > /sys/class/drm/card1/device/hwmon/hwmon1/pwm
    sh: 1: cannot create /sys/class/drm/card1/device/hwmon/hwmon1/pwm: Permission denied
  • sudo sh -c echo 's 0 214 800' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 1 481 821' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 2 760 825' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 3 1020 925' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 4 1102 1012' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 5 1138 1056' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 6 1172 1100' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 's 7 1200 1143' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 'm 0 300 800' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 'm 1 1750 850' > /sys/class/drm/card1/device/pp_od_clk_voltage
  • sudo sh -c echo 'auto' > /sys/class/drm/card1/device/power_dpm_force_performance_level
  • sudo sh -c echo 'c' > /sys/class/drm/card1/device/pp_od_clk_voltage

Batch file completed: /home/craig/Desktop/amdgpu-utils-2.1.0-Features/pac_writer_60c71cbf8a0444ce95eead142863da98.sh
Writing changes to GPU /sys/class/drm/card0/device/

  • sudo sh -c echo 1' > /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_enable
    sh: 1: Syntax error: Unterminated quoted string
  • sudo sh -c echo '153' > /sys/class/drm/card0/device/hwmon/hwmon0/pwm
    sh: 1: cannot create /sys/class/drm/card0/device/hwmon/hwmon0/pwm: Permission denied
  • sudo sh -c echo 's 0 300 750' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 1 588 765' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 2 952 918' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 3 1076 1025' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 4 1143 1087' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 5 1208 1150' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 6 1250 1150' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 's 7 1286 1150' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 'm 0 300 750' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 'm 1 1000 800' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 'm 2 1750 900' > /sys/class/drm/card0/device/pp_od_clk_voltage
  • sudo sh -c echo 'auto' > /sys/class/drm/card0/device/power_dpm_force_performance_level
  • sudo sh -c echo 'c' > /sys/class/drm/card0/device/pp_od_clk_voltage

^C
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/gi/overrides/Gtk.py", line 1588, in main_quit
@OverRide(Gtk.main_quit)
KeyboardInterrupt

At this point the terminal was non-responsive, so I closed it with ctrl-shift-w. Fan speeds remained unchanged.

amdgpu-plot erasing plot

I haven't used amdgpu-plot in a while, but just noticed today that, after a few minutes running, it begins to erase the oldest readings. With every data tick added, displayed data points are removed from the left side.
Screenshot from 2019-08-13 18-14-31

P-state masking broken in amdgpu-pac

I can no longer apply a P-state mask to my RX 460s or 570s. When I try with amdgpu-pac --execute it gives a sh: echo: I/O error and reverts the GPUs' to the default non-masked p-state. Here, I tried to apply a 0,3 mask to two GPUs:

~/Desktop/amdgpu-utils-2.5.1$ ./amdgpu-pac --execute
AMD Wattman features enabled: 0xffff7fff
amdgpu version: 19.30-838629
2 AMD GPUs detected, 2 may be compatible, checking...
2 are confirmed compatible.

Gtk-Message: 19:45:14.941: Failed to load module "canberra-gtk-module"
Batch file completed: /home/craig/Desktop/amdgpu-utils-2.5.1/pac_writer_1337992511324f5eb36a334ca77ccb6e.sh
Writing 1 changes to GPU /sys/class/drm/card1/device/
+ sudo sh -c echo '0 3' >  /sys/class/drm/card1/device/pp_dpm_sclk
[sudo] password for craig: 
sh: echo: I/O error
PAC execution complete.
Batch file completed: /home/craig/Desktop/amdgpu-utils-2.5.1/pac_writer_e589cd63d8844e93bf4a580400652328.sh
Writing 1 changes to GPU /sys/class/drm/card0/device/
+ sudo sh -c echo '0 3' >  /sys/class/drm/card0/device/pp_dpm_sclk
sh: echo: I/O error
PAC execution complete.

I can change other PAC parameters, like performance levels or fan speeds. The problem persists in the latest master and v2.5.1, so it may be something with my system? The masking worked three days ago, however, when I restarted my system and the auto-start PAC sh script ran as normal. My current system information is:
Kernel: Linux 4.15.0-55-generic (x86_64)
Version: #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019
C Library: GNU C Library / (Ubuntu GLIBC 2.27-3ubuntu1) 2.27
Distribution: Ubuntu 18.04.2 LTS
Any ideas?

Fails on Radeon R5

#lpci | grep VGA
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev 84)

#clinfo |grep 'loader Profile'
Device open failed, aborting...
Device open failed, aborting...
Device open failed, aborting...
Device open failed, aborting...
Device open failed, aborting...
ICD loader Profile OpenCL 2.2

and yet

#./amdgpu-ls --clinfo
Cannot read ppfeaturemask. Exiting...

Auto-run pac .sh scripts on startup

I got my optimized settings in amdgpu-pac .sh scripts to auto-run on startup by using crontab with @reboot. I edited the scripts (created with amdgpu-pac --force) to create a log file on my desktop so I'll know if my system has rebooted, like when there is a power glitch when I'm away. To have the pac scripts for both GPUs execute before the boinc-client auto-starts and puts a load on the GPUs, I included a <start_delay>30</start_delay> option in cc_config.xml. The 30 sec delay is more than ample time on my system.
I wanted the last script (of the two) that's executed at boot to launch amdgpu-monitor, but it didn't work. I put this on the last line of the .sh pac file:
/usr/bin/python3.6 /home/craig/Desktop/amdgpu-utils-2.5.1/amdgpu-monitor
This command line works from the Terminal to launch the monitor, but not in the .sh script that crontab runs. Any ideas? Do I need a 'sudo' in there somewhere?
Is there a better way to do this?

PPM table change in amdgpu 19.3 drivers

I just upgraded my amdgpu driver package from 18.5 to 19.3 and notice a change in the ppm tables. They have added a BOOTUP_DEFAULT option at state 0, which adds 1 to the indices of the previous modes. Anyone that has a pac_writer script set to run at startup may need to update the script to the correct performance mode. My new ppm table looks like this:

Card: /sys/class/drm/card1/device/
Power Performance Mode: manual
  0:  BOOTUP_DEFAULT                 -                 -                 -                 -                 -                 -
  1:  3D_FULL_SCREEN                 0               100                30                 0               100                10
  2:    POWER_SAVING                10                 0                30                 -                 -                 -
  3:           VIDEO                 -                 -                 -                10                16                31
  4:              VR                 0                11                50                 0               100                10
  5:         COMPUTE                 0                 5                30                 0               100                10
  6:          CUSTOM                 -                 -                 -                 -                 -                 -
 -1:            AUTO              Auto

Another difference I noticed with the new driver package is that my card indices have changed. Card 0 and Card 1 are now Card 1 and Card2.

GPUs with Empty pstate table

In trying an older R9 290X card, I found that the driver file which should contain p-state details is empty. Perhaps this format of pp_od_clk_voltage should be classified as "type 0".

amdgpu version: 18.50-725072
1 AMD GPUs detected, 1 may be compatible, checking...
Error: problem reading sensor data from GPU HWMON: /sys/class/drm/card0/device/hwmon/hwmon5/
1 are confirmed compatible.

UUID: 7fdf9c163a6e46a9b32e29aa67ff6567
amdgpu-utils Compatibility: Yes
Device ID: {'vendor': '0x1002', 'device': '0x67b0', 'subsystem_vendor': '0x1043', 'subsystem_device': '0x046a'}
GPU Frequency/Voltage Control Type: 1
Decoded Device ID: R9 290X DirectCU II
Card Model:  Hawaii XT / Grenada XT [Radeon R9 290X/390X]
Short Card Model:  R9 290X/390X
Display Card Model:  R9 290X/390X
Card Number: 0
Card Path: /sys/class/drm/card0/device/
PCIe ID: 42:00.0
Driver: amdgpu
vBIOS Version: 1113-AD62700-101
HWmon: /sys/class/drm/card0/device/hwmon/hwmon5/
Current Power (W): 33.238
Power Cap (W): 250.0
Power Cap Range (W): [0, 250]
Fan Enable: 1
Fan PWM Mode: [-1, 'UNK']
Current Fan PWM (%): -1
Current Fan Speed (rpm): -1
Fan Target Speed (rpm): -1
Fan Speed Range (rpm): [0, 0]
Fan PWM Range (%): [0, 100]
Current Temp (C): 39.0
Critical Temp (C): 104000.0
Current VddGFX (mV): -1
Vddc Range: ['', '']
Current Loading (%): 0
Link Speed: 8 GT/s
Link Width: 16
Current SCLK P-State: 2
Current SCLK: 727Mhz 
SCLK Range: ['', '']
Current MCLK P-State: 1
Current MCLK: 1250Mhz 
MCLK Range: ['', '']
Power Performance Mode: 0-3D_FULL_SCREEN
Power Force Performance Level: auto

Error after setting amdgpu.ppfeaturemask=0xfffd7fff

So this is what happens when I run /usr/bin/gpu-mon after using the ppfeaturemask kernel parameters. It works normally without (though I can't use gpu-ls in either).

Also, I have a AMD R9 290, and am currently running Debian Bullseye. The kernel is 5.10.12-xanmod1-cacule. And ricks-amdgpu-utils is version 3.5.0-1 from the Debian Testing repos.

I should also mention that I am using the refind bootloader rather than GRUB, but that (afaik) shouldn't affect how kernel parameters are applied.

gpu-mon error:

Traceback (most recent call last):
  File /usr/bin/gpu-mon, line 391, in <module>
    main()
  File /usr/bin/gpu-mon, line 295, in main
    gpu_list.set_gpu_list()
  File /usr/lib/python3/dist-packages/GPUmodules/GPUmodule.py, line 1666, in set_gpu_list
    self.amd_featuremask = env.GUT_CONST.read_amdfeaturemask()
  File /usr/lib/python3/dist-packages/GPUmodules/env.py, line 205, in read_amdfeaturemask
    self.amdfeaturemask = int(fm_file.readline())
ValueError: invalid literal for int() with base 10: '0xfffd7fff\n' 

gpu-ls error:

Traceback (most recent call last):
  File /usr/bin/gpu-mon, line 391, in <module>
    main()
  File /usr/bin/gpu-mon, line 295, in main
    gpu_list.set_gpu_list()
  File /usr/lib/python3/dist-packages/GPUmodules/GPUmodule.py, line 1666, in set_gpu_list
    self.amd_featuremask = env.GUT_CONST.read_amdfeaturemask()
  File /usr/lib/python3/dist-packages/GPUmodules/env.py, line 205, in read_amdfeaturemask
    self.amdfeaturemask = int(fm_file.readline())
ValueError: invalid literal for int() with base 10: '0xfffd7fff\n' 

skip detection of non amd gpus

I am running the amdgpu driver but the i915 GPU is getting in the way. Please skip this onboard gpu if detected.

Using python 3.6.9
           Python version OK.
Using Linux Kernel 4.20.0-042000-generic
           OS kernel OK.
AMD GPU driver is driver=i915 latency=0 
           AMD's 'amdgpu' driver package is required.

Fan PWM on RX 5600 XT

EDIT: After a system restart, this issue posted below corrected itself. I'll leave it open as I try to reproduce the conditions that cause a problem, if there is one.

I found a new issue running PAC with a Navi 10 RX 5600 XT.
The problem is that, upon Save, Fan PWM becomes set to whatever the current reading is in the entry field regardless whether any change was made to the fan setting. So if the card had been running in Auto, then PAC is executed to change some other parameter and fans happen to be off because the card is resting, then upon Save, the fans are (re)set to 0%.
Below is an example terminal stdout where I used PAC to first 'reset' the Fan PWM (auto mode), then immediately Saved again with no changes entered. It echoed '0' to pwm1 because the fans were not running at the time:

$ ./amdgpu-pac --execute
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-master/pac_writer_4f74907aa15646c8a291ac095e1423a2.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '0' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
+ sudo sh -c echo '2' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
PAC execution complete.
# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-master/pac_writer_c62fb651f47549bdbd90e909e4018fd7.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '1' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
+ sudo sh -c echo '0' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
PAC execution complete.

My current workaround it to enter "reset" for Fan PWM every time before I hit Save, and even that sometimes requires a separate "reset" and Save to get the fans going.

Unrelated, but good news, this card can accept changes to p-state masks on-the-fly, unlike the Ellsmere and Polaris RX 4xx & 5xx series that required the card not be under load. I fact, I think that sclk mask changes on Navi 10 work only when the card is under load; every change I tried to make to sclk masks when the card was not under load did not work. MCLK masks work in all situations.

PAC not working with new AMDGPU 20.10 package

I just downloaded the newly released AMDGPU package, installed the OpenCL component to my Ubuntu 18.04.4 system and found that amdgpu-pac --execute could not read to the drm files for my two cards.

~/Desktop/Link to amdgpu-utils-2.7.0$ ./amdgpu-pac --execute
AMD Wattman features enabled: 0xfffd7fff
amdgpu version: 20.10-1048554
2 AMD GPUs detected, 2 may be compatible, checking...
2 are confirmed compatible.

Batch file completed: /home/craig/amdgpu-utils-2.7.0/pac_writer_98a5934db0234fcdbd9a29b32088dc3d.sh
Writing 2 changes to GPU /sys/class/drm/card1/device/
+ sudo sh -c echo '0 6' >  /sys/class/drm/card1/device/pp_dpm_sclk
[sudo] password for craig: 
sh: echo: I/O error
+ sudo sh -c echo '0  2' >  /sys/class/drm/card1/device/pp_dpm_mclk
sh: echo: I/O error
PAC execution complete.
Batch file completed: /home/craig/amdgpu-utils-2.7.0/pac_writer_9e1b20dd96a94012b1539767fdc6dba1.sh
Writing 2 changes to GPU /sys/class/drm/card2/device/
+ sudo sh -c echo '0  6' >  /sys/class/drm/card2/device/pp_dpm_sclk
sh: echo: I/O error
+ sudo sh -c echo '0  2' >  /sys/class/drm/card2/device/pp_dpm_mclk
sh: echo: I/O error
PAC execution complete.

My earlier post pointed out that PAC doesn't work at all with the latest Master, so I used the Master that was current several days ago. The older Master executes, but doesn't seem to have write permissions, so it's something with the AMDGPU upgrade. (My older startup PAC scripts to set parameters automatically on reboot no longer work either.) Luckily the default GPU run conditions are okay for now.

Debian Installer - Testing/Verification Requested

I have posted a Debian installer for gpu-utils. Execute the following commands to install:

wget -q -O - https://debian.rickslab.com/PUBLIC.KEY | sudo apt-key add -

echo 'deb [arch=amd64] https://debian.rickslab.com/gpu-utils/ eddore main' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list

sudo apt update

sudo apt install rickslab-gpu-utils

If you have already installed from PyPI, then first uninstall and close that terminal first:

pip uninstall rickslab-gpu-utils

exit

Let me know of any issues or feedback on package construction. This (and ups-utils) are my first attempt at distributing a Debian package.

monitor module: --gui warning and column width

In v. 2.0, the command <./amdgpu-monitor --gui > brings up this warning:

(amdgpu-monitor:23263): dbind-WARNING **: 15:05:23.731: Error retrieving accessibility bus address: org.freedesktop.DBus.Error.ServiceUnknown: The name org.a11y.Bus was not provided by any .service files.

..but the Gtk monitor window launches fine.

And here is a minor cosmetic issue: I'm running an RX 460 and an RX 570. The columns in the monitor Gtk window are wide enough (self-adjusting?) to show all card information; in my case the card model name is a string of all AMD models that fall in the RX x60 (Baffin) or RX x70 & RX x80 (Ellesmere) series. The long string is fine (it does include the installed card), but when monitor is used only in the terminal window, <./amdgpu-monitor>, the columns for each card are too narrow to display the full model name(s), and may truncate before the installed model is seen. For example, the RX570 column only shows "RX 470/480/" . Also the narrow column width can truncate the card's full performance mode name, e.g. 0-3D_FULL_SCREEN; and the first column needs to be 1 character wider to display all of "Power Cap (W)". This is mostly eye-candy, but I raise it because the loss of card information in the terminal window may be confusing or misleading in some circumstances.

false venv error?

amdgpu-chk is reporting a venv error after I've activated venv. What is missing?
(In the command prompt, amdgpu-utils is a Desktop symlink to the recent, today's, amdgpu-utils-master in my home directory.)
amdgpu-chk_err_screenshop
I followed the instructions in the User Guide (they are not in README) for installing venv and requirements; the amdgpu-util-env directory is in the Master folder and the requirements-venv.txt is installed via pip.

Bug in Master from 26 June 2020

All amdgpu commands, except -chk, have this type of error:

~/Desktop/amdgpu-utils$ ./amdgpu-ls
Traceback (most recent call last):
  File "./amdgpu-monitor", line 47, in <module>
    __version__ = env.GUT_CONST.version
AttributeError: 'GutConst' object has no attribute 'version'

Ubuntu upgrade broke amdgpu-pac?

The first cold restart I did after an Ubuntu upgrade with Linux kernel 5.0.0.31, amdgpu-pac was no longer able to set p-states, and my RX570's were crunching E@H at a p-state of 2 at full 120 W power and running very hot. I had been using the default Linux 5.0 drivers with only OpenCL loaded from the 19.2 AMDGPU package. After installing the latest 19.3 (August 2019) AMDGPU package, amdgpu-pac is once again fully functional and the GPUs are running normally. So, not an issue with amdgpu-utils, but a heads-up about upgrading Linux drivers.

X-axis values obscured in amdgpu-plot

The time series graphs for amdgpu-plot show only a small portion of the x-axis values. I'm pretty sure this was an issue prior to the most recent Master.
amdgpu-plot

PyPI Package Available

I have posted a package at PyPI which can be installed with:

pip install ricks-amdgpu-utils

Let me know of any issues.

Exploring Extended Support for Non-AMD GPUs

The utilities already leverage all PCIE information identify all installed GPUs. I would like to extend this by examining additional sensor files in the card and hwmon directories. Looking for users with other than AMD GPUs to share details here. Easiest was is to run the latest version on the extended branch with the --debug option and share the debug log file.

v3.0 documentation

Here are some ideas for edits to the User Guide. Let me know what you think and I can include them for a pull request.
In "Getting Started" section, it says:
After saving, update grub:

sudo update-grub

and then reboot.

But, after updating the ppfeaturemask code in grub, I didn't have to reboot for new PAC features (e.g. overclocking) to work. However, amdgpu-ls still lists the feature mask as what is was before the grub update. Is the featuremask code reported by -ls read from the last boot record and not the current grub file? Is a reboot only necessary following update-grub for the initial loading of amdgpu.ppfeaturemask? This is more for my clarification on how grub works than any edits to the text.

In the "Using amdgpu-ls" section,
I see in the ppm graphic that the timings table were removed. If user wonder what's going on when the table of timing values is reported in their terminal, however, it may be helpful to add an explanation, unless you just want to keep less clutter in the User Guide. From the ROCm-smi page, https://github.com/RadeonOpenCompute/ROC-smi/tree/roc-2.7.0. , the column headers for the ppm timings table could be included along with brief definitions, like this:

Card Number: 1
   Card Model: Radeon RX 570
   Card: /sys/class/drm/card1/device
   Power Performance Mode: manual
                    SCLK_UP_HYST  SCLK_DOWN_HYST  SCLK_ACTIVE_LEVEL  MCLK_UP_HYST  MCLK_DOWN_HYST  MCLK_ACTIVE_LEVEL
 0:   BOOTUP_DEFAULT          -             -             -             -             -             -
 1:   3D_FULL_SCREEN          0           100            30             0           100            10
 2:     POWER_SAVING         10            0             30             -             -             -
 3:            VIDEO          -            -              -            10            16            31 
 4:               VR          0           11             50             0           100            10
 5:          COMPUTE          0            5             30             0           100            10
 6:           CUSTOM          -            -              -             -             -             -
-1:             AUTO          Auto

(Text extracted and paraphrased from the ROCm-smi readme, https://github.com/RadeonOpenCompute/ROC-smi/tree/roc-2.7.0)
SCLK_UP_HYST - Delay before sclk is increased (in milliseconds). SCLK_DOWN_HYST - Delay before sclk is decresed (in milliseconds). SCLK_ACTIVE_LEVEL - Workload required before sclk levels change (in %). MCLK_UP_HYST - Delay before mclk is increased (in milliseconds). MCLK_DOWN_HYST - Delay before mclk is decresed (in milliseconds). MCLK_ACTIVE_LEVEL - Workload required before mclk levels change (in %). Values displayed as '-' are hidden fields and are not enabled. When a compute queue is detected, the COMPUTE Power Profile values will be automatically applied to the system, provided that the Perf Level is set to "auto". The CUSTOM Power Profile is only applied when the Performance Level is set to "manual" and can be specified using ROCm-smi (??with rocm loaded??). It is not possible to modify non-CUSTOM Profiles because these are hard-coded by the kernel.

Maybe include this descriptive text in --ppm terminal output instead of adding it to the User Guide?

In the "Using amdgpu-monitor" section,
Need to update the terminal output and GUI graphics and include descriptive text for Memory Load monitoring.

In the "Using amdgpu-pac" section,
Add after, "If you know how to obtain the current value, please let me know!"...
"When changing sclk P-state MHz or mV, the desired P-state mask, if different from default, will have to be re-entered for speed or voltage changes to be applied."
At least this is how it has been working for me.

Need to get confirmation that ver.3.0 works with RX 5xxx-series (Navi) cards?

In the "Setting GPU Automatically at Startup" section,
Change section header to "Running Startup amdgpu-pac Bash Files". (and change ToC index entry)
Add instruction for setting up $HWMON variables to handle shifting hwmon# (thus increasing chances of bash files writing desired GPU parameters)?
Probably don't need to use --force_write option for startup bash file; just need to the changes from default settings.

feature request: monitor and set fans

For cards with air coolers, it would be great to be able to monitor fan RPM and %max and to be able to set fan speeds (as a percentage of max). For the RX460 and RX570 cards I have, AMD's default fan setting can make things a bit toasty under full loads.

HW file error

I just set up a new Ubuntu 18.04, Linux kernel 4.15 host with two Rx570, and loaded AMDGPU 18.4 drivers (not 18.5, because I first loaded AMDGPU drivers on a Ubuntu 16.04 system, which I subsequently upgraded to Ubuntu 18.04 - that was the only way I could figure out how to retain the 4.15 kernel, because the 4.18 kernel doesn't work with AMDGPU drivers).
With ampgpu-ls from today's amdgpu-utils I see the following errors

$ ./amdgpu-ls
AMD Wattman features enabled: 0xffff7fff
amdgpu version: 18.40-673869
2 AMD GPUs detected, 2 may be compatible, checking...
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_max
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_max
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_target
2 are confirmed compatible.
....etc...

And the contents of those two folders

$ ls /sys/class/drm/card1/device/hwmon/hwmon3
device      in0_label  power1_average  power1_cap_min  pwm1_max   temp1_crit       uevent
fan1_input  name       power1_cap      pwm1            pwm1_min   temp1_crit_hyst
in0_input   power      power1_cap_max  pwm1_enable     subsystem  temp1_input

$ ls /sys/class/drm/card2/device/hwmon/hwmon4
device      in0_label  power1_average  power1_cap_min  pwm1_max   temp1_crit       uevent
fan1_input  name       power1_cap      pwm1            pwm1_min   temp1_crit_hyst
in0_input   power      power1_cap_max  pwm1_enable     subsystem  temp1_inputt

I can, however, control the fan speeds withamdgpu-pac --execute, which I verified with amdgpu-monitor

$ ./amdgpu-pac --execute
(amdgpu-pac:3167): dbind-WARNING **: 15:28:12.776: Error retrieving accessibility bus address: org.freedesktop.DBus.Error.ServiceUnknown: The name org.a11y.Bus was not provided by any .service files
AMD Wattman features enabled: 0xffff7fff
amdgpu version: 18.40-673869
2 AMD GPUs detected, 2 may be compatible, checking...
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_max
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_max
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_target
2 are confirmed compatible.

Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_target
Batch file completed: /home/craig/Desktop/amdgpu-utils-master/pac_writer_bf254171022d44a4aa8e4fcdc8ca8094.sh
Writing 2 changes to GPU /sys/class/drm/card1/device/
+ sudo sh -c echo '1' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
[sudo] password for craig: 
+ sudo sh -c echo '102' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
+ sudo sh -c echo 'manual' >  /sys/class/drm/card1/device/power_dpm_force_performance_level
+ sudo sh -c echo '4' >  /sys/class/drm/card1/device/pp_power_profile_mode
PAC execution complete.
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_target
Batch file completed: /home/craig/Desktop/amdgpu-utils-master/pac_writer_2e23c00831eb4228b2e67df8603f52d5.sh
Writing 2 changes to GPU /sys/class/drm/card2/device/
+ sudo sh -c echo '1' >  /sys/class/drm/card2/device/hwmon/hwmon4/pwm1_enable
+ sudo sh -c echo '102' >  /sys/class/drm/card2/device/hwmon/hwmon4/pwm1
+ sudo sh -c echo 'manual' >  /sys/class/drm/card2/device/power_dpm_force_performance_level
+ sudo sh -c echo '4' >  /sys/class/drm/card2/device/pp_power_profile_mode
PAC execution complete.
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon3/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card2/device/hwmon/hwmon4/fan1_target

So, are the HW file errors anything to be concerned with since PAC seems to be working?
Also, my PC build has a PWM hub with two case fans connected to it. Is that related to the errors?
Do I need to upgrade to the 18.5 AMD drivers?
What about that dbind-WARNING?

Radeon VII and possible glibc issue

Hi Rick. This is Sean from the s@h boards.
Card 1 is a Vega 56. Card 0 is a Radeon VII.
System is Ubuntu 18.04 LTS, Ubuntu GLIBC 2.27-3ubuntu1
I'll include the output from amdgpu-ls (2.1.0) here as well:

./amdgpu-ls
AMD Wattman features enabled: 0xffffffff
amdgpu version: 18.50-725072
2 AMD GPUs detected
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/power1_cap_max
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/power1_cap
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/power1_average
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/temp1_crit
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_input
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_max
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/pwm1
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_max
Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/in0_label
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap_max
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/power1_average
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/temp1_input
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/temp1_crit
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_enable
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_target
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_input
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_max
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_enable
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/pwm1
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_max
Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/in0_label
2 are Compatible

Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage
UUID: 5d20111fb1d24b97a38ea653c57c55af
Card Model: Vega 10 XT [Radeon RX Vega 64]
Short Card Model: RX Vega 64
Card Number: 1
Card Path: /sys/class/drm/card1/device/
PCIe ID: 06:00.0
Driver: amdgpu
HWmon: /sys/class/drm/card1/device/hwmon/hwmon1/
Current Power (W): -1
Power Cap (W): -1
Power Cap Range (W): [-1, -1]
Fan Enable: -1
Fan PWM Mode: [-1, 'UNK']
Current Fan PWM (%): -1
Current Fan Speed (rpm): -1
Fan Target Speed (rpm): -1
Fan Speed Range (rpm): [-1, -1]
Fan PWM Range (%): [-1, -1]
Current Temp (C): -1
Critical Temp (C): -1
Current VddGFX (mV): -1
Vddc Range: ['800mV', '1200mV']
Current Loading (%): 83
Link Speed: 8 GT/s
Link Width: 16
vBIOS Version: 113-D0500300-101
Current SCLK P-State: 7
Current SCLK: 1590Mhz
SCLK Range: ['852MHz', '2400MHz']
Current MCLK P-State: 3
Current MCLK: 800Mhz
MCLK Range: ['167MHz', '1500MHz']
Power Performance Mode: 2-VIDEO
Power Force Performance Level: auto

UUID: 2ffbc1178e06458783b121e71dc487bd
Card Model: Device 081e
Short Card Model: Device 081e
Card Number: 0
Card Path: /sys/class/drm/card0/device/
PCIe ID: 03:00.0
Driver: amdgpu
HWmon: /sys/class/drm/card0/device/hwmon/hwmon0/
Current Power (W): -1
Power Cap (W): -1
Power Cap Range (W): [-1, -1]
Fan Enable: -1
Fan PWM Mode: [-1, 'UNK']
Current Fan PWM (%): -1
Current Fan Speed (rpm): -1
Fan Target Speed (rpm): -1
Fan Speed Range (rpm): [-1, -1]
Fan PWM Range (%): [-1, -1]
Current Temp (C): -1
Critical Temp (C): -1
Current VddGFX (mV): -1
Vddc Range: ['', '']
Current Loading (%): 97
Link Speed: 8 GT/s
Link Width: 16
vBIOS Version: 113-D3600200-105
Current SCLK P-State: -1
Current SCLK:
SCLK Range: ['808Mhz', '2200Mhz']
Current MCLK P-State: -1
Current MCLK:
MCLK Range: ['351Mhz', '1200Mhz']
Power Performance Mode: 2-VIDEO
Power Force Performance Level: auto

v2.7.0 Release Candidate

This release candidate include modifications required to facilitate a debian package release. It also supports running from a cloned local repository. Please report any bugs or feedback on this release candidate. Thanks!

Current GUI Color Scheme not Readable in Ubuntu 20.04

The gui apps' color scheme is not readable when using Ubuntu 20.04. I have had the same issue with boinc manager. I was considering changing it to a dark theme to eliminate the problem. I have changed amdgpu-monitor --gui

Let me know what you think.

No fan speed reading with RX 5600xt

I just installed an RX 5600XT, along with amdgpu-pro 20.1 OpenCL components, to run Einstein@Home. Initially the card's fans were off up to a GPU temp of 80 C when crunching an E@H task. Running amdgpu-ls gave "Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage" (below).
I turned on the fans by running amdgpu-pac to set fan speed to 30%. That worked and brought temp down to 46 C, but with the same error msg (below). The practical problem is that Fan Spd in -monitor and in -pac shows as 0%, so I can't tell what the current fan speed is.

$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   Card Path: /sys/class/drm/card0/device

Card Number: 1
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x731f', 'subsystem_vendor': '0x1da2', 'subsystem_device': '0xe411'}
   Decoded Device ID: Navi 10 [Radeon RX 5700 / 5700 XT]
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5700 / 5700 XT] (rev ca)
   Display Card Model: Navi 10 [Radeon RX 5700 / 5700 XT]
   PCIe ID: 03:00.0
      Link Speed: 16 GT/s
      Link Width: 16
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-5E4111U-X4G
   Compute Platform: OpenCL 2.0 AMD-APP (3075.10)
   GPU Frequency/Voltage Control Type: 2
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card1/device
   ##################################################
   Current Power (W): 79.0
   Power Cap (W): 160.0
      Power Cap Range (W): [0, 192]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Fan Target Speed (rpm): 0
   Current Fan Speed (rpm): 0
   Current Fan PWM (%): 0
      Fan Speed Range (rpm): [0, 3200]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 62
   Current Memory Loading (%): 30
   Current Temps (C): {'mem': 60.0, 'edge': 46.0, 'junction': 47.0}
      Critical Temp (C): 118.0
   Current Voltages (V): {'vddgfx': 950}
   Current Clk Frequencies (MHz): {'sclk': 1780.0, 'mclk': 875.0}
   Current SCLK P-State: [2, '1780Mhz']
      SCLK Range: ['800Mhz', '1820Mhz']
   Current MCLK P-State: [3, '875Mhz']
      MCLK Range: ['625Mhz', '930Mhz']
   Power Profile Mode: 5-COMPUTE
   Power DPM Force Performance Level: manual
$ ./amdgpu-pac --execute
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-master/pac_writer_99c0cfbf059042e68c31a899838cced5.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '1' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
[sudo] password for craig: 
+ sudo sh -c echo '76' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
PAC execution complete.
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage

Gpu Memory utilization

Hi,
I am running Einstein@Home on a Radeon VII under Ubuntu 18.4.x
I am running into an issue that looks like I am running out of memory on the gpu. If I exceed say 5 Gravity Wave gpu tasks, the gpu tasks stall.

Specifically the "Memory Load %" goes to 0 while the gpu load % goes to 100%. And the tasks appear to stop calculating and just increase wall clock time.

It would be very helpful if I could determine how much memory in the gpu is being used vs. available. Since I am also running two R5700's on the Gamma Ray Pulsar#1 search (different system) it would be helpful to see if I could boost from 3 gpu tasks to 4 without hitting the memory limit.

If I am missing something that is already present. I apologize and request instruction(s).
Tom M

fail more nicely on machines with no AMD card

Hello,
I ran on my virtualbox machine and got

amdgpu-ls
amdgpu-utils non-compatible driver: driver=vboxvideo latency=0
Error in environment. Exiting...

same for

$ amdgpu-monitor 
amdgpu-utils non-compatible driver: driver=vboxvideo latency=0
Error in environment. Exiting...

Have not tried the others. But>

$ amdgpu-chk 
Using python 3.7.4
           Python version OK. 
Using Linux Kernel4.20.0-trunk-amd64
           OS kernel OK. 

My initial thought was that I was just missing a setting among the environment variables. Maybe there is something more user-friendly to say that there is no driver installed that could possibly run AMD cards? And an interpretation of lspci suggesting that there is nothing installed hardware-wise either?

GPU running in thermostatic mode?

For a while one of my two RX 570s ran in what appeared to be a thermostatic mode while the other ran as expected from amdgpu-pac settings.

This occurred after upgrading to the amdgpu-pro-19.30-934563-ubuntu-18.04 package (release date November 5th 2019) with the command ./amdgpu-pro-install -y --opencl=pal,legacy --headless, then rebooting. I have my systemd set to run amdgpu-pac bash scripts at startup that set PWM fan speeds and other card parameters for einstein@home crunching. While one card ran its PAC settings as expected, the other card ignored its prescribed PWM setting (45%) and instead ran variable fan speeds which kept the GPU around 74 C. Over the course of a day or so, fan speeds ran between 36% and 43%, in 1% increments, as GPU load and room temperature varied. I have never seen that happen before. I tried to reproduce this behavior on the other card by setting its fan PWM to default (-1), but that did not trigger the thermostatic mode. Unfortunately, when I rebooted the system again, all fan PWM settings returned to normal. Has anyone else has experienced this?

I've no idea what happened, but it would be great to have the option in amdgpu-pac to take advantage of this (mysterious) AMDGPU feature and set fan PWM to "thermostatic".

V3.0 Rewrite

I am in the process of a major rewrite. This is mostly motivated by how much more I understand Python now, but also by innovations in how I am managing GPUs in benchMT. The implementation will be done in a way to potentially be applicable to other GPU vendors in addition to AMD. I will replace AMD compatible status with flags for readability, writability, and compute capability. Development is on Branch v3.0

Let me know of any recommendations to consider in this rewrite.

Linux Distribution Dependent Behavior

I would like to build in distribution dependent behavior and need help determining distribution specific commands. This includes the method of determining which distribution is used and which command is used to determine if a package is installed.

  1. Which distribution: Potentially use lsb_release, /etc/*-release, /proc/version, or hostnamectl
  2. Which tool to determine if a package is installed: dpkg for debian

Here is the list of distributions that I am aware are being used:

  1. Debian - @Ricks-Lab @smoe
  2. Gentoo - @CH3CN
  3. Arch - @berturion

Looking for feedback on distro behavior. Thanks!

Fan Setting/Reading Issues

UPDATE:possible bug. I've discovered that each time PAC is saved for a card, the fan PMW decreases 3%. If the fan PMW field is left blank or entered with a non-valid character, then the 3% decrease is from the last saved setting. There are some PMW % values that become set as entered, 0 (!), 20, 40, 60, 80, 100. This explains all previous "odd" behavior I've seen with PAC, so it's been there all along, it just took me awhile to figure it out (dang, sorry).
So, yeah, some warning need to be inserted that an entry of zero means the fan will be shut off with possible damage to the card, or make zero a non-valid character (though I suppose some folk may want to shut off their fan??). And that 3% decrement is a bit quirky and confusing, either for amdgpu or for amdgpu-pac.

Originally posted by @csecht in #10 (comment)

Development Ideas

I have opened this thread to facilitate any discussions on development ideas and code refinements. Potential topics:

  • Recommendations for more Pythonic approach
  • Gtk improvements for a nicer gui
  • Better way to write to driver files as root
  • Plotting capability approach
  • Code criticism

ValueError when reading the feature mask

Trying to run gpu-utils with the following environment:

  • Running on a venv
  • Installed from pypi
  • Python version 3.9.0
  • Kernel: 5.10 (liquorix)
  • OS: Debian/testing

I get the following error:

Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/home/sjr/.pyenv/versions/gpu-utils/bin/gpu-ls", line 150, in <module>
    main()                                                        
  File "/home/sjr/.pyenv/versions/gpu-utils/bin/gpu-ls", line 98, in main
    gpu_list.set_gpu_list(clinfo_flag=True)                                                                                                  
  File "/home/sjr/.pyenv/versions/3.9.0/envs/gpu-utils/lib/python3.9/site-packages/GPUmodules/GPUmodule.py", line 1671, in set_gpu_list
    self.amd_featuremask = env.GUT_CONST.read_amdfeaturemask()                                                                               
  File "/home/sjr/.pyenv/versions/3.9.0/envs/gpu-utils/lib/python3.9/site-packages/GPUmodules/env.py", line 206, in read_amdfeaturemask
    self.amdfeaturemask = int(fm_file.readline())

I get the same error using the packaged version for Debian testing (bullseye).

Release Candidate - Testing Requested

I have prepared v3.2.0 Release Candidate 1 on master. I have tested on my 3 systems. Looks good so far. Please provide your experience here as verification/feedback before release planned for this coming weekend. Thanks!

amdgpu-ls driver check and warning needs editing

From the recent Master, from amdgpu-ls I get the warning:

Command '['dpkg', '-l', 'amdgpu-pro']' returned non-zero exit status 1.
Warning: amdgpu drivers not may not be installed.

True, I don't have amdgpu-pro installed, but amdgpu is:

~/Desktop/amdgpu-utils$ dpkg -l amdgpu
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name              Version            Architecture       Description
+++-=================-==================-==================-=========================================
ii  amdgpu            19.30-838629       amd64              Meta package to install amdgpu components.

Shouldn't the driver check also include the amdgpu All-Open stack as well as the Pro stack?

Major Release Plans

The latest development work on the utilities has been to add Nvidia read capabilities. Also, the utilities now leverage generic PCIe sensor reading to detect all GPUs in a system. As a result, it may be appropriate to change the name of the project for the next major release. I was considering a name that would be consistent with what could be used for the package on Debian and a package released on PyPI. I was considering a project name of rickslab-gpu-utils and executables with names like rlgpu-ls, rlgpu-mon, rlgpu-pac, and rlgpu-plot.

Let me know your thoughts or any other recomendations.

User Guide - Contributors Needed

I have started a new markdown format file as a User Guide. If you would like to contribute, just edit this file: USER_GUIDE.md and do a pull request. If you are not that familiar with markdown, don't worry about format too much, and I will tune the look and feel of the document. Thanks!

AMDGPU 19.5 breaks sclk masking

Today I upgraded Ubuntu 18.04.3 LTS to the latest distro, which upgraded the Linux kernel from 5.0.0-37 to 5.3.0-26. That also upgraded AMDGPU drivers from 19.3-934563 to 19.5-967956. As a result, amdgpu-pac can no longer set sclk masking. There is a post (https://linuxreviews.org/Mesa_20_Will_Have_SDMA_Disabled_On_AMD_RX-Series_GPUs) that talks about sdma being disabled in the recent AMDGPU drivers for RX series and older AMD cards, but also for Navi (RX 5700-series) cards. Uninstalling and reinstalling AMDGPU downloaded from AMD does not fix it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.