rocm / hipblaslt Goto Github PK

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

Home Page: https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index.html

License: MIT License

Shell 0.01% CMake 0.04% C++ 0.97% C 0.01% Python 0.76% Assembly 98.21% Awk 0.01% Dockerfile 0.01%

amd assembly blas gemm gpu-computing hip machine-learning matrix-multiplication radeon-open-compute rocm

hipblaslt's Introduction

hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics.

hipBLASLt uses the HIP programming language with an underlying optimized generator as its backend kernel provider.

After you specify a set of options for a matrix-matrix operation, you can reuse these for different inputs. The general matrix-multiply (GEMM) operation is performed by the hipblasLtMatmul API.

The equation is:

$$D = Activation(alpha \cdot op(A) \cdot op(B) + beta \cdot op(C) + bias)$$

Where op( ) refers to in-place operations, such as transpose and non-transpose, and alpha and beta are scalars.

The activation function supports GELU and ReLU. the bias vector matches matrix D rows and broadcasts to all D columns.

The following table provides data type support. Note that fp8 and bf8 are only supported on the gfx94x platform.

A	B	C	D	Compute(Scale)
fp32	fp32	fp32	fp32	fp32
fp16	fp16	fp16	fp16	fp32
fp16	fp16	fp16	fp32	fp32
bf16	bf16	bf16	bf16	fp32
fp8	fp8/bf8	fp32	fp32	fp32
fp8	fp8/bf8	fp16	fp16	fp32
bf8	fp8	fp32	fp32	fp32
bf8	fp8	fp16	fp16	fp32

Documentation

Full documentation for hipBLASLt is available at rocm.docs.amd.com/projects/hipBLASLt.

Run the steps below to build documentation locally.

cd docs

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Alternatively, build with CMake:

cmake -DBUILD_DOCS=ON ...

Requirements

To install hipBLASLt, you must meet the following requirements:

Required hardware:

gfx90a card
gfx94x card

Required software:

Git
CMake 3.16.8 or later
python3.7 or later
python3.7-venv or later
AMD ROCm, version 5.5 or later
hipBLAS (for the header file)

Build and install

You can build hipBLASLt using the install.sh script:

# Clone hipBLASLt using git
git clone https://github.com/ROCmSoftwarePlatform/hipBLASLt

# Go to hipBLASLt directory
cd hipBLASLt

# Run install.sh script
# Command line options:
#   -h|--help         - prints help message
#   -i|--install      - install after build
#   -d|--dependencies - install build dependencies
#   -c|--clients      - build library clients too (combines with -i & -d)
#   -g|--debug        - build with debug flag
./install.sh -idc

Unit tests

All unit tests are located in build/release/clients/staging/. To build these tests, you must build hipBLASLt with --clients.

You can find more information at the following links:

Contribute

If you want to submit an issue, you can do so on GitHub.

To contribute to our repository, you can create a GitHub pull request.

hipblaslt's People

Contributors

Stargazers

Watchers

Forkers

jichangjichang jinp800125 vin-huang kkyang aazz44ss serge45 tonyyhsieh lajagapp cmingch imcarsonliao samjwu arvindcheru hovertank3d are-we-gfx1100-yet cgmb zstreet87 lzy17 wenchuanchen yuguo-jack lawruble13 andysu12 dllehr-amd dangglxxx yoichiyoshida mahmoodw jeffdaily streamhpc solaslin briannwu eidenyoshida ssuyuanchang ray-tw captnjacksparrow evetsso alexbrownamd mengzcai charlifu tntran92 raramakr yenong-amd bethune-bryant abhimeda swraw

hipblaslt's Issues

Would it be possible for hipBLASLt to support int8 ops?

Like their CUDA equivalents here:

https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/ops.cu#L383

[Issue]: Dependency on Tensile headers

Problem Description

I am packaging hipBLASLt for Fedora.
When building with -DBUILD_WITH_TENSILE=OFF post 5.7, there are build errors like this

hipBLASLt/library/src/amd_detail/rocblaslt/src/include/tensile_host.hpp:45:10: fatal error: 'Tensile/DataTypes.hpp' file not found
45 | #include <Tensile/DataTypes.hpp>
| ^~~~~~~~~~~~~~~~~~~~~~~

Operating System

Fedora Rawhide

CPU

x86_64

GPU

AMD Instinct MI210

Other

No response

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

the issue tracker drop down on rocm version needs to include 6.0

FP8 support

Does hipblaslt-bench support FP8? Or any other way to run a fp8 matmul test

Build Failure During Tensile Libraries Generation

Local ROCm version: 5.2.5.1
hipBLASLt version used in build: release/rocm-rel-5.5
Python version: 3.10
CPU: POWER9
GPU: gfx906

The hipBLASLt requirement arose for us re: bitsandbytes-rocm/ops.cu:400 that is required for 8-bit loading of HuggingFace language models. Unfortunately, the current implementation seems to rely on hipBLASLt for 8-bit matmul, and lacks in 4-bit implementation. Would you say that for gfx906/gfx908, hipBLASLt provides an advantage in 8-bit or 4-bit inference compared to hipBLAS code?

During the build process, the following commands were used:

CMake command: `cmake -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=hipcc -G "Unix Makefiles" ..`
Make command: `make -j16`

CMake did not report any errors. However, the build failed at the "Generating Tensile Libraries" target, immediately after displaying the message "Reading logic files: Launching 32 threads...". The build failure persists even when configuring using install.sh with AMDGPU_TARGETS hardcoded to gfx906.

traceback:

cmake.log
make.log

rocminfo: rocminfo.txt

Update: Seems the same error appears when compiling with ROCm 5.5.

Any plans for adding gfx10+ support?

I cannot use rx6xxx cards anymore for LLM fine tuning with the new hipblaslt requirement. Are there any plans to add support in the future?

/usr/lib/gcc/

Sorry, i'm not sure how i managed to fat finger so badly I've opened an issue by mistake....

Sorry!

ValueError: mutable default <class 'Tensile.KernelWriter.ABMatrixInfo'> for field a is not allowed: use default_factory

make[2]: Leaving directory '/home/xxx/amd_workspace/hipBLASLt/build/release'
[ 5%] Built target hipblaslt-test-data
Traceback (most recent call last):
File "/home/xxx/amd_workspace/hipBLASLt/build/release/library/../virtualenv/lib/python3.11/site-packages/Tensile/bin/TensileCreateLibrary", line 30, in
from Tensile.TensileCreateLibrary import TensileCreateLibrary
ModuleNotFoundError: No module named 'Tensile'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/xxx/amd_workspace/hipBLASLt/build/release/library/../virtualenv/lib/python3.11/site-packages/Tensile/bin/TensileCreateLibrary", line 37, in
from Tensile.TensileCreateLibrary import TensileCreateLibrary
File "/home/xxx/amd_workspace/hipBLASLt/build/release/virtualenv/lib/python3.11/site-packages/Tensile/TensileCreateLibrary.py", line 40, in
from .KernelWriterAssembly import KernelWriterAssembly
File "/home/xxx/amd_workspace/hipBLASLt/build/release/virtualenv/lib/python3.11/site-packages/Tensile/KernelWriterAssembly.py", line 43, in
from .KernelWriter import KernelWriter
File "/home/xxx/amd_workspace/hipBLASLt/build/release/virtualenv/lib/python3.11/site-packages/Tensile/KernelWriter.py", line 92, in
@DataClass
^^^^^^^^^
File "/root/miniconda3/lib/python3.11/dataclasses.py", line 1230, in dataclass
return wrap(cls)
^^^^^^^^^
File "/root/miniconda3/lib/python3.11/dataclasses.py", line 1220, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/dataclasses.py", line 958, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/dataclasses.py", line 815, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'Tensile.KernelWriter.ABMatrixInfo'> for field a is not allowed: use default_factory

How to solve this build error? really thanks.

AttributeError: 'NoneType' object has no attribute 'solutions'

Please see the details

[...]
# Found  hipcc version 6.0.32830-d62f6a171
Tensile::WARNING: Global parameter AsmDebug = False unrecognised.
# CodeObjectVersion from TensileCreateLibrary: default
# CxxCompiler       from TensileCreateLibrary: hipcc
# Architecture      from TensileCreateLibrary: gfx908:xnack+_gfx908:xnack-
# LibraryFormat     from TensileCreateLibrary: msgpack
# LibraryLogicFiles:
Reading logic files: Launching 32 threads...
Reading logic files: Done. (0.0 secs elapsed)
Traceback (most recent call last):
  File "/usr/local/src/hipBLASLt/build/release/library/../virtualenv/lib/python3.10/site-packages/Tensile/bin/TensileCreateLibrary", line 43, in <module>
    TensileCreateLibrary()
  File "/usr/local/src/hipBLASLt/build/release/virtualenv/lib/python3.10/site-packages/Tensile/TensileCreateLibrary.py", line 60, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/src/hipBLASLt/build/release/virtualenv/lib/python3.10/site-packages/Tensile/TensileCreateLibrary.py", line 1357, in TensileCreateLibrary
    solutions, masterLibraries, fullMasterLibrary = generateLogicDataAndSolutions(logicFiles, args)
  File "/usr/local/src/hipBLASLt/build/release/virtualenv/lib/python3.10/site-packages/Tensile/TensileCreateLibrary.py", line 60, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/src/hipBLASLt/build/release/virtualenv/lib/python3.10/site-packages/Tensile/TensileCreateLibrary.py", line 1125, in generateLogicDataAndSolutions
    solutions = [sol.originalSolution for _, sol in fullMasterLibrary.solutions.items()]
AttributeError: 'NoneType' object has no attribute 'solutions'
make[2]: *** [library/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/build.make:74: Tensile/library/TensileManifest.txt] Error 1
make[2]: Leaving directory '/usr/local/src/hipBLASLt/build/release'
make[1]: *** [CMakeFiles/Makefile2:272: library/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/all] Error 2
make[1]: Leaving directory '/usr/local/src/hipBLASLt/build/release'
make: *** [Makefile:166: all] Error 2
+ check_exit_code 2
+ ((  2 != 0  ))
+ exit 2

[Issue]: Build fails without showing any details while building TENSILE_LIBRARY_TARGET

Problem Description

tried running the following

./install.sh -idc

and got this error

#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Alik_Bljk_BBS_BH.yaml
#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Ailk_Bjlk_BBS_BH.yaml
#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Ailk_Bljk_F8F8S_BH.yaml
#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Ailk_Bljk_HSS_BH.yaml
#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Ailk_Bljk_F8H_HHS_BH_Bias_AS_SAB_SAV_custom.yaml
#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Alik_Bljk_HF8_HHS_BH.yaml
#   /home/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/aquavanjaram/gfx941/Equality/aquavanjaram_Cijk_Alik_Bljk_HHS_BH.yaml
Reading logic files: Launching 32 threads...
Reading logic files: Done. (464.4 secs elapsed)
[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% (4.1 secs elapsed)
[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% (0.1 secs elapsed)
# Writing Custom CMake
# Writing Kernels...
Generating kernels: Launching 32 threads...
Generating kernels: Done. (1329.7 secs elapsed)
*
[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% (159.3 secs elapsed)
Compiling source kernels: Launching 32 threads...
Compiling source kernels: Done. (259.5 secs elapsed)
# Check if generated files exists.
# Tensile Library Writer DONE
################################################################################

[  6%] TENSILE_LIBRARY_TARGET
make[2]: Leaving directory '/home/hipBLASLt/build/release'
[  6%] Built target TENSILE_LIBRARY_TARGET
make[1]: Leaving directory '/home/hipBLASLt/build/release'
make: *** [Makefile:166: all] Error 2
root@test:/home/hipBLASLt#

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9534 64-Core Processor

GPU

Other

MI300X

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.5.4 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD EPYC 9534 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 9534 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0

  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    792330184(0x2f39ffc8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    792330184(0x2f39ffc8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    792330184(0x2f39ffc8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    AMD EPYC 9534 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 9534 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2450
  BDFID:                   0
  Internal Node ID:        1
  Compute Unit:            112
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    792638584(0x2f3eb478) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    792638584(0x2f3eb478) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 3
*******
  Name:                    gfx942
  Uuid:                    GPU-3088f9ba5d22a8e2
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   25856
  Internal Node ID:        2
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  M
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 4
*******
  Name:                    gfx942
  Uuid:                    GPU-cf0e063
  Queue Type:              MULTI
  Node:                    3
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   17920
  Internal Node ID:        3
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      A
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 5
*******
  Name:                    gfx942
  Uuid:                    GPU-1192aa9ad9f51fe8
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    4
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   1280
  Internal Node ID:        4
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    5
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   9728
  Internal Node ID:        5
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB

      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 7
*******
  Name:                    gfx942
  Uuid:                    GPU-e872eb353b69db87
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    6
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   58624
  Internal Node ID:        6
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z
  Name:                    gfx942
  Uuid:                    GPU-13b763cbf1011d40
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    7
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   50688
  Internal Node ID:        7
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLA
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 9
*******
  Name:                    gfx942
  Uuid:                    GPU-acd560d0f22afbdd
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    8
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   34048
  Intern
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 10
*******
  Name:                    gfx942
  Uuid:                    GPU-896b648ee58968ee
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    9
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      32768(0x8000) KB
    L3:                      262144(0x40000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   42496
  Internal Node ID:        9
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 136
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

hipBlastLT build failed with msgpack error even though it is installed

Using ROCm-6.0 release

pip3 list | grep -i msgpack
msgpack 1.0.8

Also did pip install -r .sphinx/requirements as instructed in the README.


./install.sh --clients
Creating project build directory in: /root/extdir/gg/git/codelab-scripts/build-install-scripts/rocm/ROCm-6.0/hipBLASLt/build
/home/rocm/ROCm-6.0/hipBLASLt /home/rocm/ROCm-6.0/hipBLASLt
/home/rocm/ROCm-6.0/hipBLASLt/build/release /home/rocm/ROCm-6.0/hipBLASLt/build/release /home/rocm/ROCm-6.0/hipBLASLt
/home/rocm/ROCm-6.0/hipBLASLt/build/release /home/rocm/ROCm-6.0/hipBLASLt
-DAMDGPU_TARGETS=all -DCMAKE_BUILD_TYPE=Release
-- The CXX compiler identification is Clang 17.0.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm/bin/hipcc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Using hip-clang to build for amdgpu backend

*******************************************************************************
*------------------------------- ROCMChecks WARNING --------------------------*
  Options and properties should be set on a cmake target where possible. The
  variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
  calling 'cmake -DCMAKE_CXX_FLAGS=" -D__HIP_HCC_COMPAT_MODE__=1"'
  or set in a toolchain file and added with
  'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'. ROCMChecks now calling:
CMake Warning at /opt/rocm/share/rocm/cmake/ROCMChecks.cmake:46 (message):
  'CMAKE_CXX_FLAGS' is set at
  /root/extdir/gg/git/codelab-scripts/build-install-scripts/rocm/ROCm-6.0/hipBLASLt/CMakeLists.txt:<line#>
  shown below:
Call Stack (most recent call first):
  CMakeLists.txt:9223372036854775807 (rocm_check_toolchain_var)
  CMakeLists.txt:130 (set)


*-----------------------------------------------------------------------------*
*******************************************************************************

-- Found Git: /usr/bin/git (found version "2.43.0")
CMake Warning (dev) at cmake/findBLIS.cmake:41 (set):
  Cannot set "BLIS_FOUND": current scope has no parent.
Call Stack (most recent call first):
  CMakeLists.txt:141 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at cmake/findBLIS.cmake:42 (set):
  Cannot set "BLIS_INCLUDE_DIR": current scope has no parent.
Call Stack (most recent call first):
  CMakeLists.txt:141 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at cmake/findBLIS.cmake:43 (set):
  Cannot set "BLIS_LIB": current scope has no parent.
Call Stack (most recent call first):
  CMakeLists.txt:141 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

BLIS heeader directory found: /root/extdir/gg/git/codelab-scripts/build-install-scripts/rocm/ROCm-6.0/hipBLASLt/build/deps/blis/include/blis
BLIS lib found: /root/extdir/gg/git/codelab-scripts/build-install-scripts/rocm/ROCm-6.0/hipBLASLt/build/deps/blis/lib/libblis.a
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx940
-- Performing Test COMPILER_HAS_TARGET_ID_gfx940 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx941
-- Performing Test COMPILER_HAS_TARGET_ID_gfx941 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx942
-- Performing Test COMPILER_HAS_TARGET_ID_gfx942 - Success
-- AMDGPU_TARGETS: gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success
/usr/bin/python3 -m venv /root/extdir/gg/git/codelab-scripts/build-install-scripts/rocm/ROCm-6.0/hipBLASLt/build/release/virtualenv --system-site-packages --clear
/home/rocm/ROCm-6.0/hipBLASLt/build/release/virtualenv/bin/python3 -m pip install /home/rocm/ROCm-6.0/hipBLASLt/tensilelite
Processing /home/rocm/ROCm-6.0/hipBLASLt/tensilelite
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
Requirement already satisfied: pyyaml in /usr/local/lib64/python3.9/site-packages (from Tensile==4.33.0) (6.0)
Requirement already satisfied: msgpack in /usr/local/lib64/python3.9/site-packages (from Tensile==4.33.0) (1.0.8)
Requirement already satisfied: joblib in /usr/local/lib/python3.9/site-packages (from Tensile==4.33.0) (1.3.2)
Using legacy 'setup.py install' for Tensile, since package 'wheel' is not installed.
Installing collected packages: Tensile
    Running setup.py install for Tensile: started
    Running setup.py install for Tensile: finished with status 'done'
Successfully installed Tensile-4.33.0
WARNING: You are using pip version 21.2.3; however, version 24.0 is available.
You should consider upgrading via the '/home/rocm/ROCm-6.0/hipBLASLt/build/release/virtualenv/bin/python3 -m pip install --upgrade pip' command.
-- Adding /home/rocm/ROCm-6.0/hipBLASLt/build/release/virtualenv to CMAKE_PREFIX_PATH
-- The C compiler identification is GNU 11.4.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at build/release/virtualenv/lib64/python3.9/site-packages/Tensile/Source/lib/CMakeLists.txt:105 (find_package):
  By not providing "Findmsgpack.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "msgpack", but
  CMake did not find one.

  Could not find a package configuration file provided by "msgpack" with any
  of the following names:

    msgpackConfig.cmake
    msgpack-config.cmake

  Add the installation prefix of "msgpack" to CMAKE_PREFIX_PATH or set
  "msgpack_DIR" to a directory containing one of the above files.  If
  "msgpack" provides a separate development package or SDK, be sure it has
  been installed.

No mechanism to turn header off in example_hipblaslt_preference

https://github.com/ROCmSoftwarePlatform/hipBLASLt/blob/464a1e7d58e1d3d2eeb72a7df866bb0ed36b6a1d/clients/samples/example_hipblaslt_preference.cpp#L265

Install guide for non-root user

Can you please provide build instructions for non-root users ?


Successfully installed Tensile-4.33.0 msgpack-1.0.7

[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python3 -m pip install --upgrade pip
-- Adding /path/to/hipBLASLt/build/virtualenv to CMAKE_PREFIX_PATH
-- The C compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at build/virtualenv/lib/python3.10/site-packages/Tensile/Source/lib/CMakeLists.txt:105 (find_package):
  By not providing "Findmsgpack.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "msgpack", but
  CMake did not find one.

  Could not find a package configuration file provided by "msgpack" with any
  of the following names:

    msgpackConfig.cmake
    msgpack-config.cmake

  Add the installation prefix of "msgpack" to CMAKE_PREFIX_PATH or set
  "msgpack_DIR" to a directory containing one of the above files.  If
  "msgpack" provides a separate development package or SDK, be sure it has
  been installed.

cublasLtMatrixTransform() equivalent

Hello,
I did not find any reference of an equivalent of the function cublasLtMatrixTransform() in hipBLASLt.
Is there any plan to make one available ?

[Issue]: hipBLASLt support for more GPUs for PyTorch with ROCm 5.7 or later

Problem Description

PyTorch now requires hipBLASLt now when building with ROCm 5.7 or later, but hipBLASLt supports only gfx90a GPUs.

https://github.com/pytorch/pytorch/blob/84b2a323594bc7c4b47d61223b3f6466fe054416/cmake/public/LoadHIP.cmake#L158-L160

Is it means other GPUs (e.g., MI100) can not use PyTorch with the latest ROCm 6.0 release?

Operating System

Ubuntu 22.04.3 LTS

CPU

AMD EPYC 7773X

GPU

AMD Instinct MI100

Other

No response

ROCm Version

ROCm 5.7.1

ROCm Component

Pytorch

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

hipblasLtMatmul api usage

I would like to ask whether it is necessary to callhipblasLtMatmulAlgoGetHeuristic() to search algo before executing hipblasLtMatmul()?
When I set the three parameters of hipblasLtMatmulAlgo_t *algo, void *workspace, size_t workspaceSizeInBytes of hipblasLtMatmul() to nullptr, nullptr, and 0 respectively, the segment fault occurred.

[Feature-request] support amd gpus without matrix cores in rocBLASLt

Should be doable to "emulate" these in the GCN5.0/RDNA2+ ISA

[Issue]: install.sh is not friendly with `&>` log file redirection

Problem Description

Hi,

Building hipblaslt, I get a line:

`#29 666.5 # Writing Custom CMake
#29 666.5 # Writing Kernels...
#29 668.8 Generating kernels: Launching 64 threads...
#29 1051.9 Generating kernels: Done. (383.1 secs elapsed)
#29 1051.9 �|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/�-�\�|�/

with 60k characters in my log file of the build (cropped here).

I am using:

RUN git clone -b develop https://github.com/ROCm/hipBLASLt && \
    export GTest_DIR="/usr/local/lib/cmake/GTest/" && \
    cd hipBLASLt && \
    ./install.sh -idc --architecture 'gfx90a;gfx942'

and redirect my docker build to a file.

ROCm version is 6.0.2, using develop branch.

Operating System

Ubuntu 22.04

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

Other

MI300

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

is this a serious program>?

root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ./clients/staging/hipblaslt-bench -m 2048 -n 2048 -k 2048 --precision f32_r -v 1 --activation_type relu
Query device success: there are 1 devices

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack-
with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

rocblaslt warning: No paths matched /opt/hipBLASLt/build/release/library/../Tensile/library/gfx906co. Make sure that HIPBLASLT_TENSILE_LIBPATH is set correctly.
transA,transB,grouped_gemm,batch_count,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,d_type,compute_type,activation_type,bias_vector,hipblaslt-Gflops,us,CPU-Gflops,CPU-us,norm_error_1
N,N,0,1,2048,2048,2048,1,2048,4194304,0,2048,4194304,2048,4194304,2048,4194304,f32_r,f32_r,relu,0, 2.72763e+06, 6.3,4.47063,3.84376e+06,1.08487
root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ./clients/staging/hipblaslt-bench -m 1024 -n 1024 -k 1024 --precision f32_r -v 1 --activation_type relu
Query device success: there are 1 devices

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack-
with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

rocblaslt warning: No paths matched /opt/hipBLASLt/build/release/library/../Tensile/library/gfx906co. Make sure that HIPBLASLT_TENSILE_LIBPATH is set correctly.
transA,transB,grouped_gemm,batch_count,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,d_type,compute_type,activation_type,bias_vector,hipblaslt-Gflops,us,CPU-Gflops,CPU-us,norm_error_1
N,N,0,1,1024,1024,1024,1,1024,1048576,0,1024,1048576,1024,1048576,1024,1048576,f32_r,f32_r,relu,0, 279030, 7.7,4.39526,488829,1.12318
root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ^C
root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ./clients/staging/hipblaslt-bench -m 102^C-n 1024 -k 1024 --precision f32_r -v 1 --activation_type relu
root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ./clients/staging/hipblaslt-bench --precision f32_r -v 1
Query device success: there are 1 devices

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack-
with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

rocblaslt warning: No paths matched /opt/hipBLASLt/build/release/library/../Tensile/library/gfx906co. Make sure that HIPBLASLT_TENSILE_LIBPATH is set correctly.
transA,transB,grouped_gemm,batch_count,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,d_type,compute_type,activation_type,bias_vector,hipblaslt-Gflops,us,CPU-Gflops,CPU-us,norm_error_1
N,N,0,1,128,128,128,1,128,16384,0,128,16384,128,16384,128,16384,f32_r,f32_r,none,0, 776.723, 5.4,4.06425,1032,1.07202

what fuck the gpu has 200Tflops? 279030

rocm / hipblaslt Goto Github PK

hipblaslt's Introduction

hipBLASLt

Documentation

Requirements

Build and install

Unit tests

Contribute

hipblaslt's People

Contributors

Stargazers

Watchers

Forkers

hipblaslt's Issues

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ./clients/staging/hipblaslt-bench -m 2048 -n 2048 -k 2048 --precision f32_r -v 1 --activation_type relu Query device success: there are 1 devices

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack- with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0 maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack- with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0 maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack- with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0 maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

Recommend Projects

Recommend Topics

Recommend Org

root@1a89b5aa5fce:/opt/hipBLASLt/build/release# ./clients/staging/hipblaslt-bench -m 2048 -n 2048 -k 2048 --precision f32_r -v 1 --activation_type relu
Query device success: there are 1 devices

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack-
with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack-
with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

Device ID 0 : AMD Radeon VII gfx906:sramecc+:xnack-
with 17.2 GB memory, max. SCLK 1801 MHz, max. MCLK 1000 MHz, compute capability 9.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64