Coder Social home page Coder Social logo

rocmlir's Introduction

MLIR-based convolution and GEMM kernel generator for ROCm

This is the repository for a MLIR-based convolution and GEMM kernel generator targetting AMD hardware. This generator is mainly used from MIGraphX, but it can be used on a standalone basis. (The ability to use this code via torch-mlir is being investigated as well.)

Building (and testing)

To build the system

mkdir build
cd build
cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
ninja check-rocmlir

Note that we require building against a relatively recent clang. The above commands specify the ROCm clang release in order to match our standard development practice.

To not actually run the tests, use check-rocmlir-build-only.

To build the static library that is used by MIGraphX

mkdir build
cd build
cmake -G Ninja .. -DBUILD_FAT_LIBROCKCOMPILER=On -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
ninja

and to install it so MIGraphX can find it

cmake --install . --prefix [your/MIGraphX/deps/folder/path]

Standalone usage

For usage examples, see mlir/test/rocmlir-driver, especiallly the files sanity.mlir and the contents of the e2e_for_pr directory.

This project also includes code that translates from TOSA to kernels, see mlir/test/fusion for examples of how to invoke it.

In general (with all invocations given from the build directory)

  • ./bin/rocmlir-gen generates high-level convolution operations and host code. Many of the options control data layout, size, etc, but some other useful flags are:
    • -mfma=on (which enables mfma usage) (or -wmma=on for gfx11 targets)
    • -mfma=off (which disables mfma usage) (or -wmma=off for gfx11 targets)
    • -ph (which causes host code to be generated)
    • -pv (which makes the host code validtae the results against a reference)
    • -pv_with_gpu (which uses a GPU validator instead)
    • -pr (which prints kkrnel results)
  • ./bin/rocmlir-driver is a wrapper around the kernel generation pipeline. Use -c (or --kernel-pipeline=full --host-pipeline=runner) to run the default pipeline

The result of this pipeline should, most simply, be passed to the rocm-run script in mlir/utils/widgets//rocm-run, which calls mlir-cpu-runner with the appropriate flags and infers the pathnames for libraries correctly.

In more detail, the result of the above pipeline can be passed to ./external/llvm-project/llvm/bin/mlir-cpu-runner .

mlir-cpu-runner needs to link the generated host code against libraries that map from MLIR operations to the HIP runtime. The required command-line arguments (if running from build/) are

./external/llvm-project/llvm/bin/mlir-cpu-runner --shared-libs=./external/llvm-project/llvm/lib/libmlir_rocm_runtime.so,./lib/libconv-validation-wrappers.so,./external/llvm-project/llvm/lib/libmlir_runner_utils.so --entry-point-result=void

Adding --debug-only=serialize-to-blob to the rocmlir-driver invocation will cause the GCN assembly code for the kernels being executed to be dumped to standard error.

Disabling MFMA/WMMA in tests

By default, we infer the use of GPU-specific acceleration instructions, like MFMA or WMMA, based on the features of the currently available GPU.

To disable this, add -DROCMLIR_GEN_FLAGS="-mfma=off -wmma=off" to the cmake invocations given above. Note that this will not affect behavior in production/static library builds, which do not use rocmlir-gen.

rocmlir's People

Contributors

akyrtzi avatar arsenm avatar chandlerc avatar chapuni avatar d0k avatar ddunbar avatar douggregor avatar dwblaikie avatar echristo avatar eefriedman avatar ericwf avatar espindola avatar isanbard avatar kcc avatar labath avatar lattner avatar majnemer avatar maskray avatar nico avatar pcc avatar resistor avatar rksimon avatar rnk avatar rotateright avatar rui314 avatar stoklund avatar tkremenek avatar tobiasgrosser avatar topperc avatar zygoloid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rocmlir's Issues

[Issue]: Assertion `Index < Length && "Invalid index!"' failed

Problem Description

Build MIGraphX develop branch + ROCm/AMDMIGraphX#2782
I was using a ROCm 6.0.2 docker
I ran the migraphx-driver and hit the error listed in the "Additional Information" section of this issue.

Operating System

20.04.6

CPU

AMD EPYC 7702 64-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

rocMLIR

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

Trace collected from MIGRAPHX_TRACE_MLIR=1 /workspace/AMDMIGraphX/build/bin/driver perf /new-saved-models/tiny_llama/decoder_model.onnx --fill1 input_ids attention_mask --inpu
t-dim @input_ids 1 256 @attention_mask 1 256

mlir_error_tinylama.txt

[Issue]: Attention failing with gpt2-10.onnx

Problem Description

This failure happens only when including the attention op.

MIGRAPHX_MLIR_USE_SPECIFIC_OPS="attention"  /opt/rocm/bin/migraphx-driver perf --exhaustive-tune /models/onnx-model-zoo/gpt2-10.onnx --batch 1 --fp16
Compiling ...
Reading: /models/onnx-model-zoo/gpt2-10.onnx
terminate called after throwing an instance of 'migraphx::version_2_10_0::exception'
  what():  /workspace/AMDMIGraphX/src/targets/gpu/compile_ops.cpp:222: benchmark: No valid tuned compilation for gpu::mlir_op with gfx942:sramecc+:xnack-       304     -t f16 -out_datatype f16 -transA false -transB false -g 12 -m 1 -n 1 -k 64

Operating System

22.04

CPU

AMD EPYC 7702 64-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.1.0

ROCm Component

ROCm

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

rocMLIR installation is failed

I use following instrutions to install rocMLIR

mkdir build
cd build
cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
ninja check-rocmlir

and an error occurred:

rocmlir/external/llvm-project/mlir/lib/Dialect/GPU/AmdDeviceLibsIncGen.py: Permission denied

I do not know why?

[bf16]debug build verification failed

bf16 PR:
#45

release build pass, but debug build failed

./bin/mlir-miopen-driver -p=false -c -x2 -t bf16 fil_layout=kcyx -in_layout=nchw -out_layout=nkhw -batchsize=64 -in_channels=4 -out_channels=64 -in_h=32 -in_w=32 -fil_h=3 -fil_w=3 --dilation_h=1 --dilation_w=1 --padding_h=0 --padding_w=0 --conv_stride_h=2 --conv_stride_w=2 -pv | ./bin/mlir-rocm-runner --shared-libs=./lib/librocm-runtime-wrappers.so,./lib/libmlir_runner_utils.so --entry-point-result=void

release build:
gpu results:32.0 cpu:32.0

debug build:
gpu results:32.02 cpu:32.12

create this issue to track , will look into ISA

MIGraphX + rocMLIR: gibberish in the standard output when `MIGRAPHX_TRACE_MLIR=1`

Hi all,

I configure migraphx to use rocmlir. In one particular case - i.e., when MIGRAPHX_TRACE_MLIR=1, I noticed some garbage in the output. Example,

@parammlir_main:pointwise41:x1.0@parammlir_main:pointwise43:x1.0:x1.0 =
:mlir_main:pointwise48:x1.0@param:x1.0 ->  =  = @param@returnfloat_typex1.0mlir_main:pointwise53:x1.0mlir_main:pointwise36:x1.0@parammlir_main:pointwise14:x1.0mlir_main:pointwise6:x3.0 = mlir_main:pointwise17:x1.0,  ->  = mlir_main:pointwise21:x1.0(mlir_main:pointwise0:@6)@parammlir_main:pointwise28:x3.0:{mlir_main:pointwise24:x1.0 = float_type, {0: =  = @param:x1.0 -> mlir_main:pointwise50:x1.0: -> x1.01, 128, 1, 1 ->  = mlir_main:pointwise65:x1.01, 64, 1, 1mlir_main:pointwise46:x1.0, target_id=@parammlir_main:pointwise55:x1.0x1.0 -> float_type, {1, 128, 1, 1:x1.0@paramfloat_type, @param = }, }, { = 1, 128, 1, 1}, {128, 1, 1, 1}, target_id=0:0
float_type}, {,  -> float_type, {:1, 512, 1, 1}, {512, 1, 1, 1}, target_id=0
:@param{}, @param:x1.0 -> 64, 1, 1, 1x1.0float_type -> 128, 1, 1, 1

:{x1.0 = { ->  = float_type, { = :x3.0 = } -> , target_id=, float_type, {x3.0, 1, 256, 1, 1}, target_id=0@param}, x1.0float_typemlir_main:pointwise26:x1.0:x1.0 -> float_type, {{, {
 -> float_type, 256, 1, 1, 1 = float_typemlir_main:pointwise41:@1 = multibroadcast[out_lens={, 1@param{@param0:x3.01, 64, 1, 1:
}, { -> 64, 1, 1, 1}, target_id=0
{mlir_main:pointwise19:@1 = multibroadcast[out_lens=x1.0{ -> 1float_type, , {@parammlir_main:pointwise86:x1.0 = @param:x1.0 -> float_type, {1, 1024, 1, 1:1, 1024, 1, 1}, x1.0{}, {@parammlir_main:pointwise43:x2 -> 1024, 1, 1, 11, 256, 1, 11, 256, 1, 1 = @param:x2float_type, :}mlir_main:pointwise63:x1.0}, {}, { -> mlir_main:pointwise4:@11, 512, 1, 11, 256, 1, 1mlir_main:pointwise10:@1}, { = multibroadcast[out_lens={256, 1, 1, 1mlir_main:pointwise70:x1.0mlir_main:pointwise98:x1.0 = @param:x1.0 -> mlir_main:pointwise96:x1.0mlir_main:pointwise79:x1.0float_type = @param:x1.0 =  = , mlir_main:pointwise75:x1.0@param:x1.0 -> @parammlir_main:pointwise82:x1.0mlir_main:pointwise84:x1.01024, 1, 1, 1{1, 512, 1, 1}, {512, 1, 1, 1}, target_id=0
mlir_main:pointwise98:@1 = multibroadcast[out_lens={1, 512, 7, 7},out_dyn_dims={}](mlir_main:pointwise98:x1.0) -> float_type, {1, 512, 7, 7}, { = 512, 1, 0, 0}, target_id=0
mlir_main:pointwise98:y1 = @param:y1 -> float_type, {512, 512, 3, 3}, {4608, 9, 3, 1}, target_id=0
float_typemlir_main:pointwise98:y0 = , @param:y0 -> float_type, {}1, 512, 7, 7mlir_main:pointwise48:@1 = multibroadcast[out_lens={@param:x1.0{ -> float_type, {128, 1, 1, 1mlir_main:pointwise108:x1.0 = @param:x1.0 -> float_type, {x1.01, 2048, 1, 1}, {2048, 1, 1, 11, 128, 1, 1}}1, 64, 1, 1}, { -> 128, 1, 1, 1, target_id= -> , target_id=}, {1, 128, 1, 164, 1, 1, 11, 64, 1, 10
float_type, {1}, { -> mlir_main:pointwise93:x1.01, 1024, 1, 1}, {}, {:mlir_main:pointwise108:x2{} = float_type}x1.0256, 1, 1, 11024, 1, 1, 1 = float_type64, target_id=0
256, 1, 1, 1}128, 1, 1, 164, 1, 1, 1{ = @param:x1.0 -> float_type, { -> , target_id=0, mlir_main:pointwise111:x1.0}, target_id=0
mlir_main:pointwise77:x1.0{float_type, target_id=0}
mlir_main:pointwise91:x1.0 = @param:x1.0 -> float_type, {1, 256, 1, 1}, {256, 1, 1, 1}, target_id=0
mlir_main:pointwise91:@1 = multibroadcast[out_lens={1, 256, 14, 14},out_dyn_dims={}](mlir_main:pointwise91:x1.0) -> float_type, {1, 256, 14, 14}, {256, 1, 0, 0}, target_id=0
mlir_main:pointwise91:y1 = @param:y1 -> float_type, {256, 256, 3, 3}, {2304, 9, 3, 1}, target_id=0
, mlir_main:pointwise91:y064 = , 56, 56},out_dyn_dims={@param = }, {mlir_main:pointwise21:x2 = @param:x2 -> float_type, {mlir_main:pointwise53:@1 = multibroadcast[out_lens={1, 256, 14, 14},out_dyn_dims={}](mlir_main:pointwise53:x1.0) -> float_type, {1, 256, 14, 14}, {256, 1, 0, 0}, target_id=0
1, 512, 1, 1, }, mlir_main:pointwise53:y156{1, 256, 1, 1, 56},out_dyn_dims=mlir_main:pointwise113:x1.0}, }512, 1, 1, 1float_type = @param:x1.0 -> {, target_id=@param}, target_id=0{256, 1, 1, 1, {, target_id=@param25088, 49, 7, 1}, target_id=0
1, 128, 1, 1}, target_id=0
{1, 512, 1, 1}, {mlir_main:pointwise98:@4, target_id=@param0

I cannot assume that it was done intentionally. It looks more like several threads are concurrently writing to the standard output. The issue seems stemming from migraphx+rocmlir integration (more on migraphx side).

Here is the way to reproduce the issue:

cd <migraphx-src>
cmake . -B build -DMIGRAPHX_ENABLE_MLIR=ON \
  -DCMAKE_PREFIX_PATH="$(realpath ../depend);/MIGraphXDeps" \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DLLD_BUILD_TOOLS=ON \
  -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
cd build
make -j driver test_gpu_mlir

MIGRAPHX_ENABLE_MLIR=1 MIGRAPHX_TRACE_MLIR=1 ./bin/migraphx-driver compile  --onnx /MIGraphXDeps/resnet50-v1-7.onnx

Question: How to get AssembleISA to work in gfx906

Hey guys,

I narrowed down the problem I had in here relating to building stuff on gfx906 to the AssembleISA function.

I was able to build a standalone tool that generate hsaco code by basing off backendutils' compileISAToHsaco and friends:
Input:
1.Assembly File
Output:
1.Hsaco code (.co file)
2.intermediate non linked code object (.o file)

I am using the assembly code found this sample and also testing out the hsaco code we are able to generate from compileISAToHsaco by linking to the executable generated there.

Tests I tried:

Test 0:
1.Build the example from this sample
2.Run the example by it self (which uses clang to assemble + link ISA code into .co code)

Result: on gfx1030(Success), on gfx906(fail)

Test 1:

  1. Generate .co file using compileISAToHsaco
  2. Link against executable
  3. run executable

Result: on gfx1030(Success), on gfx906(fail: Can assemble but cannot run)
Runtime Error: hsa_executable_load_code_object failed: HSA_STATUS_ERROR_INCOMPATIBLE_ARGUMENTS: The arguments passed to a functions are not compatible.

Test 2:

  1. generate .o file using compileISAToHsaco (i.e take intermediate result from assmebleISA and save to file)
  2. Use clang to link CompileISAToHsaco .o file and generate a new .co file
  3. link .co to executable
  4. run executable

Result: on gfx1030(Success), on gfx906(fail: Can assemble but cannot run)
Runtime Error: hsa_executable_load_code_object failed: HSA_STATUS_ERROR_INCOMPATIBLE_ARGUMENTS: The arguments passed to a functions are not compatible.

Note: Same error shows up on the sample by itself if mcode-object-version=3, but will work succesfully if mcode-object-version=4

Also the test I found on this sample, I was able to successfully run stand alone on gfx906 when I switch out the code object version to 4 from 3. Could that be the reason it is not able to assemble ISA properly on gfx906? Thank you soo much for your help in advance! :)

[verification stall]

we can reproduce issue with
step1:
/usr/local/bin/cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="mlir;lld" -DLLVM_BUILD_EXAMPLES=ON -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DBUILD_SHARED_LIBS=ON -DLLVM_BUILD_LLVM_DYLIB=ON -DMLIR_KEVIN_DRIVER_ENABLED=1 -DMLIR_MIOPEN_DRIVER_ENABLED=1 -DMLIR_ENABLE_SQLITE=1 -DMLIR_ROCM_RUNNER_ENABLED=1

step2:
cmake --build . --target check-mlir

step3:
./bin/mlir-miopen-driver -p -c -x2 -pv | ./bin/mlir-rocm-runner --shared-libs=./lib/librocm-runtime-wrappers.so,./lib/libmlir_runner_utils.so --entry-point-result=void

it will stall and do not show results

Undefined soname for MLIR libraries

Hi! I'm packaging rocMLIR for Arch Linux. The installed libraries have a generic "VERSION" suffix that makes it impossible to automatically check for ABI changes in the package or packages that depend on it. The problem is that add_mlir_library calls llvm_add_library that uses LLVM_VERSION_MAJOR and LLVM_VERSION_SUFFIX to set the soversion via set_target_properties. The file that defines these variables (in the external llvm-project) is included after the mlir folder. Thus the LLVM version is undefined and cmake defaults to VERSION. A possible solution is to copy the definitions of LLVM_VERSION_* to mlir/CMakeLists.txt.

[Question] Is xdlops_gemm_v2 usable?

I have seen examples of xdlops_gemm_v2 mlirs, but I do not see lowerings to LLVMIR or transformation to rocdl's mfma.
example here, seems to be using "lowering-step5", which seems to be (affine->std) + (Loop -> std). any guidance would be appreciated :)

Blockwisegemm Summary

Signature of Blockwisegemm Op

Arguments<(ins MemRefRankOf<[F32, F16, BF16, I8], [3]>:$matrixA,
               MemRefRankOf<[F32, F16, BF16, I8], [3]>:$matrixB,
               MemRefRankOf<[F32, F16, BF16, I32], [2]>:$matrixC,
               IndexAttr:$kPerThread,
               IndexAttr:$mPerThread,
               IndexAttr:$nPerThread,
               IndexAttr:$mThreadsPerCuwave,
               IndexAttr:$nThreadsPerCuwave,
               IndexAttr:$mCuwavesPerBlock,
               IndexAttr:$nCuwavesPerBlock
               )> {

The miopen.blockwise_gemm op does gemm at the blockwise level without xdlops.

  • Matrix A resides in LDS and has dimensions [k, M_b , kPack].
  • Matrix B resides in LDS and has dimensions [k, N_b, kPack].
  • Matrix C resides in registers and has dimensions [m_c, n_c], where m_c = mRepeat * mPerThread and n_c = nRepeat * nPerThread.
  • It is the case that m_a = m_c * mCuwaves * mThreadsPerCuwave and n_a = n_c * nCuwaves * nThreadsPerCuwave
  • m/n/k per thread: Tuning parameter showing the length of the smallest-level tiles in the m/n/k dimension that will be passed to the threadwise gemm, which computes over m/n Repeat of such tiles.
  • m/n threads per cu wave: Both currently hard-coded as 4
  • m/n cu waves per block: Number of cu waves in a block

Divisions of the work in a block

Simplified representations:

  • mCuwavesPerBlock and nCuwavesPerBlock are simplified as M and N
  • mThreadsPerCuwave and nThreadsPerCuwave are simplified as m and n
  • With above, number of threads per cu wave is threads = m * n = 16
  • Tile size satisfy:
    • MPerBlock = MRepeat * (M * m * MPerThread)
    • NPerBlock = NRepeat * (N * n * NPerThread)

Visualization

  • An arbitrary row:
    • Start with tid / (threads * N) tile
    • Within the tile, the thread index is (tid % threads) / n
    • Therefore, `row = tid / (threads * N) + (tid % threads) / n
  • An arbitrary column:
    • Start with tid / (threads % N) tile
    • Within the tile, the thread index is tid % n
    • Therefore, column = (tid / threads) % N + (tid % n)

The blockwise would loop through the M * N tiles in the following fashion:

for (int a_elem = 0; a_elem < MRepeat * MPerThread; ++a_elem) {
  curRow = a_elem % MPerThread + MPerThread * (row + M * m * (a_elem / MPerThread));
}
for (int b_elem = 0; b_elem < NRepeat * NPerThread; ++b_elem) {
  curCol = b_elem % NPerThread + NPerThread * (column + N * n * (b_elem / NPerThread));
}

Convolution hang/crash

This is trace from the crash that includes the mlir program and tuning problem and solution keys:

Problem: gfx90a:sramecc+:xnack- 110     conv -F 1 -f GNCHW -I HWNGC -O NGCHW -n 1 -c 512 -H 256 -W 256 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1
Benchmarking solution: 128,256,4,64,128,4,1,1
module {
  func.func @mlir_transpose_convolution_add(%arg0: !migraphx.shaped<1x512x256x256xf32, 0x1x0x0>, %arg1: !migraphx.shaped<256x256x1x512xf32, 131072x512x512x1>, %arg2: !migraphx.shaped<512x512x3x3xf32, 4608x9x3x1>) -> !migraphx.shaped<1x512x256x256xf32, 33554432x65536x256x1> attributes {arch = "", kernel = "mixr", num_cu = 0 : i64} {
    %0 = migraphx.transpose %arg1 {permutation = [2, 3, 0, 1]} : <256x256x1x512xf32, 131072x512x512x1> -> <1x512x256x256xf32, 512x1x131072x512>
    %1 = migraphx.convolution %0, %arg2 {dilation = [1, 1], group = 1 : i64, padding = [1, 1, 1, 1], padding_mode = 0 : i64, stride = [1, 1]} : <1x512x256x256xf32, 512x1x131072x512>, <512x512x3x3xf32, 4608x9x3x1> -> <1x512x256x256xf32, 33554432x65536x256x1>
    %2 = migraphx.add %1, %arg0 : <1x512x256x256xf32, 33554432x65536x256x1>, <1x512x256x256xf32, 0x1x0x0> -> <1x512x256x256xf32, 33554432x65536x256x1>
    return %2 : !migraphx.shaped<1x512x256x256xf32, 33554432x65536x256x1>
  }
}

DoD

  • Need to be backported to ROCm 6.1

mlir-rocm-runner hipErrorSharedObjectInitFailed during gpu-to-hsaco.mlir

Hey guys,

I am testing out mlir-rocm-runner with gpu-to-hsaco.mlir but seems to be getting some error when doing createGpuToLLVMConversionPass.

If I comment out the call to other_func in gpu-to-hsaco.mlir it works fine, but obviously I'd like to do something with the GPUs :). I have attached a screenshot to the error called out.

This is my cmd line:./mlir-rocm-runner --triple="amdgcn-amd-amdhsa" --target="gfx906" --shared-libs=librocm-runtime-wrappers.so,libmlir_runner_utils.so --entry-point-result=void ../gpu-to-hsaco.mlir

Screen Shot 2021-04-22 at 6 20 44 PM

Thanks in advance for all your help! A fan of your work ๐Ÿ˜„ !

Crash on align tiling when running RNN GRU models enabling sigmoid

After fixing a missing comma, MLIR will now work with sigmoid operators, but this causes the ./bin/test_verify test_gru_bidirct test to fail in migraphx on mi100:

UNREACHABLE executed at /usr/local/cget/build/tmp-832dcb3bff0f4556884f71aa299c95b7/rocMLIR-1ad9d6df32acc6d29d58e8ed6710e36746d0a4d6/mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp:98!

Investigate tuning difference with Full and Exhaustive

So I ran bert with Full, and see these results:

Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 4,128,4,4,64,4,1,1: 
Fastest time: 0.404553
Slowest time: 0.414574
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,32,4,64,16,4,1,1: 
Fastest time: 0.190852
Slowest time: 0.200733
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 4,256,8,4,64,8,1,1: 
Fastest time: 0.18019
Slowest time: 0.188544
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,64,8,16,64,8,1,1: 
Fastest time: 1.07512
Slowest time: 1.0977
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,64,8,64,64,1,1,1: 
Fastest time: 0.670955
Slowest time: 0.679869
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 256,64,8,128,64,1,1,1: 
Fastest time: 0.0541399
Slowest time: 0.0620074

And running it with Exhaustive it shows these configs:

Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 16,256,2,8,8,4,0,1: 
Fastest time: 0.4017
Slowest time: 4.88479
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 4,256,1,64,8,4,0,1: 
Fastest time: 0.188217
Slowest time: 8.79304
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 8,64,2,16,32,8,0,1: 
Fastest time: 0.178298
Slowest time: 0.198162
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 256,64,4,32,64,1,0,1: 
Fastest time: 1.07209
Slowest time: 1.53007
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 8,64,1,8,8,8,0,1: 
Fastest time: 0.665448
Slowest time: 0.807237
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 4,16,1,4,4,1,1,1: 
Fastest time: 0.0520503
Slowest time: 0.0751473

I added a branch that will also print out the problem and solution config for the slowest as well. I am running it now, but it might not be done until tomorrow when I am out. You can run it off of my branch with(assuming the bert onnx file is in /onnx directory):

MIGRAPHX_MLIR_TUNE_EXHAUSTIVE=1 MIGRAPHX_ENABLE_MLIR=1 ./bin/driver perf /onnx/bert_base_cased_1.onnx --exhaustive-tune --fp16 --fill1 input_ids --input-dim @input_ids 32 384

[Issue]: rocMLIR compilation issue

Problem Description

Hi ppl,

Im trying to compile version 6.1.0 with these configuration options

cmake
-Wno-dev
-G Ninja
-D CMAKE_CXX_FLAGS="${CXXFLAGS} -fcf-protection=none"
-D CMAKE_INSTALL_PREFIX=/usr
-D CMAKE_BUILD_TYPE=Release
-D CMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang
-D CMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
-D BUILD_FAT_LIBROCKCOMPILER=ON
-D BUILD_SHARED_LIBS=ON
-D ROCM_TEST_CHIPSET=gfx900
..

ninja

and I got error:

[3/299] Building CXX object mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o
FAILED: mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o
/opt/rocm/llvm/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/mlir/lib/Dialect/Rock/Transforms -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/lib/Dialect/Rock/Transforms -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/llvm-project/llvm/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/external/llvm-project/llvm/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/llvm-project/mlir/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/external/llvm-project/llvm/tools/mlir/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/mlir-hal/mlir/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/external/mlir-hal/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/mlir-hal/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/mlir/include -fcf-protection=none -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=gnu++17 -fPIC -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -MF mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o.d -o mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -c /mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp
In file included from /mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp:14:
/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/include/mlir/Conversion/TosaToRock/TosaToRock.h:21:10: fatal error: 'mlir/Conversion/RocMLIRPasses.h.inc' file not found
21 | #include "mlir/Conversion/RocMLIRPasses.h.inc"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
[20/299] Building CXX object external/llvm-project/llvm/tools/mlir/lib/Dialect/Linalg/Transforms/CMakeFiles/obj.MLIRLinalgTransforms.dir/ElementwiseOpFusion.cpp.o
ninja: build stopped: subcommand failed.

[2948/3182] Building CXX object mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/CheckResidency.cpp.o
FAILED: mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/CheckResidency.cpp.o
/opt/rocm/llvm/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/mlir/lib/Dialect/Rock/Transforms -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/lib/Dialect/Rock/Transforms -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/llvm-project/llvm/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/external/llvm-project/llvm/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/llvm-project/mlir/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/external/llvm-project/llvm/tools/mlir/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/mlir-hal/mlir/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/external/mlir-hal/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/mlir-hal/include -I/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/include -I/mnt/arhiv/rocm/rocm-build/build/rocmlir/mlir/include -fcf-protection=none -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=gnu++17 -fPIC -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/CheckResidency.cpp.o -MF mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/CheckResidency.cpp.o.d -o mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/CheckResidency.cpp.o -c /mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/lib/Dialect/Rock/Transforms/CheckResidency.cpp
In file included from /mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/mlir/lib/Dialect/Rock/Transforms/CheckResidency.cpp:16:
In file included from /mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/mlir-hal/include/mlir/Dialect/MHAL/IR/MHAL.h:27:
/mnt/arhiv/rocm/release/rocMLIR-rocm-6.1.0/external/mlir-hal/include/mlir/Dialect/MHAL/IR/MHALTypes.h:22:10: fatal error: 'mlir/Dialect/MHAL/IR/MHALTypes.h.inc' file not found
22 | #include "mlir/Dialect/MHAL/IR/MHALTypes.h.inc"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

Any help please ?

Operating System

Slackware 15.0

CPU

AMD Ryzen 7 3800X 8-Core Processor

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.1.0

ROCm Component

rocMLIR

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Can't generate the lib MLIRMIOpen

Dear ROCm developers,
I have built this project (llvm-project-mlir) at the tag rocm-5.1.1., to be then used for MIOpen, no particular flag used, and here are the libraries generated:

ubuntu@cdp-rocmbuild:/opt/rocm-dev2/mlir/lib$ ls
libLLVMROCmBackendUtils.so          libMLIRMIGraphX.so                       libMLIRMIOpenOps.so               libMLIRMIOpenTransforms.so          librocm-runtime-wrappers.so
libLLVMROCmBackendUtils.so.VERSION  libMLIRMIGraphX.so.VERSION               libMLIRMIOpenOps.so.VERSION       libMLIRMIOpenTransforms.so.VERSION  librocm-runtime-wrappers.so.VERSION
libMIOpenOptMain.so                 libMLIRMIGraphXToTosa.so                 libMLIRMIOpenPipeline.so          libMLIRMIOpenTuning.so
libMIOpenOptMain.so.VERSION         libMLIRMIGraphXToTosa.so.VERSION         libMLIRMIOpenPipeline.so.VERSION  libMLIRMIOpenTuning.so.VERSION
libMLIRCAPIMIGraphX.so              libMLIRMIOpenConv2dGenerator.so          libMLIRMIOpenToGPU.so             libMLIRTosaToMIOpen.so
libMLIRCAPIMIGraphX.so.VERSION      libMLIRMIOpenConv2dGenerator.so.VERSION  libMLIRMIOpenToGPU.so.VERSION     libMLIRTosaToMIOpen.so.VERSION

But then, when configuring MIOpen, I get this message:

[...]
-- Build with rocblas
-- HIP backend selected.
-- clang-offload-bundler found: /opt/rocm-dev2/llvm/bin/clang-offload-bundler
CMake Error at CMakeLists.txt:314 (find_library):
  Could not find LIBMLIRMIOPEN using the following names: MLIRMIOpen

And indeed, there is not a library (.so file) with that name. Am I doing something wrong?

Thank you for your help!

[Question] Are there any docs on usage of MiOpen dialect gemm with xdlops

Hey guys, I have a question, I am trying to set up a toy example of gemm using xdlops with the miopen dialect. I have seen the example miopen_xdlops_gemm_v2_one_result_f16 but it does not seem very obvious to me how the function signature works.

Looking at the op definition, and lowering to MFMA, I can slightly understand that matrixA and matirxB is supposed to be the initial matrix which we load into bufferA and bufferB for computation, and the two indices are wave offsets.

But I still don't quite understand why matrixA and matrix B has a size of 12288xf16. I was wondering if there exist docs on this and if you have any pointer on writing simple gemm test cases for miopen dialect with xdlops. Thanks in advance! :)

[Issue]: Reduced performance with MLIR running two models with default flags

Problem Description

While attempting to move the MLIR commit hash from bf2911e to da3df73 via ROCm/AMDMIGraphX#2798 I saw a 13% drop in performance on 2 models.

Operating System

20.04.6

CPU

AMD EPYC 7702 64-Core Processor

GPU

AMD Instinct MI250, AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

rocMLIR

Steps to Reproduce

Using MLIR
Before: bf2911e
After: da3df73

migraphx-driver perf --gpu mobilenet_i64.pb

Before After
gpu::code_object::mlir_convolution_add_clip: 7.59713ms / 26 = 0.292197ms ย  gpu::code_object::mlir_convolution_add_clip: 8.59946ms / 26 = 0.330749ms

migraphx-driver perf --gpu mobilenet_i64.pb

Before After
gpu::code_object::mlir_convolution: 165.61ms / 237 = 0.698778ms ย  gpu::code_object::mlir_convolution: 184.571ms / 237 = 0.77878ms

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

private copy of llvm

I am packaging rocMLIR for Fedora

Fedora will have a problem with the local copy of the llvm-project.
The use of ROCm llvm was rejected, so (likely) will be rocMLIR's.

rocMLIR needs to work with the system's llvm.

document about your work

I am interesting in your work about MIOpen Dialect and I want to based it to develop my compiler. Where I should start. Have you own document about it.
And i find there are only conv, gemm and other Gpu operation without MIOpen layer op. Will you plan to support it or generate it by MLIR?

wrap the hip kernel [help wanted]

if my understanding is correct, the workflow in this repo is :
MIOpenDialect - GPU dialect - ROCDL dialect

i'd like to use AMD GPU in different manner:
suppose that i already have a HIP kernel as below:
global void
vectoradd_float(float* restrict a, const float* restrict b, const float* restrict c, int width, int height)
{
int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;

  int i = y * width + x;
  if ( i < (width * height)) {
    a[i] = b[i] + c[i];
  } 

}

  1. can this kernel be compiled individually
    in OpenCL, host code and device code can be compiled in different phase
    when checking HIP programming, host code and device code can be compiled together by using hipcc
    now my question is, can I compiler device kernel individually ?
    if so, what will it look like after compiling?

  2. is there way to wrap the kernel in MLIR?
    seems that kernel outlining and gpu-to-cubin are same with my idea

  • but kernel outlining is based on gpu dialect
  • gpu-to-cubin is for nvidia gpu
    can anyone help to clarify how can i do for AMD GPU?

many thanks

hipModuleLaunchKernel not working but hipLaunchKernelGGL equivalent works

Hey guys, I am interested in using hipModuleLaunchKernel for matmuls but it seems to not be working.
I suspect it might be due to the way I set up the kernel arguments:
If I set it up with kernelParams as such:
Screen Shot 2021-06-03 at 5 23 45 PM
It gets this error:
Screen Shot 2021-06-03 at 1 19 19 AM

I also tried using the extras as some doc points out that the kernelArg:
Screen Shot 2021-06-03 at 5 24 43 PM
and got this error instead:
Screen Shot 2021-06-03 at 5 25 13 PM

Weirdly, when I modify the code to be launching using hipLaunchKernelGGL using same kernel. It was able to output:
Screen Shot 2021-06-03 at 5 27 49 PM

Here is the sample code:
var_mm.zip

Thanks in advance for all your help! :)

Please add a tag for 5.4.1

Hi, I am packaging this among other libraries for nixpkgs. Seen here: NixOS/nixpkgs@b0bd0b4
I'm not sure if there were any pertinent changes regarding this library for 5.3.1, but like the other ROCm libraries, I believe this should have such a tag.

Some observations from gfx1102 runs

Hi,

I've got a RX7600 card which has gfx1102 with wmma support and here's a couple of issues I found.

I tested a 1k1k1k gemm with this,
./bin/rocmlir-gen -ph -operation gemm -t f16 -out_datatype f16 --arch gfx1102 --num_cu 32 -g 1 -m 1024 -k 1024 -n 1024 -transA=True -transB=False

Initially picked tuning params is

// -----// IR Dump After RockAffixTuningParametersPass (rock-affix-params) //----- //
func.func @rock_gemm(%arg0: memref<1x1024x1024xf16>, %arg1: memref<1x1024x1024xf16>, %arg2: memref<1x1024x1024xf16>) attributes {block_size = 128 : i32, grid_size = 64 : i32, kernel, mhal.arch = "amdgcn-amd-amdhsa:gfx1102", wave_size = 32 : i32} {
  rock.gemm %arg2 = tr %arg0 * %arg1 features =  dot|atomic_add|atomic_fmax_f32|wmma storeMethod =  set {arch = "amdgcn-amd-amdhsa:gfx1102", derivedBlockSize = 128 : i32, gridSize = 64 : i32, numCU = 32 : i32, params = #rock.wmma_gemm_params<kpackPerBlock = 4, mPerBlock = 128, nPerBlock = 128, kpack = 16, mPerWave = 64, nPerWave = 64, forceUnroll = true>} : memref<1x1024x1024xf16> = memref<1x1024x1024xf16> * memref<1x1024x1024xf16>
  return
}

Which is --perf_config='128, 128, 4, 64, 64, 16, 1, 1'
This runs and verifies with cpu result (when enabled).
But it looks mismatching with validRangeWmmaGemmParams in tuning, which has {16, 32} for kPerBlock and sets false to the last param.

The other issue is, it causes gpu hang for lots of tuning params, for example,

 --perf_config='16, 32, 16, 16, 32, 16, 1, 0'
 --perf_config='16, 64, 16, 16, 64, 16, 1, 1'
 --perf_config='32, 16, 32, 16, 16, 16, 1, 0'
 --perf_config='32, 32, 32, 16, 32, 16, 1, 0'
 --perf_config='64, 16, 16, 32, 16, 16, 1, 0'

What I've noticed is, those failing ones use even larger amount of vgprs and compiler spilled it to memory.

    .sgpr_count:     18
    .sgpr_spill_count: 0
    .symbol:         rock_gemm.kd
    .vgpr_count:     256
    .vgpr_spill_count: 22
    .wavefront_size: 32

While kernels which can run don't have spilt vgpr. Not sure that's actually causing the issue, just checked several cases.

Build failure with ubuntu 22.04

There is this failure when migraphx builds with ubuntu 22.04:

Traceback (most recent call last):
  File "/usr/local/cget/build/tmp-25313f4698354e8987318405eb382ca9/rocMLIR-bf2911e5245d6452d194ee5dc41c9c2b2869e5c0/external/llvm-project/mlir/lib/Dialect/GPU/AmdDeviceLibsIncGen.py", line 53, in <module>
    generate(Path(sys.argv[1]), Path(sys.argv[2]), sys.argv[3:])
  File "/usr/local/cget/build/tmp-25313f4698354e8987318405eb382ca9/rocMLIR-bf2911e5245d6452d194ee5dc41c9c2b2869e5c0/external/llvm-project/mlir/lib/Dialect/GPU/AmdDeviceLibsIncGen.py", line 29, in generate
    with (bcPath / (lib + ".bc")).open("rb") as libFile:
  File "/usr/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/usr/amdgcn/bitcode/ocml.bc'

[Issue]: Installation missing cmake

Problem Description

Trying to use rocMLIR with MIOpen. MIOpen expects to find rocMLIR with the usual cmake config file
https://github.com/ROCm/MIOpen/blob/develop/CMakeLists.txt#L408

A successful build / install produces a lot of *.so's but no cmake/rocMLIR

Operating System

Fedora 40

CPU

x86_64

GPU

AMD Radeon Pro W7900

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

mkdir build
cd build

prefix=$PWD/install
launcher=ccache

cmake -G Ninja
-DCMAKE_INSTALL_PREFIX=${prefix}
-DAMDGPU_TARGETS=gfx1100
-DBUILD_SHARED_LIBS=OFF
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_SKIP_RPATH=ON
-DCMAKE_C_COMPILER=gcc
-DCMAKE_CXX_COMPILER=g++
-DCMAKE_C_COMPILER_LAUNCHER=$launcher
-DCMAKE_CXX_COMPILER_LAUNCHER=$launcher
-DLLVM_BUILD_TOOLS=OFF
-DLLVM_LIBDIR_SUFFIX=64
-DLLVM_ENABLE_PROJECTS="lld;mlir;llvm"
-DLLVM_INCLUDE_DOCS=OFF
-DLLVM_INCLUDE_EXAMPLES=OFF
-DLLVM_INCLUDE_TESTS=OFF
-DLLVM_TARGETS_TO_BUILD="X86;AMDGPU"
-DMLIR_INCLUDE_TESTS=OFF
..

if [ -f build.ninja ]; then
ninja
if [ $? = 0 ]; then
ninja install
else
bash
exit 1
fi
fi

$ find install
install
install/lib64
install/lib64/libMLIRRockAnalysis.so.2.0
install/lib64/libMLIRRockAnalysis.so
install/lib64/libRocmlirEmulateFp8ExtTrunc.so.2.0
install/lib64/libRocmlirEmulateFp8ExtTrunc.so
install/lib64/libMLIRGPUToMIGraphX.so.2.0
install/lib64/libMLIRGPUToMIGraphX.so
install/lib64/libMLIRMIGraphXToTosa.so.2.0
install/lib64/libMLIRMIGraphXToTosa.so
install/lib64/libMLIRRockToGPU.so.2.0
install/lib64/libMLIRRockToGPU.so
install/lib64/libMLIRTosaToRock.so.2.0
install/lib64/libMLIRTosaToRock.so
install/lib64/libMLIRRockOps.so.2.0
install/lib64/libMLIRRockOps.so
install/lib64/libMLIRRockTransforms.so.2.0
install/lib64/libMLIRRockTransforms.so
install/lib64/libMLIRRockTuning.so.2.0
install/lib64/libMLIRRockTuning.so
install/lib64/libMLIRRockConv2dGenerator.so.2.0
install/lib64/libMLIRRockConv2dGenerator.so
install/lib64/libMLIRRockPipeline.so.2.0
install/lib64/libMLIRRockPipeline.so
install/lib64/libMLIRRockUtility.so.2.0
install/lib64/libMLIRRockUtility.so
install/lib64/libMLIRMIGraphXDialect.so.2.0
install/lib64/libMLIRMIGraphXDialect.so
install/lib64/libMLIRMIGraphXTransforms.so.2.0
install/lib64/libMLIRMIGraphXTransforms.so
install/lib64/libMLIRMIGraphXPipeline.so.2.0
install/lib64/libMLIRMIGraphXPipeline.so
install/lib64/libconv-validation-wrappers.so.2.0
install/lib64/libconv-validation-wrappers.so
install/lib64/libMLIRRocTarget.so.2.0
install/lib64/libMLIRRocTarget.so
install/lib64/libGpuModuleToRocdlirTranslation.a
install/lib64/libMLIRCAPIMIGraphX.so.2.0
install/lib64/libMLIRCAPIMIGraphX.so
install/lib64/objects-Release
install/lib64/objects-Release/obj.MLIRCAPIMIGraphX
install/lib64/objects-Release/obj.MLIRCAPIMIGraphX/MIGraphX.cpp.o
install/lib64/objects-Release/obj.MLIRCAPIRock
install/lib64/objects-Release/obj.MLIRCAPIRock/Rock.cpp.o
install/lib64/objects-Release/obj.MLIRCAPIRegisterRocMLIR
install/lib64/objects-Release/obj.MLIRCAPIRegisterRocMLIR/RegisterRocMLIR.cpp.o
install/lib64/libMLIRCAPIRock.so.2.0
install/lib64/libMLIRCAPIRock.so
install/lib64/libMLIRCAPIRegisterRocMLIR.so.2.0
install/lib64/libMLIRCAPIRegisterRocMLIR.so
install/lib64/libRocMLIROptMain.a

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

with develop branch

No applicable convolution

When trying to run SDXL unet with MLIR for all convolution there is an error where there is no applicable kernel in migraphx. This is the problem config:

gfx942:sramecc+:xnack-        304     convfp16 -F 1 -f GNCHW -I NGCHW -O NGCHW -n 2 -c 640 -H 64 -W 64 -k 640 -y 3 -x 3 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -g 1

I see these errors pop-up, I dont know if its relevant:

Error: 'linalg.generic' op is infusible with non-`Set` store method
Note: see current operation:
"linalg.generic"(%4, %1, %3, %100) <{indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 3, 1>}> ({
^bb0(%arg5: f16, %arg6: f16, %arg7: f16, %arg8: f16):
  %101 = "arith.addf"(%arg5, %arg6) <{fastmath = #arith.fastmath<none>}> : (f16, f16) -> f16
  %102 = "arith.addf"(%101, %arg7) <{fastmath = #arith.fastmath<none>}> : (f16, f16) -> f16
  "linalg.yield"(%102) : (f16) -> ()
}) {rock.majorTensorNumber = 2 : index} : (memref<2x1280x32x32xf16>, memref<2x1280x32x32xf16>, memref<2x1280x32x32xf16>, memref<2x1280x32x32xf16>) -> ()

This is using e50d72f.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.