rocm / llvm-project Goto Github PK

This project forked from llvm/llvm-project

This is the AMD-maintained fork of the LLVM git repository. This repository accepts pull requests and issues related to AMD fork-specific topics (amd/*). For all other issues/PRs, please submit upstream at https://github.com/llvm/llvm-project.

License: Other

C++ 33.34% CMake 0.28% C 17.09% Objective-C++ 0.07% HTML 0.14% Fortran 0.74% Python 0.88% LLVM 37.08% Assembly 8.91% Starlark 0.08% Cuda 0.06% MLIR 1.26% NASL 0.01% Roff 0.01% SWIG 0.02% Shell 0.02% Perl 0.02% Batchfile 0.01% Dockerfile 0.01% CSS 0.01%

llvm-project's Issues

[Feature]: Need to dump the GPU assembly code generated with Windows HIP SDK

Suggestion Description

It seems the HIP compiler on Linux can be controlled by an environment variable KMDUMPISA, when setting KMDUMPISA=1, the compiler can generate assembly code of the GPU code object.

I tried setting the same KMDUMPISA environment variable on Windows, with HIP SDK 5.7.1, but no assembly code was generated.

I need to see the assembly code to find some opportunity for optimization of my code.

Thanks.

Operating System

Windows 10

GPU

RX 7900XT

ROCm Component

HIPCC

Comgr calls exit

Comgr calls exit, which is not advised.
To quote Fedora's rpmlint tool:

This library package calls exit() or _exit(), probably in a non-fork()
context. Doing so from a library is strongly discouraged - when a library
function calls exit(), it prevents the calling program from handling the
error, reporting it to the user, closing files properly, and cleaning up any
state that the program has. It is preferred for the library to return an
actual error code and let the calling program decide how to handle the
situation.

Thanks!

correctly rounded mathematical functions?

the current C working draft [1, p392] has reserved names for correctly
rounded functions (cr_exp, cr_log, cr_sin, ...).

We propose to provide such correctly rounded implementations
for the three IEEE formats (binary32, binary64, binary128) and the
"extended double" format (long double on x86_64).

These implementations will be correctly rounded for all rounding modes,
for example one could do the following to emulate interval arithmetic:

fesetround (FE_DOWNWARD);
y_lo = cr_exp (x_lo);
fesetround (FE_UPWARD);
y_hi = cr_exp (x_hi);

Users who want a fast implementation will call the exp/log/sin/... functions,
users who want a correctly rounded function and thus reproducible results
(whatever the hardware, compiler or operating system) will use the
cr_exp/cr_log/cr_sin/... functions. Our goal is nevertheless to get the
best performance possible.

Our objective is to provide open-source implementations that can be integrated
in the major mathematical libraries (GNU libc, Intel Math Library, AMD Libm,
Redhat Newlib, OpenLibm, Musl, llvm-libc, CUDA, ROCm).

Are developers of ROCm interested by such functions?
If so, we could discuss what would be the requirements for integration in
ROCm in terms of license, table size, allowed operations.

We have started to work on two functions (cbrt and acos), for which we
provide presumably correctly rounded implementations (up to the knowledge
of hard-to-round cases) [2].

Christoph Lauter
Jean-Michel Muller
Alexei Sibidanov
Paul Zimmermann

[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf
[2] https://homepages.loria.fr/PZimmermann/CORE-MATH/

Binary hipcc doesn't detect clang properly

Hi! After building the 5.6.0 branch for Arch Linux I noted that hipcc.bin doesn't work properly as the compiler path isn't picked correctly. The output of

env LANG=C.UTF-8 ./build/hipcc.bin --version

sh: line 1: /tmp/canRunqEY2uN: Is a directory
sh: line 1: /tmp/canRunGoAqAz: Is a directory
sh: line 1: /tmp/canRunJ6tdsf: Is a directory
sh: line 1: /tmp/canRun6EPxCE: Is a directory
Device not supported - Defaulting to AMD
sh: line 1: /bin/rocm_agent_enumerator: No such file or directory
sh: line 1: /tmp/canRunxo0z3d: Is a directory
sh: line 1: /tmp/canRunG1KbTL: Is a directory
Hip Clang Compiler not found
HIP version: 4.4.0-0
sh: line 1: llvm/bin/clang++: No such file or directory

failed to execute:llvm/bin/clang++ --driver-mode=g++ -L"/home/torsten/Dokumente/HIPCC/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt  --version -Wl,-rpath=/home/torsten/Dokumente/HIPCC/lib:/lib -lamdhip64  -Lllvm/bin/../lib/clang//lib/linux -lclang_rt.builtins-x86_64

The issue is that complierPath [sic!] as constructed in src/hipBin_amd.h::HipBinAmd::constructCompilerPath relies on getRoccmPath() in src/hipBin_base.h which only returns the content of the environment var ROCM_PATH. If this is not set, hipClangPath is an empty path. The perl script instead defaults to /opt/rocm/ as ROCM_PATH when no env var is set.

But setting ROCM_PATH doesn't really fix my issue.

env LANG=C.UTF-8 ROCM_PATH=/opt/rocm ./build/hipcc.bin --version

with output

sh: line 1: /tmp/canRunIRLYT9: Is a directory
sh: line 1: /tmp/canRunDRvTyG: Is a directory
sh: line 1: /tmp/canRunI3Saom: Is a directory
sh: line 1: /tmp/canRun3Nr0Ej: Is a directory
Device not supported - Defaulting to AMD
sh: line 1: /tmp/canRunfa4okA: Is a directory
sh: line 1: /tmp/canRunlaL7PI: Is a directory
Hip Clang Compiler not found
HIP version: 4.4.0-0
clang version 16.0.0
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin

It does report the correct path but there's still Hip Clang Compiler not found raised by src/hipBin_amd.h::HipBinAmd::getCompilerVersion. I haven't figured out yet why this doesn't work.

Furthermore I'm wondering where the shell errors sh: line 1 ... are coming from.

The perl script works as expected:

env LANG=C.UTF-8 ROCM_PATH=/opt/rocm ./build/hipcc.pl --version

HIP version: 5.5.0-0
clang version 16.0.0
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin

Plan to support mfma dtype cast?

I think float4 and __attribute__((__vector_size__(4 * sizeof(float)))) float are the same thing, but there're no direct conversion of hipcc,

extern "C" __global__ void __launch_bounds__(64) sgemm_16x16x16(float *__restrict__ A, float *__restrict__ B, float *__restrict__ C) {
//   using float4 = __attribute__((__vector_size__(4 * sizeof(float)))) float;
  float Accum[4];
  float MultiA[1];
  float MultiB[1];
  for (int i = 0; i < 4; ++i) {
    Accum[i] = 0.000000e+00f;
  }
  MultiA[0] = A[(((((int)threadIdx.x) & 15) * 4) + (((int)threadIdx.x) >> 4))];
  MultiB[0] = B[(((((int)threadIdx.x) >> 4) * 4) + (((int)threadIdx.x) & 15))];
  *(float4 *)(Accum + 0) =
      (__builtin_amdgcn_mfma_f32_16x16x4f32(
          *((float *)MultiA + 0), *((float *)MultiB + 0),
          *((float4 *)Accum + 0), 0, 0, 0));
  ;
  for (int mma_accum_c_id = 0; mma_accum_c_id < 4; ++mma_accum_c_id) {
    C[((((((int)threadIdx.x) >> 4) * 64) + (mma_accum_c_id * 16)) +
       (((int)threadIdx.x) & 15))] = Accum[mma_accum_c_id];
  }
}

this will throw no viable conversion from 'float4' (aka 'HIP_vector_type<float, 4>') to '__attribute__((__vector_size__(4 * sizeof(float)))) float' (vector of 4 'float' values). This doesn't make sense.

[Feature]: Better preprocessor macros to detect RDNA/CDNA family at compile time

Suggestion Description

As new instructions/features are added with each new arch, it is useful to know the target architecture at compile time to employ separate code paths. For example: FP64 MFMA was added in CDNA2, so CDNA2 and later can use one code path while CDNA1 uses a different code path.

It gets tedious because all the archs need to be enumerated, and code needs to be updated as new archs become available:

#if __gfx940__ || __gfx941__ || __gfx942__
// Code path for CDNA3
#elif __gfx90a__
// Code path for CNDA2
#elif __gfx908__
// Code path for CDNA1
#endif

It would be nice if we had something like:

#if CDNA_VERSION >= 3
// Code path for CDNA3 and later
#elif CDNA_VERSION >= 2
// Code path for CDNA2
#else
// Code path for CDNA1
#endif

This would mirror the way it is done in CUDA:

__device__ func()
{
#if __CUDA_ARCH__ >= 800
   // Code path for compute capability 8.x and later
#elif __CUDA_ARCH__ >= 700
   // Code path for compute capability 7.x
#else
  // Code path for compute capability < 7.0
#endif
}

Operating System

No response

GPU

No response

ROCm Component

No response

HIPCC package shouldn't depend on rocm-llvm

As the title says, hipcc shouldn't require the installation of amdclang. Hipcc is also installed on NVIDIA platforms, where installing amdclang is not too useful and sometimes even harmful because hipconfig will detect the system as an AMD platform: ROCm/HIP#1650.

I'd suggest changing rocm-llvm to be minimally recommended or suggested.
Debian suggested vs recommended

cpack deb suggested
cpack rpm suggested

[Issue]: __builtin_amdgcn_workgroup_size_x incorrectly returns 0 on Code Object Model 5

Problem Description

Machine Details:
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU:
model name : AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
GPU:
Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Name: gfx1103
Marketing Name: AMD Radeon Graphics
Name: amdgcn-amd-amdhsa--gfx1103

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 7840HS w/ Radeon 780M Graphics

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.1.0

ROCm Component

clang-ocl

Steps to Reproduce

In the following kernel __builtin_amdgcn_workgroup_size_x always returns zero on APU. This causes the indexing calculation to be incorrect. If I replace that call with a hardcoded value of 64 (which is what the code in main.cpp is setting it) then the calculation comes out to be correct.

Trace from the running program -

Using agent: gfx1103
Engine init: OK
Kernel arg size: 280
Workgroup sizes: 64 1 1
Grid sizes: 100000 1 1
Setup dispatch: OK
Dispatch: OK
Wait: OK
We expected the sum to be :1409965408. Calculated sum is 4032

__attribute__((visibility("default"), amdgpu_kernel)) void add_arrays(int* input_a, int* input_b, int* output)
{
    int index =  __builtin_amdgcn_workgroup_id_x() * __builtin_amdgcn_workgroup_size_x() + __builtin_amdgcn_workitem_id_x();
    output[index] = input_a[index] + input_b[index];
}

You can run the above kernel using the following steps

git clone https://github.com/deepankarsharma/hansa.git
cd hansa
mkdir build
cd build
cmake ..
make
./hansa

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.7.0 is loaded

HSA System Attributes

Runtime Version: 1.13
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents

Agent 1

Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5137
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5567(0x15bf)
ASIC Revision: 7(0x7)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 1024
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 35
SDMA engine uCode:: 17
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1103
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

ROCM 5.6 release: Incorrect version in hipvars.pm

In ROCM 5.6 release, hipvars.pm is still using $HIP_BASE_VERSION_MINOR = "5"; and $HIP_BASE_VERSION_PATCH = "0";

LLVM ERROR: Unsupported calling convention for call for llvm/clang 14.0.5

I'm building rocm-comilersupport against llvm/clang 14.0.5. The build was successful, but many compile test failed:

The following tests FAILED:
         12 - comgr_compile_test (Subprocess aborted)
         13 - comgr_compile_minimal_test (Subprocess aborted)
         16 - comgr_compile_device_libs_test (Subprocess aborted)
         17 - comgr_compile_source_with_device_libs_to_bc_test (Subprocess aborted)
Errors while running CTest

They all fail with

LLVM ERROR: Unsupported calling convention for call

I backtraced and find out in https://github.com/llvm/llvm-project/blob/c12386ae247c0d46e1d513942e322e3a0510b126/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L980, CC is set to 91 (CallingConv::AMDGPU_KERNEL), while if I build comgr with ROCm's llvm, then the value is 0 (CallingConv::C).

I continoused tracing, until I lost track of this value at https://github.com/llvm/llvm-project/blob/c12386ae247c0d46e1d513942e322e3a0510b126/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L1136. The value hit 0 twice, and then hit 91 which causes unsupported convention error.

Also, running AMD_COMGR_SAVE_TEMPS=1 AMD_COMGR_REDIRECT_LOGS=stdout AMD_COMGR_EMIT_VERBOSE_LOGS=1 ./compile_minimal_test:

amd_comgr_do_action:
          ActionKind: AMD_COMGR_ACTION_COMPILE_SOURCE_TO_BC
             IsaName: amdgcn-amd-amdhsa--gfx803
             Options: "-mllvm" "-amdgpu-early-inline-all"
                Path:
            Language: AMD_COMGR_LANGUAGE_OPENCL_1_2
COMGR::executeInProcessDriver argv: clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-llvm-bc" "-emit-llvm-uselists" "-clear-ast-before-backend" "-main-file-name" "source1.cl" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=none" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-I" "/tmp/comgr-f8c1b9/include" "-isysroot" "/opt/gentoo" "-O3" "-std=cl1.2" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fno-threadsafe-statics" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-f8c1b9/output/source1.cl.bc" "-x" "cl" "/tmp/comgr-f8c1b9/input/source1.cl"
COMGR::executeInProcessDriver argv: clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-llvm-bc" "-emit-llvm-uselists" "-clear-ast-before-backend" "-main-file-name" "source2.cl" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=none" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-I" "/tmp/comgr-f8c1b9/include" "-isysroot" "/opt/gentoo" "-O3" "-std=cl1.2" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fno-threadsafe-statics" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-f8c1b9/output/source2.cl.bc" "-x" "cl" "/tmp/comgr-f8c1b9/input/source2.cl"
        ReturnStatus: AMD_COMGR_STATUS_SUCCESS

amd_comgr_do_action:
          ActionKind: AMD_COMGR_ACTION_LINK_BC_TO_BC
             IsaName: amdgcn-amd-amdhsa--gfx803
             Options: "-mllvm" "-amdgpu-early-inline-all"
                Path:
            Language: AMD_COMGR_LANGUAGE_OPENCL_1_2
        ReturnStatus: AMD_COMGR_STATUS_SUCCESS

amd_comgr_do_action:
          ActionKind: AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE
             IsaName: amdgcn-amd-amdhsa--gfx803
             Options: "-mllvm" "-amdgpu-early-inline-all"
                Path:
            Language: AMD_COMGR_LANGUAGE_OPENCL_1_2
COMGR::executeInProcessDriver argv: clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-obj" "-mrelax-all" "--mrelax-relocations" "-clear-ast-before-backend" "-main-file-name" "linked.bc" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=all" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-da2c2e/output/linked.bc.o" "-x" "ir" "/tmp/comgr-da2c2e/input/linked.bc"
LLVM ERROR: Unsupported calling convention for call
zsh: abort      AMD_COMGR_SAVE_TEMPS=1 AMD_COMGR_REDIRECT_LOGS=stdout =1

Directly running from cmdline also suffers:

clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-obj" "-mrelax-all" "--mrelax-relocations" "-clear-ast-before-backend" "-main-file-name" "linked.bc" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=all" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-da2c2e/output/linked.bc.o" "-x" "ir" "/tmp/comgr-da2c2e/input/linked.bc"
fatal error: error in backend: Unsupported calling convention for call

[Issue]: ROCm 6.1.2 does not compile againts upstream LLVM 17

Problem Description

I already reported the issue to @searlmc1, it seems like there was a feature backport to this tree from LLVM 19 for 6.1.2, while 6.1.x originally was planned to use LLVM 17 API. This causes issues for anyone using upstream LLVM 17 instead of the ROCm LLVM tree.

It's pretty easy to workaround, because anyone using upstream llvm can just revert 96b2ba3

I just wanted to report this for awareness, and to document it for people using upstream LLVM looking for workarounds. This likely will be an issue for a few releases, since LLVM 19 hasn't even branched yet at this time.

Operating System

All

CPU

Any

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.1.0

ROCm Component

llvm-project, ROCm-Device-Libs

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: Create 6.1.x branch

Problem Description

As per the build guide for NVIDIA systems, I need to check out the $ROCM_BRANCH branch of hipcc.git. e.g.:

      export ROCM_BRANCH=rocm-6.1.x
      git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hip.git
      git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/clr.git
      git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/HIPCC.git
      git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git

Currently, however, whilst rocm-6.1.x exists in the other repos, I am forced to check out the rocm-6.0.x branch for HIPCC as this is the latest existing branch.

(Note also the documentation entirely fails to mention the hipother.git branch.)

Please either create a branch to allow consistency in building for NVIDIA or update the build instructions for NVIDIA systems to reflect the correct process.

Operating System

n/a

CPU

n/a

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Feature]: Support executing Clang through a wrapper/launcher for caching

Suggestion Description

See ROCm/ROCm#2817 for a more detailed description and motivation.

It would be greatly beneficial for developer productivity and distro packagers' convenience if we can introduce a wrapper executable to launch the Clang executable. This will allow us to cache HIP compilations to massively speed up builds that have very few code changes, similar to how ccache and sccache can cache C/C++ and Rust builds. For example, when specifying sccache as the wrapper, instead of executing $ROCM_PATH/llvm/bin/clang --offload-arch=..., sccache $ROCM_PATH/llvm/bin/clang --offload-arch=... is executed. CMake's HIP support can already do the equivalent using the environment variable CMAKE_HIP_COMPILER_LAUNCHER.

The way I propose this feature be implemented is to introduce the environment variable HIP_CLANG_LAUNCHER, which is analogous the CMake variable CMAKE_HIP_COMPILER_LAUNCHER. A PR implementing this feature can be found at ROCm/HIPCC#148. This feature is used to package the entire ROCm stack for Solus Linux, so it has been pretty thoroughly tested.

This feature by itself is not very useful, but is very powerful when combined with compiler wrappers that cache compilations. After HIP support in sccache is implemented this will allow us to cache HIP compilations. Please see ROCm/ROCm#2817 for more details on this.

Operating System

No response

GPU

No response

ROCm Component

HIPCC

[Issue]: The OpenCL kernel compiled with clang-ocl in ROCm 6.1 produces incorrect results.

Problem Description

Incorrect results occur when loading and executing OpenCL kernels compiled with clang-ocl with hip api on ROCm 6.1 in Ubuntu 22.04.

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD EPYC 7413

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.1.0

ROCm Component

clang-ocl

Steps to Reproduce

Initially, compile the OpenCL kernel with clang-ocl.
Then, utilize HIP APIs like hipModuleLoad(), hipModuleGetFunction(), and hipExtModuleLaunchKernel() to load and execute the compiled binary within the host code.
Ultimately, incorrect results should be encountered.

Attached below is the source code and Makefile used for reproduction.
The first value is found to be incorrect upon running the attached source code.

src.tar.gz

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: enabling `amdgpu-unsafe-fp-atomics` for gfx90a

Problem Description

Hi! I'm one of the developers of AMDGPU.jl library that provides support for AMD GPU programming in Julia.
To perform compilation of Julia GPU kernels we omit HIP and use LLVM directly, hence -munsafe-fp-atomics is not available.

To enable HW atomics, we instead add amdgpu-unsafe-fp-atomics=true function attribute to our LLVM IR during compilation.
This works fine with gfx1100 devices, replacing CAS loop with HW fadd.

However, for gfx90a devices this does nothing.
I was wondering if I'm missing something else that needs to be done?

Here's an example Julia kernel, which does atomic fadd on the first array item:

@kernel function ker!(x)
    @inbounds @atomic x[1] += 1f0
end

Here's its optimized LLVM IR with atomicrmw fadd float which is the same for gfx1100 and gfx90a (notice amdgpu-unsafe-fp-atomics attribute):

click

; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.x() #0

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0

; Function Attrs: cold noreturn nounwind
declare void @llvm.amdgcn.endpgm() #1

;  @ none within `gpu_ker!`
define amdgpu_kernel void @_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [1 x [1 x [1 x i64]]], [2 x [1 x [1 x [1 x i64]]]] } %0, { [1 x i64], i8 addrspace(1)*, i64 } %1) local_unnamed_addr #2 !dbg !41 {
conversion:
  %.fca.0.0.0.0.extract = extractvalue { [1 x [1 x [1 x i64]]], [2 x [1 x [1 x [1 x i64]]]] } %0, 0, 0, 0, 0
  %.fca.1.1.0.0.0.extract = extractvalue { [1 x [1 x [1 x i64]]], [2 x [1 x [1 x [1 x i64]]]] } %0, 1, 1, 0, 0, 0
  %.fca.1.extract = extractvalue { [1 x i64], i8 addrspace(1)*, i64 } %1, 1
;  @ none within `gpu_ker!` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/macros.jl:94
; ┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/ROCKernels.jl:144 within `#__validindex`
; │┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:172 within `blockIdx`
; ││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:95 within `blockIdx_x`
; │││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:93 within `workgroupIdx_x`
; ││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
        %2 = call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !45, !range !66
; │└└└└└
; │┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:164 within `threadIdx`
; ││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:89 within `threadIdx_x`
; │││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
        %3 = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !67, !range !76
; ││││└└
; ││││┌ @ int.jl:1068 within `+` @ int.jl:87
       %4 = add nuw nsw i32 %3, 1, !dbg !77
; │└└└└
; │┌ @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ indices.jl:350 within `to_indices` @ indices.jl:354
; ││││┌ @ indices.jl:359 within `_to_indices1`
; │││││┌ @ indices.jl:277 within `to_index` @ indices.jl:292
; ││││││┌ @ number.jl:7 within `convert`
; │││││││┌ @ boot.jl:784 within `Int64`
; ││││││││┌ @ boot.jl:708 within `toInt64`
           %5 = zext i32 %4 to i64, !dbg !81
; ││└└└└└└└
; ││ @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:84 within `expand` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:74
; ││┌ @ ntuple.jl:48 within `ntuple`
; │││┌ @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:78 within `#1`
; ││││┌ @ int.jl:86 within `-`
       %6 = zext i32 %2 to i64, !dbg !104
; ││││└
; ││││┌ @ int.jl:88 within `*`
       %7 = mul i64 %.fca.1.1.0.0.0.extract, %6, !dbg !112
; ││││└
; ││││┌ @ int.jl:87 within `+`
       %8 = add i64 %7, %5, !dbg !114
; │└└└└
; │ @ /home/pxl-th/.julia/dev/AMDGPU/src/ROCKernels.jl:145 within `#__validindex`
; │┌ @ multidimensional.jl:471 within `in`
; ││┌ @ tuple.jl:318 within `map`
; │││┌ @ range.jl:1439 within `in`
; ││││┌ @ int.jl:514 within `<=`
       %9 = icmp slt i64 %8, 1, !dbg !115
       %10 = icmp sgt i64 %8, %.fca.0.0.0.0.extract, !dbg !115
; └└└└└
  %11 = or i1 %9, %10, !dbg !62
  br i1 %11, label %L128, label %L104, !dbg !62

L104:                                             ; preds = %conversion
  %.fca.0.0.extract = extractvalue { [1 x i64], i8 addrspace(1)*, i64 } %1, 0, 0
;  @ none within `gpu_ker!` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/macros.jl:95
; ┌ @ /home/pxl-th/.julia/dev/atomic.jl:6 within `macro expansion`
; │┌ @ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:95 within `getindex`
; ││┌ @ abstractarray.jl:702 within `checkbounds` @ abstractarray.jl:687
; │││┌ @ abstractarray.jl:763 within `checkindex`
; ││││┌ @ int.jl:513 within `<`
       %.not = icmp slt i64 %.fca.0.0.extract, 1, !dbg !127
; │││└└
; │││ @ abstractarray.jl:702 within `checkbounds`
     br i1 %.not, label %L115, label %L119, !dbg !133

L115:                                             ; preds = %L104
; │││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/quirks.jl:8 within `#throw_boundserror`
; ││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:113 within `signal_exception`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:11 within `exception_flag`
; ││││││┌ @ none within `kernel_state`
; │││││││┌ @ none within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
          %state.i.fca.0.extract.i = extractvalue { i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, 0, !dbg !141
; │││││└└└
; │││││┌ @ pointer.jl:146 within `unsafe_store!` @ pointer.jl:146
        %memcpy_refined_dst.i = inttoptr i64 %state.i.fca.0.extract.i to i32*, !dbg !156
        store i32 1, i32* %memcpy_refined_dst.i, align 1, !dbg !156
; │││││└
; │││││ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:115 within `signal_exception`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52 within `endpgm`
        call void @llvm.amdgcn.endpgm(), !dbg !160
; │││││└
; │││││ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:116 within `signal_exception`
       unreachable, !dbg !164

L119:                                             ; preds = %L104
; │└└└└
; │┌ @ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 within `modify!` @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/internal.jl:20
; ││┌ @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/atomics.jl:359 within `atomic_pointermodify`
; │││┌ @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/atomics.jl:255 within `llvm_atomic_op`
; ││││┌ @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/atomics.jl:255 within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
       %12 = bitcast i8 addrspace(1)* %.fca.1.extract to float addrspace(1)*, !dbg !165
       %13 = atomicrmw fadd float addrspace(1)* %12, float 1.000000e+00 seq_cst, align 4, !dbg !165
; ││└└└
; ││ @ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 within `modify!`
    br label %L128, !dbg !176

L128:                                             ; preds = %L119, %conversion
; └└
;  @ none within `gpu_ker!` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/macros.jl:97
  ret void, !dbg !179
}

attributes #0 = { nounwind readnone speculatable willreturn "amdgpu-unsafe-fp-atomics"="true"}
attributes #1 = { cold noreturn nounwind }
attributes #2 = { "amdgpu-unsafe-fp-atomics"="true"}

And here's the assembly output for gfx1100, notice global_atomic_add_f32 present:

click

	.text
	.amdgcn_target "amdgcn-amd-amdhsa--gfx1100"
	.globl	_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE ; -- Begin function _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
	.p2align	8
	.type	_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE,@function
_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE: ; @_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
.Lfunc_begin0:
	.file	1 "." "none"
	.loc	1 0 0                           ; none:0:0
	.cfi_sections .debug_frame
	.cfi_startproc
; %bb.0:                                ; %conversion
	s_clause 0x1
	s_load_b64 s[2:3], s[0:1], 0x68
	s_load_b64 s[4:5], s[0:1], 0x58
.Ltmp0:
	.file	2 "." "boot.jl"
	.loc	2 708 0 prologue_end            ; boot.jl:708:0
	v_dual_mov_b32 v1, 0 :: v_dual_add_nc_u32 v0, 1, v0
.Ltmp1:
	.file	3 "." "int.jl"
	.loc	3 87 0                          ; int.jl:87:0
	s_waitcnt lgkmcnt(0)
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_mad_u64_u32 v[2:3], null, s2, s15, v[0:1]
	v_mov_b32_e32 v0, v3
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_mad_u64_u32 v[3:4], null, s3, s15, v[0:1]
.Ltmp2:
	.loc	3 514 0                         ; int.jl:514:0
	v_cmp_lt_i64_e32 vcc_lo, 0, v[2:3]
	v_cmp_ge_i64_e64 s2, s[4:5], v[2:3]
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
	s_and_b32 s2, vcc_lo, s2
	s_and_saveexec_b32 s3, s2
	s_cbranch_execz .LBB0_4
.Ltmp3:
; %bb.1:                                ; %L104
	.loc	3 0 0 is_stmt 0                 ; int.jl:0:0
	s_load_b64 s[2:3], s[0:1], 0x70
.Ltmp4:
	.loc	3 513 0 is_stmt 1               ; int.jl:513:0
	s_waitcnt lgkmcnt(0)
	v_cmp_gt_i64_e64 s2, s[2:3], 0
	s_delay_alu instid0(VALU_DEP_1)
	s_and_b32 vcc_lo, exec_lo, s2
	s_mov_b32 s2, -1
	s_cbranch_vccz .LBB0_3
.Ltmp5:
; %bb.2:                                ; %L119
	.loc	3 0 0 is_stmt 0                 ; int.jl:0:0
	s_load_b64 s[2:3], s[0:1], 0x78
	v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 1.0
.Ltmp6:
	.file	4 "." "/home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl"
	.loc	4 38 0 is_stmt 1                ; /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38:0
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	global_atomic_add_f32 v0, v1, s[2:3]
	s_waitcnt_vscnt null, 0x0
	buffer_gl0_inv
	buffer_gl1_inv
	s_mov_b32 s2, 0
.Ltmp7:
.LBB0_3:                                ; %Flow
	.loc	4 0 0 is_stmt 0                 ; /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:0:0
	s_delay_alu instid0(SALU_CYCLE_1)
	s_and_not1_b32 vcc_lo, exec_lo, s2
	s_cbranch_vccz .LBB0_5
.LBB0_4:                                ; %UnifiedReturnBlock
	s_endpgm
.LBB0_5:                                ; %L115
	s_load_b64 s[0:1], s[0:1], 0x0
	v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v3, 1
	s_waitcnt lgkmcnt(0)
	v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
.Ltmp8:
	.file	5 "." "pointer.jl"
	.loc	5 146 0 is_stmt 1               ; pointer.jl:146:0
	s_clause 0x3
	flat_store_b8 v[0:1], v2 offset:3
	flat_store_b8 v[0:1], v2 offset:2
	flat_store_b8 v[0:1], v2 offset:1
	flat_store_b8 v[0:1], v3
.Ltmp9:
	.file	6 "." "/home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl"
	.loc	6 52 0                          ; /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52:0
	s_endpgm
	; divergent unreachable
	s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	s_endpgm

And here's assembly for gfx90a, notice regular global_atomic_cmpswap:

click

       .text
        .amdgcn_target "amdgcn-amd-amdhsa--gfx90a:sramecc+"
        .globl  _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE ; -- Begin function _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
        .p2align        8
        .type   _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE,@function
_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE: ; @_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
.Lfunc_begin0:
        .file   1 "." "none"
        .loc    1 0 0                           ; none:0:0
        .cfi_sections .debug_frame
        .cfi_startproc
; %bb.0:                                ; %conversion
        s_load_dwordx2 s[0:1], s[4:5], 0x68
        s_load_dwordx2 s[2:3], s[4:5], 0x58
.Ltmp0:
        .file   2 "." "int.jl"
        .loc    2 87 0 prologue_end             ; int.jl:87:0
        v_add_u32_e32 v0, 1, v0
.Ltmp1:
        .file   3 "." "boot.jl"
        .loc    3 708 0                         ; boot.jl:708:0
        v_mov_b32_e32 v1, 0
.Ltmp2:
        .loc    2 87 0                          ; int.jl:87:0
        v_mov_b32_e32 v2, s6
        s_waitcnt lgkmcnt(0)
        s_mul_i32 s7, s1, s6
        v_mad_u64_u32 v[0:1], s[0:1], s0, v2, v[0:1]
        v_add_u32_e32 v1, s7, v1
.Ltmp3:
        .loc    2 514 0                         ; int.jl:514:0
        v_cmp_lt_i64_e32 vcc, 0, v[0:1]
        v_cmp_ge_i64_e64 s[0:1], s[2:3], v[0:1]
        s_and_b64 s[0:1], vcc, s[0:1]
        s_and_saveexec_b64 s[2:3], s[0:1]
        s_cbranch_execz .LBB0_6
.Ltmp4:
; %bb.1:                                ; %L104
        .loc    2 0 0 is_stmt 0                 ; int.jl:0:0
        s_load_dwordx2 s[2:3], s[4:5], 0x70
        s_mov_b64 s[0:1], 0
.Ltmp5:
        .loc    2 513 0 is_stmt 1               ; int.jl:513:0
        s_waitcnt lgkmcnt(0)
        v_cmp_gt_i64_e64 s[6:7], s[2:3], 0
        s_mov_b64 s[2:3], -1
        s_and_b64 vcc, exec, s[6:7]
        s_cbranch_vccz .LBB0_5
.Ltmp6:
; %bb.2:                                ; %L119
        .loc    2 0 0 is_stmt 0                 ; int.jl:0:0
        s_load_dwordx2 s[2:3], s[4:5], 0x78
        v_mov_b32_e32 v2, 0
.Ltmp7:
        .file   4 "." "/users/antonsmi/.julia/packages/LLVM/Q3CgR/src/interop/base.jl"
        .loc    4 38 0 is_stmt 1                ; /users/antonsmi/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38:0
        s_waitcnt lgkmcnt(0)
        s_load_dword s6, s[2:3], 0x0
        s_waitcnt lgkmcnt(0)
        v_mov_b32_e32 v1, s6
.LBB0_3:                                ; %atomicrmw.start
                                        ; =>This Inner Loop Header: Depth=1
        v_add_f32_e32 v0, 1.0, v1
        buffer_wbl2
        s_waitcnt vmcnt(0) lgkmcnt(0)
        global_atomic_cmpswap v0, v2, v[0:1], s[2:3] glc
        s_waitcnt vmcnt(0)
        buffer_invl2
        buffer_wbinvl1_vol
        v_cmp_eq_u32_e32 vcc, v0, v1
        s_or_b64 s[0:1], vcc, s[0:1]
        v_mov_b32_e32 v1, v0
        s_andn2_b64 exec, exec, s[0:1]
        s_cbranch_execnz .LBB0_3
.Ltmp8:
; %bb.4:                                ; %Flow
        .loc    4 0 0 is_stmt 0                 ; /users/antonsmi/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:0:0
        s_or_b64 exec, exec, s[0:1]
        s_mov_b64 s[2:3], 0
.LBB0_5:                                ; %Flow4
        s_and_b64 vcc, exec, s[2:3]
        s_cbranch_vccnz .LBB0_7
.LBB0_6:                                ; %UnifiedReturnBlock
        s_endpgm
.LBB0_7:                                ; %L115
        s_load_dwordx2 s[0:1], s[4:5], 0x0
        v_mov_b32_e32 v2, 0
        v_mov_b32_e32 v3, 1
        s_waitcnt lgkmcnt(0)
        v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
.Ltmp9:
        .file   5 "." "pointer.jl"
        .loc    5 146 0 is_stmt 1               ; pointer.jl:146:0
        flat_store_byte v[0:1], v2 offset:3
        flat_store_byte v[0:1], v2 offset:2
        flat_store_byte v[0:1], v2 offset:1
        flat_store_byte v[0:1], v3
.Ltmp10:
        .file   6 "." "/pfs/lustrep2/scratch/project_465000557/antonsmi/julia_depot/dev/AMDGPU/src/device/gcn/execution_control.jl"
        .loc    6 52 0                          ; /pfs/lustrep2/scratch/project_465000557/antonsmi/julia_depot/dev/AMDGPU/src/device/gcn/execution_control.jl:52:0
        s_endpgm
        ; divergent unreachable
        s_endpgm

Any help or advice is appreciated.
Thanks!

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Instinct MI250X, AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0, ROCm 5.6.0

Linking static library with device functions depends on argument order

After creating a static library containing device functions using -fgpu-rdc and ar, the library is linked with a command like:

hipcc libdevicelibrary.a main.cpp -fgpu-rdc -o program

This works fine, however, when the argument order is reversed and main.cpp is passed before libdevicelibrary.a, the compiler tries to interpret the files from libdevicelibrary.a as source file:

$ hipcc main.cpp libdevicelibrary.a -fgpu-rdc -o program
/tmp/library.o:1:1: error: expected unqualified-id
<U+007F>ELF<U+0002><U+0001><U+0001><U+0000>.... many more lines

This is caused by that placing main.cpp on the command line causes hipcc to emit -x hip before it, which causes clang++ to interpret library.o as a .hip source file:

$ HIPCC_VERBOSE=1 hipcc main.cpp libdevicelibrary.a -fgpu-rdc -o program
hipcc-cmd: /opt/rocm/llvm/bin/clang++ [...extra options omitted for brevity] -x hip main.hip """/tmp/library.o""" -std=c++17 -fgpu-rdc -o "program"

potential integration options for ROCM libs

FYI we are discussing how e can integrate the ROCM device libs into IREE iree-org/iree#15148 . In case you have any thoughts or have any suggestions please chime in.

[COMGR] Codegen is slow.

Suggestion Description

AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE takes lots of time than other tasks.
It depends on the input and environment, but it is hard to accept considering system configuration.
(I assume it is single thread)
The best: 1~2s
Most cases: 20~30s or 60~70s
The worst: 80~100s

Is there any way to make it faster?

maybe related: #36

HIP SDK version

6.1.40252-53f3e11ac (official AMD release)

CPU

Intel Core i9 14900K

Operating System

Windows 11 Pro

GPU

RX 7900 XTX

ROCm Component

COMGR

[Issue]: AMD Clang 18 not correctly linking to math libraries | lld: error: undefined hidden symbol: expf

Problem Description

When compiling a C++ project with Clang 18 from the latest ROCm 6.2.0 install, when using the flag: -fmath-errno I get this error: lld: error: undefined hidden symbol: expf eventually followed by clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)

While trying to fix the issue, I reinstalled ROCm 6.1.2 and compiled the same code using AMD Clang 17 by using the environment variable ROCM_PATH=/opt/rocm-6.1.2 and had no issues, everything compiled fine. Also compiled fine when using the ROCm 6.2.0 path/files, but specifying the makefile to use Clang 17 from ROCm v6.1.2

Operating System

Linux Mint 21.3 (Virginia)

CPU

12th Gen Intel(R) Core(TM) i5-12600K

GPU

AMD Radeon RX 6800 XT

ROCm Version

ROCm 6.2.0

ROCm Component

llvm-project

Steps to Reproduce

Install ROCm 6.2.0 in full and have a supported GPU installed
Compile the source code (modify -j as needed):

git clone https://github.com/YellowRoseCx/koboldcpp-rocm.git -b v1.72.yr0-ROCm
make LLAMA_HIPBLAS=1 -j14

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.8.5 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    12th Gen Intel(R) Core(TM) i5-12600K
  Uuid:                    CPU-XX                             
  Marketing Name:          12th Gen Intel(R) Core(TM) i5-12600K
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      49152(0xc000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65698236(0x3ea79bc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65698236(0x3ea79bc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65698236(0x3ea79bc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-05965f569492ad18               
  Marketing Name:          AMD Radeon RX 6800 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
    L3:                      131072(0x20000) KB                 
  Chip ID:                 29631(0x73bf)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2575                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            72                                 
  SIMDs per CU:            2                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 118                                
  SDMA engine uCode::      83                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5600 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 29471(0x731f)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1780                               
  BDFID:                   2304                               
  Internal Node ID:        2                                  
  Compute Unit:            36                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 149                                
  SDMA engine uCode::      35                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    6275072(0x5fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    6275072(0x5fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

$ ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 make LLAMA_HIPBLAS=1 -j14
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native
I CXXFLAGS: -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
I CXX:      g++ (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0

cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native  -c ggml/src/ggml.c -o ggml.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native  -c otherarch/ggml_v3.c -o ggml_v3.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native  -c otherarch/ggml_v2.c -o ggml_v2.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native  -c otherarch/ggml_v1.c -o ggml_v1.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c expose.cpp -o expose.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c gpttype_adapter.cpp -o gpttype_adapter.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c otherarch/sdcpp/sdtype_adapter.cpp -o sdcpp_default.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c otherarch/whispercpp/whisper_adapter.cpp -o whispercpp_default.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c examples/llava/clip.cpp -o llavaclip_default.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c examples/llava/llava.cpp -o llava.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native -c ggml/src/ggml-backend.c -o ggml-backend_default.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native -c ggml/src/ggml-alloc.c -o ggml-alloc.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native -c ggml/src/ggml-aarch64.c -o ggml-aarch64.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native  -c ggml/src/ggml-quants.c -o ggml-quants.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c src/unicode.cpp -o unicode.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c src/unicode-data.cpp -o unicode-data.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  -c ggml/src/llamafile/sgemm.cpp -o sgemm.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c common/common.cpp -o common.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c common/sampling.cpp -o sampling.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c common/grammar-parser.cpp -o grammar-parser.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native   -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include -c ggml/src/ggml.c -o ggml_v4_cublas.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native   -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include -c otherarch/ggml_v3.c -o ggml_v3_cublas.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native   -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include -c otherarch/ggml_v2.c -o ggml_v2_cublas.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include -c gpttype_adapter.cpp -o gpttype_adapter_cublas.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include -c otherarch/sdcpp/sdtype_adapter.cpp -o sdcpp_cublas.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include -c otherarch/whispercpp/whisper_adapter.cpp -o whispercpp_cublas.o
g++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  -c examples/llava/clip.cpp -o llavaclip_cublas.o
cc  -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -march=native -mtune=native  -c ggml/src/ggml-backend.c -o ggml-backend_cublas.o
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o ggml/src/ggml-cuda.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml_v3-cuda.o otherarch/ggml_v3-cuda.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
otherarch/ggml_v3-cuda.cu:590:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  590 | }
      | ^
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml_v2-cuda.o otherarch/ggml_v2-cuda.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml_v2-cuda-legacy.o otherarch/ggml_v2-cuda-legacy.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/acc.o ggml/src/ggml-cuda/acc.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/arange.o ggml/src/ggml-cuda/arange.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/argsort.o ggml/src/ggml-cuda/argsort.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/binbcast.o ggml/src/ggml-cuda/binbcast.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/clamp.o ggml/src/ggml-cuda/clamp.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/concat.o ggml/src/ggml-cuda/concat.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/convert.o ggml/src/ggml-cuda/convert.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/conv-transpose-1d.o ggml/src/ggml-cuda/conv-transpose-1d.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/cpy.o ggml/src/ggml-cuda/cpy.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/diagmask.o ggml/src/ggml-cuda/diagmask.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/dmmv.o ggml/src/ggml-cuda/dmmv.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/fattn.o ggml/src/ggml-cuda/fattn.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/fattn-tile-f16.o ggml/src/ggml-cuda/fattn-tile-f16.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/fattn-tile-f32.o ggml/src/ggml-cuda/fattn-tile-f32.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/getrows.o ggml/src/ggml-cuda/getrows.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/im2col.o ggml/src/ggml-cuda/im2col.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/mmq.o ggml/src/ggml-cuda/mmq.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
/opt/rocm/llvm/bin/clang++ -I. -Iggml/include -Iggml/src -Iinclude -Isrc -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -fno-finite-math-only -fmath-errno -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -DGGML_USE_LLAMAFILE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DSD_USE_CUBLAS  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.2.0/include -I/include --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml/src/ggml-cuda/mmvq.o ggml/src/ggml-cuda/mmvq.cu
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
lld: error: undefined hidden symbol: expf
>>> referenced by /tmp/fattn-gfx1030-87b27f.o:(void flash_attn_combine_results<256, 4>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced by /tmp/fattn-gfx1030-87b27f.o:(void flash_attn_combine_results<256, 4>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced by /tmp/fattn-gfx1030-87b27f.o:(void flash_attn_combine_results<256, 2>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced 3 more times
clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:274: ggml/src/ggml-cuda/fattn.o] Error 1
make: *** Waiting for unfinished jobs....
otherarch/ggml_v3-cuda.cu:5747:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 5747 | static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int ncols_par, const int nrows_y, const float scale) {
      |                        ^
otherarch/ggml_v3-cuda.cu:5747:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
otherarch/ggml_v3-cuda.cu:5747:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
otherarch/ggml_v3-cuda.cu:5747:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
otherarch/ggml_v3-cuda.cu:5747:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
otherarch/ggml_v3-cuda.cu:5747:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
lld: error: undefined hidden symbol: expf
>>> referenced by /tmp/fattn-tile-f16-gfx1030-4a7125.o:(void flash_attn_combine_results<64, 4>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced by /tmp/fattn-tile-f16-gfx1030-4a7125.o:(void flash_attn_combine_results<64, 4>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced by /tmp/fattn-tile-f16-gfx1030-4a7125.o:(void flash_attn_combine_results<128, 4>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced 5 more times
clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:274: ggml/src/ggml-cuda/fattn-tile-f16.o] Error 1
lld: error: undefined hidden symbol: expf
>>> referenced by /tmp/fattn-tile-f32-gfx1030-726ef6.o:(void flash_attn_tile_ext_f32<64, 16, 8, 4>(char const*, char const*, char const*, char const*, float*, HIP_vector_type<float, 2u>*, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int))
>>> referenced by /tmp/fattn-tile-f32-gfx1030-726ef6.o:(void flash_attn_tile_ext_f32<64, 16, 8, 4>(char const*, char const*, char const*, char const*, float*, HIP_vector_type<float, 2u>*, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int))
>>> referenced by /tmp/fattn-tile-f32-gfx1030-726ef6.o:(void flash_attn_combine_results<64, 4>(float const*, HIP_vector_type<float, 2u> const*, float*))
>>> referenced 17 more times
clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:274: ggml/src/ggml-cuda/fattn-tile-f32.o] Error 1
7 warnings generated when compiling for gfx1030.
lld: error: undefined hidden symbol: expf
>>> referenced by /tmp/ggml_v3-cuda-gfx1030-0d42b1.o:(silu_f32(float const*, float*, int))
>>> referenced by /tmp/ggml_v3-cuda-gfx1030-0d42b1.o:(silu_f32(float const*, float*, int))
>>> referenced by /tmp/ggml_v3-cuda-gfx1030-0d42b1.o:(gelu_quick_f32(float const*, float*, int))
>>> referenced 21 more times
clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:282: ggml_v3-cuda.o] Error 1

[Issue]: build failure on archlinux

Problem Description

this occurred while attempting to compile HEAD, any guidance is appreciated

[920/7072] Building CXX object lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o
FAILED: lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o
/usr/bin/c++ -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/raijin/aur/llvm-rocm-git/src/_build/lib/Target/SystemZ -I/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ -I/home/raijin/aur/llvm-rocm-git/src/_build/include -I/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/include -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-nonnull -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17 -fvisibility=hidden  -fno-exceptions -funwind-tables -MD -MT lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o -MF lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o.d -o lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o -c /home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp: In member function ‘llvm::SDValue llvm::SystemZTargetLowering::combineTruncateExtract(const llvm::SDLoc&, llvm::EVT, llvm::SDValue, llvm::TargetLowering::DAGCombinerInfo&) const’:
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp:6590:7: error: ‘N0’ was not declared in this scope
 6590 |   if (N0.getOpcode() == ISD::XOR &&
      |       ^~
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp:6595:9: error: ‘VT’ was not declared in this scope
 6595 |     if (VT.isScalarInteger() && VT.getSizeInBits() < X.getValueSizeInBits()) {
      |         ^~
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp:6596:25: error: ‘DAG’ was not declared in this scope
 6596 |       KnownBits Known = DAG.computeKnownBits(X);

Operating System

OS: NAME="Arch Linux"

CPU

CPU: 8-core AMD Ryzen 9 4900HS with Radeon Graphics (-MT MCP-)

GPU

AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

ROCm-CompilerSupport

Steps to Reproduce

attempt to build HEAD of this llvm-project

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 4900HS with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 4900HS with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3000
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            16
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    15778348(0xf0c22c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15778348(0xf0c22c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    15778348(0xf0c22c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90c
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      1024(0x400) KB
  Chip ID:                 5686(0x1636)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1750
  BDFID:                   1024
  Internal Node ID:        1
  Compute Unit:            8
  SIMDs per CU:            4
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 471
  SDMA engine uCode::      40
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

Link to LLVM failed

I'm trying to build comgr from source, with the following cmake config:

cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DAMDDeviceLibs_DIR=$HOME/opt/rocm/5.4.3/lib/cmake/AMDDeviceLibs \
-DLLD_DIR=$HOME/opt/llvm/15.0.7/lib/cmake/lld \
-DClang_DIR=$HOME/opt/llvm/15.0.7/lib/cmake/clang \
-DROCM_DIR=$HOME/opt/rocm/5.4.3/share/rocm/cmake \
-DCMAKE_INSTALL_PREFIX=$PWD/install

cmake --build build

There are linking errors reporting undefined references to LLVM library during building:

/usr/bin/ld: CMakeFiles/amd_comgr.dir/src/comgr-metadata.cpp.o: in function `COMGR::metadata::getMetadataRoot(COMGR::DataObject*, C
OMGR::DataMeta*)':
comgr-metadata.cpp:(.text+0x2b5): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x2cf): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x2f9): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x313): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x4ad): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x4c2): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x4ec): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x501): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x715): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x72f): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x759): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x773): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x8fd): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x917): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x941): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x95b): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: CMakeFiles/amd_comgr.dir/src/comgr-metadata.cpp.o: in function `COMGR::metadata::getELFObjectFileBase(COMGR::DataObjec
t*)':

I'm using the latest version AMD LLVM and device libs from the amd-stg-open branch. The full build log is available build.log. Is there any approach to tackle this? Thanks in advance.

[Feature]: Compiler option for expanding s_waitcnt instructions or don’t merge them in the first place

Suggestion Description

From the kernels we have studied, it is not uncommon to see the use of a single waitcnt instruction to wait for multiple load instructions to finish loading their values.

Once the PC-sampling feature is enabled on these kernels, we expect to see a non-negligible amount of pc-samples reported at the waitcnt instructions.

In order to figure out which load instruction might be a/the bottleneck, it would be nice to have a compiler option that expands a single waitcnt instruction for value N into a series of waitcnt instructions with decreasing value from N+k, N+k-1, … N, when the waitcnt instruction is waiting for k loads to complete.

Please note that we are NOT looking for the existing compiler option -amdgpu-waitcnt-forcezero that adds an s_waitcnt(0) after every instruction, as we still want to hide the memory load latency with compute instructions as much as possible.

Operating System

No response

GPU

MI200 / MI250 / MI300

ROCm Component

No response

[Issue]: SD3 cpp implementation build on WIndows issue

Problem Description

I try to build sd3 cpp implementation from
https://github.com/leejet/stable-diffusion.cpp

Linux can build and run. but using HIPSDK on windows.
I get the error

cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1100

LINK : warning LNK4044: unrecognized option '/v'; ignored
LINK : warning LNK4001: no object files specified; libraries used
LINK : warning LNK4068: /MACHINE not specified; defaulting to X64
LINK : fatal error LNK1561: entry point must be defined

This looks related with -v option in ggml\src\CMakeLists.txt
The line
COMMAND ${CMAKE_C_COMPILER} ${CMAKE_EXE_LINKER_FLAGS} -Wl,-v

The linux can pass the configuration

Anyone can help to solve this

Operating System

Windows 11 Pro 23H2

CPU

AMD 7700X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 5.7.1, ROCm 5.7.0

ROCm Component

No response

Steps to Reproduce

following the stable diffusion cpp's get source and build guide with preinstalled HIPSDK 5.7 (5.5 also have same issue)

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: comgr.cpp compilation error

Problem Description

I'm trying to compile amd_comgr on Visual Studio 2022.
I'm on the latest amd-staging branch of https://github.com/ROCm/llvm-project
( 6dfd4ed )

I was able to build llvm and device-libs correctly.
But for amd_comgr, I have those compilation error:

1>comgr.cpp
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(65,1): warning C4003: not enough arguments for function-like macro invocation 'LLVM_MAKE_OPT_ID'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(66,1): warning C4003: not enough arguments for function-like macro invocation 'LLVM_MAKE_OPT_ID'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(66,1): error C2365: 'clang::driver::options::OPT_': redefinition; previous definition was 'enumerator'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(65,1): message : see declaration of 'clang::driver::options::OPT_'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(67,1): warning C4003: not enough arguments for function-like macro invocation 'LLVM_MAKE_OPT_ID'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(67,1): error C2365: 'clang::driver::options::OPT_': redefinition; previous definition was 'enumerator'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(65,1): message : see declaration of 'clang::driver::options::OPT_'
........... lots of errors like that in Options.inc .......

Am I missing something ?

Operating System

Windows 11

CPU

AMD Ryzen Threadripper PRO 5955WX 16-Cores

GPU

AMD Radeon Pro W7900

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: comgr fails to build hip code on musl systems

Problem Description

While running amd/comgr autotests with musl libc, tests related to HIP code fail like this:

33/38 Test: comgr_compile_source_to_executable
Command: "/var/tmp/portage/dev-libs/rocm-comgr-6.1.1/work/llvm-project-rocm-6.1.1/amd/comgr_build/test/compile_source_to_executable"
Directory: /var/tmp/portage/dev-libs/rocm-comgr-6.1.1/work/llvm-project-rocm-6.1.1/amd/comgr_build/test
"comgr_compile_source_to_executable" start time: May 22 04:02 UTC
Output:
----------------------------------------------------------
In file included from <built-in>:1:
In file included from /usr/lib/llvm/18/bin/../../../../lib/clang/18/include/__clang_hip_runtime_wrapper.h:111:
/usr/lib/llvm/18/bin/../../../../lib/clang/18/include/cuda_wrappers/cmath:27:15: fatal error: 'cmath' file not found
   27 | #include_next <cmath>
      |               ^~~~~~~
1 error generated when compiling for gfx900.
FAILED: amd_comgr_do_action
 REASON: ERROR
<end of output>
Test time =   0.74 sec
----------------------------------------------------------
Test Failed.

Note, that clang and hipcc work, this is just an issue of comgr runtime compiler.

The problem is caused by lines https://github.com/ROCm/llvm-project/blob/rocm-6.1.1/amd/comgr/src/comgr-compiler.cpp#L1054-L1055

    Args.push_back("-target");
    Args.push_back("x86_64-unknown-linux-gnu");

The correct target would be x86_64-unknown-linux-musl, however the easiest fix is just to remove these 2 lines, because clang uses host target by default. Removing these 2 lines allows all comgr tests to pass with musl libc (given that other non-musl fixes are applied and excluding comgr_nested_kernel_test due to #35).

Not setting -target enables automatic target selection and setting up flags:

glibc system automatically adds /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/ to include path
musl system automatically adds /usr/lib/gcc/x86_64-gentoo-linux-musl/13/include/g++-v13/ to include path

Operating System

Gentoo / musl profile

CPU

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

HIPCC-5.7.0 misparsing types causing operator signed "ambiguous"

While compiling dla-future(https://github.com/eth-cscs/DLA-Future/) with spack(https://github.com/spack/spack), an error like below encountered. It seems hipcc misparsed const HIP_vector_type<float, 2> as the same with hipFloatComplex(or say, float2, a struct in hip header), resulting in two possible '*' operator.

Shortened Error Message

/tmp/yizeyi18/spack-stage/spack-stage-dla-future-0.3.1-z3uclos3gvxaa7c5siuj7l7cx44yzj4k/spack-src/src/lapack/gpu/add.cu:32:17: error: use of overloaded operator '*' is ambiguous (with operand types 'const HIP_vector_type' and 'const HIP_vector_type')
  b = b + alpha * a;
          ~~~~~ ^ ~
/tmp/yizeyi18/spack-stage/spack-stage-dla-future-0.3.1-z3uclos3gvxaa7c5siuj7l7cx44yzj4k/spack-src/src/lapack/gpu/add.cu:101:30: note: in instantiation of function template specialization 'dlaf::gpulapack::kernels::addAlpha>' requested here
    addDiagInternal(m, n, alpha, a, lda, b, ldb);
                             ^
/tmp/yizeyi18/spack-stage/spack-stage-dla-future-0.3.1-z3uclos3gvxaa7c5siuj7l7cx44yzj4k/spack-src/src/lapack/gpu/add.cu:126:9: note: in instantiation of function template specialization 'dlaf::gpulapack::kernels::addDiag<&dlaf::util::isLower, HIP_vector_type>' requested here
        addDiag(m, n, alpha, a, lda, b, ldb);
        ^
/tmp/yizeyi18/spack-stage/spack-stage-dla-future-0.3.1-z3uclos3gvxaa7c5siuj7l7cx44yzj4k/spack-src/src/lapack/gpu/add.cu:161:12: note: in instantiation of function template specialization 'dlaf::gpulapack::kernels::add>' requested here
  kernels::add<<>>(util::blasToCublas(uplo), um, un,
           ^
/opt/rocm/include/hip/amd_detail/amd_hip_vector_types.h:548:65: note: candidate function
        friend __HOST_DEVICE__ inline constexpr HIP_vector_type operator*(
                                                                ^
/opt/rocm/include/hip/amd_detail/amd_hip_complex.h:237:1: note: candidate function
COMPLEX_MUL_OP_OVERLOAD(hipFloatComplex)
^
/opt/rocm/include/hip/amd_detail/amd_hip_complex.h:78:40: note: expanded from macro 'COMPLEX_MUL_OP_OVERLOAD'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, const type& rhs) {           \

full build logs:
logs_and_source.zip

[Issue]: failure of `AMDGPU DAG->DAG Pattern Instruction Selection` due to `llvm.llrint.i64.f32`

Problem Description

The compiler from [email protected] suite fails to generate the code for llvm.llrint.i64.f32.

Operating System

Ubuntu 20.04.1 LTS

CPU

Intel(R) Xeon(R) Gold 6132 CPU

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

llvm-project

Steps to Reproduce

Here is a small reproducible example (func.ll)

declare i64 @llvm.llrint.i64.f32(float)

define i64 @convert(float %arg) {
entry:
  %0 = tail call i64 @llvm.llrint.i64.f32(float %arg)
  ret i64 %0
}

Here is the corresponding Makefile

CC=/opt/rocm/llvm/bin/llc

x86: func.ll
        $(CC) $< -mtriple="x86_64-linux-gnu" -o x86.o

amdgcn: func.ll
        $(CC) $< -mtriple="amdgcn-amd-amdhsa" -o amdgcn.o

clean:
        rm -f x86.o amdgcn.o

Execute the following to reproduce the bug

make amdgcn

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

This bug was discovered during code generation of a Triton kernel. The original source of the issue is here.

Remove newline in hipconfig 5.7 ?

The hipconfig from version 5.7 automatically adds a newline which wasn't here from previous hipcc version:

 $ ml hip/5.6.0
 $ hipconfig --cpp_config
 -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/llvm/bin/../lib/clang/16.0.0  $ #no newline

 $ ml -hip hip/5.7.0
 $ hipconfig --cpp_config
 -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.7.0/include -I/opt/rocm-5.7.0/llvm/lib/clang/17.0.0
  $ #newline

The --newline, -n option seems a bit useless then. But the issue we have is with legacy autotools build system where FLAGS="$(hipconfig --cpp_config) $FLAGS in configure.ac would generate a Makefile with a newline between the output of hipconfig and $FLAGS, leading to Makefile syntax errors.

IEEE 754 rounding modes

I noticed that __fsub_add_ru and friends are not supported in HIP, because they were disabled in the device-lib a while ago. What would it take to re-enable IEEE 754 rounding modes (up/down) for single precision floats again?

[Issue]: Fail to compile OpenCL kernels on darktable under Fedora 40 rocm 6.0.2

Problem Description

Since updating from Fedora 39 to Fedora 40, darktable-cltest rocm fails to compile. It worked prior in Fedora 39 KDE. Attached is the error from darktable. It seems to be related to: #70

Output from darktable-cltest -d verbose

     0.4118 [opencl_build_program] could not build program: CL_BUILD_PROGRAM_FAILURE
     0.4118 [opencl_build_program] BUILD STATUS: -2
     0.4119 BUILD LOG:
     0.4119 fatal error: malformed or corrupted AST file: 'could not find file '/usr/lib64/llvm17/bin/../../../lib/clang/17/include/opencl-c-base.h' referenced by AST file '/tmp/comgr-c6b354/include/opencl1.2-c.pch''
1 error generated.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

darktable version 4.7.0~git1208.6bf26b3c

I checked and opencl-c-base.h is in /usr/lib/clang/17/include/ and also in /usr/lib/clang/18/include/

Operating System

Fedora 40 KDE X11

CPU

AMD Ryzen 7 5700G with Radeon Graphics

GPU

AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

ROCm-CompilerSupport

Steps to Reproduce

See above.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 5700G with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 5700G with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3800                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    15679652(0xef40a4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15679652(0xef40a4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    15679652(0xef40a4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx90c                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5688(0x1638)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   3328                               
  Internal Node ID:        1                                  
  Compute Unit:            8                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 471                                
  SDMA engine uCode::      40                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Also the info from rocm-clinfo. I do have an Nvidia GPU and that one compiles without issues (rpmfusion drivers)

Number of platforms:                             4
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 3.0 CUDA 12.4.131
  Platform Name:                                 NVIDIA CUDA
  Platform Vendor:                               NVIDIA Corporation
  Platform Extensions:                           cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (3602.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback 
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.1 Mesa 24.0.7
  Platform Name:                                 Clover
  Platform Vendor:                               Mesa
  Platform Extensions:                           cl_khr_icd
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 3.0 
  Platform Name:                                 rusticl
  Platform Vendor:                               Mesa/X.org
  Platform Extensions:                           cl_khr_byte_addressable_store cl_khr_create_command_queue cl_khr_expect_assume cl_khr_extended_versioning cl_khr_icd cl_khr_il_program cl_khr_spirv_no_integer_wrap_decoration


  Platform Name:                                 NVIDIA CUDA
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     10deh
  Max compute units:                             28
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           64
  Max work group size:                           1024
  Preferred vector width char:                   1
  Preferred vector width short:                  1
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      1
  Native vector width short:                     1
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1777Mhz
  Address bits:                                  64
  Max memory allocation:                         3154821120
  Image support:                                 Yes
  Max number of images read arguments:           256
  Max number of images write arguments:          32
  Max image 2D width:                            32768
  Max image 2D height:                           32768
  Max image 3D width:                            16384
  Max image 3D height:                           16384
  Max image 3D depth:                            16384
  Max samplers within kernel:                    32
  Max size of kernel argument:                   32764
  Alignment (bits) of base address:              4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               128
  Cache size:                                    802816
  Global memory size:                            12619284480
  Constant buffer size:                          65536
  Max number of constant args:                   9
  Local memory type:                             Scratchpad
  Local memory size:                             49152
  Max pipe arguments:                            0
  Max pipe active reservations:                  0
  Max pipe packet size:                          0
  Max global variable size:                      0
  Max global variable preferred total size:      0
  Max read/write image args:                     0
  Max on device events:                          0
  Queue on device max size:                      0
  Max on device queues:                          0
  Queue on device preferred size:                0
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           No
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     32
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1000
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                No
    Profiling :                                  No
  Platform ID:                                   0x564f9f0c6e00
  Name:                                          NVIDIA GeForce RTX 3060
  Vendor:                                        NVIDIA Corporation
  Device OpenCL C version:                       OpenCL C 1.2 
  Driver version:                                550.78
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 3.0 CUDA
  Extensions:                                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    AMD Radeon Graphics
  Device Topology:                               PCI[ B#13, D#0, F#0 ]
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           2000Mhz
  Address bits:                                  64
  Max memory allocation:                         402653184
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            16384
  Max image 3D height:                           16384
  Max image 3D depth:                            8192
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            536870912
  Constant buffer size:                          402653184
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          402653184
  Max global variable size:                      402653184
  Max global variable preferred total size:      536870912
  Max read/write image args:                     64
  Max on device events:                          1024
  Queue on device max size:                      8388608
  Max on device queues:                          1
  Queue on device preferred size:                262144
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   0x7f12861f1808
  Name:                                          gfx90c:xnack-
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0 
  Driver version:                                3602.0 (HSA1.1,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 


  Platform Name:                                 Clover
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 2
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    2
  Max clock frequency:                           2000Mhz
  Address bits:                                  64
  Max memory allocation:                         2006994944
  Image support:                                 No
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              32768
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               No
    Round to +ve and infinity:                   No
    IEEE754-2008 fused multiply-add:             No
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            8027979776
  Constant buffer size:                          67108864
  Max number of constant args:                   16
  Local memory type:                             Scratchpad
  Local memory size:                             65536
ERROR: clBuildProgram(-11)

Additional Information

If this is not the correct place to raise this issue, please let me know.

[Issue]: Crash while compiling rocSPARSE

Problem Description

rocSPARSE compilation crashes, rather than producing an error or succeeding.
fail.txt

Operating System

Arch linux, kernel 6.9.7-arch1-1

CPU

AMD Threadripper 1950X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.1.0

ROCm Component

rocSPARSE

Steps to Reproduce

After compiling all prerequisites, try doing the following (or something like it):

cd $BASEDIR 
[[ -n "${BASEDIR}" ]] &&  rm -rf "$BASEDIR/14_sparse"
mkdir -p 14_sparse
cd 14_sparse 

mkdir -p build 
DEST="$BASEDIR/14_sparse/build"

git clone https://github.com/ROCmSoftwarePlatform/rocSPARSE
cd rocSPARSE


cmake \
    -Wno-dev \
    -D CMAKE_BUILD_TYPE=Release \
    -D CMAKE_CXX_COMPILER=${ROCM_INSTALL_DIR}/bin/hipcc \
    -D CMAKE_CXX_FLAGS="${CXXFLAGS} -fcf-protection=none" \
    -D CMAKE_INSTALL_PREFIX=${ROCM_INSTALL_DIR} \
    -G Ninja \
    $BASEDIR/14_sparse/rocSPARSE
    
"${NINJA:=ninja}" $NUMJOBS
DESTDIR=$DEST "$NINJA" install

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: Atomic optimizer reorder causes memory access fault in Blender

Problem Description

Following https://projects.blender.org/blender/blender/issues/112084, I've bisected the rocm-6.0.x branch and found that commit 30a3adf caused any Blender render (using HIP, of course) to crash with message along the lines of "Memory access fault by GPU node-1 (Agent handle: 0x7f1db8337e00) on address 0x7f1bf177e000. Reason: Page not present or supervisor privilege."

Operating System

Solus 4.5 Resilience

CPU

AMD Ryzen 7 5800H with Radeon Graphics

GPU

AMD Instinct MI250, AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

llvm-project

Steps to Reproduce

Build this project at the commit mentioned.
Download the Blender 4.1 release binaries: curl -O https://download.blender.org/release/Blender4.1/blender-4.1.0-linux-x64.tar.xz, tar xf blender-4.1.0-linux-x64.tar.xz. You should now have a folder blender-4.1.0-linux-x64.
Clone Blender. Just cloning the v4.1.0 tag is enough: git clone https://projects.blender.org/blender/blender.git --depth 1 --branch v4.1.0.
In the Blender repo, compile the HIP fatbin used to run Blender render: hipcc --offload-arch=$arch --genco intern/cycles/kernel/device/hip/kernel.cpp -D CCL_NAMESPACE_BEGIN= -D CCL_NAMESPACE_END= -D HIPCC -I intern/cycles/kernel/.. -I intern/cycles/kernel/device/hip -ffast-math -o kernel_$arch.fatbin. Adjust HIP_ROCCLR_HOME, HIP_CLANG_PATH as necessary to point to the Clang you just compiled. Replace $arch with the GPU architecture to run on, e.g. gfx900 or gfx1030. Don't add extra attributes like :xnack-.

If you want to run on multiple architectures, repeat step 4 and 5 for each architecture.
Put this file into blender-4.1.0-linux-x64/4.1/scripts/addons/cycles/lib/kernel_$arch.fatbin.
Get the BMW27 Blender demo file. curl -O https://download.blender.org/demo/test/BMW27.blend.zip, unzip BMW27.blend.zip. You should have a file BMW27.blend.
Now run Blender render. blender-4.1.0-linux-x64/blender -b <path-to-BMW27.blend> -f 0 -- --cycles-device HIP. By default it runs on GPU with device ID 0, so adjust HIP_VISIBLE_DEVICES as necessary to run on the desired GPU.

You should almost immediately see Blender crash with an error message similar to "Memory access fault by GPU node-1 (Agent handle: 0x7f1db8337e00) on address 0x7f1bf177e000. Reason: Page not present or supervisor privilege."
Now, build LLVM at 1 commit prior, e.g. git switch --detach 30a3adf50e2d49dfc97c1b614d9b93638eba672d~1. Repeat step 4-7, and Blender should render normally.

All of this is on ROCm 6.0.0. If you get a hang instead of a crash when running Blender (likely your on an APU), Ctrl+C and run again with environment variable HSA_ENABLE_SMDA=0.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 5800H with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 5800H with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    61576816(0x3ab9670) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    61576816(0x3ab9670) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    61576816(0x3ab9670) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6600M                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29695(0x73ff)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2720                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 116                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx90c                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5688(0x1638)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   2048                               
  Internal Node ID:        2                                  
  Compute Unit:            8                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 471                                
  SDMA engine uCode::      40                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

This behavior has been reproduced on MI250, RX6600M, Vega 10, and Ryzen 7 5800H. (gfx90a, gfx1032, gfx900, gfx90c, respectively)

Kernel version: 6.6.22-281.current, with torvalds/linux@96c211f reverted (ref: https://lists.freedesktop.org/archives/amd-gfx/2023-October/100298.html and ROCm/ROCm#2596 (comment))

Parallel compilation with Comgr

Currently Comgr includes mutex locks around compilation actions to avoid unexpected LLVM multi-threading side effects.

https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/blob/amd-stg-open/lib/comgr/src/comgr.cpp#L1252

It would be nice to fix the issues with LLVM preventing multiple threads from maintaining independent LLVM instances. We could then remove these locks in Comgr to allow for parallel compilation

[Issue]: ROCm-6.1 test failures with upstream clang-17

Problem Description

4 tests failed on Gentoo rocm-device-libs-6.1.0 package, with clang-17.0.6:

          6 - compile_frexp__gfx600 (Failed)
          7 - compile_fract__gfx600 (Failed)
         12 - compile_fract__gfx700 (Failed)
         17 - compile_fract__gfx803 (Failed)

Test log:
LastTest.log

The test compilation output that are failing tests:
failed_assembly.tar.gz

Operating System

Gentoo Prefix on kernel 6.7.9

CPU

AMD Ryzen 7 7700 8-Core Processor

GPU

AMD Radeon RX 7900 XT

ROCm Version

ROCm 6.1.0

ROCm Component

ROCm-Device-Libs

Steps to Reproduce

## Setup ebuild repo, currently in my own branch
pushd /var/db/repos
rm -rf gentoo
git clone --depth 1 https://github.com/littlewu2508/gentoo.git -b rocm-runtime-6.1

## Testing setup
usermod –a –G render portage # add portage to render group to access GPU
mkdir -p /etc/portage/env/
echo 'FEATURES="test"' > /etc/portage/env/test.conf
echo 'dev-libs/rocm-device-libs test.conf' > /etc/portage/package.env

## Install deps and execute test
emerge -v "=dev-libs/rocm-device-libs-6.1.0"

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

rocm / llvm-project Goto Github PK

llvm-project's Issues

Suggestion Description

Operating System

GPU

ROCm Component

Suggestion Description

Operating System

GPU

ROCm Component

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.7.0 is loaded

HSA System Attributes

========== HSA Agents

Additional Information

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Suggestion Description

Operating System

GPU

ROCm Component

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

ROCm Version

Suggestion Description

HIP SDK version

CPU

Operating System

GPU

ROCm Component

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

==========
HSA Agents