rocm / llvm-project Goto Github PK

This project forked from llvm/llvm-project

This is the AMD-maintained fork of the LLVM git repository. This repository accepts pull requests and issues related to AMD fork-specific topics (amd/*). For all other issues/PRs, please submit upstream at https://github.com/llvm/llvm-project.

License: Other

llvm-project's Introduction

AMD Fork of The LLVM Compiler Infrastructure

The AMD fork aims to contain all of upstream LLVM, and also includes several AMD-specific additions in the llvm-project/amd directory:

amd/comgr - The Code Object Manager API, designed to simplify linking, compiling, and inspecting code objects (code owner: @lamb-j)
amd/device-libs -The sources and CMake build system for a set of AMD-specific device-side language runtime libraries (code owner: @b-sumner)
amd/hipcc - A compiler driver utility that wraps clang and passes the appropriate include and library options for the target compiler and HIP infrastructure (code owner: @david-salinas)

See the README files in respective subdirectories for more information on these AMD-specific projects. While the AMD fork aims to otherwise follow upstream as closely as possible, there are several outstanding differences.

OpenMP - The AMD fork contains several changes:
- Additional optimizations for OpenMP offload
- Host-exec services for printing on-device and doing malloc/free from device
- Improved support for OMPT, the OpenMP tools interface
- Driver improvements for multi-image and Target ID features
- OMPD support, implements OpenMP D interfaces.
- ASAN support for OpenMP.
- MI300A Unified Shared Memory support
Heterogeneous Debugging - A prototype of debug-info supporting AMDGPU targets, affecting most parts of the compiler, is implemented as documented in docs/AMDGPULLVMExtensionsForHeterogeneousDebugging.rst but is an ongoing work-in-progress. Fundamental changes are expected as parts of the design are adapted for upstreaming.
Address Sanitizer - Changes were added to santizer_common and asan libraries in compiler-rt to support AMD GPU address sanitizer error detection and reports. These changes are intended to be upstreamed. The instrumentation pass changes have already been upstreamed.
Reverted Patches - For upstream patches that break internal testing, we may temporarily revert these patches until the testing issues are resolved. We maintain a list of reverted upstream patches in llvm-project/revert_patches.txt.

llvm-project's People

Contributors

Stargazers

Watchers

Forkers

scchan ekuznetsov139 madscientist159 rksharma-yymd fatorange sclarkson xirdigh wenwu-glagle kyleroarty innomi candrews weiwei-2021 huangyizhitt jrbyrnes cgmb jatinx jbaileyhandle shaojiewang zhufenghuang leadalion al42and htec-amd-spirv-poc fpolyjuice brunoblaise tpkessler atafra mathiasmagnus aredrogue liangxijun-1001 web-logs2 loongarch-moe web-logs2 eric-song-nop sugon-hpc xcoresigma yuanwei2023 mystro256 gzgavinzhao streamhpc xyq1113723547 hack3ric tntran92 dwf kawailee amd-jmacaran

llvm-project's Issues

[Feature]: Better preprocessor macros to detect RDNA/CDNA family at compile time

Suggestion Description

As new instructions/features are added with each new arch, it is useful to know the target architecture at compile time to employ separate code paths. For example: FP64 MFMA was added in CDNA2, so CDNA2 and later can use one code path while CDNA1 uses a different code path.

It gets tedious because all the archs need to be enumerated, and code needs to be updated as new archs become available:

#if __gfx940__ || __gfx941__ || __gfx942__
// Code path for CDNA3
#elif __gfx90a__
// Code path for CNDA2
#elif __gfx908__
// Code path for CDNA1
#endif

It would be nice if we had something like:

#if CDNA_VERSION >= 3
// Code path for CDNA3 and later
#elif CDNA_VERSION >= 2
// Code path for CDNA2
#else
// Code path for CDNA1
#endif

This would mirror the way it is done in CUDA:

__device__ func()
{
#if __CUDA_ARCH__ >= 800
   // Code path for compute capability 8.x and later
#elif __CUDA_ARCH__ >= 700
   // Code path for compute capability 7.x
#else
  // Code path for compute capability < 7.0
#endif
}

Operating System

No response

GPU

No response

ROCm Component

No response

Comgr calls exit

Comgr calls exit, which is not advised.
To quote Fedora's rpmlint tool:

This library package calls exit() or _exit(), probably in a non-fork()
context. Doing so from a library is strongly discouraged - when a library
function calls exit(), it prevents the calling program from handling the
error, reporting it to the user, closing files properly, and cleaning up any
state that the program has. It is preferred for the library to return an
actual error code and let the calling program decide how to handle the
situation.

Thanks!

[Issue]: comgr.cpp compilation error

Problem Description

I'm trying to compile amd_comgr on Visual Studio 2022.
I'm on the latest amd-staging branch of https://github.com/ROCm/llvm-project
( 6dfd4ed )

I was able to build llvm and device-libs correctly.
But for amd_comgr, I have those compilation error:

1>comgr.cpp
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(65,1): warning C4003: not enough arguments for function-like macro invocation 'LLVM_MAKE_OPT_ID'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(66,1): warning C4003: not enough arguments for function-like macro invocation 'LLVM_MAKE_OPT_ID'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(66,1): error C2365: 'clang::driver::options::OPT_': redefinition; previous definition was 'enumerator'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(65,1): message : see declaration of 'clang::driver::options::OPT_'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(67,1): warning C4003: not enough arguments for function-like macro invocation 'LLVM_MAKE_OPT_ID'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(67,1): error C2365: 'clang::driver::options::OPT_': redefinition; previous definition was 'enumerator'
1>H:\PROJECTS\096_rocm\llvm\llvm-project\llvm\myBuild\tools\clang\include\clang/Driver/Options.inc(65,1): message : see declaration of 'clang::driver::options::OPT_'
........... lots of errors like that in Options.inc .......

Am I missing something ?

Operating System

Windows 11

CPU

AMD Ryzen Threadripper PRO 5955WX 16-Cores

GPU

AMD Radeon Pro W7900

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Feature]: Need to dump the GPU assembly code generated with Windows HIP SDK

Suggestion Description

It seems the HIP compiler on Linux can be controlled by an environment variable KMDUMPISA, when setting KMDUMPISA=1, the compiler can generate assembly code of the GPU code object.

I tried setting the same KMDUMPISA environment variable on Windows, with HIP SDK 5.7.1, but no assembly code was generated.

I need to see the assembly code to find some opportunity for optimization of my code.

Thanks.

Operating System

Windows 10

GPU

RX 7900XT

ROCm Component

HIPCC

[Issue]: Atomic optimizer reorder causes memory access fault in Blender

Problem Description

Following https://projects.blender.org/blender/blender/issues/112084, I've bisected the rocm-6.0.x branch and found that commit 30a3adf caused any Blender render (using HIP, of course) to crash with message along the lines of "Memory access fault by GPU node-1 (Agent handle: 0x7f1db8337e00) on address 0x7f1bf177e000. Reason: Page not present or supervisor privilege."

Operating System

Solus 4.5 Resilience

CPU

AMD Ryzen 7 5800H with Radeon Graphics

GPU

AMD Instinct MI250, AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

llvm-project

Steps to Reproduce

Build this project at the commit mentioned.
Download the Blender 4.1 release binaries: curl https://download.blender.org/release/Blender4.1/blender-4.1.0-linux-x64.tar.xz, tar xf blender-4.1.0-linux-x64.tar.xz. You should now have a folder blender-4.1.0-linux-x64.
Clone Blender. Just cloning the v4.1.0 tag is enough: git clone https://projects.blender.org/blender/blender.git --depth 1 --branch v4.1.0.
In the Blender repo, compile the HIP fatbin used to run Blender render: hipcc --offload-arch=$arch --genco intern/cycles/kernel/device/hip/kernel.cpp -D CCL_NAMESPACE_BEGIN= -D CCL_NAMESPACE_END= -D HIPCC -I intern/cycles/kernel/.. -I intern/cycles/kernel/device/hip -ffast-math -o kernel_$arch.fatbin. Adjust HIP_ROCCLR_HOME, HIP_CLANG_PATH as necessary to point to the Clang you just compiled. Replace $arch with the GPU architecture to run on, e.g. gfx900 or gfx1030. Don't add extra attributes like :xnack-.

If you want to run on multiple architectures, repeat step 4 and 5 for each architecture.
Put this file into blender-4.1.0-linux-x64/4.1/scripts/addons/cycles/lib/kernel_$arch.fatbin.
Get the BMW27 Blender demo file. curl https://download.blender.org/demo/test/BMW27.blend.zip, unzip BMW27.blend.zip. You should have a file BMW27.blend.
Now run Blender render. blender-4.1.0-linux-x64/blender -b <path-to-BMW27.blend> -f 0 -- --cycles-device HIP. By default it runs on GPU with device ID 0, so adjust HIP_VISIBLE_DEVICES as necessary to run on the desired GPU.

You should almost immediately see Blender crash with an error message similar to "Memory access fault by GPU node-1 (Agent handle: 0x7f1db8337e00) on address 0x7f1bf177e000. Reason: Page not present or supervisor privilege."
Now, build LLVM at 1 commit prior, e.g. git switch --detach 30a3adf50e2d49dfc97c1b614d9b93638eba672d~1. Repeat step 4-7, and Blender should render normally.

All of this is on ROCm 6.0.0. If you get a hang instead of a crash when running Blender (likely your on an APU), Ctrl+C and run again with environment variable HSA_ENABLE_SMDA=0.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 5800H with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 5800H with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    61576816(0x3ab9670) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    61576816(0x3ab9670) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    61576816(0x3ab9670) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6600M                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29695(0x73ff)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2720                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 116                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx90c                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5688(0x1638)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   2048                               
  Internal Node ID:        2                                  
  Compute Unit:            8                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 471                                
  SDMA engine uCode::      40                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

This behavior has been reproduced on MI250, RX6600M, Vega 10, and Ryzen 7 5800H. (gfx90a, gfx1032, gfx900, gfx90c, respectively)

Kernel version: 6.6.22-281.current, with torvalds/linux@96c211f reverted (ref: https://lists.freedesktop.org/archives/amd-gfx/2023-October/100298.html and ROCm/ROCm#2596 (comment))

[Issue]: enabling `amdgpu-unsafe-fp-atomics` for gfx90a

Problem Description

Hi! I'm one of the developers of AMDGPU.jl library that provides support for AMD GPU programming in Julia.
To perform compilation of Julia GPU kernels we omit HIP and use LLVM directly, hence -munsafe-fp-atomics is not available.

To enable HW atomics, we instead add amdgpu-unsafe-fp-atomics=true function attribute to our LLVM IR during compilation.
This works fine with gfx1100 devices, replacing CAS loop with HW fadd.

However, for gfx90a devices this does nothing.
I was wondering if I'm missing something else that needs to be done?

Here's an example Julia kernel, which does atomic fadd on the first array item:

@kernel function ker!(x)
    @inbounds @atomic x[1] += 1f0
end

Here's its optimized LLVM IR with atomicrmw fadd float which is the same for gfx1100 and gfx90a (notice amdgpu-unsafe-fp-atomics attribute):

click

; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.x() #0

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0

; Function Attrs: cold noreturn nounwind
declare void @llvm.amdgcn.endpgm() #1

;  @ none within `gpu_ker!`
define amdgpu_kernel void @_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [1 x [1 x [1 x i64]]], [2 x [1 x [1 x [1 x i64]]]] } %0, { [1 x i64], i8 addrspace(1)*, i64 } %1) local_unnamed_addr #2 !dbg !41 {
conversion:
  %.fca.0.0.0.0.extract = extractvalue { [1 x [1 x [1 x i64]]], [2 x [1 x [1 x [1 x i64]]]] } %0, 0, 0, 0, 0
  %.fca.1.1.0.0.0.extract = extractvalue { [1 x [1 x [1 x i64]]], [2 x [1 x [1 x [1 x i64]]]] } %0, 1, 1, 0, 0, 0
  %.fca.1.extract = extractvalue { [1 x i64], i8 addrspace(1)*, i64 } %1, 1
;  @ none within `gpu_ker!` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/macros.jl:94
; ┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/ROCKernels.jl:144 within `#__validindex`
; │┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:172 within `blockIdx`
; ││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:95 within `blockIdx_x`
; │││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:93 within `workgroupIdx_x`
; ││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
        %2 = call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !45, !range !66
; │└└└└└
; │┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:164 within `threadIdx`
; ││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:89 within `threadIdx_x`
; │││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
        %3 = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !67, !range !76
; ││││└└
; ││││┌ @ int.jl:1068 within `+` @ int.jl:87
       %4 = add nuw nsw i32 %3, 1, !dbg !77
; │└└└└
; │┌ @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ indices.jl:350 within `to_indices` @ indices.jl:354
; ││││┌ @ indices.jl:359 within `_to_indices1`
; │││││┌ @ indices.jl:277 within `to_index` @ indices.jl:292
; ││││││┌ @ number.jl:7 within `convert`
; │││││││┌ @ boot.jl:784 within `Int64`
; ││││││││┌ @ boot.jl:708 within `toInt64`
           %5 = zext i32 %4 to i64, !dbg !81
; ││└└└└└└└
; ││ @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:84 within `expand` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:74
; ││┌ @ ntuple.jl:48 within `ntuple`
; │││┌ @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/nditeration.jl:78 within `#1`
; ││││┌ @ int.jl:86 within `-`
       %6 = zext i32 %2 to i64, !dbg !104
; ││││└
; ││││┌ @ int.jl:88 within `*`
       %7 = mul i64 %.fca.1.1.0.0.0.extract, %6, !dbg !112
; ││││└
; ││││┌ @ int.jl:87 within `+`
       %8 = add i64 %7, %5, !dbg !114
; │└└└└
; │ @ /home/pxl-th/.julia/dev/AMDGPU/src/ROCKernels.jl:145 within `#__validindex`
; │┌ @ multidimensional.jl:471 within `in`
; ││┌ @ tuple.jl:318 within `map`
; │││┌ @ range.jl:1439 within `in`
; ││││┌ @ int.jl:514 within `<=`
       %9 = icmp slt i64 %8, 1, !dbg !115
       %10 = icmp sgt i64 %8, %.fca.0.0.0.0.extract, !dbg !115
; └└└└└
  %11 = or i1 %9, %10, !dbg !62
  br i1 %11, label %L128, label %L104, !dbg !62

L104:                                             ; preds = %conversion
  %.fca.0.0.extract = extractvalue { [1 x i64], i8 addrspace(1)*, i64 } %1, 0, 0
;  @ none within `gpu_ker!` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/macros.jl:95
; ┌ @ /home/pxl-th/.julia/dev/atomic.jl:6 within `macro expansion`
; │┌ @ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:95 within `getindex`
; ││┌ @ abstractarray.jl:702 within `checkbounds` @ abstractarray.jl:687
; │││┌ @ abstractarray.jl:763 within `checkindex`
; ││││┌ @ int.jl:513 within `<`
       %.not = icmp slt i64 %.fca.0.0.extract, 1, !dbg !127
; │││└└
; │││ @ abstractarray.jl:702 within `checkbounds`
     br i1 %.not, label %L115, label %L119, !dbg !133

L115:                                             ; preds = %L104
; │││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/quirks.jl:8 within `#throw_boundserror`
; ││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:113 within `signal_exception`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:11 within `exception_flag`
; ││││││┌ @ none within `kernel_state`
; │││││││┌ @ none within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
          %state.i.fca.0.extract.i = extractvalue { i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, 0, !dbg !141
; │││││└└└
; │││││┌ @ pointer.jl:146 within `unsafe_store!` @ pointer.jl:146
        %memcpy_refined_dst.i = inttoptr i64 %state.i.fca.0.extract.i to i32*, !dbg !156
        store i32 1, i32* %memcpy_refined_dst.i, align 1, !dbg !156
; │││││└
; │││││ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:115 within `signal_exception`
; │││││┌ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52 within `endpgm`
        call void @llvm.amdgcn.endpgm(), !dbg !160
; │││││└
; │││││ @ /home/pxl-th/.julia/dev/AMDGPU/src/device/runtime.jl:116 within `signal_exception`
       unreachable, !dbg !164

L119:                                             ; preds = %L104
; │└└└└
; │┌ @ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 within `modify!` @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/internal.jl:20
; ││┌ @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/atomics.jl:359 within `atomic_pointermodify`
; │││┌ @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/atomics.jl:255 within `llvm_atomic_op`
; ││││┌ @ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/6HZfV/src/atomics.jl:255 within `macro expansion` @ /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38
       %12 = bitcast i8 addrspace(1)* %.fca.1.extract to float addrspace(1)*, !dbg !165
       %13 = atomicrmw fadd float addrspace(1)* %12, float 1.000000e+00 seq_cst, align 4, !dbg !165
; ││└└└
; ││ @ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 within `modify!`
    br label %L128, !dbg !176

L128:                                             ; preds = %L119, %conversion
; └└
;  @ none within `gpu_ker!` @ /home/pxl-th/.julia/packages/KernelAbstractions/Zcyra/src/macros.jl:97
  ret void, !dbg !179
}

attributes #0 = { nounwind readnone speculatable willreturn "amdgpu-unsafe-fp-atomics"="true"}
attributes #1 = { cold noreturn nounwind }
attributes #2 = { "amdgpu-unsafe-fp-atomics"="true"}

And here's the assembly output for gfx1100, notice global_atomic_add_f32 present:

click

	.text
	.amdgcn_target "amdgcn-amd-amdhsa--gfx1100"
	.globl	_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE ; -- Begin function _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
	.p2align	8
	.type	_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE,@function
_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE: ; @_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
.Lfunc_begin0:
	.file	1 "." "none"
	.loc	1 0 0                           ; none:0:0
	.cfi_sections .debug_frame
	.cfi_startproc
; %bb.0:                                ; %conversion
	s_clause 0x1
	s_load_b64 s[2:3], s[0:1], 0x68
	s_load_b64 s[4:5], s[0:1], 0x58
.Ltmp0:
	.file	2 "." "boot.jl"
	.loc	2 708 0 prologue_end            ; boot.jl:708:0
	v_dual_mov_b32 v1, 0 :: v_dual_add_nc_u32 v0, 1, v0
.Ltmp1:
	.file	3 "." "int.jl"
	.loc	3 87 0                          ; int.jl:87:0
	s_waitcnt lgkmcnt(0)
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_mad_u64_u32 v[2:3], null, s2, s15, v[0:1]
	v_mov_b32_e32 v0, v3
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_mad_u64_u32 v[3:4], null, s3, s15, v[0:1]
.Ltmp2:
	.loc	3 514 0                         ; int.jl:514:0
	v_cmp_lt_i64_e32 vcc_lo, 0, v[2:3]
	v_cmp_ge_i64_e64 s2, s[4:5], v[2:3]
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
	s_and_b32 s2, vcc_lo, s2
	s_and_saveexec_b32 s3, s2
	s_cbranch_execz .LBB0_4
.Ltmp3:
; %bb.1:                                ; %L104
	.loc	3 0 0 is_stmt 0                 ; int.jl:0:0
	s_load_b64 s[2:3], s[0:1], 0x70
.Ltmp4:
	.loc	3 513 0 is_stmt 1               ; int.jl:513:0
	s_waitcnt lgkmcnt(0)
	v_cmp_gt_i64_e64 s2, s[2:3], 0
	s_delay_alu instid0(VALU_DEP_1)
	s_and_b32 vcc_lo, exec_lo, s2
	s_mov_b32 s2, -1
	s_cbranch_vccz .LBB0_3
.Ltmp5:
; %bb.2:                                ; %L119
	.loc	3 0 0 is_stmt 0                 ; int.jl:0:0
	s_load_b64 s[2:3], s[0:1], 0x78
	v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 1.0
.Ltmp6:
	.file	4 "." "/home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl"
	.loc	4 38 0 is_stmt 1                ; /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38:0
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	global_atomic_add_f32 v0, v1, s[2:3]
	s_waitcnt_vscnt null, 0x0
	buffer_gl0_inv
	buffer_gl1_inv
	s_mov_b32 s2, 0
.Ltmp7:
.LBB0_3:                                ; %Flow
	.loc	4 0 0 is_stmt 0                 ; /home/pxl-th/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:0:0
	s_delay_alu instid0(SALU_CYCLE_1)
	s_and_not1_b32 vcc_lo, exec_lo, s2
	s_cbranch_vccz .LBB0_5
.LBB0_4:                                ; %UnifiedReturnBlock
	s_endpgm
.LBB0_5:                                ; %L115
	s_load_b64 s[0:1], s[0:1], 0x0
	v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v3, 1
	s_waitcnt lgkmcnt(0)
	v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
.Ltmp8:
	.file	5 "." "pointer.jl"
	.loc	5 146 0 is_stmt 1               ; pointer.jl:146:0
	s_clause 0x3
	flat_store_b8 v[0:1], v2 offset:3
	flat_store_b8 v[0:1], v2 offset:2
	flat_store_b8 v[0:1], v2 offset:1
	flat_store_b8 v[0:1], v3
.Ltmp9:
	.file	6 "." "/home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl"
	.loc	6 52 0                          ; /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52:0
	s_endpgm
	; divergent unreachable
	s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	s_endpgm

And here's assembly for gfx90a, notice regular global_atomic_cmpswap:

click

       .text
        .amdgcn_target "amdgcn-amd-amdhsa--gfx90a:sramecc+"
        .globl  _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE ; -- Begin function _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
        .p2align        8
        .type   _Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE,@function
_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE: ; @_Z8gpu_ker_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE14ROCDeviceArrayI7Float32Li1ELi1EE
.Lfunc_begin0:
        .file   1 "." "none"
        .loc    1 0 0                           ; none:0:0
        .cfi_sections .debug_frame
        .cfi_startproc
; %bb.0:                                ; %conversion
        s_load_dwordx2 s[0:1], s[4:5], 0x68
        s_load_dwordx2 s[2:3], s[4:5], 0x58
.Ltmp0:
        .file   2 "." "int.jl"
        .loc    2 87 0 prologue_end             ; int.jl:87:0
        v_add_u32_e32 v0, 1, v0
.Ltmp1:
        .file   3 "." "boot.jl"
        .loc    3 708 0                         ; boot.jl:708:0
        v_mov_b32_e32 v1, 0
.Ltmp2:
        .loc    2 87 0                          ; int.jl:87:0
        v_mov_b32_e32 v2, s6
        s_waitcnt lgkmcnt(0)
        s_mul_i32 s7, s1, s6
        v_mad_u64_u32 v[0:1], s[0:1], s0, v2, v[0:1]
        v_add_u32_e32 v1, s7, v1
.Ltmp3:
        .loc    2 514 0                         ; int.jl:514:0
        v_cmp_lt_i64_e32 vcc, 0, v[0:1]
        v_cmp_ge_i64_e64 s[0:1], s[2:3], v[0:1]
        s_and_b64 s[0:1], vcc, s[0:1]
        s_and_saveexec_b64 s[2:3], s[0:1]
        s_cbranch_execz .LBB0_6
.Ltmp4:
; %bb.1:                                ; %L104
        .loc    2 0 0 is_stmt 0                 ; int.jl:0:0
        s_load_dwordx2 s[2:3], s[4:5], 0x70
        s_mov_b64 s[0:1], 0
.Ltmp5:
        .loc    2 513 0 is_stmt 1               ; int.jl:513:0
        s_waitcnt lgkmcnt(0)
        v_cmp_gt_i64_e64 s[6:7], s[2:3], 0
        s_mov_b64 s[2:3], -1
        s_and_b64 vcc, exec, s[6:7]
        s_cbranch_vccz .LBB0_5
.Ltmp6:
; %bb.2:                                ; %L119
        .loc    2 0 0 is_stmt 0                 ; int.jl:0:0
        s_load_dwordx2 s[2:3], s[4:5], 0x78
        v_mov_b32_e32 v2, 0
.Ltmp7:
        .file   4 "." "/users/antonsmi/.julia/packages/LLVM/Q3CgR/src/interop/base.jl"
        .loc    4 38 0 is_stmt 1                ; /users/antonsmi/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:38:0
        s_waitcnt lgkmcnt(0)
        s_load_dword s6, s[2:3], 0x0
        s_waitcnt lgkmcnt(0)
        v_mov_b32_e32 v1, s6
.LBB0_3:                                ; %atomicrmw.start
                                        ; =>This Inner Loop Header: Depth=1
        v_add_f32_e32 v0, 1.0, v1
        buffer_wbl2
        s_waitcnt vmcnt(0) lgkmcnt(0)
        global_atomic_cmpswap v0, v2, v[0:1], s[2:3] glc
        s_waitcnt vmcnt(0)
        buffer_invl2
        buffer_wbinvl1_vol
        v_cmp_eq_u32_e32 vcc, v0, v1
        s_or_b64 s[0:1], vcc, s[0:1]
        v_mov_b32_e32 v1, v0
        s_andn2_b64 exec, exec, s[0:1]
        s_cbranch_execnz .LBB0_3
.Ltmp8:
; %bb.4:                                ; %Flow
        .loc    4 0 0 is_stmt 0                 ; /users/antonsmi/.julia/packages/LLVM/Q3CgR/src/interop/base.jl:0:0
        s_or_b64 exec, exec, s[0:1]
        s_mov_b64 s[2:3], 0
.LBB0_5:                                ; %Flow4
        s_and_b64 vcc, exec, s[2:3]
        s_cbranch_vccnz .LBB0_7
.LBB0_6:                                ; %UnifiedReturnBlock
        s_endpgm
.LBB0_7:                                ; %L115
        s_load_dwordx2 s[0:1], s[4:5], 0x0
        v_mov_b32_e32 v2, 0
        v_mov_b32_e32 v3, 1
        s_waitcnt lgkmcnt(0)
        v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
.Ltmp9:
        .file   5 "." "pointer.jl"
        .loc    5 146 0 is_stmt 1               ; pointer.jl:146:0
        flat_store_byte v[0:1], v2 offset:3
        flat_store_byte v[0:1], v2 offset:2
        flat_store_byte v[0:1], v2 offset:1
        flat_store_byte v[0:1], v3
.Ltmp10:
        .file   6 "." "/pfs/lustrep2/scratch/project_465000557/antonsmi/julia_depot/dev/AMDGPU/src/device/gcn/execution_control.jl"
        .loc    6 52 0                          ; /pfs/lustrep2/scratch/project_465000557/antonsmi/julia_depot/dev/AMDGPU/src/device/gcn/execution_control.jl:52:0
        s_endpgm
        ; divergent unreachable
        s_endpgm

And help or advise is appreciated.
Thanks!

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Instinct MI250X, AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0, ROCm 5.6.0

LLVM ERROR: Unsupported calling convention for call for llvm/clang 14.0.5

I'm building rocm-comilersupport against llvm/clang 14.0.5. The build was successful, but many compile test failed:

The following tests FAILED:
         12 - comgr_compile_test (Subprocess aborted)
         13 - comgr_compile_minimal_test (Subprocess aborted)
         16 - comgr_compile_device_libs_test (Subprocess aborted)
         17 - comgr_compile_source_with_device_libs_to_bc_test (Subprocess aborted)
Errors while running CTest

They all fail with

LLVM ERROR: Unsupported calling convention for call

I backtraced and find out in https://github.com/llvm/llvm-project/blob/c12386ae247c0d46e1d513942e322e3a0510b126/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L980, CC is set to 91 (CallingConv::AMDGPU_KERNEL), while if I build comgr with ROCm's llvm, then the value is 0 (CallingConv::C).

I continoused tracing, until I lost track of this value at https://github.com/llvm/llvm-project/blob/c12386ae247c0d46e1d513942e322e3a0510b126/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L1136. The value hit 0 twice, and then hit 91 which causes unsupported convention error.

Also, running AMD_COMGR_SAVE_TEMPS=1 AMD_COMGR_REDIRECT_LOGS=stdout AMD_COMGR_EMIT_VERBOSE_LOGS=1 ./compile_minimal_test:

amd_comgr_do_action:
          ActionKind: AMD_COMGR_ACTION_COMPILE_SOURCE_TO_BC
             IsaName: amdgcn-amd-amdhsa--gfx803
             Options: "-mllvm" "-amdgpu-early-inline-all"
                Path:
            Language: AMD_COMGR_LANGUAGE_OPENCL_1_2
COMGR::executeInProcessDriver argv: clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-llvm-bc" "-emit-llvm-uselists" "-clear-ast-before-backend" "-main-file-name" "source1.cl" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=none" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-I" "/tmp/comgr-f8c1b9/include" "-isysroot" "/opt/gentoo" "-O3" "-std=cl1.2" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fno-threadsafe-statics" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-f8c1b9/output/source1.cl.bc" "-x" "cl" "/tmp/comgr-f8c1b9/input/source1.cl"
COMGR::executeInProcessDriver argv: clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-llvm-bc" "-emit-llvm-uselists" "-clear-ast-before-backend" "-main-file-name" "source2.cl" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=none" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-I" "/tmp/comgr-f8c1b9/include" "-isysroot" "/opt/gentoo" "-O3" "-std=cl1.2" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fno-threadsafe-statics" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-f8c1b9/output/source2.cl.bc" "-x" "cl" "/tmp/comgr-f8c1b9/input/source2.cl"
        ReturnStatus: AMD_COMGR_STATUS_SUCCESS

amd_comgr_do_action:
          ActionKind: AMD_COMGR_ACTION_LINK_BC_TO_BC
             IsaName: amdgcn-amd-amdhsa--gfx803
             Options: "-mllvm" "-amdgpu-early-inline-all"
                Path:
            Language: AMD_COMGR_LANGUAGE_OPENCL_1_2
        ReturnStatus: AMD_COMGR_STATUS_SUCCESS

amd_comgr_do_action:
          ActionKind: AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE
             IsaName: amdgcn-amd-amdhsa--gfx803
             Options: "-mllvm" "-amdgpu-early-inline-all"
                Path:
            Language: AMD_COMGR_LANGUAGE_OPENCL_1_2
COMGR::executeInProcessDriver argv: clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-obj" "-mrelax-all" "--mrelax-relocations" "-clear-ast-before-backend" "-main-file-name" "linked.bc" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=all" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-da2c2e/output/linked.bc.o" "-x" "ir" "/tmp/comgr-da2c2e/input/linked.bc"
LLVM ERROR: Unsupported calling convention for call
zsh: abort      AMD_COMGR_SAVE_TEMPS=1 AMD_COMGR_REDIRECT_LOGS=stdout =1

Directly running from cmdline also suffers:

clang "-cc1" "-triple" "amdgcn-amd-amdhsa" "-emit-obj" "-mrelax-all" "--mrelax-relocations" "-clear-ast-before-backend" "-main-file-name" "linked.bc" "-mrelocation-model" "pic" "-pic-level" "1" "-fhalf-no-semantic-interposition" "-mframe-pointer=all" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-target-cpu" "gfx803" "-mllvm" "-treat-scalable-fixed-error-as-warning" "-debugger-tuning=gdb" "-resource-dir" "../../../../lib/clang/14.0.5" "-fdebug-compilation-dir=/ext4-disk/build/portage/dev-libs/rocm-comgr-5.0.2-r1/work/rocm-comgr-5.0.2_build/test" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-mllvm" "-amdgpu-early-inline-all" "-faddrsig" "-o" "/tmp/comgr-da2c2e/output/linked.bc.o" "-x" "ir" "/tmp/comgr-da2c2e/input/linked.bc"
fatal error: error in backend: Unsupported calling convention for call

Link to LLVM failed

I'm trying to build comgr from source, with the following cmake config:

cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DAMDDeviceLibs_DIR=$HOME/opt/rocm/5.4.3/lib/cmake/AMDDeviceLibs \
-DLLD_DIR=$HOME/opt/llvm/15.0.7/lib/cmake/lld \
-DClang_DIR=$HOME/opt/llvm/15.0.7/lib/cmake/clang \
-DROCM_DIR=$HOME/opt/rocm/5.4.3/share/rocm/cmake \
-DCMAKE_INSTALL_PREFIX=$PWD/install

cmake --build build

There are linking errors reporting undefined references to LLVM library during building:

/usr/bin/ld: CMakeFiles/amd_comgr.dir/src/comgr-metadata.cpp.o: in function `COMGR::metadata::getMetadataRoot(COMGR::DataObject*, C
OMGR::DataMeta*)':
comgr-metadata.cpp:(.text+0x2b5): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x2cf): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x2f9): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x313): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x4ad): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x4c2): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x4ec): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x501): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x715): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x72f): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x759): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x773): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x8fd): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x917): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x941): undefined reference to `llvm::object::object_category()'
/usr/bin/ld: comgr-metadata.cpp:(.text+0x95b): undefined reference to `llvm::StringError::StringError(llvm::Twine const&, std::erro
r_code)'
/usr/bin/ld: CMakeFiles/amd_comgr.dir/src/comgr-metadata.cpp.o: in function `COMGR::metadata::getELFObjectFileBase(COMGR::DataObjec
t*)':

I'm using the latest version AMD LLVM and device libs from the amd-stg-open branch. The full build log is available build.log. Is there any approach to tackle this? Thanks in advance.

Parallel compilation with Comgr

Currently Comgr includes mutex locks around compilation actions to avoid unexpected LLVM multi-threading side effects.

https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/blob/amd-stg-open/lib/comgr/src/comgr.cpp#L1252

It would be nice to fix the issues with LLVM preventing multiple threads from maintaining independent LLVM instances. We could then remove these locks in Comgr to allow for parallel compilation

[Issue]: build failure on archlinux

Problem Description

this occurred while attempting to compile HEAD, any guidance is appreciated

[920/7072] Building CXX object lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o
FAILED: lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o
/usr/bin/c++ -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/raijin/aur/llvm-rocm-git/src/_build/lib/Target/SystemZ -I/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ -I/home/raijin/aur/llvm-rocm-git/src/_build/include -I/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/include -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-nonnull -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17 -fvisibility=hidden  -fno-exceptions -funwind-tables -MD -MT lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o -MF lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o.d -o lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZISelLowering.cpp.o -c /home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp: In member function ‘llvm::SDValue llvm::SystemZTargetLowering::combineTruncateExtract(const llvm::SDLoc&, llvm::EVT, llvm::SDValue, llvm::TargetLowering::DAGCombinerInfo&) const’:
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp:6590:7: error: ‘N0’ was not declared in this scope
 6590 |   if (N0.getOpcode() == ISD::XOR &&
      |       ^~
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp:6595:9: error: ‘VT’ was not declared in this scope
 6595 |     if (VT.isScalarInteger() && VT.getSizeInBits() < X.getValueSizeInBits()) {
      |         ^~
/home/raijin/aur/llvm-rocm-git/src/llvm-project/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp:6596:25: error: ‘DAG’ was not declared in this scope
 6596 |       KnownBits Known = DAG.computeKnownBits(X);

Operating System

OS: NAME="Arch Linux"

CPU

CPU: 8-core AMD Ryzen 9 4900HS with Radeon Graphics (-MT MCP-)

GPU

AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

ROCm-CompilerSupport

Steps to Reproduce

attempt to build HEAD of this llvm-project

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 4900HS with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 4900HS with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3000
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            16
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    15778348(0xf0c22c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15778348(0xf0c22c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    15778348(0xf0c22c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90c
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      1024(0x400) KB
  Chip ID:                 5686(0x1636)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1750
  BDFID:                   1024
  Internal Node ID:        1
  Compute Unit:            8
  SIMDs per CU:            4
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 471
  SDMA engine uCode::      40
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

rocm / llvm-project Goto Github PK

llvm-project's Introduction

AMD Fork of The LLVM Compiler Infrastructure

llvm-project's People

Contributors

Stargazers

Watchers

Forkers

llvm-project's Issues

Suggestion Description

Operating System

GPU

ROCm Component

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Suggestion Description

Operating System

GPU

ROCm Component

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Problem Description

Operating System

CPU

GPU

ROCm Version

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Recommend Projects

Recommend Topics

Recommend Org