Coder Social home page Coder Social logo

rocmsoftwareplatform / tensile Goto Github PK

View Code? Open in Web Editor NEW
197.0 55.0 136.0 94.94 MB

Stretching GPU performance for GEMMs and tensor contractions.

License: MIT License

CMake 1.10% C++ 30.21% Python 50.32% Shell 1.08% Awk 0.03% Makefile 0.09% Dockerfile 0.02% Groovy 0.46% TeX 1.42% Assembly 15.27%
gemm blas dnn neural-networks machine-learning tensors python opencl hip radeon auto-tuning amd gpu-computing gpu-acceleration gpu matrix-multiplication tensor-contraction assembly

tensile's Introduction

Tensile is a tool for creating benchmark-driven backend libraries for GEMMs, GEMM-like problems (such as batched GEMM), and general N-dimensional tensor contractions on a GPU. The Tensile library is mainly used as backend library to rocBLAS. Tensile acts as the performance backbone for a wide variety of 'compute' applications running on AMD GPUs.

See Tensile Wiki for documentation.

tensile's People

Contributors

aazz44ss avatar aferoz21 avatar alexbrownamd avatar amcamd avatar amdkila avatar amgddm avatar babakpst avatar bensander avatar cgmb avatar cgmillette avatar eidenyoshida avatar ellosel avatar guacamoleo avatar jeremyadamhart avatar jgoldsamd avatar jichangjichang avatar leekillough avatar nakajee avatar nielenventer avatar pfultz2 avatar ramjana avatar rosenrodt avatar saadrahim avatar sdquiring avatar solaslin avatar tonyyhsieh avatar torrezuk avatar wbgilmartin avatar yoichiyoshida avatar zaliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensile's Issues

Basic build for gfx1010

I'm trying to build this library so that I can link it to rocBLAS and have a functioning gemm implementation in HIP (similarly to this PR) for gfx1010/gfx1012.

I saw a commit titled Kernels now working on gfx1010. and I'm wondering what set of environment variables and compilers is needed to accomplish this.

I'm using Linux Mint 19.3 with ROCm 3.3.0 on a RX 5500 XT (gfx1012) and RX 5700 XT (gfx1010). Host processor is Ryzen 9 3900X.

I first tried python3 ../Tensile/bin/Tensile ../Tensile/Configs/rocblas_sgemm_asm_only.yaml ./
Tensile_default_output.TXT.zip

Most notably, I don't think that HCC supports gfx1010 and up, so I see a lot of compiler errors that look like

'['/opt/rocm/bin/hcc', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mno-code-object-v3', '-mcpu=gfx1010', '-mwavefrontsize64', '-c', '-o', '/home/mihir/Programs/TensileOcl/Tensile/build_hip/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_BF/sourceTmp/assembly/Cijk_Ailk_Bljk_SB_MT64x32x32_SE_AMAS0_EPS1_GRVW1_K1_PGR1_TT8_4_VW1_WG8_8_4_WGM1.o', '/home/mihir/Programs/TensileOcl/Tensile/build_hip/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_BF/sourceTmp/assembly/Cijk_Ailk_Bljk_SB_MT64x32x32_SE_AMAS0_EPS1_GRVW1_K1_PGR1_TT8_4_VW1_WG8_8_4_WGM1.s']' returned non-zero exit status 1.
/home/mihir/Programs/TensileOcl/Tensile/build_hip/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_BF/sourceTmp/assembly/Cijk_Ailk_Bljk_SB_MT32x32x32_SE_AMAS0_EPS0_GRVW1_K1_PGR0_TT4_4_VW1_WG8_8_4_WGM8.s:1461:1: /home/mihir/Programs/TensileOcl/Tensile/build_hip/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_BF/sourceTmp/assembly/Cijk_Ailk_Bljk_SB_MT32x64x32_SE_AMAS0_EPS1_GRVW1_K1_PGR1_TT4_8_VW1_WG8_8_4_WGM1.s:2346:1: error: instruction not supported on this GPU

Additionally, the generated run.sh script tries to set the graphics card clock to an invalid value.

+ /opt/rocm/bin/rocm-smi -d 0 --setfan 255 --setsclk 7
[sudo] password for mihir:          


========================ROCm System Management Interface========================
ERROR: GPU[0] 		: Unable to set clock level
ERROR: GPU[0]	: Max clock level is 2
GPU[0] 		: Successfully set fan control to 'manual'
GPU[0] 		: Successfully set fan speed to Level 255

I also tried setting --cxx-compiler to hipcc with $HIP_PLATFORM set to clang but CMake-generated flags include -hc which isn't recognized by the compiler. I could get a little further by editing TensileCreateLibrary.py to pass in -D__HIP_VDI__, manually editing the generated flags.make in the build files, and setting the benchmark config to not build a new client, but I inevitably run into one of the above issues.

Is there a recommended way to directly build a basic gemm kernel in pure HIP that can be used by rocBLAS without using the benchmarking driver program?

Tensile solution found, but kernel not found in any loaded module.

This issue popped up in the AUR package for rocBLAS here: rocm-arch/rocm-arch#312

When compiling rocBLAS 3.5.0 with the following cmake options:

        -Damd_comgr_DIR=/opt/rocm/lib/cmake/amd_comgr
        -DBUILD_WITH_TENSILE=ON
        -DTensile_TEST_LOCAL_PATH="..."
        -DTensile_COMPILER=hipcc
        -DTensile_ARCHITECTURE=all
        -DTensile_LOGIC=asm_full
        -DBUILD_CLIENTS_TESTS=OFF
        -DBUILD_CLIENTS_BENCHMARKS=OFF
        -DBUILD_CLIENTS_SAMPLES=OFF
        -DBUILD_TESTING=OFF

using the corresponding Tensile library, and with a Vega 20 or gfx906 hardware available, I can't get a simple dgemm to work:

#include <iostream>
#include <vector>
#include "rocblas.h"

int main()
{
    rocblas_int N = 10;

    double alpha = 1.0;
    double beta = 0.0;

    // create a and b on the host
    std::vector<double> A_h(N * N, 1.0);
    std::vector<double> B_h(N * N, 1.0);
    std::vector<double> C_h(N * N, 0.0);

    double * A_d;
    double * B_d;
    double * C_d;

    rocblas_handle handle;
    rocblas_create_handle(&handle);

    // allocate memory on device
    hipMalloc(&A_d, N * N * sizeof(double));
    hipMalloc(&B_d, N * N * sizeof(double));
    hipMalloc(&C_d, N * N * sizeof(double));

    hipMemcpy(A_d, A_h.data(), sizeof(double) * N * N, hipMemcpyHostToDevice);
    hipMemcpy(B_d, B_h.data(), sizeof(double) * N * N, hipMemcpyHostToDevice);
    hipMemcpy(C_d, C_h.data(), sizeof(double) * N * N, hipMemcpyHostToDevice);

    rocblas_dgemm(handle, rocblas_operation_none, rocblas_operation_none, N, N, N, &alpha, A_d, N, B_d, N, &beta, C_d, N);

    // copy output from device memory to host memory
    hipMemcpy(C_h.data(), C_d, sizeof(double) * N * N, hipMemcpyDeviceToHost);

    // verify rocblas_scal result
    auto err = 0.0;
    for (size_t i = 0; i < N * N; ++i)
        err += std::abs(C_h[i] - N);
    
    std::cout << "Error = " << err << '\n';

    hipFree(A_d);
    hipFree(B_d);
    hipFree(C_d);
    rocblas_destroy_handle(handle);
}

compiled with:

hipcc -std=c++14 -I /path/to/rocblas/include/ dgemm.cpp -L /path/to/rocblas/lib/ -lrocblas

It outputs:

Error: Tensile solution found, but Kernel Cijk_Ailk_Bljk_DB_MT32x64x8_SN_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_EPS1_FL1_GRVW2_GSU1_ISA906_IU1_K1_KLA_LBSPPn1_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU0_SUM0_SUS256_SNLL1_TT4_4_TLDS0_USFGRO0_VAW1_VS1_VW2_WG8_16_1_WGM1 not found in any loaded module. exception thown for { a_type: "f64_r", b_type: "f64_r", c_type: "f64_r", d_type: "f64_r", compute_type: "f64_r", transA: 'N', transB: 'N', M: 10, N: 10, K: 10, lda: 10, ldb: 10, ldc: 10, ldd: 10, beta: -12345.0, batch_count: 1, stride_a: 100, stride_b: 100, stride_c: 100, stride_d: 100 }

It seems this kernel does exist, though:

$ grep Cijk_Ailk_Bljk_DB_MT32x64x8_SN_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_EPS1_FL1_GRVW2_GSU1_ISA906_IU1_K1_KLA_LBSPPn1_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU0_SUM0_SUS256_SNLL1_TT4_4_TLDS0_USFGRO0_VAW1_VS1_VW2_WG8_16_1_WGM1 /path/to/rocblas/lib/  -r
Binary file library/TensileLibrary_gfx906.co matches
library/TensileLibrary.yaml:  name: Cijk_Ailk_Bljk_DB_MT32x64x8_SN_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_EPS1_FL1_GRVW2_GSU1_ISA906_IU1_K1_KLA_LBSPPn1_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU0_SUM0_SUS256_SNLL1_TT4_4_TLDS0_USFGRO0_VAW1_VS1_VW2_WG8_16_1_WGM1
library/TensileLibrary.yaml:  name: Cijk_Ailk_Bljk_DB_MT32x64x8_SN_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_EPS1_FL1_GRVW2_GSU1_ISA906_IU1_K1_KLA_LBSPPn1_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU0_SUM0_SUS256_SNLL1_TT4_4_TLDS0_USFGRO0_VAW1_VS1_VW2_WG8_16_1_WGM1

Any idea what is causing this problem?

Skinny Tensors

Solution Selection Logic doesn't yet check if a problem is skinny and choose a skinny solution for it.
Tests exist from Google and others for skinnyness.
GEMV is probably also a skinny tensor contraction.

Thank you for your work. Can you teach me how to create a s,d,c,z GEMM not batched library?

I would like to use Tensile to create GEMM codes (like a clBLAS) so that to have a library per different devices: Fiji, Polaris, and Vega. The goal is to use them in environment where multiple devices are available and use amdgpu-pro and legacy,rocm.

I can play with the package and I can create libraries and the tests. But I can do it only using the Tensile.py code. I have an idea how to create the profiles for create the z and c GEMM. I noticed that rocBLAS are build using tensile. My understanding is that there will be multiple choices to use at run time for the same problem and for different sizes (thus for different devices).

However, I have no clear understanding how to build the library and then use it like I used to use clBLAS. Be patient with me and please let me know if you are interested in help me.

I envision to build three tensile libraries for opencl: Fiji, polaris and Vega (and future) Using a clBLAS interface with basically 4 GEMM each. For every device there will be a queue/stream with a platform. The GEMM can have a parameter specifying the device or the GEMM name can be different so that to address the correct algorithm (architecture-problem sizes). Any of these GEMMs will be called in parallel each on a different device.

Please, can you teach me to create a library where there is one GEMM function either at low level where I will take care of the data movements using opencl standard (and soon sharing data among GPUS) or at high level where I specify the device.

Cheers
Paolo

New client does not honor size assertions of size N (AF1EM)

New client does not check free1-dimension (sizeN) of the problem size against the requirement of a particular solution

Say I have a kernel that has AssertFree1ElementMultiple: 8, and a problem size
of (m, n, k) = (1024, 1023, 1024). What the old client will do is to block the kernel from launching. The new client ignores that requirement and launches kernel anyway. That leads to UB where the validation may or may not pass.

Old client (correct behavior)

Tensile Client Columns: GFlops (clock-normalized), GFlops (raw), SolName, KernelMs, Valid/Total, CoreMhz, MemMhz, TempC, FanSpeed, Idx/Total, TimeStamp
Initializing 399 MBytes................
Problem[0/1]: 1024, 1023, 1, 1024, 1024, 1024, 1024, 1023
  3021.327,   3021.327, Cijk_Ailk_Bjlk_HBH_MT32x256x16_MI32x32x4x2_SE_AF1EM1*,     0.710, PASSED: 33792/33792, 878, 1200, 38.000, 0, 0/2, 2020-04-08 00:48:36.338256, 
     0.000,      0.000, Cijk_Ailk_Bjlk_HBH_MT32x256x16_MI32x32x4x2_SE_AF1EM8 ,     0.710, DID_NOT_SATISFY_ASSERTS, 878, 1200, 37.000, 0, 1/2, 2020-04-08 00:48:36.340868, 

New client (wrong behavior)

run, problem-progress, solution-progress, operation, problem-sizes, solution, validation, time-us, gflops, empty, total-granularity, tiles-per-cu, num-cus, tile0-granularity, tile1-granularity, cu-granularity, wave-granularity, mem-read-bytes, mem-write-bytes, temp-edge, clock-sys, clock-soc, clock-mem, fan-rpm, hardware-samples, enqueue-time
0, 0/1, 0/1, Contraction_l_Ailk_Bjlk_Cijk_Dijk, "(1024,1023,1,1024)", Cijk_Ailk_Bjlk_HBH_MT32x256x16_MI32x32x4x2_SE_AF1EM1_K1, PASSED, 777.76, 2758, , 0.53, 1.07, 120, 1.00, 1.00, 0.53, 1.00, 77518848, 2095104, 37.00, 878.00, 750.00, 1200.00, 0, 1, 2020-04-08 00:47:29.225930
0, 0/1, 1/1, Contraction_l_Ailk_Bjlk_Cijk_Dijk, "(1024,1023,1,1024)", Cijk_Ailk_Bjlk_HBH_MT32x256x16_MI32x32x4x2_SE_AF1EM8_K1, PASSED, 625.76, 3428, , 0.53, 1.07, 120, 1.00, 1.00, 0.53, 1.00, 77518848, 2095104, 37.00, 878.00, 750.00, 1200.00, 0, 1, 2020-04-08 00:47:29.241186

Define skinny-ness

A tensor contraction needs a skinny tile to restore performance is the smaller of dim0/dim1 is less than 32 and if the greater is greater than 1024 ? 4096?

Why there existed "KernelLanguage: Source" in vega20*.yaml under rocBLAS/library/src/blas3/Tensile/Logic/asm_full

What is the expected behavior

What actually happens

  • I think vega20*.yaml only reserved for Assembly kernel. I am confused why there has multiple source kernels in vega20 configuration.

Environment

Hardware description
GPU Vega20
CPU AMD
Software version
ROCK v3.3
ROCR v3.3
HCC v3.3
Library v3.3

Cannot library-ize problem/solutions with multiple summation indices with different orders

The Solution Selection Logic currently doesn't try to match order of free indices (strideA+strideB of the index) or the order of summation indices. So, the library backend may not be able to support multiple different kinds of multi-summation problems.

If all problems only have one type of summation index order then it'll probably work since they'll all take the same path.

GPU memory out of bounds problem

Attempting to benchmark the attached file with the latest code produces a memory access fault during the first stage:


+ ./client --platform-idx 0 --device-idx 0 --init-alpha 2 --init-beta 0 --init-d 0 --init-c 3 --init-a 3 --init-b 3 --c-equal-d 1 --print-valids 0 --print-max 4 --num-benchmarks 1 --num-elements-to-validate 0 --num-enqueues-per-sync 1 --num-syncs-per-benchmark 1 --use-gpu-timer 1 --sleep-percent 200 --benchmark-solutions 0
################################################################################
# Device[0]: Vega 10 XT [Radeon RX Vega 64] (gfx900)
# Compute:   13352 GFlop/s (64 CUs @ 1630 MHz)
# Bandwidth: 483 GB/s (2048-bit @ 945 MHz)
################################################################################
Tensile Client Columns: GFlops (clock-normalized), GFlops (raw), SolName, KernelMs, Valid/Total, CoreMhz, MemMhz, TempC, FanSpeed, Idx/Total, TimeStamp
Initializing 362 MBytes................


Problem[0/1]: 512, 3584, 1, 32770, 512, 512, 512, 32770
Memory access fault by GPU node-1 (Agent handle: 0x10bdc80) on address 0x7fe4dce82000. Reason: Page not present or supervisor privilege.
/home/zen/devel/Tensile/build/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/run.sh: line 12: 10554 Aborted                 (core dumped) ./client --platform-idx 0 --device-idx 0 --init-alpha 2 --init-beta 0 --init-d 0 --init-c 3 --init-a 3 --init-b 3 --c-equal-d 1 --print-valids 0 --print-max 4 --num-benchmarks 1 --num-elements-to-validate 0 --num-enqueues-per-sync 1 --num-syncs-per-benchmark 1 --use-gpu-timer 1 --sleep-percent 200 --benchmark-solutions 0
+ ERR1=134

It seems to be caused by an out-of-bounds access from inside the GPU kernel.

test.zip

LLVM version requirements on master branch

I've updated our at PR in our CI to rocBLAS commit hash: 45dce72

Which has the latest master branch for Tensile. There is some LLVM requirement that is not allowing our build to complete. Did the LLVM version change recently?

CMake Error at /usr/local/cget/build/tmp-76df881074ee43768f8c4d0236b39e32/build/virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/CMakeLists.txt:48 (find_package):

  Could not find a configuration file for package "LLVM" that is compatible with requested version "7.0".

  The following configuration files were considered but not accepted:
    /usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.0

-- Configuring incomplete, errors occurred!

Re-order work-groups

Try smaller macro-tile sizes with z-order work-group order to hopefully get most global memory accesses to go through L2 cache. Doing so, improves effective global memory bandwidth and allows using a smaller macro-tile AND gives higher occupancy. Super-fast GEMM kernel uses 32x32 macro tile.

'/opt/rocm/bin/hcc-config' path hardcoded in Tensile

When building tensile in a docker we have the below error:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/rocm/bin/hcc-config'

In a docker this does not work since we are using a linked path not the system linked path.
Tensile can instead use the /opt/rocm/hcc/bin/hcc-config path where is where the actual binary is installed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/cget/build/tmp-30667e2aac3647f982bfccb12d2861d4/build/virtualenv/bin/TensileCreateLibrary", line 10, in <module>
    sys.exit(TensileCreateLibrary())
  File "/usr/local/cget/build/tmp-30667e2aac3647f982bfccb12d2861d4/build/virtualenv/lib/python3.5/site-packages/Tensile/TensileCreateLibrary.py", line 993, in TensileCreateLibrary
    kernelWriterSource, kernelWriterAssembly)
  File "/usr/local/cget/build/tmp-30667e2aac3647f982bfccb12d2861d4/build/virtualenv/lib/python3.5/site-packages/Tensile/TensileCreateLibrary.py", line 299, in writeSolutionsAndKernels
    codeObjectFiles += buildSourceCodeObjectFiles(kernelFiles, kernels + kernelsBetaOnly, outputPath)
  File "/usr/local/cget/build/tmp-30667e2aac3647f982bfccb12d2861d4/build/virtualenv/lib/python3.5/site-packages/Tensile/TensileCreateLibrary.py", line 138, in buildSourceCodeObjectFiles
    method=lambda x: x.starmap)
  File "/usr/local/cget/build/tmp-30667e2aac3647f982bfccb12d2861d4/build/virtualenv/lib/python3.5/site-packages/Tensile/Common.py", line 1133, in ParallelMap
    rv = mapFunc(function, objects)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 268, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value

FileNotFoundError: [Errno 2] No such file or directory: '/opt/rocm/bin/hcc-config'

CMake Error at /usr/local/cget/build/tmp-30667e2aac3647f982bfccb12d2861d4/build/virtualenv/cmake/TensileConfig.cmake:87 (message):
  Error generating kernels
Call Stack (most recent call first):
  library/src/CMakeLists.txt:41 (TensileCreateLibrary)

SGEMM benchmark config for Vega FE?

Hi,

I'm experimenting with SGEMM performance tuning on Vega FE and get around 5 GFLOP/s max with the 5760 benchmark config. I was wondering if there was a pointer to a current best config for Vega/Vega FE that I could use as a starting point that was closer to peak performance?

Thanks,

Eddie

Multi-dimensional tensor

Generic tensor support is broken, having to do with the index order. Try the 7D convolution case.

Tensile 4.8 and great fail to build with Docker

When building the docker for MIOpen we are getting this error when building rocBLAS:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/rocm/cget/build/tmp-0ad16ebbc1bd491da0f635bafc5f0437/build/virtualenv/lib/python2.7/site-packages//Tensile/TensileCreateLibrary.py", line 67, in processKernelSourceChunk
    results.append (processKernelSource(kernel, kernelWriterSource, kernelWriterAssembly)) # returns err, src, header, kernelName
  File "/opt/rocm/cget/build/tmp-0ad16ebbc1bd491da0f635bafc5f0437/build/virtualenv/lib/python2.7/site-packages//Tensile/TensileCreateLibrary.py", line 46, in processKernelSource
    (err, src) = kernelWriter.getSourceFileString(kernel)
  File "/opt/rocm/cget/build/tmp-0ad16ebbc1bd491da0f635bafc5f0437/build/virtualenv/lib/python2.7/site-packages/Tensile/KernelWriter.py", line 2141, in getSourceFileString
    (error, kb) = self.kernelBody( kernel, tensorParametersA, tensorParametersB)
  File "/opt/rocm/cget/build/tmp-0ad16ebbc1bd491da0f635bafc5f0437/build/virtualenv/lib/python2.7/site-packages/Tensile/KernelWriter.py", line 659, in kernelBody
    self.makeSchedule(kernel, tensorParametersA, tensorParametersB, localWriteEndIter)
  File "/opt/rocm/cget/build/tmp-0ad16ebbc1bd491da0f635bafc5f0437/build/virtualenv/lib/python2.7/site-packages/Tensile/KernelWriter.py", line 78, in makeSchedule
    maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]
KeyError: 'MaxVmcnt'

The error appears when we attempt to build for rocBLAS 2.1.0 and 2.2.0, but not 2.0.0 and earlier.

The Tensile which does seem to build is:
set( tensile_tag v4.7.0 CACHE STRING "Tensile tag to download" )
from rocBLAS 2.0.0.

Here is a link to our dockerfile:
https://github.com/AMDComputeLibraries/MLOpen/blob/develop/Dockerfile

The base install for our CI is ROCm 1.9.2.

We are not sure why this would be the case. What is MaxVmcnt?

How to use the parameter : 'DirectToLds'

After reading the source code in 'Common.py' & 'KernelWritterAssembly.py'. I wonder how 'DirectToLds' make sense. It is initialized as False and do not change with 'DirectToLds= [ True ]' even if I am sure that 'GlobalVectorWidth = 1'
Another question about this is in 'KernelWritterAssembly.py'. When 'PrefetchGlobalRead' is set, what if 'DirectToLds = True'?

Here is the Code.
image

Extremely long compilation times

I'm using an AMD node with over 128 cpus, but building rocBLAS with Tensile is excruciatingly slow.

Builds are taking more than 2 hours and 15 minutes. It builds almost completely single-threaded and some single clang processes are running for 20 minutes straight. 2 hours is spent compiling kernels.

What can I do to speed it up?

This is from the build logs:

################################################################################
# Tensile Create Library
# Detected local GPU with ISA: gfx906
# Asm caps for gfx803:SupportedISA=1 HasExplicitCO=0 HasExplicitNC=0 HasDirectToLds=1 HasAddLshl=0 HasSMulHi=0 HasCodeObjectV3=1 MaxVmcnt=15 SupportedSource=1
# Arch caps for gfx803:HasEccHalf=0 Waitcnt0Disabled=0 SeparateVscnt=0 CMPXWritesSGPR=1 HasWave32=0
# Asm caps for gfx900:SupportedISA=1 HasExplicitCO=1 HasExplicitNC=0 HasDirectToLds=1 HasAddLshl=1 HasSMulHi=1 HasCodeObjectV3=1 MaxVmcnt=63 SupportedSource=1
# Arch caps for gfx900:HasEccHalf=0 Waitcnt0Disabled=0 SeparateVscnt=0 CMPXWritesSGPR=1 HasWave32=0
# Asm caps for gfx906:SupportedISA=1 HasExplicitCO=1 HasExplicitNC=0 HasDirectToLds=1 HasAddLshl=1 HasSMulHi=1 HasCodeObjectV3=1 MaxVmcnt=63 SupportedSource=1
# Arch caps for gfx906:HasEccHalf=1 Waitcnt0Disabled=0 SeparateVscnt=0 CMPXWritesSGPR=1 HasWave32=0
# Asm caps for gfx908:SupportedISA=1 HasExplicitCO=1 HasExplicitNC=0 HasDirectToLds=1 HasAddLshl=1 HasSMulHi=1 HasCodeObjectV3=1 MaxVmcnt=63 SupportedSource=1
# Arch caps for gfx908:HasEccHalf=1 Waitcnt0Disabled=1 SeparateVscnt=0 CMPXWritesSGPR=1 HasWave32=0
# Asm caps for gfx1010:SupportedISA=1 HasExplicitCO=1 HasExplicitNC=1 HasDirectToLds=1 HasAddLshl=1 HasSMulHi=1 HasCodeObjectV3=1 MaxVmcnt=63 SupportedSource=0
# Arch caps for gfx1010:HasEccHalf=0 Waitcnt0Disabled=0 SeparateVscnt=1 CMPXWritesSGPR=0 HasWave32=1
# Asm caps for gfx000:SupportedISA=0 HasExplicitCO=0 HasExplicitNC=0 HasDirectToLds=0 HasAddLshl=0 HasSMulHi=0 HasCodeObjectV3=0 MaxVmcnt=0 SupportedSource=1
# Arch caps for gfx000:HasEccHalf=0 Waitcnt0Disabled=0 SeparateVscnt=0 CMPXWritesSGPR=1 HasWave32=0
Tensile::WARNING: Global parameter EmbedLibrary = None unrecognised.
# CodeObjectVersion from TensileCreateLibrary: V2
# CxxCompiler       from TensileCreateLibrary: hipcc
# Architecture      from TensileCreateLibrary: all

Generating kernels: Done.
*
Compiling source kernels: Launching 64 threads...
...
Compiling source kernels: Done.
# Kernel Building elapsed time = 7181.6 secs
# Tensile Library Writer DONE
################################################################################

-- TARGET: gfx803
-- TARGET: gfx900
-- TARGET: gfx906
-- TARGET: gfx908
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /dev/shm/spack/hstoppel/spack-stage-rocblas-3.5.0-zhkw5kdvy5cvpzwsdd3rwkrmwdg3a3bi/spack-build
==> rocblas: Executing phase: 'build'
==> [2020-08-12-23:28:03.231710] 'make' '-j128'

tuning for concurrent api calls

Tensile currently tunes itself assuming each API call should use the entire GPU for fastest result.
We should add another API for the scenario that the user knows it will call multiple non-batchable gemms simultaneously (different precisions, different transposes, different sizes).

Validation failure with LDS padding + MI 32x32x2x1

Kernels generated with 1-block version of MI fp32 instruction does not work properly with LDS padding enabled. Issue found in fairly recent develop branch at 92fc266

63.015,     63.015, Cijk_Alik_Bljk_SB_MT32x128x32_MI32x32x2x1_SE_LPA0_LPB0*,     0.017, PASSED: 16384/16384, 1289, 1200, 31.000, 0, 0/4, 2020-03-23 15:21:36.688785, 
76.205,     76.205, Cijk_Alik_Bljk_SB_MT32x128x32_MI32x32x2x1_SE_LPA0_LPB1*,     0.014, PASSED: 16384/16384, 1289, 1200, 31.000, 0, 2/4, 2020-03-23 15:21:36.691804, 
0.000,      0.000, Cijk_Alik_Bljk_SB_MT32x128x32_MI32x32x2x1_SE_LPA1_LPB0 ,     0.017, FAILED: 1/16384, 1289, 1200, 31.000, 0, 1/4, 2020-03-23 15:21:36.690565, 
0.000,      0.000, Cijk_Alik_Bljk_SB_MT32x128x32_MI32x32x2x1_SE_LPA1_LPB1 ,     0.014, FAILED: 1/16384, 1289, 1200, 31.000, 0, 3/4, 2020-03-23 15:21:36.693013, 

Here is the config yaml to repro the error

TestParameters:
  marks: [skip-gfx900, skip-gfx906, skip-gfx1010] # not supported by arch

GlobalParameters:
  NumElementsToValidate: 65536
  BoundsCheck: True
  KernelTime: True

BenchmarkProblems:
  ########################################
  # TN - standard
  ########################################
  -
    - # ProblemType
      OperationType: GEMM
      DataType: s
      TransposeA: True
      TransposeB: False
      UseBeta: True
      Batched: True

    - # BenchmarkProblemSizeGroup - Standard
      InitialSolutionParameters:
      BenchmarkCommonParameters:
        - KernelLanguage: ["Assembly"]
        - EdgeType: ["ShiftPtr"]
        - PrefetchLocalRead: [True]
      ForkParameters:
        - MatrixInstruction:
          - [32, 32, 2, 1]
        - PrefetchGlobalRead: [True]
        - ThreadTile:
          - [ 1, 32 ]
        - WorkGroup:
          - [ 32, 8, 1 ]
        - WorkGroupMapping: [8]
        - LdsPadA: [0, 1]
        - LdsPadB: [0, 1]
        - GlobalSplitU: [1]
        - DepthU: [32]
        - VectorWidth: [-1]
        - AssertSummationElementMultiple: [2]
        - AssertFree0ElementMultiple: [2]
      JoinParameters:
      BenchmarkJoinParameters:
      BenchmarkFinalParameters:
        - ProblemSizes:
          - Exact: [ 64, 256, 1, 32 ]

Higher Accumulation Precision

In order to fully support mixed precisions, kernel generator needs to define a MAD() instruction for real and complex. (and probably a few more things).

Solution Selection Logic doesn't support tolerance

SSL does include a tolerance which should simplify the logic and how many solutions are needed, but it doesn't use the tolerance for detecting conflicts and, more importantly, it isn't known how to support the tolerance when merging two rules which technically have a conflict but its outside the threshold.

Cannot library-ize -O4 kernels

-O4 kernels optimize all kernel arguments into pre-processor definitions. These can be benchmarked but they cannot be library-ized for two reasons:

  1. The kernel namer can't yet name kernels with all sizes in the kernel name.
  2. The Kernel Selection Logic can't account for exactly matching all sizes, it only check that multiples match.

ValueError: Old is good (1.82645) and new is bad (-1.0).

I am tuning some int8 configs and get the following error from Tensile:

Traceback (most recent call last):
  File "/usr/local/bin/Tensile", line 11, in <module>
    load_entry_point('Tensile==4.11.0', 'console_scripts', 'Tensile')()
  File "/usr/local/lib/python3.6/dist-packages/Tensile/Tensile.py", line 224, in main
    Tensile(sys.argv[1:])
  File "/usr/local/lib/python3.6/dist-packages/Tensile/Tensile.py", line 181, in Tensile
    executeStepsInConfig( config )
  File "/usr/local/lib/python3.6/dist-packages/Tensile/Tensile.py", line 51, in executeStepsInConfig
    BenchmarkProblems.main( config["BenchmarkProblems"] )
  File "/usr/local/lib/python3.6/dist-packages/Tensile/BenchmarkProblems.py", line 796, in main
    problemSizeGroupConfig, problemSizeGroupIdx)
  File "/usr/local/lib/python3.6/dist-packages/Tensile/BenchmarkProblems.py", line 364, in benchmarkProblemType
    results = getResults(resultsFileName, solutions, enableTileSelection, newResultsFileName)
  File "/usr/local/lib/python3.6/dist-packages/Tensile/BenchmarkProblems.py", line 461, in getResults
    diffCSV.writerow([compareResults(old,new,name) for old,new,name in itertools.zip_longest(row, newRow, headerRow)])
  File "/usr/local/lib/python3.6/dist-packages/Tensile/BenchmarkProblems.py", line 461, in <listcomp>
    diffCSV.writerow([compareResults(old,new,name) for old,new,name in itertools.zip_longest(row, newRow, headerRow)])
  File "/usr/local/lib/python3.6/dist-packages/Tensile/BenchmarkProblems.py", line 406, in compareResults
    raise ValueError("Old is good ({}) and new is bad ({}). Name: {}".format(old, new, name))
ValueError: Old is good (1.82645) and new is bad (-1.0). Name:  Cijk_Ailk_Bljk_4xi8B_MT32x32x8_SE_EPS0_FL0_GRVW1_GSU8_LPB0_PGR0_PLR0_TT2_2_USFGRO1_VW2_WG16_16_1_WGM1

Here is the yaml file I am using for benchmarking:

GlobalParameters:
  CMakeBuildType: Release
  DataInitTypeAB: 0
  DataInitTypeBeta: 0
  Device: 0
  EnqueuesPerSync: 1
  ForceRedoBenchmarkProblems: true
  ForceRedoLibraryClient: true
  ForceRedoLibraryLogic: true
  KernelTime: true
  LibraryPrintDebug: false
  MergeFiles: true
  MinimumRequiredVersion: 4.2.0
  NumElementsToValidate: 0
  PinClocks: false
  Platform: 0
  PrintWinnersOnly: 1
  ShortNames: false
  SleepPercent: 50
  SolutionSelectionAlg: 1
  SyncsPerBenchmark: 1
  ValidationMaxToPrint: 4
  ValidationPrintValids: false
BenchmarkProblems:
- - {Batched: true, DataType: 4xi8, DestDataType: I, OperationType: GEMM, TransposeA: false, TransposeB: false,
    UseBeta: true}
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [2, 1, 1, 768]
      - Exact: [768, 1, 1, 768]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalSplitU: [1, 4]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - LdsPadA: [0, -1]
    - LdsPadB: [0, -1]
    - ThreadTile:
      - [2, 2]
      - [4, 2]
      - [2, 4]
    - WorkGroup:
      - [16, 16, 1]
      - [8, 16, 2]
      - [16, 8, 2]
      - [4, 16, 4]
      - [16, 4, 4]
      - [32, 8, 4]
      - [8, 32, 4]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - GlobalSplitU: [1]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [3072, 128, 1, 768]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - ThreadTile:
      - [4, 4]
      - [6, 4]
      - [4, 6]
      - [4, 8]
      - [8, 4]
      - [8, 8]
    - WorkGroup:
      - [16, 16, 1]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [768, 128, 1, 3072]
      - Exact: [768, 128, 1, 768]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalSplitU: [1, 8]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - LdsPadA: [0, -1]
    - LdsPadB: [0, -1]
    - ThreadTile:
      - [2, 2]
      - [4, 2]
      - [2, 4]
      - [4, 4]
    - WorkGroup:
      - [16, 16, 1]
      - [8, 16, 2]
      - [16, 8, 2]
      - [4, 16, 4]
      - [16, 4, 4]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [2, 1, 1, 768]
      - Exact: [768, 1, 1, 768]
      - Exact: [2, 1, 1, 768]
      - Exact: [768, 1, 1, 768]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalSplitU: [1, 4]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - LdsPadA: [0, -1]
    - LdsPadB: [0, -1]
    - ThreadTile:
      - [2, 2]
      - [4, 2]
      - [2, 4]
    - WorkGroup:
      - [16, 16, 1]
      - [8, 16, 2]
      - [16, 8, 2]
      - [4, 16, 4]
      - [16, 4, 4]
      - [32, 8, 4]
      - [8, 32, 4]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - GlobalSplitU: [1]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [3072, 128, 1, 768]
      - Exact: [3072, 128, 1, 768]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - ThreadTile:
      - [4, 4]
      - [6, 4]
      - [4, 6]
      - [4, 8]
      - [8, 4]
      - [8, 8]
    - WorkGroup:
      - [16, 16, 1]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [768, 128, 1, 3072]
      - Exact: [768, 128, 1, 768]
      - Exact: [768, 128, 1, 3072]
      - Exact: [768, 128, 1, 768]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalSplitU: [1, 8]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - LdsPadA: [0, -1]
    - LdsPadB: [0, -1]
    - ThreadTile:
      - [2, 2]
      - [4, 2]
      - [2, 4]
      - [4, 4]
    - WorkGroup:
      - [16, 16, 1]
      - [8, 16, 2]
      - [16, 8, 2]
      - [4, 16, 4]
      - [16, 4, 4]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - GlobalSplitU: [1]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [128, 128, 12, 64]
      - Exact: [64, 128, 12, 128]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - LdsPadA: [0, -1]
    - LdsPadB: [0, -1]
    - ThreadTile:
      - [4, 4]
      - [4, 2]
      - [2, 4]
      - [4, 8]
      - [8, 4]
      - [8, 8]
    - WorkGroup:
      - [16, 16, 1]
      - [16, 8, 1]
      - [8, 16, 1]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - GlobalSplitU: [1]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [128, 128, 12, 64]
      - Exact: [64, 128, 12, 128]
      - Exact: [128, 128, 12, 64]
      - Exact: [64, 128, 12, 128]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 32]
    - VectorWidth: [2, 4]
    - GlobalReadVectorWidth: [1, 2, 4]
    - FractionalLoad: [0, 1]
    - PrefetchGlobalRead: [false, true]
    - PrefetchLocalRead: [false, true]
    - LdsPadA: [0, -1]
    - LdsPadB: [0, -1]
    - ThreadTile:
      - [4, 4]
      - [4, 2]
      - [2, 4]
      - [4, 8]
      - [8, 4]
      - [8, 8]
    - WorkGroup:
      - [16, 16, 1]
      - [16, 8, 1]
      - [8, 16, 1]
    InitialSolutionParameters: null
    JoinParameters: null
LibraryLogic:
  ArchitectureName: gfx906
  DeviceNames: [Device 66a0, Device 66a1, Device 66a7, Vega 20]
  ScheduleName: vega20
LibraryClient:

Finer thread assignments

Allow for threads to operate on adjacent values even without using VectorWidth. Is there a way for different assignments for loading and operating, so that global memory reads can always be dwordx4?

"Memory access fault by GPU node-1" error in Conv3d.

๐Ÿ› Bug

Got "Memory access fault by GPU node-1" when training my model, now I can reproduce the problem in a very simple script.
the env is ROCM 2.9.6, Radeon VII, I compiled pytorch from the most recent source on master branch.
details as follow.

To Reproduce

import torch
import torch.nn as nn
t=torch.rand(2,32,64,128,160).to('cuda')
t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t) #error occurs.

Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
import torch.nn as nn
t=torch.rand(2,32,64,128,160).to('cuda')
HIP_DB=0x1 [api]
hip-api pid:9748 tid:1:HIP initialized short_tid#1 (maps to full_tid: 0x7fba8044f740)
t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t)
<<hip-api pid:9748 tid:1.63 9748 1.63 hipLaunchKernel 'ZN12_GLOBAL__N_110hip_fill_nILj256EPjmjEEvT0_T1_T2' gridDim:{163840,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5334006293209
<<hip-api pid:9748 tid:1.69 9748 1.69 hipLaunchKernel 'ZN2at6native14vol2col_kernelIfEEviPKT_iiiiiiiiiiiiiiiiiiPS2' gridDim:{40960,1,1} groupDim:{1024,1,1} sharedMem:+0 stream:0.0 @5340563243577
<<hip-api pid:9748 tid:1.409 9748 1.409 hipLaunchKernel 'Cijk_Ailk_Bljk_SB_MT128x64x8_SE_APM1_AF0EM1_AF1EM1_AMAS3_ASEM1_BL1_DTL0_EPS1_FL1_GRVW4_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_MGWVW1_NLCA1_NLCB1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT8_4_USFGRO0_VAW1_VW4_WG16_16_1_WGM8' gridDim:{10240,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5340572207622
Memory access fault by GPU node-1 (Agent handle: 0x55e2fa08a6f0) on address 0x7fb968e02000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Environment

ROCM Version: 2.9.6

PyTorch version: 1.4.0a0+21ab112
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.3
[pip] torch==1.4.0a0+21ab112
[pip] torchvision==0.2.0
[conda] mkl 2019.4 243
[conda] mkl-include 2019.4 243

Something in Python is wrong while generating kernels

I am having the following output:

Reading logic files: Done.
[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% (14.2 secs elapsed)
# Writing Custom CMake
# Writing Kernels...
Generating kernels: Launching 48 threads...
info: growing pool += 1 * 2 for GlobalWrite

0         1   
01234567890123
....#######|..
0         1     
0123456789012345
....#######|..||
info: growing pool += 1 * 2 for GlobalWrite

0         1   
01234567890123
....#######|..
0         1     
0123456789012345
....#######|..||
info: growing pool += 1 * 2 for GlobalWrite

0         1   
01234567890123
....#######|..
0         1     
0123456789012345
....#######|..||
info: growing pool += 1 * 2 for GlobalWrite

0         1   
01234567890123
....#######|..
0         1     
0123456789012345
....#######|..||
Generating kernels: Done.
Compiling source kernels: Launching 48 threads...
Traceback (most recent call last):
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/Common.py", line 1371, in apply_print_exception
    return func(*args)
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/TensileCreateLibrary.py", line 146, in buildSourceCodeObjectFile
    hipFlags += subprocess.check_output([which('hipconfig'), '-C']).decode().split(' ')
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1583, in _execute_child
    and os.path.dirname(executable)
  File "/usr/lib/python3.8/posixpath.py", line 152, in dirname
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/Common.py", line 1371, in apply_print_exception
    return func(*args)
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/TensileCreateLibrary.py", line 146, in buildSourceCodeObjectFile
    hipFlags += subprocess.check_output([which('hipconfig'), '-C']).decode().split(' ')
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1583, in _execute_child
    and os.path.dirname(executable)
  File "/usr/lib/python3.8/posixpath.py", line 152, in dirname
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/bin/TensileCreateLibrary", line 38, in <module>
    TensileCreateLibrary()
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/TensileCreateLibrary.py", line 1165, in TensileCreateLibrary
    codeObjectFiles = writeSolutionsAndKernels(outputPath, CxxCompiler, problemTypes, solutions,
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/TensileCreateLibrary.py", line 409, in writeSolutionsAndKernels
    codeObjectFiles += buildSourceCodeObjectFiles(CxxCompiler, kernelFiles, outputPath)
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/TensileCreateLibrary.py", line 211, in buildSourceCodeObjectFiles
    coFiles = Common.ParallelMap(buildSourceCodeObjectFile, args, "Compiling source kernels",
  File "/home/selveskii/rocm-arch/rocblas/src/build/virtualenv/lib/python3.8/site-packages/Tensile/Common.py", line 1441, in ParallelMap
    rv = mapFunc(function, objects)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value
TypeError: expected str, bytes or os.PathLike object, not NoneType

I am not sure if this is related by my Python version being 3.8.

I am building Tensile as a part of building rocblas, on Arch Linux, using a modified AUR file:

  mkdir -p "$srcdir/build"
  cd "$srcdir/build"

  # fix broken build with stack protection
  export CFLAGS="$(sed -e 's/-fstack-protector-strong//' <<< "$CFLAGS")"
  export CXXFLAGS="$(sed -e 's/-fstack-protector-strong//' <<< "$CXXFLAGS")"
  export CPPFLAGS="$(sed -e 's/-fstack-protector-strong//' <<< "$CPPFLAGS")"

  # compile with HCC
  export CXX="/opt/rocm/hcc/bin/hcc"

  # TODO: fix librocblas.so, it contains references to $srcdir
  cmake -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=/opt/rocm/rocblas \
        -HIP_DIR=/opt/rocm/hip/lib/cmake/hip \
        -hcc_DIR=/opt/rocm/hcc/lib/cmake/hcc \
        -Damd_comgr_DIR=/opt/rocm/lib/cmake/amd_comgr \
        -DTensile_ARCHITECTURE=gfx906 \
        -DTensile_LOGIC=asm_full \
        -DBUILD_WITH_TENSILE=ON \
        "$srcdir/rocBLAS-rocm-$_pkgver"
  make

An issue about the file name: 00_BF.csv

image
I am using tensile-master-3.3 with rocm-3.3.1. Firstly, tensile did not found 'llvm-config', so I manusally add the path in '.bashrc'
image
image
Then, I get such a different issue, which tell me did not find the '00_BF-new.csv' where itself build a file named '00_BF.csv' as shown above.
I wonder how to solve this question, thx.

run error

when I run the Tensile use the command python3 ../Tensile/bin/Tensile ../Tensile/Configs/rocblas_sgemm_asm_only.yaml ./, I got the result such as follows:

image
I'm sure that I have the power to access the file /public/home/caspra69/Tensile/build/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/Data/00_Final.csv

why?

tensile not completing assembly code for sgemm

Memory access fault by GPU node-2 on address 0x4100000000. Reason: Page not present or supervisor privilege.

Details :

OS : centos 7
Driver rocm1.8
GPU : AMD Pro SSG
Can anybody help in this scenario ?

Best,
Sagar

Cannot build Tensile with ROCm's llvm fork on 3.5.0

I'm trying to build Tensile for rocBLAS using https://github.com/RadeonOpenCompute/llvm-project all on the rocm-3.5.0 tag. I can't get it to build:

In file included from /tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/source/llvm/YAML.cpp:2:
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:249:36: error: no viable conversion from 'llvm::StringRef' to 'const std::string' (aka 'const basic_string<char>')
                Impl::inputOne(io, key, *value);
                                   ^~~
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:1101:31: note: in instantiation of member function 'llvm::yaml::CustomMappingTraits<llvm::yaml::Hide<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>>::inputOne' requested here
      CustomMappingTraits<T>::inputOne(io, key, Val);
                              ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:96:13: note: in instantiation of function template specialization 'llvm::yaml::yamlize<llvm::yaml::Hide<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>>' requested here
            yamlize(io, hide, b, ctx);
            ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:945:7: note: in instantiation of function template specialization 'llvm::yaml::yamlize<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>' requested here
      yamlize(*this, Val, Required, Ctx);
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/ContractionSolution.hpp:62:22: note: (skipping 10 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
                iot::mapRequired(io, "info", s.info);
                     ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/SolutionLibrary.hpp:148:22: note: in instantiation of function template specialization 'Tensile::Serialization::IOTraits<llvm::yaml::IO>::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                iot::mapRequired(io, "solutions", solutions);
                     ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Base.hpp:209:46: note: in instantiation of member function 'Tensile::Serialization::MappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping' requested here
                MappingTraits<Subclass, IO>::mapping(io, *sc);
                                             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:265:52: note: in instantiation of function template specialization 'Tensile::Serialization::PointerMappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>' requested here
                sn::PointerMappingTraits<obj, IO>::mapping(io, o);
                                                   ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:448:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const std::__cxx11::basic_string<char> &' for 1st argument
      basic_string(const basic_string& __str)
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:525:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const char *' for 1st argument
      basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:552:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'std::__cxx11::basic_string<char> &&' for 1st argument
      basic_string(basic_string&& __str) noexcept
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:579:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'initializer_list<char>' for 1st argument
      basic_string(initializer_list<_CharT> __l, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:440:7: note: explicit constructor is not a candidate
      basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/ADT/StringRef.h:275:14: note: explicit conversion function is not a candidate
    explicit operator std::string() const { return str(); }
             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Containers.hpp:83:62: note: passing argument to parameter 'keyStr' here
            static void inputOne(IO & io, std::string const& keyStr, Map & value)
                                                             ^
In file included from /tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/source/llvm/Loading.cpp:29:
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:249:36: error: no viable conversion from 'llvm::StringRef' to 'const std::string' (aka 'const basic_string<char>')
                Impl::inputOne(io, key, *value);
                                   ^~~
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:1101:31: note: in instantiation of member function 'llvm::yaml::CustomMappingTraits<llvm::yaml::Hide<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>>::inputOne' requested here
      CustomMappingTraits<T>::inputOne(io, key, Val);
                              ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:96:13: note: in instantiation of function template specialization 'llvm::yaml::yamlize<llvm::yaml::Hide<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>>' requested here
            yamlize(io, hide, b, ctx);
            ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:945:7: note: in instantiation of function template specialization 'llvm::yaml::yamlize<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>' requested here
      yamlize(*this, Val, Required, Ctx);
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>, std::less<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/ContractionSolution.hpp:62:22: note: (skipping 10 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
                iot::mapRequired(io, "info", s.info);
                     ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/SolutionLibrary.hpp:148:22: note: in instantiation of function template specialization 'Tensile::Serialization::IOTraits<llvm::yaml::IO>::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                iot::mapRequired(io, "solutions", solutions);
                     ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Base.hpp:209:46: note: in instantiation of member function 'Tensile::Serialization::MappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping' requested here
                MappingTraits<Subclass, IO>::mapping(io, *sc);
                                             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:265:52: note: in instantiation of function template specialization 'Tensile::Serialization::PointerMappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>' requested here
                sn::PointerMappingTraits<obj, IO>::mapping(io, o);
                                                   ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:448:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const std::__cxx11::basic_string<char> &' for 1st argument
      basic_string(const basic_string& __str)
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:525:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const char *' for 1st argument
      basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:552:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'std::__cxx11::basic_string<char> &&' for 1st argument
      basic_string(basic_string&& __str) noexcept
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:579:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'initializer_list<char>' for 1st argument
      basic_string(initializer_list<_CharT> __l, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:440:7: note: explicit constructor is not a candidate
      basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/ADT/StringRef.h:275:14: note: explicit conversion function is not a candidate
    explicit operator std::string() const { return str(); }
             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Containers.hpp:83:62: note: passing argument to parameter 'keyStr' here
            static void inputOne(IO & io, std::string const& keyStr, Map & value)
                                                             ^
In file included from /tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/source/llvm/YAML.cpp:2:
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:249:36: error: no viable conversion from 'llvm::StringRef' to 'const std::string' (aka 'const basic_string<char>')
                Impl::inputOne(io, key, *value);
                                   ^~~
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:1101:31: note: in instantiation of member function 'llvm::yaml::CustomMappingTraits<llvm::yaml::Hide<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>>>::inputOne' requested here
      CustomMappingTraits<T>::inputOne(io, key, Val);
                              ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:96:13: note: in instantiation of function template specialization 'llvm::yaml::yamlize<llvm::yaml::Hide<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>>>' requested here
            yamlize(io, hide, b, ctx);
            ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:945:7: note: in instantiation of function template specialization 'llvm::yaml::yamlize<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>>' requested here
      yamlize(*this, Val, Required, Ctx);
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:888:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, false, Ctx);
          ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:861:5: note: in instantiation of function template specialization 'llvm::yaml::IO::mapOptionalWithContext<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>, llvm::yaml::EmptyContext>' requested here
    mapOptionalWithContext(Key, Val, Ctx);
    ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:143:20: note: (skipping 11 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
                io.mapOptional(key, obj);
                   ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/SolutionLibrary.hpp:148:22: note: in instantiation of function template specialization 'Tensile::Serialization::IOTraits<llvm::yaml::IO>::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                iot::mapRequired(io, "solutions", solutions);
                     ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Base.hpp:209:46: note: in instantiation of member function 'Tensile::Serialization::MappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping' requested here
                MappingTraits<Subclass, IO>::mapping(io, *sc);
                                             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:265:52: note: in instantiation of function template specialization 'Tensile::Serialization::PointerMappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>' requested here
                sn::PointerMappingTraits<obj, IO>::mapping(io, o);
                                                   ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:448:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const std::__cxx11::basic_string<char> &' for 1st argument
      basic_string(const basic_string& __str)
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:525:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const char *' for 1st argument
      basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:552:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'std::__cxx11::basic_string<char> &&' for 1st argument
      basic_string(basic_string&& __str) noexcept
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:579:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'initializer_list<char>' for 1st argument
      basic_string(initializer_list<_CharT> __l, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:440:7: note: explicit constructor is not a candidate
      basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/ADT/StringRef.h:275:14: note: explicit conversion function is not a candidate
    explicit operator std::string() const { return str(); }
             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Containers.hpp:83:62: note: passing argument to parameter 'keyStr' here
            static void inputOne(IO & io, std::string const& keyStr, Map & value)
                                                             ^
In file included from /tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/source/llvm/Loading.cpp:29:
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:249:36: error: no viable conversion from 'llvm::StringRef' to 'const std::string' (aka 'const basic_string<char>')
                Impl::inputOne(io, key, *value);
                                   ^~~
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:1101:31: note: in instantiation of member function 'llvm::yaml::CustomMappingTraits<llvm::yaml::Hide<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>>>::inputOne' requested here
      CustomMappingTraits<T>::inputOne(io, key, Val);
                              ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:96:13: note: in instantiation of function template specialization 'llvm::yaml::yamlize<llvm::yaml::Hide<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>>>' requested here
            yamlize(io, hide, b, ctx);
            ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:945:7: note: in instantiation of function template specialization 'llvm::yaml::yamlize<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>>' requested here
      yamlize(*this, Val, Required, Ctx);
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:888:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, false, Ctx);
          ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:861:5: note: in instantiation of function template specialization 'llvm::yaml::IO::mapOptionalWithContext<std::map<int, double, std::less<int>, std::allocator<std::pair<const int, double>>>, llvm::yaml::EmptyContext>' requested here
    mapOptionalWithContext(Key, Val, Ctx);
    ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:143:20: note: (skipping 11 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
                io.mapOptional(key, obj);
                   ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/SolutionLibrary.hpp:148:22: note: in instantiation of function template specialization 'Tensile::Serialization::IOTraits<llvm::yaml::IO>::mapRequired<std::vector<std::shared_ptr<Tensile::ContractionSolution>, std::allocator<std::shared_ptr<Tensile::ContractionSolution>>>>' requested here
                iot::mapRequired(io, "solutions", solutions);
                     ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Base.hpp:209:46: note: in instantiation of member function 'Tensile::Serialization::MappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping' requested here
                MappingTraits<Subclass, IO>::mapping(io, *sc);
                                             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:265:52: note: in instantiation of function template specialization 'Tensile::Serialization::PointerMappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>' requested here
                sn::PointerMappingTraits<obj, IO>::mapping(io, o);
                                                   ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:448:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const std::__cxx11::basic_string<char> &' for 1st argument
      basic_string(const basic_string& __str)
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:525:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const char *' for 1st argument
      basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:552:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'std::__cxx11::basic_string<char> &&' for 1st argument
      basic_string(basic_string&& __str) noexcept
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:579:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'initializer_list<char>' for 1st argument
      basic_string(initializer_list<_CharT> __l, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:440:7: note: explicit constructor is not a candidate
      basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/ADT/StringRef.h:275:14: note: explicit conversion function is not a candidate
    explicit operator std::string() const { return str(); }
             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Containers.hpp:83:62: note: passing argument to parameter 'keyStr' here
            static void inputOne(IO & io, std::string const& keyStr, Map & value)
                                                             ^
In file included from /tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/source/llvm/YAML.cpp:2:
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:249:36: error: no viable conversion from 'llvm::StringRef' to 'const std::string' (aka 'const basic_string<char>')
                Impl::inputOne(io, key, *value);
                                   ^~~
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:1101:31: note: in instantiation of member function 'llvm::yaml::CustomMappingTraits<llvm::yaml::Hide<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>>::inputOne' requested here
      CustomMappingTraits<T>::inputOne(io, key, Val);
                              ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:96:13: note: in instantiation of function template specialization 'llvm::yaml::yamlize<llvm::yaml::Hide<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>>' requested here
            yamlize(io, hide, b, ctx);
            ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:945:7: note: in instantiation of function template specialization 'llvm::yaml::yamlize<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>' requested here
      yamlize(*this, Val, Required, Ctx);
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/MapLibrary.hpp:48:22: note: (skipping 12 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
                iot::mapRequired(io, "map",      lib.map);
                     ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/SolutionLibrary.hpp:173:22: note: in instantiation of function template specialization 'Tensile::Serialization::IOTraits<llvm::yaml::IO>::mapRequired<std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>' requested here
                iot::mapRequired(io, "library", innerLibrary);
                     ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Base.hpp:209:46: note: in instantiation of member function 'Tensile::Serialization::MappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping' requested here
                MappingTraits<Subclass, IO>::mapping(io, *sc);
                                             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:265:52: note: in instantiation of function template specialization 'Tensile::Serialization::PointerMappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>' requested here
                sn::PointerMappingTraits<obj, IO>::mapping(io, o);
                                                   ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:448:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const std::__cxx11::basic_string<char> &' for 1st argument
      basic_string(const basic_string& __str)
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:525:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const char *' for 1st argument
      basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:552:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'std::__cxx11::basic_string<char> &&' for 1st argument
      basic_string(basic_string&& __str) noexcept
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:579:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'initializer_list<char>' for 1st argument
      basic_string(initializer_list<_CharT> __l, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:440:7: note: explicit constructor is not a candidate
      basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/ADT/StringRef.h:275:14: note: explicit conversion function is not a candidate
    explicit operator std::string() const { return str(); }
             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Containers.hpp:83:62: note: passing argument to parameter 'keyStr' here
            static void inputOne(IO & io, std::string const& keyStr, Map & value)
                                                             ^
In file included from /tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/source/llvm/Loading.cpp:29:
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:249:36: error: no viable conversion from 'llvm::StringRef' to 'const std::string' (aka 'const basic_string<char>')
                Impl::inputOne(io, key, *value);
                                   ^~~
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:1101:31: note: in instantiation of member function 'llvm::yaml::CustomMappingTraits<llvm::yaml::Hide<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>>::inputOne' requested here
      CustomMappingTraits<T>::inputOne(io, key, Val);
                              ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:96:13: note: in instantiation of function template specialization 'llvm::yaml::yamlize<llvm::yaml::Hide<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>>' requested here
            yamlize(io, hide, b, ctx);
            ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:945:7: note: in instantiation of function template specialization 'llvm::yaml::yamlize<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>' requested here
      yamlize(*this, Val, Required, Ctx);
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::unordered_map<std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, std::hash<std::string>, std::equal_to<std::__cxx11::basic_string<char>>, std::allocator<std::pair<const std::__cxx11::basic_string<char>, std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/MapLibrary.hpp:48:22: note: (skipping 12 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
                iot::mapRequired(io, "map",      lib.map);
                     ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/Support/YAMLTraits.h:851:11: note: in instantiation of function template specialization 'llvm::yaml::IO::processKey<std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>, llvm::yaml::EmptyContext>' requested here
    this->processKey(Key, Val, true, Ctx);
          ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:131:20: note: in instantiation of function template specialization 'llvm::yaml::IO::mapRequired<std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>' requested here
                io.mapRequired(key, obj);
                   ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/SolutionLibrary.hpp:173:22: note: in instantiation of function template specialization 'Tensile::Serialization::IOTraits<llvm::yaml::IO>::mapRequired<std::shared_ptr<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>>' requested here
                iot::mapRequired(io, "library", innerLibrary);
                     ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Base.hpp:209:46: note: in instantiation of member function 'Tensile::Serialization::MappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping' requested here
                MappingTraits<Subclass, IO>::mapping(io, *sc);
                                             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/llvm/YAML.hpp:265:52: note: in instantiation of function template specialization 'Tensile::Serialization::PointerMappingTraits<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>, llvm::yaml::IO, Tensile::Serialization::EmptyContext>::mapping<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution>>' requested here
                sn::PointerMappingTraits<obj, IO>::mapping(io, o);
                                                   ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:448:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const std::__cxx11::basic_string<char> &' for 1st argument
      basic_string(const basic_string& __str)
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:525:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'const char *' for 1st argument
      basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:552:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'std::__cxx11::basic_string<char> &&' for 1st argument
      basic_string(basic_string&& __str) noexcept
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:579:7: note: candidate constructor not viable: no known conversion from 'llvm::StringRef' to 'initializer_list<char>' for 1st argument
      basic_string(initializer_list<_CharT> __l, const _Alloc& __a = _Alloc())
      ^
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:440:7: note: explicit constructor is not a candidate
      basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT
      ^
/home/harmen/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/llvm-amdgpu-3.5.0-2pca4hvik6g3rkshkawjtzhwmvv7xgvi/include/llvm/ADT/StringRef.h:275:14: note: explicit conversion function is not a candidate
    explicit operator std::string() const { return str(); }
             ^
/tmp/harmen/spack-stage/spack-stage-rocblas-3.5.0-mkl42uc35yef75uwoxnphdynuk2uzbfi/spack-build/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/include/Tensile/Serialization/Containers.hpp:83:62: note: passing argument to parameter 'keyStr' here
            static void inputOne(IO & io, std::string const& keyStr, Map & value)
                                                             ^

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.