Thank you for your work. Can you teach me how to create a s,d,c,z GEMM not batched library?,about rocmsoftwareplatform/tensile

Comments (6)

gstoner commented on July 27, 2024

First we should explain what tensile is and not. Tensile really is tool we use to generate x-GEMM kernels that we leverage in our library development. Tensile as tool can generate OpenCL, HIP or GCN Assembly based Kernels. For ROCblas we develop HIP and GCNISA kerenls. Note AMDGPUpro driver cannot with 18.10 driver support the GCN Assembly based kernels since it not using Native LLVM compiler with has the GCN ISA support. What it is not BLAS library interface. What your really asking to do is create a new OpenCL based BLAS library with compatible interface to clBLAS clBLAS as library would need to be gutted to do what your asking. It would be better to use the current rocBLAS library as reference for the new library but use OpenCL Kernel generated from Tensile One best place to start learning how to use Tensile is here https://github.com/ROCmSoftwarePlatform/Tensile/wiki rocBLAS is based on learning we had on clBLAS. Tensile Library The Tensile API, Tensile.h, is confined to C89 so that it will be usable by most software. The code behind the API is allowed to be c++11. <https://github.com/ROCmSoftwarePlatform/Tensile/wiki/Languages#device-languages>Device Languages The device languages Tensile supports for the gpu kernels is * OpenCL 1.2 * HIP * Assembly * gfx803 * gfx900 Greg On Mar 1, 2018, at 12:49 AM, paolodalberto <[email protected]<mailto:[email protected]>> wrote: I would like to use Tensile to create GEMM codes (like a clBLAS) so that to have a library per different devices: Fiji, Polaris, and Vega. The goal is to use them in environment where multiple devices are available and use amdgpu-pro and legacy,rocm. I can play with the package and I can create libraries and the tests. But I can do it only using the Tensile.py code. I have an idea how to create the profiles for create the z and c GEMM. I noticed that rocBLAS are build using tensile. My understanding is that there will be multiple choices to use at run time for the same problem and for different sizes (thus for different devices). However, I have no clear understanding how to build the library and then use it like I used to use clBLAS. Be patient with me and please let me know if you are interested in help me. I envision to build three tensile libraries for opencl: Fiji, polaris and Vega (and future) Using a clBLAS interface with basically 4 GEMM each. For every device there will be a queue/stream with a platform. The GEMM can have a parameter specifying the device or the GEMM name can be different so that to address the correct algorithm (architecture-problem sizes). Any of these GEMMs will be called in parallel each on a different device. Please, can you teach me to create a library where there is one GEMM function either at low level where I will take care of the data movements using opencl standard (and soon sharing data among GPUS) or at high level where I specify the device. Cheers Paolo — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#162>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuZplut3snM9uMRrAm69mmSlUP4ebks5tZ5nigaJpZM4SX1oY>.

from tensile.

tingxingdong commented on July 27, 2024

You do not need to differentiate Fiji and Polaris. They are the same GFX803 architecture while Vega is GFX900. The kernel optimal for Fiji should be the same optimal for Polaris. This helps you to reduce amount of work

from tensile.

paolodalberto commented on July 27, 2024

Thank you for the reply. I was not clear. Let me work on my question.

Background: I am looking for sGEMM, dGEMM, cGEMM, and zGEMM. Classic GEMMs without batching (in the future I may add the batched). Currently my library is built using C and OpenCL (1.2) and I currently call the *GEMM from clBLAS for the code generated for Fiji GFX803. clBLAS provides a clean and known interface. I manage the data movement and then link the library to my application.

My understanding: Tensile creates a library of methods by an empirical search. So different devices may have different winners although they have the same architecture. At least, I think you explore a space and select winners, In this case I am willing to explore each device.

gstoner: Note AMDGPUpro driver cannot with 18.10 driver support the GCN Assembly based kernels since it not using Native LLVM compiler with has the GCN ISA support.

I recently installed: Radeon™ Software for Linux® Driver Version 17.50 for Ubuntu 16.04.3. I could run the experiments and create a client using the Tensile.py. I do not understand the statement above. You are saying that I cannot create the library and use it beside the experiments created by Tensile (hip only)?

Note: As long as I can use OpenCL to call the final result I will be very happy. I will be able to reuse my OpenCL code and I can work with other devices that are not GPUs. But if the only way is to use RocM and Hip. I will work to introduce a new interface for the new requirements. Either way, it is moving forward.

I am asking to learn how to create s,d,c,z GEMM self contained library for a device in such a way I can link it to an application written in C using an OpenCL interface, I would rather customize the call for the device a priori/or at run time. I completely miss, by Looking at the Client.cpp Client.h available in 4_LibraryClient, the methods that will be called to execute the computational kernel but I can follow most of the data preparation (may be because I know how to do it already).

I hope this time I clarified my request and expressed my ignorance. Would you mind to add a tutorial to address my request ? For example after we build the sgemm and its libtensile.a, what will be the interface to call the opencl sgemm function if there is any.

Note clBLAS used to have sample code for sgemm in c and c++. The code was clear (not short) but everything was there to understand how to reuse the code in a different scenario. This will help me to use other OpenCL implementations for other devices that are not GPUs.

Please, do not hesitate to contact me directly if you wish to ask me to do anything in particular.

from tensile.

gstoner commented on July 27, 2024

17.50 still shipping with ROCm support. G On Mar 2, 2018, at 12:41 AM, paolodalberto <[email protected]<mailto:[email protected]>> wrote: Thank you for the reply. I was not clear. Let me work on my question. Background: I am looking for sGEMM, dGEMM, cGEMM, and zGEMM. Classic GEMMs without batching (in the future I may add the batched). Currently my library is built using C and OpenCL (1.2) and I currently call the *GEMM from clBLAS for the code generated for Fiji GFX803. clBLAS provides a clean and known interface. I manage the data movement and then link the library to my application. My understanding: Tensile creates a library of methods by an empirical search. So different devices may have different winners although they have the same architecture. At least, I think you explore a space and select winners, In this case I am willing to explore each device. gstoner: Note AMDGPUpro driver cannot with 18.10 driver support the GCN Assembly based kernels since it not using Native LLVM compiler with has the GCN ISA support. I recently installed: Radeon™ Software for Linux® Driver Version 17.50 for Ubuntu 16.04.3. I could run the experiments and create a client using the Tensile.py. I do not understand the statement above. You are saying that I cannot create the library and use it beside the experiments created by Tensile (hip only)? Note: As long as I can use OpenCL to call the final result I will be very happy. I will be able to reuse my OpenCL code. But if the only way is to use RocM and Hip. I will work to introduce a new interface for the new requirements. Either way, it is moving forward. I am asking to learn how to create s,d,c,z GEMM self contained library for a device in such a way I can link it to an application written in C using an OpenCL interface and this customize the call for the device a priori/or at run time. For example, looking at the Client.cpp Client.h available in 4_LibraryClient it is difficult to identify the methods that will be called. I can follow most of the data preparation (may be because I know how to do it already) and I miss the point how to call the final method or methods. I hope this time I clarified my request and expressed my ignorance. Would you mind to add a tutorial to address my request ? For example after we build the sgemm and its libtensile.a, what will be the interface to call the opencl sgemm function if there is any. clBLAS used to have sample code for sgemm in c and c++. The code was clear (not short) but everything was there to understand how to reuse the code in a different scenario. This will help me to use other OpenCL implementations for other devices that are not GPUs. Please, do not hesitate to contact me directly if you wish to ask me to do anything in particular. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#162 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DucXhMG_Cs_ppUiHr6nOj0FnIHe0wks5taOmfgaJpZM4SX1oY>.

from tensile.

paolodalberto commented on July 27, 2024

17.50 still shipping with ROCm support.
legacy and rocm yep.

what about a tutorial ? Is it worth asking ?

from tensile.

paolodalberto commented on July 27, 2024

The answer is no. So let us move on (no mixed devices).
Next will be rocBLAS then. I installed and run the first sgemm.

sgemm example
NT: m, n, k, lda, ldb, ldc = 1023, 1024, 1025, 1023, 1024, 1023
PASS: max_relative_error = 1.17549e-38

Can I customize rocBLAS per device? (tensile does that)
Are they z and c GEMM available ?

from tensile.

Thank you for your work. Can you teach me how to create a s,d,c,z GEMM not batched library? about tensile HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent