Comments (5)
from cutlass.
OK,that is to say , M ,N ,K at least two of them should be Multiples of 16.
Thank u , I will close the issue.
from cutlass.
thank u for your reply,Kerr,Then I have a need in my job now, calculating C (int) = A (int8) × B (int8), where I want A, B, and C to be Rowmajor matrices, the size of A is M × K, and the size of B is K × N, the size of C is M * N.
I can guarantee that K is a multiple of 16, and M can be converted to a multiple of 16 (if you can choose it arbitrarily, it is the best, if not, it is fine), but N must be a random number.
How do I achieve it with cutlass?
I tested all combinations in cutlass. If ABC is rowmajor, then N and K must be multiples of 16. If I convert my thoughts and convert A × B to B.trans * A.trans (ABC selects column_major, and brings it back in), then M becomes N and N becomes M, this time it becomes, N can be chosen at will, M must be a multiple of 16, still cannot solve my problem.
Can this problem be solved by cutlass? It's fine if you don't use tensorcoreop, or even wmma.
from cutlass.
Here are a three possible recourses:
1.) Padding.
Size the matrices such that they are divisible by 16 elements and initialize the extra elements with zero.
2.) Reduce the alignment requirement at the expense of performance.
The device-level GEMM API accepts an admittedly long list of template arguments including the alignment constraints.
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L201
using Gemm = cutlass::gemm::device::Gemm<
int8_t,
cutlass::layout::RowMajor,
int8_t,
B_Major,
cutlass::layout::ColumnMajor,
cutlass::layout::RowMajor,
ElementAccumulator,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm75,
cutlass::gemm::GemmShape<128, 128, 64>,
cutlass::gemm::GemmShape<64, 64, 64>,
cutlass::gemm::GemmShape<8, 8, 16>,
cutlass::epilogue::thread::LinearCombination<
ElementOutput,
1, // alignment of C units
ElementAccumulator,
ElementAccumulator
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
2,
1, // alignment of A in units of number of elements
1 // alignment of B in units of number of elements
>;
3.) Use the integer-valued SIMT kernels.
You may consider using a kernel targeting integer dot product "dp4" instructions, first available in the Pascal microarchitecture and beyond.
Here is the definition syntax, visible in unit tests for these kernels.
https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/device/simt_int8_igemm_sm61.cu
// Output data type - may be int8_t or int32_t
using ElementOutput = int8_t;
// Accumulator data type
using ElementAccumulator = int32_t;
// Scalar data type
using ElementCompute = float;
// Instruction shape - describes a 1x1x4 dot product computed by
// the "dp4" instruction.
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 4>;
using Gemm = cutlass::gemm::device::Gemm<
int8_t,
cutlass::layout::ColumnMajor,
int8_t,
cutlass::layout::ColumnMajor,
ElementOutput,
cutlass::layout::RowMajor,
int32_t,
cutlass::arch::OpClassSimt,
cutlass::arch::Sm61,
ThreadBlockShape,
WarpShape,
InstructionShape
>;
There is no restriction on M, N, or K, but the matrices themselves must be 32b aligned. That is, pointers and leading dimensions must be divisible by 4 bytes.
from cutlass.
thank you for your help.I will close the question.
from cutlass.
Related Issues (20)
- [QST]What is the difference between `WmmaTensorOp` and `TensorOp`?
- [DOC] Incorrect link in main README file
- [BUG] Circular Dependency in Header Files
- [QST] Tiling an MMA in the K dimension HOT 3
- [QST] How to run GEMM with CUDA Graph?
- [BUG] CUTE: zipped_divide function returns different results on cuda device and on host HOT 6
- [QST] Unknown CMake command "cutlass_example_add_executable" HOT 3
- [QST] How to improve skinny matrix perf over Ampere like 3090?
- [QST] how to run profiler with fp16 accumulator for GEMM? HOT 4
- [QST] CUTLASS support for sparse matrix multiplication for X*W=Y with GPU sparse tensor core
- [QST] GemmUniversal is slower than GemmSplitKParallel when M and N are small and K is large HOT 1
- [QST]Why does profiler run so many kernels? HOT 2
- [QST]Question about the cutlass 3.0 API
- [QST]error: too few arguments for class template "cutlass::epilogue::collective::DefaultEpilogue"
- [QST]What is the difference between `TensorOp` and `WmmaTensorOp` HOT 2
- [BUG] Convolution examples fail to compile
- [QST] Are the `T` and `N` in TCOs expressing performance hints? HOT 2
- [QST] How to get Cutlass to run on Windows 11 in Visual Studio? HOT 25
- [QST] Available Fusion Options in EVT HOT 2
- [QST]41_fused_multi_head_attention on sm89 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cutlass.