Comments (6)
Yes.
All arrangements of the quadpair MMAs are possible. For example,
TiledMMA mma_4x1 = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
Layout<Shape <_4,_1>, Stride<_1,_0>>{});
TiledMMA mma_2x2c = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
Layout<Shape <_2,_2>, Stride<_1,_2>>{});
TiledMMA mma_2x2r = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
Layout<Shape <_2,_2>, Stride<_2,_1>>{});
TiledMMA mma_1x4 = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
Layout<Shape <_1,_4>, Stride<_0,_1>>{});
In fact, you can swizzle the 2x2 arrangement of quadpairs as well. Left as an exercise.
Your table is not showing the projections of the quadpairs. None of the quadpairs can communicate. I would correct it as follows
QP0 [ 0: 4)+[16:20)
QP1 [ 4: 8)+[20:24)
QP2 [ 8:12)+[24:28)
QP3 [12:16)+[28:32)
B
| QP0&QP2 | QP1&QP3 |
-----+---------+---------+
QP0 | | |
& | QP0MMA | QP1MMA |
QP1 | | |
A -----+---------C---------+
QP2 | | |
& | QP2MMA | QP3MMA |
QP3 | | |
-----+---------+---------+
from cutlass.
Could you elaborate a bit on the projection? I don't think you are refering to the projection that strip out modes via dice
and I'm not quite sure what it should refer to.
The following suddenly makes a lot sense to me now. (warp-wise speaking, replace &
with |
, or
), And it is indeed the same as #1367 =)
A
| QP0/QP2 | QP1/QP3 |
-----+---------+---------+
QP0 | | |
/ | QP0MMA | QP1MMA |
QP1 | | |
B -----+---------C---------+
QP2 | | |
/ | QP2MMA | QP3MMA |
QP3 | | |
-----+---------+---------+
In TiledMMA, the peculiarity here is the C sometimes (if not always) shows more threads than A or/and B has. When get_slice
into A or B, we can imagine those thr ids which are not presented in the plot of LayoutA_TV or LayoutB_TV to "overflow into an imaginary mode that perpendicularly grows out of/into the screen". This indicate a value broadcasting case.
To wrap up, TiledMMA always describe a computation that is carried out in a cooperative style. Combining with previous property, at least one src must be broadcasting.
And my perculiar 4x4 TiledMMA makes no sense at all. I want to replicate the standalone style 4 MMAs computation which is described in PTX ISA docs, all I need is just to stick to 1x1 TiledMMA or the Atom =)
from cutlass.
The threads are projected onto each other such that multiple threads are reading the same data elements in A and B.
Consider the tutorial example:
TiledMMA mma = make_tiled_mma(UniversalFMA<float>{},
Layout<Shape<_16,_16>>{}, // Layout of Atoms
Only T0 is displayed in A, but clearly T16, T32, T64, ... are all loading the same element.
This is related
#1367
from cutlass.
@ccecka Thanks for the quick reply. I think #1367 is rather different. Tho, It shedded some insight that TiledMMA operates in a broadcasting way. It asked about the logical mapping of TV. Here is whether or not the computation which is specified by the TiledMMA can be materialized.
ThrID
other than sm70 MMAs are all 32:1
, that they cover a full warp. SM70_8x8x4_F32F16F16F32_NT
, however, with quad pair (4,2):(1,16)
. Due to the subwarp nature, there are some futher precondition on the instruction which is issued warp-wise.
See https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m8n8k4-with-f16-floating-point-type for the following table:
MMA Computation | Threads participating in MMA computation |
---|---|
MMA computation 1 | Threads with %laneid 0-3 (low group) and 16-19 (high group) |
MMA computation 2 | Threads with %laneid 4-7 (low group) and 20-23 (high group) |
MMA computation 3 | Threads with %laneid 8-11 (low group) and 24-27 (high group) |
MMA computation 4 | Threads with %laneid 12-15 (low group) and 28-31 (high group) |
So it seems that the computation mma(QP_x, QP_y) is valid iff x == y.
However,
TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
Layout<Shape <_2,_2>,
Stride<_2,_1>>{}); // 2x2 n-major layout of Atoms
Produces a computation as depicted as follows:
QP0 [ 0: 4)+[16:20)
QP1 [ 4: 8)+[20:24)
QP2 [ 8:12)+[24:28)
QP3 [12:16)+[28:32)
A
| QP0 | QP1 |
-----+---------+---------+
| Good | Bad? |
QP0 | mma of | mma of |
|(QP0,QP0)|(QP0,QP1)|
B -----+---------C---------+
| | |
QP2 | Bad? | Bad? |
| | |
-----+---------+---------+
Read `mma of (QP0,QP1)` as:
rank-4 acc between data held by QP0 and data held by QP1
The TV mapping as shown in print_latex
is pretty reasonable. The question is, can the hardware materialize the computation specified by that TV mapping?
from cutlass.
The projections of the threads being referred to here is how the thread layout of the tiled mma maps onto the logical coordinates within A (st multiple threads of the MMA get projected out to the same coord)
from cutlass.
Thanks! After putting all these back to the code, I think I fully grasp the idea now.
from cutlass.
Related Issues (20)
- [QST] Benchmark #15 error. "void cutlass::arch::ldsm not implemented" HOT 2
- [QST] LDSM Copy for int8 HOT 15
- [BUG] coalesce with Tensor dispatch to coalesce with Shape HOT 5
- What is the layout of the Batched Stride GEMM when A C is row major and B is col major? HOT 1
- [FEA] Hopper group-gemm of mixed type
- [QST] about swizzle_layout HOT 1
- [FEA] BFloat16x2 Atomics HOT 1
- [QST]how to use one threadblock process one matrix multiplication?
- [QST] How to do Batched Gemm with python API
- [BUG] undefined symbol: cuTensorMapEncodeTiled on CUTLASS 3.5.1 HOT 3
- [QST] [CuTe] Why `right_inverse` does not work as expected when shape is not static? HOT 2
- undefined reference to `cutlass::library::Handle::~Handle()' collect2: error: ld returned 1 exit status HOT 2
- [QST] How to implement non-interleaving partitioning in CuTe? HOT 5
- ldmatrix instruction HOT 1
- [QST] Why do we only need the result of the last k-loop in `cute::gemm` dispatch-5? HOT 4
- [BUG] ElementC=void kernel reads non-void in `GemmDescription` HOT 1
- undefined reference to `cutlass::library::Handle::~Handle()' collect2: error: ld returned 1 exit status HOT 6
- ldmatrix instruction HOT 1
- [QST] Error: Could not detect active GPU device ID [CUDA driver is a stub library] HOT 1
- [QST] TMA Reduce ADD HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cutlass.