What is your question? In <a href="https://github.com/NVIDIA/cutl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[QST] TiledMMA with Volta architecture is really weird in cute mma atom doc about cutlass HOT 6 CLOSED

cloudhan commented on July 17, 2024

[QST] TiledMMA with Volta architecture is really weird in cute mma atom doc

from cutlass.

Comments (6)

ccecka commented on July 17, 2024 1

Yes.

All arrangements of the quadpair MMAs are possible. For example,

    TiledMMA mma_4x1 = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_4,_1>, Stride<_1,_0>>{});

    TiledMMA mma_2x2c = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_2,_2>, Stride<_1,_2>>{});

    TiledMMA mma_2x2r = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_2,_2>, Stride<_2,_1>>{});

    TiledMMA mma_1x4 = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_1,_4>, Stride<_0,_1>>{});

In fact, you can swizzle the 2x2 arrangement of quadpairs as well. Left as an exercise.

Your table is not showing the projections of the quadpairs. None of the quadpairs can communicate. I would correct it as follows

QP0 [ 0: 4)+[16:20)
QP1 [ 4: 8)+[20:24)
QP2 [ 8:12)+[24:28)
QP3 [12:16)+[28:32)

                 B
       | QP0&QP2 | QP1&QP3 |
  -----+---------+---------+
  QP0  |         |         |
   &   |  QP0MMA |  QP1MMA |
  QP1  |         |         |
A -----+---------C---------+
  QP2  |         |         |
   &   |  QP2MMA |  QP3MMA |
  QP3  |         |         |
  -----+---------+---------+

from cutlass.

cloudhan commented on July 17, 2024 1

Could you elaborate a bit on the projection? I don't think you are refering to the projection that strip out modes via dice and I'm not quite sure what it should refer to.

The following suddenly makes a lot sense to me now. (warp-wise speaking, replace & with |, or), And it is indeed the same as #1367 =)

                 A
       | QP0/QP2 | QP1/QP3 |
  -----+---------+---------+
  QP0  |         |         |
   /   |  QP0MMA |  QP1MMA |
  QP1  |         |         |
B -----+---------C---------+
  QP2  |         |         |
   /   |  QP2MMA |  QP3MMA |
  QP3  |         |         |
  -----+---------+---------+

In TiledMMA, the peculiarity here is the C sometimes (if not always) shows more threads than A or/and B has. When get_slice into A or B, we can imagine those thr ids which are not presented in the plot of LayoutA_TV or LayoutB_TV to "overflow into an imaginary mode that perpendicularly grows out of/into the screen". This indicate a value broadcasting case.

To wrap up, TiledMMA always describe a computation that is carried out in a cooperative style. Combining with previous property, at least one src must be broadcasting.

And my perculiar 4x4 TiledMMA makes no sense at all. I want to replicate the standalone style 4 MMAs computation which is described in PTX ISA docs, all I need is just to stick to 1x1 TiledMMA or the Atom =)

from cutlass.

ccecka commented on July 17, 2024

The threads are projected onto each other such that multiple threads are reading the same data elements in A and B.

Consider the tutorial example:

    TiledMMA mma = make_tiled_mma(UniversalFMA<float>{},
                                  Layout<Shape<_16,_16>>{},   // Layout of Atoms

Only T0 is displayed in A, but clearly T16, T32, T64, ... are all loading the same element.

This is related
#1367

from cutlass.

cloudhan commented on July 17, 2024

@ccecka Thanks for the quick reply. I think #1367 is rather different. Tho, It shedded some insight that TiledMMA operates in a broadcasting way. It asked about the logical mapping of TV. Here is whether or not the computation which is specified by the TiledMMA can be materialized.

ThrID other than sm70 MMAs are all 32:1, that they cover a full warp. SM70_8x8x4_F32F16F16F32_NT, however, with quad pair (4,2):(1,16). Due to the subwarp nature, there are some futher precondition on the instruction which is issued warp-wise.

See https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m8n8k4-with-f16-floating-point-type for the following table:

MMA Computation	Threads participating in MMA computation
MMA computation 1	Threads with %laneid 0-3 (low group) and 16-19 (high group)
MMA computation 2	Threads with %laneid 4-7 (low group) and 20-23 (high group)
MMA computation 3	Threads with %laneid 8-11 (low group) and 24-27 (high group)
MMA computation 4	Threads with %laneid 12-15 (low group) and 28-31 (high group)

So it seems that the computation mma(QP_x, QP_y) is valid iff x == y.

However,

    TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                  Layout<Shape <_2,_2>,
                                         Stride<_2,_1>>{});   // 2x2 n-major layout of Atoms

Produces a computation as depicted as follows:

QP0 [ 0: 4)+[16:20)
QP1 [ 4: 8)+[20:24)
QP2 [ 8:12)+[24:28)
QP3 [12:16)+[28:32)

                 A
       |   QP0   |   QP1   |
  -----+---------+---------+
       |  Good   |  Bad?   |
  QP0  | mma of  | mma of  |
       |(QP0,QP0)|(QP0,QP1)|
B -----+---------C---------+
       |         |         |
  QP2  |  Bad?   |  Bad?   |
       |         |         |
  -----+---------+---------+


Read `mma of (QP0,QP1)` as:
rank-4 acc between data held by QP0 and data held by QP1

The TV mapping as shown in print_latex is pretty reasonable. The question is, can the hardware materialize the computation specified by that TV mapping?

from cutlass.

thakkarV commented on July 17, 2024

The projections of the threads being referred to here is how the thread layout of the tiled mma maps onto the logical coordinates within A (st multiple threads of the MMA get projected out to the same coord)

from cutlass.

cloudhan commented on July 17, 2024

Thanks! After putting all these back to the code, I think I fully grasp the idea now.

from cutlass.

[QST] TiledMMA with Volta architecture is really weird in cute mma atom doc about cutlass HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent