Coder Social home page Coder Social logo

Comments (6)

ccecka avatar ccecka commented on July 17, 2024 1

Yes.

All arrangements of the quadpair MMAs are possible. For example,

    TiledMMA mma_4x1 = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_4,_1>, Stride<_1,_0>>{});

    TiledMMA mma_2x2c = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_2,_2>, Stride<_1,_2>>{});

    TiledMMA mma_2x2r = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_2,_2>, Stride<_2,_1>>{});

    TiledMMA mma_1x4 = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                      Layout<Shape <_1,_4>, Stride<_0,_1>>{});

In fact, you can swizzle the 2x2 arrangement of quadpairs as well. Left as an exercise.

Your table is not showing the projections of the quadpairs. None of the quadpairs can communicate. I would correct it as follows

QP0 [ 0: 4)+[16:20)
QP1 [ 4: 8)+[20:24)
QP2 [ 8:12)+[24:28)
QP3 [12:16)+[28:32)

                 B
       | QP0&QP2 | QP1&QP3 |
  -----+---------+---------+
  QP0  |         |         |
   &   |  QP0MMA |  QP1MMA |
  QP1  |         |         |
A -----+---------C---------+
  QP2  |         |         |
   &   |  QP2MMA |  QP3MMA |
  QP3  |         |         |
  -----+---------+---------+

from cutlass.

cloudhan avatar cloudhan commented on July 17, 2024 1

Could you elaborate a bit on the projection? I don't think you are refering to the projection that strip out modes via dice and I'm not quite sure what it should refer to.

The following suddenly makes a lot sense to me now. (warp-wise speaking, replace & with |, or), And it is indeed the same as #1367 =)

                 A
       | QP0/QP2 | QP1/QP3 |
  -----+---------+---------+
  QP0  |         |         |
   /   |  QP0MMA |  QP1MMA |
  QP1  |         |         |
B -----+---------C---------+
  QP2  |         |         |
   /   |  QP2MMA |  QP3MMA |
  QP3  |         |         |
  -----+---------+---------+

In TiledMMA, the peculiarity here is the C sometimes (if not always) shows more threads than A or/and B has. When get_slice into A or B, we can imagine those thr ids which are not presented in the plot of LayoutA_TV or LayoutB_TV to "overflow into an imaginary mode that perpendicularly grows out of/into the screen". This indicate a value broadcasting case.

To wrap up, TiledMMA always describe a computation that is carried out in a cooperative style. Combining with previous property, at least one src must be broadcasting.

And my perculiar 4x4 TiledMMA makes no sense at all. I want to replicate the standalone style 4 MMAs computation which is described in PTX ISA docs, all I need is just to stick to 1x1 TiledMMA or the Atom =)

from cutlass.

ccecka avatar ccecka commented on July 17, 2024

The threads are projected onto each other such that multiple threads are reading the same data elements in A and B.

Consider the tutorial example:

    TiledMMA mma = make_tiled_mma(UniversalFMA<float>{},
                                  Layout<Shape<_16,_16>>{},   // Layout of Atoms

Only T0 is displayed in A, but clearly T16, T32, T64, ... are all loading the same element.

TiledMmaC

This is related
#1367

from cutlass.

cloudhan avatar cloudhan commented on July 17, 2024

@ccecka Thanks for the quick reply. I think #1367 is rather different. Tho, It shedded some insight that TiledMMA operates in a broadcasting way. It asked about the logical mapping of TV. Here is whether or not the computation which is specified by the TiledMMA can be materialized.

ThrID other than sm70 MMAs are all 32:1, that they cover a full warp. SM70_8x8x4_F32F16F16F32_NT, however, with quad pair (4,2):(1,16). Due to the subwarp nature, there are some futher precondition on the instruction which is issued warp-wise.

See https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m8n8k4-with-f16-floating-point-type for the following table:

MMA Computation Threads participating in MMA computation
MMA computation 1 Threads with %laneid 0-3 (low group) and 16-19 (high group)
MMA computation 2 Threads with %laneid 4-7 (low group) and 20-23 (high group)
MMA computation 3 Threads with %laneid 8-11 (low group) and 24-27 (high group)
MMA computation 4 Threads with %laneid 12-15 (low group) and 28-31 (high group)

So it seems that the computation mma(QP_x, QP_y) is valid iff x == y.

However,

    TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
                                  Layout<Shape <_2,_2>,
                                         Stride<_2,_1>>{});   // 2x2 n-major layout of Atoms

Produces a computation as depicted as follows:

QP0 [ 0: 4)+[16:20)
QP1 [ 4: 8)+[20:24)
QP2 [ 8:12)+[24:28)
QP3 [12:16)+[28:32)

                 A
       |   QP0   |   QP1   |
  -----+---------+---------+
       |  Good   |  Bad?   |
  QP0  | mma of  | mma of  |
       |(QP0,QP0)|(QP0,QP1)|
B -----+---------C---------+
       |         |         |
  QP2  |  Bad?   |  Bad?   |
       |         |         |
  -----+---------+---------+


Read `mma of (QP0,QP1)` as:
rank-4 acc between data held by QP0 and data held by QP1

The TV mapping as shown in print_latex is pretty reasonable. The question is, can the hardware materialize the computation specified by that TV mapping?

from cutlass.

thakkarV avatar thakkarV commented on July 17, 2024

The projections of the threads being referred to here is how the thread layout of the tiled mma maps onto the logical coordinates within A (st multiple threads of the MMA get projected out to the same coord)

from cutlass.

cloudhan avatar cloudhan commented on July 17, 2024

Thanks! After putting all these back to the code, I think I fully grasp the idea now.

from cutlass.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.