What is your question? For <code class="notranslate

Got it, thanks for the explanation! Based on the deion, are al

[QST] TiledMMA with `>1` Atoms in K dimension --- how to reduce? about cutlass HOT 4 CLOSED

HanGuo97 commented on July 17, 2024

[QST] TiledMMA with `>1` Atoms in K dimension --- how to reduce?

from cutlass.

Comments (4)

ccecka commented on July 17, 2024

Even though we've left that open as a possible configuration in CuTe, I have also never actually seen it used or asked for either. This also means we don't have much experience with how to perform this reduction. We don't recommend it unless you have a very special use case.

The very simplest thing you could do is probably write an atomic_axpy as a new epilogue

// Atomic AXPY
template <class Alpha,
          class XEngine, class XLayout,
          class YEngine, class YLayout>
CUTE_HOST_DEVICE void
atomic_axpy(Alpha const& alpha,
            Tensor<XEngine, XLayout> const& x,
            Tensor<YEngine, YLayout>      & y)
{
  CUTE_UNROLL
  for (int i = 0; i < size(x); ++i) {
    // y(i) += alpha * x(i);
    atomicAdd(&y(i), alpha * x(i));
  }
}

which would reduce the partial accumulators back into global memory safely. Optimizations that use shared memory and/or in-place register reductions using CUB would clearly be the next step.

from cutlass.

HanGuo97 commented on July 17, 2024

Got it, thanks for the explanation!

Based on the description, are alternative methods like Split-K or Stream-K strictly more favorable than using more than 1 Atoms in the K dimension? (After all, they try to achieve essentially the same goal.)

On that note, do you have good pointers/references to a split-K / stream-K implementation using CUTLASS-3? I'm somewhat new to the library, and seeing a lot of Split-K/Stream-K examples (many of which use CUTLASS-2) could be a bit confusing to me.

Thanks again for being so helpful!

from cutlass.

thakkarV commented on July 17, 2024

are alternative methods like Split-K or Stream-K strictly more favorable

this highly depends on the exact nature of your problem and what the bottlenecks in your kernel are. Hard to answer without knowing this, but there are some niche usecases where splitting accumulation of tiles into multiple partial kblocks is a viable strategy. This is not something I would start with however.

CUTLASS 3 Hopper persistent cooperative kernel supports stream K already. I encourage you to take a look at it. Kernel layer is here:https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp

which simply composes with the stream K scheduler: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp

Unit tests are here: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_cooperative_stream_k.cu

And they are stamped out in the profiler too

from cutlass.

HanGuo97 commented on July 17, 2024

Thanks for the quick response!

from cutlass.

[QST] TiledMMA with `>1` Atoms in K dimension --- how to reduce? about cutlass HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent