Coder Social home page Coder Social logo

Comments (4)

ccecka avatar ccecka commented on July 17, 2024

Even though we've left that open as a possible configuration in CuTe, I have also never actually seen it used or asked for either. This also means we don't have much experience with how to perform this reduction. We don't recommend it unless you have a very special use case.

The very simplest thing you could do is probably write an atomic_axpy as a new epilogue

// Atomic AXPY
template <class Alpha,
          class XEngine, class XLayout,
          class YEngine, class YLayout>
CUTE_HOST_DEVICE void
atomic_axpy(Alpha const& alpha,
            Tensor<XEngine, XLayout> const& x,
            Tensor<YEngine, YLayout>      & y)
{
  CUTE_UNROLL
  for (int i = 0; i < size(x); ++i) {
    // y(i) += alpha * x(i);
    atomicAdd(&y(i), alpha * x(i));
  }
}

which would reduce the partial accumulators back into global memory safely. Optimizations that use shared memory and/or in-place register reductions using CUB would clearly be the next step.

from cutlass.

HanGuo97 avatar HanGuo97 commented on July 17, 2024

Got it, thanks for the explanation!

Based on the description, are alternative methods like Split-K or Stream-K strictly more favorable than using more than 1 Atoms in the K dimension? (After all, they try to achieve essentially the same goal.)

On that note, do you have good pointers/references to a split-K / stream-K implementation using CUTLASS-3? I'm somewhat new to the library, and seeing a lot of Split-K/Stream-K examples (many of which use CUTLASS-2) could be a bit confusing to me.

Thanks again for being so helpful!

from cutlass.

thakkarV avatar thakkarV commented on July 17, 2024

are alternative methods like Split-K or Stream-K strictly more favorable

this highly depends on the exact nature of your problem and what the bottlenecks in your kernel are. Hard to answer without knowing this, but there are some niche usecases where splitting accumulation of tiles into multiple partial kblocks is a viable strategy. This is not something I would start with however.

CUTLASS 3 Hopper persistent cooperative kernel supports stream K already. I encourage you to take a look at it. Kernel layer is here:https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp

which simply composes with the stream K scheduler: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp

Unit tests are here: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_cooperative_stream_k.cu

And they are stamped out in the profiler too

from cutlass.

HanGuo97 avatar HanGuo97 commented on July 17, 2024

Thanks for the quick response!

from cutlass.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.