Comments (4)
Even though we've left that open as a possible configuration in CuTe, I have also never actually seen it used or asked for either. This also means we don't have much experience with how to perform this reduction. We don't recommend it unless you have a very special use case.
The very simplest thing you could do is probably write an atomic_axpy
as a new epilogue
// Atomic AXPY
template <class Alpha,
class XEngine, class XLayout,
class YEngine, class YLayout>
CUTE_HOST_DEVICE void
atomic_axpy(Alpha const& alpha,
Tensor<XEngine, XLayout> const& x,
Tensor<YEngine, YLayout> & y)
{
CUTE_UNROLL
for (int i = 0; i < size(x); ++i) {
// y(i) += alpha * x(i);
atomicAdd(&y(i), alpha * x(i));
}
}
which would reduce the partial accumulators back into global memory safely. Optimizations that use shared memory and/or in-place register reductions using CUB would clearly be the next step.
from cutlass.
Got it, thanks for the explanation!
Based on the description, are alternative methods like Split-K or Stream-K strictly more favorable than using more than 1 Atoms in the K
dimension? (After all, they try to achieve essentially the same goal.)
On that note, do you have good pointers/references to a split-K / stream-K implementation using CUTLASS-3? I'm somewhat new to the library, and seeing a lot of Split-K/Stream-K examples (many of which use CUTLASS-2) could be a bit confusing to me.
Thanks again for being so helpful!
from cutlass.
are alternative methods like Split-K or Stream-K strictly more favorable
this highly depends on the exact nature of your problem and what the bottlenecks in your kernel are. Hard to answer without knowing this, but there are some niche usecases where splitting accumulation of tiles into multiple partial kblocks is a viable strategy. This is not something I would start with however.
CUTLASS 3 Hopper persistent cooperative kernel supports stream K already. I encourage you to take a look at it. Kernel layer is here:https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
which simply composes with the stream K scheduler: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp
Unit tests are here: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_cooperative_stream_k.cu
And they are stamped out in the profiler too
from cutlass.
Thanks for the quick response!
from cutlass.
Related Issues (20)
- [QST] How to force Cutlass to use Tensor core?
- [QST] Strided dgrad conv epilogue does not use fast divmod HOT 11
- [QST] how consumer_release works? HOT 1
- [QST]Cannot get correct copyAtom HOT 3
- [QST] In cute, when should we use int<> and when should not? HOT 2
- [QST] Why we use three sync in sgemm_1? HOT 1
- [QST] How to avoid bank conflict using cute?
- [QST]How to use print_latex??
- [QST] In cute, what is MMA_K's meaning? HOT 2
- [QST]In cute, how to control register amount? What is the meaning of "Tile" in make_tiled_mma? HOT 3
- [QST]How MMA_M, MMA_N, MMA_K computed in cute? HOT 1
- [QST] Question about zipped_divide example HOT 1
- [QST] Best way to implement a custom iterator? HOT 2
- [QST] Fast Implementation of (Small-)Table Lookup
- [BUG] Indexing bug in `get_layoutB_TV`? HOT 3
- [QST] Epilogue Broadcast: `Adapter` vs `GemmUniversal` HOT 4
- [QST] TiledMMA with Volta architecture is really weird in cute mma atom doc HOT 6
- [QST] Example of blocked product HOT 1
- [BUG] Python `EVT` `Pytorch` Emitter Broken HOT 2
- Supports for s4 and s8 GEMM on Python? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cutlass.