cutlass calculate matrix size problem about cutlass HOT 5 CLOSED

nvidia commented on July 16, 2024

cutlass calculate matrix size problem

from cutlass.

Comments (5)

kerrmudgeon commented on July 16, 2024

This is expected behavior. As a performance optimization, most kernels targeting Tensor Cores require 128b aligned memory accesses. This enables vectorized memory accesses, fewer address instructions, and greater efficiency. It is a documented requirement in cublas, described here in greater detail <https://docs.nvidia.com/cuda/cublas/index.html#tensorop-restrictions>. Here are some additional suggestions for higher performance on the Turing architecture: 1. Target Tensor Cores natively with `cutlass::arch::OpClassTensorOp` rather than WMMA. This utilizes feature new in CUDA 10.2. using Gemm = cutlass::gemm::device::Gemm< int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<8, 8, 16>

;

2. Adapt your application to structure the A matrix as row-major.

…

On Mon, Dec 23, 2019 at 5:36 AM wanghr323 ***@***.***> wrote: I wrote a function using cutlass to test the performance of cutlass calculation (int8, int8 to int), but I have now found a problem. M, N, and K in my parameters cannot be selected at random, where N and K must Multiples of 16. Choosing something else will cause an error. Is there something wrong with my writing of this function? int Int8Operator:: cutlass_gemm32I_tensorop(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void *alpha, const void* A, const void* B, const void *beta, void* C,cublasGemmAlgo_t algo/*non used*/) { using A_Major = cutlass::layout::ColumnMajor; using B_Major = cutlass::layout::ColumnMajor; using ElementOutput = int32_t; using ElementAccumulator = int32_t; int lda = (TransA == CblasNoTrans) ? K : M; int ldb = (TransB == CblasNoTrans) ? N : K; int ldc = N; using Gemm = cutlass::gemm::device::Gemm< int8_t, A_Major, int8_t, B_Major, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassWmmaTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 16, 16>, cutlass::epilogue::thread::LinearCombination< ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value, ElementAccumulator, ElementAccumulator >, cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle, 2 >; Gemm gemm_op; int alpha_ = *(static_cast<const int*>(alpha)); int beta_ = *(static_cast<const int*>(beta)); cutlass::Status status = gemm_op({ {M, N, K}, {static_cast<const int8_t *>(A), lda}, {static_cast<const int8_t *>(B), ldb}, {static_cast<int*>(C), ldc}, {static_cast<int*>(C), ldc}, {alpha_,beta_} }); if (status != cutlass::Status::kSuccess) { return cudaErrorUnknown; } return cudaSuccess; } — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#75?email_source=notifications&email_token=AAACX6ZQYN6CPASEURBHXF3Q2C5HVA5CNFSM4J6UH4V2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ICKSZOQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACX64YETMNR2KKCM5IP63Q2C5HVANCNFSM4J6UH4VQ> .

from cutlass.

wanghr323 commented on July 16, 2024

OK，that is to say , M ,N ,K at least two of them should be Multiples of 16.
Thank u , I will close the issue.

from cutlass.

wanghr323 commented on July 16, 2024

thank u for your reply，Kerr，Then I have a need in my job now, calculating C (int) = A (int8) × B (int8), where I want A, B, and C to be Rowmajor matrices, the size of A is M × K, and the size of B is K × N, the size of C is M * N.
I can guarantee that K is a multiple of 16, and M can be converted to a multiple of 16 (if you can choose it arbitrarily, it is the best, if not, it is fine), but N must be a random number.
How do I achieve it with cutlass?
I tested all combinations in cutlass. If ABC is rowmajor, then N and K must be multiples of 16. If I convert my thoughts and convert A × B to B.trans * A.trans (ABC selects column_major, and brings it back in), then M becomes N and N becomes M, this time it becomes, N can be chosen at will, M must be a multiple of 16, still cannot solve my problem.
Can this problem be solved by cutlass? It's fine if you don't use tensorcoreop, or even wmma.

from cutlass.

kerrmudgeon commented on July 16, 2024

Here are a three possible recourses:

1.) Padding.

Size the matrices such that they are divisible by 16 elements and initialize the extra elements with zero.

2.) Reduce the alignment requirement at the expense of performance.

The device-level GEMM API accepts an admittedly long list of template arguments including the alignment constraints.

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L201

using Gemm = cutlass::gemm::device::Gemm<
      int8_t,
      cutlass::layout::RowMajor,
      int8_t,
      B_Major,
      cutlass::layout::ColumnMajor,
      cutlass::layout::RowMajor,
      ElementAccumulator,
      cutlass::arch::OpClassTensorOp,
      cutlass::arch::Sm75,
      cutlass::gemm::GemmShape<128, 128, 64>,
      cutlass::gemm::GemmShape<64, 64, 64>,
      cutlass::gemm::GemmShape<8, 8, 16>,
      cutlass::epilogue::thread::LinearCombination<
        ElementOutput,
        1,     // alignment of C units
        ElementAccumulator,
        ElementAccumulator
      >,
      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
      2,
      1,   // alignment of A in units of number of elements
      1    // alignment of B in units of number of elements
  >;

3.) Use the integer-valued SIMT kernels.

You may consider using a kernel targeting integer dot product "dp4" instructions, first available in the Pascal microarchitecture and beyond.

Here is the definition syntax, visible in unit tests for these kernels.
https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/device/simt_int8_igemm_sm61.cu

  // Output data type - may be int8_t or int32_t
  using ElementOutput = int8_t;

  // Accumulator data type
  using ElementAccumulator = int32_t;

  // Scalar data type
  using ElementCompute = float;

  // Instruction shape - describes a 1x1x4 dot product computed by
  // the "dp4" instruction.
  using InstructionShape = cutlass::gemm::GemmShape<1, 1, 4>;

  using Gemm = cutlass::gemm::device::Gemm<
    int8_t,
    cutlass::layout::ColumnMajor,
    int8_t,
    cutlass::layout::ColumnMajor,
    ElementOutput,
    cutlass::layout::RowMajor,
    int32_t,
    cutlass::arch::OpClassSimt,
    cutlass::arch::Sm61,
    ThreadBlockShape,
    WarpShape,
    InstructionShape
  >;

There is no restriction on M, N, or K, but the matrices themselves must be 32b aligned. That is, pointers and leading dimensions must be divisible by 4 bytes.

from cutlass.

wanghr323 commented on July 16, 2024

thank you for your help.I will close the question.

from cutlass.

cutlass calculate matrix size problem about cutlass HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent