Hi, I profiled cutlass using nvvp on Tx2 with jetpack 3.2, using "./cutlass_perf_t

Hmm, it is strange. I notice what wrong is for my first time.

cutlass performance about cutlass HOT 8 CLOSED

nvidia commented on August 15, 2024

cutlass performance

from cutlass.

Comments (8)

wtiandong commented on August 15, 2024

OK,
I come to answer my own question again. I tried --m=10240 --k=4096 --n=4096 on TITAN X with CUDA 9.2, cutlass performance is about more than 95% of cublas.

from cutlass.

wtiandong commented on August 15, 2024

Hmm, it is strange.

I notice what wrong is for my first time. I should use sgemm_nn to compare with cublas.
On TITAN X with CUDA 9.2, the performance is 95%.
On TITAN Xp with CUDA 9.0, the performance is also 95%.
well, on jetson TX2 with CUDA 9.0, jetpack 3.2, the performance is 70%.
Could some one give some suggestions?

from cutlass.

GLJeff commented on August 15, 2024

I don't think cutlass is capable of doing the same automatic heuristic based tuning as cublas. it'd probably take some manual tweaking to get up to the 95% mark. even though both devices are pascal arch, the core count and probably other important factors are way different. But even then are you ever going to make up that 5% with a fused epilogue function? that's what I'm trying to figure out atm, but my guess is no.

from cutlass.

wtiandong commented on August 15, 2024

Hi Jeff,
I profiled the whole system with nvvp, it is worth to do that~

from cutlass.

GLJeff commented on August 15, 2024

NVVP is an awesome tool. I look forward to getting the same results then.

Let us know if you find a way to get good cutlass results on your TX2, perhaps with different block sizes.

from cutlass.

GLJeff commented on August 15, 2024

I just want to confirm what wtiandong stated (contrary to my original guess) that, in my case as well, it would be beneficial to implement cutlass for performance gains. My simple epilogue of merely a bias and relu is taking 9% of my forward pass time. Nearly all of that 9% is coming from mem reads that would be unnecessary with an epilogue built into the gemm kernel. There would be room for significant gains with cutlass going at 95% of cublass speed. Unfortunately I use strided batch gemms, so I'll need to wait for those to be implemented before I can adopt. Looking forward to it.

from cutlass.

wtiandong commented on August 15, 2024

Hmm, I'm little busy these days... I tried to change the block size/tile size a little, but had no good result - either build failed or slow speed. I think it needs more deep analysis.
If you use TITAN or Geforce, you can directly use cutlass. The slow speed only occurs on TX2.

from cutlass.

kerrmudgeon commented on August 15, 2024

Sorry to let this issue languish for so long. Several comments.

> Unfortunately I use strided batch gemms, so I'll need to wait for those to be implemented before I can adopt. Looking forward to it.

Batched strided GEMMs are implemented in CUTLASS 2.x!

> I don't think cutlass is capable of doing the same automatic heuristic based tuning as cublas.

Correct. CUTLASS performs no heuristics or decision-making on its own, except in selecting the first functionally sufficient kernel in the recently added host-side API in the CUTLASS Library.

I'd like to close this issue as the code has undergone significant changes since it was opened. Feel free to re-open to resume this discussion.

Thanks!

from cutlass.

cutlass performance about cutlass HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent