Coder Social home page Coder Social logo

cutlass performance about cutlass HOT 8 CLOSED

nvidia avatar nvidia commented on August 15, 2024
cutlass performance

from cutlass.

Comments (8)

wtiandong avatar wtiandong commented on August 15, 2024

OK,
I come to answer my own question again. I tried --m=10240 --k=4096 --n=4096 on TITAN X with CUDA 9.2, cutlass performance is about more than 95% of cublas.

from cutlass.

wtiandong avatar wtiandong commented on August 15, 2024

Hmm, it is strange.

  1. I notice what wrong is for my first time. I should use sgemm_nn to compare with cublas.
  2. On TITAN X with CUDA 9.2, the performance is 95%.
  3. On TITAN Xp with CUDA 9.0, the performance is also 95%.
    well, on jetson TX2 with CUDA 9.0, jetpack 3.2, the performance is 70%.
    Could some one give some suggestions?

from cutlass.

GLJeff avatar GLJeff commented on August 15, 2024

I don't think cutlass is capable of doing the same automatic heuristic based tuning as cublas. it'd probably take some manual tweaking to get up to the 95% mark. even though both devices are pascal arch, the core count and probably other important factors are way different. But even then are you ever going to make up that 5% with a fused epilogue function? that's what I'm trying to figure out atm, but my guess is no.

from cutlass.

wtiandong avatar wtiandong commented on August 15, 2024

Hi Jeff,
I profiled the whole system with nvvp, it is worth to do that~

from cutlass.

GLJeff avatar GLJeff commented on August 15, 2024

NVVP is an awesome tool. I look forward to getting the same results then.

Let us know if you find a way to get good cutlass results on your TX2, perhaps with different block sizes.

from cutlass.

GLJeff avatar GLJeff commented on August 15, 2024

I just want to confirm what wtiandong stated (contrary to my original guess) that, in my case as well, it would be beneficial to implement cutlass for performance gains. My simple epilogue of merely a bias and relu is taking 9% of my forward pass time. Nearly all of that 9% is coming from mem reads that would be unnecessary with an epilogue built into the gemm kernel. There would be room for significant gains with cutlass going at 95% of cublass speed. Unfortunately I use strided batch gemms, so I'll need to wait for those to be implemented before I can adopt. Looking forward to it.

from cutlass.

wtiandong avatar wtiandong commented on August 15, 2024

Hmm, I'm little busy these days... I tried to change the block size/tile size a little, but had no good result - either build failed or slow speed. I think it needs more deep analysis.
If you use TITAN or Geforce, you can directly use cutlass. The slow speed only occurs on TX2.

from cutlass.

kerrmudgeon avatar kerrmudgeon commented on August 15, 2024

Sorry to let this issue languish for so long. Several comments.

> Unfortunately I use strided batch gemms, so I'll need to wait for those to be implemented before I can adopt. Looking forward to it.

Batched strided GEMMs are implemented in CUTLASS 2.x!

> I don't think cutlass is capable of doing the same automatic heuristic based tuning as cublas.

Correct. CUTLASS performs no heuristics or decision-making on its own, except in selecting the first functionally sufficient kernel in the recently added host-side API in the CUTLASS Library.

I'd like to close this issue as the code has undergone significant changes since it was opened. Feel free to re-open to resume this discussion.

Thanks!

from cutlass.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.