Coder Social home page Coder Social logo

How to use NVTX in device code. about nvtx HOT 2 CLOSED

nvidia avatar nvidia commented on June 24, 2024
How to use NVTX in device code.

from nvtx.

Comments (2)

jcohen-nvidia avatar jcohen-nvidia commented on June 24, 2024 1

Hi Xiaobin,

NVTX is not usable from device code, by design. We've thought a lot about how we could implement this, but it would guaranteeably cause unacceptable performance degradation. Keep in mind that CUDA kernels make best use of the GPU when they launch thousands or millions of threads. Imagine a 1-million-thread kernel with a single NVTX range, using nvtxRangePushA and nvtxRangePop, running under a trace tool like Nsight Systems. Just one kernel launch like this would produce 2 million trace records, and the Push records contain a string of unbounded length. With two 8-byte timestamps and perhaps 32 more bytes of other data (already optimistically low), that's 40 bytes per range. So for the whole kernel, that's 40 MB of trace data being generated on the GPU, and that data would have to be transferred from device-memory to host-memory, and then to disk -- all without harming performance. It's already challenging to keep the overhead of tracing just the CUDA kernel start & end times to less than 1 ยตs per kernel launch, and the overhead added by NVTX calls would be much worse, scaling up with the number of NVTX calls times the number of threads.

This is why we think it's not a plausible solution to the problem of investigating performance within a CUDA kernel. For that problem, I recommend using Nsight Compute to analyze specific kernels. You could use Nsight Systems first to identify which kernels are the bottlenecks in your application, and then use NVTX ranges or other tricks to focus Nsight Compute's deep-dive analysis on just the problematic kernels.

If you really do want to track progress of the many device code threads arriving at a certain line of code during a kernel's execution, there is a way to do that -- the "pmevent" instruction, which is callable using inline PTX assembly:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-pmevent
This instruction increments a GPU hardware counter any time a CUDA device-code thread issues it. If you place this instruction at an important point in your device function, then you'd expect the hardware counter to go from zero at the kernel start time up to the total number of threads once they've all passed that line. You can then use tools like Nsight Systems, Nsight Compute, or the PerfWorks library to sample these counters and graph them over time, so during a kernel launch you can see the progress. The affect on kernel performance should be insignificant with this approach, even with huge numbers of pmevents. This approach to device-code perf analysis isn't one I'd recommend, though... I encourage you to use Nsight Compute, learn what its reported metrics mean, and how to change your code to avoid perf issues it highlights -- the documentation will help you understand the inner workings of the GPU and how to write efficient device code.

I tried to briefly describe this in the NVTX docs here:
https://github.com/NVIDIA/NVTX#which-platforms-does-nvtx-support
...but I should probably pull this topic out to be its own section, because this is a request we get often. And when describing pmevent, I should also make sure all relevant tools have documentation for how to collect & display that counter, and then provide links from the NVTX docs to the tool-specific docs for that.

from nvtx.

jiangxiaobin96 avatar jiangxiaobin96 commented on June 24, 2024

greatly appreciate!

from nvtx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.