Comments (2)
Hi Xiaobin,
NVTX is not usable from device code, by design. We've thought a lot about how we could implement this, but it would guaranteeably cause unacceptable performance degradation. Keep in mind that CUDA kernels make best use of the GPU when they launch thousands or millions of threads. Imagine a 1-million-thread kernel with a single NVTX range, using nvtxRangePushA and nvtxRangePop, running under a trace tool like Nsight Systems. Just one kernel launch like this would produce 2 million trace records, and the Push records contain a string of unbounded length. With two 8-byte timestamps and perhaps 32 more bytes of other data (already optimistically low), that's 40 bytes per range. So for the whole kernel, that's 40 MB of trace data being generated on the GPU, and that data would have to be transferred from device-memory to host-memory, and then to disk -- all without harming performance. It's already challenging to keep the overhead of tracing just the CUDA kernel start & end times to less than 1 ยตs per kernel launch, and the overhead added by NVTX calls would be much worse, scaling up with the number of NVTX calls times the number of threads.
This is why we think it's not a plausible solution to the problem of investigating performance within a CUDA kernel. For that problem, I recommend using Nsight Compute to analyze specific kernels. You could use Nsight Systems first to identify which kernels are the bottlenecks in your application, and then use NVTX ranges or other tricks to focus Nsight Compute's deep-dive analysis on just the problematic kernels.
If you really do want to track progress of the many device code threads arriving at a certain line of code during a kernel's execution, there is a way to do that -- the "pmevent" instruction, which is callable using inline PTX assembly:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-pmevent
This instruction increments a GPU hardware counter any time a CUDA device-code thread issues it. If you place this instruction at an important point in your device function, then you'd expect the hardware counter to go from zero at the kernel start time up to the total number of threads once they've all passed that line. You can then use tools like Nsight Systems, Nsight Compute, or the PerfWorks library to sample these counters and graph them over time, so during a kernel launch you can see the progress. The affect on kernel performance should be insignificant with this approach, even with huge numbers of pmevents. This approach to device-code perf analysis isn't one I'd recommend, though... I encourage you to use Nsight Compute, learn what its reported metrics mean, and how to change your code to avoid perf issues it highlights -- the documentation will help you understand the inner workings of the GPU and how to write efficient device code.
I tried to briefly describe this in the NVTX docs here:
https://github.com/NVIDIA/NVTX#which-platforms-does-nvtx-support
...but I should probably pull this topic out to be its own section, because this is a request we get often. And when describing pmevent, I should also make sure all relevant tools have documentation for how to collect & display that counter, and then provide links from the NVTX docs to the tool-specific docs for that.
from nvtx.
greatly appreciate!
from nvtx.
Related Issues (20)
- error: declaration of template parameter โDโ shadows template parameter HOT 1
- Installing NVTX on python fails in nvidia docker image. HOT 11
- NVTX C++ availability HOT 2
- Failed to build nvtx HOT 3
- scoped_range does not work with domain::global HOT 1
- `nvToolsExt.h` defines min/max macros on Windows HOT 2
- NvToolExt_LIBRARIES-NOTFOUND
- how can i get nvToolsExt64_1.dll and nvToolsExt64_1.lib
- Python 3.11 support HOT 3
- PyPi README missing
- pip install nvtx on macOs HOT 3
- Wheels for Mac OS and Windows
- [python] Automatic annotation with function name HOT 1
- Payloads in python events? HOT 11
- __sync_val_compare_and_swap used incorrect parameter order
- `NVTX3_CPP_REQUIRE_EXPLICIT_VERSION` is problematic in header-only libraries HOT 1
- Simplify the process of using NVTX in another CMake project HOT 1
- Seeking for some explanations on the meaning of terminology in nvtx.h and nvToolsExtPayload.h
- Will NVTX3 ship the C++ V1 API for eternity?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nvtx.