Coder Social home page Coder Social logo

nvidia / nvtx Goto Github PK

View Code? Open in Web Editor NEW
253.0 12.0 43.0 2.66 MB

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.

License: Apache License 2.0

C 52.96% CMake 4.57% C++ 26.73% Makefile 0.16% Python 6.30% Batchfile 0.20% Cython 2.76% Shell 0.22% CSS 6.09%

nvtx's Introduction

NVTX (NVIDIA Tools Extension Library)

NVTX is a cross-platform API for annotating source code to provide contextual information to developer tools.

The NVTX API is written in C, with wrappers provided for C++ and Python.

What does NVTX do?

By default, NVTX API calls do nothing. When you launch a program from a developer tool, NVTX calls in that program are redirected to functions in the tool. Developer tools are free to implement NVTX API calls however they wish.

Here are some examples of what a tool might do with NVTX calls:

  • Print a message to the console
  • Record a trace of when NVTX calls occur, and display them on a timeline
  • Build a statistical profile of NVTX calls, or time spent in ranges between calls
  • Enable/disable tool features in ranges bounded by NVTX calls matching some criteria
  • Forward the data to other logging APIs or event systems

Example: Visualize loop iterations on a timeline

This C++ example annotates some_function with an NVTX range using the function's name. This range begins at the top of the function body, and automatically ends when the function returns. The function performs a loop, sleeping for one second in each iteration. A local nvtx3::scoped_range annotates the scope of the loop body. The loop iteration ranges are nested within the function range.

#include <nvtx3/nvtx3.hpp>

void some_function()
{
    NVTX3_FUNC_RANGE();  // Range around the whole function

    for (int i = 0; i < 6; ++i) {
        nvtx3::scoped_range loop{"loop range"};  // Range for iteration

        // Make each iteration last for one second
        std::this_thread::sleep_for(std::chrono::seconds{1});
    }
}

Normally, this program waits for 6 seconds, and does nothing else.

Launch it from NVIDIA Nsight Systems, and you'll see this execution on a timeline:

Example NVTX Ranges in Nsight Systems

The NVTX row shows the function's name "some_function" in the top-level range and the "loop range" message in the nested ranges. The loop iterations each last for the expected one second.

What kinds of annotation does NVTX provide?

Markers

Markers annotate a specific point in a program's execution with a message. Optional extra fields may be provided: a category, a color, and a payload value.

Ranges

Ranges annotate a range between two points in a program's execution, like a related pair of markers. There are two types of ranges:

  • Push/Pop ranges, which can be nested to form a stack
    • The Pop call is automatically associated with a prior Push call on the same thread
  • Start/End ranges, which may overlap with other ranges arbitrarily
    • The Start call returns a handle which must be passed to the End call
    • These ranges can start and end on different threads

The C++ and Python interfaces provide objects and decorators for automatically managing the lifetimes of ranges.

Resource naming/tracking

Resource naming associates a displayable name string with an object. For example, naming CPU threads allows a tool that displays thread activity on a timeline to have more meaningful labels for its rows than a numeric thread ID.

Resource tracking extends the idea of naming to include object lifetime tracking, as well as important usage of the object. For example, a mutex provided by platform API (e.g. pthread_mutex, CriticalSection) can be tracked by a tool that intercepts its lock/unlock API calls, so using NVTX to name these mutex objects would be sufficient to see the names of mutexes being locked/unlocked on a timeline. However, manually implemented spin-locks may not have an interceptible API, so tools can't automatically detect when they are used. Use NVTX to annotate these types of mutexes where they are locked/unlocked to enable tools to track them just like standard platform API mutexes.

How do I use NVTX in my code?

C and C++

For C and C++, NVTX is a header-only library with no dependencies. Simply #include the header(s) you want to use, and call NVTX functions! NVTX initializes automatically during the first call to any NVTX function.

It is not necessary to link against a binary library. On POSIX platforms, adding the -ldl option to the linker command-line is required.

NOTE: Older versions of NVTX did require linking against a dynamic library. NVTX version 3 provides the same API, but removes the need to link with any library. Ensure you are including NVTX v3 by using the nvtx3 directory as a prefix in your #includes:

C:

#include <nvtx3/nvToolsExt.h>

void example()
{
    nvtxMark("Hello world!");
}

C++:

#include <nvtx3/nvtx3.hpp>

void example()
{
    nvtx3::mark("Hello world!");
}

The NVTX C++ API is a set of wrappers around the C API, so the C API functions are usable from C++ as well.

Since the C and C++ APIs are header-only, dependency-free, and don't require explicit initialization, they are suitable for annotating other header-only libraries. Libraries using different versions of the NVTX headers in the same translation unit or different translation units will not have conflicts, as long as best practices are followed.

See more details in the c directory of this repo, and in the API reference guides:

CMake

For projects that use CMake, the CMake scripts included with NVTX provide targets nvtx3-c and nvtx3-cpp. Use target_link_libraries to make any CMake target use nvtx3-c for the C API only and nvtx3-cpp for both the C and C++ APIs. Since NVTX is a header-only library, these targets simply add the include search path for the NVTX headers and add the -ldl linker option where required. Example usage:

# Example C program
add_executable(some_c_program main.c)
target_link_libraries(some_c_program PRIVATE nvtx3-c)
# main.c can now do #include <nvtx3/nvToolsExt.h>

# Example C++ program
add_executable(some_cpp_program main.cpp)
target_link_libraries(some_cpp_program PRIVATE nvtx3-cpp)
# main.cpp can now do #include <nvtx3/nvtx3.hpp>

NVTX provides two different ways to define the CMake targets:

Normal CMake targets (non-IMPORTED)

Non-IMPORTED targets are global to the entire build. In a typical CMake codebase, add_subdirectory is used to include every directory in a source tree, where each contains a CMakeLists.txt file that defines targets usable anywhere in the build. The NVTX CMakeLists.txt file defines normal (non-IMPORTED) targets when add_subdirectory is called on that directory.

This example code layout has a few imported third-party libraries and a separate directory for its own source. It shows that adding the NVTX directory to CMake allows the nvtx3-cpp to be used elsewhere in the source tree:

  • CMakeLists.txt
    add_subdirectory(Imports)
    add_subdirectory(Source)
  • Imports/
    • CMakeLists.txt
      add_subdirectory(SomeLibrary)
      add_subdirectory(NVTX)
      add_subdirectory(SomeOtherLibrary)
    • SomeLibrary/
    • NVTX/ (this is the downloaded copy of NVTX)
      • CMakeLists.txt (defines nvtx3-c and nvtx3-cpp targets)
      • nvtxImportedTargets.cmake (helper script)
      • include/
        • nvtx3/ (all NVTX headers)
    • SomeOtherLibrary/
  • Source/
    • CMakeLists.txt
      add_executable(my_program main.cpp)
      target_link_libraries(my_program PRIVATE nvtx3-cpp)
    • main.cpp (does #include <nvtx3/nvtx3.hpp>)

Another example is when the NVTX directory must be added with a relative path that is not a subdirectory. In this case, CMake requires a second parameter to add_subdirectory to give a unique name for the directory where build output goes:

  • Utils/
    • SomeLibrary/
    • NVTX/ (this is the downloaded copy of NVTX)
      • CMakeLists.txt (defines nvtx3-c and nvtx3-cpp targets)
      • nvtxImportedTargets.cmake (helper script)
      • include/
        • nvtx3/ (all NVTX headers)
    • SomeOtherLibrary/
  • Project1/
  • Project2/
  • Project3/
    • CMakeLists.txt
      add_subdirectory("${CMAKE_CURRENT_LIST_DIR}/../Utils/NVTX" "ImportNVTX")
      
      add_executable(my_program main.cpp)
      target_link_libraries(my_program PRIVATE nvtx3-cpp)
    • main.cpp (does #include <nvtx3/nvtx3.hpp>)

When defining normal (non-IMPORTED) targets, the NVTX CMake scripts avoid target-already-defined errors by checking if the targets exist before attempting to define them. This enables the following scenarios:

  • The same NVTX directory can be added more than once
  • Multiple directories with copies of the same NVTX version can be added
  • Multiple directories different versions of NVTX can be added
    • If newest version is added first, everything should work:
      • The nvtx3-c/nvtx3-cpp targets will point to the newest version
    • If a new version is added after an old version:
      • The nvtx3-c/nvtx3-cpp targets will point to an old version
      • If features of the newest version are used, compilation will fail
      • The NVTX CMake scripts print a warning for this case

Normal (non-IMPORTED) targets will be defined when using CPM (CMake Package Manager) to fetch NVTX directly from the internet. Thus, NVTX targets defined via CPM follow the behavior described above. This example shows usage of CPM instead of a local copy of NVTX:

  • Source/
    • CMakeLists.txt
      include(path/to/CPM.cmake)
      
      CPMAddPackage(
          NAME NVTX
          GITHUB_REPOSITORY NVIDIA/NVTX
          GIT_TAG v3.1.0-c-cpp
          GIT_SHALLOW TRUE)
      
      add_executable(my_program main.cpp)
      target_link_libraries(my_program PRIVATE nvtx3-cpp)
    • main.cpp (does #include <nvtx3/nvtx3.hpp>)

See CPM section below in "How do I get NVTX?" for more details.

IMPORTED CMake targets

IMPORTED targets are scoped to the directory where they are defined. These are useful when defining targets for dependencies of a SHARED or STATIC library, because they won't conflict with targets of the same name elsewhere in the build. This lets a library ensure it is using the expected version of its own dependencies without imposing that version on other parts of the build. NVTX provides the nvtxImportedTargets.cmake script to define the targets nvtx3-c and nvtx3-cpp as IMPORTED. Use include("path/to/nvtxImportedTargets.cmake") from any CMakeLists.txt file to define the NVTX targets with scope locally to that directory.

This example shows a program that imports multiple third-party libraries, which each import their own copy of NVTX:

  • CMakeLists.txt
    add_subdirectory(Imports)
    add_subdirectory(Source)
  • Imports/
    • CMakeLists.txt
      add_subdirectory(Foo)
      add_subdirectory(Bar)
    • Foo/
      • CMakeLists.txt
        include(Detail/NVTX/nvtxImportedTargets.cmake)
        add_library(foo STATIC foo.cpp)
        target_link_libraries(foo PRIVATE nvtx3-cpp)
      • Detail/
        • NVTX/ (downloaded copy of NVTX, version 3.1)
      • foo.cpp (does #include <nvtx3/nvtx3.hpp>)
    • Bar/
      • CMakeLists.txt
        include(Detail/NVTX/nvtxImportedTargets.cmake)
        add_library(bar SHARED bar.cpp)
        target_link_libraries(bar PRIVATE nvtx3-cpp)
      • Detail/
        • NVTX/ (downloaded copy of NVTX, version 3.2)
      • bar.cpp (does #include <nvtx3/nvtx3.hpp>)
  • Source/
    • CMakeLists.txt
      add_executable(my_program main.cpp)
      target_link_libraries(my_program PRIVATE foo bar)

Note that in this example, Foo uses an older version of NVTX than Bar, and Foo is added before Bar. Since the NVTX CMake target definitions are local within the Foo and Bar directories, both libraries will use their own copies. Bar can safely use NVTX version 3.2 features, even though Foo used version 3.1 earlier. There will be no warnings printed that an older NVTX version was added before a newer one, unlike the case with global (non-IMPORTED) target definitions (see above).

Python

Install NVTX for Python using pip or conda, and use import nvtx in your code:

import nvtx

nvtx.mark(message="Hello world!")

See more details in the python directory in this repo.

How do I get NVTX?

C/C++

Get NVTX from GitHub

The C/C++ NVTX headers are provided by the NVIDIA NVTX GitHub repo, in the c directory. This is the most up-to-date copy of NVTX. Copying that directory into your codebase is sufficient to use NVTX.

The release-v3 branch is the version officially supported by NVIDIA tools. Other branches may incur breaking changes at any time and are not recommended for use in production code. The release-v3-c-cpp branch is maintained as a copy of the c directory from release-v3, so downloading release-v3-c-cpp is a lightweight way to get all that is needed to build C/C++ programs with NVTX.

Get NVTX with NVIDIA Developer Tools

Some NVIDIA developer tools include the NVTX v3 library as part of the installation. See the documentation of the tools for details about where the NVTX headers are installed.

Get NVTX with the CUDA Toolkit

The CUDA toolkit provides NVTX v3.

Note that the toolkit may also include older versions for backwards compatibility, so be sure to use version 3 (the nvtx3 subdirectory of headers) for best performance, convenience, and support. Use #include <nvtx3/nvToolsExt.h> instead of #include <nvToolsExt.h> to ensure code is including v3!

Get NVTX using CMake Package Manager (CPM)

CMake Package Manager (CPM) is a utility that automatically downloads dependencies when CMake first runs on a project, and adds their CMakeLists.txt to the build. The downloaded files can be stored in an external cache directory to avoid redownloading during clean builds, and to enable offline builds once the cache is populated.

To use CPM, download CPM.cmake from CPM's repo and save it in your project. Then you can fetch NVTX directly from GitHub with CMake code like this (CMake 3.14 or greater is required):

include(path/to/CPM.cmake)

CPMAddPackage(
    NAME NVTX
    GITHUB_REPOSITORY NVIDIA/NVTX
    GIT_TAG v3.1.0-c-cpp
    GIT_SHALLOW TRUE)

# Example C program
add_executable(some_c_program main.c)
target_link_libraries(some_c_program PRIVATE nvtx3-c)

# Example C++ program
add_executable(some_cpp_program main.cpp)
target_link_libraries(some_cpp_program PRIVATE nvtx3-cpp)

Note that this downloads from GitHub using a version tag with the suffix -c-cpp. The v3.x.y tags points to the release-v3 branch, which contains the entire repo. The v3.x.y-c-cpp tags point to a separate branch called release-v3-c-cpp containing only the c directory of the repo, which is the bare minimum needed to use NVTX in a C or C++ project. If you specify GIT_TAG v3.x.y to download the full repo, the SOURCE_SUBDIR c option is also needed to tell CMake where CMakeLists.txt is in the downloaded repo. Also, avoid downloading the full history of the repo by using GIT_SHALLOW TRUE to download only the requested version.

Python

Get NVTX using Conda

conda install -c conda-forge nvtx

Get NVTX using PIP

python3 -m pip install nvtx

What tools support NVTX?

These NVIDIA tools provide built-in support for NVTX:

  • Nsight Systems logs NVTX calls and shows them on a timeline alongside driver/OS/hardware events
  • Nsight Compute uses NVTX ranges to focus where deep-dive GPU performance analysis occurs
  • Nsight Graphics uses NVTX ranges to set bounds for range profiling in the Frame Debugger
  • The CUPTI API supports recording traces of NVTX calls

Other tools may provide NVTX support as well -- see the tool documentation for details.

Which platforms does NVTX support?

NVTX was designed to work on:

  • Windows
  • Linux and other POSIX-like platforms (including cygwin)
  • Android

Both 64-bit and 32-bit processes are supported. There are no restrictions on CPU architecture.

NVTX may work on other POSIX-like platforms, but support is not guaranteed. NVTX relies on the platform's standard API to load a dynamic library (.dll) or shared object (.so). Platforms without dynamic library functionality cannot support NVTX.

NVTX is not supported in GPU code, such as __device__ functions in CUDA. While NVTX for GPU may intuitively seem useful, keep in mind that GPUs are best utilized with thousands or millions of threads running the same function in parallel. A tool tracing ranges in every thread would produce an unreasonably large amount of data, and would incur large performance overhead to manage this data. Efficient instrumentation of CUDA GPU code is possible with the pmevent PTX instruction, which can be monitored by hardware performance counters with no overhead.

See the documentation for individual tools to see which platforms they support.

Which languages/compilers does NVTX support?

C

The NVTX C API is a header-only library, implemented using standard C89/C90. The headers can be compiled with -std=gnu90 or newer using many common compilers. Tested compilers include:

  • GNU gcc
  • clang
  • Microsoft Visual C++
  • NVIDIA nvcc

C89 support in these compilers has not changed in many years, so even very old compiler versions should work.

See more details in the c directory of this repo, and the NVTX C API Reference.

C++

The NVTX C++ API is a header-only library, implemented as a wrapper over the NVTX C API, using standard C++11. The C++ headers are provided alongside the C headers. NVTX C++ is implemented , and can be compiled with -std=c++11 or newer using many common compilers. Tested compilers include:

  • GNU g++ (4.8.5 to 11.1)
  • clang (3.5.2 to 12.0)
  • Microsoft Visual C++ (VS 2015 to VS 2022)
    • On VS 2017.7 and newer, NVTX enables better error message output
  • NVIDIA nvcc (CUDA 7.0 and newer)

See more details in the c directory of this repo, and the NVTX C++ API Reference.

Python

The NVTX Python API provides native Python wrappers for a subset of the NVTX C API. NVTX Python requires Python 3.6 or newer. It has been tested on Linux, with Python 3.6 to 3.9.

See more details in the python directory of this repo.

Other languages

Any language that can call into C with normal calling conventions can work with the NVTX C API. There are two general approaches to implement NVTX wrappers in other languages:

  1. Write C code that #includes and exposes NVTX functionality through a language binding interface. Since the NVTX C API uses pointers and unions, wrappers for other languages may benefit from a more idiomatic API for ease of use. NVTX for Python uses this approach, based on Cython.
  2. Make a dynamic library that exports the NVTX C API directly, and use C interop bindings from the other language to call into this dynamic library. To create a dynamic library from the NVTX v3 C headers, simply compile this .c file as a dynamic library:
    #define NVTX_EXPORT_API
    #include <nvtx3/nvToolsExt.h>
    // #include any other desired NVTX C API headers here to export them

Older versions of NVTX distributed a dynamic library with C API exported. Projects depending on that library can use the code above to recreate a compatible library from NVTX v3.

NOTE: Official Fortran support coming soon!

How much overhead does NVTX add to my code?

The first call to any NVTX API function in a process will trigger initialization of the library. The implementation checks an environment variable to see if a tool wishes to intercept the NVTX calls.

When no tool is present, initialization disables all the NVTX API functions. Subsequent NVTX API calls are a handful of instructions in a likely-inlined function to jump over the disabled call.

When a tool is present, initialization configures the NVTX API so all subsequent calls jump directly into that tool's implementation. Overhead in this case is entirely determined by what the tool does.

The first NVTX call can incur significant overhead while loading and initializing the tool. If this first call happens in a latency-sensitive part of the program (e.g. a game with low frame-rate detection), it may cause the program to behave differently with the tool vs. without the tool. The nvtxInitialize C API function is provided for this situation, to allow force-initializing NVTX at a convenient time, without any other contextual meaning like a marker. It is not necessary to use nvtxInitialize in other cases.

How do I disable all NVTX calls at compile-time?

Providing non-public information to tools via NVTX is helpful in internal builds, but may not be acceptable for public release builds. The entire NVTX C and C++ APIs can be preprocessed out with a single macro before including any NVTX headers:

#define NVTX_DISABLE

Or add -DNVTX_DISABLE to the compiler command line, only in the configuration for public builds. This avoids having to manually add #ifs around NVTX calls solely for the purpose of disabling all of them in specific build configurations.

General Usage Guidelines

Add ranges around important sections of code

Developer tools often show low-level information about what the hardware or operating system is doing, but without correlation to the high-level structure of your program. Annotate sections of your code with NVTX ranges to add contextual information, so the information reported by tools can be extended to show where in your program the low-level events occur. This also enables some tools to target only these important parts of your program, and to choose which parts to target in the tool options -- no need to recompile your code to target other sections!

Give, don't take

NVTX is primarily a one-way API. Your program gives information to the tool, but it does not get actionable information back from the tool. Some NVTX functions return values, but these should only be used as inputs to other NVTX functions. Programs should not behave differently based on these values, because it is important that tools can see programs behaving the same way they would without any tools present!

Avoid depending on any particular tool

Do not use NVTX for any functionality that is required for your program to work correctly. If a program depends on a particular tool being present to work, then it would be impossible to use any other NVTX tools with this program. NVTX does not currently support multiple tools being attached to the same program.

Isolate NVTX annotations in a library using a Domain

It is possible for a program to use many libraries, all of which include NVTX annotations. When running such a program in a tool, it is helpful if the user can keep these libraries' annotations separate. A library should isolate its annotations from other libraries by creating a "domain", and performing all marker/range/naming annotations within that domain. Tools can provide options for which domains to enable, and use domains to group annotation data by library.

The domain also acts as a namespace: Different domains may use the same hard-coded values for category IDs without conflict. The NVTX C++ API provides initialize-on-first-use for domains to avoid the need for up-front initialization.

Use categories to organize annotations

While domains are meant to keep the annotations from different libraries separate, it may be useful within a library to have separate categories for annotations. NVTX markers and ranges provide a "category ID" field for this purpose. This integer may be hard-coded, like an enum in C/C++. NVTX provides API functions to name to a category ID value, so tools can display meaningful names for categories. Tools are encouraged to logically group annotations into categories. Using slashes in category names like filesystem paths allows the user to create a hierarchy of categories, and tools should handle these as a hierarchy.

Avoid slow processing to prepare arguments for NVTX calls

When tools are not present, the first NVTX call quickly configures the API to make all subsequent NVTX calls into no-ops. However, any processing done before making an NVTX call to prepare the arguments for the call is not disabled. Using a function like sprintf to generate a message string dynamically for each call will add overhead even in the case when no tool is present! Instead of generating message strings, is more efficient to pass a hard-coded string for the message, and variable as a payload.

Register strings that will be used many times

In each NVTX marker or range, tools may copy the message string into a log file, or test the string (e.g. with a regex) to see if it matches some criteria for triggering other functionality. If the same message string is used repeatedly, this work in the tool would be redundant. To reduce the tool overhead and help keep log files smaller, NVTX provides functions to "register" a message string. These functions return a handle that can be used in markers and ranges in place of a message string. This allows tools to log or test message strings just once, when they are registered. Logs will be smaller when storing handle values instead of large strings, and string tests reduce to lookup of precomputed answers. The NVTX3_FUNC_RANGE macros, for example, register the function's name and save the handle in a local static variable for efficient reuse in subsequent calls to that function. Some tools may require using registered strings for overhead-sensitive functionality, such as using NVTX ranges to start/stop data collection in Nsight Systems.

nvtx's People

Contributors

afroger-nvidia avatar beru avatar dekhtiarjonathan avatar dhoro-nvidia avatar jcohen-nvidia avatar jrhemstad avatar karthikeyann avatar sevagh avatar shwina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvtx's Issues

`NVTX3_CPP_REQUIRE_EXPLICIT_VERSION` is problematic in header-only libraries

Hi, the documentation for NVTX3_CPP_REQUIRE_EXPLICIT_VERSION in the nvtx3.hpp header containing the C++ API explains the following:

... the recommended best practice for instrumenting header-based libraries with NVTX C++ Wrappers is is to #define NVTX3_CPP_REQUIRE_EXPLICIT_VERSION before including nvtx3.hpp, #undef it afterward, and only use explicit-version symbols.

However, this breaks user code using the unversioned API directly.

For example:

#include <my_library.hpp> // includes NVTX3
#include <nvtx3/nvtx3.hpp> // user also includes NVTX3

int main() {
  nvtx3::scoped_range domain; // user uses unversioned API
}

If my_library.hpp now changes from

#include <nvtx3/nvtx3.hpp>

to

#define NVTX3_CPP_REQUIRE_EXPLICIT_VERSION
#include <nvtx3/nvtx3.hpp>
#undef NVTX3_CPP_REQUIRE_EXPLICIT_VERSION

the above user program breaks, because the unversioned API is no longer provided.

This happens in the second case because the first inclusion of nvtx3.hpp defines NVTX3_CPP_DEFINITIONS_V1_0 and emits the symbols into namespace nvtx3::v1 and v1 is not an inline namespace because NVTX3_CPP_REQUIRE_EXPLICIT_VERSION is defined. The second inclusion will then see that NVTX3_CPP_DEFINITIONS_V1_0 is already defined and not provide the unversioned API (e.g., by inlining the v1 namespace).

We observed this behavior at the following PR to CCCL/CUB: NVIDIA/cccl#1688

I therefore think that header-only libraries must not define NVTX3_CPP_REQUIRE_EXPLICIT_VERSION to not break user code. Please correct me if I am wrong. Otherwise, I would kindly ask you to update the guidance provided by the documentation.

Python 3.11 support

Issue:

  • Cannot upgrade to python 3.11 due to nvtx

Error:

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.0rc1 Requires-Python >=3.7,<3.10; 1.7.0rc2 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement nvtx>=0.2.5; sys_platform != "darwin" (from XXX) (from versions: 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4)

Question:

  • Any plans to add support for python 3.11 in future releases?

`nvToolsExt.h` defines min/max macros on Windows

When including nvToolsExt.h in a MSVC build, the Windows.h include file isn't guarded by NOMINMAX, so it breaks subsequent code that defines min or max functions. I guess we can't decide for users whether this is intended behavior, but maybe it should be documented somewhere, as not everybody is familiar with this issue on Windows.

Among cudaEventRecord and nvtxRangePush, whose overhead is smaller?

Hello!
When I want to timing a CUDA kernel, one way is using cudaEventRecord before and after the kernel launch, the other way is using nvtxRangePush before and nvtxRangePop after the kernel. Which one is better? Which one has less additional overhead?
In NVTX docs, I also found the words "The library introduces close to zero overhead if no tool is attached to the application. The overhead when a tool is attached is specific to the tool". Does "tool" here mean nvprof or nsys?

[python] Automatic annotation with function name

Hello,

As of nvtx 0.2.8 (installed from Pypi), the feature automatically annotating functions with their name when using @nvtx.annotate() no longer works.

I believe this is due to the following check:

if not self.attributes.message:

Indeed, self.attributes.message seems to be created regardless of whether message was provided in the constructor:

message = RegisteredString(self.domain.handle, message)

Changing the check to:

if not self.attributes.message.string:

locally seems to give back the expected behavior (however I don't know if this is the correct fix).

Wheels for Mac OS and Windows

Currently, we only publish wheels for Linux/x86 on PyPI: https://pypi.org/project/nvtx/#files. On any other OS/arch, pip installs fall back to source builds, which are problematic because users need to configure their compilers to find the NVTX C headers. This is often a pain point, as evidenced by issues like #80.

We should provide pre-built wheels for other OS/arch.

Simplify the process of using NVTX in another CMake project

Rather than requesting users use a custom include(...) file to consume imported CMake targets, it is possible to detect whether NVTX is being built stand-alone or as part of another project. The common pattern for detecting this is to add a top-level CMakeLists.txt file with:

In a new NVTX/CMakeLists.txt file:

# ...normal project setup here...

# Check if standalone or part of another project:
if ("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_LIST_DIR}")
  set(NVTX_TOPLEVEL_PROJECT ON)
endif()

# using `option` instead of `set` lets users change this using `ccmake` or `cmake -D....`.
# When built as part of another project, NVTX_TOPLEVEL_PROJECT will be unset and default to false.
# When built stand-alone, NVTX_TOPLEVEL_PROJECT will be truthy and disable imported targets.
option(NVTX3_TARGETS_NOT_USING_IMPORTED "<doc string>" ${NVTX_TOPLEVEL_PROJECT})

# The `set(...NOT_USING_IMPORTED...)` in c/CMakeLists.txt should be removed in favor of the option above.
add_subdirectory(c)

For an example of how we use this pattern in CCCL, see: https://github.com/NVIDIA/cccl/blob/main/CMakeLists.txt#L10-L14

By making this change (and adding a top-level CMakeLists.txt file to the NVTX repo per the suggestion above), CPM usage would be simplified to:

CPMAddPackage("gh:NVIDIA/NVTX@release-v3")

for most users. Currently, we must do

CPMAddPackage(
  NAME NVTX
  GITHUB_REPOSITORY NVIDIA/NVTX
  GIT_TAG release-v3
  DOWNLOAD_ONLY
  SYSTEM
)
include("${NVTX_SOURCE_DIR}/c/nvtxImportedTargets.cmake")

for the common case of wanting imported targets.

Installing NVTX on python fails in nvidia docker image.

Hi,

I'm trying to install nvtx using pip inside an nvidia docker image nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04
I'm installed python 3.9 alongside with pip, and when running

python3.9 -m pip install nvtx

I get:

 #13 79.02 Failed to build nvtx
#13 79.02 ERROR: Could not build wheels for nvtx, which is required to install pyproject.toml-based projects

At first I thought this is the same as #43 , but I see these are not exactly the same issue. I'd rather not build the wheel inside the docker, or install conda.

A minimal dockerfile that fails:

FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04

RUN apt update --fix-missing
RUN apt install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Asia/Jerusalem
RUN apt install -y python3.9 python3-pip python3.9-dev python3.9-distutils curl unzip
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/
RUN python3.9 -m pip install --upgrade pip \
    && python3.9 -m pip install --upgrade setuptools \
    && python3.9 -m pip install --upgrade distlib \
    && python3.9 -m pip install nvtx

scoped_range does not work with domain::global

I am trying to use NVTX3_FUNC_RANGE(), nvtx3::scoped_range or nvtx3::scoped_range_in<nvtx3::domain::global> but neither is working - the range does not show in nvperf / nvvp.

I have tested nvtxRangePush and nvtx3::scoped_range_in<my_domain> and these ranges show in profiling tools correctly.

Configuration

  • GPU CARD: GeForce GTX 1650 Mobile
  • driver version: 520.61.05
  • CUDA version: 11.8
  • OS version: Ubuntu 22.04

Reproduction docker

I have prepared a simple reproduction docker here.

Reproduction code:

struct my_domain{ static constexpr char const* name{"my_domain"}; };

void function_my_domain(){
    // this range does show in profiling tools as expected
    nvtx3::scoped_range_in<my_domain> r(__FUNCTION__);
    std::this_thread::sleep_for(1s);
}

void function_global(){
    // this range does not show in profiling tools
    nvtx3::scoped_range r(__FUNCTION__);
    std::this_thread::sleep_for(1s);
}

error: declaration of template parameter ‘D’ shadows template parameter

I am running into a build error on Ubuntu 18.04, g++ 9.3.0, nvcc / CUDA 11.3.

This is using the C++ version on the dev branch, added to the project via CMake. The error is produced from a simple include of the library (#include <nvtx3/nvtx3.hpp>).

/home/user/dev/project/build/_deps/nvtxpp-src/cpp/include/nvtx3/nvtx3.hpp:2212:11: error: declaration of template parameter ‘D’ shadows template parameter
 2212 |   template <typename D>
      |           ^~~~~
/home/user/dev/project/build/_deps/nvtxpp-src/cpp/include/nvtx3/nvtx3.hpp:2128:11: note: template parameter ‘D’ declared here
 2128 | template <typename D = domain::global>
      |           ^~~~~

Apart from this error, the same line is also referenced in many warnings that state the same.

`end_range` should be templated with the domain and use `nvtxDomainRangeEnd`

The documentation of the end_range function is incorrect:

This function does not have a Domain tag type template parameter as the
handle r already indicates the domain to which the range belongs.

The handle does not indicate the domain to which the range belongs. The function should be implemented the following way:

template <typename D = domain::global>
void end_range(range_handle r)
{
#ifndef NVTX_DISABLE
  nvtxDomainRangeEnd(domain::get<D>());
#else
  (void)r;
#endif
}

As a result, the domain_process_range implementation is incorrect. The class relies on the start_range and end_range functions. The former relies on nvtxDomainRangeStartEx while the latter relies on nvtxRangeEnd. The only scenario where it work is if domain_process_range is templated with the default domain. For custom domains, the range will be started in the specified domain and ended in the default domain, which is incorrect.

Also, I am wondering whether it is really necessary to have the free-standing start_range and end_range functions. The domain_process_range class should be sufficient to handle all possible use cases and has the added benefit of preventing the end-users from misusing the API (e.g. generating an unterminated range, passing a wrong handle to end_range, etc.)

Documentation for the tool-side interface

As near as I can tell, there are two options for tool authors wishing to provide an NVTX implementation (such that NVTX annotations can be recorded in another format, as opposed to converting their own annotations to NVTX):

  1. Implement an independent NVTX library against the headers
  2. Some sort of export table mechanism, wherein we could provide pointers to the functions we wish to implement and a function that registers those function pointers.

The mechanism in 2. is sadly not documented anywhere I've been able to find online (notably not in the NVTX Doxygen docs) and the comments are somewhat sparse for a third party implementor. Is there any chance this interface could become explicitly documented/have I simply missed something in my initial research?

Implement the `NVTX3_FUNC_RANGE_IF_IN` and `NVTX3_FUNC_RANGE_IF` macros

The NVTX++ interface currently have the NVTX3_FUNC_RANGE and NVTX3_FUNC_RANGE_IN macros which allow the generation of an NVTX range from the lifetime of the block.

There are scenarios where we only want to conditionally generate NVTX annotations. For example, some developers might want to annotate their libraries but have some kind of verbosity control. In this case, they might want to control whether an annotation is emitted or not dynamically.

One possible implementation would be to add an additional class similar to domain_thread_range which would enable the move constructor. The macro implementation would then be the following:

#define NVTX3_V1_FUNC_RANGE_IF_IN(C, D) \
  ::nvtx3::v1::movable_domain_thread_range<D> const nvtx3_range__{}; \
  if (C) { \
    static ::nvtx3::v1::registered_string<D> const nvtx3_func_name__{__func__}; \
    static ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__}; \
    nvtx3_range__ = ::nvtx3::v1::movable_domain_thread_range<D>{nvtx3_func_attr__}; \
  }

If the user wants the condition to only be evaluated once for the whole duration of the program execution, he can cache the result in a static variable.

The downside of making a class that allows a thread range to be movable is that it can allow misuse of the API. For e.g. a user might create such thread range an move it into a functor which is executed in another thread. If this is problematic, this class could be implemented into the detail namespace and documented to warn the users of those invalid cases.

Failed to build nvtx

Im getting a "Cannot open include file: 'nvtx3/nvToolsExt.h',

This is on windows, ive tried the suggestions in #43 but that doesnt seem to work. Ive tried defining C_INCLUDE in the windows path variables but that doesnt seem to help either.

Any ideas?

Implement `domain_process_range` without storing the handle in `std::unique_ptr`

We should avoid relying on dynamic memory allocation when possible. I don't think there is a strong reason for storing the range handle into a unique pointer.

Instead, we could:

  1. Add a boolean member variable in the domain_process_range class. It determines whether the range should be emitted or not
  2. Create a move constructor for the domain_process_range class:
domain_process_range(domain_process_range&& other) noexcept
{
#ifndef NVTX_DISABLE
  handle = other.handle;
  generate = other.generate;
  other.generate = false;
#endif
}

Rename range classes

Currently in NVTX++ we have:

  • domain_thread_range<D> (and thread_range alias for global domain)
    • Represents nested time spans
    • Non-moveable/Non-copyable
    • Must be destroyed on the same thread it was created
    • Deleted operator new prevents heap objects
  • domain_process_range<D> (process_range alias for global domain)
    • Arbitrary time span, no nesting requirements
    • Movable, not copyable
    • Can be created/destroyed on different threads

The naming leaves something to be desired. thread_range comes from the concept that the object needs to be created and destroyed on the same thread, but doesn't tell you anything else about it's semantics. process_range is named primarily to contrast it to thread_range in that it can be created/destroyed on different threads.

We can do better. I propose the following:

  • domain_thread_range<D>/thread_range -> scoped_range_in<D>/scoped_range
  • domain_process_range<D>/process_range -> unique_range_in<D>/unique_range

First, the *_in<D> better conveys that the range is in the particular domain as opposed to calling it a domain_*<D>. (credit to @jcohen-nvidia for this idea).

Second, the scoped_range/unique_range naming is inspired by std::scoped_lock and std::unique_lock.

Like scoped_range, scoped_lock is neither copyable nor movable. It conveys that this object is intrinsically tied to a particular scope.

Like unique_range, unique_lock is movable, but not copyable. It represents a single, unique range that is tied to the lifetime of the object and not confined to a particular scope. Furthermore, the canonical name in C++ for "movable but not copyable" things is unique_*.

Furthermore, just as scoped_lock is typically preferred to unique_lock, scoped_range is preferred to unique_range.

__sync_val_compare_and_swap used incorrect parameter order

at nvtx3/nvtxDetail/nvtxInit.h:43

#define NVTX_ATOMIC_CAS_32(old, address, exchange, comparand) __sync_synchronize(); old = __sync_val_compare_and_swap(address, exchange, comparand)

but refs to url, the function prototype is:

“ T __sync_val_compare_and_swap (T* __p, U __compVal, V __exchVal, ...); ”

This may lead to the NVTX initialization operation being called multiple times in a multi-threaded program.
Here, it needs to be corrected as well by swapping the third and fourth parameters.

    NVTX_ATOMIC_CAS_32(
        old,
        &NVTX_VERSIONED_IDENTIFIER(nvtxGlobals).initState,
        NVTX_INIT_STATE_STARTED,
        NVTX_INIT_STATE_FRESH);

If I have misunderstood, please forgive my mistake.

Compilation error when if `NVTX_DISABLE` is defined

The following code fails to compile with GCC 8.4.0.

#define NVTX_DISABLE
#include <nvtx3/nvtx3.hpp>
int main() {}

Compilation error:

/home/afroger/dev/projects/NVTX/cpp/include/nvtx3/nvtx3.hpp: In function ‘nvtx3::v1::range_handle nvtx3::v1::start_range(const Args& ...)’:
/home/afroger/dev/projects/NVTX/cpp/include/nvtx3/nvtx3.hpp:1979:9: error: ‘first’ was not declared in this scope
   (void)first;
         ^~~~~

Building wheel for nvtx (PEP 517) ... error

I am trying to install nvtx on Jetson Nano. However, I get the error:

ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmpbbid0wvf
       cwd: /tmp/pip-install-r4jbfqi6/nvtx_522b0ae8d511499ab4698e708b64cfe1
  Complete output (26 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-aarch64-3.6
  creating build/lib.linux-aarch64-3.6/nvtx
  copying nvtx/nvtx.py -> build/lib.linux-aarch64-3.6/nvtx
  copying nvtx/__init__.py -> build/lib.linux-aarch64-3.6/nvtx
  copying nvtx/colors.py -> build/lib.linux-aarch64-3.6/nvtx
  creating build/lib.linux-aarch64-3.6/nvtx/utils
  copying nvtx/utils/__init__.py -> build/lib.linux-aarch64-3.6/nvtx/utils
  copying nvtx/utils/cached.py -> build/lib.linux-aarch64-3.6/nvtx/utils
  creating build/lib.linux-aarch64-3.6/nvtx/_lib
  copying nvtx/_lib/__init__.py -> build/lib.linux-aarch64-3.6/nvtx/_lib
  copying nvtx/_lib/lib.pxd -> build/lib.linux-aarch64-3.6/nvtx/_lib
  running build_ext
  building 'nvtx._lib.lib' extension
  creating build/temp.linux-aarch64-3.6
  creating build/temp.linux-aarch64-3.6/nvtx
  creating build/temp.linux-aarch64-3.6/nvtx/_lib
  aarch64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.6m -c nvtx/_lib/lib.c -o build/temp.linux-aarch64-3.6/nvtx/_lib/lib.o
  nvtx/_lib/lib.c:632:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   #include "nvtx3/nvToolsExt.h"
            ^~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  error: command 'aarch64-linux-gnu-gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for nvtx
Failed to build nvtx
ERROR: Could not build wheels for nvtx which use PEP 517 and cannot be installed directly

I would appreciate some help on this. Thanks.

Deleting the default constructor of some classes

For the following classes, it might make sense to delete the default-constructor:

  • event_attributes
  • domain_thread_range
  • domain_process_range

What is the rational for allowing the creation of an empty event attribute or a thread range with an empty event attribute? I understand that the NVTX API does not prohibit this behavior but a range or event without any attribute will be meaningless for most, if not all, tools.

Payloads in python events?

As far as I can tell the Python bindings don't support setting payload values - is that correct? Are there any plans to support that?

Provide mechanism to disable Python NVTX ranges

In C/C++, NVTX ranges can be disabled, effectively eliminating any overhead of the nvtx function calls (albeit this overhead should already be small when a tool is not attached).

The Python NVTX ranges should have a similar feature that makes any annotation a no-op.

Will NVTX3 ship the C++ V1 API for eternity?

We have included the NVTX3 C++ wrapper into the CCCL/CUB and hit a couple of challenges on how NVTX and CCCL can evolve with minimal maintenance effort. I am tempted to think the best way is to use NVTX3 like:

Inside CCCL/CUB:

#if __has_include(<nvtx3/nvtx3.hpp>)
#  include <nvtx3/nvtx3.hpp>
#  ifndef NVTX3_CPP_DEFINITIONS_V1_0
#    error 
#  endif
...
nvtx::v1::scoped_range
#endif

This way, we are independent of whether the user defines NVTX3_CPP_REQUIRE_EXPLICIT_VERSION or not (see discussion in #93) and whether they use a newer NVTX3 version, since we use the oldest API. We can happily ship CCCL and as long as we don't need any new NVTX features, this code would work if and for as long as nvtx3/nvtx3.hpp will exist.

So, does the NVTX3 C++ wrapper guarantee that if you release a new version, e.g. NVTX3 V2 or V66, all the V1 symbols will still be available, forever? An NVTX4 library can of course break this, and would also require new work on cmake, etc.

There is a comment section in nvtx3.hpp that suggests this, but I want to be sure: https://github.com/NVIDIA/NVTX/blob/release-v3/c/include/nvtx3/nvtx3.hpp#L2835-L2841

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.