Coder Social home page Coder Social logo

nvidia / nvtx Goto Github PK

View Code? Open in Web Editor NEW
246.0 246.0 44.0 2.65 MB

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.

License: Apache License 2.0

C 52.96% CMake 4.57% C++ 26.73% Makefile 0.16% Python 6.30% Batchfile 0.20% Cython 2.76% Shell 0.22% CSS 6.09%

nvtx's People

Contributors

afroger-nvidia avatar beru avatar dekhtiarjonathan avatar dhoro-nvidia avatar jcohen-nvidia avatar jrhemstad avatar karthikeyann avatar sevagh avatar shwina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvtx's Issues

Rename range classes

Currently in NVTX++ we have:

  • domain_thread_range<D> (and thread_range alias for global domain)
    • Represents nested time spans
    • Non-moveable/Non-copyable
    • Must be destroyed on the same thread it was created
    • Deleted operator new prevents heap objects
  • domain_process_range<D> (process_range alias for global domain)
    • Arbitrary time span, no nesting requirements
    • Movable, not copyable
    • Can be created/destroyed on different threads

The naming leaves something to be desired. thread_range comes from the concept that the object needs to be created and destroyed on the same thread, but doesn't tell you anything else about it's semantics. process_range is named primarily to contrast it to thread_range in that it can be created/destroyed on different threads.

We can do better. I propose the following:

  • domain_thread_range<D>/thread_range -> scoped_range_in<D>/scoped_range
  • domain_process_range<D>/process_range -> unique_range_in<D>/unique_range

First, the *_in<D> better conveys that the range is in the particular domain as opposed to calling it a domain_*<D>. (credit to @jcohen-nvidia for this idea).

Second, the scoped_range/unique_range naming is inspired by std::scoped_lock and std::unique_lock.

Like scoped_range, scoped_lock is neither copyable nor movable. It conveys that this object is intrinsically tied to a particular scope.

Like unique_range, unique_lock is movable, but not copyable. It represents a single, unique range that is tied to the lifetime of the object and not confined to a particular scope. Furthermore, the canonical name in C++ for "movable but not copyable" things is unique_*.

Furthermore, just as scoped_lock is typically preferred to unique_lock, scoped_range is preferred to unique_range.

Provide mechanism to disable Python NVTX ranges

In C/C++, NVTX ranges can be disabled, effectively eliminating any overhead of the nvtx function calls (albeit this overhead should already be small when a tool is not attached).

The Python NVTX ranges should have a similar feature that makes any annotation a no-op.

Implement the `NVTX3_FUNC_RANGE_IF_IN` and `NVTX3_FUNC_RANGE_IF` macros

The NVTX++ interface currently have the NVTX3_FUNC_RANGE and NVTX3_FUNC_RANGE_IN macros which allow the generation of an NVTX range from the lifetime of the block.

There are scenarios where we only want to conditionally generate NVTX annotations. For example, some developers might want to annotate their libraries but have some kind of verbosity control. In this case, they might want to control whether an annotation is emitted or not dynamically.

One possible implementation would be to add an additional class similar to domain_thread_range which would enable the move constructor. The macro implementation would then be the following:

#define NVTX3_V1_FUNC_RANGE_IF_IN(C, D) \
  ::nvtx3::v1::movable_domain_thread_range<D> const nvtx3_range__{}; \
  if (C) { \
    static ::nvtx3::v1::registered_string<D> const nvtx3_func_name__{__func__}; \
    static ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__}; \
    nvtx3_range__ = ::nvtx3::v1::movable_domain_thread_range<D>{nvtx3_func_attr__}; \
  }

If the user wants the condition to only be evaluated once for the whole duration of the program execution, he can cache the result in a static variable.

The downside of making a class that allows a thread range to be movable is that it can allow misuse of the API. For e.g. a user might create such thread range an move it into a functor which is executed in another thread. If this is problematic, this class could be implemented into the detail namespace and documented to warn the users of those invalid cases.

error: declaration of template parameter ‘D’ shadows template parameter

I am running into a build error on Ubuntu 18.04, g++ 9.3.0, nvcc / CUDA 11.3.

This is using the C++ version on the dev branch, added to the project via CMake. The error is produced from a simple include of the library (#include <nvtx3/nvtx3.hpp>).

/home/user/dev/project/build/_deps/nvtxpp-src/cpp/include/nvtx3/nvtx3.hpp:2212:11: error: declaration of template parameter ‘D’ shadows template parameter
 2212 |   template <typename D>
      |           ^~~~~
/home/user/dev/project/build/_deps/nvtxpp-src/cpp/include/nvtx3/nvtx3.hpp:2128:11: note: template parameter ‘D’ declared here
 2128 | template <typename D = domain::global>
      |           ^~~~~

Apart from this error, the same line is also referenced in many warnings that state the same.

Among cudaEventRecord and nvtxRangePush, whose overhead is smaller?

Hello!
When I want to timing a CUDA kernel, one way is using cudaEventRecord before and after the kernel launch, the other way is using nvtxRangePush before and nvtxRangePop after the kernel. Which one is better? Which one has less additional overhead?
In NVTX docs, I also found the words "The library introduces close to zero overhead if no tool is attached to the application. The overhead when a tool is attached is specific to the tool". Does "tool" here mean nvprof or nsys?

Failed to build nvtx

Im getting a "Cannot open include file: 'nvtx3/nvToolsExt.h',

This is on windows, ive tried the suggestions in #43 but that doesnt seem to work. Ive tried defining C_INCLUDE in the windows path variables but that doesnt seem to help either.

Any ideas?

scoped_range does not work with domain::global

I am trying to use NVTX3_FUNC_RANGE(), nvtx3::scoped_range or nvtx3::scoped_range_in<nvtx3::domain::global> but neither is working - the range does not show in nvperf / nvvp.

I have tested nvtxRangePush and nvtx3::scoped_range_in<my_domain> and these ranges show in profiling tools correctly.

Configuration

  • GPU CARD: GeForce GTX 1650 Mobile
  • driver version: 520.61.05
  • CUDA version: 11.8
  • OS version: Ubuntu 22.04

Reproduction docker

I have prepared a simple reproduction docker here.

Reproduction code:

struct my_domain{ static constexpr char const* name{"my_domain"}; };

void function_my_domain(){
    // this range does show in profiling tools as expected
    nvtx3::scoped_range_in<my_domain> r(__FUNCTION__);
    std::this_thread::sleep_for(1s);
}

void function_global(){
    // this range does not show in profiling tools
    nvtx3::scoped_range r(__FUNCTION__);
    std::this_thread::sleep_for(1s);
}

Compilation error when if `NVTX_DISABLE` is defined

The following code fails to compile with GCC 8.4.0.

#define NVTX_DISABLE
#include <nvtx3/nvtx3.hpp>
int main() {}

Compilation error:

/home/afroger/dev/projects/NVTX/cpp/include/nvtx3/nvtx3.hpp: In function ‘nvtx3::v1::range_handle nvtx3::v1::start_range(const Args& ...)’:
/home/afroger/dev/projects/NVTX/cpp/include/nvtx3/nvtx3.hpp:1979:9: error: ‘first’ was not declared in this scope
   (void)first;
         ^~~~~

Deleting the default constructor of some classes

For the following classes, it might make sense to delete the default-constructor:

  • event_attributes
  • domain_thread_range
  • domain_process_range

What is the rational for allowing the creation of an empty event attribute or a thread range with an empty event attribute? I understand that the NVTX API does not prohibit this behavior but a range or event without any attribute will be meaningless for most, if not all, tools.

[python] Automatic annotation with function name

Hello,

As of nvtx 0.2.8 (installed from Pypi), the feature automatically annotating functions with their name when using @nvtx.annotate() no longer works.

I believe this is due to the following check:

if not self.attributes.message:

Indeed, self.attributes.message seems to be created regardless of whether message was provided in the constructor:

message = RegisteredString(self.domain.handle, message)

Changing the check to:

if not self.attributes.message.string:

locally seems to give back the expected behavior (however I don't know if this is the correct fix).

Wheels for Mac OS and Windows

Currently, we only publish wheels for Linux/x86 on PyPI: https://pypi.org/project/nvtx/#files. On any other OS/arch, pip installs fall back to source builds, which are problematic because users need to configure their compilers to find the NVTX C headers. This is often a pain point, as evidenced by issues like #80.

We should provide pre-built wheels for other OS/arch.

Building wheel for nvtx (PEP 517) ... error

I am trying to install nvtx on Jetson Nano. However, I get the error:

ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmpbbid0wvf
       cwd: /tmp/pip-install-r4jbfqi6/nvtx_522b0ae8d511499ab4698e708b64cfe1
  Complete output (26 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-aarch64-3.6
  creating build/lib.linux-aarch64-3.6/nvtx
  copying nvtx/nvtx.py -> build/lib.linux-aarch64-3.6/nvtx
  copying nvtx/__init__.py -> build/lib.linux-aarch64-3.6/nvtx
  copying nvtx/colors.py -> build/lib.linux-aarch64-3.6/nvtx
  creating build/lib.linux-aarch64-3.6/nvtx/utils
  copying nvtx/utils/__init__.py -> build/lib.linux-aarch64-3.6/nvtx/utils
  copying nvtx/utils/cached.py -> build/lib.linux-aarch64-3.6/nvtx/utils
  creating build/lib.linux-aarch64-3.6/nvtx/_lib
  copying nvtx/_lib/__init__.py -> build/lib.linux-aarch64-3.6/nvtx/_lib
  copying nvtx/_lib/lib.pxd -> build/lib.linux-aarch64-3.6/nvtx/_lib
  running build_ext
  building 'nvtx._lib.lib' extension
  creating build/temp.linux-aarch64-3.6
  creating build/temp.linux-aarch64-3.6/nvtx
  creating build/temp.linux-aarch64-3.6/nvtx/_lib
  aarch64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.6m -c nvtx/_lib/lib.c -o build/temp.linux-aarch64-3.6/nvtx/_lib/lib.o
  nvtx/_lib/lib.c:632:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   #include "nvtx3/nvToolsExt.h"
            ^~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  error: command 'aarch64-linux-gnu-gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for nvtx
Failed to build nvtx
ERROR: Could not build wheels for nvtx which use PEP 517 and cannot be installed directly

I would appreciate some help on this. Thanks.

Documentation for the tool-side interface

As near as I can tell, there are two options for tool authors wishing to provide an NVTX implementation (such that NVTX annotations can be recorded in another format, as opposed to converting their own annotations to NVTX):

  1. Implement an independent NVTX library against the headers
  2. Some sort of export table mechanism, wherein we could provide pointers to the functions we wish to implement and a function that registers those function pointers.

The mechanism in 2. is sadly not documented anywhere I've been able to find online (notably not in the NVTX Doxygen docs) and the comments are somewhat sparse for a third party implementor. Is there any chance this interface could become explicitly documented/have I simply missed something in my initial research?

__sync_val_compare_and_swap used incorrect parameter order

at nvtx3/nvtxDetail/nvtxInit.h:43

#define NVTX_ATOMIC_CAS_32(old, address, exchange, comparand) __sync_synchronize(); old = __sync_val_compare_and_swap(address, exchange, comparand)

but refs to url, the function prototype is:

“ T __sync_val_compare_and_swap (T* __p, U __compVal, V __exchVal, ...); ”

This may lead to the NVTX initialization operation being called multiple times in a multi-threaded program.
Here, it needs to be corrected as well by swapping the third and fourth parameters.

    NVTX_ATOMIC_CAS_32(
        old,
        &NVTX_VERSIONED_IDENTIFIER(nvtxGlobals).initState,
        NVTX_INIT_STATE_STARTED,
        NVTX_INIT_STATE_FRESH);

If I have misunderstood, please forgive my mistake.

`nvToolsExt.h` defines min/max macros on Windows

When including nvToolsExt.h in a MSVC build, the Windows.h include file isn't guarded by NOMINMAX, so it breaks subsequent code that defines min or max functions. I guess we can't decide for users whether this is intended behavior, but maybe it should be documented somewhere, as not everybody is familiar with this issue on Windows.

`NVTX3_CPP_REQUIRE_EXPLICIT_VERSION` is problematic in header-only libraries

Hi, the documentation for NVTX3_CPP_REQUIRE_EXPLICIT_VERSION in the nvtx3.hpp header containing the C++ API explains the following:

... the recommended best practice for instrumenting header-based libraries with NVTX C++ Wrappers is is to #define NVTX3_CPP_REQUIRE_EXPLICIT_VERSION before including nvtx3.hpp, #undef it afterward, and only use explicit-version symbols.

However, this breaks user code using the unversioned API directly.

For example:

#include <my_library.hpp> // includes NVTX3
#include <nvtx3/nvtx3.hpp> // user also includes NVTX3

int main() {
  nvtx3::scoped_range domain; // user uses unversioned API
}

If my_library.hpp now changes from

#include <nvtx3/nvtx3.hpp>

to

#define NVTX3_CPP_REQUIRE_EXPLICIT_VERSION
#include <nvtx3/nvtx3.hpp>
#undef NVTX3_CPP_REQUIRE_EXPLICIT_VERSION

the above user program breaks, because the unversioned API is no longer provided.

This happens in the second case because the first inclusion of nvtx3.hpp defines NVTX3_CPP_DEFINITIONS_V1_0 and emits the symbols into namespace nvtx3::v1 and v1 is not an inline namespace because NVTX3_CPP_REQUIRE_EXPLICIT_VERSION is defined. The second inclusion will then see that NVTX3_CPP_DEFINITIONS_V1_0 is already defined and not provide the unversioned API (e.g., by inlining the v1 namespace).

We observed this behavior at the following PR to CCCL/CUB: NVIDIA/cccl#1688

I therefore think that header-only libraries must not define NVTX3_CPP_REQUIRE_EXPLICIT_VERSION to not break user code. Please correct me if I am wrong. Otherwise, I would kindly ask you to update the guidance provided by the documentation.

Payloads in python events?

As far as I can tell the Python bindings don't support setting payload values - is that correct? Are there any plans to support that?

Implement `domain_process_range` without storing the handle in `std::unique_ptr`

We should avoid relying on dynamic memory allocation when possible. I don't think there is a strong reason for storing the range handle into a unique pointer.

Instead, we could:

  1. Add a boolean member variable in the domain_process_range class. It determines whether the range should be emitted or not
  2. Create a move constructor for the domain_process_range class:
domain_process_range(domain_process_range&& other) noexcept
{
#ifndef NVTX_DISABLE
  handle = other.handle;
  generate = other.generate;
  other.generate = false;
#endif
}

Simplify the process of using NVTX in another CMake project

Rather than requesting users use a custom include(...) file to consume imported CMake targets, it is possible to detect whether NVTX is being built stand-alone or as part of another project. The common pattern for detecting this is to add a top-level CMakeLists.txt file with:

In a new NVTX/CMakeLists.txt file:

# ...normal project setup here...

# Check if standalone or part of another project:
if ("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_LIST_DIR}")
  set(NVTX_TOPLEVEL_PROJECT ON)
endif()

# using `option` instead of `set` lets users change this using `ccmake` or `cmake -D....`.
# When built as part of another project, NVTX_TOPLEVEL_PROJECT will be unset and default to false.
# When built stand-alone, NVTX_TOPLEVEL_PROJECT will be truthy and disable imported targets.
option(NVTX3_TARGETS_NOT_USING_IMPORTED "<doc string>" ${NVTX_TOPLEVEL_PROJECT})

# The `set(...NOT_USING_IMPORTED...)` in c/CMakeLists.txt should be removed in favor of the option above.
add_subdirectory(c)

For an example of how we use this pattern in CCCL, see: https://github.com/NVIDIA/cccl/blob/main/CMakeLists.txt#L10-L14

By making this change (and adding a top-level CMakeLists.txt file to the NVTX repo per the suggestion above), CPM usage would be simplified to:

CPMAddPackage("gh:NVIDIA/NVTX@release-v3")

for most users. Currently, we must do

CPMAddPackage(
  NAME NVTX
  GITHUB_REPOSITORY NVIDIA/NVTX
  GIT_TAG release-v3
  DOWNLOAD_ONLY
  SYSTEM
)
include("${NVTX_SOURCE_DIR}/c/nvtxImportedTargets.cmake")

for the common case of wanting imported targets.

Installing NVTX on python fails in nvidia docker image.

Hi,

I'm trying to install nvtx using pip inside an nvidia docker image nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04
I'm installed python 3.9 alongside with pip, and when running

python3.9 -m pip install nvtx

I get:

 #13 79.02 Failed to build nvtx
#13 79.02 ERROR: Could not build wheels for nvtx, which is required to install pyproject.toml-based projects

At first I thought this is the same as #43 , but I see these are not exactly the same issue. I'd rather not build the wheel inside the docker, or install conda.

A minimal dockerfile that fails:

FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04

RUN apt update --fix-missing
RUN apt install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Asia/Jerusalem
RUN apt install -y python3.9 python3-pip python3.9-dev python3.9-distutils curl unzip
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/
RUN python3.9 -m pip install --upgrade pip \
    && python3.9 -m pip install --upgrade setuptools \
    && python3.9 -m pip install --upgrade distlib \
    && python3.9 -m pip install nvtx

Python 3.11 support

Issue:

  • Cannot upgrade to python 3.11 due to nvtx

Error:

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.0rc1 Requires-Python >=3.7,<3.10; 1.7.0rc2 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement nvtx>=0.2.5; sys_platform != "darwin" (from XXX) (from versions: 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4)

Question:

  • Any plans to add support for python 3.11 in future releases?

`end_range` should be templated with the domain and use `nvtxDomainRangeEnd`

The documentation of the end_range function is incorrect:

This function does not have a Domain tag type template parameter as the
handle r already indicates the domain to which the range belongs.

The handle does not indicate the domain to which the range belongs. The function should be implemented the following way:

template <typename D = domain::global>
void end_range(range_handle r)
{
#ifndef NVTX_DISABLE
  nvtxDomainRangeEnd(domain::get<D>());
#else
  (void)r;
#endif
}

As a result, the domain_process_range implementation is incorrect. The class relies on the start_range and end_range functions. The former relies on nvtxDomainRangeStartEx while the latter relies on nvtxRangeEnd. The only scenario where it work is if domain_process_range is templated with the default domain. For custom domains, the range will be started in the specified domain and ended in the default domain, which is incorrect.

Also, I am wondering whether it is really necessary to have the free-standing start_range and end_range functions. The domain_process_range class should be sufficient to handle all possible use cases and has the added benefit of preventing the end-users from misusing the API (e.g. generating an unterminated range, passing a wrong handle to end_range, etc.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.