nvidia / cccl Goto Github PK

CUDA C++ Core Libraries

License: Other

Dockerfile 0.01% Shell 0.20% CMake 1.55% Cuda 26.67% Python 1.36% Batchfile 0.04% PowerShell 0.11% C++ 68.48% C 1.39% Objective-C++ 0.01% Ruby 0.01% SCSS 0.03% Makefile 0.07% Perl 0.07%

accelerated-computing cpp cpp-programming cuda cuda-cpp cuda-kernels cuda-library cuda-programming gpu gpu-acceleration

cccl's Introduction

Contributor Guide	Dev Containers	Discord	Godbolt	GitHub Project	libcudacxx Docs	Thrust Docs	CUB Docs

CUDA C++ Core Libraries (CCCL)

Welcome to the CUDA C++ Core Libraries (CCCL) where our mission is to make CUDA C++ more delightful.

This repository unifies three essential CUDA C++ libraries into a single, convenient repository:

The goal of CCCL is to provide CUDA C++ developers with building blocks that make it easier to write safe and efficient code. Bringing these libraries together streamlines your development process and broadens your ability to leverage the power of CUDA C++. For more information about the decision to unify these projects, see the announcement here.

Overview

The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers. Naturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.

Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs via configurable backends that allow using multiple parallel programming frameworks (such as CUDA, TBB, and OpenMP).
CUB is a lower-level, CUDA-specific library designed for speed-of-light parallel algorithms across all GPU architectures. In addition to device-wide algorithms, it provides cooperative algorithms like block-wide reduction and warp-wide scan, providing CUDA kernel developers with building blocks to create speed-of-light, custom kernels.
libcudacxx is the CUDA C++ Standard Library. It provides an implementation of the C++ Standard Library that works in both host and device code. Additionally, it provides abstractions for CUDA-specific hardware features like synchronization primitives, cache control, atomics, and more.

The main goal of CCCL is to fill a similar role that the Standard C++ Library fills for Standard C++: provide general-purpose, speed-of-light tools to CUDA C++ developers, allowing them to focus on solving the problems that matter. Unifying these projects is the first step towards realizing that goal.

Example

This is a simple example demonstrating the use of CCCL functionality from Thrust, CUB, and libcudacxx.

It shows how to use Thrust/CUB/libcudacxx to implement a simple parallel reduction kernel. Each thread block computes the sum of a subset of the array using cub::BlockReduce. The sum of each block is then reduced to a single value using an atomic add via cuda::atomic_ref from libcudacxx.

It then shows how the same reduction can be done using Thrust's reduce algorithm and compares the results.

Try it live on Godbolt!

#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cstdio>

constexpr int block_size = 256;

__global__ void reduce(int const* data, int* result, int N) {
  using BlockReduce = cub::BlockReduce<int, block_size>;
  __shared__ typename BlockReduce::TempStorage temp_storage;

  int const index = threadIdx.x + blockIdx.x * blockDim.x;
  int sum = 0;
  if (index < N) {
    sum += data[index];
  }
  sum = BlockReduce(temp_storage).Sum(sum);

  if (threadIdx.x == 0) {
    cuda::atomic_ref<int, cuda::thread_scope_device> atomic_result(*result);
    atomic_result.fetch_add(sum, cuda::memory_order_relaxed);
  }
}

int main() {

  // Allocate and initialize input data
  int const N = 1000;
  thrust::device_vector<int> data(N);
  thrust::fill(data.begin(), data.end(), 1);

  // Allocate output data
  thrust::device_vector<int> kernel_result(1);

  // Compute the sum reduction of `data` using a custom kernel
  int const num_blocks = (N + block_size - 1) / block_size;
  reduce<<<num_blocks, block_size>>>(thrust::raw_pointer_cast(data.data()),
                                     thrust::raw_pointer_cast(kernel_result.data()),
                                     N);

  auto const err = cudaDeviceSynchronize();
  if (err != cudaSuccess) {
    std::cout << "Error: " << cudaGetErrorString(err) << std::endl;
    return -1;
  }

  int const custom_result = kernel_result[0];

  // Compute the same sum reduction using Thrust
  int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);

  // Ensure the two solutions are identical
  std::printf("Custom kernel sum: %d\n", custom_result);
  std::printf("Thrust reduce sum: %d\n", thrust_result);
  assert(kernel_result[0] == thrust_result);
  return 0;
}

Getting Started

Users

Everything in CCCL is header-only. Therefore, users need only concern themselves with how they get the header files and how they incorporate them into their build system.

CUDA Toolkit

The easiest way to get started using CCCL is via the CUDA Toolkit which includes the CCCL headers. When you compile with nvcc, it automatically adds CCCL headers to your include path so you can simply #include any CCCL header in your code with no additional configuration required.

If compiling with another compiler, you will need to update your build system's include search path to point to the CCCL headers in your CTK install (e.g., /usr/local/cuda/include).

#include <thrust/device_vector.h>
#include <cub/cub.cuh>
#include <cuda/std/atomic>

GitHub

Users who want to stay on the cutting edge of CCCL development are encouraged to use CCCL from GitHub. Using a newer version of CCCL with an older version of the CUDA Toolkit is supported, but not the other way around. For complete information on compatibility between CCCL and the CUDA Toolkit, see our platform support.

Everything in CCCL is header-only, so cloning and including it in a simple project is as easy as the following:

git clone https://github.com/NVIDIA/cccl.git
nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub main.cu -o main

Note Use -I and not -isystem to avoid collisions with the CCCL headers implicitly included by nvcc from the CUDA Toolkit. All CCCL headers use #pragma system_header to ensure warnings will still be silenced as if using -isystem, see #527 for more information.

CMake Integration

CCCL uses CMake for all build and installation infrastructure, including tests as well as targets to link against in other CMake projects. Therefore, CMake is the recommended way to integrate CCCL into another project.

For a complete example of how to do this using CMake Package Manager see our example project.

Other build systems should work, but only CMake is tested. Contributions to simplify integrating CCCL into other build systems are welcome.

Contributors

Interested in contributing to making CCCL better? Check out our Contributing Guide for a comprehensive overview of everything you need to know to set up your development environment, make changes, run tests, and submit a PR.

Platform Support

Objective: This section describes where users can expect CCCL to compile and run successfully.

In general, CCCL should work everywhere the CUDA Toolkit is supported, however, the devil is in the details. The sections below describe the details of support and testing for different versions of the CUDA Toolkit, host compilers, and C++ dialects.

CUDA Toolkit (CTK) Compatibility

Summary:

The latest version of CCCL is backward compatible with the current and preceding CTK major version series
CCCL is never forward compatible with any version of the CTK. Always use the same or newer than what is included with your CTK.
Minor version CCCL upgrades won't break existing code, but new features may not support all CTK versions

CCCL users are encouraged to capitalize on the latest enhancements and "live at head" by always using the newest version of CCCL. For a seamless experience, you can upgrade CCCL independently of the entire CUDA Toolkit. This is possible because CCCL maintains backward compatibility with the latest patch release of every minor CTK release from both the current and previous major version series. In some exceptional cases, the minimum supported minor version of the CUDA Toolkit release may need to be newer than the oldest release within its major version series. For instance, CCCL requires a minimum supported version of 11.1 from the 11.x series due to an unavoidable compiler issue present in CTK 11.0.

When a new major CTK is released, we drop support for the oldest supported major version.

CCCL Version	Supports CUDA Toolkit Version
2.x	11.1 - 11.8, 12.x (only latest patch releases)
3.x (Future)	12.x, 13.x (only latest patch releases)

Well-behaved code using the latest CCCL should compile and run successfully with any supported CTK version. Exceptions may occur for new features that depend on new CTK features, so those features would not work on older versions of the CTK. For example, C++20 support was not added to nvcc until CUDA 12.0, so CCCL features that depend on C++20 would not work with CTK 11.x.

Users can integrate a newer version of CCCL into an older CTK, but not the other way around. This means an older version of CCCL is not compatible with a newer CTK. In other words, CCCL is never forward compatible with the CUDA Toolkit.

The table below summarizes compatibility of the CTK and CCCL:

CTK Version	Included CCCL Version	Desired CCCL	Supported?	Notes
CTK `X.Y`	CCCL `MAJOR.MINOR`	CCCL `MAJOR.MINOR+n`	✅	Some new features might not work
CTK `X.Y`	CCCL `MAJOR.MINOR`	CCCL `MAJOR+1.MINOR`	✅	Possible breaks; some new features might not be available
CTK `X.Y`	CCCL `MAJOR.MINOR`	CCCL `MAJOR+2.MINOR`	❌	CCCL supports only two CTK major versions
CTK `X.Y`	CCCL `MAJOR.MINOR`	CCCL `MAJOR.MINOR-n`	❌	CCCL isn't forward compatible
CTK `X.Y`	CCCL `MAJOR.MINOR`	CCCL `MAJOR-n.MINOR`	❌	CCCL isn't forward compatible

For more information on CCCL versioning, API/ABI compatibility, and breaking changes see the Versioning section below.

Operating Systems

Unless otherwise specified, CCCL supports all the same operating systems as the CUDA Toolkit, which are documented here:

Host Compilers

Unless otherwise specified, CCCL supports all the same host compilers as the CUDA Toolkit, which are documented here:

C++ Dialects

C++11 (Deprecated in Thrust/CUB, to be removed in next major version)
C++14 (Deprecated in Thrust/CUB, to be removed in next major version)
C++17
C++20

GPU Architectures

Unless otherwise specified, CCCL supports all the same GPU architectures/Compute Capabilities as the CUDA Toolkit, which are documented here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability

Note that some features may only support certain architectures/Compute Capabilities.

Testing Strategy

CCCL's testing strategy strikes a balance between testing as many configurations as possible and maintaining reasonable CI times.

For CUDA Toolkit versions, testing is done against both the oldest and the newest supported versions. For instance, if the latest version of the CUDA Toolkit is 12.3, tests are conducted against 11.1 and 12.3. For each CUDA version, builds are completed against all supported host compilers with all supported C++ dialects.

The testing strategy and matrix are constantly evolving. The matrix defined in the ci/matrix.yaml file is the definitive source of truth. For more information about our CI pipeline, see here.

Versioning

Objective: This section describes how CCCL is versioned, API/ABI stability guarantees, and compatibility guidelines to minimize upgrade headaches.

Summary

The entirety of CCCL's API shares a common semantic version across all components
Only the most recently released version is supported and fixes are not backported to prior releases
API breaking changes and incrementing CCCL's major version will only coincide with a new major version release of the CUDA Toolkit
Not all source breaking changes are considered breaking changes of the public API that warrant bumping the major version number
Do not rely on ABI stability of entities in the cub:: or thrust:: namespaces
ABI breaking changes for symbols in the cuda:: namespace may happen at any time, but will be reflected by incrementing the ABI version which is embedded in an inline namespace for all cuda:: symbols. Multiple ABI versions may be supported concurrently.

Note: Prior to merging Thrust, CUB, and libcudacxx into this repository, each library was independently versioned according to semantic versioning. Starting with the 2.1 release, all three libraries synchronized their release versions in their separate repositories. Moving forward, CCCL will continue to be released under a single semantic version, with 2.2.0 being the first release from the nvidia/cccl repository.

Breaking Change

A Breaking Change is a change to explicitly supported functionality between released versions that would require a user to do work in order to upgrade to the newer version.

In the limit, any change has the potential to break someone somewhere. As a result, not all possible source breaking changes are considered Breaking Changes to the public API that warrant bumping the major semantic version.

The sections below describe the details of breaking changes to CCCL's API and ABI.

Application Programming Interface (API)

CCCL's public API is the entirety of the functionality intentionally exposed to provide the utility of the library.

In other words, CCCL's public API goes beyond just function signatures and includes (but is not limited to):

The location and names of headers intended for direct inclusion in user code
The namespaces intended for direct use in user code
The declarations and/or definitions of functions, classes, and variables located in headers and intended for direct use in user code
The semantics of functions, classes, and variables intended for direct use in user code

Moreover, CCCL's public API does not include any of the following:

Any symbol prefixed with _ or __
Any symbol whose name contains detail including the detail:: namespace or a macro
Any header file contained in a detail/ directory or sub-directory thereof
The header files implicitly included by any header part of the public API

In general, the goal is to avoid breaking anything in the public API. Such changes are made only if they offer users better performance, easier-to-understand APIs, and/or more consistent APIs.

Any breaking change to the public API will require bumping CCCL's major version number. In keeping with CUDA Minor Version Compatibility, API breaking changes and CCCL major version bumps will only occur coinciding with a new major version release of the CUDA Toolkit.

Anything not part of the public API may change at any time without warning.

API Versioning

The public API of all CCCL's components share a unified semantic version of MAJOR.MINOR.PATCH.

Only the most recently released version is supported. As a rule, features and bug fixes are not backported to previously released version or branches.

The preferred method for querying the version is to use CCCL_[MAJOR/MINOR/PATCH_]VERSION as described below. For backwards compatibility, the Thrust/CUB/libcudacxxx version definitions are available and will always be consistent with CCCL_VERSION. Note that Thrust/CUB use a MMMmmmpp scheme whereas the CCCL and libcudacxx use MMMmmmppp.

	CCCL	libcudacxx	Thrust	CUB
Header	`<cuda/version>`	`<cuda/std/version>`	`<thrust/version.h>`	`<cub/version.h>`
Major Version	`CCCL_MAJOR_VERSION`	`_LIBCUDACXX_CUDA_API_VERSION_MAJOR`	`THRUST_MAJOR_VERSION`	`CUB_MAJOR_VERSION`
Minor Version	`CCCL_MINOR_VERSION`	`_LIBCUDACXX_CUDA_API_VERSION_MINOR`	`THRUST_MINOR_VERSION`	`CUB_MINOR_VERSION`
Patch/Subminor Version	`CCCL_PATCH_VERSION`	`_LIBCUDACXX_CUDA_API_VERSION_PATCH`	`THRUST_SUBMINOR_VERSION`	`CUB_SUBMINOR_VERSION`
Concatenated Version	`CCCL_VERSION (MMMmmmppp)`	`_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`	`THRUST_VERSION (MMMmmmpp)`	`CUB_VERSION (MMMmmmpp)`

Application Binary Interface (ABI)

The Application Binary Interface (ABI) is a set of rules for:

How a library's components are represented in machine code
How those components interact across different translation units

A library's ABI includes, but is not limited to:

The mangled names of functions and types
The size and alignment of objects and types
The semantics of the bytes in the binary representation of an object

An ABI Breaking Change is any change that results in a change to the ABI of a function or type in the public API. For example, adding a new data member to a struct is an ABI Breaking Change as it changes the size of the type.

In CCCL, the guarantees about ABI are as follows:

Symbols in the thrust:: and cub:: namespaces may break ABI at any time without warning.
The ABI of thrust:: and cub:: symbols includes the CUDA architectures used for compilation. Therefore, a thrust:: or cub:: symbol may have a different ABI if:
- compiled with different architectures
- compiled as a CUDA source file (-x cu) vs C++ source (-x cpp)
Symbols in the cuda:: namespace may also break ABI at any time. However, cuda:: symbols embed an ABI version number that is incremented whenever an ABI break occurs. Multiple ABI versions may be supported concurrently, and therefore users have the option to revert to a prior ABI version. For more information, see here.

Who should care about ABI?

In general, CCCL users only need to worry about ABI issues when building or using a binary artifact (like a shared library) whose API directly or indirectly includes types provided by CCCL.

For example, consider if libA.so was built using CCCL version X and its public API includes a function like:

void foo(cuda::std::optional<int>);

If another library, libB.so, is compiled using CCCL version Y and uses foo from libA.so, then this can fail if there was an ABI break between version X and Y. Unlike with API breaking changes, ABI breaks usually do not require code changes and only require recompiling everything to use the same ABI version.

To learn more about ABI and why it is important, see What is ABI, and What Should C++ Do About It?.

Compatibility Guidelines

As mentioned above, not all possible source breaking changes constitute a Breaking Change that would require incrementing CCCL's API major version number.

Users are encouraged to adhere to the following guidelines in order to minimize the risk of disruptions from accidentally depending on parts of CCCL that are not part of the public API:

Do not add any declarations to the thrust::, cub::, nv::, or cuda:: namespaces unless an exception is noted for a specific symbol, e.g., specializing a type trait.
- Rationale: This would cause symbol conflicts if a symbol is added with the same name.
Do not take the address of any API in the thrust::, cub::, cuda::, or nv:: namespaces.
- Rationale: This would prevent adding overloads of these APIs.
Do not forward declare any API in the thrust::, cub::, cuda::, or nv:: namespaces.
- Rationale: This would prevent adding overloads of these APIs.
Do not directly reference any symbol prefixed with _, __, or with detail anywhere in its name including a detail:: namespace or macro
- Rationale: These symbols are for internal use only and may change at any time without warning.
Include what you use. For every CCCL symbol that you use, directly #include the header file that declares that symbol. In other words, do not rely on headers implicitly included by other headers.
- Rationale: Internal includes may change at any time.

Portions of this section were inspired by Abseil's Compatibility Guidelines.

Deprecation Policy

We will do our best to notify users prior to making any breaking changes to the public API, ABI, or modifying the supported platforms and compilers.

As appropriate, deprecations will come in the form of programmatic warnings which can be disabled.

The deprecation period will depend on the impact of the change, but will usually last at least 2 minor version releases.

Mapping to CTK Versions

Coming soon!

CI Pipeline Overview

For a detailed overview of the CI pipeline, see ci-overview.md.

Related Projects

Projects that are related to CCCL's mission to make CUDA C++ more delightful:

cuCollections - GPU accelerated data structures like hash tables
NVBench - Benchmarking library tailored for CUDA applications
stdexec - Reference implementation for Senders asynchronous programming model

Projects Using CCCL

Does your project use CCCL? Open a PR to add your project to this list!

AmgX - Multi-grid linear solver library
ColossalAI - Tools for writing distributed deep learning models
cuDF - Algorithms and file readers for ETL data analytics
cuGraph - Algorithms for graph analytics
cuML - Machine learning algorithms and primitives
CuPy - NumPy & SciPy for GPU
cuSOLVER - Dense and sparse linear solvers
cuSpatial - Algorithms for geospatial operations
GooFit - Library for maximum-likelihood fits
HeavyDB - SQL database engine
HOOMD - Monte Carlo and molecular dynamics simulations
HugeCTR - GPU-accelerated recommender framework
Hydra - High-energy Physics Data Analysis
Hypre - Multigrid linear solvers
LightSeq - Training and inference for sequence processing and generation
MatX - Numerical computing library using expression templates to provide efficient, Python-like syntax
PyTorch - Tensor and neural network computations
Qiskit - High performance simulator for quantum circuits
QUDA - Lattice quantum chromodynamics (QCD) computations
RAFT - Algorithms and primitives for machine learning
TensorFlow - End-to-end platform for machine learning
TensorRT - Deep learning inference
tsne-cuda - Stochastic Neighborhood Embedding library
Visualization Toolkit (VTK) - Rendering and visualization library
XGBoost - Gradient boosting machine learning algorithms

cccl's People

Contributors

Stargazers

Watchers

Forkers

raydouglass dillon-cullinan alliepiper gevtushenko msadang 5l1v3r1 ajschmidt8 jarmak-nv jrhemstad miscco elstehle milesvant siboehm asbaharoon java-cds-club wence- griwes ahendriksen robertmaynard wmaxey bdice zihanliao hapemask pointkernel hageboeck blonck trxcllnt leofang jecs pb-dseifert gonzalobg xu457 gmh5225 khushi-411 chengwei920412 srinivasyadav18 huismiling sweemer pauleonix zelbok git0ffs3c frankdmacias lilohuang weedge vaibhavwakde52 oliveiradan djns99 windshirely kzwrime col-in-coding thomascherickal shiyi23 ecbadeaux ethananro elleryqu vyasr so298 dkolsen-pgi kririae thealexfuller koenpoppe wangzj997 webstorage119 wrvsrx valgur m-haseeb01 evelynmitchell eriktedhamre yaraslaut phial3 rupprecht gonidelis codereport svetlanasieber jamestiotio neonkore codingonion ludsfer yrmo pciolkosz saneens nyrio revaj magicrabbittang mard1no chokole dacianshaw eminsight am-sanchez cupy nnhanptnk bprb brugarolas

cccl's Issues

Functor name clashes across "compilation units"?

I have two .h/.cu pairs, each implementing a similar structure of C++ object with member device vectors and methods which use Thrust functions calling functors defined as structs locally in their respective .cu files (not mentioned in the .h files, no cross-#include issues). Same name-spacing on each side, different C++ class names, five functors in each case, three with different names/sigs/bodies, one with same name/sig/body (which I will factor out), and the last with the same name but different sigs and body.

One of these code paths runs correctly, the second falls over with Thrust/CUDA memory allocation corruption after the step which invokes that functor which has the same name on both sides.

Renaming that functor in one side fixes the problem, as does putting an anonymous namespace around all the functors on both sides.

My first suspicion was that because the namespacing was the same, and it was a struct (so no addiitonal C++ mangling) that the Thrust run-time was somehow calling the same version of that functor from both code paths. I tried putting a printf in each functor with different messages, but those messages came out as expected, so it does not appear to be this simple.

I can think of any number of reasons why spotting and rejecting this situation at CPU compile time would be hard, as they are separate files and separate compilation units, and the pre-compiled CUDA kernel isn't "linked" until run-time, but I guess I would have expected normal C/C++ compilation-unit-scope rules to still apply, and therefore for what I did not to be a problem.

This is all on Ubuntu 18.04 with gcc 7.3.0 and CUDA/nvcc/Thrust 10.0.130.

Actual code supplied on request.

Make cub NVRTC-compatible

Currently, cub includes a number of system headers causing errors such as

C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\vadefs.h(143): error: A function without execution space annotations (host/device/global) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as device functions in JIT mode

when attempting to use cub with NVRTC. It would be great if cub could be modified to also support use in NVRTC. (Pointers only instead of general iterators would be fine for me).

Note: while passing -default-device to NVRTC removes this particular error, other errors such as

C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\vcruntime.h(81): error: identifier "push" is undefined

C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\vcruntime.h(81): error: this declaration has no storage class or type specifier

remain. Furthermore, packaging all compiler headers is not feasible anyways.

inclusive_scan produces the wrong result for char types

Originally reported here: https://groups.google.com/d/msg/thrust-users/X7-FEDtKfBo/4wVMgfGgBgAJ

Here's a self-contained example showing a bug with the latest Thrust (I've tried both the one included with Cuda 7.5 RC and the latest from the master branch of the repo which included a recent fix for inclusive_scan): https://gist.github.com/eglaser77/756e5a9234cf0f08a3fb.

I build it with the command:

/usr/local/cuda/bin/nvcc -arch=sm_30 thrust_test.cu -o thrust_test -I/usr/local/cuda/include -g -L/usr/local/cuda/lib64/ -lcuda -lcudart

Basically I am trying to get the locations of 'true' values in a stencil. The first method uses thrust::inclusive_scan followed by thrust::upper_bound. It works with host vectors but fails when run with device vectors on the GPU. The second method does a thrust::copy_if and works fine. I get the same results on a Quadro K2100M and a GeForce GTX 750 Ti.

Here's the output I get (hindices1 are from the inclusive_scan/upper_bound method; hindices2 are from copy_if):

i: 0 stencil_location: 467508 hindices1: 467508 hindices2: 467508
i: 1 stencil_location: 1326441 hindices1: 1326441 hindices2: 1326441
i: 2 stencil_location: 1541662 hindices1: 1541662 hindices2: 1541662
i: 3 stencil_location: 1679866 hindices1: 1679866 hindices2: 1679866
i: 4 stencil_location: 2234773 hindices1: 2234773 hindices2: 2234773
i: 5 stencil_location: 2387355 hindices1: 2387355 hindices2: 2387355
i: 6 stencil_location: 2653762 hindices1: 2653762 hindices2: 2653762
i: 7 stencil_location: 3159732 hindices1: 3159732 hindices2: 3159732
i: 8 stencil_location: 3226888 hindices1: 3226888 hindices2: 3226888
i: 9 stencil_location: 3828014 hindices1: 3828014 hindices2: 3828014
i: 10 stencil_location: 3887644 hindices1: 3887644 hindices2: 3887644
i: 11 stencil_location: 3909417 hindices1: 3909417 hindices2: 3909417
i: 12 stencil_location: 3924245 hindices1: 3924245 hindices2: 3924245
i: 13 stencil_location: 4042273 hindices1: 4233776 hindices2: 4042273
i: 14 stencil_location: 4150580 hindices1: 4446033 hindices2: 4150580
i: 15 stencil_location: 4233776 hindices1: 4484984 hindices2: 4233776
i: 16 stencil_location: 4425058 hindices1: 4836990 hindices2: 4425058
i: 17 stencil_location: 4446033 hindices1: 5328271 hindices2: 4446033
i: 18 stencil_location: 4484984 hindices1: 5483482 hindices2: 4484984
i: 19 stencil_location: 4565655 hindices1: 5755194 hindices2: 4565655
i: 20 stencil_location: 4629464 hindices1: 5781566 hindices2: 4629464
i: 21 stencil_location: 4703190 hindices1: 5987753 hindices2: 4703190
i: 22 stencil_location: 4836990 hindices1: 8000000 hindices2: 4836990
i: 23 stencil_location: 4903165 hindices1: 8000000 hindices2: 4903165
i: 24 stencil_location: 4910365 hindices1: 8000000 hindices2: 4910365
i: 25 stencil_location: 5328271 hindices1: 8000000 hindices2: 5328271
i: 26 stencil_location: 5483482 hindices1: 8000000 hindices2: 5483482
i: 27 stencil_location: 5755194 hindices1: 8000000 hindices2: 5755194
i: 28 stencil_location: 5781566 hindices1: 8000000 hindices2: 5781566
i: 29 stencil_location: 5966710 hindices1: 8000000 hindices2: 5966710
i: 30 stencil_location: 5987753 hindices1: 8000000 hindices2: 5987753
i: 31 stencil_location: 7870669 hindices1: 8000000 hindices2: 7870669

The problem appears to be in the inclusive_scan call. When I examine the values I see that it is not strictly increasing as I would expect. Printing out where the scanned values change I get the following:

i: 467508 hscanned[i]: 1
i: 1326441 hscanned[i]: 2
i: 1541662 hscanned[i]: 3
i: 1679866 hscanned[i]: 4
i: 2234773 hscanned[i]: 5
i: 2387355 hscanned[i]: 6
i: 2653762 hscanned[i]: 7
i: 3159732 hscanned[i]: 8
i: 3226888 hscanned[i]: 9
i: 3828014 hscanned[i]: 10
i: 3887644 hscanned[i]: 11
i: 3909417 hscanned[i]: 12
i: 3924245 hscanned[i]: 13
i: 4008960 hscanned[i]: 11
i: 4042273 hscanned[i]: 12
i: 4150580 hscanned[i]: 13
i: 4233776 hscanned[i]: 14
i: 4276224 hscanned[i]: 13
i: 4425058 hscanned[i]: 14
i: 4446033 hscanned[i]: 15
i: 4484984 hscanned[i]: 16
i: 4543488 hscanned[i]: 14
i: 4565655 hscanned[i]: 15
i: 4629464 hscanned[i]: 16
i: 4677120 hscanned[i]: 15
i: 4703190 hscanned[i]: 16
i: 4836990 hscanned[i]: 17
i: 4903165 hscanned[i]: 18
i: 4910365 hscanned[i]: 19
i: 4944384 hscanned[i]: 17
i: 5328271 hscanned[i]: 18
i: 5483482 hscanned[i]: 19
i: 5755194 hscanned[i]: 20
i: 5781566 hscanned[i]: 21
i: 5879808 hscanned[i]: 20
i: 5966710 hscanned[i]: 21
i: 5987753 hscanned[i]: 22
i: 6013440 hscanned[i]: 21
i: 7870669 hscanned[i]: 22

Documentation (& code) regarding determining temp_storage_bytes not very clear

I would like to be able to determine my device-wide primitive's temp_storage_bytes before I have all of the primitive's arguments ready. The interface for obtaining it ostensibly requires everything to be ready for the actual run - and the documentation does not make it clear what arguments it actually needs and what it's going to do with them (e.g. will it look at the input at all, in any way? Probably not, but who knows)

Also, delving into the source (in my case: dispatch_select_if.cuh), I see:

if (d_temp_storage == NULL)
{
    // Return if the caller is simply requesting the size of the storage allocation
    break;
}

which is super-weird. What if I passed a pointer to a size_t containing 2^64-1? Would my code be led to believe it has to allocate that much? Hmm.

bazel build

hello, would be nice to add a bazel build for the examples using the default sdk installation, have seen some efforts made to use nvcc from bazel https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl, not sure how to integrate it here though, maybe it would be better to have nvcc rules here https://github.com/bazelbuild/bazel/blob/9116b3e99af2fd31d92c9bb7c37905a1675456c1/tools/cpp/cc_configure.bzl#L291 ?

Document algorithm asymptotic work complexities

Dynamic Backend

It's not always clear at compile time on which computational resources a Thrust program should execute at runtime. For example, the programmer may not know whether the system the program will run will host a GPU. It would be useful to be able to defer the decision of which Thrust system to target until runtime. A dynamic backend system could solve this problem.

Blocked by: ~~Stateful Dispatch NVIDIA/thrust#3~~

Document and test WarpScan::Scan's treatment of `init` argument

...
int input = 1;
int init = 10;
int inclusive_output, exclusive_output;

using WarpScan = cub::WarpScan<int, 32>;
__shared__ typename WarpScan::TempStorage storage[warps_no];
WarpScan(storage[warp_id])
    .Scan(input, inclusive_output, exclusive_output, init, cub::Sum());
...

Should inclusive_output after calling WarpScan::Scan(..) function include init? Right now it does, but WarpScan::Scan's documentation is unclear about it.

Give inclusive_scan an overload with init

It's difficult to implement scan and other sequences of scan or sums otherwise. The init parameter can accept the "carry" of the previous sum. inclusive_scan without init can easily be implemented by the more general overload by passing the initial input element as the init and having the first thread copy to the first element of the output.

Strided Iterators and 'Skip' Iterators

Is there anything equivalent in Thrust that corresponds to the strided_iterator found in boost compute? On the same topic of operating on specific elements of a buffer, is there a way to operator on 'padded' objects? Note, I also have this same question posted in boost compute For example, take the row-wise padded matrix:

8 2 7 0 0 0
6 5 4 0 0 0
1 3 9 0 0 0

So the internal flat buffer would be

8 2 7 0 0 0 6 5 4 0 0 0 1 3 9 0 0 0

As an example, how could I use sort on just the 'non-padded' elements?

Likewise, the column wise may also be padded (less common but might as well include here)

8 2 7 0 0 0
6 5 4 0 0 0
1 3 9 0 0 0 
0 0 0 0 0 0
0 0 0 0 0 0 
0 0 0 0 0 0

thrust::device_delete cannot be compiled for class with user-defined constructor.

Following code cannot be compiled:

#include <thrust/device_ptr.h>
#include <thrust/device_new.h>
#include <thrust/device_delete.h>

struct Foo {
     __host__ __device__ Foo() : x(0) {}

    int x;
};

int main() {
    thrust::device_ptr<Foo> foo = thrust::device_new<Foo>();
    thrust::device_delete(foo);

    cudaDeviceReset();

    return 0;
}

Error:

$ nvcc -arch=sm_35 -std=c++11 test.cu
/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.h(145): error: class "thrust::detail::device_delete_allocator" has no member "value_type"
          detected during:
            instantiation of class "thrust::detail::allocator_traits<Alloc> [with Alloc=thrust::detail::device_delete_allocator]" 
(262): here
            instantiation of class "thrust::detail::allocator_system<Alloc> [with Alloc=thrust::detail::device_delete_allocator]" 
/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/detail/allocator/destroy_range.inl(137): here
            instantiation of "thrust::detail::allocator_traits_detail::enable_if_destroy_range_case2<Allocator, Pointer>::type thrust::detail::allocator_traits_detail::destroy_range(Allocator &, Pointer, Size) [with Allocator=thrust::detail::device_delete_allocator, Pointer=thrust::device_ptr<Foo>, Size=size_t]" 
/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/detail/allocator/destroy_range.inl(158): here
            instantiation of "void thrust::detail::destroy_range(Allocator &, Pointer, Size) [with Allocator=thrust::detail::device_delete_allocator, Pointer=thrust::device_ptr<Foo>, Size=size_t]" 
/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/detail/device_delete.inl(42): here
            instantiation of "void thrust::device_delete(thrust::device_ptr<T>, size_t) [with T=Foo]" 
test.cu(13): here

1 error detected in the compilation of "/tmp/tmpxft_000025dd_00000000-8_test.cpp1.ii".

What with wrong?

If I remove or defaults the constructor of Foo class, then compilation succeeds:

OK1:

struct Foo {
    int x;
};

OK2:

struct Foo {
    Foo() = default;
    int x;
};

Version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Performance regression in 8-bit HistogramRange between 1.3.2B and 1.8.0

FYI, HistogramRange is about half the performance for 8 bit data in 1.8.0 as was 1.3.2B, but everything else is about twice as fast.

V100 on Cuda 9.1 with an updated driver that supports Volta.

Add missing hard algorithms

We're missing these algorithms (which look hard);

partial_sort
partial_sort_copy
includes
nth_element

Forwarded from http://code.google.com/p/thrust/issues/detail?id=423

Update benchmark results in CUB documentation

@dumerrill the library is awesome!

All documented results are out of date and there's no results for Pascal generation.
Probably you already have that code but not publicly available.

Add reduce_each example

This code for summing matrix rows is appropriate and simple:

#include <thrust/tabulate.h>
#include <thrust/reduce.h>
#include <thrust/tabulate.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h> // needed for thrust::device
#include <thrust/copy.h>
#include <thrust/sequence.h>
#include <iterator>

struct reduce_functor
{
  thrust::device_ptr<float> matrix;
  int width;

  __host__ __device__
  float operator()(int row_index)
  {
    auto begin = matrix + width * row_index;
    return thrust::reduce(thrust::device, begin, begin + width, 0.0f);
  }
};

template<class Vector>
void print_matrix(const Vector& mtx, int width)
{
  auto num_rows = mtx.size() / width;
  for(auto row = 0; row < num_rows; ++row)
  {
    std::cout << "|";
    auto begin = mtx.begin() + row * width;
    thrust::copy(begin, begin + width, std::ostream_iterator<typename Vector::value_type>(std::cout, " "));
    std::cout << "|" << std::endl;
  }
}

int main()
{
  int width = 10;
  thrust::device_vector<float> matrix(width * width);

  // fill the matrix with ascending values
  thrust::sequence(matrix.begin(), matrix.end());

  auto f = reduce_functor{matrix.data(), width};

  thrust::device_vector<float> row_sums(width);

  // call the reduce_functor on the indices of the rows of matrix
  // this reduces each row of the matrix to a single float containing the row's sum
  thrust::tabulate(row_sums.begin(), row_sums.end(), f);

  std::cout << "matrix: " << std::endl;
  print_matrix(matrix, width);
  std::cout << std::endl;

  std::cout << "row sums:" << std::endl;
  print_matrix(row_sums, 1);
  std::cout << std::endl;

  return 0;
}

Implement thrust::includes

This is a set op that we forgot I guess

Forwarded from http://code.google.com/p/thrust/issues/detail?id=411

thrust::system::cuda::par stream documentation incorrect

(Revised now that I've found the error)

With a cudaStream_t stream,

The documentation says to do thrust::cuda::par(stream).
The example says to use thrust::cuda::par.on(stream)

The example code works, documentation (the second code block for that method) does not compile.

The compilation error received:

/path/to/sumTest/main.cu(50): error: no instance of function template "thrust::system::cuda::detail::par_t::operator()" matches the argument list
            argument types are: (cudaStream_t)
            object type is: const thrust::system::cuda::detail::par_t

/path/to/sumTest/main.cu(47): error: no instance of overloaded function "thrust::reduce_by_key" matches the argument list
            argument types are: (<error-type>, thrust::transform_iterator<linear_index_to_row_index<int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::transform_iterator<linear_index_to_row_index<int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::equal_to<int>, thrust::plus<int>)

I (quickly) looked around the wiki and did not find any examples that have the same issue, the only mentions of thrust::cuda::par are here, but they aren't talking about streams. Anyway, it might be worth somebody else taking a second look 😄

Null references used for type dispatch

Here: https://github.com/thrust/thrust/blob/417d78471dadefa3087ff274e64f43ce74acfd3d/thrust/detail/reference.inl#L120 and at some other places in this file and also in its associated .h file, null pointers are created and passed to functions which are known to dereference them.

According to the comments and by looking at how they are ultimately used, it seems that we only care about the type. Is there anything that prevents using meta-programmation instead of passing actual, invalid values?

Zip iterators with copy across GPUs gives error

I have a struct that contains device_vectors of the same size. I want to be able to use thrust copy with zip iterators for two structs that reside on two different GPUs.

namespace T = thrust;
struct Pair_vec
{
    unsigned deviceId;
    T::device_vector<int> a;
    T::device_vector<int> b;

    Pair_vec(const unsigned device,
            const int length,
            const int val1,
            const int val2) : deviceId(device)
    {
        cudaSetDevice(deviceId);
        T::device_vector<int> tA(length, val1);
        T::device_vector<int> tB(length, val2);
        a.swap(tA);
        b.swap(tB);
    }

    T::zip_iterator<T::tuple<
        T::device_vector<int>::iterator,
        T::device_vector<int>::iterator>> begin(void)
        {
            return T::make_zip_iterator(T::make_tuple(
                        a.begin(), b.begin()));
        }

    T::zip_iterator<T::tuple<
        T::device_vector<int>::iterator,
        T::device_vector<int>::iterator>> end(void)
        {
            return T::make_zip_iterator(T::make_tuple(
                        a.end(), b.end()));
        }

    ~Pair_vec(void)
    {
        cudaSetDevice(deviceId);
    }

};

It is being used like this

        unsigned device1 = std::stoi(argv[1]);
        unsigned device2 = std::stoi(argv[2]);
        const unsigned length = 10240;
        Pair_vec v1(device1, length, 10, 11);
        Pair_vec v2(device2, length, 20, 21);

When I try to copy using the begin and end functions of the struct, I get an error.
T::copy(v1.begin(), v1.end(), v2.begin()); gives

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  cudaFree in free: an illegal memory access was encountered
Aborted

This works though

T::copy(v1.a.begin(), v1.a.end(), v2.a.begin());
T::copy(v1.b.begin(), v1.b.end(), v2.b.begin());

I tried to fix this by using cudaSetDevice in the destructor, but that does not work.

Is what I am trying to do even possible? How should I go about it?
Is it worthwhile to do so from a performance perspective? Copying with zip iters works for a single device. Should I concerned about the performance?

Potential speedup for count_if and suggested reduce_if

The current implementation of count_if uses a transform_reduce where the predicate is used to drive a 0-1 output which is then added up.

The computation could be sped up by discovering (per thread-block and warp) that no element returned 1, thus this warp/thread block does not need to participate in the reduction.

The implementation could also make use of a modified "reduce", called "reduce_if" which would apply a predicate to each element and discard those for which the predicate is false.

Having a reduce_if will have other benefits. It would allow skipping a separate copy_if step required when only some elements should be summed up (say).
See http://stackoverflow.com/questions/23334793/conditional-reduction-in-cuda for a discussion on this particular example.

Here is some real-world code that makes use of a (hand-crafted) conditional reduction: https://github.com/victorprad/InfiniTAM/blob/master/InfiniTAM/ITMLib/Engine/DeviceSpecific/CUDA/ITMDepthTracker_CUDA.cu#L113
In that code, each thread computes a summand (consisting of A[] and b) which is only used when the function constructing it returns "isValidPoint = true". The whole thread-block votes in "shared bool should_prefix" whether it has to participate in the reduction, which is the case only when at least one thread had "isValidPoint == true". Otherwise the threadblock can terminate early.
It might be beneficial to refine this vote a little bit to be computed per-warp, such that some warps can exit early.

Return value of minmax_element differs from STL's minmax_element

In both STL and thrust a call to minmax_element returns a pair of iterators, the first points to the minimum and the second to the maximum value. In both STL and thrust, if more than run minimum exists, the first value is chosen. If more than one maximum value exists, the STL version chooses the last candidate while thrust chooses the first one. While I think the STL is quite inconsistent here, I'd still prefer if thrust would behave just like the STL

Best wishes,
maribu

`iterator_category_to_system` returns `host_system_tag` for `input_device_iterator_tag` instead of `device_system_tag`

Hello,

I was trying to design my own iterators to make them compatible with thrust. I stumbled across some problems and dug inside the code. Ultimately, it seems that the following statement fails to compile:

static_assert(std::is_same<thrust::detail::iterator_category_to_systemthrust::input_device_iterator_tag::type, thrust::device_system_tag>::value, "test");

While the following passes:

static_assert(std::is_same<thrust::detail::iterator_category_to_systemthrust::input_device_iterator_tag::type, thrust::host_system_tag>::value, "test");

That is not really surprising given the code of iterator_category_to_system and the fact that input_device_iterator_tag actually inherits from std::input_iterator_tag, which is just another name for input_host_iterator_tag (see iterator_categories.h).

Am I missing something or there is a bug there?

Note that all classes inheriting from thrust::iterator_adapter seem to be fine. Still trying to figure out why.

Document THRUST_DEBUG_SYNC

The THRUST_DEBUG_SYNC ability to report the SM version of each executed algorithm is very useful in tracking down translation units compiled with mismatching SM versions.

Having formal documentation on this define would be great.

Implementation of load balance algorithlms

Hi, I find it very convenient to use load balance interface like in moderngpu:
https://github.com/moderngpu/moderngpu/wiki/Load-balancing-search.
Many of the problems I encounter is given many segments of workload, while each segment has different and irregular granularity of workload, which makes the parallelizing difficult.
With 'transform_lbs' provided by moderngpu, I can pass in a lambda function with the arguments
(int index, int seg, int rank), where index is global position, seg is the segment id, and rank is the offset inside the segment.
I wonder why thrust does not provide this kind of interface. Or Thrust can actually support it, but need to implement it with several API calls.

deprecate "universal" iterator categories

I don't think Thrust uses them, there's no reason for anyone else to, and they've always seemed hacky

add nth_element

Description [1] and fast GPU implementation [2].

[1] http://en.cppreference.com/w/cpp/algorithm/nth_element
[2] http://www.moderngpu.com/select/radixselect.html

Forwarded from http://code.google.com/p/thrust/issues/detail?id=413

Speedup of any_of and find_if operations

I work with Thrust shipped with CUDA 9, Linux x64, Volta GPU.

Initially we had the following code to check whether we can find value X in the array:

      auto it = thrust::find_if(p, p + num_rows, is_equal_to(X);
      return it != (p + num_rows);

but its speed was 27 GB/s, i.e. 30x slower than Volta memory speed.

I changed it to the following code, which was only 5x slower than memory speed:

      auto nulls_count = thrust::count(p, p + num_rows, X);
      return (nulls_count > 0);

Finally I tried this code, but it had the same speed as thrust::find_if:

      return thrust::any_of(p, p + num_rows, _1 == X);

Now i'm going to write my own kernel. Hopefully, you can improve at least find_if and any_of to be as fast as count, and ideally I like to see them all working at full Volta memory speed. Average array size in my tests is about 64 MB (8M elements), so i think it's not startup time problem.

__shfl_sync instructions may have wrong member mask

When using WarpScanShfl from warp_scan_shfl.cuh inside a while() loop and in conjunction with a sub-warp LOGICAL_WARP_THREADS argument, i.e. LOGICAL_WARP_THREADS=2^n with n<5, I get lots of errors like these with cuda-memcheck --tool synccheck

========= Barrier error detected. Invalid arguments
=========     at 0x000000d0 in __cuda_sm70_shflsync_idx
=========     by thread (17,0,0) in block (204,0,0)
=========     Device Frame:__cuda_sm70_shflsync_idx (__cuda_sm70_shflsync_idx : 0xd0)
=========     Device Frame:/ccsopen/home/glaser/hoomd-blue/hoomd/extern/cub/cub/block/specializations/../../block/../util_ptx.cuh:358:void gpu_compute_nlist_binned_kernel<unsigned char=0, int=1, int=1>(unsigned int*, unsigned int*, double4*, unsigned int*, unsigned int const *, unsigned int const *, double4 const *, unsigned int const *, double const *, unsigned int, unsigned int const *, double4 const *, unsigned int const *, double4 const *, unsigned int const *, Index3D, Index2D, Index2D, BoxDim, double const *, double, unsigned int, double3, unsigned int, unsigned int, unsigned int) (void gpu_compute_nlist_binned_kernel<unsigned char=0, int=1, int=1>(unsigned int*, unsigned int*, double4*, unsigned int*, unsigned int const *, unsigned int const *, double4 const *, unsigned int const *, double const *, unsigned int, unsigned int const *, double4 const *, unsigned int const *, double4 const *, unsigned int const *, Index3D, Index2D, Index2D, BoxDim, double const *, double, unsigned int, double3, unsigned int, unsigned int, unsigned

I believe the root cause is the following.

WarpScanShfl sets its member_mask for the shfl_sync to reflect the sub-warp membership. However, what happens if some threads exit early, the compiler may predicate off this initialization statement, leaving member_mask in an invalid state. Later, when the PTX assembly instruction shfl.sync.idx.b32 is hit, it is executed without predicate (such as @p) and thus with a wrong mask. Then cuda-memcheck finds that the executing warp lane executes an implicit syncwarp but without its mask bits set, and issues an error, as documented here:
https://docs.nvidia.com/cuda/cuda-memcheck/index.html#synccheck-demo-illegal-syncwarp

The safe solution would be to always use the full mask (0xffffffffu) to synchronize the entire warp. I realize this may not fully take advantage of Volta's independent thread scheduling. However, if that were the goal I think the CUB API would have to expose the member_mask somehow to allow the user to set it, so that it is possible to issue e.g. a ballot_sync outside CUB first, and then pass the member mask to CUB. As discussed here: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/

I will submit a pull request shortly for this solution.

variadic template implementation of thrust::tuple

thrust::tuple does not use variadic templates. A variadic template implementation of thrust::tuple would not just make the internal implementation more elegant, there are serious practical issues with the current thrust::tuple implementation:

Only tuples of size 10 may be used. Nested tuples are often an inconvenient workaround and are not always feasible.
Compilation speed is poor, which could otherwise be significantly improved by using variadic templates. [1]
Compiler error diagnostics are often unreadable due to thrust::null_type everywhere.

In consideration of the fact that C++11 support in nvcc is still in the release candidate, undocumented, and requires host compiler support, I don't expect C++03 support to be dropped anytime soon. Instead, can a user-supplied compile-time macro (e.g. THRUST_USE_CXX11) or auto-detection of __cplusplus >= 201103L please be considered to switch to an alternate implementation?

[1] http://www.jot.fm/issues/issue_2008_02/article2.pdf

Don't call the predicate twice in remove_if

Hello Thrust Team,

While playing around with learning Thrust, I happened to put a printf statement in the predicate-functor passed to thrust::remove_if(), and noticed some curious behavior. When run on host_vectors (ie on the CPU) the printf statement is output once per element, but when run on device vectors, the printf statement is output twice per element.

According to this, Thrust uses a parallel scan to determine output positions for the non-removed elements in a stable way, so it must do something like run the predicate on each element, then create an array of "hit" tags corresponding to each element with hit[i] = 1 if predicate( array[i] ) == false (don't remove) and hit[i] = 0 if predicate( array[i] ) == true (do remove). An exclusive scan on hit then gives the output positions for filtered elements of array...I'm just saying all this to highlight that I can't for the life of me imagine why the predicate functor would need to be called twice per element. What am I missing??

I have attached a minimal working sample, identical to the remove_points2d.cu example except that it has a printf statement in the predicate functor and includes device_vector.h in addition to host_vector.h. If x and y are host_vectors, "In here" gets printed 20 times, but if x and y are device_vectors, it gets printed 40 times.

remove_points_devicevshost.cu.txt

Thanks in advance for your insight,
Michael

Destructors must not throw exception

Reported by repalov, Aug 3, 2010
Some execution paths in Thrust can throw exception in destructor. It is lead to call terminate() if we already processing some exception (for example see http://www.gotw.ca/gotw/047.htm). For example I have next call stack:

device_vector::~device_vector
vector_base::~vector_base
contiguous_storage::~contiguous_storage
contiguous_storage::deallocate
...

thrust::detail::device::cuda::free

and last function (free) throw exception which uncatched at all levels, including destructors.

Yes, I know that this error mean that I have broke CUDA device. But in any case - destructor must not throw.
Delete comment Comment 1 by repalov, Aug 3, 2010
It is seems that solution is near Issue 37 (boost/std system_error).

Delete comment Comment 2 by project member jaredhoberock, Aug 3, 2010
Here is a dissenting opinion:

http://www.kolpackov.net/projects/c++/eh/dtor-1.xhtml

I don't believe there exists any reasonable alternative at this time: if we didn't allow ~device_vector to throw, how would we propagate the CUDA error to the user? CUDA errors are unrecoverable -- why does it make sense to silently continue program execution?

The dissenting opinion is very bad. The examples is very strange, moreover some of them is incorrect. The key idea of this paper is: "Even though Abort might be an overkill in some cases, it is the most ethical solution from all that were proposed."
But why is it "most ethical"? Because if program continue it can damage some data? And the solution from author of paper is to terminate program.
He lack the other possibility of damage data. Not by action, but by inaction.

For example if MS Word with document (which not saved for last 4-8 hours) is terminated when You start spell check because it can't close some file with dictionary in right way? Will You be the disappointed? I think You will be furious.

In MOST cases it is not to know that destruction is unsuccessfully. For example - if i destroy the mutex and OS call return error. It mean (among other) that this object can't be used any more. But I destroy mutex, so I won't use it in any case! So the fail in destructor is not important for me. The free of memory is same case. I don't really need to know about errors at "free". Because I won't use this memory any more.

If I broke the CUDA subsystem then I got error at the future. For example at next constructor. Constructor can (and must use) exceptions for inform about errors. But it is MORE preferable than terminating my program without any possibility to save user_data.

The solution, as I say in comment 1, probably is to use boost/std::system_error ( Issue 37 ). The free function must support unthrowing syntax, and destructor must use this syntax.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=188

add missing easy algorithms

We're missing these algorithms (which look simple to add):

find_end
find_first_of
adjacent_find
search
search_n
rotate
rotate_copy
lexicographical_compare

Parallel rotate() seems like it demands some temporary memory, though one could be clever and conserve the amount required.

We can build a naive search() with find_if(), but a better one would cache the needle in shared memory and perform Boyer-Moore-esque optimizations [1].

[1] http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

Forwarded from http://code.google.com/p/thrust/issues/detail?id=422

Explicitly document synchronization requirements in Warp-level APIs

For all warp-based cub api, say warpscan, the example given by the document do not use __syncwarp to sync threads within a warp.

However, it seems that in volta, threads within a warp are no longer synchronized naturally and seems __syncwarp is required before and after warp-based communication.

So, need I use __syncwarp for warpscan just like use __syncthreads in blockscan?

Thanks!

Consider support for segmented reductions and sorts specified by count-value representation

Consider this API:

// fills an array with runs of keys specified by the count of each run
// count_by_key is the inverse of fill_by_count
template<typename InputIterator1,
         typename InputIterator2,
         typename OutputIterator>
OutputIterator fill_by_count(InputIterator1 counts_first, InputIterator1 counts_last, InputIterator2 values_first, OutputIterator result);

// reduces contiguous runs of keys into their count
// fill_by_count is the inverse of count_by_key
template<typename InputIterator,
         typename OutputIterator,
         typename Compare>
OutputIterator count_by_key(InputIterator first, InputIterator last, OutputIterator result, Compare comp);

// count_by_key may be a bad name because it is not a vectorized version of thrust::count count_unique? unique_count?

// reduces contiguous runs of elements. each run is specified by its count
// counts come first like keys
template<typename InputIterator1,
         typename InputIterator2,
         typename OutputIterator,
         typename BinaryFunction>
OutputIterator reduce_by_count(counts_first, counts_last, values_first, result, binary_op);

// reduces contiguous runs of elements. each run's size is n, except for the last partial run
// values come first (cf. fill_n) because the last run could be partial
// XXX should n come first to match the previous?
template<typename Size,
         typename InputIterator,
         typename OutputIterator,
         typename BinaryFunction>
OutputIterator reduce_by_count(values_first, values_last, Size n, OutputIterator result, BinaryFunction binary_op);

// sorts contiguous runs of elements. each run's is specified by its count
template<typename RandomAccessIterator1,
         typename RandomAccessIterator2,
         typename Compare>
void sort_by_count(RandomAccessIterator1 counts_first, RandomAccessIterator1 counts_last, RandomAccessIterator2 values_first, Compare comp);

// sorts contiguous runs of elements. each run's size is n, except for the last partial run
template<typename Size,
         typename RandomAccessIterator,
         typename Compare>
void sort_by_count(RandomAccessIterator first, RandomAccessIterator last, Size n, Compare comp);

Why a count-value structure?

The mapping between keys and counts is obvious (there are two functions which are inverses of the other)
The important special case of uniform counts is easy for the user to find (it's just an overload)
and for the backend to target.

Why not a pairs of offsets, or range of ranges, or similar structure?

Because it forces the user to specify redundant data in the common case (continguous, non-overlapping values). It's easy to mess this up. There's no possibility of this kind of error with a count-value format (unless the user just gets the counts wrong). Indirection could be achieved as normal with a permutation_iterator.

Doesn't the count-value format imply that the implementation of each algorithm first has to perform an exclusive scan?

Under what conditions would an initial scan be expensive compared to the operation being performed?

If we preferred a different representation more convenient for the backend, would that simply force the user to perform the scan himself?

UBSan complains about device_vector::operator[] calls from host

To reproduce :

main.cu :

#include <thrust/device_vector.h>
#include <vector>
#include <iostream>

int main(void) {
  thrust::device_vector<int> D = std::vector<int>{1,2,3};
  std::cout << D[0] << std::endl;
  return 0;
}

Compiled with:
/usr/local/cuda-10.1/bin/nvcc main.cu -Xcompiler=-fsanitize=undefined

causes following errors at run-time :

~/p/t/thrust ❯❯❯ ./a.out
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/detail/reference.inl:105:43: runtime error: reference binding to null pointer of type 'struct execution_policy'                                                                                                          
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/detail/generic/select_system.inl:89:36: runtime error: reference binding to null pointer of type 'struct execution_policy_base'                                                                                   
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/detail/execution_policy.h:58:39: runtime error: reference binding to null pointer of type 'struct tag'                                                                                                                   
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/detail/generic/select_system.inl:89:43: runtime error: reference binding to null pointer of type 'struct tag'                                                                                                     
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/detail/reference.inl:105:52: runtime error: reference binding to null pointer of type 'const struct tag'                                                                                                                 
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/detail/reference.inl:133:9: runtime error: reference binding to null pointer of type 'struct tag'                                                                                                                        
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/detail/reference.inl:137:38: runtime error: reference binding to null pointer of type 'struct execution_policy_base'                                                                                                     
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/detail/reference.inl:137:17: runtime error: reference binding to null pointer of type 'struct execution_policy'                                                                                                          
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/get_value.h:51:57: runtime error: reference binding to null pointer of type 'struct policy2'                                                                                                          
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/cross_system.h:51:55: runtime error: reference binding to null pointer of type 'struct policy1'                                                                                                       
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/internal/copy_cross_system.h:213:40: runtime error: reference binding to null pointer of type 'struct execution_policy_base'                                                                          
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/internal/copy_cross_system.h:213:27: runtime error: reference binding to null pointer of type 'struct execution_policy'                                                                               
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/internal/copy_cross_system.h:103:33: runtime error: reference binding to null pointer of type 'struct execution_policy_base'                                                                          
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/internal/copy_cross_system.h:103:20: runtime error: reference binding to null pointer of type 'struct execution_policy'                                                                               
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/util.h:53:31: runtime error: reference binding to null pointer of type 'struct execution_policy_base'                                                                                                 
/usr/local/cuda-10.1/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/util.h:53:18: runtime error: reference binding to null pointer of type 'struct execution_policy'                                                                                                      
1

Nvcc version : V10.1.168
Thrust version : 1.9.5

tagged_iterator does not propagate its tag to its nested pointer & reference types

consider

int *raw_ptr;
auto ptr = reinterpret_tag<Tag>(raw_ptr);

The result of *ptr is int & when it should be reference<int,Tag>.

The result of &*ptr is int * when it should be pointer<int,Tag>.

After this is fixed, clean up the implementation of cuda::detail::fill_n so that it doesn't have to use thrust::pointer directly.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=462

Consider swallowing errors in deallocation functions

thrust::system::cuda::free potentially throws an exception, which can lead to multiple exceptions in flight when this function is called from a destructor. This makes it impossible to do correct exception handling.

We should avoid throwing in functions named free or catch and swallow exceptions in destructors. Not sure what the right thing to do is.

add transpose() example

Show how to perform matrix transpose for general case (with separate input and output) and an in-place transposition (for square matrices).

Forwarded from http://code.google.com/p/thrust/issues/detail?id=467

Allow custom tuning policies to be passed into device algorithms.

When I used an iterator as an input for device-reduce reducing kernel was limited by amount of registers. The iterator does a few math operation on data in global memory plus branching. Degreasing default parameters in dispatch_reduce.cuh resulted in slight performance improvement, but I saw that it affected performance of simple reduction. I kept those changes because it improves total performance.
Do you think it wise to add an optional parameter to specify execution policy for every device operation?
What tricks can also be used to improve performance for a pipeline like
read data from global memory -> deterministic logic on it -> cub operation like reduce?

thrust::reference needs additional operator overloads

For example, operator[] when the reference wraps a pointer.

Since we can't know the result of many such operators, we need to wait for C++11 result_of.

Consider support for vectorized binary search with sorted needles

When the needles are sorted, we can significantly accelerate the search.

To indicate that the needles are sorted, we can receive an additional Compare parameter:

template<typename ForwardIterator, typename InputIterator, typename OutputIterator, typename Compare1, typename Compare2>
  OutputIterator
    lower_bound(ForwardIterator haystack_first,
                ForwardIterator haystack_last,
                InputIterator needles_first,
                InputIterator needles_last,
                OutputIterator result,
                Compare1 haystack_comp,
                Compare2 needles_comp);

consider adding algorithm for repeating values a specified number of times

e.g. repeat([A,B,C,D],[2,3,0,4]) -> [A,A,B,B,B,D,D,D,D]

fill_by_count may be a better name as this is a generalization of fill_n

template<typename InputIterator1,
                  typename InputIterator2,
                  typename OutputIterator>
OutputIterator fill_by_count(counts_first, counts_last, values_first, result);

The semantics would be as if it were implemented as

for(; counts_first != counts_last; ++counts_first, ++values_first)
{
  result = fill_n(result, *counts_first, *values_first);
}

return result;

Yes, fill_by_count is better because it connotes the fact that the counts should be non-negative integers. Here's a parallel implementation:
http://code.google.com/p/thrust/source/browse/examples/expand.cu

Forwarded from http://code.google.com/p/thrust/issues/detail?id=458

Provide a standard base class to inherit from for defining custom allocators

As a user, oftentimes I want to customize memory allocation for thrust::device_vector or temporary memory allocation such as in thrust::sort.

There are a number of examples of doing this:

However, notice that all 3 examples do things in quite different ways:

Inherit from nothing, define a new class with an allocate and deallocate function.
- Doesn't define any of the typedefs that Thrust may require from an allocator, e.g. pointer or reference
Inherits from device_malloc_allocator, defines some of the typedefs and presumably inherits the rest of the necessary typedefs from device_malloc_allocator.
- One potential danger is that rebind will be inherited from the base, which may not be what is wanted.
The most verbose. Inherits from nothing, explicitly defines many typedefs and rebind.

As a user, I'm not sure which approach I should emulate. I would like it if there was a well-defined approach for defining a custom Thrust allocator that better defines what the expected interface is, i.e., what typedefs are needed, what member functions are needed, etc.

Even better would be to provide a base class allocator that I could inherit from that takes care of defining all of the boilerplate typedefs for me and provides the interface for which functions I need to hide. Perhaps device_allocator or device_malloc_allocator already satisfy this desire, in which case it would be nice if this were more concretely documented as the "best practice" for defining a custom Thrust allocator.

equal_range could use specialization

We generically implement equal_range with lower_bound + upper_bound but it's possible to do better.

optimize sorting tuples of arithmetic types

We could optimize sorting of types such as tuple<int,int> using multiple radix sort passes instead of a single comparison based sort. The basic strategy would be similar to [1].

[1] http://code.google.com/p/thrust/source/browse/examples/lexicographical_sort.cu

Forwarded from http://code.google.com/p/thrust/issues/detail?id=364

Parameterize `thrust::*_vector` Resizing

We've had a request to change the resizing strategy for thrust::*_vector. Currently, when we need to reallocate, we double the size of storage. Some users are asking for us to switch to a smaller factor, like 1.5.

I doubt that we can change this globally for everyone, so I'm inclined to suggest that we make this parameterizable. We'd do this by making it a property of the allocator that the vector is using; for example, if the allocator has an allocator::recommend_new_size method, we'd call that.

Some background, from emails:

As part of getting feedback from different DOE projects, I noticed this
on a list from one of the AMReX/LBL developers:

A complaint [about] thrust vector. The growth factor when resizing is 2. This results in a lot of waste of memory. Sometimes the size of vector is 1GB before the resize. Adding one number to the container ends up with a vector with a capacity of 2GB. Given the limited amount of GPU memory, this is very annoying. We wish the growth factor is adjustable or smaller like Facebook's vector.

I have not validated the behavior, nor do I know if they've already said
something to you directly, but just in case, I'm passing this along.

can you re-articulate the issues you've encountered with thrust's vector resizing within AMReX on Volta and perhaps create a small example of how memory usage grows/shrinks in practice?

Finally, are you wanting ...

exponential 2x growth to 1GB followed by linear growth (e.g. 0.5GB increments)

(slower) exponential (1.4x) growth

The main use of thrust vector in AMReX is for particles. These
vectors are usually very big because of the number of particles and
the number of attributes particles have. For the WarpX code using
AMReX, most of the memory is used by particles. Particles move from
one process to another. So the sizes of particle containers change.
Memory allocation is very expensive for gpu. So AMReX has a number of
memory arenas and they are used by the particle containers. When we
run the WarpX code on summit, we find that a lot of memory is wasted
because of the memory arena and the way how vector resize works.
Suppose at some point a vector's size and capacity are 1GB and a new
element is pushed to the back. Then suddenly the capacity becomes
2GB, whereas the size of 1+epsilon GB. This gets even worse when we
have multiple vectors and use a memory arena. Suppose we have 4
vectors that all have the capacity and size of 1GB. We add one
element to each of the vectors. For each vector, it is resized and
its capacity becomes 2GB. Then its original 1GB is released back to
the memory arena. There will be 4 chunks of 1GB memory released back
to the arena. If we could not coalesce them into a big chunk, we will
end up with 12GB of memory allocated from the system although the data
only require a tiny bit more than 4GB. We have found that, for CUDA
memory (at least for unified memory), memory blocks allocated from
different cudaMallocs cannot be coalesced even if their addresses are
contiguous. Because of that, we have modified AMReX's arena. Now we
allocate a large chunk of memory say 12GB in one cudaMalloc upfront at
the beginning and put it in our memory arena. This helps a lot. But
it is still the case the memory used by a vector could suddenly
increase be factor of 2 when adding only one more element.

As for the two strategies you mentioned, I don't know which way is
better. Both gcc, llvm and thrust simply use 2. Facebook vector uses
1.5 as growth rate and they argues 2 is a very bad choice.

https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md

I don't know what kind of examples you like to see. I am cc'ing
Andrew. He is our main developer on particle codes.

RFE: Add device-wide segmented scan primitives

Using CUB device-wide primitives, I wrote an implementation of the algorithm described in "Efﬁcient Projections onto the ℓ₁-Ball for Learning in High Dimensions" by John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra for computing the Euclidean projection of a vector onto the ℓ₁-ball of a given radius.

A reference implementation of the algorithm written in MATLAB by John Duchi is available at:
https://web.stanford.edu/~jduchi/projects/DuchiShSiCh08/ProjectOntoL1Ball.m

One of the steps of the algorithm is to compute a cumulative sum (inclusive-sum) of a vector:

sv = cumsum(u);

If there is only one vector, this can be accomplished using cub::DeviceScan::InclusiveSum(). For multiple vectors, it would be nice to have available a device-wide segmented inclusive sum operation.

Consider complementing the set operations with relational algebra algorithms

For example, natural join:

template<typename InputIterator1, typename InputIterator2, typename OutputIterator2>
OutputIterator set_natural_join(InputIterator1 first1, InputIterator1 last1,
                                InputIterator2 first2, InputIterator2 last2,
                                OutputIterator result);

template<typename InputIterator1, typename InputIterator2, typename OutputIterator2, typename Compare>
OutputIterator set_natural_join(InputIterator1 first1, InputIterator1 last1,
                                InputIterator2 first2, InputIterator2 last2,
                                OutputIterator result,
                                Compare comp);

template<typename InputIterator1, typename InputIterator2, typename InputIterator3, typename InputIterator4, typename OutputIterator1, typename OutputIterator2>
tuple<OutputIterator1, OutputIterator2, OutputIterator3>
  set_natural_join_by_key(InputIterator1 keys_first1, InputIterator1 keys_last1, InputIterator2 values_first1,
                          InputIterator3 keys_first2, InputIterator3 keys_last2, InputIterator4 values_first2,
                          OutputIterator1 keys_result, OutputIterator2 values_result1, OutputIterator2 values_result2);

template<typename InputIterator1, typename InputIterator2, typename InputIterator3, typename InputIterator4, typename OutputIterator1, typename OutputIterator2, typename Compare>
tuple<OutputIterator1, OutputIterator2, OutputIterator3>
  set_natural_join_by_key(InputIterator1 keys_first1, InputIterator1 keys_last1, InputIterator2 values_first1,
                          InputIterator3 keys_first2, InputIterator3 keys_last2, InputIterator4 values_first2,
                          OutputIterator1 keys_result, OutputIterator2 values_result1, OutputIterator2 values_result2,
                          Compare comp);

With these semantics:

template<typename InputIterator1, typename InputIterator2, typename InputIterator3, typename InputIterator4, typename OutputIterator1, typename OutputIterator2>
tuple<OutputIterator1, OutputIterator2, OutputIterator3>
  set_natural_join_by_key(InputIterator1 keys_first1, InputIterator1 keys_last1, InputIterator2 values_first1,
                          InputIterator3 keys_first2, InputIterator3 keys_last2, InputIterator4 values_first2,
                          OutputIterator1 keys_result, OutputIterator2 values_result1, OutputIterator2 values_result2)
{
  InputIterator3 keys2_restart = keys_first2;
  InputIterator4 values2_restart = values_first2;

  for(; keys_first1 != keys_last1; ++keys_first1, ++values_first1)
  {
    keys_first2 = loop2_restart;
    values_first2 = values2_restart;
    for(; keys_first2 != keys_last2; ++keys_first2, ++values_first2)
    {
      if(*keys_first1 < *keys_first2)
      {
        break;
      }
      else if(*keys_first2 < *keys_first1)
      {
        loop2_restart = keys_first;
        ++loop2_restart;

        values2_restart = values2_first;
        ++values2_restart;
      }
      else
      {
        *keys_result = *keys_first;
        *values_result1 = *values_first1;
        *values_result2 = *values_first2;

        ++keys_result;
        ++values_result1;
        ++values_result2;
      }
    }
  }

  return make_tuple(keys_result, values_result1, values_result2);
}

Something like set_natural_join might make implementing an algorithm like outer_product practical.

generalize get_iterator_value() to respect execution policy

Unless it is a raw pointers, get_value_iterator [1] will not respect execution policy when dereferenced via *it. A naïve replacement of *it with

typename thrust::iterator_traits<Iterator>::value_type value;
thrust::copy(exec, it,it+1, &value); 
return value;

makes the reproducer in NVIDIA/thrust#780 to die with

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  an illegal memory access was encountered
Aborted (core dumped)

[1] https://github.com/thrust/thrust/blob/master/thrust/detail/get_iterator_value.h#L29

Use testing iterator wrappers to ensure that algorithms support their stated requirements

Since InputIterators can't be dereferenced multiple times we shouldn't implement algorithms like copy_if(first, last, output, pred) with copy_if(first, last, first, output, pred).

In general the unit tests should check InputIterator semantic correctness.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=419

nvidia / cccl Goto Github PK

cccl's Introduction

CUDA C++ Core Libraries (CCCL)

Overview

Example

Getting Started

Users

CUDA Toolkit

GitHub

CMake Integration

Contributors

Platform Support

CUDA Toolkit (CTK) Compatibility

Operating Systems

Host Compilers

C++ Dialects

GPU Architectures

Testing Strategy

Versioning

Breaking Change

Application Programming Interface (API)

API Versioning

Application Binary Interface (ABI)

Compatibility Guidelines

Deprecation Policy

Mapping to CTK Versions

CI Pipeline Overview

Related Projects

Projects Using CCCL

cccl's People

Contributors

Stargazers

Watchers

Forkers

cccl's Issues

thrust::detail::device::cuda::free

Recommend Projects

Recommend Topics

Recommend Org