nvidia / thrust Goto Github PK

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

License: Other

Python 1.50% Cuda 23.04% C++ 68.59% C 3.70% CMake 2.11% Makefile 0.28% Shell 0.48% Perl 0.30%

cuda nvidia-hpc-sdk thrust gpu cpp cpp11 cpp14 cpp17 cpp20 cxx

thrust's Introduction

⚠️ The Thrust repository has been archived and is now part of the unified nvidia/cccl repository. See the announcement here for more information. Please visit the new repository for the latest updates. ⚠️

Thrust: The C++ Parallel Algorithms Library

Examples	Godbolt	Documentation

Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. It builds on top of established parallel programming frameworks (such as CUDA, TBB, and OpenMP). It also provides a number of general-purpose facilities similar to those found in the C++ Standard Library.

The NVIDIA C++ Standard Library is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use libcu++.

Examples

Thrust is best learned through examples.

The following example generates random numbers serially and then transfers them to a parallel device where they are sorted.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/random.h>

int main() {
  // Generate 32M random numbers serially.
  thrust::default_random_engine rng(1337);
  thrust::uniform_int_distribution<int> dist;
  thrust::host_vector<int> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Transfer data to the device.
  thrust::device_vector<int> d_vec = h_vec;

  // Sort data on the device.
  thrust::sort(d_vec.begin(), d_vec.end());

  // Transfer data back to host.
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
}

See it on Godbolt

This example demonstrates computing the sum of some random numbers in parallel:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>

int main() {
  // Generate random data serially.
  thrust::default_random_engine rng(1337);
  thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
  thrust::host_vector<double> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Transfer to device and compute the sum.
  thrust::device_vector<double> d_vec = h_vec;
  double x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
}

See it on Godbolt

This example show how to perform such a reduction asynchronously:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/async/copy.h>
#include <thrust/async/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <numeric>

int main() {
  // Generate 32M random numbers serially.
  thrust::default_random_engine rng(123456);
  thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
  thrust::host_vector<double> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Asynchronously transfer to the device.
  thrust::device_vector<double> d_vec(h_vec.size());
  thrust::device_event e = thrust::async::copy(h_vec.begin(), h_vec.end(),
                                               d_vec.begin());

  // After the transfer completes, asynchronously compute the sum on the device.
  thrust::device_future<double> f0 = thrust::async::reduce(thrust::device.after(e),
                                                           d_vec.begin(), d_vec.end(),
                                                           0.0, thrust::plus<double>());

  // While the sum is being computed on the device, compute the sum serially on
  // the host.
  double f1 = std::accumulate(h_vec.begin(), h_vec.end(), 0.0, thrust::plus<double>());
}

See it on Godbolt

Getting The Thrust Source Code

Thrust is a header-only library; there is no need to build or install the project unless you want to run the Thrust unit tests.

The CUDA Toolkit provides a recent release of the Thrust source code in include/thrust. This will be suitable for most users.

Users that wish to contribute to Thrust or try out newer features should recursively clone the Thrust Github repository:

git clone --recursive https://github.com/NVIDIA/thrust.git

Using Thrust From Your Project

For CMake-based projects, we provide a CMake package for use with find_package. See the CMake README for more information. Thrust can also be added via add_subdirectory or tools like the CMake Package Manager.

For non-CMake projects, compile with:

The Thrust include path (-I<thrust repo root>)
The libcu++ include path (-I<thrust repo root>/dependencies/libcudacxx/)
The CUB include path, if using the CUDA device system (-I<thrust repo root>/dependencies/cub/)
By default, the CPP host system and CUDA device system are used. These can be changed using compiler definitions:
- -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_XXX, where XXX is CPP (serial, default), OMP (OpenMP), or TBB (Intel TBB)
- -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_XXX, where XXX is CPP, OMP, TBB, or CUDA (default).

Developing Thrust

Thrust uses the CMake build system to build unit tests, examples, and header tests. To build Thrust as a developer, it is recommended that you use our containerized development system:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Build and run tests and examples:
ci/local/build.bash

That does the equivalent of the following, but in a clean containerized environment which has all dependencies installed:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake ..   # Command line interface.
ccmake ..  # ncurses GUI (Linux only).
cmake-gui  # Graphical UI, set source/build directories in the app.

# Build:
cmake --build . -j ${NUM_JOBS} # Invokes make (or ninja, etc).

# Run tests and examples:
ctest

By default, a serial CPP host system, CUDA accelerated device system, and C++14 standard are used. This can be changed in CMake and via flags to ci/local/build.bash

More information on configuring your Thrust build and creating a pull request can be found in the contributing section.

Licensing

Thrust is an open source project developed on GitHub. Thrust is distributed under the Apache License v2.0 with LLVM Exceptions; some parts are distributed under the Apache License v2.0 and the Boost License v1.0.

CI Status

thrust's People

Contributors

Stargazers

Watchers

Forkers

aterrel ezeferrero quanteek bryancatanzaro ulmangt verma corefan wizzk42 jaredhoberock brijen kshyatt jdemouth raymondtay zenotech andrewcorrigan marwan-abdellah erum21 stellar-group skrieder doug-grove mokerjoke hongyaoxu jonbaer azuredsky souravzzz johntyree shayanzafar hubpeter glepag1 areslp quantcast-engineering rdkamali npezolano fpelliccioni koichishirahata fjramireg anzhihun renjiec szho42 duanqi nikkadim karltong dhrvg gurutech mcolak gjmurakami-10gen kuke benhaynor jn11585852 wuxl hassanba marksantos tomzgy shivakmr sitlimit312 erianthus cheesinglee eford wangwatercup filipemaia guozhen1368 dreamfrog shiningvon vimalthilak qimi2008 johnforce shu65 bjoernthiel frpays rkskvk amitkumaar chapmanuniversity adamfrank aizaimenghuangu gnzlbg hpc-sect stan1990 ccadenas syntheticpp rockefys sharmaar12 emaxerrno garthof huoyao ahujack biswapanda songfj dfontainenvidia samiit luwei0917 chrisidefix rserban isawyer ccecka josephwinston zhmath gwli mjlong henrywoo mikecollinsnvidia

thrust's Issues

Quick Start Guide is too CUDA-specific

The Quick Start Guide implies that Thrust requires nvcc and a GPU. We should make it less CUDA specific to make it clear that it is multiplatform.

consider removing device-to-device copy workaround for absence of nvcc

IIRC we decided to eliminate this WAR because it violates the one-definition rule and there's no way to ensure we get the "good" version when both are present.

[1] http://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/copy_device_to_device.inl#41

I don't think we should modify that code unless we get a bug report

I'll take a look at this later. I can't say what the right approach is, but I'm concerned about the silent (and hard to diagnose) performance hazard that this introduces.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=449

Add missing vector member functions

Remaining work is mostly the c++11 additions.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=4

Consider support for stateful dispatch

Thrust does not introduce (much) state beyond the implicit state of each underlying backend system. However, it is occasionally desirable to perform algorithm dispatch in a stateful manner. For example, a user may desire to associate an algorithm's launch with a particular CUDA stream, or to associate a computation with a particular processor, or to associate an algorithm's temporary storage requirements with a particular segment of memory. The current method of (stateless) tag dispatch cannot support any of these use cases in general.

Thrust should generalize tag dispatch to provide stateful forms of invocation for each algorithm, e.g.

template<typename State, typename Iterator, typename Function>
void algo(State &state, Iterator first, Iterator last, Function f);

template<typename Iterator, typename Function>
void algo(Iterator first, Iterator last, Function f);

The obvious way to introduce stateful tags would be to provide this sort of interface:

thrust::algo(Tag t, Iterator first, Iterator last);

If you want to inject state, you'd just do:

vector vec; 
state s;
thrust::algo(s, vec.begin(), vec.end());

We had previously nixed allowing the user to explicitly provide a tag, because it's not clear what this should do:

cuda::vector vec;
thrust::algo(omp::tag(), vec.begin(), vec.end());  // should this be a compile time error?

nor what this should do:

cuda::vector cuda_vec;
omp::vector omp_vec;
thrust::algo2(omp::tag(), cuda_vec.begin(), cuda_vec.end(), omp_vec.begin()); // should this be a compile-time error?

This tag-explicit form raises issues:

if the iterator arguments' tags can't be reconciled with the provided tag, should this be a compile time error?
should the iterators be retagged before dispatch? If algo2 lowers to another Thrust algorithm, then it will be dispatched using the iterator's tags, not what was provided by the user.

Perhaps all lowering should be performed using the explicit tag form? There'd be no other way to communicate the state, because select_system doesn't know anything about it.

OTOH, this interface is attractive because it makes selecting a backend simpler than using retag on all the iterator arguments:

This:

thrust::algo(thrust::cuda::tag(), vec.begin(), vec.end());

Is nicer than:

thrust::algo(thrust::retag<thrust::cuda::tag>(vec.begin()), thrust::retag<thrust::cuda::tag>(vec.end()));

A tag-explicit form also makes it easy to communicate precisely what the implicit API does:

namespace thrust
{

template<typename Iterator>
  void algo(Iterator first, Iterator second)
{
  typename thrust::iterator_system<Iterator>::type system;
  thrust::algo(select_system(system()), first, second);
}

}

The tag-explicit form would do the ADL part:

namespace thrust
{

template<typename Tag, typename Iterator>
  void algo(Tag tag, Iterator first, Iterator last)
{
  // dispatch via adl
  algo(tag, first, last);
}

}

We'd probably have to change all Tag parameters to be references, which would create a lot of new entry points. Rvalue reference would probably solve that problem.

Alternatively, we could pass by value and ask the user to use something like reference_wrapper.

implement sparse_vector example with merge_by_key

Currently uses sort_by_key, which doesn't exploit the sorted nature of the input sequences.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=476

Specialize stable_partition_copy using a variant of the current specialization of copy_if

copy_if could be implemented with stable_partition_copy + null_iterator

Forwarded from http://code.google.com/p/thrust/issues/detail?id=210

Document more subtleties

Here's a non-exhaustive list:

zip_iterator equality
could also mention that the current implementations of some algos requires data types to have default constructors and trivial copy constructors
host/cpp algorithms have parallel semantics, but current implementation is serial
presence of thrust::minimum and thrust::maximum
generalization of some algorithms to accept stencil sequences
presence of stable_partition_copy
interpretation of get_temporary_buffer's result
what thrust::detail means
what an experimental namespace implies

Forwarded from http://code.google.com/p/thrust/issues/detail?id=459

Thrust random headers do not pass trivial tests

xor_combine_engine_max.h does not properly #include <cstddef>:

./thrust/random/detail/xor_combine_engine_max.h:275:28: error: 'size_t' has not been declared
./thrust/random/detail/xor_combine_engine_max.h:275:57: error: 'size_t' has not been declared
./thrust/random/detail/xor_combine_engine_max.h:278:16: error: 'size_t' does not name a type
./thrust/random/detail/xor_combine_engine_max.h:284:37: error: 'w' was not declared in this scope
./thrust/random/detail/xor_combine_engine_max.h:284:41: error: template argument 2 is invalid
./thrust/random/detail/xor_combine_engine_max.h:285:5: error: template argument 3 is invalid
./thrust/random/detail/xor_combine_engine_max.h:291:37: error: 'w' was not declared in this scope
./thrust/random/detail/xor_combine_engine_max.h:291:41: error: template argument 2 is invalid
./thrust/random/detail/xor_combine_engine_max.h:292:5: error: template argument 3 is invalid
./thrust/random/detail/xor_combine_engine_max.h:308:7: error: 'w' was not declared in this scop

add stencil argument to partition algorithms

The partition, stable_partition, partition_copy, and stable_partition_copy algorithms would benefit from the addition of a stencil input sequence.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=379

unit tester should test on all devices by default

Right now, the unit tester defaults to device ID 0. In a multi-gpu system, it might be better to default to device ID -1, which causes it to run tests on all GPUs in the system. This makes multi-GPU testing in Jenkins a lot easier since we don't have to launder a device ID command line variable through SCons.

Make fallback_allocator example use the allocator in the device_vector

Should use thrust::allocator (when it exists) to implement fallback_allocator

Hard to do now without making fallback_allocator verbose. In particular, fallback_allocator should return a tagged pointer instead of a raw pointer

Forwarded from http://code.google.com/p/thrust/issues/detail?id=457

integrate managed-layout clone

See here [1]

[1] http://code.google.com/r/jaredhoberock-managed-layout/

Forwarded from http://code.google.com/p/thrust/issues/detail?id=410

-Wunused-but-set-variable in vector_base.inl

./thrust/detail/vector_base.inl:918:14: warning: variable 'new_end' set but not used [-Wunused-but-set-variable]

Tag dispatch bug (customizing transform)

#include <thrust/iterator/iterator_traits.h>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <iostream>

struct my_tag : thrust::device_system_tag {};

template<typename InputIterator,
         typename OutputIterator,
         typename UnaryFunction>
OutputIterator transform(my_tag, 
                         InputIterator first,
                         InputIterator last,
                         OutputIterator result,
                         UnaryFunction f)
{
  // output a message
  std::cout << "Hello, world from transform(my_tag)!" << std::endl;

  return thrust::transform(thrust::retag<thrust::device_system_tag>(first),
                           thrust::retag<thrust::device_system_tag>(last),
                           thrust::retag<thrust::device_system_tag>(result),
                           f);
}

int main()
{
  thrust::device_vector<int> vec(1);

  thrust::transform(thrust::retag<my_tag>(vec.begin()),
                    thrust::retag<my_tag>(vec.end()),
                    thrust::retag<my_tag>(vec.begin()),
                    thrust::identity<int>());

  return 0;
}

Jared suggested a workaround: adding a select_system overload for my_tag solves the problem:

my_tag select_system(my_tag, my_tag) {
  return my_tag;
}

And also suggested the real problem is that the thrust::cuda::detail::select_system() overload gets called, and erases the derivedness type of my_tag. Perhaps these select system calls need to be templatized?

template<typename Tag> Tag select_system(Tag,Tag) { return Tag(); }

long long is not supported by ISO C++03

Reported here: https://groups.google.com/forum/#!msg/thrust-users/jSujgymfbwM/MckpqfhvcFsJ

Can result in errors like:

/usr/local/cuda/include/thrust/detail/type_traits.h:464: error: ISO C++ does not support ‘long long’

add direct testing of CUDA block algorithms

The current block algorithms are listed below and organized into categories based on inputs and outputs.

X->X (inplace)
inclusive_scan
inclusive_scan_n
inclusive_scan_by_flag
inclusive_scan_by_flag_n
reduce

X,Y->X,Y (inplace)
merging_sort

X->Y
copy

X,Y->Z
merge

X,Y,Z->W
set_difference
set_intersection
set_symmetric_difference
set_union

Forwarded from http://code.google.com/p/thrust/issues/detail?id=406

Misuse of device_vector's copy constructor results in unhelpful error message

The copy constructor of a device vector with different types does not work:

thrust::device_vector<float> t1(10);
thrust::device_vector<double> t2(t1);

It does work for host vectors. I am using Thrust version from CUDA 4.0. Thanks for a great library.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=409

Functor not declared host device in remove_points2d.cu

is_outside_circle::operator() needs to be annotated as host device. It's not annotated right now, and is producing warnings.

c++11 requires vector_base to construct elements using its allocator

We do it in an ad hoc fashion, but we're required to use the allocator according to S. 23.2.1 subpoint 3:

For the components affected by this subclause that declare an allocator_type, objects stored in these
components shall be constructed using the allocator_traits<allocator_type>::construct function and
destroyed using the allocator_traits<allocator_type>::destroy function (20.6.8.2). These functions
are called only for the container’s element type, not for internal types used by the container. [ Note: This
means, for example, that a node-based container might need to construct nodes containing aligned buffers
and call construct to place the element into the buffer. —end note ]

This is actually a good thing for us, because it means that the user should be able to create an uninitialized_allocator, if he desires.

See this post [1] for why it's worthwhile to do this sort of thing.

[1] http://stackoverflow.com/questions/7218574/avoiding-default-construction-of-elements-in-standard-containers

reference needs the full complement of operator overloads

for example,

reference<int> == int

is an error. this is annoying

Forwarded from http://code.google.com/p/thrust/issues/detail?id=450

Update error codes in system/cuda/error.h to reflect those found in cuda/include/driver_types.h

It looks like we're missing quite a few from the last couple of versions.

Eliminate comma operator for placeholders

It causes problems in expressions like

thrust::transform_reduce(v.begin(), v.end(), _1 * _1, 0.0f, _1 + _2);

I don't know how to disambiguate cases like these and I can't think of any reason to keep it around.

Algorithms which assume sorted input should have _by_key variants

In practice, this would mean adding _by_key set algorithm variants.

This has value beyond Thrust; note how awkward [1] taking the difference of two std::sets is.

[1] http://stackoverflow.com/questions/7706602/how-to-subtract-one-list-of-map-keys-from-another-and-get-new-map-map-a-mab-b/7706740#7706740

How should the key and value sequences be ordered?

Option 1: [first1, last1) and [first2, last2) are the left and right key sequences

template <typename InputIterator1,
typename InputIterator2,
typename InputIterator3,
typename InputIterator4,
typename OutputIterator1,
typename OutputIterator2,
typename StrictWeakOrdering>
thrust::pair<OutputIterator1,OutputIterator2>
merge_by_key(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
InputIterator3 first3,
InputIterator4 first4,
OutputIterator1 output1,
OutputIterator2 output2,
StrictWeakOrdering comp);

Option 2: [first1, last1) and [first3, last3) are the left and right key sequences

template <typename InputIterator1,
typename InputIterator2,
typename InputIterator3,
typename InputIterator4,
typename OutputIterator1,
typename OutputIterator2,
typename StrictWeakOrdering>
thrust::pair<OutputIterator1,OutputIterator2>
merge_by_key(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2,
InputIterator3 first3, InputIterator last3,
InputIterator4 first4,
OutputIterator1 output1,
OutputIterator2 output2,
StrictWeakOrdering comp);

AFAICT there's no prior art within Thrust (in the public interface). Internally we used Option 1 for merge_by_key, but I didn't give it much consideration back when.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=393

temporary_array needs to invoke T's constructor when it is not trivial

temporary_array<std::string, thrust::cpp::tag> probably crashes right now

remove Thrust 1.6.0 deprecations

these aliases were deprecated in v1.6.0:

device_space_tag
host_space_tag
iterator_space
any_space_tag
THRUST_DEVICE_BACKEND
THRUST_HOST_BACKEND

Forwarded from http://code.google.com/p/thrust/issues/detail?id=466

tune parallel/serial threshold in CPU sorting algorithms

The threshold should depend on the data type and comparison operator.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=475

update cuda::arch constants and calculator

See Section 4.2 of the CUDA Programming Guide [1] and the new CUDA Occupancy Calculator [2]

[1] http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
[2] http://developer.download.nvidia.com/compute/DevZone/docs/html/C/tools/CUDA_Occupancy_Calculator.xls

I think we might be able to remove the occupancy calculator entirely and make decisions based on cudaFuncAttributes::maxThreadsPerBlock instead.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=382

parallelize set operations on multicore systems

Currently they fall back to sequential implementations

Forwarded from http://code.google.com/p/thrust/issues/detail?id=468

transform to immutable transform_iterator compiles without error

The following shouldn't be possible:

#include <thrust/iterator/transform_iterator.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/transform.h>

int main()
{
  thrust::host_vector<int> vec;
  thrust::transform(vec.begin(), vec.end(), thrust::make_transform_iterator(vec.begin(), thrust::negate<int>()), thrust::negate<int>());
  return 0;
}

The unary transform functors in internal_functional.h seem to be assigning to the temporary returned by transform_iterator.

Add example to quick start that shows how to initialize a device_vector to existing memory

Forwarded from http://code.google.com/p/thrust/issues/detail?id=481

Scrub KNOWN_FAILURES

I suspect most of these are resolved by now

the type of sequence's step size template parameter should be allowed to differ from the init

Why not? Consider

sequence(v.begin(), v.end(), 0.0f, 5);

Forwarded from http://code.google.com/p/thrust/issues/detail?id=150

integrate one of D.E. Shaw Research's counter-based prngs

pick the one most suitable to GPUs, port it to c++11 interface, add it to thrust::random

forwarded from http://code.google.com/p/thrust/issues/detail?id=427

Potentially interpret tuple as an index type in zip_iterator

z_iter[(i,j,k)] could potentially make sense if the tuple is the right size and the element types are each convertible to zip_iterator::difference_type

Comment 1 by andrew.corrigan, Sep 16, 2011
I implemented a new iterator called multi_permutation_iterator [1]. It is a new iterator which is like a permutation_iterator but with tuple-valued indexes, and thus dereferences to produce tuples of values from ElementIterator, as opposed to a single value like permutation_iterator. This implements the functionality I described last year, which led to this issue being opened. [2]

As compared to a combination of zip_iterator and multiple permutation_iterators, this new iterator has the advantage of only using one ElementIterator, which can reduce register usage and also parameter size (and thus avoid launch_closure_by_pointer and its memory allocation and copy). This has already helped speed up my code quite a bit, where so far I use it to implement non-contiguous, coalesced multi-dimensional arrays. This multi_permutation_iterator should also help others implement things like finite-difference stencils without excessive register usage, as discussed in [3].

[1] https://code.google.com/r/andrewcorrigan-thrust/source/detail?r=22958c2f47c02cb1a67d8e5c2ad1f788da99f7ac
[2] http://groups.google.com/group/thrust-users/browse_thread/thread/d76e6f7d6206790e/1a7a307e843241f7
[3] http://groups.google.com/group/thrust-users/browse_thread/thread/853579a9bccff08c/33c67c39e7e380a1

Forwarded from http://code.google.com/p/thrust/issues/detail?id=205

compilation error when sorting bools with CUDA backend

The problem is that radix sort is being dispatched, but the underlying radix sort implementation doesn't provide a bool specialization. Since we probably don't want to use radix sort here (it would be unstable for general numerical values in bool storage) we should dispatch partition or stable_partition instead.

Minimal reproducer:

#include <thrust/device_vector.h>
#include <thrust/sort.h>

int main(void)
{
  thrust::device_vector<bool> V(10);
  thrust::sort(V.begin(), V.end());
  return 0;
}

Compiler output:

/home/nathan/NV/thrust/thrust/system/cuda/detail/detail/stable_radix_sort.inl(67): error: incomplete type is not allowed
          detected during:
            instantiation of "void thrust::system::cuda::detail::detail::stable_radix_sort(RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::pointer<__nv_bool, thrust::device_system_tag, thrust::use_default, thrust::use_default>>]" 
(62): here
            instantiation of "void thrust::system::cuda::detail::detail::stable_radix_sort(RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>]" 
/home/nathan/NV/thrust/thrust/system/cuda/detail/sort.inl(284): here
            instantiation of "void thrust::system::cuda::detail::first_dispatch::stable_sort(RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering, thrust::detail::true_type) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/system/cuda/detail/sort.inl(419): here
            instantiation of "void thrust::system::cuda::detail::stable_sort(thrust::system::cuda::detail::tag, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/detail/sort.inl(88): here
            instantiation of "void thrust::stable_sort(RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/system/detail/generic/sort.inl(59): here
            instantiation of "void thrust::system::detail::generic::sort(thrust::system::detail::generic::tag, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/detail/sort.inl(60): here
            instantiation of "void thrust::sort(RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/system/detail/generic/sort.inl(47): here
            instantiation of "void thrust::system::detail::generic::sort(thrust::system::detail::generic::tag, RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>]" 
/home/nathan/NV/thrust/thrust/detail/sort.inl(45): here
            instantiation of "void thrust::sort(RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>]" 
foo.cu(7): here

/home/nathan/NV/thrust/thrust/system/cuda/detail/detail/stable_radix_sort.inl(67): error: incomplete type is not allowed
          detected during:
            instantiation of "void thrust::system::cuda::detail::detail::stable_radix_sort(RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>]" 
/home/nathan/NV/thrust/thrust/system/cuda/detail/sort.inl(284): here
            instantiation of "void thrust::system::cuda::detail::first_dispatch::stable_sort(RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering, thrust::detail::true_type) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/system/cuda/detail/sort.inl(419): here
            instantiation of "void thrust::system::cuda::detail::stable_sort(thrust::system::cuda::detail::tag, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/detail/sort.inl(88): here
            instantiation of "void thrust::stable_sort(RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/system/detail/generic/sort.inl(59): here
            instantiation of "void thrust::system::detail::generic::sort(thrust::system::detail::generic::tag, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/detail/sort.inl(60): here
            instantiation of "void thrust::sort(RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>, StrictWeakOrdering=thrust::less<__nv_bool>]" 
/home/nathan/NV/thrust/thrust/system/detail/generic/sort.inl(47): here
            instantiation of "void thrust::system::detail::generic::sort(thrust::system::detail::generic::tag, RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>]" 
/home/nathan/NV/thrust/thrust/detail/sort.inl(45): here
            instantiation of "void thrust::sort(RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<__nv_bool>>]" 
foo.cu(7): here

2 errors detected in the compilation of "/tmp/tmpxft_00005faf_00000000-4_foo.cpp1.ii".

Provide better error message when OpenMP is required but not found

Ideally, we want a human-readable compiler error. When we have static_assert, we can guard just the function bodies and do something like static_assert(_OPENMP, "Did you forget -fopenmp?")

Forwarded from http://code.google.com/p/thrust/issues/detail?id=94

improve test coverage of tests using pseudo random input

The bug in Google Code issue #80 would have been discovered more easily if our
randomized testing of unique_copy() was more comprehensive. In general, we
should
(1) make our random input portable and not rely on the system's rand()
(2) add a range parameter to random_integers that specifies the range of
random values
(3) add a function repeated_random_integers(N, range, repeats) that
generates sequences of random values where the length of each run is
selected randomly from [1,repeats]

Forwarded from http://code.google.com/p/thrust/issues/detail?id=81

Ambiguous overloads of reinterpret_tag

I run into an ambiguous overload of reinterpret_tag when trying to invoke thrust::copy on thrust::detail::pointer objects which are both tagged with the thrust::system::omp::tag. This error does not occur when both pointers are tagged with thrust::system::cuda::tag. It also does not occur when pointers have different tags (i.e., one is from omp and one from cuda).

#include <thrust/copy.h>
#include <thrust/detail/pointer.h>
#include <thrust/system/omp/memory.h>
#include <thrust/system/cuda/memory.h>

//#define ERROR

int main() {
#ifdef ERROR
    typedef thrust::system::omp::tag my_tag;
#else
    typedef thrust::system::cuda::tag my_tag;
#endif

    thrust::pointer<int, my_tag> a_begin((int*)NULL);
    thrust::pointer<int, my_tag> b_begin((int*)NULL);
    thrust::copy(a_begin, a_begin,
                 b_begin);
}

The error message indicates the problem comes from

/Users/catanzar/thrust/thrust/system/omp/detail/copy.inl(60): error: more than one instance of overloaded function
"thrust::reinterpret_tag" matches the argument list:
    function template "thrust::pointer<T, Tag, thrust::use_default, thrust::use_default> thrust::reinterpret_tag<Tag,T,OtherTag>
        (thrust::pointer<T, OtherTag, thrust::use_default, thrust::use_default>)"
    function template "thrust::detail::enable_if_retaggable<OtherTag, Tag, thrust::pointer<T, Tag, thrust::use_default, 
        thrust::use_default>>::type thrust::reinterpret_tag<Tag,T,OtherTag>(thrust::pointer<T, OtherTag, thrust::use_default,
        thrust::use_default>)"
    argument types are: (thrust::pointer<int, thrust::system::omp::detail::tag, thrust::use_default, thrust::use_default>)

radix_sort() spends a lot of time in cudaGetDeviceProperties

Forwarded from http://code.google.com/p/thrust/issues/detail?id=318

remove manual unrolling in cuda::block::inclusive_scan

The statically-sized variants could be implemented with TMP or automatic compiler-based unrolling (w/ #pragma unroll). The latter would likely require use of a static log2.

Forwarded from http://code.google.com/p/thrust/issues/detail?id=407

Unused variable warning in testing/copy.cu & testing/copy_n.cu

testing/copy.cu:212:117: warning: variable 'l_result' set but not used [-Wunused-but-set-variable]

testing/copy_n.cu:20:59: warning: variable 'end' set but not used [-Wunused-but-set-variable]

testing/copy_n.cu:176:117: warning: variable 'l_result' set but not used [-Wunused-but-set-variable]

Thrust lacks generic malloc/free

There is no thrust::malloc or thrust::free.

This makes it difficult to allocate (deallocate) tagged memory generically.

WARing this omission forces the user to dip into the private generic namespace and dispatch through ADL.

Consider adding these and exposing them in thrust/memory.h

reported by Bryan Catanzaro

Forwarded from http://code.google.com/p/thrust/issues/detail?id=448

improve cuda scan performance

forceinline the block scan routines
avoid shared memory writes in the serial phase (accumulate only)
use RSS instead of SSA to save one write per element (total: 2 reads + 1 write)
use raking approach to block scan
Use device and type-specific tuning parameters

Forwarded from http://code.google.com/p/thrust/issues/detail?id=294

nvidia / thrust Goto Github PK

thrust's Introduction

Thrust: The C++ Parallel Algorithms Library

Examples

Getting The Thrust Source Code

Using Thrust From Your Project

Developing Thrust

Licensing

CI Status

thrust's People

Contributors

Stargazers

Watchers

Forkers

thrust's Issues

Recommend Projects

Recommend Topics

Recommend Org