stotko / stdgpu Goto Github PK

View Code? Open in Web Editor NEW

1.1K 29.0 77.0 6.09 MB

stdgpu: Efficient STL-like Data Structures on the GPU

Home Page: https://stotko.github.io/stdgpu/

License: Apache License 2.0

CMake 4.90% Cuda 27.78% Shell 0.64% C++ 45.29% C 0.06% NASL 21.21% Python 0.12%

gpu gpu-computing gpu-acceleration gpgpu data-structures stl stl-like stl-containers cpp cpp17

stdgpu's People

Contributors

Stargazers

Watchers

Forkers

ykkawana sichitong vitaliytalyh zeta1999 clin99 tzf-omkey hollisjoe foreverdavid lp249839965 sevenchao linecode saimarpaka jrhemstad lsy1599 stanleyjacob sohardforaname thended isl-org yyhclimacool romeluko mathpopo geotyper imaginary-person chengwei920412 metavai feiyunwill blinky0815 henrychang213 machinelearningsystem drzhoukarl ajunlonglive zjhellofss hongzhen12 grideyes-2010 baajarmeh longervision jiapei-nexera bluceyee vn-os edwardnguyen1705 joonvan gatsby23 bqm1111 mfkiwl lionxu kstppd loki-hxg uniwangwang gg-big-org badal244 distributedsystemresearch secp8x32 yes-jumby kindofblue kindhearted57 gmh5225 hzmslx sarvex 7tendiakexshi lahmos jackylucifer dlxj noticeable 5l1v3r1 dreamplayer-zhang yofufufufu yuehchuan brugarolas weedge crhcrhcrhcrh zjarci hwzen wasahaiah brewbee classic130 tanzby

stdgpu's Issues

Remove deprecated classes and functions in stdgpu 2.0.0

This issue comprises a list of deprecated functionality that will be removed in version 2.0.0:

Sources:

bit: ispow2(), log2pow2(), mod2()
bitset: Replace internal non-static member _bit_per_block by static member version (changes object size)
cstdlib: sizedivPow2(std::size_t, std::size_t), sizediv_t
memory: safe_pinned_host_allocator, default_allocator_traits
mutex: mutex_ref
ranges: device_range(T*, index_t), host_range(T*, index_t), non-const begin() and end() member functions
unordered_map,unordered_set: createDeviceObject(index_t, index_t), excess_count(), total_count()

CMake:

Configuration Options: STDGPU_ENABLE_AUXILIARY_ARRAY_WARNING, STDGPU_ENABLE_MANAGED_ARRAY_WARNING, STDGPU_USE_FAST_DESTROY, STDGPU_USE_FIBONACCI_HASHING

Header name convention

Until now, the header names follow a simple convention. There are essentially three different types of extensions:

.h: Header files that can be used in both CUDA (.cu) and C++ (.cpp) code.
.cuh: Header files that can be used exclusively in CUDA (.cu) code.
no extension: Forward declaration files.

With the recent addition of an OpenMP backend, this scheme no longer fits as the device code is now also compiled in .cpp files. Furthermore, forward declarations follow a different convention which is also not obvious. This will also be the case for future backends. Thus, the current scheme is confusing and should be changed. Essentially, we have the following options:

thrust-like scheme: Use .h for all files including forward declaration files. This has the advantage of being a simple and uniform solution, but makes above inclusion limitations less obvious.
C++ STL-like scheme: Use no extension for all files. This would make the relation to the C++ standard library more obvious and is again a uniform solution. However, as most editors deduce the language from the file extension, we should think of a solution to keep syntax highlighting working.

Since the involving changes would lead to an API break, this will only be included in version 2.0.0.

stdgpu 2.0.0

Since we aim to postpone API (and if possible ABI) breaks until the next major version (see Semantic Versioning), this issue serves as a short overview of these major changes.

Header name convention #38
Remove deprecated classes and functions (involves #36)

Compatibility with Thrust?

Describe the bug
Current stdgpu seems to be NOT compatible with NVidia Thurst?

Steps to reproduce

git clone https://github.com/stotko/stdgpu.git
mkdir build
cd build
cmake ../

Expected behavior
Successfully built and run.

Actual behavior
No matter GCC or Clang, both failed.

GCC 11.2

[  1%] Building CXX object src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o
cd ....../stdgpu/build/src/stdgpu && /usr/bin/c++ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -Dstdgpu_EXPORTS -I....../stdgpu/src/stdgpu/.. -I....../stdgpu/build/src/stdgpu/include -isystem /usr/local/cuda/include -fPIC -Wall -pedantic -Wextra -Wshadow -Wsign-compare -Wconversion -Wfloat-equal -Wundef -Wdouble-promotion -MD -MT src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -MF CMakeFiles/stdgpu.dir/impl/iterator.cpp.o.d -o CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -c ....../stdgpu/src/stdgpu/impl/iterator.cpp
In file included from /usr/local/cuda/include/nv/detail/__target_macros:13,
                 from /usr/local/cuda/include/nv/target:195,
                 from /usr/local/cuda/include/cub/detail/device_synchronize.cuh:23,
                 from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:36,
                 from /usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h:26,
                 from /usr/local/cuda/include/thrust/system/detail/adl/malloc_and_free.h:42,
                 from /usr/local/cuda/include/thrust/system/detail/generic/memory.inl:22,
                 from /usr/local/cuda/include/thrust/system/detail/generic/memory.h:69,
                 from /usr/local/cuda/include/thrust/detail/reference.h:28,
                 from ....../stdgpu/src/stdgpu/../stdgpu/iterator.h:30,
                 from ....../stdgpu/src/stdgpu/impl/iterator.cpp:16:
/usr/local/cuda/include/cub/util_device.cuh: In function ‘cudaError_t cub::PtxVersionUncached(int&)’:
/usr/local/cuda/include/cub/util_device.cuh:368:15: error: invalid conversion from ‘EmptyKernelPtr’ {aka ‘void (*)()’} to ‘const void*’ [-fpermissive]
  368 |           if (CubDebug(result = cudaFuncGetAttributes(&empty_kernel_attrs,
      |               ^~~~~~~~
      |               |
      |               EmptyKernelPtr {aka void (*)()}
In file included from /usr/local/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h:38,
                 from /usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h:19,
                 from /usr/local/cuda/include/thrust/system/detail/adl/malloc_and_free.h:42,
                 from /usr/local/cuda/include/thrust/system/detail/generic/memory.inl:22,
                 from /usr/local/cuda/include/thrust/system/detail/generic/memory.h:69,
                 from /usr/local/cuda/include/thrust/detail/reference.h:28,
                 from ....../stdgpu/src/stdgpu/../stdgpu/iterator.h:30,
                 from ....../stdgpu/src/stdgpu/impl/iterator.cpp:16:
/usr/local/cuda/include/cuda_runtime_api.h:4337:125: note:   initializing argument 2 of ‘cudaError_t cudaFuncGetAttributes(cudaFuncAttributes*, const void*)’
 4337 | extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaFuncGetAttributes(struct cudaFuncAttributes *attr, const void *func);
      |                                                                                                                 ~~~~~~~~~~~~^~~~
make[2]: *** [src/stdgpu/CMakeFiles/stdgpu.dir/build.make:93: src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o] Error 1
make[2]: Leaving directory '....../stdgpu/build'
make[1]: *** [CMakeFiles/Makefile2:318: src/stdgpu/CMakeFiles/stdgpu.dir/all] Error 2
make[1]: Leaving directory '....../stdgpu/build'
make: *** [Makefile:149: all] Error 2

clang 14.0

[  1%] Building CXX object src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o
cd ....../build/src/stdgpu && /usr/bin/clang++ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -I....../src/stdgpu/.. -I....../build/src/stdgpu/include -isystem /usr/local/cuda/include -Wall -pedantic -Wextra -Wshadow -Wsign-compare -Wconversion -Wfloat-equal -Wundef -Wdouble-promotion -MD -MT src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -MF CMakeFiles/stdgpu.dir/impl/iterator.cpp.o.d -o CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -c ....../src/stdgpu/impl/iterator.cpp
In file included from ....../src/stdgpu/impl/iterator.cpp:16:
In file included from ....../src/stdgpu/../stdgpu/iterator.h:30:
In file included from /usr/local/cuda/include/thrust/detail/reference.h:28:
In file included from /usr/local/cuda/include/thrust/system/detail/generic/memory.h:69:
In file included from /usr/local/cuda/include/thrust/system/detail/generic/memory.inl:22:
In file included from /usr/local/cuda/include/thrust/system/detail/adl/malloc_and_free.h:42:
In file included from /usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h:26:
In file included from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:38:
/usr/local/cuda/include/cub/util_device.cuh:368:33: error: no matching function for call to 'cudaFuncGetAttributes'
          if (CubDebug(result = cudaFuncGetAttributes(&empty_kernel_attrs,
                                ^~~~~~~~~~~~~~~~~~~~~
/usr/local/cuda/include/cub/util_debug.cuh:115:64: note: expanded from macro 'CubDebug'
    #define CubDebug(e) CUB_NS_QUALIFIER::Debug((cudaError_t) (e), __FILE__, __LINE__)
                                                               ^
/usr/local/cuda/include/nv/detail/__target_macros:455:78: note: expanded from macro 'NV_IF_TARGET'
#  define NV_IF_TARGET(cond, t, ...)    _NV_BLOCK_EXPAND(_NV_TARGET_IF(cond, t, __VA_ARGS__))
                                                                             ^
/usr/local/cuda/include/nv/detail/__target_macros:419:74: note: expanded from macro '_NV_TARGET_IF'
#    define _NV_TARGET_IF(cond, t, ...) _NV_IF( _NV_ARCH_COND_CAT(cond), t, __VA_ARGS__)
                                                                         ^
note: (skipping 24 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
/usr/local/cuda/include/nv/detail/__preprocessor:83:47: note: expanded from macro '_NV_STRIP_PAREN'
#define _NV_STRIP_PAREN(...) _NV_STRIP_PAREN1(__VA_ARGS__)
                                              ^~~~~~~~~~~
/usr/local/cuda/include/nv/detail/__preprocessor:82:48: note: expanded from macro '_NV_STRIP_PAREN1'
#define _NV_STRIP_PAREN1(...) _NV_STRIP_PAREN2 __VA_ARGS__
                                               ^~~~~~~~~~~
/usr/local/cuda/include/nv/detail/__preprocessor:81:31: note: expanded from macro '_NV_STRIP_PAREN2'
#define _NV_STRIP_PAREN2(...) __VA_ARGS__
                              ^~~~~~~~~~~
/usr/local/cuda/include/cuda_runtime_api.h:4337:58: note: candidate function not viable: no known conversion from 'EmptyKernelPtr' (aka 'void (*)()') to 'const void *' for 2nd argument; take the address of the argument with &
extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaFuncGetAttributes(struct cudaFuncAttributes *attr, const void *func);
                                                         ^
1 error generated.
make[2]: *** [src/stdgpu/CMakeFiles/stdgpu.dir/build.make:93: src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o] Error 1
make[2]: Leaving directory '....../build'
make[1]: *** [CMakeFiles/Makefile2:403: src/stdgpu/CMakeFiles/stdgpu.dir/all] Error 2
make[1]: Leaving directory '....../build'
make: *** [Makefile:149: all] Error 2

System (please complete the following information):

OS: Ubuntu 22.04
Compiler: gcc 11.2 and clang 14.0, both tried
Backend: CUDA 11.6
Library version: stdgpu 1.3.0 ??

Findthrust.cmake parse error

Describe the bug
Cannot config project because Findthrust.cmake@15 cannot find the thrust properly.

Look at this code:

string(REGEX REPLACE "#define THRUST_VERSION[ \t]+" "" THRUST_VERSION_STRING ${THRUST_VERSION_STRING})

It does not take into account that #define THRUST_VERSION may be followed by comments.

Unfortunately, cuda 12.4's thrust/version.h has a comment after THRUST_VERSION

/*! \def THRUST_VERSION
 *  \brief The preprocessor macro \p THRUST_VERSION encodes the version
 *         number of the Thrust library as MMMmmmpp.
 *
 *  \note THRUST_VERSION is formatted as `MMMmmmpp`, which differs from `CCCL_VERSION` that uses `MMMmmmppp`.
 *
 *         <tt>THRUST_VERSION % 100</tt> is the sub-minor version.
 *         <tt>THRUST_VERSION / 100 % 1000</tt> is the minor version.
 *         <tt>THRUST_VERSION / 100000</tt> is the major version.
 */
#define THRUST_VERSION 200301 // macro expansion with ## requires this to be a single value

So ${THRUST_VERSION_STRING} in cmake was parsed to 200301 // macro expansion with ## requires this to be a single value and led to subsequent errors in judgment.

If you think what I'm saying makes sense, I'll be happy to fix it.

System (please complete the following information):

OS: Windows 11
Compiler: Visual Studio 2022
Backend: CUDA 12.4
Library version: main branch

Unexpected insertion when using unordered_map and capacity equals to 4

Describe the bug
The insertion of unordered_map does not work correctly under capacity=4.

Steps to reproduce
Change the n from 100 to 4 in examples/cuda/unordered_map.cu, compile and run this example code.

Expected behavior
the terminal should be:
The duplicate-free map of numbers contains 5 elements (5 expected) and the computed sums are (2, 4) ((10, 30) expected)

Actual behavior
the terminal actually shows:
The duplicate-free map of numbers contains 2 elements (5 expected) and the computed sums are (2, 4) ((10, 30) expected)

System (please complete the following information):

OS: Ubuntu 20.04
Compiler: GCC
Backend: CUDA
Library version: master

bitset: Class name and API

Our bitset class is a GPU version of std::bitset which, however, is designed to cover more use cases. In particular, its interface and implementation (run-time fixed-sized) is somewhere between std::bitset (compile-time fixed-sized) and boost::dynamic_bitset (run-time dynamic-sized). This may lead to confusion if users expect the exact same API as std::bitset.

There are several ways to address this issue:

Rename it to dynamic_bitset and extend its API to match (as close as possible) boost.
Rename it to vector<bool> and change/extend its API to match (as close as possible) vector.
Keep the name and extend its API towards boost's version, i.e. considering the non-dynamic-sized functions.

At the moment, the last option seems to be a good compromise. However, it does not fully solve the problem regarding potential user confusion. Since any of the options will break the API, this change is considered for stdgpu 2.0.0

Execution policy support for all containers

As the containers should mimic their C++ counterparts as close as possible in terms of functionality, both per-element and iterator-based member functions are considered and provided. While the former allow for easy usage in the native context, that is e.g. in CUDA kernels for the CUDA backend, the latter iterator-based versions can be considered following algorithm semantics. However, they lack support for execution_policys prohibiting greater flexibility such as using asynchronous CUDA streams. The affected functionality is listed below:

~~All containers:~~
- ~~createDeviceObject and destroyDeviceObject~~
~~bitset:~~
- ~~set, reset, flip, count, all, any, none~~
~~deque:~~
- ~~clear, device_range, valid~~
~~memory:~~
- ~~createDeviceArray, destroyDeviceArray, and for symmetry reasons also the respective host versions~~
~~mutex:~~
- ~~valid~~
queue:
- valid
stack:
- valid
~~unordered_map, unordered_set:~~
- ~~device_range, insert, erase, clear, valid~~
~~vector:~~
- ~~insert, erase, clear, valid~~

Option 1:
Add a respective execution_policy parameter to all of these functions. This could either follow algorithm and make this the first parameter such that each functions must be duplicated. Alternatively, it could be passed as the last parameter with a default value, at the cost of an inconsistent interface to algorithm.

Option 2:
Add a scoped_execution_policy class which acts as a customizable default policy for all calls within its scope. While this minimizes the required changes for the containers, proper global management may be hard to implement as the class types of the policies could theoretically be arbitrary.

Header-only and GPU architecture independence

In contrast to boost, thrust and others, stdgpu is not a header-only library and, hence, requires shipping a compiled library. The following module currently require source file compilation:

~~bitset: Contains host-only functions which also contain code executed on the device.~~
device: Contains a function relying on backend-specific host API functions.
iterator: Only contains a wrapper function to hide the dependency to memory from the header.
~~limits: Contains the definition of static member variables.~~
memory: Both the general as well as the backend-specific parts handle the allocation and memcpy parts in the sources. ~~This includes some global variables that need to be converted to proper singletons.~~
~~mutex: Contains host-only functions which also contain code executed on the device.~~

Inlining bitset and mutex will make the library independent of the required GPU architecture, e.g. the compute capability set for CUDA. Even if we decide not to go for header-only, achieving architecture independence might be a good compromise.

Unordered_set with structure

Hello, I'm new to this library. I wonder what should I do if I want to use something like stdgpu::unordered_set<thrust::pair<int, int>> or stdgpu::unordered_set<myStruct> ? Because when I try to use stdgpu::unordered_set<thrust::pair<int, int>>::createDeviceObject(n), the error occurred:

/usr/local/include/c++/9.5.0/type_traits(2378): error: class "std::enable_if<false, thrust::pair<int, int>>" has no member "type"
          detected during:
            instantiation of type "std::enable_if_t<false, thrust::pair<int, int>>"

unordered map creation freezes async processes

Describe the bug
unordered map creation freezes async processes

Steps to reproduce

runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);

// The line below would only complete when runBuldKernel is done
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);

Expected behavior
The map creation and memory allocation should complete right away, without waiting for runBuldKernel to complete

Actual behavior
The map creation and memory allocation completes only after runBuldKernel is done

System (please complete the following information):

OS: Windows 11 x64
Compiler: MSVC Visual Studio 2022
Backend: CUDA 12
Library version: master

Please help to answer the two questions about this, thank you

1.When I want to use other containers (e.g., list, array), do I need to implement one myself?
2.The hip version of the example does not have the use of container classes, is it supported?

Minimize thrust dependency

In order to implement several functions in a portable manner across the backends, thrust is used as the abstraction of choice. This may, however, limit extending support to other possible libraries making it desirable to reduce the coupling to thrust. Throughout the whole code base, the following classes and functions are used:

Header algorithm:

Header execution:

~~device~~: Only used as object alias for stdgpu::execution::device.
~~host~~: Only used as object alias for stdgpu::execution::host.

Header functional:

~~equal_to<>~~
~~identity<>~~
~~plus<>~~

Header iterator:

~~counting_iterator~~: Only used in stdgpu::for_each_index and stdgpu::transform_reduce_index implementation.
~~discard_iterator~~: Implementation detail used for iterator_adaptor.
~~distance~~
~~iterator_adaptor~~: Only used in implementation of thrust-compatible container iterators.
~~transform_iterator~~: Only used in stdgpu::transform_range implementation.
~~zip_iterator~~

Header utility:

~~pair~~: Only used as typedef for stdgpu::pair.
~~tuple~~

Other Headers:

~~random~~

Note: The examples may still make use of thrust to demonstrate interoperability.

Any chance to support spirv as a backend?

Is your feature request related to a problem? Please describe.
I'd like to use this in a hardware vendor agnostic way, and more specifically I'd like to use with webgpu. Any plans to support in the future? OpenCL subset can compile to spirv, so that could be another option?

Build suceeds but cannot include package and build

Hi there,

I have been able to build stdgpu under windows with msvc 2019 and cuda 10.2 with no problems. The example projects work fine. However, taking the installed library and using it in a test setup produces some errors (the same is true for the addsubdirectories route). My CmakeLists.txt is:

cmake_minimum_required(VERSION 3.1)
set (CMAKE_CXX_STANDARD 14)

project(VoxelGrid LANGUAGES CXX CUDA)

file(GLOB srcfiles 
${PROJECT_SOURCE_DIR}/src/*.h   
${PROJECT_SOURCE_DIR}/src/*.cpp
)
include_directories(${PROJECT_SOURCE_DIR}/src)

set(stdgpu_DIR ${PROJECT_SOURCE_DIR}/3rdParty/stdgpu/lib/cmake/stdgpu)
find_package(stdgpu 1.0.0 REQUIRED)

add_executable(VoxelGridTest exe/main.cpp ${srcfiles})
target_link_libraries(VoxelGridTest PUBLIC stdgpu::stdgpu)

in main.cpp I copied the unordered_map example. The build fails with

 FAILED: CMakeFiles/VoxelGridTest.dir/exe/main.cpp.obj 
  C:\PROGRA~2\MICROS~2\2019\COMMUN~1\VC\Tools\MSVC\1427~1.291\bin\Hostx64\x64\cl.exe  /nologo /TP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -I..\..\src -I..\..\3rdParty\stdgpu\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include" /DWIN32 /D_WINDOWS /W3 /GR /EHsc /MD /Zi /O2 /Ob1 /DNDEBUG   -std:c++14 /showIncludes /FoCMakeFiles\VoxelGridTest.dir\exe\main.cpp.obj /FdCMakeFiles\VoxelGridTest.dir\ /FS -c ..\..\exe\main.cpp
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(140): error C2059: syntax error: 'sizeof'
  C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu/atomic.cuh(332): note: see reference to class template instantiation 'stdgpu::atomic<T>' being compiled
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(141): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(150): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(152): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(160): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(162): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(171): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(171): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(172): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(180): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(180): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(181): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(189): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(189): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(190): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(198): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(198): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(199): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(207): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(207): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(208): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(217): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(217): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(218): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(226): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(226): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(227): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(235): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(235): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(236): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(244): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(244): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(245): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(253): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(253): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(254): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(261): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(261): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(262): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(269): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(269): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(270): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(277): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(277): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(278): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(287): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(287): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(288): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(296): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(296): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(297): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(305): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(305): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(306): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(314): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(314): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(315): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(323): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(323): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(324): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(422): error C2059: syntax error: 'sizeof'
  C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu/atomic.cuh(615): note: see reference to class template instantiation 'stdgpu::atomic_ref<T>' being compiled
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(423): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(432): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(434): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(442): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(444): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(453): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(453): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(454): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(462): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(462): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(463): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(471): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(471): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(472): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(480): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(480): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(481): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(489): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(489): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(490): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(499): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(499): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(500): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(508): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(508): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(509): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(517): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(517): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(518): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(526): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(526): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(527): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(535): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(535): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(536): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(543): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(543): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(544): error C2238: unexpected token(s) preceding ';'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(551): error C2988: unrecognizable template declaration/definition
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(551): error C2059: syntax error: 'sizeof'
C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(551): fatal error C1003: error count exceeds 100; stopping compilation

We had a similar issue under linux. Thanks in advance for the support!

cmake: Backend-specific targets

The backend system is currently restricted to build and install the library only for a single backend.

Current behavior:

Set STDGPU_BACKEND to either STDGPU_BACKEND_CUDA (default) or STDGPU_BACKEND_OPENMP to control which backend will be used.
Build target stdgpu::stdgpu for the particular choice of STDGPU_BACKEND. Other backends will not be considered at all.

Proposed behavior:

Set STDGPU_ENABLE_<BACKEND> where <BACKEND> is one of CUDA, OPENMP.
Build backend targets stdgpu::<BACKEND> for each enabled backend using the backend-specific settings and dependency checks.
Define stdgpu::stdgpu as an alias target to stdgpu::<BACKEND> serving as a default which can be controlled via STDGPU_BACKEND to match current behavior.

This will make the system more flexible and allow users to choose freely between all enabled backends in their projects rather than being globally restricted to a single choice. Note that linking to more than one backend at the same time will be considered undefined behavior/ODR violation.

Furthermore, if only a single backend should be used at all times, this intend can also be expressed more clearly by linking to stdgpu::<BACKEND> rather than the configuration-dependent stdgpu::stdgpu target.

error: redefinition of ‘constexpr const bool stdgpu::numeric_limits<T>::is_specialized’

Ubuntu 20
Cuda 11.4

Library compiled by CUDA back-end successfully, examples work.
But when add STDGPU to small project, next error during compiling:

/usr/local/include/stdgpu/impl/limits_detail.h:73:42: error: redefinition of ‘constexpr const bool stdgpu::numeric_limits::is_specialized’

Raise minimum requirements to C++17

C++17 has been released 5 years ago and the default compilers on Ubuntu 20.04 and Ubuntu 22.04 all support this standard. Support in CUDA has been added with CUDA 11.0 (released March 2020) and a potential future SYCL backend will require C++17 anyways. Furthermore, Ubuntu 18.04 will reach EOL soon in April 2023 and should not longer be used. Therefore, it makes sense to raise the requirements which will also simplify and unblock future developments.

New Requirements

Drop Ubuntu 18.04 support (EOL in April 2023)
GCC 7 -> 9, Clang 6 -> 10, MSVC 19.20 already sufficient
CMake 3.15 -> 3.18 (CUDA support for C++17), also aligns with the requirements for the Clang CUDA compiler
thrust 1.9.2 -> thrust 1.9.9
CUDA 10.0 -> 11.0

Cleanups with C++17

Library Code
- ~~Use inline variables in limits -> unblocks header-only~~
- ~~Replace *::value by shorter *_v versions~~
- ~~Drop void_t in type_traits~~
- ~~Drop attribute.h and use native attributes~~
- ~~Drop STDGPU_HAS_CXX_17~~
- ~~Drop atomic backports in CUDA backend for CC 3.0 and lower (support removed with CUDA 11.0+)~~
- ~~Drop thrust workaround in memory~~
- ~~Simplify to_address implementation with if constexpr~~
- ~~Use std::byte over unsigned char in bit~~
- ~~Add noexcept qualifier to function signatures~~
CMake
- ~~Drop custom generation of compile flags for CUDA architectures~~
- ~~Drop custom FindCUDAToolkit.cmake module~~
- ~~Drop workarounds in CUDA set_device_flags.cmake~~

Container: Add allocator object member

The recent improvements regarding the allocation system (see #56, #58, #61) increase the conformance of the container implementations with the C++ standard. A further (small) step is the introduction of an allocator object member in the container classes. However, since this will be a breaking change, it is postponed to stdgpu 2.0.0.

Support shared library building

Hi thank you for this great project, it really hits the spot.
Is there any plan to support an option for shared library building on cmaking? The option would be helpful for integrating this repo to other projects (such as integrating to custom pytorch layer).

unordered_map creation crashing with cuda 12.x

Describe the bug

unordered_map creation crashes with cuda 12.x (testing with release build)

Steps to reproduce
In cu main()

stdgpu::unordered_map<int, int> map = stdgpu::unordered_map<int, int>::createDeviceObject(1);

Expected behavior
Abvious

Actual behavior

Exception thrown at 0x00007FFD29D140AC in CudaHelloWorld.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x0000009373EFF210.
Unhandled exception at 0x00007FFD29D140AC in CudaHelloWorld.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x0000009373EFF210.

System (please complete the following information):

OS: Windows 11
Compiler: Visual Studio 2022, c++17
Backend: CUDA, c++17
Library version: master

ROCm Backend to support AMD GPUs

At the time of this writing, stdgpu implements two different backends: CUDA (NVIDIA GPUs) and OpenMP (CPUs and GPUs). Extending this effort by introducing a ROCm backend (AMD GPUs) would further increase the number of use cases and improve the hardware support. This requires the following tasks to be completed:

CMake integration: A respective backend-specific configuration is needed to built the library with ROCm.
thrust support: Since version 1.9.3 of thrust currently is required, a proper equivalent version of rocThrust is needed. Alternatively, support for older thrust versions may be investigated as well to handle this task.
Backend-specific API implementation: The backend API of stdgpu is kept minimal to simplifiy the implementation of further backends. In particular, a proper implementation consists of platform-specific macros, memory allocation and copy functions (includes managed memory) as well as atomic functions (at least exchange and compare_exchange required).

Container: resize and copy support

Up to now, the container classes have a fixed capacity and are created using the non-standard createDeviceObject factory function. Furthermore, since ease of use in GPU kernels is considered a key feature, the copy constructors are currently restricted to perform only shallow copies rather than deep copies. This behavior makes the container still feel non-standard and unintuitive to some degree, especially for new users.

In order to fix both issues, the design of the copy operations needs to be revised to match the STL more closely. At first glance, this seems to be an easy task:

Define the copy constructors and copy assignment operators to perform deep copies.
Provide a reference_wrapper<T> class which can be used on the GPU.

However, objects (or at least their states) need to be copied from CPU to GPU memory in order to allow for the proper execution of an operation. Since we want to make the containers work for as many backends and use cases as possible, we cannot make any assumptions how this transfer will be performed or whether this really requires calling the copy constructor or not. reference_wrapper<T> does not solve this problem since it points to the original object which lives in CPU memory.

Therefore, the current proposal would be:

Provide a shallow_copy_wrapper<T> class (suggestions for a better name are welcome) which wraps the object state. This class is copyable such that the object state can be easily passed to the GPU similar to reference_wrapper<T>. However, if the state of the original object is changed, e.g. due to a resize operation, this change will not be visible or propagated to the wrapper invalidating it. Thus, we trade object consistency with GPU support.
Define the copy constructors and copy assignment operators to perform deep copies, but restrict them to be callable from the host only.
Clearly document that shallow_copy_wrapper<T> is only intended to allow crossing memory boundaries and to enable container usage on the GPU. For CPU usage, std::reference_wrapper<T> should be used instead if required.
Deprecate/remove the createDeviceObject and destroyDeviceObject factory functions.

This change will break existing usage within kernels and thrust algorithms (functors). A reasonable transition strategy would be to introduce shallow_copy_wrapper<T> in the last minor release of version 1 (which might be 1.3.0) and provide an option to disable the copy constructor and copy assignment operators. This way, users could start porting to the new copy model and will only need to move away from the factory functions in version 2.0.0.

Implement at() in terms of operator[] to avoid bound checks in deque and vector

The containers in the STL that support random access implement two different functions to access specific elements in the container:

operator[]
at

operator[] just accesses the element without bound checks, while at perfoms the same operation, but doing bound checks and throwing an out_of_bounds exception when the index is out of bounds.

I propose that the containers (deque and vector basically) implement at() in terms of operator[], and perfom bound checks only when calling at(). That way they're more compliant to the STL ones. This is my proposed solution for vector (for deque is the same idea):

template <typename T>
inline STDGPU_DEVICE_ONLY typename vector<T>::reference
vector<T>::at(const vector<T>::index_type n)
{
    return const_cast<vector<T>::reference>(static_cast<const vector<T>*>(this)->at(n));
}


template <typename T>
inline STDGPU_DEVICE_ONLY typename vector<T>::const_reference
vector<T>::at(const vector<T>::index_type n) const
{
    STDGPU_EXPECTS(0 <= n);
    STDGPU_EXPECTS(n < size());
    STDGPU_EXPECTS(occupied(n));

    return this->operator[](n);
}


template <typename T>
inline STDGPU_DEVICE_ONLY typename vector<T>::reference
vector<T>::operator[](const vector<T>::index_type n)
{
    return _data[n];
}


template <typename T>
inline STDGPU_DEVICE_ONLY typename vector<T>::const_reference
vector<T>::operator[](const vector<T>::index_type n) const
{
    return _data[n];
}

Failed to build project with stdgpu. Error: expected unqualified-id before ‘sizeof’

Hello, I'm trying to embed stdgpu into my project. I write the cmake file as the tutorial and cmake built successfully.

However, when I tried to build my own project, it failed and raised many errors.

[ 70%] Built target stdgpu
[ 80%] Built target foo
[ 90%] Building CXX object CMakeFiles/parallel_cache.dir/main.cpp.o
In file included from /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/platform.h:34:0,
                 from /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/iterator.h:33,
                 from /home/yanglinzhuo/parallel_cache/stdgpu_test.cuh:5,
                 from /home/yanglinzhuo/parallel_cache/main.cpp:3:
/home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/cuda/platform.h:48:37: error: expected unqualified-id before ‘sizeof’
     #define STDGPU_CUDA_DEVICE_ONLY sizeof("STDGPU ERROR: Wrong compiler detected! Device-only functions must be compiled with the device compiler!")
                                     ^
/home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/cuda/platform.h:48:37: note: in definition of macro ‘STDGPU_CUDA_DEVICE_ONLY’
     #define STDGPU_CUDA_DEVICE_ONLY sizeof("STDGPU ERROR: Wrong compiler detected! Device-only functions must be compiled with the device compiler!")
                                     ^~~~~~
/home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/platform.h:80:34: note: in expansion of macro ‘STDGPU_DETAIL_CAT2_DIRECT’
 #define STDGPU_DETAIL_CAT2(A, B) STDGPU_DETAIL_CAT2_DIRECT(A, B)
                                  ^~~~~~~~~~~~~~~~~~~~~~~~~
...

I omit many error lines because they are similar. The main error here is error: expected unqualified-id before ‘sizeof’.

I'm confused with these errors and have no ideas how to fix them.

Because I'm new to compile with cmake, so I think there maybe some mistakes in my cmake file.

The following is y project's sructure:

CMakeLists.txt
main.cpp
stdgpu_test.cuh
stdgpu/

And the following is my cmake file:

cmake_minimum_required(VERSION 3.18)

project(parallel_cache)

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(Torch_DIR /usr/local/libtorch/share/cmake/Torch)  # My libtorch path
find_package(Torch REQUIRED)

# Exclude the examples from the build
set(STDGPU_BUILD_EXAMPLES OFF CACHE INTERNAL "")
# Exclude the tests from the build
set(STDGPU_BUILD_TESTS OFF CACHE INTERNAL "")
add_subdirectory(stdgpu)
set_property(TARGET stdgpu PROPERTY CUDA_ARCHITECTURES 60)
add_library(foo stdgpu)
set_target_properties(foo PROPERTIES LINKER_LANGUAGE CXX)
target_link_libraries(foo PUBLIC stdgpu::stdgpu)

add_executable(${PROJECT_NAME} "main.cpp" "stdgpu_test.cuh")

target_link_libraries(parallel_cache PUBLIC "${TORCH_LIBRARIES}")
target_link_libraries(parallel_cache PUBLIC foo)
set_property(TARGET parallel_cache PROPERTY CXX_STANDARD 14)

My system configuration is:

OS: Ubuntu 18.04
Compiler: GCC 7.5.0
Backend: CUDA
Library version: 1.3.0

Wish for any help. Thanks.

About stdgpu::queue size < 0

Hello, I found that sometimes queue.size() < 0, I guess it is because more than one threads pop the empty queue at the same time.
I can see this warning when program running:

stdgpu/src/stdgpu/impl/deque_detail.cuh

Lines 434 to 441 in d9a4587

    
           if (current_size < 0) 
        
           { 
        
               printf("stdgpu::deque::size : Size out of bounds: %" STDGPU_PRIINDEX " not in [0, %" STDGPU_PRIINDEX 
        
                      "]. Clamping to 0\n", 
        
                      current_size, 
        
                      capacity()); 
        
               return 0; 
        
           }

Even worse, I found that the queue will be invalid(queue.valid()=false) after the CUDA kernel, which causes unexpected result when I try to reuse it in another kernel like:

__global__ void kernel1(){
// push and pop operation
}

__global__ void kernel2(){
// push and pop operation
}

int main() {
kernel1<<<>>>(queue);
kernel2<<<>>>(queue);
}

What should I do? Thank in advance!

About _excess_list_positions used in unordered_set

Hello, I use unordered_set with the vector and queue in my code like:

// do something...
element = queue.pop()
// do something...
vector.push_back(...);
// do something...
auto dup_res = unordered_set.insert(...);
if (dup_res.second)
    queue.push(...);

And I found that in my cuda kernel, sometimes the result of insert operation of unordered_set will be false because _excess_list_positions is empty:

stdgpu/src/stdgpu/impl/unordered_base_detail.cuh

Lines 931 to 953 in 00820f9

    
           template <typename Key, typename Value, typename KeyFromValue, typename Hash, typename KeyEqual, typename Allocator> 
        
           inline STDGPU_DEVICE_ONLY 
        
                   pair<typename unordered_base<Key, Value, KeyFromValue, Hash, KeyEqual, Allocator>::iterator, bool> 
        
                   unordered_base<Key, Value, KeyFromValue, Hash, KeyEqual, Allocator>::insert( 
        
                           const unordered_base<Key, Value, KeyFromValue, Hash, KeyEqual, Allocator>::value_type& value) 
        
           { 
        
               pair<iterator, operation_status> result(end(), operation_status::failed_collision); 
        
               while (true) 
        
               { 
        
                   if (result.second == operation_status::failed_collision && !full() && !_excess_list_positions.empty()) 
        
                   { 
        
                       result = try_insert(value); 
        
                   } 
        
                   else 
        
                   { 
        
                       break; 
        
                   } 
        
               } 
        
               return result.second == operation_status::success ? pair<iterator, bool>(result.first, true) 
        
                                                                 : pair<iterator, bool>(result.first, false); 
        
           }

What I have done: I add the following code snippet before return, and I sometimes get the message "list empty":

    if (result.second == operation_status::failed_collision)
    {
        if (full())
            std::printf("full \n");
        if (_excess_list_positions.empty())
            std::printf("list empty\n");
    }
    if (result.second == operation_status::failed_no_action_required)
    {
        std::printf("no_action_required");
    }

I am confused about this kind of failure, what does this mean and what should I do? Thanks in advance!

Cannot include stdgpu to an existing OpenMP/HIP project

I'm working on an OpenMP/HIP code and trying to include stdgpu as a subproject. What I need is to

compile stdgpu with hipcc/hcc,
compile my HIP sources with hipcc/hcc,
compile the rest with gcc/clang, and
leave all the linking things to gcc/clang

It seems like stdgpu needs to pass -DCMAKE_CXX_COMPILER=hcc to cmake to build HIP backend but hcc doesn't support -fopenmp yet. So the OpenMP libraries are missing when a find_package(OpenMP) is encountered. Is there any way to workaround this?

I've tried to set the compiler to clang. The compiler complains that -hc is an unknown argument. It looks like this is an hcc argument required by the rocthrust::rocthrust target.

clang++  -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -Dstdgpu_EXPORTS
...
-isystem /opt/rocm-3.3.0/hip/include -isystem /opt/rocm/include  -stdlib=libc++ -O3 -DNDEBUG -fPIC   -hc -fPIC -std=c++14 -o CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -c /external/stdgpu/src/stdgpu/impl/iterator.cpp
clang-9: error: unknown argument: '-hc'

Unordered_maps with complex containers

Don't know if this feature already exists but do unordered_maps support structures other than <int,int> like <int,stdgpu::unordered_set>. Similarly can unordered_set support <pair<int,int>> using a hashing function from boost::hash<pair<int, int>> or vectors using your own hashing function for vectors.

This library is excellent btw! Solves so many issues with support for STL like containers for GPU :)

Solve compilation error in Ubuntu 22.04 and friends

Amazing work! This is not a bug report, but just leaving it here in case someone needs this in the future

Problem

For those trying to compile this on Ubuntu 22.04 and friends alike, I managed to solve the following error:

/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |

The problem is that the default C++ compiler for Ubuntu 22.04 has a problem making friends with nvcc, more info on the related issues at the bottom

Solution

Install an older compiler and tell nvcc which is the host-compiler using CMake

sudo apt install g++10
cmake -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-10 -Bbuild

	if (current_size < 0)
	{
	printf("stdgpu::deque::size : Size out of bounds: %" STDGPU_PRIINDEX " not in [0, %" STDGPU_PRIINDEX
	"]. Clamping to 0\n",
	current_size,
	capacity());
	return 0;
	}

	template <typename Key, typename Value, typename KeyFromValue, typename Hash, typename KeyEqual, typename Allocator>
	inline STDGPU_DEVICE_ONLY
	pair<typename unordered_base<Key, Value, KeyFromValue, Hash, KeyEqual, Allocator>::iterator, bool>
	unordered_base<Key, Value, KeyFromValue, Hash, KeyEqual, Allocator>::insert(
	const unordered_base<Key, Value, KeyFromValue, Hash, KeyEqual, Allocator>::value_type& value)
	{
	pair<iterator, operation_status> result(end(), operation_status::failed_collision);

	while (true)
	{
	if (result.second == operation_status::failed_collision && !full() && !_excess_list_positions.empty())
	{
	result = try_insert(value);
	}
	else
	{
	break;
	}
	}

	return result.second == operation_status::success ? pair<iterator, bool>(result.first, true)
	: pair<iterator, bool>(result.first, false);
	}

stotko / stdgpu Goto Github PK

stdgpu's People

Contributors

Stargazers

Watchers

Forkers

stdgpu's Issues

New Requirements

Cleanups with C++17

Problem

Solution

Related issues

Recommend Projects

Recommend Topics

Recommend Org