Coder Social home page Coder Social logo

cucollections's Introduction

cuCollections

Examples Doxygen Documentation (TODO)

cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.

Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.

Development Status

cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.

Major Updates

01/08/2024 Deprecated the experimental namespace

01/02/2024 Moved the legacy static_map to cuco::legacy namespace

Getting cuCollections

cuCollections is header only and can be incorporated manually into your project by downloading the headers and placing them into your source tree.

Adding cuCollections to a CMake Project

cuCollections is designed to make it easy to include within another CMake project. The CMakeLists.txt exports a cuco target that can be linked1 into a target to setup include directories, dependencies, and compile flags necessary to use cuCollections in your project.

We recommend using CMake Package Manager (CPM) to fetch cuCollections into your project. With CPM, getting cuCollections is easy:

cmake_minimum_required(VERSION 3.23.1 FATAL_ERROR)

include(path/to/CPM.cmake)

CPMAddPackage(
  NAME cuco
  GITHUB_REPOSITORY NVIDIA/cuCollections
  GIT_TAG dev
  OPTIONS
     "BUILD_TESTS OFF"
     "BUILD_BENCHMARKS OFF"
     "BUILD_EXAMPLES OFF"
)

target_link_libraries(my_library cuco)

This will take care of downloading cuCollections from GitHub and making the headers available in a location that can be found by CMake. Linking against the cuco target will provide everything needed for cuco to be used by the my_library target.

1: cuCollections is header-only and therefore there is no binary component to "link" against. The linking terminology comes from CMake's target_link_libraries which is still used even for header-only library targets.

Requirements

  • nvcc 11.5+
  • C++17
  • Volta+
    • Pascal is partially supported. Any data structures that require blocking algorithms are not supported. See libcu++ documentation for more details.

Dependencies

cuCollections depends on the following libraries:

No action is required from the user to satisfy these dependencies. cuCollections's CMake script is configured to first search the system for these libraries, and if they are not found, to automatically fetch them from GitHub.

Building cuCollections

Since cuCollections is header-only, there is nothing to build to use it.

To build the tests, benchmarks, and examples:

cd $CUCO_ROOT
mkdir -p build
cd build
cmake .. # configure
make # build
ctest --test-dir tests # run tests

Binaries will be built into:

  • build/tests/
  • build/benchmarks/
  • build/examples/

Build Script:

Alternatively, you can use the build script located at ci/build.sh. Calling this script with no arguments will trigger a full build which will be located at build/local.

cd $CUCO_ROOT
ci/build.sh # configure and build
ctest --test-dir build/local/tests # run tests

For a comprehensive list of all available options along with descriptions and examples, you can use the option ci/build.sh -h.

Code Formatting

By default, cuCollections uses pre-commit.ci along with mirrors-clang-format to automatically format the C++/CUDA files in a pull request. Users should enable the Allow edits by maintainers option to get auto-formatting to work.

Pre-commit hook

Optionally, you may wish to setup a pre-commit hook to automatically run clang-format when you make a git commit. This can be done by installing pre-commit via conda or pip:

conda install -c conda-forge pre_commit
pip install pre-commit

and then running:

pre-commit install

from the root of the cuCollections repository. Now code formatting will be run each time you commit changes.

You may also wish to manually format the code:

pre-commit run clang-format --all-files

Caveats

mirrors-clang-format guarantees the correct version of clang-format and avoids version mismatches. Users should NOT use clang-format directly on the command line to format the code.

Documentation

Doxygen is used to generate HTML pages from the C++/CUDA comments in the source code.

The example

The following example covers most of the Doxygen block comment and tag styles for documenting C++/CUDA code in cuCollections.

/**
 * @file source_file.cpp
 * @brief Description of source file contents
 *
 * Longer description of the source file contents.
 */

/**
 * @brief Short, one sentence description of the class.
 *
 * Longer, more detailed description of the class.
 *
 * A detailed description must start after a blank line.
 *
 * @tparam T Short description of each template parameter
 * @tparam U Short description of each template parameter
 */
template <typename T, typename U>
class example_class {

  void get_my_int();            ///< Simple members can be documented like this
  void set_my_int( int value ); ///< Try to use descriptive member names

  /**
   * @brief Short, one sentence description of the member function.
   *
   * A more detailed description of what this function does and what
   * its logic does.
   *
   * @param[in]     first  This parameter is an input parameter to the function
   * @param[in,out] second This parameter is used both as an input and output
   * @param[out]    third  This parameter is an output of the function
   *
   * @return The result of the complex function
   */
  T complicated_function(int first, double* second, float* third)
  {
      // Do not use doxygen-style block comments
      // for code logic documentation.
  }

 private:
  int my_int;                ///< An example private member variable
};

Doxygen style check

cuCollections also uses Doxygen as a documentation linter. To check the Doxygen style locally, run

./ci/pre-commit/doxygen.sh

Data Structures

We plan to add many GPU-accelerated, concurrent data structures to cuCollections. As of now, the two flagships are variants of hash tables.

static_set

cuco::static_set is a fixed-size container that stores unique elements in no particular order. See the Doxygen documentation in static_set.cuh for more detailed information.

Examples:

static_map

cuco::static_map is a fixed-size hash table using open addressing with linear probing. See the Doxygen documentation in static_map.cuh for more detailed information.

Examples:

static_multimap

cuco::static_multimap is a fixed-size hash table that supports storing equivalent keys. It uses double hashing by default and supports switching to linear probing. See the Doxygen documentation in static_multimap.cuh for more detailed information.

Examples:

dynamic_map

cuco::dynamic_map links together multiple cuco::static_maps to provide a hash table that can grow as key-value pairs are inserted. It currently only provides host-bulk APIs. See the Doxygen documentation in dynamic_map.cuh for more detailed information.

Examples:

distinct_count_estimator

cuco::distinct_count_estimator implements the well-established HyperLogLog++ algorithm for approximating the count of distinct items in a multiset/stream.

Examples:

cucollections's People

Contributors

ajschmidt8 avatar amukkara avatar austinschuh avatar bdice avatar chirayug-nvidia avatar dark-knight11 avatar dillon-cullinan avatar divyegala avatar hahnjo avatar jrhemstad avatar karthikeyann avatar m3g4d1v3r avatar mimaric avatar miscco avatar mtmd avatar niskos99 avatar njones93531 avatar pointkernel avatar pre-commit-ci[bot] avatar raydouglass avatar robertmaynard avatar seunghwak avatar sleeepyjack avatar srinivasyadav18 avatar trxcllnt avatar vyasr avatar wence- avatar wphicks avatar zasdfgbnm avatar zelbok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cucollections's Issues

[FEA] dynamic_map device-side API

Currently, the dynamic_map has only a host-side API, which supports insert, find, contains. It would be useful to have an additional, advanced-usage API supporting __device__ versions of insert, find, and contains.

There are some important considerations in designing this API:

  1. The map will not be able to dynamically resize when using the device-side API. This is because control must be returned to the host to resize. There is no preventing the user from inserting too many keys for the capacity of the map.
  2. The size field of the dynamic_map as well as its constituent static_map objects needs to be updated as the device-side insert is used. This could potentially be done automatically, such as with an RAII wrapper around the device_mutable_view object. It could also be done manually by the user.

[FEA] Create repository for hash map benchmarking

Is your feature request related to a problem? Please describe.

Any hash map benchmarking should be in a single place that is easy to find.

Describe the solution you'd like

Create a new repository for hash map benchmarking.

Additional context

What should the new repo be called? I assume we don't want to limit it to just hash map benchmarking.

Maybe something like cuCollectionsBench?

[FEA] Hash map design that supports reduce by key

Is your feature request related to a problem? Please describe.

I would like to be able to implement a hash-based reduce-by-key operation. A thrust::reduce_by_key operation requires first doing a sort_by_key. This means 2 passes through the input data and 2n memory overhead. A hash based implementation requires only a single pass and n/0.7 memory overhead (assuming a 70% load factor).

Trying to implement a reduce-by-key with today's static_map implementation is effectively impossible.

There's two ways you could implement a hash-based rbk (assume the reduction is sum).

The first assumes that insert returns an iterator to the slot where the inserted occurred, or the slot that contains an equivalent key:

template <typename Map, typename Key, typename Value>
hash_rbk(Map m, Key key_begin, Key key_end, Value value_begin){
   auto found = map.insert(key_begin + tid, value_begin + tid); // assume `insert` returns an iterator to the slot where the insert occurred
   
   // key already exists, accumulate with the existing value
   if(found.second){
      auto& found_pair = (found.first)
      found_pair.second.fetch_add(value_begin + tid);
   }
}

The second assumes you can do concurrent insert and find:

template <typename Map, typename Key, typename Value>
hash_rbk(Map m, Key key_begin, Key key_end, Value value_begin){

   auto found = map.find(key_begin + tid); // look if the key exists
   
   // key already exists, accumulate with the existing value
   if(found != map.end()){
      found.second.fetch_add(value_begin + tid);
   } else { // key doesn't exist
      auto found = map.insert(key_begin + tid, value_begin + tid);
      
      if(not found.second){ // someone else already did the insert, accumulate with their value
           found.first.second.fetch_add(value_begin+tid);
      }
   }
}

However, both of these are impossible because:

  1. The insert function cannot return an iterator (due to the potential for a race condition caused by the back-to-back CAS algorithm), and instead returns a bool.
    • This results from the fact that we use pair<atomic<K>, atomic<V>> as the slot storage type and use back to back CAS. If we instead used atomic<pair<K,V>> we could return an iterator
  2. You cannot do concurrent insert/find operations.
    • This results from the fact that we use pair<atomic<K>, atomic<V>> as the slot storage type and use back to back CAS. If we instead used atomic<pair<K,V>> we could support concurrent insert and find.

Describe the solution you'd like

Explore a hash map implementation that allows implementing a reduce by key.

One option is to use a atomic<pair<K,V>> implementation, but that has it's own drawbacks. See https://docs.google.com/presentation/d/1_UJlQoDc985sj03grMroB2zcCbiiC7-ua_1Eqm4qrRU/edit?usp=sharing

[BUG] A few simple issues with static_map

Trying to use coco::static_map in cugraph, I found several little issues:

  1. device_view::find returns an iterator. It would be nice to have a const version that returned a const_iterator so the function can be used in a const context
  2. Check the visibility of the methods on device_view_base, some things seem to be protected that should probably be public. In particular, I need to be able to iterate over the slots. The begin iterator is protected (therefore not visible to the outside), while the end iterator is public. They should either both be protected and a mechanism in the derived class uses them, or they should both be public (I suspect the latter).
  3. The example in the static_map doxygen comments indicates you should create a thrust::pair. However, pair_atomic_type is set to coco::pair.

[FEA] `static_map::insert_if`

Is your feature request related to a problem? Please describe.
Given a set of key/value pairs, insert pair[i] if predicate of stencil[i] returns true. The API should be similar to static_mulimap::insert_if:

  /**
   * @brief Inserts key/value pairs in the range `[first, last)` if `pred`
   * of the corresponding stencil returns true.
   *
   * The key/value pair `*(first + i)` is inserted if `pred( *(stencil + i) )` returns true.
   *
   * @tparam InputIt Device accessible random access input iterator where
   * `std::is_convertible<std::iterator_traits<InputIt>::value_type,
   * static_map<K, V>::value_type>` is `true`
   * @tparam StencilIt Device accessible random access iterator whose value_type is
   * convertible to Predicate's argument type
   * @tparam Predicate Unary predicate callable whose return type must be convertible to `bool` and
   * argument type is convertible from `std::iterator_traits<StencilIt>::value_type`.
   * @param first Beginning of the sequence of key/value pairs
   * @param last End of the sequence of key/value pairs
   * @param stencil Beginning of the stencil sequence
   * @param pred Predicate to test on every element in the range `[stencil, stencil +
   * std::distance(first, last))`
   * @param stream CUDA stream used for insert
   */
  template <typename InputIt, typename StencilIt, typename Predicate>
  void insert_if(
    InputIt first, InputIt last, StencilIt stencil, Predicate pred, cudaStream_t stream = 0);

Additional context
Required by cudf semi/anti join refactoring

[BUG] Find does not return the empty sentinel value for missing or empty sentinel keys

Describe the bug

* If the key `*(first + i)` exists in the map, copies its associated value to `(output_begin +

states "Else, copies the empty value sentinel". However, for missing keys, it seems like garbage values are returned.

Steps/Code to reproduce bug

writeBuffer[threadIdx.x / tile_size] = found->second.load(cuda::std::memory_order_relaxed);

If found is view.end(), found->second.load(cuda::std::memory_order_relaxed) loads from slots[capacity] and this access is out-of-bound if the number of allocated slots is capacity.

I think this should be

found == view.end() ? view.get_empty_value_sentinel() : found->second.load(cuda::std::memory_order_relaxed)
or
we need to allocate capacity + 1 slots and store the empty sentinel (key, value) pairs in the last slot.

Clean up README

Remove RAPIDS boilerplate from README and populate with info for cuCollections.

[FEA] Batched reduce_by_key benchmark for dynamic hash map

Is your feature request related to a problem? Please describe.

Similar to the reduce by key benchmark for the static map, it would be nice to have a similar benchmark for the dynamic map to showcase a problem that they dynamic map can solve.

The idea would be to emulate a "streaming" workload where you are processing data in batches and building up a single reduce by key result among all of the batches processed thus far. A dynamic map is perfect for this scenario as you may not know how many batches, nor the size of each batch that will be processed.

So the benchmark would generate batches of key/value pairs and then perform a bulk insert_reduce_by_key operation for each batch.

[BUG] A few small bugs

  1. In static_map_test.cu, in the section "Keys are all found after inserting many keys", there should be a call to cudaDeviceSynchronize() after the thrust::for_each is called.

  2. In static_map.cuh, dynamic_map is declared twice as a friend class on lines 104 and 106.

[FEA] Refactor of open address data structures

Is your feature request related to a problem? Please describe.

There is a significant amount of redundancy among the static_map/static_multimap/static_reduction_map classes. This is a large maintenance overhead and means optimizations made to one data structure do not translate to the others.

Furthermore, there are several configuration options we'd like to enable, like using AoS vs SOA, scalar vs. CG operations, etc.

I'd also like to enable adding a static_set and static_multiset classes that could share the same backend.

Describe the solution you'd like

List of things I'd like to address:

  • Use atomic_ref when possible (4B/8B key/value types) #183
  • Falls back on atomic when necessary (<4B, >8B key/value types)
    • For <4B, we should probably just widen them ourselves and still use atomic_ref instead of atomic.
  • Eliminates redundancy among static_map/reduction_map/multimap
  • Enables AoS vs SoA layout (#103)
  • Enables statically sized device views
    • We should use a pattern like std::span with std::dynamic_extent to support both dynamic and statically sized capacities.
  • Enables adding static_set and static_multiset
  • Supports the various insert schemes: packed/back2back/cas+dependent write
  • Switch between scalar/CG operations
  • Stream support everywhere (#65)
  • Consistent use of bitwise_equal
  • Asynchronous size computation (#102, #237 )
  • rehashing (#21)

My current thinking is to create an open_address_impl class that provides an abstraction for a logical array of "slots" and exposes operations on those slots. All the core logic and switching for things like AoS/SoA, atomic_ref/atomic can/should be implemented in this common impl class.

[FEA] Remove dependency on cuDF

Is your feature request related to a problem? Please describe.

The map structures currently have extraneous dependencies on libcudf. These should be removed.

Describe the solution you'd like

Remove all includes of libcudf headers and remove libcudf code from the repo.

[FEA] Add option to report hash collisions

Is your feature request related to a problem? Please describe.
Hash collisions impact performance of hash map insert and probe. It will be useful to find a way to report the number of collisions for static_map and dynamic_map to help assess performance of insert or probe when developers are evaluating perf on their dataset. It would also allow developers to tune the hash function or occupancy to reduce collisions and find the right balance for their scenario.

Describe the solution you'd like
The map can have an optional template argument that specifies if we need to count collisions (disabled by default), so it's opt-in and doesn't impact perf for the standard case. The number of collisions would be stored in a class variable that's accessible with something like get_num_collisions(). Implementation: allocate memory for device variable uint64_t *d_num_collisions, update all insert and find device code to do atomicAdd(d_num_collisions, 1) to that variable, then copy the contents to the host variable after the kernel. Here is where we can count the collisions for no-CG static_map::find:

current_slot = next_slot(current_slot);

The atomic will be guarded by the template argument check, so should only impact perf if we're asked to count collisions. Similarly, would have to update all other insert and find functions.

Describe alternatives you've considered
None.

Additional context
None.

[BUG] find hangs with 2^30 inputs

Describe the bug
The code below hangs.

  rmm::device_uvector<vertex_t> keys(100, handle.get_stream());
  thrust::sequence(rmm::exec_policy(handle.get_stream())->on(handle.get_stream()), keys.begin(), keys.end(), vertex_t{0});
  rmm::device_uvector<vertex_t> queries(1073741824, handle.get_stream());
  thrust::sequence(rmm::exec_policy(handle.get_stream())->on(handle.get_stream()), queries.begin(), queries.end(), vertex_t{0});
  thrust::transform(rmm::exec_policy(handle.get_stream())->on(handle.get_stream()), queries.begin(), queries.end(), queries.begin(), []__device__(auto val) { return val % 100; });
  handle.get_stream_view().synchronize();

  cuco::static_map<vertex_t, vertex_t> test_map{
    static_cast<size_t>(static_cast<double>(keys.size()) / 0.7),
    invalid_vertex_id<vertex_t>::value,
    invalid_vertex_id<vertex_t>::value};
  auto pair_first = thrust::make_transform_iterator(
    thrust::make_zip_iterator(
      thrust::make_tuple(keys.begin(), thrust::make_counting_iterator(vertex_t{0}))),
    [] __device__(auto val) {
      return thrust::make_pair(thrust::get<0>(val), thrust::get<1>(val));
    });
  test_map.insert(pair_first, pair_first + keys.size());
  test_map.find(queries.begin(), queries.end(), queries.begin());

auto const grid_size = (tile_size * num_keys + stride * block_size - 1) / (stride * block_size);

If num_keys = 2^30, num_keys * tile_size becomes 2^32, and gridDim.x * blockDim.x becomes 2^32 overflowing 4 byte unsigned integer.

The code below can overflow if input size >= 2^30

auto tid = blockDim.x * blockIdx.x + threadIdx.x;

key_idx += (gridDim.x * blockDim.x) / tile_size;

e.g. gridDim.x * blockDim.x becomes 2^32 which is 0 with 4 byte unsigned int; and this leads to infinite looping.

Upcasting blockDim.x to uint64_t before multiplying with blockIdx.x or gridDim.x could be one solution to fix this.

[FEA] Support for non-arithmetic keys

When inserting a key into an empty slot, insert uses compare_exchange_strong, which uses a bitwise equality check between empty_key_sentinel and the contents of the slot. Consequently, find must also use a bitwise equality check between empty_key_sentinel and the contents of the slot when this comparison is performed. This check can be performed using operator== for arithmetic types, but a custom high performance key_equal function is needed to support the bitwise equality check for all types, including non-arithmetic types.

[FEA] `static_multimap`

Is your feature request related to a problem? Please describe.

static_map only allows a single instance of each key.

Describe the solution you'd like

I would like an extension of static_map that supports duplicate keys.

Additional context

See https://arxiv.org/pdf/2009.07914.pdf for a discussion of multimap implementation ideas.

[FEA] Add allocator interface to static/dynamic map

Is your feature request related to a problem? Please describe.

I'd like to be able to customize how the memory is allocated for static_map and dynamic_map.

Describe the solution you'd like

Add an allocator interface to the static/dynamic_map constructors.

[FEA] Use heterogenous lookup instead of per-function key equal/hash

Is your feature request related to a problem? Please describe.

Sometimes we want to insert/find key types that may be different from the static_map::key_type. To support that today, we allow specifying a custom and independent key_equal and hash operations for insert and find. This is potentially quite problematic as it allows passing in different and conflicting notions of hashing and equality that could be incompatible and lead to undefined behavior.

  template <typename InputIt, typename Hash = MurmurHash3_32<key_type>,
            typename KeyEqual = thrust::equal_to<key_type>>
  void insert(InputIt first, InputIt last, Hash hash = Hash{},
              KeyEqual key_equal = KeyEqual{});

  template <typename InputIt, typename OutputIt,
            typename Hash = MurmurHash3_32<key_type>,
            typename KeyEqual = thrust::equal_to<key_type>>
  void find(InputIt first, InputIt last, OutputIt output_begin,
            Hash hash = Hash{}, KeyEqual key_equal = KeyEqual{}) noexcept

Describe the solution you'd like

A cleaner way to support the usecase of doing insert/find operations with different key types is via "heterogenous lookup".

In this way, the KeyEqual and Hash types would be specified only once as part of the static_map class template types, but it would allow those function's operators to be templates. This should require explicit user opt-in using the is_transparent trait.

#26 (comment)

[FEA] Add option to select between Struct of Arrays vs Array of Structs

Is your feature request related to a problem? Please describe.

All of the current hash map implementations use an Array of Structs storage implementation.

There are instances where it could be nice to have a Struct of Arrays layout.

Describe the solution you'd like

Provide some way to configure the storage layout. Probably via some template parameter.

We'd want to avoid as much duplication as possible between the AoS vs SoA implementations, so would require some clever abstraction.

[BUG] `reinterpret_cast` used in constexpr function `cuda_memcmp`

Describe the bug
In cuda_memcmp, which is declared constexpr, two reinterpret_cast calls appear, as shown here. The C++ standard does not allow this (see item 17 here), and when clang is used as the host compiler, it (correctly, I believe) throws an error.

Steps/Code to reproduce bug

mkdir build
cd build
cmake -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/clang++ -GNinja ..
ninja

In the output from this build attempt, we see that cuda_memcmp is identified as an invalid constexpr function because it does not return a constant expression.

Expected behavior
GCC allows this code, but I would expect/hope that compilation would succeed with the (in this case) more conformant clang compilation rules

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Method of PROJECT install: from source

Additional context
I discovered this while trying to set up IWYU for cuml (and hopefully other RAPIDS projects eventually). This requires compilation with clang.

[DOC] It is unclear whether find operation supports in-place mode or not

Report needed documentation

Report needed documentation
https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/static_map.cuh#L201
This documentation doesn't say whether InputIt first and OutputIt output_begin can coincide or not. It seems like this is supported based on the current implementation, but I am not sure the API intends to support this in the future or not.

Describe the documentation you'd like
Explicitly say whether this is supported or not.

For example, thrust is explicit about whether in-place operation is supported or not.

thrust::transform
https://thrust.github.io/doc/group__transformations_gacbd546527729f24f27dc44e34a5b8f73.html
The input and output sequences may coincide, resulting in an in-place transformation

thrust::reduce_by_key
https://thrust.github.io/doc/group__reductions_gad5623f203f9b3fdcab72481c3913f0e0.html#gad5623f203f9b3fdcab72481c3913f0e0
The input ranges shall not overlap either output range.

[BUG] Back-to-back CAS algorithm is incompatible with Pascal

We use a "back-to-back" CAS algorithm for inserting a key/value pair into the static_map. This technique relies on the Volta thread model of independent thread forward progress and is therefore incompatible with Pascal.

We should specialize the implementation for 4B/4B key/values (or less) to use a single CAS for the key/value pair.

We also need to add the appropriate error checking to prevent someone trying to use 8B/8B key values on Pascal.

[FEA] Add a `find_if_exists` function to the static map implementation

Is your feature request related to a problem? Please describe.

Given a set of probe keys, I would like to gather all of the matching <key,value> pairs from a map.

@nsakharnykh @Nicolas-Iskos @allisonvacanti and myself spent some time discussing how to handle the fact that the size of the output cannot be known a priori in the general case. We decided we would allow a user to specify an output buffer that is potentially too small to contain all of the matching (key,value) pairs. In the event such an "overflow" is detected (more on how to detect this below), we do not throw an exception. Instead, we exploit the fact that we can tell the user exactly how many matches there are so they can allocate a properly sized output buffer.

Instead of simply returning a std::size_t with the actual number of matches, @allisonvacanti suggested that we use something like std::expected (which hasn't made it's way into the standard yet) that indicates both if an overflow occurred and the total number of matches. This is potentially less of a foot-gun as it more explicitly informs the user that an overflow occurred (otherwise they would have to remember to compare the returned size_t against distance(output_begin, output_end) to detect the overflow).

The API would look something like this:

 /**
   * @brief Finds all (key,value) pairs whose keys are equal to a set of probe keys.
   *
   * For each key `*(probe_first + i)`, if it exists in the map, copies the corresponding
   * (key,value) pair to the output. The order of the key,value pairs in the output is
   * non-deterministic. If there are repeated values in `[probe_first, probe_last)`, there will be
   * repeated pairs in the output.
   *
   * In general, the number of matches `n` is not known ahead of time. In the worst case, the number
   * of elements in the output is equal to `std::distance(probe_first, probe_last)`. This presents
   * the potential for a large memory overhead as, in practice, the number of matches may be far
   * less than the number of probe keys. Therefore, it is well-defined to pass in an output
   * iterator range that is potentially smaller than the actual number of matches.
   *
   * If `std::distance(output_first, output_last) < n`, i.e., the output iterator range is not large
   * enough to fit all matches, then `[output_first, output_last)` will be filled with a valid set
   * of (key,value) pairs, but plainly does not contain _all_ matches. In the event such an
   * "overflow" occurs, `find_if_exists` returns an `expected<size_t,size_t>` object that indicates:
   * 1) whether or not an overflow occurred, 2) the total number of keys in `[probe_first,
   * probe_last)` that were present in the map.
   *
   * In the event the original output iterator range was not large enough, the `expected_proxy`'s
   * reported number of matches can be used to create a new output iterator range that is exactly
   * large enough to contain all of the matches in a future call to `find_if_exists`.
   *
   * TODO: Describe detailed semantics of the "expected_proxy" object
   *
   * @tparam InputIt Input iterator whose `value_type` is convertible to `Key`
   * @tparam OutputIt Output iterator whose `value_type` is convertible from `pair<Key,Value>`
   * @tparam Hash
   * @tparam KeyEqual
   * @param probe_first Beginning of the probe key range
   * @param probe_last End of the probe key range
   * @param output_begin Beginning of the output pair range
   * @param output_end End of the output pair range
   * @param hash
   * @param key_equal
   * @return The number of keys in `[probe_first, probe_last)` that exist in the map
   */
  template <typename InputIt, typename OutputIt, typename Hash, typename KeyEqual>
  expected_proxy<std::size_t, std::size_t> find_if_exists(InputIt probe_first,
                                                          InputIt probe_last,
                                                          OutputIt output_first,
                                                          OutputIt output_last,
                                                          Hash hash,
                                                          KeyEqual key_equal)

Questions/Points of discussion:

  • There are prototype implementations of std::expected in the wild, but they are more complicated than what we need for this usecase. I propose we make our own simple "expected-like" object tailored to this usecase. We can eventually replace that with thrust::expected down the road.

  • I documented that the order of the output is non-deterministic. This avoids tying our hands in how we implement it, but it does preclude an interesting use case that @allisonvacanti mentioned:

I'm starting to like the soft overflow more. It would also let someone just reuse a smaller buffer to stream values out of the hash map. If it overflows, just handle what you have, update your query, and call find_if_exists again.

That would only work if the returned (key,value) pairs are in the same order as their corresponding probe_keys.

  • The naive implementation of find_if_exists can be done by just doing a normal "bulk find" that returns a value for each probe key, and for probe keys that don't exist, it just returns the empty value sentinel. Then, you can do a stream compaction/copy_if of only the non-empty values.
    • However, I think there's an opportunity to do better by fusing the find + stream compaction operations. The naive approach would be to keep a single global memory offset counter, and as you find a match, you atomically increment the counter and use it's return value as the location to write your found match. This suffers from atomic contention and non-coalesced writes.
      • I have two ideas that should be more optimal :
        1. We could extend the above idea and keep a shared memory buffer and atomic counter per block. When a thread finds a match, it writes it to the shared memory buffer. As that buffer gets full, you can flush from shared to global using coalesced writes. To detect overflow, when a block goes to flush its shared memory buffer, it checks if the flush will overflow the output. If yes, it atomically sets a flag indicating an overflow occurred and does a partial flush. All other blocks not yet retired will skip their flush from shmem to global memory. The output order is non-deterministic.
        2. We run a kernel to compute the number of matches per thread block, then compute a scan of the sizes to get the write offset location per block. Threads within a block coordinate using a shared memory atomic to write to their window in global memory. Or we can This has the advantage of the fact that we can detect "overflow" outside of a kernel as the last element of the scan tells us the total number of matches.
          • There's an opportunity here to make the output order deterministic by doing a per-block scan to determine the output location for each thread instead of using an atomic. But this is likely going to be slower than the above approach.

[FEA] On-demand size computation to solve #65 and #39

In order to support fully asynchronous bulk operations, e.g., for multi-GPU hash tables, as requested in #65, we need to rethink the way we compute the table's size.

For now, insert tracks the number of successful insertions on-the-fly. After the kernel has finished, we copy the number back to the host and add it to the size_ member. This implies that insert synchronizes with the host.

In order to overcome this limitation for cuco::static_reduction_map (PR #98), 902b93a proposes a standalone size computation based on thrust::count_if and also implements fully asynchronous bulk operations.

The size computation using thrust::count_if shows near SOL performance in terms of throughput.
Additionally, since we do not need to reduce the number of added pairs during insertion, the overall performance of the insert bulk operation improves by ~3-5%.

IMHO we should also add this feature to the other hash table implementations.

This feature also implicitly solves issue #39.

[FEA] static_map::rehash

Is your feature request related to a problem? Please describe.

Oftentimes when you insert a set of N keys into a map, you don't know ahead of time the number of distinct keys m. This can lead to inefficient memory usage if N >> m. For example, if you bulk insert 1M keys, then for a 50% load factor we allocate 2M slots. If there are only 2 unique values, then only 2 of 2M slots in the map will be occupied. This is a waste of memory.

It would be nice if there was an API to allow "compacting" hash map down to a smaller number of slots.

One way we could achieve this is with a static_map::rehash function. Somewhat inspired by std::unordered_map::rehash. However, unlike std::unordered_map::rehash which takes the number of buckets as the argument, the cuco::static_map::rehash function would probably take the number of slots as the argument. For example:

// Creates a new hash map with the specified number of slots and rehashes the existing keys into the new slots
// Destroys the old slot storage
void rehash(std::size_t num_slots)

One tricky thing is to ensure that num_slots >= the number of keys that exist in the map. Assuming all inserts have been done through the bulk insert API, we already have this information. However, if a user does manual inserts via the mutable_device_view, then we know longer know exactly how many keys are present. In this situation, @Nicolas-Iskos had the idea of first doing a kernel to count the number of existing keys to ensure num_slots is valid.

[BUG] static_map should never use `key_equal` to compare against `empty_key_sentinel`

Describe the bug

There are cases where the user-defined key_equal function may be used to compare against the empty_key_sentinel. For example,

uint32_t existing = g.ballot(key_equal(existing_key, insert_pair.first));

if any of the slots in the CG window are empty, then we would be calling key_equal(x, empty_key_sentinel).

This is problematic as the empty_key_sentinel may not be a valid value in the user defined equality function.

Expected behavior

The static_map implementation needs to be updated such that key_equal should never be used in a situation where it can compare against empty_key_sentinel.

[FEA] CUDA stream support

Is your feature request related to a problem? Please describe.
static_map/dynamic_map currently does not take cudaStream_t stream parameter. This often requires additional synchronization and limits speedup we can get when we wish to concurrently run multiple cuCollection operations using multiple CUDA streams.

Describe the solution you'd like
Add CUDA stream support

Additional context
cuGraph needs this to run multiple graph kernels concurrently using multiple CUDA streams (for batch processing).

[FEA] `insert_reduce_by_key`

Is your feature request related to a problem? Please describe.

A common use case for hash maps is to perform a reduce-by-key-like operation that doesn't require sorting the key/value pairs.

Similar to the bulk insert functions, our static/dynamic maps could provide a bulk insert_reduce_by_key operation that looks something like:

/**
 * @brief Inserts a set of key-value pairs and performs a reduction among the values of identical
 * keys.
 *
 * @tparam InputIt
 * @tparam BinaryOp
 * @tparam Hash
 * @tparam KeyEqual
 * @param first Beginning of the key-value pair range
 * @param last End of the key-value pair range
 * @param op The binary operation to perform among values of identical keys
 * @param hash
 * @param key_equal
 */
template <typename InputIt, typename BinaryOp, typename Hash, typename KeyEqual>
void insert_reduce_by_key(InputIt first, InputIt last, BinaryOp op, Hash hash, KeyEqual key_equal);

One issue with this operation is that since the BinaryOp is any generic operation, applying will have to be done with an atomicCAS. If the operation was simply a sum, it would be more efficient to directly use an atomicAdd, but there isn't any way for us to detect what the user provided operation is. Therefore, we could possibly provide explicit functions that map to native atomic instructions, e.g., insert_sum_by_key, insert_min_by_key, etc.

Add style check to CI

We should add a CI step that verifies that clang-format has been applied to any submitted patches.

[FEA] Reduce building time

Is your feature request related to a problem? Please describe.
After introducing rapids.cmake into the project, building cuco becomes unreasonably expensive for such a small project. More precisely, it takes 11 mins to build the code with 6 concurrent threads. Linking STATIC_MAP_TEST and DYNAMIC_MAP_TEST are the major time killers.

Describe the solution you'd like
Get rid of the dynamic initialization warnings in these two tests and reduce building time.

[FEA] Replace thrust/cuco::pair with libcu++ tuple/pair

Is your feature request related to a problem? Please describe.

cuco currently uses a combination of Thrust and a custom pair type.

Describe the solution you'd like

We should just use everything from libcu++ for pair/tuple.

[FEA] static_map::erase

It would be useful to have an static_map::erase function to form a more complete API for static_map. Perhaps the simplest way to achieve this would be to use a "tombstone" sentinel value to indicate whether or not a slot contains an element that has been deleted. Another option would be to use an array of 1B atomic<state> for each slot. The lower 2 bits of this byte could be used to hold states such as empty, filling, filled, or deleted.

[FEA] Invalidate `get_size()` after user calls `get_device_mutable_view()`

Is your feature request related to a problem? Please describe.

The value of static_map::get_size() is only guaranteed to be correct if a user goes through the bulk static_map::insert API. I

If instead a user calls get_device_mutable_view(), then there it is likely that the user inserted keys into the map, but there is no way for us to know how many keys were inserted. Therefore, the value of static_map::get_size() is no longer valid.

Describe the solution you'd like

Invalidate the value of static_map::get_size() whenever a user calls static_map::get_device_mutable_view(). Likewise, provide a static_map::set_size() API for the user to tell us how many keys they inserted (their responsibility to make sure it is accurate).

We can do two things after static_map::get_size() has been invalidated:

  1. Throw an exception stating that get_size() cannot be called after get_mutable_device_view() unless set_size() has been called.
  2. Launch a kernel to recompute the size.

Option 2. would be more convenient, but a potential performance pitfall. Option 1 requires more user effort, but less of a chance of a performance pitfall.

[FEA] `dynamic_multimap`

Is your feature request related to a problem? Please describe.

dynamic_map only allows a single instance of each key.

Describe the solution you'd like

I would like an extension of dynamic_map that supports multiple instances of each key.

Additional context

Related: #59

[DOC] Size requirement (there should be at least one empty slot) should be clearly articulated.

Report incorrect documentation

Location of incorrect documentation

* Performance begins to degrade significantly beyond a load factor of ~70%.

Describe the problems or issues found in the documentation
If there is one key to insert and the capacity is set to 1 (1/0.7 is 1 if converted to an integral type),

https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/detail/static_map.inl#L373

will hang if the search key is not in the map (as it will just visit the same slot again & again).

Steps taken to verify documentation is incorrect
List any steps you have taken:

Suggested fix for documentation
Need to articulate that the capacity should be larger than the number of the inserted keys or modify the code (e.g. the above contains function) to limit the loop count to not exceed the capacity.

[BUG] CUDA Exception: Warp Illegal Address if find is executed on a static_map with capacity 0.

Describe the bug

cuco::static_map<int32_t, int32_t> map(0, -1, -1);

map.find(...);

find fails with the illegal address error if map has capacity 0 (it succeeds if capacity is set to 1). I think the expected behavior is to set output array elements to -1.

The code actually fails in the line below.

auto const existing_key = current_slot->first.load(cuda::std::memory_order_relaxed);

See the debugger outputs

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55559c258f10 (__atomic_generated:765)

Thread 1 "MG_BFS_TEST" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 329, block (160,0,0), thread (32,0,0), device 0, sm 0, warp 11, lane 0]
0x000055559c258f20 in _INTERNAL_49_tmpxft_0008af13_00000000_7_renumber_utils_cpp1_ii_d2beb9e5::cuda::__3::detail::__cuda_load_relaxed_32_device<int const volatile*, unsigned int> (__ptr=0xfffffffffffffff8, __dst=0x7ffeb3fffae4)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/libcudacxx-src/include/cuda/std/detail/__atomic_generated:765
765     template<class _CUDA_A, class _CUDA_B> static inline __device__ void __cuda_load_relaxed_32_device(_CUDA_A __ptr, _CUDA_B& __dst) {asm volatile("ld.relaxed.gpu.b32 %0,[%1];" : "=r"(__dst) : "l"(__ptr) : "memory"); }
(cuda-gdb) where
#0  0x000055559c258f20 in _INTERNAL_49_tmpxft_0008af13_00000000_7_renumber_utils_cpp1_ii_d2beb9e5::cuda::__3::detail::__cuda_load_relaxed_32_device<int const volatile*, unsigned int> (__ptr=0xfffffffffffffff8, __dst=0x7ffeb3fffae4)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/libcudacxx-src/include/cuda/std/detail/__atomic_generated:765
#1  0x000055559d93b890 in cuda::__3::detail::__atomic_load_cuda<int, 0> (__ptr=<error reading variable: Register 6 was not saved>, 
    __ret=<error reading variable: Register 4 was not saved>, __memorder=<optimized out>)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/libcudacxx-src/include/cuda/std/detail/__atomic_generated:775
#2  0x000055559c4ccf80 in cuda::__3::detail::__atomic_load_n_cuda<int, cuda::__3::detail::__thread_scope_device_tag> (
    __ptr=<error reading variable: Register 6 was not saved>, __memorder=<optimized out>, __s=...)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/libcudacxx-src/include/cuda/std/detail/__atomic_derived:129
#3  0x000055559c268c80 in cuda::std::__3::__cxx_atomic_load<int, 1> (__a=<error reading variable: Register 6 was not saved>, 
    __order=<optimized out>) at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/libcudacxx-src/include/cuda/std/detail/__atomic:222
#4  0x000055559bdeead0 in cuda::std::__3::__atomic_base<int, 1, false>::load (this=<error reading variable: Register 6 was not saved>, 
    __m=<optimized out>)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/libcudacxx-src/include/cuda/std/detail/libcxx/include/atomic:1660
#5  0x000055559d1d3190 in cuco::static_map<int, int, (cuda::__3::thread_scope)1, cuco::cuda_allocator<char> >::device_view::find<cooperative_groups::__v1::thread_block_tile<4u, cooperative_groups::__v1::thread_block>, cuco::detail::MurmurHash3_32<int>, thrust::equal_to<int> > (
    this=<optimized out>, g=..., k=<optimized out>, hash=..., key_equal=...)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/cuco-src/include/cuco/detail/static_map.inl:288
#6  0x000055559dc109e0 in cuco::detail::find<128u, 4u, int, int*, int*, cuco::static_map<int, int, (cuda::__3::thread_scope)1, cuco::cuda_allocator<char> >::device_view, cuco::detail::MurmurHash3_32<int>, thrust::equal_to<int> ><<<(31251,1,1),(128,1,1)>>> (first=0x7ff6463d0a00, 
    last=0x7ff6467a1314, output_begin=0x7ff6463d0a00, view=..., hash=..., key_equal=...)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/cuco-src/include/cuco/detail/static_map_kernels.cuh:266
(cuda-gdb) 

and

(cuda-gdb) up
#5  0x000055559d1d3190 in cuco::static_map<int, int, (cuda::__3::thread_scope)1, cuco::cuda_allocator<char> >::device_view::find<cooperative_groups::__v1::thread_block_tile<4u, cooperative_groups::__v1::thread_block>, cuco::detail::MurmurHash3_32<int>, thrust::equal_to<int> > (
    this=0x7ffeb3fffc38, g=..., k=0x7ffeb3fffc6c, hash=..., key_equal=...)
    at /home/seunghwak/RAPIDS/development/cugraph/cpp/build/_deps/cuco-src/include/cuco/detail/static_map.inl:288
288         auto const existing_key = current_slot->first.load(cuda::std::memory_order_relaxed);
(cuda-gdb) print *this
$4 = {__b_N4cuco10static_mapIiiLN4cuda3__312thread_scopeE1ENS_14cuda_allocatorIcEEE16device_view_baseE = {slots_ = 0x0, capacity_ = 0, 
    empty_key_sentinel_ = -1, empty_value_sentinel_ = -1}}
(cuda-gdb) 

slots_ is 0x0 and this might be causing the error.

[BUG] Key existence check with sentinel causes illegal memory access

The problem is in this line:

if (key_equal(existing_key, insert_pair.first)) { return false; }

It checks whether the existing key value in the slot is equal to the one being inserted. This means that for uninitialized slots, the comparison happens between insert_pair.first and empty_key_sentinel_ (assuming the slots were initialized with it).

When used with a key_equal that employs indirection like this: https://github.com/devavret/cudf/blob/ff8b885f1c9d3fb9753f4c65ded4d30f6a736267/cpp/src/io/parquet/chunk_dict.cu#L47-L51
it results in accesses using an invalid lhs_idx (-1 in this case).

This is not a problem with the cg overload:

// The user provide `key_equal` can never be used to compare against `empty_key_sentinel` as the
// sentinel is not a valid key value. Therefore, first check for the sentinel
auto const slot_is_empty =
detail::bitwise_compare(existing_key, this->get_empty_key_sentinel());
// the key we are trying to insert is already in the map, so we return with failure to insert
if (g.any(not slot_is_empty and key_equal(existing_key, insert_pair.first))) { return false; }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.