Coder Social home page Coder Social logo

ips4o / ips2ra Goto Github PK

View Code? Open in Web Editor NEW
29.0 6.0 9.0 129 KB

In-place Parallel Super Scalar Radix Sort (IPS²Ra)

Home Page: https://arxiv.org/abs/2009.13569

License: BSD 2-Clause "Simplified" License

CMake 1.50% C++ 98.50%
parallel in-place sorting

ips2ra's Introduction

In-place Parallel Super Scalar Radix Sort (IPS²Ra)

This is the implementation of the algorithm IPS²Ra presented in the paper Engineering In-place (Shared-memory) Sorting Algorithms, which contains an in-depth description of its inner workings, as well as an extensive experimental performance evaluation. Here's the abstract:

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account.

Our new comparison-based algorithm In-place Superscalar Samplesort (IPS⁴o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings.

Another surprising result is that IPS⁴o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS²Ra) turns out to be the best algorithm.

Claims to have the -- in some sense -- "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

The radix sort algorithm IPS²Ra is an adaption of the samplesort algorithm IPS⁴o. The implementation of IPS⁴o is also available on GitHub. In the sequential case, IPS²Ra outperforms IPS⁴o for many input distributions and data types. However, IPS⁴o may be faster for very skewed key distributions. In the parallel case, IPS⁴o outperforms IPS²Ra most of the time.

An initial version of IPS⁴o has been described in our publication on the 25th Annual European Symposium on Algorithms.

Usage

Clone this repository and check out its submodule

git clone --recurse-submodules https://github.com/ips4o/ips2ra.git

or use the following commands instead if you want to include this repository as a submodule:

git submodule add https://github.com/ips4o/ips2ra.git
git submodule update --recursive --init

IPS²Ra provides a CMake library for simple usage:

add_subdirectory(<path-to-the-ips2ra-repository>)
target_link_libraries(<your-target> PRIVATE ips2ra)

A minimal working example:

#include "ips2ra.hpp"

// sort sequentially
ips2ra::sort(begin, end[, Extractor = ips2ra::Config<>::identity]);

// sort in parallel (uses OpenMP if available, std::thread otherwise)
ips2ra::parallel::sort(begin, end);

The parallel version of IPS²Ra requires 16-byte atomic compare-and-exchange instructions to run the fastest. Most CPUs and compilers support 16-byte compare-and-exchange instructions nowadays. If the CPU in question does so, IPS²Ra uses 16-byte compare-and-exchange instructions when you set your CPU correctly (e.g., -march=native) or when you enable the instructions explicitly (-mcx16). In this case, you also have to link against GCC's libatomic (-latomic). Otherwise, we emulate some 16-byte compare-and-exchange instructions with locks which may slightly mitigate the performance of IPS²Ra.

If you use the CMake example shown above, we automatically optimize IPS²Ra for the native CPU (e.g., -march=native). You can disable the CMake property IPS2RA_OPTIMIZE_FOR_NATIVE to avoid native optimization and you can enable the CMake property IPS2RA_USE_MCX16 if you compile with GCC or Clang to enable 16-byte compare-and-exchange instructions explicitly.

IPS²Ra uses C++ threads if not specified otherwise. If you prefer OpenMP threads, you need to enable OpenMP threads, e.g., enable the CMake property IPS2RA_USE_OPENMP or add OpenMP to your target. If you enable the CMake property DISABLE_IPS2RA_PARALLEL, most of the parallel code will not be compiled and no parallel libraries will be linked. Otherwise, CMake automatically enables C++ threads (e.g., -pthread) and links against TBB and GCC's libatomic. (Only when you compile your code for 16-byte compare-and-exchange instructions you need libatomic.) Thus, you need the Thread Building Blocks (TBB) library to compile and execute the parallel version of IPS²Ra. We search for TBB with find_package(TBB REQUIRED). If you want to execute IPS²Ra in parallel but your TBB library is not accessible via find_package(TBB REQUIRED), you can still compile IPS²Ra with parallel support. Just enable the CMake property DISABLE_IPS2RA_PARALLEL, enable C++ threads for your own target and link your own target against your TBB library (and also link your target against libatomic if you want 16-byte atomic compare-and-exchange instruction support).

If you do not set a CMake build type, we use the build type Release which disables debugging (e.g., -DNDEBUG) and enables optimizations (e.g., -O3).

Currently, the code does not compile on Windows.

Licensing

IPS²Ra is free software provided under the BSD 2-Clause License described in the LICENSE file. If you use IPS²Ra in an academic setting please cite the paper Engineering In-place (Shared-memory) Sorting Algorithms using the BibTeX entry

@misc{axtmann2020engineering,
  title =	 {Engineering In-place (Shared-memory) Sorting Algorithms},
  author =	 {Michael Axtmann and Sascha Witt and Daniel Ferizovic and Peter Sanders},
  howpublished = {Computing Research Repository (CoRR)},
  year =	 {Sept. 2020},
  archivePrefix ={arXiv},
  eprint =	 {2009.13569},
}

ips2ra's People

Contributors

bytehamster avatar ips4o avatar lorenzhs avatar lumidify avatar saschawitt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ips2ra's Issues

Is there anyway to sort a pair?

Hi, I am interested in integrating this project to my own project. But I need to sort a pair which only the first element need to be sorted. Is there anyway to do that? Great thanks!

How to sort keys longer than 8 bytes

Hello,

I have longer keys than 8 bytes. My records are something like below:

struct Record
{
	uint64_t key[2];
	//possibly other data related to the current key
};

Is there a way to use ips2ra for such records?
I was digging into the code a little but I stuck somewhere around:

template <int level, class T>
    static inline T getBucket(const T val) {
        static_assert(sizeof(T) > level, "Too many levels");
        return (val >> (8 * (sizeof(T) - level - 1))) & 0xFF;
    }

I have defined my operator>> for Record, but according to your definition, it returns T (Record in my case).
So I suspect I would also need to define the & operator that would also return a Record, but in fact, the usable result is in fact uint8_t due to 0xFF masking.
I ended up with something like (it is a little modified from what I actually tried, not sure if the below version compiles, but I hope it shows the idea):

struct Record
{
	struct Proxy {
		uint8_t tmp;
		Record operator&(int val) const
		{
			Record res;
			res.data[0] = tmp;
			return res;
		}
		operator uint8_t()
		{
			return tmp;
		}
	};

	Proxy operator>>(const unsigned int shift) const
	{
		uint8_t* ptr = reinterpret_cast<uint8_t*>(&key);
		Proxy pr;
		pr.tmp = ptr[shift]; //this should depend on the little/big endian and if data[0] or data[1] is most significant
		return pr; 
	}
};

There may be some issues in the implementation, but I just wanted to make it compile.
Yet I have still a compile error on

yield(b, begin);

besides, I think that what I have done is a little tricky and I am afraid it may hurt performance or something.
Do you have any suggestions on how should I sort my Record?

128-bit support

Hi there,

I tried to pass a 128-bit integer as the key to sort the sequence, but it returned the following errors:

[ 95%] Built target tlx
Scanning dependencies of target ips2ra_example
[ 97%] Building CXX object CMakeFiles/ips2ra_example.dir/src/example.cpp.o
In file included from /home/xdong038/GitHub/ips2ra/extern/tlx/tlx/math.hpp:26,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/sampling.hpp:44,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/sequential.hpp:44,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/parallel.hpp:58,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/ips2ra.hpp:46,
from /home/xdong038/GitHub/ips2ra/include/ips2ra.hpp:38,
from /home/xdong038/GitHub/ips2ra/src/example.cpp:48:
/home/xdong038/GitHub/ips2ra/extern/tlx/tlx/math/clz.hpp:41:17: warning: inline function ‘unsigned int tlx::clz(Integral) [with Integral = __int128 unsigned]’ used but never defined
inline unsigned clz(Integral x);
^~~
In file included from /home/xdong038/GitHub/ips2ra/extern/tlx/tlx/math.hpp:27,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/sampling.hpp:44,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/sequential.hpp:44,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/parallel.hpp:58,
from /home/xdong038/GitHub/ips2ra/include/ips2ra/ips2ra.hpp:46,
from /home/xdong038/GitHub/ips2ra/include/ips2ra.hpp:38,
from /home/xdong038/GitHub/ips2ra/src/example.cpp:48:
/home/xdong038/GitHub/ips2ra/extern/tlx/tlx/math/ctz.hpp:41:17: warning: inline function ‘unsigned int tlx::ctz(Integral) [with Integral = __int128 unsigned]’ used but never defined
inline unsigned ctz(Integral x);
^~~
[100%] Linking CXX executable ips2ra_example
CMakeFiles/ips2ra_example.dir/src/example.cpp.o: In function ips2ra::detail::Sorter<ips2ra::ExtendedConfig<std::pair<unsigned __int128, unsigned __int128>*, test<unsigned __int128>(parlay::sequence<std::pair<unsigned __int128, unsigned __int128>, parlay::allocator<std::pair<unsigned __int128, unsigned __int128> >, std::is_same<std::pair<unsigned __int128, unsigned __int128>, char>::value> const&)::{lambda(std::pair<unsigned __int128, unsigned __int128> const&)#1}, ips2ra::Config<32l, 128l, 2048l, 1024l, 8l, 2048l, long, 4096ul, 8, 4l, 20, 7>, ips2ra::OpenMPThreadPool> >::parallelGetLevels(std::pair<unsigned __int128, unsigned __int128>*, std::pair<unsigned __int128, unsigned __int128>*, std::vector<unsigned __int128, std::allocator<unsigned __int128> >&, ips2ra::detail::Sorter<ips2ra::ExtendedConfig<std::pair<unsigned __int128, unsigned __int128>*, test<unsigned __int128>(parlay::sequence<std::pair<unsigned __int128, unsigned __int128>, parlay::allocator<std::pair<unsigned __int128, unsigned __int128> >, std::is_same<std::pair<unsigned __int128, unsigned __int128>, char>::value> const&)::{lambda(std::pair<unsigned __int128, unsigned __int128> const&)#1}, ips2ra::Config<32l, 128l, 2048l, 1024l, 8l, 2048l, long, 4096ul, 8, 4l, 20, 7>, ips2ra::OpenMPThreadPool> ><bool, std::allocator<bool> >&, int, int) [clone .constprop.947]': example.cpp:(.text._ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE17parallelGetLevelsES5_S5_RSt6vectorIoSaIoEERSQ_IbSaIbEEii.constprop.947[_ZZN6ips2ra14ParallelSorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS2_IT_S8_ENS6_9allocatorIS9_EEXsrSt7is_sameIS9_cE5valueEEEEUlRKS3_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEEclES4_S4_ENKUliiE_clEii]+0x307): undefined reference to unsigned int tlx::clz(unsigned __int128)'
example.cpp:(.text._ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE17parallelGetLevelsES5_S5_RSt6vectorIoSaIoEERSQ_IbSaIbEEii.constprop.947[_ZZN6ips2ra14ParallelSorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS2_IT_S8_ENS6_9allocatorIS9_EEXsrSt7is_sameIS9_cE5valueEEEEUlRKS3_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEEclES4_S4_ENKUliiE_clEii]+0x315): undefined reference to unsigned int tlx::ctz<unsigned __int128>(unsigned __int128)' CMakeFiles/ips2ra_example.dir/src/example.cpp.o: In function ips2ra::detail::Sorter<ips2ra::ExtendedConfig<std::pair<unsigned __int128, unsigned __int128>, test(parlay::sequence<std::pair<unsigned __int128, unsigned __int128>, parlay::allocator<std::pair<unsigned __int128, unsigned __int128> >, std::is_same<std::pair<unsigned __int128, unsigned __int128>, char>::value> const&)::{lambda(std::pair<unsigned __int128, unsigned __int128> const&)#1}, ips2ra::Config<32l, 128l, 2048l, 1024l, 8l, 2048l, long, 4096ul, 8, 4l, 20, 7>, ips2ra::OpenMPThreadPool> >::sampleLevels(std::pair<unsigned __int128, unsigned __int128>, std::pair<unsigned __int128, unsigned __int128>*)':
example.cpp:(.text.ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE12sampleLevelsES5_S5[ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE12sampleLevelsES5_S5]+0x14b): undefined reference to unsigned int tlx::clz<unsigned __int128>(unsigned __int128)' example.cpp:(.text._ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE12sampleLevelsES5_S5_[_ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE12sampleLevelsES5_S5_]+0x159): undefined reference to unsigned int tlx::ctz(unsigned __int128)'
CMakeFiles/ips2ra_example.dir/src/example.cpp.o: In function ips2ra::detail::Sorter<ips2ra::ExtendedConfig<std::pair<unsigned __int128, unsigned __int128>*, test<unsigned __int128>(parlay::sequence<std::pair<unsigned __int128, unsigned __int128>, parlay::allocator<std::pair<unsigned __int128, unsigned __int128> >, std::is_same<std::pair<unsigned __int128, unsigned __int128>, char>::value> const&)::{lambda(std::pair<unsigned __int128, unsigned __int128> const&)#1}, ips2ra::Config<32l, 128l, 2048l, 1024l, 8l, 2048l, long, 4096ul, 8, 4l, 20, 7>, ips2ra::OpenMPThreadPool> >::sequentialGetLevels(std::pair<unsigned __int128, unsigned __int128>*, std::pair<unsigned __int128, unsigned __int128>*)': example.cpp:(.text._ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE19sequentialGetLevelsES5_S5_[_ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE19sequentialGetLevelsES5_S5_]+0x13d): undefined reference to unsigned int tlx::clz(unsigned __int128)'
example.cpp:(.text.ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE19sequentialGetLevelsES5_S5[ZN6ips2ra6detail6SorterINS_14ExtendedConfigIPSt4pairIooEZ4testIoEdRKN6parlay8sequenceIS3_IT_S9_ENS7_9allocatorISA_EEXsrSt7is_sameISA_cE5valueEEEEUlRKS4_E_NS_6ConfigILl32ELl128ELl2048ELl1024ELl8ELl2048ElLm4096ELi8ELl4ELi20ELi7EEENS_16OpenMPThreadPoolEEEE19sequentialGetLevelsES5_S5]+0x14b): undefined reference to `unsigned int tlx::ctz(unsigned __int128)'
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/ips2ra_example.dir/build.make:107: ips2ra_example] Error 1
make[1]: *** [CMakeFiles/Makefile2:133: CMakeFiles/ips2ra_example.dir/all] Error 2
make: *** [Makefile:149: all] Error 2

Do you consider adding 128-bit support to your library?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.