Origami: A High-Performance Mergesort Framework

This repository contains the implementation of the components and algorithms of the Origami framework described in the Origami paper. The framework was developed by the IRL @ Texas A&M University, published in the PVLDB Volume 15, and will be presented in the VLDB 2022.

Summary

Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multithread. We introduce Origami, an in-memory mergesort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8X faster than prior methods and a branchless streaming merger that achieves up to a 1.5X speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize threadlevel concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2X faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.

Performance

A quick preview of Origami sort performance is given in the tables below. For a deeper analysis and discussion, please refer to the paper. Benchmarks run all code compiled with Visual Studio 2019 in Windows Server 2016 on an 8-core Intel Skylake-X (i7-7820X) CPU with a fixed 4.7 GHz clock on all cores, 1 MB L2 cache, and 32 GB of DDR4-3200 quad-channel RAM. When AVX-512 was used, BIOS defaulted to a 400-MHz lower clock (i.e., 4.3 GHz), which is known as the AVX offset implemented by many motherboard manufacturers to keep core temperature under control. We enable the maximum level of optimization and use appropriate flags (e.g., /arch:AVX512) to ensure the compiler uses all available SIMD registers.

The following table shows the performance comparison of Origami with prior works in corresponding SIMD architectures. It compares the sort speed of different chunk sizes in a 1 GB array of 32-bit random keys.

Single Thread Chunked Sort Speed (M keys/s, 32-bit keys)
_{Chunk size}	_SSE		_AVX2			_AVX-512
_{Chunk size}	_AA-sort	_Origami	_ASPaS-sort	_Peloton-sort	_Origami	_Watkins-sort	_{Xiaochen-sort}	_Yin-sort	_Origami
_{128 K}	₆₃	₁₇₆	₅₃	₁₃₉	₂₂₈	₄₀	₁₉₈	₁₄₀	₂₉₅
_{256 K}	₆₁	₁₄₇	₄₇	₁₂₈	₂₁₀	₃₃	₁₈₄	₁₃₀	₂₆₉
_{512 K}	₅₉	₁₃₈	₄₄	₁₂₀	₁₉₅	₃₀	₁₇₂	₁₁₃	₂₄₉
_{1 M}	₅₇	₁₃₁	₄₁	₁₀₉	₁₈₃	₂₈	₁₆₀	₁₀₂	₂₃₂
_{2 M}	₅₅	₁₂₄	₃₉	₉₂	₁₇₄	₂₅	₁₅₀	₉₅	₂₁₆
_{4 M}	₅₄	₁₁₈	₃₇	₈₁	₁₆₈	₂₃	₁₄₀	₈₈	₂₀₃
_{8 M}	₅₂	₁₁₂	₃₅	₇₇	₁₆₂	₂₁	₁₃₁	₈₃	₁₉₁
_{16 M}	₅₀	₁₀₇	₃₃	₇₃	₁₅₃	₂₀	₁₂₂	₇₈	₁₈₁
_{32 M}	₄₈	₁₀₂	₃₂	₇₀	₁₄₅	₁₉	₁₁₅	₇₂	₁₇₂
_{64 M}	₄₇	₉₈	₃₀	₆₇	₁₃₈	₁₈	₁₀₉	₆₉	₁₆₃
_{128 M}	₄₅	₉₅	₂₉	₆₅	₁₃₂	₁₇	₁₀₃	₆₆	₁₅₆
_{256 M}	₄₄	₉₁	₂₈	₆₃	₁₂₆	₁₇	₉₇	₆₄	₁₄₉

Origami is distribution insensitive i.e. it retains almost constant speed on all inputs. The next table shows this characteristic on 1 GB data for the following distributions:

D1: Uniformly random, generated by Mersenne Twister
D2: All same keys
D3: Sorted
D4: Reverse sorted
D5: Almost sorted, where every 7th key is set to KEY_MAX
D6: Pareto keys, generated as min(ceil(beta(1/(1-u) - 1)), 10000), where beta = 7 and u ~ uniform[0, 1]
D7: Bursts of same keys, where the length of each subsequence is drawn from D6 and key from D1
D8: Random shuffle, generated by randomly permuting D7
D9: Fibonacci, wrapped around when overflows number of items to sort

Single Thread Sort Speed for Different Distributions (M keys/s)
	_{Key size}	_D1	_D2	_D3	_D4	_D5	_D6	_D7	_D8	_D9
_Scalar	₃₂	₄₇	₅₂	₅₂	₅₂	₄₇	₄₇	₄₉	₄₇	₄₇
	₆₄	₄₃	₄₇	₄₇	₄₇	₄₃	₄₄	₄₅	₄₃	₄₄
	₆₄₊₆₄	₂₅	₂₇	₂₇	₂₇	₂₅	₂₅	₂₅	₂₅	₂₅
_SSE	₃₂	₉₁	₉₀	₉₀	₉₀	₉₁	₉₁	₉₂	₉₁	₉₀
	₆₄	₅₀	₄₉	₄₉	₅₀	₅₀	₅₀	₅₀	₅₀	₄₉
	₆₄₊₆₄	₃₅	₃₄	₃₄	₃₄	₃₅	₃₅	₃₅	₃₅	₃₄
_AVX2	₃₂	₁₂₆	₁₂₆	₁₂₅	₁₂₄	₁₂₆	₁₂₆	₁₂₇	₁₂₆	₁₂₅
	₆₄	₄₈	₄₈	₄₈	₄₈	₄₈	₄₈	₄₈	₄₈	₄₇
	₆₄₊₆₄	₃₄	₃₄	₃₄	₃₄	₃₄	₃₄	₃₄	₃₄	₃₄
_AVX-512	₃₂	₁₄₉	₁₄₅	₁₄₅	₁₄₅	₁₄₈	₁₄₉	₁₄₉	₁₄₉	₁₄₆
	₆₄	₆₅	₆₃	₆₄	₆₃	₆₄	₆₄	₆₅	₆₄	₆₃
	₆₄₊₆₄	₂₇	₂₆	₂₆	₂₆	₂₆	₂₇	₂₆	₂₆	₂₆

Origami achieves near perfect speedup in multi-core environments. The next table shows the scalability for 1 GB random data:

Parallel Sort Speed on Skylake-X (M/s)
	Key size	_{Speed (M/s)}				_Speed-up
	Key size	_1C	_2C	_4C	_8C	_2C	_4C	_8C
_Scalar	₃₂	₄₇	₇₄	₁₄₇	₂₈₂	_1.6	_3.1	_6.0
	₆₄	₄₃	₇₅	₁₄₉	₂₇₃	_1.7	_3.5	_6.4
	₆₄₊₆₄	₂₅	₄₄	₈₄	₁₆₂	_1.8	_3.4	_6.5
_SSE	₃₂	₉₁	₁₇₉	₃₅₂	₆₈₇	_2.0	_3.9	_7.6
	₆₄	₅₀	₉₄	₁₈₅	₃₆₁	_1.9	_3.7	_7.2
	₆₄₊₆₄	₃₅	₇₀	₁₃₉	₂₆₀	_2.0	_4.0	_7.4
_AVX2	₃₂	₁₂₆	₂₄₈	₄₉₅	₉₅₀	_2.0	_3.9	_7.5
	₆₄	₄₈	₉₅	₁₈₉	₃₆₉	_2.0	_3.9	_7.7
	₆₄₊₆₄	₃₄	₇₀	₁₃₇	₂₅₄	_2.1	_4.0	_7.5
_AVX-512	₃₂	₁₄₉	₂₈₆	₅₆₅	₁₀₆₂	_1.9	_3.8	_7.1
	₆₄	₆₅	₁₂₂	₂₄₂	₄₆₂	_1.9	_3.7	_7.1
	₆₄₊₆₄	₂₇	₅₃	₁₀₅	₁₉₇	_2.0	_3.9	_7.3

Getting Started

Recommended Setup:

OS: Windows Server 2019 or Server 2016
Compiler: MSVC++ 14.29 (Visual Studio 2019 16.11)

Make sure to check that the project is set for x64 Release.
Set C++ standard to C++17 or later under Project > Properties > Configuration Properties > General.
Set register type (Scalar, SSE, AVX2 or AVX-512) and sort key type (uint32, int64 or <int64, int64> key-value pair) in config.h.
Set appropriate compiler flags (e.g., /arch:AVX2, /arch:AVX512 etc.) under Project > Properties > Configuration Properties > C/C++ > Command Line.
Make sure no Windows update from 03/21 or later is installed as they make AVX2 and AVX-512 bmerge upto 20% slower.
Update parameters in config.h if necessary (details below). Current tuned parameters are for the Skylake-X mentioned above.

Parameters:

_P1_NREG: Number of register available, typically 16 or 32
_P1_SWITCH: Switch point from mcmerge to mrmerge, get this from bench_sort_phases:phase1_matrix_merge_test()
_P1_N: Switch point from in register sort to bmerge, get this from bench_sort_phases:phase2_switch_point_test()
_P2_MERGE_UNROLL: In-cache bmerge unroll, get this from bench_bmerge:bmerge_test()
_P2_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
_P2_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
_P2_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
_P3_MERGE_UNROLL: Out-of-cache bmerge unroll, get this from bench_bmerge:bmerge_test()
_P3_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
_P3_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
_P3_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
_MTREE_NREG: Number of registers per stream while merging withing mtree, get this from bench_mtree:mtree_single_thread_test()
_MT_L1_BUFF_N: Buffer size at the internal node of each 4-way node in mtree
_MT_L2_BUFF_N: Buffer size at the root of each 4-way node in mtree
_MIN_K: Minimum way of merge to avoid memory bandwidth bottleneck, get this from bench_mtree:mtree_multi_thread_test()

Run:

Once all parameters are set, Origami sort API can be used as shown in bench_sort:sort_bench().

License

This project is licensed under the GPLv3.0 License - see the LICENSE file for details.

Authors

Arif Arman, Dmitri Loguinov

arif-arman / origami-sort Goto Github PK

origami-sort's Introduction

Origami: A High-Performance Mergesort Framework

Summary

Performance

Getting Started

License

Authors

origami-sort's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent