Coder Social home page Coder Social logo

anikristo / learnedsort Goto Github PK

View Code? Open in Web Editor NEW
80.0 80.0 12.0 5.11 MB

Learned Sort: a model-enhanced sorting algorithm

License: GNU General Public License v3.0

CMake 1.57% C++ 81.22% Shell 10.07% Python 7.14%
algorithm linear-models sorting-algorithms

learnedsort's People

Contributors

anikristo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

learnedsort's Issues

Underflow in slope calculation

Describe the bug
The slope calculation in the RMI::train() function (Case 6) has an underflow error that appears for unsigned types.

To reproduce

  1. Change the data type to some unsigned type (e.g. unsigned int)
  2. Put a breakpoint in Case 6 of the training function.
  3. Look at the denominator in the calculation of the slope of the current model. You can see that min.x - max.x > min.x due to underflow and roll-over.

Performance suffers for input sizes that are not divisible by 10^6

Describe the bug
When the input size is not divisible by 10^6 (1M), the performance suffers. For example, for INPUT_SIZE=1M, the code executes at 37,174 keys/ms.
However for 1.2M keys, it goes down to 12,332 keys/ms, whereas for 2M it goes back up to 38,461 keys/ms.

To reproduce
Steps to reproduce the behavior:

  1. Go to benchmarks_driver.cc
  2. Go to the function benchmark_arguments(benchmark::internal::Benchmark *) in Line 156.
  3. Set the input size parameters as follows:
static void benchmark_arguments(benchmark::internal::Benchmark *b) {
  
  // Set input size parameter
  b->Arg(1E6); 
  b->Arg(1.2E6); 
  b->Arg(1.23E6); 
  b->Arg(1.234E6); 
  b->Arg(1.2345E6); 
  b->Arg(2E6);  
 
  // Set time measurement unit
  b->Unit(benchmark::kMillisecond);
}

Expected behavior
The sorting rate should remain fairly stable throughout the benchmarks.

Screenshots
image

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Compiler: GCC-9
  • Any dataset (tested on Standard Normal distribution)

SegFault when a major bucket's size is less than BATCH_SZ

Describe the bug
In rare occasions, one of the major buckets' size might be less than BATCH_SZ. In this case the second round of bucketization fails because the vector minor_bckt_sizes is uninitialized. This is due to the fact that the loop condition in line 648 is unmet.

To reproduce
Steps to reproduce the behavior:

  1. Remove the size guard that switches to std::sort (line 957)
  2. Set input size to 1000

Environment (please complete the following information):

  • OS: Ubuntu 16.04
  • Compiler: GCC-9

RMI training performs int division for int types

Describe the bug
The RMI training function performs an integer division for slope calculation when the input type is of integer type. This behavior is undesired since the slope and intercept terms of linear models are floating-points.

learned_sort fails to sort some collections

Describe the bug

As mentioned in #14, I had issues when I tried to benchmark LearnedSort because it did not always yield a sorted collection depending on the input. I managed to write a simple program demonstrating one such cases.

To reproduce

#include <algorithm>
#include <cassert>
#include <functional>
#include <random>
#include <vector>
#include "learned_sort.h"

int main()
{
    std::vector<int> vec;

    std::mt19937_64 prng(1604922353);
    for (int i = 0 ; i < 1'000'000 ; ++i) {
        vec.emplace_back(i % 16);
    }
    std::shuffle(vec.begin(), vec.end(), prng);

    learned_sort::sort(vec.begin(), vec.end());
    assert(std::is_sorted(vec.begin(), vec.end()));
}

I used a standard library random engine, so the results should be reproducible across platforms.

Expected behavior

I expect this code to run to the end and the assertion not to fire. When I run it the assertion fires, which means that LearnedSort failed to sort the collection.

Environment

  • OS: Windows 10
  • Compiler: MinGW-w64 GCC10
  • C++ standard: C++20

RMI trainer not using DEFAULT_OVERALLOCATION_RATIO when supplied parameter is invalid

I was reading through the code and I noticed on line 208 that when the supplied overallocation_ratio is invalid (<= 0) that the value is set to be 1 rather than RMI<T>::Params::DEFAULT_OVERALLOCATION_RATIO (which is 1.1). However, on line 211 it is incorrectly indicated on stderr that the overallocation_ratio is set to the default value.

Expected behavior
Expect the output on stderr to match what is being set in the parameters.

Tests fail for TEST_SIZE not divisible by 100

Describe the bug
The unit tests fail for input sizes that are not divisible by 100.

To reproduce
Steps to reproduce the behavior:

  1. Set TEST_SIZE = 1234560 in unit_tests/test_driver.cc
  2. Execute test.sh

Expected behavior
Tests of any size should pass.

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Compiler: GCC-9
  • Any dataset

Additional context
The tests pass for TEST_SIZE = 1234500, which is divisible by 100. THRESHOLD is also set to 100, so one might think it is because the input size is not divisible by the THRESHOLD value. However, even if you set THRESHOLD=50, tests will also fail for sizes that are divisible by the THRESHOLD (e.g. 1234550).

Arrays are not sorted in case of using uint64_t keys

Describe the bug
The learnedSort algorithm fails to sort the arrays when the keys to be sorted are of uint64_t type.

To reproduce
Steps to reproduce the bug:

  1. Changing the keys type in benchmark_driver.cc to be uint64_t instead of double by changing "typedef double data_t;" to be "typedef uint64_t data_t;"
  2. Run ./compile.sh
  3. Run ./run.sh

Expected behavior
It was expected to have the array output sorted, and ideally beating the performance of other state-of-the-art baselines.

Screenshots
This is a screenshot of the output error:

Screen Shot 2020-07-01 at 11 57 06 PM

Environment (please complete the following information):

  • OS: Linux 5.7.4-arch1-1 and MacOS Catalina V 10.15.4
  • Compiler: GCC 10.1.0 on Arch Linux, and Apple clang version 11.0.3 on MacOS
  • Dataset: Default datasets used in benchmark_driver.cc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.