Coder Social home page Coder Social logo

radix-hash-join's Introduction

Software Development for Database Systems

About

The continuous advancement of technology used in the hardware domain has lead to the mass production of multi-core CPUs as well as to the decrease of RAM cost in terms of $/GB. In this project we demonstrate an efficient implementation of join operation in relational databases exploiting CPU parallelism and the large amount of available RAM in modern servers.

Our project, written in C language, was constructed in three parts which were then merged into a single one in a way, that complies with the instructions we were given. It features a hash-based radix partition join inspired by the join algorithms introduced in this paper.

Additionally, it is worth mentioning that the idea of this project originated from the SIGMOD Programming Contest 2018. Thus, we follow the task specifications of the contest and we also utilize the testing workloads provided to the contestants.

Implementation

  • Query Execution

    Firstly, we collect various statistics concerning the input relations, such as max/min value, approximate number of discrete values e.t.c during the nontimed pre-processing stage. These stats can be useful for future work on query optimization.

    During the execution of the query we use an intermediate result structure to store the tuples resulted from each predicate, either a filter or a join. By doing that we manage to avoid scanning a relation from top to bottom multiple times when it is present in more than one predicates.

  • Radix Hash Join

    The main idea of Radix Hash Join algorithm is to partition the input data of the two join relations in a number of buckets, so that the largest bucket can fit into the CPU cache. More precisely, the RHJ algorithm consists of the following three phases:

    • Partition

      We partition the data of each relation into a number of buckets by applying the same hash function (HASH_FUN_1) on both relations. In our implementation HASH_FUN_1 uses the n least-significant bits of the record to determine its bucket. In addition, histogram and prefix sum tables need to be calculated for each one of the two relations.

    • Build

      An index is created for each of the partitions (i.e: buckets) of the smallest relation. Each index resembles a hash table using two arrays (chain array and bucket array). These arrays are used to store indices of the corresponding bucket according to the hash value of a new hash function (HASH_FUN_2).

    • Probe

      Partitions of the non-indexed relation, i.e: the bigger one, are scanned and the respective index is probed for matching tuples.

    image not found

    Image above illustrates the three phases of Radix Hash Join Algorithm

  • Multithreading

    We managed to speed up our serial implementation by applying multithreading on various parts of our code, such as filter execution, histogram creation, bucket indexing, probing e.t.c. We decided to make use of POSIX Threads for this purpose. You may modify the thread number here. The following figures depict the satisfactory speedup we achieved.

    image not found

    The above graph shows the correlation between execution time and number of threads using the small dataset. For this test we used a machine from the Linux lab of our department (Intel i5-6500 3.2 GHz, 4 cores, 4 threads | 16 GB RAM )

    image not found

    The above graph shows the correlation between execution time and number of threads using the public dataset which can be downloaded from here.

    Our machine's specifications are:

    • CPU: Ryzen 2400G 3.6 GHz , 4 cores , 8 threads
    • RAM: 16GB DDR4 dual-channel

Usage

  • cd final
  • ./compile.sh && ./runTestHarness.sh

Unit Testing

For unit testing we use the CUnit testing framework. Tests are added to different suites, each one being responsible for testing a specific category of functions. In order to run the tests CUnit must be installed on your system.

Running the tests

  • cd final
  • make unittest && ./runUnitTesting.sh

Profiling

This pie graph was generated using profiling data collected by callgrind's function cost mechanism.

image not found

  1. constructTuple : creates a new tuple that will be added to the join result

  2. insertAtVector : inserts the tuple into the result vector

  3. joinFunc : implements Probing (phase 3)

  4. checkSumFunc : calculates checksums after query's execution is finished

  5. partition : implements Partition (phase 2)

You may also run a memory check using valgrind by uncommenting the line you wish in run.sh script.

Authors

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.