Coder Social home page Coder Social logo

hyrise / hyrise-v1 Goto Github PK

View Code? Open in Web Editor NEW
90.0 90.0 44.0 24.65 MB

HYRISE In-Memory Hybrid Storage Engine (archived, now developed in hyrise/hyrise repo)

Home Page: https://github.com/hyrise/hyrise

License: MIT License

Makefile 0.82% CSS 2.61% JavaScript 2.06% Python 3.98% Shell 0.27% C++ 87.74% C 0.69% Ruby 1.15% OpenEdge ABL 0.05% HTML 0.12% Perl 0.51%

hyrise-v1's Introduction

Build Status Coverage Status CodeFactor

Welcome to Hyrise

Hyrise is a research in-memory database system that has been developed by HPI since 2009 and has been entirely rewritten in 2017. Our goal is to provide a clean and flexible platform for research in the area of in-memory data management. Its architecture allows us, our students, and other researchers to conduct experiments around new data management concepts. To enable realistic experiments, Hyrise features comprehensive SQL support and performs powerful query plan optimizations. Well-known benchmarks, such as TPC-H or TPC-DS, can be executed with a single command and without any preparation.

This readme file focuses on the technical aspects of the repository. For more background on our research and for a list of publications, please visit the Hyrise project page.

You can still find the (archived) previous version of Hyrise on Github.

Citation

When referencing this version of Hyrise, please use the following bibtex entry:

(click to expand)
@inproceedings{DBLP:conf/edbt/DreselerK0KUP19,
  author    = {Markus Dreseler and
               Jan Kossmann and
               Martin Boissier and
               Stefan Klauck and
               Matthias Uflacker and
               Hasso Plattner},
  editor    = {Melanie Herschel and
               Helena Galhardas and
               Berthold Reinwald and
               Irini Fundulaki and
               Carsten Binnig and
               Zoi Kaoudi},
  title     = {Hyrise Re-engineered: An Extensible Database System for Research in
               Relational In-Memory Data Management},
  booktitle = {Advances in Database Technology - 22nd International Conference on
               Extending Database Technology, {EDBT} 2019, Lisbon, Portugal, March
               26-29, 2019},
  pages     = {313--324},
  publisher = {OpenProceedings.org},
  year      = {2019},
  url       = {https://doi.org/10.5441/002/edbt.2019.28},
  doi       = {10.5441/002/edbt.2019.28},
  timestamp = {Mon, 18 Mar 2019 16:09:00 +0100},
  biburl    = {https://dblp.org/rec/conf/edbt/DreselerK0KUP19.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Supported Systems

Hyrise is developed for Linux (preferrably the most current Ubuntu version) and optimized to run on server hardware. We support Mac to facilitate the local development of Hyrise, but do not recommend it for benchmarking.

Supported Benchmarks

We support a number of benchmarks out of the box. This makes it easy to generate performance numbers without having to set up the data generation, loading CSVs, and finding a query runner. You can run them using the ./hyriseBenchmark* binaries.

Note that the query plans are generated in our CI pipeline with possibly many stages in parallel and different CI runs might be executed on different machines. Reported runtimes are not to be taken as solid benchmark performance numbers.

Benchmark Notes
TPC-DS Query Plans
TPC-H Query Plans
Join Order Query Plans
Star Schema Query Plans
JCC-H Call the hyriseBenchmarkTPCH binary with the -j flag.
TPC-C In development, no proper optimization done yet

Getting started

Have a look at our contributor guidelines.

You can find definitions of most of the terms and abbreviations used in the code in the glossary. If you cannot find something that you are looking for, feel free to open an issue.

The Step by Step Guide is a good starting point to get to know Hyrise.

Native Setup

You can install the dependencies on your own or use the install_dependencies.sh script (recommended) which installs all of the therein listed dependencies and submodules. The install script was tested under macOS Monterey (12.4) and Ubuntu 22.04.

See dependencies for a detailed list of dependencies to use with brew install or apt-get install, depending on your platform. As compilers, we generally use recent versions of clang and gcc (Linux only). Please make sure that the system compiler points to the most recent version or use cmake (see below) accordingly. Older versions may work, but are neither tested nor supported.

Setup using Docker

If you want to create a Docker-based development environment using CLion, head over to our dedicated tutorial.

Otherwise, to get all dependencies of Hyrise into a Docker image, run

docker build -t hyrise .

You can start the container via

docker run -it hyrise

Inside the container, you can then checkout Hyrise and run ./install_dependencies.sh to download the required submodules.

Building and Tooling

It is highly recommended to perform out-of-source builds, i.e., creating a separate directory for the build. Advisable names for this directory would be cmake-build-{debug,release}, depending on the build type. Within this directory call cmake .. to configure the build. By default, we use very strict compiler flags (beyond -Wextra, including -Werror). If you use one of the officially supported environments, this should not be an issue. If you simply want to test Hyrise on a different system and run into issues, you can call cmake -DHYRISE_RELAXED_BUILD=On .., which will disable these strict checks. Subsequent calls to CMake, e.g., when adding files to the build will not be necessary, the generated Makefiles will take care of that.

Compiler choice

CMake will default to your system's default compiler. To use a different one, call cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ .. in a clean build directory. See dependencies for supported compiler versions.

Unity Builds

Starting with cmake 3.16, you can use -DCMAKE_UNITY_BUILD=On to perform unity builds. For a complete (re-)build or when multiple files have to be rebuilt, these are usually faster, as the relative cost of starting a compiler process and loading the most common headers is reduced. However, this only makes sense for debug builds. See our blog post on reducing the compilation time for details.

ccache

For development, you may want to use ccache, which reduces the time needed for recompiles significantly. Especially when switching branches, this can reduce the time to recompile from several minutes to one or less. On the downside, we have seen random build failures on our CI server, which is why we do not recommend ccache anymore but merely list it as an option. To use ccache, add -DCMAKE_CXX_COMPILER_LAUNCHER=ccache to your cmake call. You will need to adjust some ccache settings either in your environment variables or in your ccache config so that ccache can handle the precompiled headers. On our CI server, this worked for us: CCACHE_SLOPPINESS=file_macro,pch_defines,time_macros CCACHE_DEPEND=1.

Build

Simply call make -j*, where * denotes the number of threads to use.

Usually debug binaries are created. To configure a build directory for a release build make sure it is empty and call CMake like cmake -DCMAKE_BUILD_TYPE=Release

Lint

./scripts/lint.sh (Google's cpplint is used for the database code. In addition, we use flake8 for linting the Python scripts under /scripts.)

Format

./scripts/format.sh (clang-format is used for the database code. We use black for formatting the Python scripts under /scripts.)

Test

Calling make hyriseTest from the build directory builds all available tests. The binary can be executed with ./<YourBuildDirectory>/hyriseTest. Subsets of all available tests can be selected via --gtest_filter=.

Coverage

./scripts/coverage.sh will print a summary to the command line and create detailed html reports at ./coverage/index.html

Requires clang on macOS and Linux.

Address/UndefinedBehavior Sanitizers

cmake -DENABLE_ADDR_UB_LEAK_SANITIZATION=ON will generate Makefiles with AddressSanitizer, LeakSanitizer, and Undefined Behavior options. Compile and run them as normal - if any issues are detected, they will be printed to the console. It will fail on the first detected error and will print a summary. To convert addresses to actual source code locations, make sure llvm-symbolizer is installed (included in the llvm package) and is available in $PATH. To specify a custom location for the symbolizer, set $ASAN_SYMBOLIZER_PATH to the path of the executable. This seems to work out of the box on macOS - if not, make sure to have llvm installed. The binary can be executed with LSAN_OPTIONS=suppressions=asan-ignore.txt ./<YourBuildDirectory>/hyriseTest.

cmake -DENABLE_THREAD_SANITIZATION=ON will work as above but with the ThreadSanitizer. Some sanitizers are mutually exclusive, which is why we use two configurations for this.

Compile Times

When trying to optimize the time spent building the project, it is often helpful to have an idea how much time is spent where. scripts/compile_time.sh helps with that. Get usage instructions by running it without any arguments.

Maintainers

  • Martin Boissier
  • Daniel Lindner
  • Marcel Weisgut

Contact: [email protected]

Maintainers Emeriti

  • Markus Dreseler
  • Stefan Halfpap
  • Jan Kossmann

Contributors

  • Yannick Bäumer
  • Lawrence Benson
  • Jasper Blum
  • Lukas Budach
  • Timo Djürken
  • Alexander Dubrawski
  • Fabian Dumke
  • Leonard Geier
  • Richard Ebeling
  • Fabian Engel
  • Ben-Noah Engelhaupt
  • Moritz Eyssen
  • Martin Fischer
  • Christian Flach
  • Pedro Flemming
  • Mathias Flüggen
  • Johannes Frohnhofen
  • Pascal Führlich
  • Carl Gödecken
  • Adrian Holfter
  • Theresa Hradilak
  • Ben Hurdelhey
  • Sven Ihde
  • Ivan Illic
  • Jonathan Janetzki
  • Michael Janke
  • Max Jendruk
  • Tobias Jordan
  • David Justen
  • Youri Kaminsky
  • Marvin Keller
  • Mirko Krause
  • Eva Krebs
  • Henok Lachmann
  • Sven Lehmann
  • Till Lehmann
  • Tom Lichtenstein
  • Alexander Löser
  • Jan Mattfeld
  • Arne Mayer
  • Dominik Meier
  • Julian Menzler
  • Torben Meyer
  • Leander Neiß
  • Vincent Rahn
  • Hendrik Rätz
  • Niklas Riekenbrauck
  • Alexander Riese
  • Marc Rosenau
  • Johannes Schneider
  • David Schumann
  • Simon Siegert
  • Arthur Silber
  • Furkan Simsek
  • Toni Stachewicz
  • Daniel Stolpe
  • Jonathan Striebel
  • Nils Thamm
  • Hendrik Tjabben
  • Justin Trautmann
  • Carsten Walther
  • Leo Wendt
  • Lukas Wenzel
  • Fabian Wiebe
  • Tim Zimmermann

hyrise-v1's People

Contributors

bastih avatar bensk1 avatar brandlukas avatar cfrahnow avatar dukeharris avatar felixeberhardt avatar grundprinzip avatar irruputuncu avatar jwust avatar kaihowl avatar kateyy avatar klauck avatar martinfaust avatar mrks avatar schwald avatar timbokopter avatar torpedro avatar tzwenn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyrise-v1's Issues

Inconsistent State in MVCC

In the current implementation, MVCC allows for inconsistent reads. For a record deleted by another transaction, it is impossible to determine whether it would have been valid or not in the context of the current transaction. Example:

  1. T1 starts with last_CID=6, does things
  2. T2 starts, inserts record A, commits with CID=7
  3. T3 starts, deletes record A, commits with CID=8
  4. T1 reads A with valid=0 and CID (8) > last_CID (6) ==> current implementation assumes that value was valid for T1's read

Radix Join and PCs

Currently the RadixJoin does not work on PointerCalculator objects since the class does not implement the getAttributeVectors() interface. Even if it would it would break since the positions inside the vector might not be the same as the input positions.

Possible solutions: If the input is a pointer calculator, rewrite the positions to match the input vector and extend the handling to multiple horizontal partitions.

This is related to #18

Duplicate inputs to operators are currently removed

This makes writing Operators that expect identical data inputs twice (such as self joins) a pain.

Anyone knows a case where it makes sense to filter inputs for duplicates? Pl let me know, otherwise, this filter will be removed.

Add caveats section to readme

It might be nice to add a caveat section on the intended audience and what someone can expect from this chunk of code.

Assign unique table ids for logging

For logging, we need to identify to which table a certain entry was written. This could be a unique table id.

  • We need to discuss the scope of such an ID. Is it a Store or a Table? What information do we need to non-ambiguously identify where a logged value id belongs?
  • In the following, "table" means the logical construct, not the Table class
  • Table IDs must be unique over restarts of Hyrise. When I create the table "customers" and it gets the ID 5, it should have 5 when I restart Hyrise.
  • For this, we also need to save meta information (column [names], ...) about the table
  • We only need to log data in the delta - the main is persisted using snapshots
  • It would be great to have small table ids, not GUIDs
  • The table id shall be stored close to the data (in the Store class?). We should not need to go to the StorageManager every time we want to log something

Fails on clang++ 3.2

Fails in AbstractCoreBoundTask::launchThread when taking address of executeTask with "pure virtual function called".

Barrier has weird semantics

Well Barrier is kind of weird, because it uses the length n of _field_definition to forward the first n elements, regardless of actually set values in _field_definition.

We should make sure that its semantics are basically output == input.

Simplify Table Loading

There should be a way to simply access the tables stored on disk without really need to specify always the load operation explicitly.

Partitioning of tables for parallel execution

Distribute method in PlanOperation returns first and last as array positions (0 and 999 for a table with 1000 elements) -> following iterator-style you would expect last to work as an exit condition for iterating over input table.

RFC: Improve logging to explicitly distinguish hardware counters and make the results clearer

{
    /*... other stuff ... */
    "performance_data": {
        /* ... other operands, too */
        "operandId": {
            "timing": { /* all these values are explicit not measured through hardware counters */
                "start": 10 /*ns since query start*/ ,
                "stop": 20 /*ns since query start*/ ,
                "start_epoch": 100000000123 /*ns since epoch*/ ,
                "stop_epoch": 100000000133 /*ns since epoch*/ ,
            },
            "counters": {
                /* these values are explicitly measured through hardware counters */
                "PAPI_TOT_INS": 10,
                "PAPI_TOT_CYC": 20,
                "PAPI_L2_TCM": 30
            },
            "input": {
                tables: []
            },
            "output": {
                /* same as input */
            },
            "custom": {
                /* op may emit whatever JSON it deems sensible */
            }
        }
    }
}

in request, we specify what we want
for logging

{ "operations": [ /*...*/ ], "logging" = ["papi", "input", "output", "timing"] }

Issues with this proposal: "counters" would currently only count what happens in executePlanOperation

Logfile Management (DESIGN DISCUSSION)

Currently, all transactions log into a single logfile, which consequently grows over time and never gets deleted or truncated. Also, once a table is merged, its previous log entries must be removed/invalidated to avoid redo recovery in case of failure.

Solution 1: One logfile per table. Problems: increased logging overhead, commits must either be written into all logfiles or in a separate commit log.

Solution 2: One logfile for all tables, checkpoint entries when table is merged (i.e. "ignore previous log entries for table X"), new logfile after merge. When all tables have been merged once, the first logfile can be deleted, the second after all tables were merged twice and so on. Problems: Might take a while until all tables are merged.

Solution 3: One logfile for all tables, checkpoint entries when table is merged (i.e. "ignore previous log entries for table X"), new logfile after merge. Separate worker reads old logfiles on every table merge, removes entries and writes truncated log back. Problems: Potentially costly operation, IO overhead

Binutils-dev missing in vagrant/chef setup

I ran vagrant up. On the virtual box I tried to compile Hyrise, but the binutils-dev package is missing (the -lbfd flag fails in the linker). A simple apt-get install binutils-dev rectified the situation.

MergeJoin does not merge correctly

Entries in the right-hand table only appear once in the result table, namely for the first match. Instead, they should appear for every matching entry in the left-hand table.

Cleanup pointer calculator design

  • Replace pointer based fields and pos_list with non-pointer members
  • Replace const pos_list * with const pos_list&

Increase ease of use of PointerCalculator.

UUID assignment unnecessarily broad and slow!

The generic assignment of UUIDs to all containers through placement in the AbstractTable places an unnecessary burden on the overall system.

tests before uuid: make test  3.34s user 1.19s system 77% cpu 5.843 total
tests after uuid: make test  8.84s user 4.61s system 103% cpu 12.975 total

There is no reason that every single AbstractTable needs a UUID. It might even be debatable if every Store needs a UUID.

Potential fixes:

  • replace with cheaper ID-mechanism (atomic is probably faster than drawing a new number out of a mersenne_twister RNG)
  • lazily generate when needed
  • move to stores(?)
  • replace with just using the pointer casted to size_t - the pointer will be unique for a given cycle of server-start to next server start

A way to write self-contained JSON tests

A problem with sharing error cases is that you also need to share a file or a database connection to be able to run the test.

If instead, we could load a table from JSON only, this would reduce the work needed to reproduce bugs.

So either enhance the existing string loader with one that can also provide a table, or provide JSON-based loaders, that allows for loading from the operation parameters ie:

  {
    "type": "JsonTable",
    "table": { "header" : ["a", "b"],
                  "types" : ["INT", "STRING"],
                 /*optional*/ partitions: [ "1_R", "2_R" ],
                  "data" : [ [ 1, "USA" ], [ 2, "GERMANY" ] ] }
  }

Store: Delta resize is not thread-safe

While only one thread may resize the delta, other threads working on delta at the time of resizing may end up working on deallocated memory, since the underlying vectors may change during a reserve.

Remove vname from operators

Since 6576e8d, much boilerplate isn't necessary any longer when implementing operators.

Thus, we should remove all name/vname methods and use registerPlanOperation<Type>("name"); instead.

Add Meta Information about currently loaded Tables

There should be a meta table that shows which tables are there, to follow the SQL spec this should as well contain information about the columns etc as well.

Something like information_schema etc from MySQL & Co.

Update Documentation

The current status of the documentation is pre-c++11 and our shared_ptr usage, so we should find some time to update it.

Remove TPCCHQ1Scan PlanOp?

The TPCCHQ1Scan Plan Operation seems unnecessary and can potentially be removed unless required for special purposes

Radix Join Hashing for non-int columns

Radix Join should work on non int columns as well. The RadixCluster plan operation has to be adapted templated to work independent of the dictionary type.

Enhance Expression Support

Currently only rudimentary expression support is available for HYRISE. It should be extended to support.

  • Expressions: mod( (a * b), 1000)
  • Expressions: like - string matching
  • Expressions: exists with subselect
  • Expressions: substr()
  • Expressions: ascii()
  • Expressions: extract (year from data)

Hyrise server does not shutdown

The issue here is that hyrise doesn't seem to shut down on CTRL-C occasionally and we need to figure out why that is the case.

Add Immutable Table Type

The default table type should be immutable und allow changing. However, currently in general all table types are mutable instead.

Implement client-side driven transactions

Instead of implicitly assuming every request encloses a transaction, a transaction may last for several requests. Thus, we need:

  • BeginTransaction Operator -> returns txid (maybe even full context?)
  • Let queries run in a user specified transaction context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.