Coder Social home page Coder Social logo

hyrise / hyrise-v1 Goto Github PK

View Code? Open in Web Editor NEW
90.0 90.0 44.0 24.65 MB

HYRISE In-Memory Hybrid Storage Engine (archived, now developed in hyrise/hyrise repo)

Home Page: https://github.com/hyrise/hyrise

License: MIT License

Makefile 0.82% CSS 2.61% JavaScript 2.06% Python 3.98% Shell 0.27% C++ 87.74% C 0.69% Ruby 1.15% OpenEdge ABL 0.05% HTML 0.12% Perl 0.51%

hyrise-v1's Issues

A way to write self-contained JSON tests

A problem with sharing error cases is that you also need to share a file or a database connection to be able to run the test.

If instead, we could load a table from JSON only, this would reduce the work needed to reproduce bugs.

So either enhance the existing string loader with one that can also provide a table, or provide JSON-based loaders, that allows for loading from the operation parameters ie:

  {
    "type": "JsonTable",
    "table": { "header" : ["a", "b"],
                  "types" : ["INT", "STRING"],
                 /*optional*/ partitions: [ "1_R", "2_R" ],
                  "data" : [ [ 1, "USA" ], [ 2, "GERMANY" ] ] }
  }

Add Meta Information about currently loaded Tables

There should be a meta table that shows which tables are there, to follow the SQL spec this should as well contain information about the columns etc as well.

Something like information_schema etc from MySQL & Co.

Remove TPCCHQ1Scan PlanOp?

The TPCCHQ1Scan Plan Operation seems unnecessary and can potentially be removed unless required for special purposes

Fails on clang++ 3.2

Fails in AbstractCoreBoundTask::launchThread when taking address of executeTask with "pure virtual function called".

Add Immutable Table Type

The default table type should be immutable und allow changing. However, currently in general all table types are mutable instead.

Radix Join Hashing for non-int columns

Radix Join should work on non int columns as well. The RadixCluster plan operation has to be adapted templated to work independent of the dictionary type.

RFC: Improve logging to explicitly distinguish hardware counters and make the results clearer

{
    /*... other stuff ... */
    "performance_data": {
        /* ... other operands, too */
        "operandId": {
            "timing": { /* all these values are explicit not measured through hardware counters */
                "start": 10 /*ns since query start*/ ,
                "stop": 20 /*ns since query start*/ ,
                "start_epoch": 100000000123 /*ns since epoch*/ ,
                "stop_epoch": 100000000133 /*ns since epoch*/ ,
            },
            "counters": {
                /* these values are explicitly measured through hardware counters */
                "PAPI_TOT_INS": 10,
                "PAPI_TOT_CYC": 20,
                "PAPI_L2_TCM": 30
            },
            "input": {
                tables: []
            },
            "output": {
                /* same as input */
            },
            "custom": {
                /* op may emit whatever JSON it deems sensible */
            }
        }
    }
}

in request, we specify what we want
for logging

{ "operations": [ /*...*/ ], "logging" = ["papi", "input", "output", "timing"] }

Issues with this proposal: "counters" would currently only count what happens in executePlanOperation

UUID assignment unnecessarily broad and slow!

The generic assignment of UUIDs to all containers through placement in the AbstractTable places an unnecessary burden on the overall system.

tests before uuid: make test  3.34s user 1.19s system 77% cpu 5.843 total
tests after uuid: make test  8.84s user 4.61s system 103% cpu 12.975 total

There is no reason that every single AbstractTable needs a UUID. It might even be debatable if every Store needs a UUID.

Potential fixes:

  • replace with cheaper ID-mechanism (atomic is probably faster than drawing a new number out of a mersenne_twister RNG)
  • lazily generate when needed
  • move to stores(?)
  • replace with just using the pointer casted to size_t - the pointer will be unique for a given cycle of server-start to next server start

Barrier has weird semantics

Well Barrier is kind of weird, because it uses the length n of _field_definition to forward the first n elements, regardless of actually set values in _field_definition.

We should make sure that its semantics are basically output == input.

Partitioning of tables for parallel execution

Distribute method in PlanOperation returns first and last as array positions (0 and 999 for a table with 1000 elements) -> following iterator-style you would expect last to work as an exit condition for iterating over input table.

Inconsistent State in MVCC

In the current implementation, MVCC allows for inconsistent reads. For a record deleted by another transaction, it is impossible to determine whether it would have been valid or not in the context of the current transaction. Example:

  1. T1 starts with last_CID=6, does things
  2. T2 starts, inserts record A, commits with CID=7
  3. T3 starts, deletes record A, commits with CID=8
  4. T1 reads A with valid=0 and CID (8) > last_CID (6) ==> current implementation assumes that value was valid for T1's read

Update Documentation

The current status of the documentation is pre-c++11 and our shared_ptr usage, so we should find some time to update it.

Binutils-dev missing in vagrant/chef setup

I ran vagrant up. On the virtual box I tried to compile Hyrise, but the binutils-dev package is missing (the -lbfd flag fails in the linker). A simple apt-get install binutils-dev rectified the situation.

Radix Join and PCs

Currently the RadixJoin does not work on PointerCalculator objects since the class does not implement the getAttributeVectors() interface. Even if it would it would break since the positions inside the vector might not be the same as the input positions.

Possible solutions: If the input is a pointer calculator, rewrite the positions to match the input vector and extend the handling to multiple horizontal partitions.

This is related to #18

Remove vname from operators

Since 6576e8d, much boilerplate isn't necessary any longer when implementing operators.

Thus, we should remove all name/vname methods and use registerPlanOperation<Type>("name"); instead.

Implement client-side driven transactions

Instead of implicitly assuming every request encloses a transaction, a transaction may last for several requests. Thus, we need:

  • BeginTransaction Operator -> returns txid (maybe even full context?)
  • Let queries run in a user specified transaction context

Duplicate inputs to operators are currently removed

This makes writing Operators that expect identical data inputs twice (such as self joins) a pain.

Anyone knows a case where it makes sense to filter inputs for duplicates? Pl let me know, otherwise, this filter will be removed.

Store: Delta resize is not thread-safe

While only one thread may resize the delta, other threads working on delta at the time of resizing may end up working on deallocated memory, since the underlying vectors may change during a reserve.

Add caveats section to readme

It might be nice to add a caveat section on the intended audience and what someone can expect from this chunk of code.

Simplify Table Loading

There should be a way to simply access the tables stored on disk without really need to specify always the load operation explicitly.

Hyrise server does not shutdown

The issue here is that hyrise doesn't seem to shut down on CTRL-C occasionally and we need to figure out why that is the case.

Logfile Management (DESIGN DISCUSSION)

Currently, all transactions log into a single logfile, which consequently grows over time and never gets deleted or truncated. Also, once a table is merged, its previous log entries must be removed/invalidated to avoid redo recovery in case of failure.

Solution 1: One logfile per table. Problems: increased logging overhead, commits must either be written into all logfiles or in a separate commit log.

Solution 2: One logfile for all tables, checkpoint entries when table is merged (i.e. "ignore previous log entries for table X"), new logfile after merge. When all tables have been merged once, the first logfile can be deleted, the second after all tables were merged twice and so on. Problems: Might take a while until all tables are merged.

Solution 3: One logfile for all tables, checkpoint entries when table is merged (i.e. "ignore previous log entries for table X"), new logfile after merge. Separate worker reads old logfiles on every table merge, removes entries and writes truncated log back. Problems: Potentially costly operation, IO overhead

Cleanup pointer calculator design

  • Replace pointer based fields and pos_list with non-pointer members
  • Replace const pos_list * with const pos_list&

Increase ease of use of PointerCalculator.

Enhance Expression Support

Currently only rudimentary expression support is available for HYRISE. It should be extended to support.

  • Expressions: mod( (a * b), 1000)
  • Expressions: like - string matching
  • Expressions: exists with subselect
  • Expressions: substr()
  • Expressions: ascii()
  • Expressions: extract (year from data)

MergeJoin does not merge correctly

Entries in the right-hand table only appear once in the result table, namely for the first match. Instead, they should appear for every matching entry in the left-hand table.

Assign unique table ids for logging

For logging, we need to identify to which table a certain entry was written. This could be a unique table id.

  • We need to discuss the scope of such an ID. Is it a Store or a Table? What information do we need to non-ambiguously identify where a logged value id belongs?
  • In the following, "table" means the logical construct, not the Table class
  • Table IDs must be unique over restarts of Hyrise. When I create the table "customers" and it gets the ID 5, it should have 5 when I restart Hyrise.
  • For this, we also need to save meta information (column [names], ...) about the table
  • We only need to log data in the delta - the main is persisted using snapshots
  • It would be great to have small table ids, not GUIDs
  • The table id shall be stored close to the data (in the Store class?). We should not need to go to the StorageManager every time we want to log something

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.