The hyrise-v1's discuss from hyrise

A way to write self-contained JSON tests

A problem with sharing error cases is that you also need to share a file or a database connection to be able to run the test.

If instead, we could load a table from JSON only, this would reduce the work needed to reproduce bugs.

So either enhance the existing string loader with one that can also provide a table, or provide JSON-based loaders, that allows for loading from the operation parameters ie:

  {
    "type": "JsonTable",
    "table": { "header" : ["a", "b"],
                  "types" : ["INT", "STRING"],
                 /*optional*/ partitions: [ "1_R", "2_R" ],
                  "data" : [ [ 1, "USA" ], [ 2, "GERMANY" ] ] }
  }

Add Meta Information about currently loaded Tables

There should be a meta table that shows which tables are there, to follow the SQL spec this should as well contain information about the columns etc as well.

Something like information_schema etc from MySQL & Co.

Remove TPCCHQ1Scan PlanOp?

The TPCCHQ1Scan Plan Operation seems unnecessary and can potentially be removed unless required for special purposes

Fails on clang++ 3.2

Fails in AbstractCoreBoundTask::launchThread when taking address of executeTask with "pure virtual function called".

Add Immutable Table Type

The default table type should be immutable und allow changing. However, currently in general all table types are mutable instead.

segfault in MysqlTableloader during parallel execution

Radix Join Hashing for non-int columns

Radix Join should work on non int columns as well. The RadixCluster plan operation has to be adapted templated to work independent of the dictionary type.

RFC: Improve logging to explicitly distinguish hardware counters and make the results clearer

{
    /*... other stuff ... */
    "performance_data": {
        /* ... other operands, too */
        "operandId": {
            "timing": { /* all these values are explicit not measured through hardware counters */
                "start": 10 /*ns since query start*/ ,
                "stop": 20 /*ns since query start*/ ,
                "start_epoch": 100000000123 /*ns since epoch*/ ,
                "stop_epoch": 100000000133 /*ns since epoch*/ ,
            },
            "counters": {
                /* these values are explicitly measured through hardware counters */
                "PAPI_TOT_INS": 10,
                "PAPI_TOT_CYC": 20,
                "PAPI_L2_TCM": 30
            },
            "input": {
                tables: []
            },
            "output": {
                /* same as input */
            },
            "custom": {
                /* op may emit whatever JSON it deems sensible */
            }
        }
    }
}

in request, we specify what we want
for logging

{ "operations": [ /*...*/ ], "logging" = ["papi", "input", "output", "timing"] }

Issues with this proposal: "counters" would currently only count what happens in executePlanOperation

Fully namespace everything in src/lib or completely remove namespaces

The current half/half situation doesn't really help anyone.

UUID assignment unnecessarily broad and slow!

The generic assignment of UUIDs to all containers through placement in the AbstractTable places an unnecessary burden on the overall system.

tests before uuid: make test 3.34s user 1.19s system 77% cpu 5.843 total
tests after uuid: make test 8.84s user 4.61s system 103% cpu 12.975 total

There is no reason that every single AbstractTable needs a UUID. It might even be debatable if every Store needs a UUID.

Potential fixes:

replace with cheaper ID-mechanism (atomic is probably faster than drawing a new number out of a mersenne_twister RNG)
lazily generate when needed
move to stores(?)
replace with just using the pointer casted to size_t - the pointer will be unique for a given cycle of server-start to next server start

Barrier has weird semantics

Well Barrier is kind of weird, because it uses the length n of _field_definition to forward the first n elements, regardless of actually set values in _field_definition.

We should make sure that its semantics are basically output == input.

Partitioning of tables for parallel execution

Distribute method in PlanOperation returns first and last as array positions (0 and 999 for a table with 1000 elements) -> following iterator-style you would expect last to work as an exit condition for iterating over input table.

Add HAVING operator

Make Commit Parallel

Currently only one TX can be committed at a time, this should be improved :)

Inconsistent State in MVCC

In the current implementation, MVCC allows for inconsistent reads. For a record deleted by another transaction, it is impossible to determine whether it would have been valid or not in the context of the current transaction. Example:

T1 starts with last_CID=6, does things
T2 starts, inserts record A, commits with CID=7
T3 starts, deletes record A, commits with CID=8
T1 reads A with valid=0 and CID (8) > last_CID (6) ==> current implementation assumes that value was valid for T1's read

Update Documentation

The current status of the documentation is pre-c++11 and our shared_ptr usage, so we should find some time to update it.

Github Pages and Documentation

Can we use github pages to store the recent version of the documenation. I think this should be possible...

Binutils-dev missing in vagrant/chef setup

I ran vagrant up. On the virtual box I tried to compile Hyrise, but the binutils-dev package is missing (the -lbfd flag fails in the linker). A simple apt-get install binutils-dev rectified the situation.

TaskSchedulerAdjustment, ThreadpoolAdjustment, SettingsOperation redundant

The Plan Operations TaskSchedulerAdjustment and ThreadpoolAdjustment do exactly the same, SettingsOperation also does the same but leaves out one step. Therefore it seems that only one of these three is actually necessary and the other two should be removed

How to handle Pointer Calculator and Tables in Operators?

prototype for typeswitches

Correct Dictionary cannot be found in Horizontal table, if tables with diff dictionaries are unified

when getvalue searches the right dictionary, column AND row needs to be taken into account

TransactionManager: Aborting transaction does not clean up rows that have been committed

The commit implementation allows Store->updateCommitId to fail in some way, but does not clean up any changes that have been made.

setTableName in CreateIndex is misleading

Should be "setIndexName" or similar.

Radix Join and PCs

Currently the RadixJoin does not work on PointerCalculator objects since the class does not implement the getAttributeVectors() interface. Even if it would it would break since the positions inside the vector might not be the same as the input positions.

Possible solutions: If the input is a pointer calculator, rewrite the positions to match the input vector and extend the handling to multiple horizontal partitions.

This is related to #18

Remove vname from operators

Since 6576e8d, much boilerplate isn't necessary any longer when implementing operators.

Thus, we should remove all name/vname methods and use registerPlanOperation<Type>("name"); instead.

Implement client-side driven transactions

Instead of implicitly assuming every request encloses a transaction, a transaction may last for several requests. Thus, we need:

BeginTransaction Operator -> returns txid (maybe even full context?)
Let queries run in a user specified transaction context

Occasional failures in autojson: string_header_load.json

C++ exception with description "StorageManager: Table 'revenue' does not exist" thrown in the test body.

Not sure about the underlying issue yet. The test fails occasionally on our build infrastructure.

Extract taskscheduler unit tests into own binary

Create tests for limit/offset parameters in ResponseTask

Duplicate inputs to operators are currently removed

This makes writing Operators that expect identical data inputs twice (such as self joins) a pain.

Anyone knows a case where it makes sense to filter inputs for duplicates? Pl let me know, otherwise, this filter will be removed.

Add AbstractResource concept & unify operations data handling

Implement a base class AbstractResource that is used to transport resource between different operations in a plan.

This should replace multiple lists of resource types in OperationData

SegFault enhancement mit backtrace und stackframes

JoinType being ignored in JoinScan

Setting _join_type in JoinScan does not have any effect on the join being performed

Add support for "IN subquery" predicates

HYRISE should be able to support filter predicates based on subqueries. Something like

SELECT * FROM table where attr in (SELECT id FROM othertab);

Store: Delta resize is not thread-safe

While only one thread may resize the delta, other threads working on delta at the time of resizing may end up working on deallocated memory, since the underlying vectors may change during a reserve.

StorageManager distinction between tables and indexes is awkward

Move towards a storage manager that only stores AbstractResource objects with a name ready for retrieval. (Also remove the loading shortcuts in StorageManager while we are at it).

Scheduler/SchedulerTest.dont_block_test/0 fails occasionally

Stacktrace

Value of: long_block_test(scheduler)
Actual: false
Expected: true

@JWUST any hints on that one?

Remove getSlice/getSliceWidth and adjacent methods from AbstractTable

Remove template-based Allocators in favor of polymorphic allocators.

Currently, the facilities surrounding template-based allocators are completely underused and are not actually necessary. Instead, it would be good to use polymorphic allocators similar to what is implemented in bsl https://github.com/bloomberg/bsl/wiki/BDE-Allocator-model and will be part of the c++14 standard.

Improve Papi Initialization

PAPI initialization takes a while for small ops. Maybe only initialize once per thread.

Add caveats section to readme

It might be nice to add a caveat section on the intended audience and what someone can expect from this chunk of code.

Simplify Table Loading

There should be a way to simply access the tables stored on disk without really need to specify always the load operation explicitly.

Hyrise server does not shutdown

The issue here is that hyrise doesn't seem to shut down on CTRL-C occasionally and we need to figure out why that is the case.

Logfile Management (DESIGN DISCUSSION)

Currently, all transactions log into a single logfile, which consequently grows over time and never gets deleted or truncated. Also, once a table is merged, its previous log entries must be removed/invalidated to avoid redo recovery in case of failure.

Solution 1: One logfile per table. Problems: increased logging overhead, commits must either be written into all logfiles or in a separate commit log.

Solution 2: One logfile for all tables, checkpoint entries when table is merged (i.e. "ignore previous log entries for table X"), new logfile after merge. When all tables have been merged once, the first logfile can be deleted, the second after all tables were merged twice and so on. Problems: Might take a while until all tables are merged.

Solution 3: One logfile for all tables, checkpoint entries when table is merged (i.e. "ignore previous log entries for table X"), new logfile after merge. Separate worker reads old logfiles on every table merge, removes entries and writes truncated log back. Problems: Potentially costly operation, IO overhead

Cleanup pointer calculator design

Replace pointer based fields and pos_list with non-pointer members
Replace const pos_list * with const pos_list&

Increase ease of use of PointerCalculator.

Enhance Expression Support

Currently only rudimentary expression support is available for HYRISE. It should be extended to support.

Expressions: mod( (a * b), 1000)
Expressions: like - string matching
Expressions: exists with subselect
Expressions: substr()
Expressions: ascii()
Expressions: extract (year from data)

MergeJoin does not merge correctly

Entries in the right-hand table only appear once in the result table, namely for the first match. Instead, they should appear for every matching entry in the left-hand table.

Assign unique table ids for logging

For logging, we need to identify to which table a certain entry was written. This could be a unique table id.

We need to discuss the scope of such an ID. Is it a Store or a Table? What information do we need to non-ambiguously identify where a logged value id belongs?
In the following, "table" means the logical construct, not the Table class
Table IDs must be unique over restarts of Hyrise. When I create the table "customers" and it gets the ID 5, it should have 5 when I restart Hyrise.
For this, we also need to save meta information (column [names], ...) about the table
We only need to log data in the delta - the main is persisted using snapshots
It would be great to have small table ids, not GUIDs
The table id shall be stored close to the data (in the Store class?). We should not need to go to the StorageManager every time we want to log something

Server terminates with logic_error if empty query is sent

To reproduce:
send empty query to server twice
curl -X POST --data-urlencode "query@xxx" http://localhost:5000

hyrise / hyrise-v1 Goto Github PK

hyrise-v1's Issues

Recommend Projects

Recommend Topics

Recommend Org