Coder Social home page Coder Social logo

src-d / go-mysql-server Goto Github PK

View Code? Open in Web Editor NEW
1.0K 30.0 110.0 10.15 MB

An extensible MySQL server implementation in Go.

License: Apache License 2.0

Go 99.14% Makefile 0.13% C# 0.10% JavaScript 0.06% Java 0.14% PHP 0.07% Python 0.19% Ruby 0.06% Shell 0.01% C 0.11%
mysql-server mysql golang sql-engine

go-mysql-server's Introduction

Notice: This repository is no longer actively maintained, and no further updates will be done, nor issues/PRs will be answered or attended. An alternative actively maintained can be found at https://github.com/dolthub/go-mysql-server repository.

go-mysql-server

License Build Status codecov GoDoc Issues

go-mysql-server is a SQL engine which parses standard SQL (based on MySQL syntax), resolves and optimizes queries. It provides simple interfaces to allow custom tabular data source implementations.

go-mysql-server also provides a server implementation compatible with the MySQL wire protocol. That means it is compatible with MySQL ODBC, JDBC, or the default MySQL client shell interface.

Scope of this project

These are the goals of go-mysql-server:

  • Be a generic extensible SQL engine that performs queries on your data sources.
  • Provide interfaces so you can implement your own custom data sources without providing any (except for the mem data source that is used for testing purposes).
  • Have a runnable server you can use on your specific implementation.
  • Parse and optimize queries while still allow specific implementations to add their own analysis steps and optimizations.
  • Provide some common index driver implementations so the user does not have to bring their own index implementation, and still be able to do so if they need to.

What are not the goals of go-mysql-server:

  • Be a drop-in MySQL database replacement.
  • Be an application/server you can use directly.
  • Provide any kind of backend implementation (other than the mem one used for testing) such as json, csv, yaml, ... That's for clients to implement and use.

What's the use case of go-mysql-server?

Having data in another format that you want as tabular data to query using SQL, such as git. As an example of this, we have gitbase.

Installation

The import path for the package is github.com/src-d/go-mysql-server.

To install it, run:

go get github.com/src-d/go-mysql-server

Documentation

SQL syntax

We are continuously adding more functionality to go-mysql-server. We support a subset of what is supported in MySQL, to see what is currently included check the SUPPORTED file.

Third-party clients

We support and actively test against certain third-party clients to ensure compatibility between them and go-mysql-server. You can check out the list of supported third party clients in the SUPPORTED_CLIENTS file along with some examples on how to connect to go-mysql-server using them.

Available functions

Name Description
ARRAY_LENGTH(json) if the json representation is an array, this function returns its size.
AVG(expr) returns the average value of expr in all rows.
CEIL(number) returns the smallest integer value that is greater than or equal to number.
CEILING(number) returns the smallest integer value that is greater than or equal to number.
CHAR_LENGTH(str) returns the length of the string in characters.
COALESCE(...) returns the first non-null value in a list.
CONCAT(...) concatenates any group of fields into a single string.
CONCAT_WS(sep, ...) concatenates any group of fields into a single string. The first argument is the separator for the rest of the arguments. The separator is added between the strings to be concatenated. The separator can be a string, as can the rest of the arguments. If the separator is NULL, the result is NULL.
CONNECTION_ID() returns the current connection ID.
COUNT(expr) returns a count of the number of non-NULL values of expr in the rows retrieved by a SELECT statement.
DATE_ADD(date, interval) adds the interval to the given date.
DATE_SUB(date, interval) subtracts the interval from the given date.
DAY(date) is a synonym for DAYOFMONTH().
DATE(date) returns the date part of the given date.
DAYOFMONTH(date) returns the day of the month (0-31).
DAYOFWEEK(date) returns the day of the week of the given date.
DAYOFYEAR(date) returns the day of the year of the given date.
FIRST(expr) returns the first value in a sequence of elements of an aggregation.
FLOOR(number) returns the largest integer value that is less than or equal to number.
FROM_BASE64(str) decodes the base64-encoded string str.
GREATEST(...) returns the greatest numeric or string value.
HOUR(date) returns the hours of the given date.
IFNULL(expr1, expr2) if expr1 is not NULL, it returns expr1; otherwise it returns expr2.
IS_BINARY(blob) returns whether a blob is a binary file or not.
JSON_EXTRACT(json_doc, path, ...) extracts data from a json document using json paths. Extracting a string will result in that string being quoted. To avoid this, use JSON_UNQUOTE(JSON_EXTRACT(json_doc, path, ...)).
JSON_UNQUOTE(json) unquotes JSON value and returns the result as a utf8mb4 string.
LAST(expr) returns the last value in a sequence of elements of an aggregation.
LEAST(...) returns the smaller numeric or string value.
LENGTH(str) returns the length of the string in bytes.
LN(X) returns the natural logarithm of X.
LOG(X), LOG(B, X) if called with one parameter, this function returns the natural logarithm of X. If called with two parameters, this function returns the logarithm of X to the base B. If X is less than or equal to 0, or if B is less than or equal to 1, then NULL is returned.
LOG10(X) returns the base-10 logarithm of X.
LOG2(X) returns the base-2 logarithm of X.
LOWER(str) returns the string str with all characters in lower case.
LPAD(str, len, padstr) returns the string str, left-padded with the string padstr to a length of len characters.
LTRIM(str) returns the string str with leading space characters removed.
MAX(expr) returns the maximum value of expr in all rows.
MID(str, pos, [len]) returns a substring from the provided string starting at pos with a length of len characters. If no len is provided, all characters from pos until the end will be taken.
MIN(expr) returns the minimum value of expr in all rows.
MINUTE(date) returns the minutes of the given date.
MONTH(date) returns the month of the given date.
NOW() returns the current timestamp.
NULLIF(expr1, expr2) returns NULL if expr1 = expr2 is true, otherwise returns expr1.
POW(X, Y) returns the value of X raised to the power of Y.
REGEXP_MATCHES(text, pattern, [flags]) returns an array with the matches of the pattern in the given text. Flags can be given to control certain behaviours of the regular expression. Currently, only the i flag is supported, to make the comparison case insensitive.
REPEAT(str, count) returns a string consisting of the string str repeated count times.
REPLACE(str,from_str,to_str) returns the string str with all occurrences of the string from_str replaced by the string to_str.
REVERSE(str) returns the string str with the order of the characters reversed.
ROUND(number, decimals) rounds the number to decimals decimal places.
RPAD(str, len, padstr) returns the string str, right-padded with the string padstr to a length of len characters.
RTRIM(str) returns the string str with trailing space characters removed.
SECOND(date) returns the seconds of the given date.
SLEEP(seconds) waits for the specified number of seconds (can be fractional).
SOUNDEX(str) returns the soundex of a string.
SPLIT(str,sep) returns the parts of the string str split by the separator sep as a JSON array of strings.
SQRT(X) returns the square root of a nonnegative number X.
SUBSTR(str, pos, [len]) returns a substring from the string str starting at pos with a length of len characters. If no len is provided, all characters from pos until the end will be taken.
SUBSTRING(str, pos, [len]) returns a substring from the string str starting at pos with a length of len characters. If no len is provided, all characters from pos until the end will be taken.
SUM(expr) returns the sum of expr in all rows.
TO_BASE64(str) encodes the string str in base64 format.
TRIM(str) returns the string str with all spaces removed.
UPPER(str) returns the string str with all characters in upper case.
WEEKDAY(date) returns the weekday of the given date.
YEAR(date) returns the year of the given date.
YEARWEEK(date, mode) returns year and week for a date. The year in the result may be different from the year in the date argument for the first and the last week of the year.

Configuration

The behaviour of certain parts of go-mysql-server can be configured using either environment variables or session variables.

Session variables are set using the following SQL queries:

SET <variable name> = <value>
Name Type Description
INMEMORY_JOINS environment If set it will perform all joins in memory. Default is off.
inmemory_joins session If set it will perform all joins in memory. Default is off. This has precedence over INMEMORY_JOINS.
MAX_MEMORY environment The maximum number of memory, in megabytes, that can be consumed by go-mysql-server. Any in-memory caches or computations will no longer try to use memory when the limit is reached. Note that this may cause certain queries to fail if there is not enough memory available, such as queries using DISTINCT, ORDER BY or GROUP BY with groupings.
DEBUG_ANALYZER environment If set, the analyzer will print debug messages. Default is off.
PILOSA_INDEX_THREADS environment Number of threads used in index creation. Default is the number of cores available in the machine.
pilosa_index_threads environment Number of threads used in index creation. Default is the number of cores available in the machine. This has precedence over PILOSA_INDEX_THREADS.

Example

go-mysql-server contains a SQL engine and server implementation. So, if you want to start a server, first instantiate the engine and pass your sql.Database implementation.

It will be in charge of handling all the logic to retrieve the data from your source. Here you can see an example using the in-memory database implementation:

...

func main() {
    driver := sqle.NewDefault()
    driver.AddDatabase(createTestDatabase())

    config := server.Config{
        Protocol: "tcp",
        Address:  "localhost:3306",
        Auth:     auth.NewNativeSingle("user", "pass", auth.AllPermissions),
    }

    s, err := server.NewDefaultServer(config, driver)
    if err != nil {
        panic(err)
    }

    s.Start()
}

func createTestDatabase() *memory.Database {
    const (
        dbName    = "test"
        tableName = "mytable"
    )

    db := memory.NewDatabase(dbName)
    table := memory.NewTable(tableName, sql.Schema{
        {Name: "name", Type: sql.Text, Nullable: false, Source: tableName},
        {Name: "email", Type: sql.Text, Nullable: false, Source: tableName},
        {Name: "phone_numbers", Type: sql.JSON, Nullable: false, Source: tableName},
        {Name: "created_at", Type: sql.Timestamp, Nullable: false, Source: tableName},
    })

    db.AddTable(tableName, table)
    ctx := sql.NewEmptyContext()

    rows := []sql.Row{
        sql.NewRow("John Doe", "[email protected]", []string{"555-555-555"}, time.Now()),
        sql.NewRow("John Doe", "[email protected]", []string{}, time.Now()),
        sql.NewRow("Jane Doe", "[email protected]", []string{}, time.Now()),
        sql.NewRow("Evil Bob", "[email protected]", []string{"555-666-555", "666-666-666"}, time.Now()),
	}

    for _, row := range rows {
        table.Insert(ctx, row)
    }

    return db
}

...

Then, you can connect to the server with any MySQL client:

> mysql --host=127.0.0.1 --port=3306 -u user -ppass test -e "SELECT * FROM mytable"
+----------+-------------------+-------------------------------+---------------------+
| name     | email             | phone_numbers                 | created_at          |
+----------+-------------------+-------------------------------+---------------------+
| John Doe | [email protected]      | ["555-555-555"]               | 2018-04-18 10:42:58 |
| John Doe | [email protected]   | []                            | 2018-04-18 10:42:58 |
| Jane Doe | [email protected]      | []                            | 2018-04-18 10:42:58 |
| Evil Bob | [email protected] | ["555-666-555","666-666-666"] | 2018-04-18 10:42:58 |
+----------+-------------------+-------------------------------+---------------------+

See the complete example here.

Queries examples

SELECT count(name) FROM mytable
+---------------------+
| COUNT(mytable.name) |
+---------------------+
|                   4 |
+---------------------+

SELECT name,year(created_at) FROM mytable
+----------+--------------------------+
| name     | YEAR(mytable.created_at) |
+----------+--------------------------+
| John Doe |                     2018 |
| John Doe |                     2018 |
| Jane Doe |                     2018 |
| Evil Bob |                     2018 |
+----------+--------------------------+

SELECT email FROM mytable WHERE name = 'Evil Bob'
+-------------------+
| email             |
+-------------------+
| [email protected] |
+-------------------+

Custom data source implementation

To be able to create your own data source implementation you need to implement the following interfaces:

  • sql.Database interface. This interface will provide tables from your data source.

    • If your database implementation supports adding more tables, you might want to add support for sql.Alterable interface
  • sql.Table interface. It will be in charge of transforming any kind of data into an iterator of Rows. Depending on how much you want to optimize the queries, you also can implement other interfaces on your tables:

    • sql.PushdownProjectionTable interface will provide a way to get only the columns needed for the executed query.
    • sql.PushdownProjectionAndFiltersTable interface will provide the same functionality described before, but also will push down the filters used in the executed query. It allows to filter data in advance, and speed up queries.
    • sql.Indexable add index capabilities to your table. By implementing this interface you can create and use indexes on this table.
    • sql.Inserter can be implemented if your data source tables allow insertions.
  • If you need some custom tree modifications, you can also implement your own analyzer.Rules.

You can see a really simple data source implementation on our mem package.

Indexes

go-mysql-server exposes a series of interfaces to allow you to implement your own indexes so you can speedup your queries.

Taking a look at the main index interface, you must note a couple of constraints:

  • This abstraction lets you create an index for multiple columns (one or more) or for only one expression (e.g. function applied on multiple columns).
  • If you want to index an expression that is not a column you will only be able to index one and only one expression at a time.

Custom index driver implementation

Index drivers provide different backends for storing and querying indexes. To implement a custom index driver you need to implement a few things:

  • sql.IndexDriver interface, which will be the driver itself. Not that your driver must return an unique ID in the ID method. This ID is unique for your driver and should not clash with any other registered driver. It's the driver's responsibility to be fault tolerant and be able to automatically detect and recover from corruption in indexes.
  • sql.Index interface, returned by your driver when an index is loaded or created.
    • Your sql.Index may optionally implement the sql.AscendIndex and/or sql.DescendIndex interfaces, if you want to support more comparison operators like >, <, >=, <= or BETWEEN.
  • sql.IndexLookup interface, returned by your index in any of the implemented operations to get a subset of the indexed values.
    • Your sql.IndexLookup may optionally implement the sql.Mergeable and sql.SetOperations interfaces if you want to support set operations to merge your index lookups.
  • sql.IndexValueIter interface, which will be returned by your sql.IndexLookup and should return the values of the index.
  • Don't forget to register the index driver in your sql.Catalog using catalog.RegisterIndexDriver(mydriver) to be able to use it.

To create indexes using your custom index driver you need to use USING driverid on the index creation query. For example:

CREATE INDEX foo ON table USING driverid (col1, col2)

You can see an example of a driver implementation inside the sql/index/pilosa package, where the pilosa driver is implemented.

Index creation is synchronous by default, to make it asynchronous, use WITH (async = true), for example:

CREATE INDEX foo ON table USING driverid (col1, col2) WITH (async = true)

Old pilosalib driver

pilosalib driver was renamed to pilosa and now pilosa does not require an external pilosa server. pilosa is not supported on Windows.

Metrics

go-mysql-server utilizes github.com/go-kit/kit/metrics module to expose metrics (counters, gauges, histograms) for certain packages (so far for engine, analyzer, regex, pilosa). If you already have metrics server (prometheus, statsd/statsite, influxdb, etc.) and you want to gather metrics also from go-mysql-server components, you will need to initialize some global variables by particular implementations to satisfy following interfaces:

// Counter describes a metric that accumulates values monotonically.
type Counter interface {
	With(labelValues ...string) Counter
	Add(delta float64)
}

// Gauge describes a metric that takes specific values over time.
type Gauge interface {
	With(labelValues ...string) Gauge
	Set(value float64)
	Add(delta float64)
}

// Histogram describes a metric that takes repeated observations of the same
// kind of thing, and produces a statistical summary of those observations,
// typically expressed as quantiles or buckets.
type Histogram interface {
	With(labelValues ...string) Histogram
	Observe(value float64)
}

You can use one of go-kit implementations or try your own. For instance, we want to expose metrics for prometheus server. So, before we start mysql engine, we have to set up the following variables:

import(
    "github.com/go-kit/kit/metrics/prometheus"
    promopts "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

//....

// engine metrics
sqle.QueryCounter = prometheus.NewCounterFrom(promopts.CounterOpts{
		Namespace: "go_mysql_server",
		Subsystem: "engine",
		Name:      "query_counter",
	}, []string{
		"query",
	})
sqle.QueryErrorCounter = prometheus.NewCounterFrom(promopts.CounterOpts{
    Namespace: "go_mysql_server",
    Subsystem: "engine",
    Name:      "query_error_counter",
}, []string{
    "query",
    "error",
})
sqle.QueryHistogram = prometheus.NewHistogramFrom(promopts.HistogramOpts{
    Namespace: "go_mysql_server",
    Subsystem: "engine",
    Name:      "query_histogram",
}, []string{
    "query",
    "duration",
})

// analyzer metrics
analyzer.ParallelQueryCounter = prometheus.NewCounterFrom(promopts.CounterOpts{
    Namespace: "go_mysql_server",
    Subsystem: "analyzer",
    Name:      "parallel_query_counter",
}, []string{
    "parallelism",
})

// regex metrics
regex.CompileHistogram = prometheus.NewHistogramFrom(promopts.HistogramOpts{
    Namespace: "go_mysql_server",
    Subsystem: "regex",
    Name:      "compile_histogram",
}, []string{
    "regex",
    "duration",
})
regex.MatchHistogram = prometheus.NewHistogramFrom(promopts.HistogramOpts{
    Namespace: "go_mysql_server",
    Subsystem: "regex",
    Name:      "match_histogram",
}, []string{
    "string",
    "duration",
})

// pilosa index driver metrics
pilosa.RowsGauge = prometheus.NewGaugeFrom(promopts.GaugeOpts{
    Namespace: "go_mysql_server",
    Subsystem: "index",
    Name:      "indexed_rows_gauge",
}, []string{
    "driver",
})
pilosa.TotalHistogram = prometheus.NewHistogramFrom(promopts.HistogramOpts{
    Namespace: "go_mysql_server",
    Subsystem: "index",
    Name:      "index_created_total_histogram",
}, []string{
    "driver",
    "duration",
})
pilosa.MappingHistogram = prometheus.NewHistogramFrom(promopts.HistogramOpts{
    Namespace: "go_mysql_server",
    Subsystem: "index",
    Name:      "index_created_mapping_histogram",
}, []string{
    "driver",
    "duration",
})
pilosa.BitmapHistogram = prometheus.NewHistogramFrom(promopts.HistogramOpts{
    Namespace: "go_mysql_server",
    Subsystem: "index",
    Name:      "index_created_bitmap_histogram",
}, []string{
    "driver",
    "duration",
})

One important note - internally we set some labels for metrics, that's why have to pass those keys like "duration", "query", "driver", ... when we register metrics in prometheus. Other systems may have different requirements.

Powered by go-mysql-server

License

Apache License 2.0, see LICENSE

go-mysql-server's People

Contributors

agarciamontoro avatar ajnavarro avatar bake avatar carlosms avatar crweiner avatar dependabot[bot] avatar dpordomingo avatar erizocosmico avatar ferhatelmas avatar geekysrm avatar hydrocharged avatar jfontan avatar juanjux avatar kuba-- avatar lwsanty avatar marnovo avatar mcarmonaa avatar mcuadros avatar ntakouris avatar rahulkumaran avatar ravivarshney01 avatar ricardoseriani avatar sbinet avatar smacker avatar smola avatar theodesp avatar travisturner avatar zachmu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-mysql-server's Issues

Filters are not being pushed down, but are removed from the filter node

This query works as expected:

SELECT *
FROM refs 
INNER JOIN commits c 
	ON refs.name = 'HEAD' 
	AND history_idx(refs.hash, c.hash) >= 0

This doesn't (and should):

SELECT *
FROM refs
INNER JOIN commits c 
	ON history_idx(refs.hash, c.hash) >= 0
WHERE refs.name = 'HEAD'

Only difference is that name check here is a filter and not a join condition.

Tree is the following:

Project(repository_id, name, hash, hash, author_name, author_email, author_when, committer_name, committer_email, committer_when, message, tree_hash)
 └─ InnerJoin(history_idx(refs.hash, commits.hash) >= 0)
     ├─ PushdownProjectionAndFiltersTable
     │   ├─ Columns(refs.hash)
     │   ├─ Filters()
     │   └─ Table(refs)
     └─ TableAlias(c)
         └─ PushdownProjectionAndFiltersTable
             ├─ Columns(commits.hash)
             ├─ Filters()
             └─ Table(commits)

We can see the filter has been lost.

I still don't know why the filter is not being pushed down, so I have to investigate.

TPC-H queries

TPC-H is a set of queries to benchmark decision support on databases. The queries are business oriented.

This issue is an umbrella issue to group all the bugs that we need to fix to be able to run this set of queries.

Bugs

  • #38 Support more than two tables on JOINS
  • #40 Unsupported syntax that should be supported
  • #58 NOT REGEXP

Missing features

  • #46 Aliased tables
  • #45 DATE type
  • #43 YEAR(date) function
  • #42 SUBSTRING function
  • #41 Implement ParentExpr syntax
  • #56 HAVING
  • #57 INNER JOIN
  • #59 SUBQUERY
  • #60 List of values
  • #61 AVG() function
  • #62 Between

This issue will be updated with new needed features and bugs.

ambiguous columns cannot be used

Imagine we have table foo with column a and table bar with column a as well. Right now, it's not possible to use both columns.

NOTE: you can select * a result with ambiguous columns, you just cannot use them e.g. select foo.a, foo.b.

NOTE 2: PR to fix this should start from #82

Improve error messages

Right now, if some query failed on resolution process, an error like this is obtained:

    Error:          Received unexpected error:
                        2 errors occurred:
                        
                        * plan is not resolved because of node '*plan.Project'
                        * plan is not resolved because of node '*plan.Filter'

Improve error messages to add more information about what is failing on that nodes.

Add Walk function to plan

Problem

We have many rules in the analyzer that need to gather some data from the whole tree before making any transformations. To do that, right now they make use of TransformUp, which creates a lot of nodes that are unnecessary, since we are not transforming anything, just gathering data.

Solution

It would be very handy to have a Walk function in the plan that allowed us to traverse the plan nodes.

Why a function instead of a method? To not add more and more methods to the Node interface. Besides, we don't need all nodes to implement that, only the ones we care about. For example, we don't need table implementations to provide their Walk method. Plan nodes can only be generated by us, since we are the ones parsing.

Besides, we have the Children method in the Node interface, which makes having a per-node implementation of Walk irrelevant.

Ideal API would be the following:

plan.Walk(node, func(sql.Node) plan.Visitor {
        // do something with the node
})

The callback should return a Visitor with perhaps custom logic for the given node or nil to stop.

Then the shortcut:

plan.Inspect(node, func(sql.Node) bool {
        // do something with the node
})

The callback should return a boolean. If the result is true it will stop walking.

It should work like go/ast#Inspect and go/ast#Walk

Ignore SET keyword

MySQL workbench executes set commands on start. These should be ignored:

set autocommit=1
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ

Expression and Node should have a Walk method

Motivation

During the analysis phase, sometimes we need to traverse the tree for reading purposes, that is, inspecting it and extracting some information without actually changing the tree.
The thing is, no mater if you change the children or not, TransformUp, TransformExpressionsUp and so on create new nodes every time you call them. So, for readonly operations we are making a lot of allocations that are not really needed.

Proposed solution

Add a Walk(func(sql.Node) bool) method to the Node interface and a Walk(func(Expression) bool) method to the Expression interface.

That way, we can walk the tree and inspecting without making new unnecessary allocations.

Benefits

We get to examine the tree without needing to create new nodes and make really expensive allocations.

Drawbacks

Every expression and node should have a new method added, which results in more and more boilerplate to the already huge boilerplate we have to add with every single node we create.

Pushdown Projections to Tables

To do not return all the columns of a table if it is not necessary, we should push down the actual needed projection to tables iterators to filter before hand.

Add configuration per engine instance and per session

We need a way to set configurations at the engine (or BBDD) and session levels. This configuration should be of two types, one specific to configure parameters on go-mysql-server (log level, number of rows to return per batch on a connection) or specific depending on the database implementation (on the gitBase case, the max size of files to return in blob table)

Create String() method on Node Interface

Create a String() method on Node interface to be able to print an entire tree. This can be really useful when something fails and we throw an error, or for debugging.

Expression has a Name() method that should be changed to String() too.

Type Compare method should also return an error

Right now, the Type interface has the following method:

Compare(a, b interface{}) int

Ideally, this should return an error as well if the types don't match. Otherwise, what is the way to report the types don't match with that result?
So, everything that needs to use Compare needs to implement its own type matching logic, leading to having this scattered all across the codebase instead of a single place.

The method would finally look like this:

Compare(a, b interface{}) (int, error)

Thoughts?

NOTE: by mismatch, I don't mean failing if Int64 is not Int32, but failing only if a and/or b cannot be converted to the type T using Convert.

star is resolved, but columns are not qualified, so they are not pushed down

Project(repository_id, name, hash, hash, author_name, author_email, author_when, committer_name, committer_email, committer_when, message, tree_hash)
 └─ InnerJoin(history_idx(refs.hash, commits.hash) >= 0)
     ├─ PushdownProjectionAndFiltersTable
     │   ├─ Columns(refs.hash)
     │   ├─ Filters()
     │   └─ Table(refs)
     └─ TableAlias(c)
         └─ PushdownProjectionAndFiltersTable
             ├─ Columns(commits.hash)
             ├─ Filters()
             └─ Table(commits)

As we can see in the previous tree, * resolves to GetField, which is not qualified because it's already resolved. It should resolve to UnresolvedColumns instead or at least add the source to it. Otherwise, columns will not be pushed down.

Proposal: Specify function arity instead of using reflection

Current state

Right now, we pass a interface{} as the UDF, which then is inspected and called using reflect.

Proposal

Instead, I propose to have fixed arity specified by the user.

Benefits of this:

  • does not use reflection, so more maintainable code
  • just looking at the registry you can see the arity of the function
  • does not allow random signatures, so it's more type-safe

The implementation would look like the following:

type Function interface {
        Arity() int
}

type Function1 func(sql.Expression) sql.Expression
type Function2 func(sql.Expression, sql.Expression) sql.Expression)
// ...
type FunctionN func(sql.Expression, sql.Expression, sql.Expression, sql.Expression) sql.Expression)

func (Function1) Arity() int { return 1 }
func (Function2) Arity() int { return 2 }
// ...
func (FunctionN) Arity() int { return n }

// Call could be converted to Call(...sql.Expression) 
func Call(fn Function, exprs ...sql.Expression) (sql.Expression, error) {
        if fn.Arity() != len(exprs) { /* throw error */ }
        switch fn := fn.(type) {
        case Function1: fn(exprs[0])
        // ...
        }
}

Then in the registry:

var Functions = map[string]Function{
        "func": Function1(NewMyFunc),
}

Plan pretty printer

We have a #95 to add String() method to all nodes so they can be printed, but that will not generate a tree-like structure to pretty-print the plan as such.

We might want to have a way to pretty-print the plan in a tree structure.

When we have the Children method in expressions, as described in #107, and the String function in all nodes we could do a generic function to pretty-print the plan as a tree.

Something like

Project(foo)
|- Filter(yada yada)
    |- CrossJoin
        |- Table(bar)
        |- Table(baz)

API could by like:

result := plan.Pretty(node)

Pass session around

We should have a Session to pass around to all nodes that need to be evaluated. With this session, we can pass dependencies around (such as the repository pool for gitquery) and any other dependencies that are needed elsewhere in the code along with user data, if any.

Possible use cases for this:

  • request-scoped metrics
  • dependency injection
  • passing of session data

Any thoughts? /cc @ajnavarro @mcarmonaa @jfontan

Proposal:

A way of implementing this in a way that's not very disruptive to the current API would be the following:

  • Implement a Session type. It could either be a concrete type or an interface with a base implementation that can be extended to provide extra features (this option could be really powerful for gitquery).
type SessionData struct { /* yada yada */ }

type Session interface {
	Data() SessionData
        // yada yada
}

type BaseSession struct {
	data SessionData
}

func (s BaseSession) Data() SessionData {
	return s.data
}

// another file or even another package

type CustomSession struct {
	BaseSession
	mything MyType
}

// inside an UDF, for example
customSess, ok := session.(*CustomSession)
if ok {
	customSess.mything.foo()
}
  • Change the signature of Expression.Eval to Eval(Session, Row) (interface{}, error)
  • Change the signature of Node.RowIter to RowIter(Session) (RowIter, error)
  • Change the signature of AggregationExpression.Update to Update(session Session, buffer, row Row) error
  • Change the signature of AggregationExpression.Merge to Merge(session Session, buffer, partial Row) error
  • Change the signature of Aggregation.Update to Update(Session, Row) (Row, error)
  • Change the signature of Aggregation.Merge to Merge(Session, Row)
  • Change the signature of Aggregation.Eval to Eval(Session) (interface{}, error) (and make it return an error, while we are at it

That's about all the places that would need the session passed around. Then it's the job of whoever calls these methods to provide them the session and so on.

Split UDFs into several files in their own package

Right now, we have all functions inside sql/expression in a file called function.go. Since functions are not just an isolated function but a struct with many methods, this can go out of hand when we start adding a lot of functions even though right now seems manageable.

I propose adding a new package sql/expression/function and put every function inside a different file. That way its easier to find a specific function and they're not lying around a file all together.

Same with aggregations when we have enough of them that having them in a file starts to be unmanageable.

Make named types for Transform callbacks

We should add named types for all function types passed as callbacks to transforms.

  • No need to change all of them when signature changes
  • Current ones are huge e.g. func(sql.Expression) (sql.Expression, error)

I propose the following:

type (
  TransformNodeFunc func(Node) (Node, error)
  TransformExprFunc func(Expression) (Expression, error)
)

type Node interface {
  // ...
  TransformUp(TransformNodeFunc) (Node, error)
  TransformExpressionsUp(TransformExprFunc) (Node, error)
}

type Expression interface {
  // ...
  TransformUp(TransformExprFunc) (Expression, error)
}

Narrow star expression

Doing SELECT * and SELECT c.* are the same, while with the second, only the columns from c should be selected

Implement null logic

All the types and expressions should support null handling with any type. Right now nulls are not supported.

Add TPC-H queries as engine tests

We should test the engine using the Memory datasource implementation.

This will help us to check if the plans are created correctly and the results are correct. Also we can check the engine performance on differents versions.

Invalid queries

Right now, the engine is parsing and returning results for some wrong queries.

Having this table:

!> select * from refs;

--> Executing query: select * from refs;

+------------------------------+----------------------+--------------------------------------------+---------------------+-----------+---------+-----------+--------+
|             NAME             |         TYPE         |                    HASH                    |       TARGET        | IS BRANCH | IS NOTE | IS REMOTE | IS TAG |
+------------------------------+----------------------+--------------------------------------------+---------------------+-----------+---------+-----------+--------+
| "refs/heads/master"          | "hash-reference"     | "f54d0f2c6e12a135e0f8762cca88a6d80741a35f" | NULL                | true      | false   | false     | false  |
| "refs/remotes/origin/master" | "hash-reference"     | "f54d0f2c6e12a135e0f8762cca88a6d80741a35f" | NULL                | false     | false   | true      | false  |
| "HEAD"                       | "symbolic-reference" | NULL                                       | "refs/heads/master" | false     | false   | false     | false  |
+------------------------------+----------------------+--------------------------------------------+---------------------+-----------+---------+-----------+--------+

You can perform the query SELECT type,name FROM refs GROUP BY type, which is using the projection name in the SELECT that is not an aggregation neither is present in the GROUP BY.

!> select type,name from refs group by type;

--> Executing query: select type,name from refs group by type;

+----------------------+---------------------+
|         TYPE         |        NAME         |
+----------------------+---------------------+
| "hash-reference"     | "refs/heads/master" |
| "symbolic-reference" | "HEAD"              |
+----------------------+---------------------+

It's returning the FIRST value found in the group of rows for each type.

Another example is SELECT type,name,hash FROM refs GROUP BY count(type), which is using an aggregation for the GROUP BY:

!> select type,name,hash from refs group by count(type);

--> Executing query: select type,name,hash from refs group by count(type);

+----------------------+------------------------------+--------------------------------------------+
|         TYPE         |             NAME             |                    HASH                    |
+----------------------+------------------------------+--------------------------------------------+
| "hash-reference"     | "refs/heads/master"          | "f54d0f2c6e12a135e0f8762cca88a6d80741a35f" |
| "hash-reference"     | "refs/remotes/origin/master" | "f54d0f2c6e12a135e0f8762cca88a6d80741a35f" |
| "symbolic-reference" | "HEAD"                       | NULL                                       |
+----------------------+------------------------------+--------------------------------------------+

The most simple example for a wrong query is SELECT type,count(name) FROM refs:

+------------------+-------------+
|       TYPE       | COUNT(NAME) |
+------------------+-------------+
| "hash-reference" |           3 |
+------------------+-------------+

Add Walk function to expression

Same as #106, but for expressions.

The only thing that I would change here is make Expression have a Children method as Node does. With this, we can have a function that works generically on all nodes without having to force all Expression implementations to implement their Walk method.

Panic comparing literals and columns numbers

If we have the following query:
SELECT a from b where a > 10
It will panic because a is an int32 and the literal 10 is an int64 by default. Line on parse.go:

return expression.NewLiteral(n, sql.Int64), nil

The standard solution for this is add a CAST node surrounding the literal, casting it to the type of the other expression into the comparison. Doing this, all the those queries will be executable:
SELECT a from b where a > 10
SELECT a from b where a > "10"
SELECT a from b where a > 10.0
SELECT a from b where 10 < a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.