andygrove / datafusion Goto Github PK

Apache arrow

License: Apache License 2.0

Rust 96.08% Shell 2.53% Dockerfile 1.40%

datafusion's Introduction

Hi there 👋

I'm Andy Grove, a software engineer specializing in distributed systems and query engines. I am the author of the book How Query Engines Work.

Open Source

I am a PMC member of the Apache Arrow and Apache DataFusion projects. I have made a number of code donations to these projects:

In 2018, I donated the Rust implementation of Apache Arrow
In 2019, I donated the DataFusion in-process SQL query engine
In 2021, I donated the Ballista distributed SQL query engine

I am also the original author of the sqlparser-rs crate.

Social Media

datafusion's People

Contributors

Stargazers

Watchers

Forkers

avantgardnerio j4in placrosse andrewgrz malu mypmc jthelin mgxm vivekjhaver maccam912 atouchet nlauchande gokuldas debugger87 gaslitbytech crepererum vijaykiran aheart sunchao pmaciolek qiangcai hntd187 srenatus tehroot dandandan joshlemer cbxgyh timclicks jakebecker dthg pablogore-zz hoangpq emaxerrno gnieto nevi-me juaby 17ai yz89 damienstanton graydon ariesdevil sstephant strikew kujon tupshin tempbottle seyi yasinlin cambricorp isgasho sinhasantos huxiyu yudi-z jacksonrnewhouse loredp iq-scm 42i-ai

datafusion's Issues

Consider adding kafka-like functionality

Move code to CircleCI for automated builds

I can see that TravisCI is currently being used, but it might behoove us to move this project to CircleCI.

CircleCI 2.0 gives us a lot of flexibility, and allows us to run things in containers with versions of code that we desire. It also means we don't have to maintain our own build cluster.

Just something to think about. I can take this on if you want; shouldn't take longer than a few hours.

Great feedback from reddit

I looked at the code 2-3 weeks ago when it was first announced with the 2018 big-data on Rust blog post. I took a cursory look today. I am going through similar exercise in Rust for scratching my own itch after working 4+ years in the same domain on JVM/Scala/etc.

I would like this project to succeed so here are some observations that may help in the long run (caveat - I may have missed some of the points in your code).

The streaming model is more general and it can be easily relaxed to batch processing - the inverse is always hard. On the other side batch processing is more efficient than processing event-by-event. In my experience something like micro-batching works best in terms of flexibility and performance.

Columnar oriented processing is more efficient on the current crop of hardware as it plays better with dispatch overhead, cache locality and data prefetching. It is also a natural extension of micro-batching from point 1. My own benchmarks show 20x difference with batches of 1024 tuples, around 2-3 cycles per arithmetic operation on f64, including SQL NULL correctness.

You definitely don't want to dispatch on types in the leaves, e.g. in the function/expression bodies like https://github.com/andygrove/datafusion-rs/blob/master/src/functions/math.rs#L14 - The data types are compilation concept, they should not exist at run-time. You can use generics, type erasure and columnar dispatch to get it. Think about this in the same vein as Rust vs. Ruby - Rust is faster because it does not need to check the run-time type on each operation because the compiler guarantees it.

Instead of doing switch type interpreter, where you have dispatch cost on each event you may want to think about expressions in more functional way, e.g.

fn gteq(lhs:expr, rhs:expr) -> Box<Fn(input: tuple) -> value>

or if you get into columnar approach:

fn gteq(lhs: expr, rhs: expr) -> Box<Fn(in: &Frame) -> Column>

and compose the whole computation just once before executing it: following a reference once per 1000 tuples is a lot cheaper than going trough a match on each tuple.

You can take a look at a minimalistic example I put together some time ago: https://gist.github.com/luben/95c1c05f36ec56a57f5624c1b40e9f11

Implement data type support using generics and add support for all base types

The current type system uses an enum as a wrapper around some standard types but this should probably be re-implemented using generics instead to make it easier to add new types.

The system should support all the standard primitve types (bool, int, long, float, double) as well as string, date/time, and binary.

Refactor parser to avoid copying tokens

This is some of the oldest code in the repo and I used Copy a lot but it would be good to clean this up now before too much more functionality is built.

Implement CREATE EXTERNAL TABLE

We need a way to register CSV files and defining a schema via SQL so we can query CSV from a SQL console without having to write Rust code.

The standard way of doing this is by using a CREATE EXTERNAL TABLE command that defines the schema and provides other information required so open the CSV, like the directory path.

Here are docs for how Hive does this:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_data-access/content/moving_data_from_hdfs_to_hive_external_table_method.html

Implement authentication and authorization framework

Implement SSL transport (HTTPS)

Convert project to a library with examples

Currently the project is a binary project with a main method containing some sample code. We should change it to a library and add examples instead.

good work

good work, good start, I will learn it and do something.

Some valid/invalid sql query not accepted/accepted

Invalid but parsed successfully:

SELECT CREATE EXTERNAL TABLE test (id VARCHAR(1)) FROM foo WHERE 1=2
SELECT 1 FROM foo WHERE CREATE EXTERNAL TABLE test (id VARCHAR(1))

Valid (debatable if useful) but not accepted:

CREATE EXTERNAL TABLE test ()

I'm trying to fix the parser, but it seems to be non-trivial. Is there a reason why you wrote your own parser and didn't use some parser combinator library like nom or combine?

Scalar functions

It should be possible to use some pre-defined scalar functions in a query. For example:

SELECT sqrt(x) FROM foo

SELECT * FROM foo WHERE sqrt(x) < sqrt(y)

Generate closures from relational plan

Instead of interpreting the plan and expressions at runtime, the executor should generate closures from the plan and these closures can be called per row. This should be faster.

Optimize for ordered nearest left join

KX Systems sold probably a billion dollars worth of licenses to financial institutions by optimizing the following operation: http://code.kx.com/q/ref/joins/#aj-aj0-asof-join

To my knowledge, Spark doesn't have an optimized implementation of this, or if it does, I have no idea what they call it at the moment.

Design struct for distributed jobs

First use case should be sending a job to read a csv file and partition it into multiple csv files

For interactive queries, worker should stream results to client

Design user-defined type traits

Datafusion needs to be able to process tuples containing values that are of a user-defined type not known at compile time. Datafusion doesn't need to understand the types too much but needs to be able to perform serde operations and pass them to user-defined functions.

Dockerize the worker process

Create a Dockerfile to run the worker process (to try out the worker locally just run cargo run worker).

A/C

There should be a bash script to create the worker binary (e.g. cargo build --release) and create the Docker image
The Docker image should not contain Rust or Cargo, just the pre-compiled worker executable
Port 80 should be exposed (for web UI and REST API)
It should be possible to map a volume containing user-defined types and functions

Remove need to clone ExecutionContext

Implement integration test

We need an integration test that runs as part of the ci build:

Start etcd
Start a worker node
Run the SQL console to execute a query (probably means we need a command-line parameter for specifying a query to run)

Worker process should register with etcd

Implement IPC mechanism for integrating with other processes and frameworks

Adding IPC support could allow DataFusion to work alongside existing Spark jobs and allow for a gradual migration from Spark to DataFusion.

Also it could allow non-Rust code to be executed by DataFusion.

Research if it is possible to serialize Rust closures and load them in another process

To match the ad-hoc query capabilities of Spark many users will expect to be able to apply functional transformations to DataFrames using custom Rust code without having to package that code in a crate dependency behind a UDF trait.

Is this even possible with Rust?

Create logo for the project

Something like this:

https://www.iconfinder.com/icons/360735/atom_discovery_physic_science_icon#size=256

Default configuration file

I wonder if we could write a simple configuration file.
A very simple .toml file that defines all the necessary configurations.
So when the project forwards and needs more some configurations it would be easier to use and we do not need to pass all the arguments every time.

For example:

# Here we can put some configuration that
# are common to both
[datafusion]
etcd = "http://127.0.0.1:2379"

[worker]
bind = "0.0.0.0:8080"
data_dir = "./path/data_dir"
webroot = "./src/bin/worker"

[console]
# some specific configuration
# to the console

When we initialize the console or the worker we could pass a flag that searches for the configuration.

will try to load the configuration from a default path
worker -c
or we can pass the path to configuration file
worker -c /path/to/config.toml

And likewise for the console

Implement predicate push down as the first example of an optimization

We should add an example query optimization to make sure that the current structures for the logical plan are going to work well enough.

Run single-process benchmark vs Apache Spark

For the first benchmark, I'm thinking that a geospatial use case would be good. It would be easy to generate csv files of varying size contianing fake address data and lat/lng co-ordinates and then run a job to translate the lat/lngs to quadkeys or mercator projection.

References:

Implement encryption at rest and between nodes

SQL Console should support multi-line statements

Currently the console executes the input when enter is hit. Normal behavior for a SQL console is to accept multi-line statements terminated with a semicolon.

Add support for Parquet

Maybe write a Rust wrapper around this?

https://github.com/apache/parquet-cpp

Implement efficient serialization of RecordBatch

In order to stream data between nodes and to efficiently swap data out to disk and store in a native format, it will be necessary to serialize to a binary format.

Add support for Apache Kudu

This should be fairly simple thanks to https://github.com/danburkert/kudu-rs

HDFS Support

This is the lowest common denominator for being to run some distributed workloads on existing Hadoop clusters.

There is no working pure Rust HDFS library unfortunately, so we'll probably have to start with this one which wraps a C library that wraps the Java client.

http://hyunsik.github.io/hdfs-rs/hdfs/index.html

Tuple should be a trait, not a struct?

Come up with a good name for this project

Dynamic loading for UDTs and UDFs

This library looks like a good place to start:

https://docs.rs/libloading/0.5.0/libloading/

Users should be able to publish crates containing structs that implement appropriate datafusion traits for types and functions and then register the produced libraries with datafusion at runtime.

Worker 8987a3f3-71e1-5cca-aadf-bc165f528fac listening on 127.0.0.1:8088 and serving content from /opt/datafusion/www
Heartbeat loop failed: Error { repr: Kind(NotFound) }

DataFusion Console
$ SELECT * FROM foo
Executing: SELECT * FROM foo
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParserError("Prefix parser expected a keyword but found Mult")', /checkout/src/libcore/result.rs:906:4