Light

amannm / lake-driver Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 1.0 194 KB

cheaply runs SQL queries on S3 flat-file assets

License: GNU Affero General Public License v3.0

Java 100.00%

s3 sql java calcite aws

lake-driver's Introduction

lake-driver

provides JDBC connections capable of executing SQL SELECT statements (see https://calcite.apache.org/docs/reference.html) on AWS S3 flat-file assets
query optimization pushes column projection and filtering over to AWS (aka "predicate pushdown") leading to less data needing transfer out of S3 and reduced network/storage/memory footprint by the runtime computing+consuming the output record set
Most of the work is done by the Apache Calcite (and Apache Avatica) projects, via an interface called ProjectableFilterableTable that (work-in-progress) maps to the subset of SQL currently supported by "S3 Select" https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference.html
In-depth paper about Apache Calcite: https://arxiv.org/pdf/1802.10233.pdf

environment setup

make sure you've already used AWS CLI's configure command to add credentials to whatever environment you dev in
in the test folder, replace all strings of "build.cauldron.tools" with an existing s3 bucket id of something your previously configured AWS credentials actually have read/write access to
use the LakeDriver.getConnection(...) methods to create JDBC connections
- pass a list of TableSpecification defining all "external tables" your query needs to be a valid reference
- (optional) specify one of the following Scan classes to configure behavior
  - LakeS3GetScan uses GetObject, full tables are downloaded, both projection and filtering are performed in memory
  - LakeS3SelectScan Uses SelectObjectContent, only the required projected columns are downloaded, filtering is done in memory
  - LakeS3SelectWhereScan (default) uses SelectObjectContent, both projection and filtering is done on AWS, the results are downloaded, any remaining untranslated filters are applied in memory

todo

improve WHERE push-down
performance profiling, optimization
smarter, more comprehensive testing
mixed scan mode: some table scans are better GET, others SELECT
integrate and test the parquet compression support and save cash
get rid of AmazonS3URI.java dependency
figure out a way to get S3 Select working on AWS SDK v2

lake-driver's People

Contributors

Stargazers

Watchers

Forkers

vdm98

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.