duckdb / duckdb Goto Github PK

View Code? Open in Web Editor NEW

16.7K 170.0 1.5K 263.06 MB

DuckDB is an in-process SQL OLAP Database Management System

Home Page: http://www.duckdb.org

License: MIT License

CMake 0.45% Makefile 0.05% Python 4.69% C 5.15% C++ 86.93% R 0.01% Shell 0.09% Java 1.14% Julia 0.71% Swift 0.78%

sql database olap analytics embedded-database

duckdb's Introduction

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb's People

Stargazers

Watchers

Forkers

aris-koning relationalai-oss noamross ankravch marcboeker snehlsen hannes ngaut xhochy linuxperia longunsw b-xiang radtek clearlylin alexkashuba philhoch sreeharshar84 fengweijp rubik-ai hawkfish apavlo morganwu277 geokollias kuncao crixalis2013 mweisgut ganielinn taniabogatsch astonegod strikew pritambaral arosch rainmaple hyperxor fxcebx longi94 richyraj lnkuiper kavinderd adrianriedl ericyao2013 pauloamora kryonix mbasmanova bhhari honeypot95 tiagokepe nantiamak tdoehmen sinhasantos phillc73 lepennec edalmeida informagi wesm margaritaderaney joe-cipolla zz-jason tallamjr fengttt joe2hpimn r-dbi m1sta ttaranov keevol vdt intfrr danishjalil2991 mu-l tchigher okdaydream ivan-v-kush abhisheknishantpuresoftware datadevopscloud sp-mishra wzrumich pat-s mrigger 3aceshowhand ravit73 jbampton yingsu00 kriti-sc itswill blockspacer mschmo moneytech rpbear88 zeta1999 jtommaney1 shenyunlong bnm3k iyefeng dejan-stankovic proteanblank cuulee lichnak saurabhdhupar mjunaidi diegopacheco

duckdb's Issues

Subexpression Elimination

Common Table Expressions

WITH xyz AS (SELECT 42) SELECT * FROM xyz; etc. Not recursive for now or never.

Crash when pressing tab

Integrate SQLite Client

ALTER TABLE support for renaming tables

Add Queries to CSV FROM/TO

Rewrite NULL Handling Logic

Special values in the domain are not nice, because they have to be handled all over the place in every single loop which makes the code ugly, and they are not efficient. Tim and me talked about it, and propose the following change:

Every vector has an optional pointer to a bitmask of size 1024 (this amounts to 64 bytes, or 16 8-byte integers as overhead), which is relatively negligible for most data types.

This can be implemented with C++ bitsets (http://www.cplusplus.com/reference/bitset/bitset/) which handles most nasty code for us and should be quite efficient because of template magic. If we find out it's not efficient we can always roll our own implementation.

In regular loops (e.g. addition and so), we completely ignore null values and just loop over the whole data. This is nice because it (1) allows SIMD (2) does away with any branching and (3) simplifies our loop code which results in a lower code size.

The actual NULLs in the data can be computed separately depending on the operator type. For regular math operators (+, -, *, /), we would OR together the bitmasks if both sides have one, or simply select the other bitmask if one of the sides does not have a bitmask (or set to NULL if neither have one). This should be pretty much free if both sides have no NULLs, and cheap even if they do, because the bitmask OR is just an OR of 16 8-byte integers.

The difficulty with this approach will be aggregations, since they still have to check the NULL values. Also, because we are performing operations on the "hidden" NULL values (even if the results of the operations are not used) the hidden values should be somewhat sane. I propose the value 0 for the hidden value, because:

We already have to check for 0 in divisions, so that doesn't create any extra problems
By setting to 0, it makes checking NULLs not necessary in the SUM computation.

For other aggregations (MIN and MAX) we still need to check the bitmask to see if a value is a "genuine" NULL or a fake NULL. For this, we could have separate code depending on whether or not there is a NULL mask. However, this is only for a small amount of functions and not for every single function in the pipeline.

(Meta) Issue Handling

Pedro made a good suggestion yesterday regarding the issues. We should have clearer issues that are more broken down into small components rather than big meta issues. This way it is easier for newcomers to pick a small issue and work on it to get started on working on the database. Not so important right now but will become more important after we open source and get more people to work on it.

I propose breaking existing issues into smaller components and labeling them with the components/parts of the system that they relate to.

i.e. ->
Core Components
(Execution)
(Optimizer)
(Parser)
(Planner)
(Storage)
Larger Issues
(Core Design)
(SQL)
Meta Issues
(Documentation)
(Code Quality)
(Meta)
Tools/Extensions
(Shell)

String Hashtable

Transactions

Updates
Logs
MVCC

TPC-H Working

Add a buffer manager

Readonly mode

NULL handling

Add support for incremental checkpointing and checkpointing during runtime

Various TODOs and FIXMEs

Look here: https://github.com/cwida/duckdb/search?q=TODO+OR+FIXME&unscoped_q=TODO+OR+FIXME

Optimize ORDER BY clause (physical_orderby)

The current ORDER BY implementation has very poor performance because it uses generic types (Value::) which means type checking is done for every single comparison. The performance can be greatly improved by switching to templated comparisons.

(Code Quality) Code Refactor/Cleanup

Split up internal_types.cpp into separate files
Use map<> to handle the ENUM <-> STRING mappings instead of IF/ELSE statements
Split up Transformer into separate files
Wrap Transformer functions in Transformer class rather than being top-level functions
Split up ExpressionExecutor into separate files (one file per expression type)
Organize logical_operator and physical_operator and expression into separate directories (e.g. joins, scans, etc)
expression_rules/logical_rules -> rules/expression/, rules/logical/
Consistent plural/singular -> directories are singular (“function” not “functions”, etc)
Prefix all test files with "test_"
Exclude test files from code sanity analysis (?)

Support for FOREIGN KEY

Simple Storage/Transaction Model

For the first release, we should do a simple storage and transaction model. Core ideas:

On BEGIN TRANSACTION, acquire a lock on the whole database
COMMIT releases that lock, and flushes the data to disk using either log file.
Storage can be a simple hierarchical set of directories with uncompressed binary data sitting in files split on "blocks of columns", one block is up to 20-100 vectors of a single column (we should check when it stops mattering how big we make these).
There is one main "index" file that keeps track of which things are where for which columns, as well as metadata and statistics. On COMMIT, we flush to disk by atomically overwriting the index file. We should probably also do something with log files and such.

a IN b implementation

UNION/UNION ALL/EXCEPT/INTERSECT

Intra-Query Parallelism

UTF8

Make sure all strings going into DuckDB through any route (INSERT, constants, CSV load etc.) are UTF8

Query Optimizer

Add EXPLAIN

CASE expressions

ALTER TABLE support for adding/removing columns

Window Functions

partition by etc.

Subqueries

rollup() grouping operator

rewrite into unions?

Prepared statements

New expression, Placeholder
Create logical plan, optimize
Store plan in client context
Return fake result set with result set type and parameter types
Open question: when should we invalidate (probably on schema changes)
Copy plan for execution, replace placeholders, execute

MIT
Public Domain
?