Coder Social home page Coder Social logo

duckdb / duckdb Goto Github PK

View Code? Open in Web Editor NEW
16.7K 170.0 1.5K 263.06 MB

DuckDB is an in-process SQL OLAP Database Management System

Home Page: http://www.duckdb.org

License: MIT License

CMake 0.45% Makefile 0.05% Python 4.69% C 5.15% C++ 86.93% R 0.01% Shell 0.09% Java 1.14% Julia 0.71% Swift 0.78%
sql database olap analytics embedded-database

duckdb's Introduction

DuckDB logo

Github Actions Badge discord Latest Release

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

duckdb's Issues

Rewrite NULL Handling Logic

Special values in the domain are not nice, because they have to be handled all over the place in every single loop which makes the code ugly, and they are not efficient. Tim and me talked about it, and propose the following change:

Every vector has an optional pointer to a bitmask of size 1024 (this amounts to 64 bytes, or 16 8-byte integers as overhead), which is relatively negligible for most data types.

This can be implemented with C++ bitsets (http://www.cplusplus.com/reference/bitset/bitset/) which handles most nasty code for us and should be quite efficient because of template magic. If we find out it's not efficient we can always roll our own implementation.

In regular loops (e.g. addition and so), we completely ignore null values and just loop over the whole data. This is nice because it (1) allows SIMD (2) does away with any branching and (3) simplifies our loop code which results in a lower code size.

The actual NULLs in the data can be computed separately depending on the operator type. For regular math operators (+, -, *, /), we would OR together the bitmasks if both sides have one, or simply select the other bitmask if one of the sides does not have a bitmask (or set to NULL if neither have one). This should be pretty much free if both sides have no NULLs, and cheap even if they do, because the bitmask OR is just an OR of 16 8-byte integers.

The difficulty with this approach will be aggregations, since they still have to check the NULL values. Also, because we are performing operations on the "hidden" NULL values (even if the results of the operations are not used) the hidden values should be somewhat sane. I propose the value 0 for the hidden value, because:

  • We already have to check for 0 in divisions, so that doesn't create any extra problems
  • By setting to 0, it makes checking NULLs not necessary in the SUM computation.

For other aggregations (MIN and MAX) we still need to check the bitmask to see if a value is a "genuine" NULL or a fake NULL. For this, we could have separate code depending on whether or not there is a NULL mask. However, this is only for a small amount of functions and not for every single function in the pipeline.

(Meta) Issue Handling

Pedro made a good suggestion yesterday regarding the issues. We should have clearer issues that are more broken down into small components rather than big meta issues. This way it is easier for newcomers to pick a small issue and work on it to get started on working on the database. Not so important right now but will become more important after we open source and get more people to work on it.

I propose breaking existing issues into smaller components and labeling them with the components/parts of the system that they relate to.

i.e. ->
Core Components
(Execution)
(Optimizer)
(Parser)
(Planner)
(Storage)
Larger Issues
(Core Design)
(SQL)
Meta Issues
(Documentation)
(Code Quality)
(Meta)
Tools/Extensions
(Shell)

Optimize ORDER BY clause (physical_orderby)

The current ORDER BY implementation has very poor performance because it uses generic types (Value::) which means type checking is done for every single comparison. The performance can be greatly improved by switching to templated comparisons.

(Code Quality) Code Refactor/Cleanup

  • Split up internal_types.cpp into separate files

  • Use map<> to handle the ENUM <-> STRING mappings instead of IF/ELSE statements

  • Split up Transformer into separate files

  • Wrap Transformer functions in Transformer class rather than being top-level functions

  • Split up ExpressionExecutor into separate files (one file per expression type)

  • Organize logical_operator and physical_operator and expression into separate directories (e.g. joins, scans, etc)

  • expression_rules/logical_rules -> rules/expression/, rules/logical/

  • Consistent plural/singular -> directories are singular (“function” not “functions”, etc)

  • Prefix all test files with "test_"

  • Exclude test files from code sanity analysis (?)

Simple Storage/Transaction Model

For the first release, we should do a simple storage and transaction model. Core ideas:

  1. On BEGIN TRANSACTION, acquire a lock on the whole database
  2. COMMIT releases that lock, and flushes the data to disk using either log file.
  3. Storage can be a simple hierarchical set of directories with uncompressed binary data sitting in files split on "blocks of columns", one block is up to 20-100 vectors of a single column (we should check when it stops mattering how big we make these).
  4. There is one main "index" file that keeps track of which things are where for which columns, as well as metadata and statistics. On COMMIT, we flush to disk by atomically overwriting the index file. We should probably also do something with log files and such.

More In-Depth Component Documentation

I suggest adding more in-depth documentation for the different components in the form of a README.md file inside the different subdirectories. I.e. there could be a README.md inside the src/transaction directory explaining the MVCC, one in the execution directory explaining the execution model, etc... just to make it easier to understand the different components and how they relate to each other.

UTF8

Make sure all strings going into DuckDB through any route (INSERT, constants, CSV load etc.) are UTF8

Query Optimizer

  • Join Ordering (Statistics; Sampling)
  • Query Flattening
  • CSE
  • Simple Rewrite Rules
  • Acces Path Selections

Prepared statements

New expression, Placeholder
Create logical plan, optimize
Store plan in client context
Return fake result set with result set type and parameter types
Open question: when should we invalidate (probably on schema changes)
Copy plan for execution, replace placeholders, execute

Checksumming

buffer pool should check for every chunk it reads from disk

Storage

Schema
Data (File, Chunks)

License

Which license should DuckDB have?

  • MIT
  • Public Domain
  • ?

Indexes

Use B+-Trees

Create it automatically for Primary Keys

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.