duckdb / duckdb-web Goto Github PK

DuckDB website and documentation

License: MIT License

HTML 7.84% CSS 0.15% JavaScript 65.86% Python 6.27% Shell 1.02% SCSS 12.99% Ruby 0.73% Dockerfile 0.05% TeX 4.92% Makefile 0.01% Java 0.08% R 0.07%

duckdb-web's Introduction

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs, maps), and several extensions designed to make SQL easier to use.

DuckDB is available as a standalone CLI application and has clients for Python, R, Java, Wasm, etc., with deep integrations with packages such as pandas and dplyr.

For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install DuckDB, please see our installation page for instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The documentation contains a SQL introduction and reference.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb-web's People

Contributors

Stargazers

Watchers

Forkers

tdoehmen ceckoslab eddelbuettel henrikhorluck personalcomputer emres lorenzwalthert wangfenjin scaleoutsean joswlv domoritz alex-monahan informagi hannes christianmurphy jim-tommaney tobymao szarnyasg goodwanghan hawkfish lnkuiper taniabogatsch arjenpdevries tiagokepe mause ajzo90 oliviercavadenti dnlmc philippgrulich metab0t loreith jimexist tishj mytherin prmoore77 koenvo antonycourtney hughcameron pdet douenergy liurenjie1024 mawuva krlmlr giorgi 2cjenn samansmink aborruso nojvek jmloyola catlyst statlib eitsupi ak-coram jovanveljanoski maxxen nishduckdb mac-cain13 papparapa wjones127 ax42 yothinix satotake davetapley camdencheek d33bs andygrove metesynnada rpbouman magorlick voberoi tmonster kvnkho carlopi idomic bluehat974 jmptrader annainfo lindsaywray tuffnatty sebastian-nagel vlulla mneedham mariusvolkhart earlev4 yooakim ritchie46 randomfractals tydunn maartenbosteels blackrez elithrar alexboks rmoff pmm-motif dinsaw redviking1 adriens mehd-io wpdonders kryonix

duckdb-web's Issues

DuckDB Live Demo sometimes fails on "SELECT distinct l_shipinstruct FROM lineitem"

I tried running "SELECT distinct l_shipinstruct FROM lineitem" and "SELECT distinct l_shipinstruct FROM lineitem order by 1 desc" on the https://duckdb.org/demo/ website and I either get an incorrectly rendered result or HTTP 503 errors.

But the query does work occasionally.

Request URL: https://duckdbdemo.project.cwi.nl//fetch?callback=jQuery3510355319707602344_1629874848617&ref=53knwZQ5mH&_=1629874848637
Request Method: GET
Status Code: 502 Proxy Error

Request URL: https://duckdbdemo.project.cwi.nl//query?callback=jQuery351010110534406317129_1629874584547&q=SELECT%20distinct%20l_shipinstruct%20FROM%20lineitem%20order%20by%201%20desc&_=1629874584573
Request Method: GET
Status Code: 503 Service Temporarily Unavailable

Docs show 0.3.3 as latest release

Hey Folks!

It looks like the docs are still showing 0.3.3 as the latest release. How can I upgrade that to 0.3.4?

I searched through this repo for a way to increment the docs to the next version, but I couldn't find any past PR's that showed it.

Thanks!
-Alex

Documentation on transferring data from other systems

We should add some documentation on how to import data from other database systems.

e.g. sqlite -> duckdb, postgres -> duckdb, etc

Document MAKE_TEMPORAL functions

We support MAKE_DATE, MAKE_TIME and MAKE_TIMESTAMP in the main code.

"PREDECING" typo

Hi,

Thanks for an awesome package. Just a small typo that confused me more than it should as I was copy+pasting from examples:

https://duckdb.org/docs/sql/window_functions

SELECT points,
    SUM(points) OVER (
        ROWS BETWEEN 1 PREDECING
                 AND 1 FOLLOWING) we
FROM results

"PREDECING" should be "PRECEDING"

cli alternative: sqlline

just a tip: no real need to build a cli, you can reuse sqlline if required so.
I understand the integration will not be as tight as possible, but it works, and enables one to use the same cli for all kind of databases:

combi of DuckDB JDBC driver and sqlline
see: https://duckdb.org/docs/data/parquet, and sqlline https://github.com/julianhyde/sqlline

sqlline -u "jdbc:duckdb:" -d "org.duckdb.DuckDBDriver" -n '' -p '' -e "select * from 'userdata1.parquet';"

add a search box on the website

so that we can find the document of a feature or function easily.

Update minvalue/maxvalue for CREATE SEQUENCE

Update minvalue/maxvalue once duckdb/duckdb#1056 has been resolved.

See discussion in #13 (comment)

Document DESCRIBE clause

Feedback to #117

The DESCRIBE clause is mentioned in the documentation but the command itself is not documented.

Example:

D create table t(x int primary key, y varchar);
D describe t;
┌───────┬─────────┬──────┬─────┬─────────┬───────┐
│ Field │  Type   │ Null │ Key │ Default │ Extra │
├───────┼─────────┼──────┼─────┼─────────┼───────┤
│ x     │ INTEGER │ NO   │     │         │       │
│ y     │ VARCHAR │ YES  │     │         │       │
└───────┴─────────┴──────┴─────┴─────────┴───────┘

(I believe the output above is not correct: x should be a Key)

Benchmarks pages blank

The benchmark pages are showing as blank for me. I am getting a 503 (Service Temporarily Unavailable) from the duckdbdemo.project.cwi.nl calls.

Best way to use Python Threads

Hey folks!

Before I write up a how-to-guide, would you mind taking a look at this approach to using Python threads? Is this the best practice? I couldn't get it to work with cursors, so if that is a better method than not checking the same thread, I'm open to changing this!

Thanks!

import duckdb
from threading import Thread, current_thread
import pandas as pd

def insert_from_thread(duckdb_con, results_df_dict):
  # Insert a row with the name of the thread
  thread_name = str(current_thread().name)
  results_df_dict[thread_name] = duckdb_con.execute("""INSERT INTO my_inserts VALUES (?)""", (thread_name,)).df()

duckdb_con = duckdb.connect(check_same_thread=False) # In Memory DuckDB
duckdb_con.execute("""CREATE OR REPLACE TABLE my_inserts (thread_name varchar)""")

thread_count = 10
threads = []
results_df_dict = {}

# Kick off multiple threads (in the same process) 
# Pass in the same connection as an argument, and an object to store the results
for i in range(thread_count):
    threads.append(Thread(target=insert_from_thread,
                            args=(duckdb_con, results_df_dict,),
                            name='my_thread_'+str(i)))

for i in range(thread_count):
    threads[i].start()

for i in range(thread_count):
    threads[i].join()

print(results_df_dict)
print(duckdb_con.execute("""SELECT * FROM my_inserts""").df())

Download link for CLI and C++ not sensitive to system detected

System detected seems to be hardcoded in the links:

https://github.com/cwida/duckdb-web/blob/82d4855e69dcbaafbccf2699f7d1b0b8072d852e/index.html#L143-L144

On Windows 10 Firefox 86.0a1

Similarly on Edge 87.0.664.66

Document QUALIFY clause

The QUALIFY clause is missing in the documentation

Add UnNest to Nested Types Page

I just wanted to document this here before I forgot! I'm happy to make this change.

I think that the UnNest function should be mentioned in the Nested types section in addition to the SELECT overview.

In general, I think there might be a few other Postgres functions that aren't in the docs just yet, but we can track those in separate issues.

CodeMirror SQL highlighting shows {}'s as errors

We are also missing some of the newer keywords like Qualify and Having. Is there a good way to pull a list of keywords with a DuckDB script where we could build a DuckDB-specific dialect for CodeMirror?

Thanks!

Missing padding above headings

At https://duckdb.org/docs/sql/functions/dateformat, the headings are missing padding.

Revise and extend WITH RECURSIVE examples

A while ago, I contributed the examples for WITH RECURSIVE (#158 #188)
https://duckdb.org/docs/sql/query_syntax/with

I have since found that these queries are quite inefficient, especially the bidirectional search if there is no path between the start and end nodes.

It would also want to add a proper bidirectional search algorithm, where the two BFS frontiers are advanced simultaneously from the start and the end node.

I don't have time to tackle this now, so I'm opening this issue as a reminder and will get back to it later this spring.

Font ligatures make comparison operators page confusing

<=, >= and != looks like unicode ≤, ≥ and ≠.
consider turning off ligatures or using a font without them.
Now:

With font-variant-ligatures: none;:

Document CLI commands

Feedback to #117

The DuckDB CLI has quite a few features (e.g. .tables) which are not yet documented.

I started writing this earlier this year but realized it's a larger task and abandoned it. I got this far:

## Installation

DuckDB can be installed as a binary. Please see the [installation page](/docs/installation?environment=cli) for details.

Other than SQL commands, SQLite-like instructions can be used:

    .help

* `.tables`: Print a list of tables

## History

The command history is saved in the home directory in `.duckdb_history`.

Benchmark logs are empty

In the Benchmarks page, the stdout logs are empty.

The link in the Profiling page is broken, it points to:

https://duckdb.org/benchmarks/logs/e7eb7154848be520159d9e1ee744989b25d4c987-graph.html?name=Q20

Add imgbot

imgbot automatically sends PRs to optimize images and compress them significantly. Could be useful for blog posts.

See for example cmudig/cmudig.github.io#160

the map function document is out of order

https://duckdb.org/docs/sql/functions/nested

Map Functions
| Function | Description | Example | Result | |:—|:—|:—|:—| | map[entry] | Alias for element_at | map([100, 5], ['a', 'b'])[100] | 42 | | element_at(map, key) | Return a list containing the value for a given key or an empty list if the key is not contained in the map. The type of the key provided in the second parameter must match the type of the map’s keys else an error is returned. | SELECT element_at(map([100, 5], [42, 43]),100); | 42 | | cardinality(map) | Return the size of the map (or the number of entries in the map). | cardinality( map([4, 2], ['a', 'b']) ); | 2 | | map() | Returns an empty map. | map() | {} |

https://duckdb.org/docs/installation/ R GitHub master (bleeding edge) link broken

The instructions for R at https://duckdb.org/docs/installation/ display:

install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)

which does not work because https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz returns a 404 error.

Clarify use and differences of composite and nested types

I had some questions which I don't think are answered by the docs. The "Data Dypes Overview" briefly mentions ROW, MAP, and ARRAY, but not LIST and STRUCT, which are only mentioned in the "Nested Types" page.

What is the difference (if any) between ARRAY, INT[], and LIST?
Can the exact size of an array be used in a DDL statement (INT[3])? Are these limits respected?
How is the type of a LIST or ARRAY specified in a DDL statement?
Can LIST expressions ([1, 2, 3]) be used to insert into ARRAYs?
What is the difference (if any) between STRUCT, ROW, and MAP?
Can STRUCT expressions ({'foo': ...}) be used to insert into MAPs or ROWs?

Presumably they work like PG or standard SQL, but the docs could be expanded.

PLEASE RE-CLONE / RE-FORK THIS REPO

In the context of #123 we have rewritten the history of this repository to remove (many) redundant large files. This has reduced the size of this repo from 1.3GB to 75MB. However, if you have an outstanding repository cloned or forked the old history will still be present there. It is recommended that you re-clone or re-fork this repository to prevent that from causing problems.

https://duckdb.org/docs/sql/functions/timestamp underlines multiple menu items

I suspect the check for underlining should be stricter.

Improve documentation on CTEs

See duckdb/duckdb#2551
I'm "self-assigning" this since I'm working with recursive CTEs quite a lot.

Add hierarchical CTE example
Add generic graph query example
Check whether nested CTEs work (if not, document it)

Interval functions documentation

Interval functions

Should be like

to_milliseconds(integer) | Construct a millisecond interval | to_milliseconds(5) | INTERVAL 5 MILLISECOND
to_microseconds(integer) | Construct a microsecond interval | to_microseconds(5) | INTERVAL 5 MICROSECOND

RSS feed?

The jekyll docs mention a home grown solution https://jekyllrb.com/tutorials/convert-site-to-jekyll/#10-rss-feed, but also link to: https://github.com/jekyll/jekyll-feed

add compile document

please add a detailed compile guide for so/dll file and program that calling the functions in so file.

something like this https://sqlite.org/howtocompile.html

Repository size is very large

@Alex-Monahan Currently, the size of this repository is massive, approx. 1.2 GB. Even when cloning just the last commit, the resulting directory is 81 MB:

$ git clone --depth 1 [email protected]:duckdb/duckdb-web.git
$ du -hd0 duckdb-web
81M	duckdb-web

Here's a list of the top 50k largest files committed, generated with this script.

To nuke old commits, the BFG repository cleaner can be used. This will still require a force push and thus break the commit tree but there are not too many forks of this repository yet.

Document the `storage_info` pragma

Hey folks!

On Discord, @hannes mentioned the storage_info pragma as a way to get the storage footprint of a table:

PRAGMA storage_info('my_table');

The columns provide a ton of info:

row_group_id,
column_name,
column_id,
column_path, 
segment_id,
segment_type,
start,
count, 
compression,
stats,
has_updates,
persistent,
block_id,
block_offset

I'm looking to calculate the storage size of a table, so I assume that means summing count here.

This would be an awesome pragma to document if possible! Very useful for those of us telling the machine to do a lot of big calculations in DuckDB.

Filter Clause Documentation in SQL Sidebar

The documentation page for the filter clause is available in search engines, but is not accessible from the duckdb documentation sql sidebar:

Question: Can I use the theme skeleton for my website?

Hello Duck DB team,

I really like the them and I would like to use it in order to build a website of my open source project. I tried to find license notes but I couldn't find any in this repository.

Do you think that I can use this project as boilerplate for my website where I will use only the layout but not the content (text and images).

If yes, then I can perhaps add a link to the original them,

CLI API Documentation

Hi,

When can we expect the documentation notes for the bellow link??

https://duckdb.org/docs/api/cli

Documentation Roadmap Discussion

I wanted to build a central list of documentation tasks in order to prioritize them. I've tied in existing issues where applicable. Feedback is welcome! Please let me know what I forgot or if the order of importance should be changed.

Tactical items:

Walkthroughs / tutorials (These could be blog posts maybe?)

Using various IDE's with DuckDB (DBeaver, Jupyter Lab SQL?, SQLite IDE's?)
Integrate with additional Python data tools (Dask, Modin, Vaex)
Integrate with Python orchestrators (Prefect, Dagster, Airflow)
This can be a small or large effort. We could just do a demo, or integrate directly (Prefect has the ability to build custom connectors and has Postgres and SQLite already.)
Integrate with visualization engines (Redash, Metabase, Apache Superset). These likely require building small connectors for each library
Getting started with DuckDB for folks coming from a purely SQL background (Ex: PostgreSQL or SQL Server)

Larger items:

Integrate DuckDB WASM documentation in some way (Basic installation steps / simple example? Or just link to the other repo?)
duckdb/duckdb-wasm#438
duckdb/duckdb-wasm#375
Document the node.js Client API (#138) (This may be easier for me than the others since I've used it a little more, so I put it first)
Document the CLI (#125)
Document the Python Relational API (Is this likely to change much or is it ok to document now?)
Add examples for trickier functions, especially aggregates, etc.
Build user-editable examples using DuckDB WASM
Ex: Interactive SQL IDE that can be pre-populated with examples from docs

How to do aggregation ignoring nulls in some window functions?

Is it possible at the moment to ignore nulls while computing some aggregations like count, first/last value etc. in window functions like Redshift does?

Thanks for the great work btw 🙏

add documents of some functions

such as encode, decode and COALESCE, or add a catalog misc for those functions.

Tutorials and How-To Guides

I found this guide on writing good documentation. It essentially advocates for splitting documentation into four distinct groups:

Tutorials
How-To Guides
References
Explanations

Most of the documentation we have is in the form of Reference, which, while useful, is perhaps not sufficient particularly for beginners.

What do you guys think about splitting the documentation into three separate sections -- Tutorials, How-To Guides and Reference. Most of the current material would go under the Reference section, and we could write up a few language-specific tutorials for getting started and how-to guides for accomplishing common tasks in each of these.

Thoughts?

create a node.js page in the client API section

The node module isn't mentioned in the documentation. I totally get that it's not as much a "data friendly" language as Python and R but there is a large class of data apps that could leverage node in the future. At the minimum a link to the GH readme would suffice.

document the `INTERVAL` data type

Hi folks! Thank you for taking documentation seriously. Reading through the docs has helped onboard me to duckdb very quickly.

One thing I noticed: INTERVAL functions are documented by INTERVAL isn't. Could be that the function defs are enough to get started but it took me a Github issue discussion & looking at the source to understand the data type.

Feature: Use `markdownlint` for consistent markdown formatting

See https://github.com/DavidAnson/markdownlint for more info

This could be enabled as a Github action for PRs

Document list functions

List functions such as unnest and string_split_regex are currently undocumented. An example for their use:

D insert into emails values (1, '[email protected];[email protected]'), (2, '[email protected]'), (3, '');
Error: Catalog Error: Table with name emails does not exist!
Did you mean "x"?
D create table emails(id int, addresses varchar);
D insert into emails values (1, '[email protected];[email protected]'), (2, '[email protected]'), (3, '');
D select id, unnest(string_split_regex(addresses, ';')) from emails;
D select id, unnest(string_split_regex(addresses, ';')) as email from emails;
┌────┬─────────┐
│ id │  email  │
├────┼─────────┤
│ 1  │ a@b.com │
│ 1  │ c@d.com │
│ 2  │ e@f.com │
│ 3  │         │
└────┴─────────┘

Document that `SELECT` is not parallel by default but adding `ORDER BY` parallelises

I read in some Github issue that a simple SELECT query is not parallelized but that by adding ORDER BY one can workaround this limitation.

I think it'd be great if this trick was documented to help discoverability. Unfortunately I can't find the Github issue to link to, but this trick definitely works.

Adding Webpage URL to `About:` section

This would simplify navigation, as no link is visible on the start page, neither README nor elsewhere.

Document `CREATE TEMP TABLE...`

I can not find any description on what a TEMP or TEMPORARY table is in CREATE TEMP TABLE... statements. Would be helpful with a brief description on this page https://duckdb.org/docs/sql/statements/create_table

One use case I have is that I consider it for "caching" aggregates:
create tmp as select user, count(*) v ... and then run queries like this select city, sum(v) v ... from tmp.

Is it materialised in memory? What if it does not fit in memory?
Is it removed automatically? When? How?
What is the intended use case for temporary tables?

Demo appears to be down (aug 12) ie https://duckdb.org/demo (shows duck-spinner)

The live demo on https://duckdb.org/demo/ appears to be down, the Firefox browser complains about

Loading failed for the <script> with source “https://duckdbdemo.project.cwi.nl//query?callback=jQuery35104049208308842104_1628770206616&q=SELECT%20*%20FROM%20lineitem&_=1628770206618”.

Chromium has a problem resolving the hostname duckdbdemo.project.cwi.nl... DNS issue?

The website does not have a link to the blog posts

I found the 'efficient SQL on pandas' blog post when someone posted it on Reddit. When I wanted to go back to it I could not find any links on the website that will take me to the blogs. If you want people to read the blog posts then I suggest you put a link to them on the home page.

How to nest query?

I want to test nest query by duckdb. First I use create and insert to create a table with a map filed.
Then I want to query map data. But I didn't see some example in the documents, I trid many times, but not worked. So someone can help me ?

import duckdb
import time

if __name__ == "__main__":
    con = duckdb.connect()
    start = time.time()

    con.execute("create table mcule(id INTEGER, map_col MAP(VARCHAR ,VARCHAR ))")    
    con.execute("insert into mcule VALUES (1,map(['asia'],['asdfa']))")

    #con.execute("select * from mcule where map(map_col)(['asia'],['asdfa']"))
   # con.execute("select element_at(map_col,['asia'])")
   # print(con.fetchall())
   # con.execute("copy (select * from mcule) to 'nesttest.csv' (FORMAT 'CSV')")
   # con.execute("select people.name from 'nesttest.parquet'")
    con.execute("select * from mcule")
    print(con.fetchall())
    end = time.time()
    print("test time: " + str(end - start))

error

Traceback (most recent call last):
  File "createNesTTest.py", line 12, in <module>
    con.execute("select * from mcule where map(map_col)['asia']='asdfa'")
RuntimeError: We need exactly two lists for a map


Traceback (most recent call last):
  File "createNesTTest.py", line 12, in <module>
    con.execute("select * from mcule where map(['asia'],['asdfa'])")
RuntimeError: Conversion Error: Unimplemented type for cast (MAP<VARCHAR, VARCHAR> -> BOOLEAN)