duckdblabs / db-benchmark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from h2oai/db-benchmark

127.0 127.0 27.0 23.2 MB

reproducible benchmark of database-like ops

Home Page: https://duckdblabs.github.io/db-benchmark/

License: Mozilla Public License 2.0

Shell 9.36% Python 33.10% R 51.38% Julia 6.11% HTML 0.05%

db-benchmark's People

Contributors

Stargazers

Watchers

db-benchmark's Issues

Inconsistent multicore use across languages?

Probably a question for @jangorecki, since this applies to data.table's thread retrieval function. But it will effect other R implementations relying on getDThreads, e.g. the newly proposed r-collapse PR (#33).

Here's an example from my own EC2 server.

First, basic machine characteristics.

[~]$ grep -E -w 'VERSION|NAME|PRETTY_NAME' /etc/os-release    
NAME="Amazon Linux"
VERSION="2"
PRETTY_NAME="Amazon Linux 2"

[~]$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('  
CPU(s):              96
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2

The Python call gets the 96 CPU number right.

[~]$ python3 -c 'import multiprocessing as mp; print(mp.cpu_count())'
96

But data.table only recognises half of the available cores (physical only?).

[~]$ Rscript -e "data.table::getDTthreads()" 
[1] 48

Publish databend results

https://github.com/datafuselabs/databend

`=~` matching in run.sh causes "duckdb-latest" to match "duckdb"

Commit 17d41db that introduced solution "duckdb-latest" is causing that the following line in run.sh

db-benchmark/run.sh

Line 66 in 17d41db

if [[ "$RUN_SOLUTIONS" =~ "duckdb" ]]; then ./duckdb/ver-duckdb.sh; fi;

is matching to duckdb even when duckdb is not being used, but only duckdb-latest. As a result benchmark may fail with the following message:

Error in read.dcf(system.file(package = "duckdb", lib.loc = "./duckdb/r-duckdb",  : 
  cannot open the connection
In addition: Warning message:
In read.dcf(system.file(package = "duckdb", lib.loc = "./duckdb/r-duckdb",  :
  cannot open compressed file '', probable reason 'No such file or directory'
Execution halted

To resolve it we should use more precise operator than =~, till then let's keep this issue open in case someone will be getting such error and trying to search for it.

Dask tests could use optimization

Hi there,

At Coiled we have better optimized versions of the tests

https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_h2o.py

Any chance of things being updated to give dask a fairer shot?

Include datafusion in the benchmark

Datafusion is another stateless query engine/dataframe library I'd be interested in seeing results for.

https://github.com/apache/arrow-datafusion

Inclusion of Julia in the benchmarks

Are you open to re-include Julia DataFrames.jl in the benchmarks?

The scripts in https://github.com/duckdblabs/db-benchmark/tree/master/juliadf should just work. However, I can also update them to latest Julia and related packages versions. It would roughly be h2oai#232.

A related big question is precompilation. In the H2O benchmark the compilation time for Julia was included in the benchmark (that is the reason why there are two runs per test). Also there was a requirement that no named functions should be created and called from the code performing the operations. Do you want to keep these two restrictions?

suspicious timings for groupby q10

groupby q10 is quite specific query which meant to stress aggregations really well.

Looking at current results I observe that duckdb and clickhouse are suspiciously well dealing with the question (0.14s, 0.47s), comparing to remaining solutions (1.56s).

My guess, and what I would like to clarify in this issue, is that those databases are holding statistics of the tables, and run queries faster using those statistics. Could you provide a comment, if that is likely scenario? For duckdb @Tmonster, and clickhouse @qoega ?

If that is the case, then:

there should be a note about that, probably even the first note just below the timings;
we should look for corresponding features in other solutions. For example, in data.table test script we could set indices (these are data.table native statistics that could be enabled), or a key. Those optimizations are currently not there.

Question

Hi,
I wanted to know how often will the benchmark be updated?
Just curious to know how some new solutions are performing.
Thank you

Add one Digit to Benchmark Summary

Minor issue, but I'd like to suggest to add one digit to the summary table for better display of results, especially for 0.5Gb and 5Gb benchmarks.

many solutions are missing from join 1e9

It is a bit confusing that join 1e9 has so few solutions. There are no notes below the benchplot explaining that. Ideally is to have all solutions included, and whenever something is not working, then exception defined in benchplot dictionary that refers to gh issues #num of particular project. You will find multiple examples in current benchplot dictionary because that was the way we handled failures and exceptions in different solutions. This way you are also allowing users to eventually fix the problems that are behind particular issues.

datafusion does not correctly make chunk results

tagging @Dandandan since they added the solution.

Thile in the process of running the benchmarks again and the output of data fusion has caused some issues during report generation.

Specifically, when a write_log is called and a chunk is written using chk=make_chk([chk]) the chunk output is incorrectly formatted.

Running the 500mb group by benchmark on G1_1e7_1e2_0_0 logs the following result for one of the queries

{{codename}},1696851068,1696851075.3744261,groupby,G1_1e7_1e2_0_0,10000000,sum v1 mean v3 by id3,100000,3,datafusion,31.0.0,,.groupby,1,0.269,1.877,TRUE,[29998789.          4999719.62234443],5.398,,FALSE

the column with the value [29998789. 4999719.62234443] is the offending column. It should contain semi-colon separated answers.

the report is ready to be published and I would like to have the datafusion results included

Recommended Benchmarking Arrangement

I would like to thank the DuckDB team for keeping this benchmark going! It can be expensive to run benchmark tests for many software. As a single developer working on my current dataframe project, I encounter problems if I spend too much time learning how to configure other software to do benchmark.

However, I suggest that if each software developer writes their own script, the results can be more fair as different script settings will result in different performance. The software developers must be able to configure their script to get optimized performance.

It is unbelievable that none of the software of the benchmark can complete the JoinTable test.

I use a 67 GB csv fact table of 1 billion rows x 14 columns to join a master of 99696 rows, full join for 3-composit key, returning a 90GB csv file. Peaks can complete the Jointable using only 32GB memory. You can see https://github.com/hkpeaks/peaks-consolidation/tree/main/Polars-PeaksBenchmarking for more information.

You can use the Peaks pre-release to see if it can complete Jointabe on your much better machine. https://github.com/hkpeaks/peaks-consolidation/releases . I want to know the results as I do not have a Google Cloud account. I am exploring which cloud service can accept PayPal for prepayment. Using credit card for online payment is very risky for me.

I suggest that your coming benchmark can be tested on the following categories by selecting all records except filter test:

Select all records for Sorting test for csv and parquet (cover 3+ sorting columns)
Select all records for Groupby test for csv and parquet (cover 3+ GroupBy columns, 3+ Aggregate Column)
Select all records for Distinct test for csv and parquet (cover 3+ Distinct column)
Select all records for JoinTable test for csv and parquet (cover 3+ Composit Key)
Different compare operators filter/unmatch filter test for csv and parquet (cover at least filtered 50%+ volume from input data)

This way it can be more easily understood by business users.

Publish DataFusion to report

Add additional data wrangling methods

Thank you DuckDB team for keeping this benchmark going!!!

I see there are a lot of variations on group bys and joins, however, I think it would be highly beneficial to incorporate additional data wrangling methods. A few that come to mind, but others should add to this list, includes:

Unions
Subsetting data
Sampling data
Rolling joins (see data.table)
Pivots long and wide
Rolling / windowing operations by groups over time, such as lags and moving averages
Differencing data by groups based on a time column
Updating records in a data frame / table
Categorical encoding methods: target encoding, James-Stein encoding
Column type conversions

I believe a broader set of operations serves a several purposes. For one, I would like to know if a particular framework can actually do the operation. Secondly, I would like to see benchmarks on their performance. Lastly, I think it would a huge community benefit to see what the actual code ends up looking like to get the greatest performance, which isn't always available through documentation or stackoverflow.

Thanks in advance,
Adrian

Some values in the picture seems broken

Some values are showed as NA now.

I created a benchmark but DuckDB run times are super slow and not sure why

Here's a link to the repo: https://github.com/AdrianAntico/Benchmarks

I use 3 datasets, one with a million rows, one with 10M, and one with 100M. I am currently just running a sum aggregation, with varying number of grouping variables and aggregated numeric variables. One main difference between those datasets and the ones used here is that I make use of a Date type column in all aggregations. It also seems that Polars has a harder time with that data type. I'm showing data.table to be the fastest for all queries except the single Date type aggregation (this is where DuckDB wins).

I copied some of the code from this repo. Hoping someone can take a look because the results were a bit unexpected.

misplaced datafussion code in history.Rmd

          This chunk of code has been added in the wrong place. It needs to be added after the last solution, not at the end of the document.

Originally posted by @jangorecki in #18 (comment)

Clean-up `/tmp` dirctory

I notice that on-disk solutions may create large temporary files during their runs, however, they may not clean up afterward (e.g. polars creates .ipc files). This may cause the undefined exception error for other solutions, when they run within the same session.

DOC: modin results "see README"

modin's results are pending, and it says "see README". But README.md doesn't mention modin.

Incorrect timings in groupby 50 GB charts

Data in this chart (groupby 50 GB) is incorrect. Timings for each question seem way too small (smaller that for 5 GB) and do not add up to the timing at the top of the chart. E.g. DuckDB top timing is 143s, but for each question timings are less than 0.5s.

Published duckdb results are not reproducible

Hi.
I created environment you use for benchmarks and tried to reproduce current published results.

curl http://169.254.169.254/latest/meta-data/instance-type
c6id.metal

Local disk with benchmark data is stored on local nvme disk

~/nvme/h2oai-db-benchmark$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme2n1    1.8T  510G  1.3T  29% /home/ubuntu/nvme

lsblk | grep -v loop
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme4n1      259:0    0  1000G  0 disk
├─nvme4n1p1  259:1    0 999.9G  0 part /
├─nvme4n1p14 259:2    0     4M  0 part
└─nvme4n1p15 259:3    0   106M  0 part /boot/efi
nvme2n1      259:4    0   1.7T  0 disk /home/ubuntu/nvme
nvme3n1      259:5    0   1.7T  0 disk
nvme1n1      259:6    0   1.7T  0 disk /var/lib/clickhouse
nvme0n1      259:7    0   1.7T  0 disk /nvme

Group by G1_1e9_1e2_5_0 fails with OOM for duckdb 0.8.1.3

cat run_duckdb_groupby_G1_1e9_1e2_5_0.err
Error: rapi_execute: Failed to run query
Error: Out of Memory Error: could not allocate block of size 262KB (216.2GB/216.2GB used)
Database is launched in in-memory mode and no temporary directory is specified.
Unused blocks cannot be offloaded to disk.

Launch the database with a persistent storage back-end
Or set PRAGMA temp_directory='/path/to/tmp.tmp'
Timing stopped at: 768 538.4 33.28
Execution halted
Warning messages:
1: Connection is garbage-collected, use dbDisconnect() to avoid this.
2: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

Suggestion: Include reshape benchmarks

Stoked to see this back up and running!

(As an aside, the relentless performance gains of DuckDB are truly impressive.)

Two suggestions:

Please consider the collapse R package (link). In my own set of benchmarks, collapse is typically at or near the top of various groupby operations for datasets in the order of .5-5 GB. (I haven't tested larger than that and should also say it doesn't support join operations yet.) I can add a PR if interested. Closed via #33.
There was talk over at the old repo of adding a set of reshape benchmarks. Personally, I think this would be great to have. See: h2oai#175

Thanks again for all effort in resurrecting this.

duckdb-latest fails

When trying to setup duckdb-latest I get an error:

~/db-benchmark$ bash duckdb-latest/setup-duckdb-latest.sh
Downloading GitHub repo duckdb/duckdb@HEAD
Error: Failed to install 'duckdb' from GitHub:
  Does not appear to be an R package (no DESCRIPTION)
Execution halted

The docs page recommends doing the following:

install.packages('duckdb', repos=c('https://duckdb.r-universe.dev', 'https://cloud.r-project.org'))

I'm probably doing something wrong as I'm not very familiar with R

Benchmark with Parquet

Is there any interest in using Parquet datasets to benchmark, particularly for the 50GB dataset case? Parquet is very common for large-scale data analytics, and as far as I know, most if not all of the libraries tested support Parquet files. Furthermore, some libraries have special support for Parquet datasets, like reading encoded columns and performing analytics on them directly in memory.

I think it still makes sense to do the 0.5GB and 5GB benchmarks in CSV, but I think it would be helpful to have an additional 50GB benchmark in Parquet (or even replace the existing one). That way, the benchmark doesn't take much longer.

Add cuDF

Any particular reason why cuDF is left out especially since it's included in the original repo?

duckdblabs / db-benchmark Goto Github PK

db-benchmark's People

Contributors

Stargazers

Watchers

Forkers

db-benchmark's Issues

Recommend Projects

Recommend Topics

Recommend Org