Comments (6)
Same as #4295 ?
from duckdb.
Same as #4295 ?
Technically Yes, But also I've noticed that every filter is removed to any possibile pushdown to file.
from duckdb.
Hi @MPizzotti ,
Thank you for filing the issue. I've been trying to reproduce it, but haven't found a simple reproducible example yet. In the meantime, can you let me know if it is reproducible without reading from azure? You can check by copying a sample of you data with
copy (select * from ... USING SAMPLE 10%) TO 'simple_parquet_partitioning' (FORMAT PARQUET, PARTITION_BY (a, b));
other sampling methods are available here.
Looking at the explain output, it looks like the the hive partitioned columns are varcher
column, but you are checking for an equality with an int so a cast is introduced. Can you also try a query like
explain analyze {redacted_query}
FROM
{redacted_parquet}
WHERE
path1 = '7' AND path2='2047' -- these are hive_partitioned (file filter pushdown)
AND column1= 'k1' -- 7 and 2047 are now strings.
)
and let me know if the problem persists? This can tell me if the issue is somewhere in how we initialize our scan using network data with the HEAD request, or if it is exclusively a problem in the duckdb core filter pushdown logic. This should also clean up the EXPLAIN ANALYZE
output since there will be no more casting
Thanks,
Tom Ebergen
Edit: Added not about network HEAD request vs duckdb-core filter pushdown functionality.
from duckdb.
Hi @Tmonster,
Thank you for investigating on the issue,
I've run the query as suggested, importing the data on my local pc, with a sample size of 5%:
COPY(
SELECT *
FROM read_parquet("azure://<redacted>/*/*/*.parquet", hive_partitioning = true)
USING SAMPLE 5%
) TO 'C:<redacted>' (FORMAT PARQUET, PARTITION_BY (path1, path2));
Once loaded, I've tried to run the same query on both local and on Azure (with the full dataset instead of 5% sample size).
Here's the EXPLAIN ANALYZE
query result, redacted due to sensitive data:
ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITHOUT IN
QUERY FILTER:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
read_parquet('azure:<redacted>/*/*/*.parquet', hive_partitioning = true)
WHERE
path1 = 1 AND
path2 = 1 AND
column1 = '202152' AND
column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 2.45s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 38 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_PARQUET │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: column1 │
│=202152 AND date_group... │
│ NULL │
│ column2=K1 AND │
│ column2 IS NOT NULL│
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ File Filters: (CAST │
│(path1 AS INTEGER... │
│(path2 AS INTEGER)... │
│ ) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 18467054 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 38 │
│ (8.39s) │
└───────────────────────────┘
ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITH IN
QUERY FILTER:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
read_parquet('azure:<redacted>/*/*/*.parquet', hive_partitioning = true)
WHERE
path1 = 1 AND
path2 = 1 AND
column1 = '202152' AND
column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 298.70s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 267 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│((column2 = 'K1' │
│ ) OR (column2 = ' │
│ K900')) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3693410 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 267 │
│ (0.58s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_PARQUET │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: column1 │
│=202152 AND date_group... │
│ NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ File Filters: (CAST │
│(path1 AS INTEGER... │
│(path2 AS INTEGER)... │
│ ) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 18467054 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 677527 │
│ (2381.50s) │
└───────────────────────────┘
ON LOCAL MACHINE, SAMPLED BY 5%, WITHOUT IN
QUERY FILTER:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
read_parquet('C:<redacted>\*\*\*.parquet', hive_partitioning = true)
WHERE
path1 = 1 AND
path2 = 1 AND
column1 = '202152' AND
column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.0868s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 7 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_PARQUET │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: column1 │
│=202152 AND date_group... │
│ NULL │
│ column2=K1 AND │
│ column2 IS NOT NULL│
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│File Filters: (path1 =│
│ 1)(path2 = 1) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 931615 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 7 │
│ (0.27s) │
└───────────────────────────┘
ON LOCAL MACHINE, SAMPLED BY 5%, WITH IN
QUERY FILTER:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
read_parquet('C:<redacted>\*\*\*.parquet', hive_partitioning = true)
WHERE
path1 = 1 AND
path2 = 1 AND
column1 = '202152' AND
column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 2.22s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 7 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│((column2 = 'K1' │
│ ) OR (column2 = ' │
│ K900')) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 186323 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 7 │
│ (0.03s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_PARQUET │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: column1 │
│=202152 AND date_group... │
│ NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│File Filters: (path1 =│
│ 1)(path2 = 1) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 931615 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 32529 │
│ (16.64s) │
└───────────────────────────┘
Running these queries I've also noticed that the 2 columns path1
and path2
are correctly filtered in local, but not on Azure Environment.
both colums are INTEGER
TYPES because they are the result of a GROUPING_ID()
fuction applied to the initial dataset.
But, as you suggested, when casting the data to type VARCHAR
against azure, the casting operation is executed CORRECTLY,
even if the data is actually of type INTEGER
:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
read_parquet('azure:<redacted>/*/*/*.parquet', hive_partitioning = true)
WHERE
path1 = '1' AND
path2 = '1' AND
column1 = '202152' AND
column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 1.60s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 38 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_PARQUET │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: column1 │
│=202152 AND date_group... │
│ NULL │
│ column2=K1 AND │
│ column2 IS NOT NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│File Filters: (path1 = │
│ '1')(path2 = '1') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 18467054 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 38 │
│ (6.38s) │
└───────────────────────────┘
in any case, running the query with IN
filter is still extremely slow:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
read_parquet('azure:<redacted>/*/*/*.parquet', hive_partitioning = true)
WHERE
path1 = '1' AND
path2 = '1' AND
column1 = '202152' AND
column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 466.98s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 267 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│((column2 = 'K1' │
│ ) OR (column2 = ' │
│ K900')) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3693410 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 267 │
│ (0.87s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_PARQUET │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ <redacted> │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: column1 │
│=202152 AND date_group... │
│ NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│File Filters: (path1 =│
│ '1')(path2 = '1') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 18467054 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 677527 │
│ (3717.91s) │
└───────────────────────────┘
from duckdb.
Hi @MPizzotti ,
Thank you for your response.
Running these queries I've also noticed that the 2 columns
path1
andpath2
are correctly filtered in local, but not on Azure Environment.
Looking at the explain output for all queries, it seems all plans have File Filters
, these are responsible for filtering out files that do not satisfy the path1 = '1' AND path2 = '1'
conditions. These are present on the on the Azure environment queries you pasted, leading me to believe they are correctly applied. Maybe this is a copy paste error? Unless the cardinality of the READ_PARQUET
function resulting in 677527 read tuples is the same as a count(*)
of the same query without filters?
If the File Filters are properly applied in all cases, then I think it is a duplicate of #4295. We don't support pushing OR filters into Table/File reads yet.
In your original post, there seems to be a case where the File Filters
disappear when you add the IN
Filter. That is indeed a bug. If you can reproduce that case consistently (preferably locally with no azure) I will continue investigating.
from duckdb.
Hi @Tmonster,
thanks again for the feedback,
in merit to the following:
In your original post, there seems to be a case where the
File Filters
disappear when you add theIN
Filter. That is indeed a bug. If you can reproduce that case consistently (preferably locally with no azure) I will continue investigating.
I can confirm that it was not a copy-paste error, but it's very inconsistent. Today i was unable to replicate it,
But when i opened the ticket, I'm sure 100% the bug as occurred.
For trying to replicate, i worked with the following environment:
DuckDB 0.10.3
Env: Databricks VM
The blob storage is mounted directly inside Databricks file system, so there isn't a need for "Azure" plugin.
Moreover, i also tried bumping duckDB to the latest version (1.0.0) and the problem hasn't occurred yet.
Thanks again for the support. probably is duplicate of #4295,
But still, the error regarding incorrect casting persist and is not mentioned on the above issue
from duckdb.
Related Issues (20)
- DuckDB Build Failure in Alpine Linux Docker Image (host: mac M1) HOT 5
- Incorrect results returned when using JOIN statement HOT 3
- performance of lateral join on timestamp fields
- current_timestamp cannot be used in having clause (Binder Error) HOT 1
- PIVOT should have syntax to GROUP BY no columns
- build error: jemalloc fails due to strerror_r returning char* instead of int HOT 1
- `clang_format` doesn't support Apple Silicon so format-fix doesn't run HOT 2
- Using GROUP BY ALL with a window function causes "Binder Error: Cannot mix aggregates with non-aggregated columns!" HOT 1
- _LIBCPP_ASSERT_SEMANTIC_REQUIREMENT assertion failure HOT 1
- Out of Memory Error: Failed to allocate block of 262144 bytes (bad allocation) HOT 2
- Inconsistent Query Results with GREATEST HOT 1
- Very slow performance for query targeting partitioned parquet files with WHERE clause for values inside another duckdb table
- insert with on conflict fails if column has index
- Unexpected error raised by PRINTF HOT 1
- ERROR: duckdb.duckdb.Error: Failed to execute query "UPDATE HOT 1
- DuckDB.jl `execute` doesn't like non-null terminated strings HOT 1
- NaN behaves erratically HOT 3
- build error for fuction "strerror_r" in extention jemalloc
- Can't stream record batches from the same cursor that I'm using to insert rows HOT 4
- struct: no support for UPDATE?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckdb.