What happens? when using the IN / OR operator against a file parqu

Same as <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

Same as <a class="issue-link js-issue-link" data-error-text="Failed to lo

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

IN/OR operator removes completely Parquet Filter Pushdown on other filters,about duckdb/duckdb

Comments (6)

mauropagano commented on June 27, 2024

Same as #4295 ?

from duckdb.

MPizzotti commented on June 27, 2024

Same as #4295 ?

Technically Yes, But also I've noticed that every filter is removed to any possibile pushdown to file.

from duckdb.

Tmonster commented on June 27, 2024

Hi @MPizzotti ,

Thank you for filing the issue. I've been trying to reproduce it, but haven't found a simple reproducible example yet. In the meantime, can you let me know if it is reproducible without reading from azure? You can check by copying a sample of you data with
copy (select * from ... USING SAMPLE 10%) TO 'simple_parquet_partitioning' (FORMAT PARQUET, PARTITION_BY (a, b));
other sampling methods are available here.

Looking at the explain output, it looks like the the hive partitioned columns are varcher column, but you are checking for an equality with an int so a cast is introduced. Can you also try a query like

explain analyze {redacted_query}
     FROM
			{redacted_parquet}
	WHERE
       path1 = '7' AND path2='2047' -- these are hive_partitioned (file filter pushdown)
	   AND column1= 'k1' -- 7 and 2047 are now strings.
 )

and let me know if the problem persists? This can tell me if the issue is somewhere in how we initialize our scan using network data with the HEAD request, or if it is exclusively a problem in the duckdb core filter pushdown logic. This should also clean up the EXPLAIN ANALYZE output since there will be no more casting

Thanks,
Tom Ebergen

Edit: Added not about network HEAD request vs duckdb-core filter pushdown functionality.

from duckdb.

MPizzotti commented on June 27, 2024

Hi @Tmonster,
Thank you for investigating on the issue,
I've run the query as suggested, importing the data on my local pc, with a sample size of 5%:

COPY(
    SELECT *
    FROM read_parquet("azure://<redacted>/*/*/*.parquet",  hive_partitioning = true)
    USING SAMPLE 5%
) TO 'C:<redacted>' (FORMAT PARQUET, PARTITION_BY (path1, path2));

Once loaded, I've tried to run the same query on both local and on Azure (with the full dataset instead of 5% sample size).
Here's the EXPLAIN ANALYZE query result, redacted due to sensitive data:

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITHOUT `IN` QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
        path2 = 1 AND
        column1 = '202152' AND
        column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 2.45s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│ column2=K1 AND     │
│ column2 IS NOT NULL│
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│    File Filters: (CAST    │
│(path1 AS INTEGER...  │
│(path2 AS INTEGER)...  │
│             )             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (8.39s)          │
└───────────────────────────┘

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITH `IN` QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
        path2 = 1 AND
        column1 = '202152' AND
        column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 298.70s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│((column2 = 'K1'    │
│ ) OR (column2 = '  │
│          K900'))          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3693410        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.58s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│    File Filters: (CAST    │
│(path1 AS INTEGER...  │
│(path2 AS INTEGER)...  │
│             )             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           677527          │
│         (2381.50s)        │
└───────────────────────────┘

ON LOCAL MACHINE, SAMPLED BY 5%, WITHOUT `IN` QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('C:<redacted>\*\*\*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
        path2 = 1 AND
        column1 = '202152' AND
        column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 0.0868s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│ column2=K1 AND     │
│ column2 IS NOT NULL│
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =│
│    1)(path2 = 1)      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 931615        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.27s)          │
└───────────────────────────┘

ON LOCAL MACHINE, SAMPLED BY 5%, WITH `IN` QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('C:<redacted>\*\*\*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
         path2 = 1 AND
         column1 = '202152' AND
         column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 2.22s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│((column2 = 'K1'    │
│ ) OR (column2 = '  │
│          K900'))          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 186323        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.03s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =│
│    1)(path2 = 1)      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 931615        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           32529           │
│          (16.64s)         │
└───────────────────────────┘

Running these queries I've also noticed that the 2 columns path1 and path2 are correctly filtered in local, but not on Azure Environment.
both colums are INTEGER TYPES because they are the result of a GROUPING_ID() fuction applied to the initial dataset.

But, as you suggested, when casting the data to type VARCHAR against azure, the casting operation is executed CORRECTLY,
even if the data is actually of type INTEGER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = '1' AND
         path2 = '1' AND
         column1 = '202152' AND
         column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 1.60s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1        │
│=202152 AND date_group...  │
│            NULL           │
│ column2=K1 AND            │
│ column2 IS NOT NULL       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =     │
│  '1')(path2 = '1')        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (6.38s)          │
└───────────────────────────┘

in any case, running the query with IN filter is still extremely slow:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = '1' AND
        path2 = '1' AND
        column1 = '202152' AND
        column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 466.98s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│((column2 = 'K1'    │
│ ) OR (column2 = '  │
│          K900'))          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3693410        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.87s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =│
│  '1')(path2 = '1')    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           677527          │
│         (3717.91s)        │
└───────────────────────────┘

from duckdb.

Tmonster commented on June 27, 2024

Hi @MPizzotti ,

Thank you for your response.

Running these queries I've also noticed that the 2 columns path1 and path2 are correctly filtered in local, but not on Azure Environment.

Looking at the explain output for all queries, it seems all plans have File Filters, these are responsible for filtering out files that do not satisfy the path1 = '1' AND path2 = '1' conditions. These are present on the on the Azure environment queries you pasted, leading me to believe they are correctly applied. Maybe this is a copy paste error? Unless the cardinality of the READ_PARQUET function resulting in 677527 read tuples is the same as a count(*) of the same query without filters?

If the File Filters are properly applied in all cases, then I think it is a duplicate of #4295. We don't support pushing OR filters into Table/File reads yet.

In your original post, there seems to be a case where the File Filters disappear when you add the IN Filter. That is indeed a bug. If you can reproduce that case consistently (preferably locally with no azure) I will continue investigating.

from duckdb.

MPizzotti commented on June 27, 2024

Hi @Tmonster,
thanks again for the feedback,
in merit to the following:

In your original post, there seems to be a case where the File Filters disappear when you add the IN Filter. That is indeed a bug. If you can reproduce that case consistently (preferably locally with no azure) I will continue investigating.

I can confirm that it was not a copy-paste error, but it's very inconsistent. Today i was unable to replicate it,
But when i opened the ticket, I'm sure 100% the bug as occurred.

For trying to replicate, i worked with the following environment:
DuckDB 0.10.3
Env: Databricks VM

The blob storage is mounted directly inside Databricks file system, so there isn't a need for "Azure" plugin.

Moreover, i also tried bumping duckDB to the latest version (1.0.0) and the problem hasn't occurred yet.

Thanks again for the support. probably is duplicate of #4295,

But still, the error regarding incorrect casting persist and is not mentioned on the above issue

from duckdb.

IN/OR operator removes completely Parquet Filter Pushdown on other filters about duckdb HOT 6 OPEN

Comments (6)

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITHOUT `IN` QUERY FILTER:

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITH `IN` QUERY FILTER:

ON LOCAL MACHINE, SAMPLED BY 5%, WITHOUT `IN` QUERY FILTER:

ON LOCAL MACHINE, SAMPLED BY 5%, WITH `IN` QUERY FILTER:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (6)

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITHOUT IN QUERY FILTER:

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITH IN QUERY FILTER:

ON LOCAL MACHINE, SAMPLED BY 5%, WITHOUT IN QUERY FILTER:

ON LOCAL MACHINE, SAMPLED BY 5%, WITH IN QUERY FILTER:

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITHOUT `IN` QUERY FILTER:

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITH `IN` QUERY FILTER:

ON LOCAL MACHINE, SAMPLED BY 5%, WITHOUT `IN` QUERY FILTER:

ON LOCAL MACHINE, SAMPLED BY 5%, WITH `IN` QUERY FILTER: