Coder Social home page Coder Social logo

Comments (6)

mauropagano avatar mauropagano commented on June 27, 2024

Same as #4295 ?

from duckdb.

MPizzotti avatar MPizzotti commented on June 27, 2024

Same as #4295 ?

Technically Yes, But also I've noticed that every filter is removed to any possibile pushdown to file.

from duckdb.

Tmonster avatar Tmonster commented on June 27, 2024

Hi @MPizzotti ,

Thank you for filing the issue. I've been trying to reproduce it, but haven't found a simple reproducible example yet. In the meantime, can you let me know if it is reproducible without reading from azure? You can check by copying a sample of you data with
copy (select * from ... USING SAMPLE 10%) TO 'simple_parquet_partitioning' (FORMAT PARQUET, PARTITION_BY (a, b));
other sampling methods are available here.

Looking at the explain output, it looks like the the hive partitioned columns are varcher column, but you are checking for an equality with an int so a cast is introduced. Can you also try a query like

explain analyze {redacted_query}
     FROM
			{redacted_parquet}
	WHERE
       path1 = '7' AND path2='2047' -- these are hive_partitioned (file filter pushdown)
	   AND column1= 'k1' -- 7 and 2047 are now strings.
 )

and let me know if the problem persists? This can tell me if the issue is somewhere in how we initialize our scan using network data with the HEAD request, or if it is exclusively a problem in the duckdb core filter pushdown logic. This should also clean up the EXPLAIN ANALYZE output since there will be no more casting

Thanks,
Tom Ebergen

Edit: Added not about network HEAD request vs duckdb-core filter pushdown functionality.

from duckdb.

MPizzotti avatar MPizzotti commented on June 27, 2024

Hi @Tmonster,
Thank you for investigating on the issue,
I've run the query as suggested, importing the data on my local pc, with a sample size of 5%:

COPY(
    SELECT *
    FROM read_parquet("azure://<redacted>/*/*/*.parquet",  hive_partitioning = true)
    USING SAMPLE 5%
) TO 'C:<redacted>' (FORMAT PARQUET, PARTITION_BY (path1, path2));

Once loaded, I've tried to run the same query on both local and on Azure (with the full dataset instead of 5% sample size).
Here's the EXPLAIN ANALYZE query result, redacted due to sensitive data:

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITHOUT IN QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
        path2 = 1 AND
        column1 = '202152' AND
        column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 2.45s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│ column2=K1 AND     │
│ column2 IS NOT NULL│
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│    File Filters: (CAST    │
│(path1 AS INTEGER...  │
│(path2 AS INTEGER)...  │
│             )             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (8.39s)          │
└───────────────────────────┘                             

ON AZURE BLOB STORAGE (Blob Container (ADLS Gen2)) WITH IN QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
        path2 = 1 AND
        column1 = '202152' AND
        column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 298.70s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│((column2 = 'K1'    │
│ ) OR (column2 = '  │
│          K900'))          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3693410        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.58s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│    File Filters: (CAST    │
│(path1 AS INTEGER...  │
│(path2 AS INTEGER)...  │
│             )             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           677527          │
│         (2381.50s)        │
└───────────────────────────┘                             

ON LOCAL MACHINE, SAMPLED BY 5%, WITHOUT IN QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('C:<redacted>\*\*\*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
        path2 = 1 AND
        column1 = '202152' AND
        column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 0.0868s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│ column2=K1 AND     │
│ column2 IS NOT NULL│
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =│
│    1)(path2 = 1)      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 931615        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.27s)          │
└───────────────────────────┘                       

ON LOCAL MACHINE, SAMPLED BY 5%, WITH IN QUERY FILTER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('C:<redacted>\*\*\*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = 1 AND
         path2 = 1 AND
         column1 = '202152' AND
         column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 2.22s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│((column2 = 'K1'    │
│ ) OR (column2 = '  │
│          K900'))          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 186323        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.03s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =│
│    1)(path2 = 1)      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 931615        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           32529           │
│          (16.64s)         │
└───────────────────────────┘                             

Running these queries I've also noticed that the 2 columns path1 and path2 are correctly filtered in local, but not on Azure Environment.
both colums are INTEGER TYPES because they are the result of a GROUPING_ID() fuction applied to the initial dataset.

But, as you suggested, when casting the data to type VARCHAR against azure, the casting operation is executed CORRECTLY,
even if the data is actually of type INTEGER:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = '1' AND
         path2 = '1' AND
         column1 = '202152' AND
         column2 = 'K1'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 1.60s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1        │
│=202152 AND date_group...  │
│            NULL           │
│ column2=K1 AND            │
│ column2 IS NOT NULL       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =     │
│  '1')(path2 = '1')        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             38            │
│          (6.38s)          │
└───────────────────────────┘

in any case, running the query with IN filter is still extremely slow:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE FROM
     read_parquet('azure:<redacted>/*/*/*.parquet',  hive_partitioning = true)
     WHERE
     	path1 = '1' AND
        path2 = '1' AND
        column1 = '202152' AND
        column2 IN ('K1', 'K900')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 466.98s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│((column2 = 'K1'    │
│ ) OR (column2 = '  │
│          K900'))          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3693410        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            267            │
│          (0.87s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         <redacted>        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   Filters: column1  │
│=202152 AND date_group...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│File Filters: (path1 =│
│  '1')(path2 = '1')    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 18467054       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           677527          │
│         (3717.91s)        │
└───────────────────────────┘   

from duckdb.

Tmonster avatar Tmonster commented on June 27, 2024

Hi @MPizzotti ,

Thank you for your response.

Running these queries I've also noticed that the 2 columns path1 and path2 are correctly filtered in local, but not on Azure Environment.

Looking at the explain output for all queries, it seems all plans have File Filters, these are responsible for filtering out files that do not satisfy the path1 = '1' AND path2 = '1' conditions. These are present on the on the Azure environment queries you pasted, leading me to believe they are correctly applied. Maybe this is a copy paste error? Unless the cardinality of the READ_PARQUET function resulting in 677527 read tuples is the same as a count(*) of the same query without filters?

If the File Filters are properly applied in all cases, then I think it is a duplicate of #4295. We don't support pushing OR filters into Table/File reads yet.

In your original post, there seems to be a case where the File Filters disappear when you add the IN Filter. That is indeed a bug. If you can reproduce that case consistently (preferably locally with no azure) I will continue investigating.

from duckdb.

MPizzotti avatar MPizzotti commented on June 27, 2024

Hi @Tmonster,
thanks again for the feedback,
in merit to the following:

In your original post, there seems to be a case where the File Filters disappear when you add the IN Filter. That is indeed a bug. If you can reproduce that case consistently (preferably locally with no azure) I will continue investigating.

I can confirm that it was not a copy-paste error, but it's very inconsistent. Today i was unable to replicate it,
But when i opened the ticket, I'm sure 100% the bug as occurred.

For trying to replicate, i worked with the following environment:
DuckDB 0.10.3
Env: Databricks VM
image
The blob storage is mounted directly inside Databricks file system, so there isn't a need for "Azure" plugin.

Moreover, i also tried bumping duckDB to the latest version (1.0.0) and the problem hasn't occurred yet.

Thanks again for the support. probably is duplicate of #4295,

But still, the error regarding incorrect casting persist and is not mentioned on the above issue

from duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.