Coder Social home page Coder Social logo

Comments (2)

ZacBlanco avatar ZacBlanco commented on June 26, 2024 3

Some more follow-up information:

Comprison to Trino

I also tested trino's Iceberg implementation and found that their data sizes for variable-width iceberg columns are also generally wrong.

trino:tpch> SHOW STATS FOR (SELECT comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  554688.0 |               14756.0 |            0.0 |      NULL | NULL      | NULL
 NULL        |      NULL |                  NULL |           NULL |   15000.0 | NULL      | NULL
(2 rows)

----

trino:tpch> select "$internal$sum_data_size_for_stats"(comment) from orders;
 _col0
--------
 727364
(1 row)

Understanding the Iceberg code

After digging into the iceberg library, I found that (at least for parquet format) that the column stats are generated by this line

https://github.com/apache/iceberg/blob/560b72344350816eb31f9a165c2947caa7381a9b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L127

The call they use here is the parquet file footer's getTotalSize method. This represents the bytes on disk. However, parquet footers also have a getTotalUncompressedSize method. I tested to see how this value compares when used to generate the column statistics and found that it is much closer to the true value.

presto:tpch> SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  745598.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

Query 20240316_002952_00016_muzsz, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 76ms, server-side: 61ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

IMO I think that we should be able to contribute this change to the Iceberg community but it might be hard to get them to accept such a change

  1. The spec states that the data size field in the manifest should be the on-disk size. I don't think this is a very useful metric for the optimizer unless we can consistently convert this to a "true" size value.
  2. Other file formats don't have support for reading the uncompressed size - e.g. in Iceberg's ORC implementation, they call getBytesOnDisk and there isn't a corresponding method for uncompressed or "in memory" size.

from presto.

ZacBlanco avatar ZacBlanco commented on June 26, 2024

cc: @aaneja @ClarenceThreepwood

from presto.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.