Your Environment Any iceberg table Expected Beh

Some more follow-up information: Comprison to Trino <p dir="au

cc: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

[Iceberg] variable-width column data sizes are generally wrong about presto HOT 2 OPEN

ZacBlanco commented on June 26, 2024

[Iceberg] variable-width column data sizes are generally wrong

from presto.

Comments (2)

ZacBlanco commented on June 26, 2024 3

Some more follow-up information:

Comprison to Trino

I also tested trino's Iceberg implementation and found that their data sizes for variable-width iceberg columns are also generally wrong.

trino:tpch> SHOW STATS FOR (SELECT comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  554688.0 |               14756.0 |            0.0 |      NULL | NULL      | NULL
 NULL        |      NULL |                  NULL |           NULL |   15000.0 | NULL      | NULL
(2 rows)

----

trino:tpch> select "$internal$sum_data_size_for_stats"(comment) from orders;
 _col0
--------
 727364
(1 row)

Understanding the Iceberg code

After digging into the iceberg library, I found that (at least for parquet format) that the column stats are generated by this line

https://github.com/apache/iceberg/blob/560b72344350816eb31f9a165c2947caa7381a9b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L127

The call they use here is the parquet file footer's getTotalSize method. This represents the bytes on disk. However, parquet footers also have a getTotalUncompressedSize method. I tested to see how this value compares when used to generate the column statistics and found that it is much closer to the true value.

presto:tpch> SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  745598.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

Query 20240316_002952_00016_muzsz, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 76ms, server-side: 61ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

IMO I think that we should be able to contribute this change to the Iceberg community but it might be hard to get them to accept such a change

The spec states that the data size field in the manifest should be the on-disk size. I don't think this is a very useful metric for the optimizer unless we can consistently convert this to a "true" size value.
Other file formats don't have support for reading the uncompressed size - e.g. in Iceberg's ORC implementation, they call getBytesOnDisk and there isn't a corresponding method for uncompressed or "in memory" size.

from presto.

ZacBlanco commented on June 26, 2024

cc: @aaneja @ClarenceThreepwood

from presto.

Recommend Projects

[Iceberg] variable-width column data sizes are generally wrong about presto HOT 2 OPEN

Comments (2)

Comprison to Trino

Understanding the Iceberg code

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent