The icelake from icelake-io

Line 393 in 99d8d92

let mut writer = AvroWriter::new(avro_schema, Vec::new());

Tracking issues of IceLake v0.0.1

Part of #1

Tasks

Notes

Schema Evolution is not supported for now.
Struct Default Value is not supported

Implement equality delete file writer.

Similar to #38 , but this time for data file writer.

Disscussion: Is 'table metadata not loaded yet' reasonable?

I find that the current_table_metadata interface will check self.current_version and return "table metadata not loaded yet" when it's 0.

    pub fn current_table_metadata(&self) -> Result<&types::TableMetadata> {
        if self.current_version == 0 {
            return Err(anyhow!("table metadata not loaded yet"));
        }
        ...
    }

But I find that it will call load automatically in the open or open_with_op. So I guesss the error will occur in this case: self.current_version = metadata.last_updated_ms; // metadata.last_updated_ms = 0
And we hope user to fix this error like this:

let table = Table::open("");
table.current_table_metadata(); // get error
table.load(); // retry load

But why not directly return error in open when we can't load a valid current version. And we can assert current_version is valid in current_table_metadata. Like

let table = Table::open(""); // Will return error if the metadata.last_updated_ms = 0 

pub fn current_table_metadata(&self) -> Result<&types::TableMetadata> {
    assert(self.current_version != 0)
    ...
}

I think this way is more friendly for user🤔 When it open succeesfully, it can convince that the version is valid. It don't need to check and process error every time in later operation.

Build spark image with dependencies rather downloading them for every run.

In our integrations setup for spark: https://github.com/icelake-io/icelake/blob/c07e4f13c65cee1a20b01f73bc67f74a8d93f4f1/testdata/docker/spark-script/spark-connect-server.sh

This script will download dependent packages every time the container starts. It's quite convenient that we only need to depend on official image, however the download is quite time consuming. We should maintain a image with dependencies already downloaded to speed up tests.

Friendly write test framework

For now our test for write is hard extend to test kinds of table. The schema is hard-code.

So I propose a way to make test more easily to extend.

We can describe the test in yaml file like following:

schema:
	init_sql:
		CREATE SCHEMA IF NOT EXISTS s1

src:  
	name: t1
	root: demo/s1/t1
	init_sql: 
		DROP TABLE IF EXISTS s1.t1,
        CREATE TABLE s1.t1
        (
            id long,
            v_int int,
            v_long long,
            v_float float,
            v_double double,
            v_varchar string,
            v_bool boolean,
            v_date date,
            v_timestamp timestamp,
            v_decimal decimal(36, 10),
            v_ts_ntz timestamp_ntz
        ) USING iceberg
        TBLPROPERTIES ('format-version'='2');


dst:
	name: tmp
	root: demo/s1/tmp
	init_sql: 
		DROP TABLE IF EXISTS s1.tmp,
        CREATE TABLE s1.tmp
        (
            id long,
            v_int int,
            v_long long,
            v_float float,
            v_double double,
            v_varchar string,
            v_bool boolean,
            v_date date,
            v_timestamp timestamp,
            v_decimal decimal(36, 10),
            v_ts_ntz timestamp_ntz
        ) USING iceberg
        TBLPROPERTIES ('format-version'='2');

write_data:
	schema: long,int,float,double,string,boolean,date,timestamp ..
	data:
		1,1,1000,1.1,1.11,1-1,true,2022-11-01,2022-11-01 11:03:02.123456+04:00,389.11111,2022-11-01 11:03:02.123456
		2,2,2000,2.2,2.22,2-2,false,2022-11-02,2022-11-02 11:03:02.123456+04:00,389.2222,2022-11-02 11:03:02.123456
		3,3,3000,3.3,3.33,3-3,true,2022-11-03,2022-11-03 11:03:02.123456+04:00,389.3333,2022-11-03 11:03:02.123456
		4,4,4000,4.4,4.44,4-4,false,2022-11-04,2022-11-04 11:04:02.123456+04:00,389.4444,2022-11-04 11:04:02.123456
		5,5,5000,5.5,5.55,5-5,true,2022-11-05,2022-11-05 11:05:02.123456+04:00,389.5555,2022-11-05 11:05:02.123456

query_sql: 
		select * from s1.t1;
		select * from s1.tmp;

query_sql:
		select * from s1.t1.partitions;
		select * from s1.tmp.partitions;

In the init phase, the test framework will execute the init sql.
In the write phase, the test framework will parse the schema and write data into t1 and tmp seperatly using icelake and other client(like spark sql).
In the check phase, the test framework will execute two sql and check whether the output is same.

The benifit of this way:

support kinds of schema
describe the whole test process in a external file.

Implement load table.

support transform expression

We need the transform expression to support partition write.

At high level, we need interface like:

fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch>;

and for implementation, it has the process to compute the transform looks like:

fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch> {
    ...
    for partition_field in partition_spec {
        // 1. Get the transform expression 
        let expr = transform_expr(partition_field.transform);
        // 2. Get source column 
        let column = record_batch.column[partition_field.source_field];
        // 3. Transform compute 
        let res_column = expr.eval(column);
    }
    ...
}

So we need the expression interface and implement it for kinds of transform, like

trait Expr {
    fn eval(&self,array: ArrayRef) -> ArrayRef
}

impl Expr for Identity {
    fn eval(&self,array: ArrayRef) {
        array 
    }
}

...

For the implementation of expression, we can make use of the compute module of arrow. I'm taking investiage for it.

Use separate arrow crates rather single arrow crate.

This helps to reduce compile time. Though it doesn't help much for library like icelake, it matters for large applictions.

Implement data file writer.

A file writer accepts inputs, writes out them using a file appender, and returns some iceberg data file structs. This issue address data file writer.

We can refer to java version as example:

https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/FileWriter.java
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/RollingDataWriter.java

Implement parquet file appender.

The file appender converts input data into some specific format, e.g. parquet, orc, avro.

In rust implementation, we accept arrow record batch and output into parquet file.

Please refer to the following java version as example:

https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/io/FileAppender.java
https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

Incompatitable schema using apache-avro rust

I try to support to write decimal in avro and find that there is incompatitablity using apache-avro rust schema directly.

In iceberg spec, the decimal represent in avro as:

{ 
  "type": "fixed",
  "size": minBytesRequired(P),
  "logicalType": "decimal",
  "precision": P,
  "scale": S 
}

But in Schema::Decimal in apache-avro rust, the serialize way will cause the type look like:

{ 
  "type": {
      "type": fixed,
      "size": minBytesRequired(P),
  },
  "logicalType": "decimal",
  "precision": P,
  "scale": S 
}

And it's incompatitable with the type spec in iceberg.

I have try to modify the serialize way in local and verify that it can work.

            Schema::Decimal(DecimalSchema {
                ref scale,
                ref precision,
                ref inner,
            }) => {
                let mut map = serializer.serialize_map(None)?;
                map.serialize_entry("type", "fixed")?;
                map.serialize_entry("name", "decimal_36_10")?;
                map.serialize_entry("size", &16)?;
                map.serialize_entry("logicalType", "decimal")?;
                map.serialize_entry("scale", scale)?;
                map.serialize_entry("precision", precision)?;
                map.end()
            }

There are seems no way provied by avro-rust to custom the serialize way. Apparently we can fork the avro-rust and modify it to make it work but this solution is not very well. Maybe we can have a communication wtih community of apache-avro rust.

discussion: let Table can be clone

In risingwave sink, the sink own the table. We create writer or coordinator using table in sink. We need commit in writer or coordinator which means that we also need own table in writer and coordinator.

cc @liurenjie1024 @Xuanwo

Implement serialization AnyValue to json.

Upgrade arrow 46.0.0

This fixes apache/arrow-rs#4610 , now we can use correct metadata from parquet file.

Expose metrics to users.

As a library, we should be able to expose metrics to external users for better observability. Any suggestions?

Implement sorted position delete file writer.

Similar to #38 , but this time we use position delete file writer.

split in_memory module like on_disk

I find that in_memory.rs is so long, do we consider to split into manifest_file, manifest_list like on_disk?

cc @liurenjie1024 @Xuanwo

Add Table Operations to allow reading data via IceLake

icelake 0.0.1 added most types that we need to load table metadata. We need to implement table operations to allow users reading data via us.

Tasks

Get current table metadata
Get all snapshots of current table metadata
Get all manifest lists from current snapshot
#31

Notes

We will not add partition support before v0.1, so we can ignore partition spec and sorting order.
API should be polished after.

Supports rest catalog.

The current implementation only supports filesystem catalog. In production, jdbc/hive metastore catalogs are widely used. To support them, we need to support rest catalog so that we don't need to interact with them directly.

Implement append only task writer.

An append-only task writer accepts an optional partitioner, file appender factory as arguments. When it receives records, it dispatches records to different file writer according to the partition key(generated by partitioner), and inserts it. When it finished, it returns generated data file structs.

Notice that this will be the api used directly by compute engines such as risingwave, ballista. We can refer to following implementation as an example.

https://github.com/apache/iceberg/blob/e340ad5be04e902398c576f431810c3dfa4fe717/core/src/main/java/org/apache/iceberg/io/PartitionedFanoutWriter.java#L28

Remove openapi generated code.

In the development of #181 , I found the openapi generated code is useless and has some bugs, so I tried to rewrite with reqwest, and suprisingly it works well. So I'll remove openapi generated code.

Implement serialization PartitionSpec to json.

Ignore Cargo.lock in git?

If you’re building a non-end product, such as a rust library that other rust packages will depend on, put Cargo.lock in your .gitignore.

Integration test of file writer.

After we finished #37 #38 #39 , we should be able to create iceberg tables. We need to add integration tests to verify that it can be read from other libraries such as spark.

feat: support binary single-value serialization and deserialization.

There are many binary value fields in DataFile, such as lower_bounds, upper_bounds, ... However, there aren't serializers and deserializers for them.

Spec:

Binary single-value serialization: https://iceberg.apache.org/spec/#binary-single-value-serialization

discussion: sync with the apache Iceberg community

Hi there,

There is a thread regarding the C++/Rust support for Iceberg in the community's dev mail list. Here is the link: https://lists.apache.org/thread/7njzq6b0m9qbjmbtgrtkhmj2nnmhso4t I wonder if you fellows can reply to this thread and ask for the feedbacks from the Iceberg community? I think most community members would welcome this project. However, there will be a lot to discuss in terms of the technical design, in which you may need the professional advice from the community. With the advice and feedbacks, we will be able to see if this project is heading towards the right direction.

iceberg spec requires to store absolute path of file

          Yes, the iceberg spec requires to store absolute path of file. "s3a" can be recognized by hdfs client while "s3" can't, so the be best compatible with other systems, we should keep it.

Originally posted by @liurenjie1024 in #118 (comment)

Add test for ManifestWriter.

Implement create table api.

Allow user to config parquet writer.

We should allow user to config parquet writer, such as encoding, compression, page size, etc.

Introduce error type for io module

We can introduce two error type like

`For io module`
enum IoError {
    Appender(AppenderError)
}

// For `ParquetWriter`
struct AppenderError {}

@Xuanwo @liurenjie1024 What do you think?

Infer storage types by table uri

Add more data types in integration tests.

Tracking: Implement table transaction api.

A table transaction is a collection of modifications to an iceberg table, we can support them one by one.

Support append files in transaction.
Support row delta in transaction.

We can have following as reference:

https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Transaction.java
https://github.com/risingwavelabs/iceberg-rust/blob/main/iceberg-rust/src/table/transaction/mod.rs

Implement update table method.

Implement trivial catalog methods and set up ci.

Using `dotenvy` and `libtest-mimic` instead

          I prefer using `dotenvy` and `libtest-mimic` for smoother integration with `cargo test`. It is highly likely that we need to test against various storage services and catalogs.

cc opendal's for example: https://github.com/apache/incubator-opendal/blob/main/core/tests/behavior/main.rs

Originally posted by @Xuanwo in #118 (comment)

Implement Parquet Reader to transmute into arrow

discussion: use order_float instead of f32,f64

To support partition write, we need something like HashMap<StructValue,_>. So this means that we need to impl Eq for AnyValue. To do this, we need to replace f32, f64 to order_float. (e.g. https://crates.io/crates/ordered-float)

As https://internals.rust-lang.org/t/f32-f64-should-implement-hash/5436/33#:~:text=If%20you%20want%20to%20have%20a%20content%20hash%20of%20a%20struct%20that%20includes%20floats says, to support Hash for the struct, we also can use serde. But I guess there may be other scenes we need Eq.

Add schema field default value support

Tracking: Implement catalog api

RoadMap of IceLake v0.1

Iceberg is an open table format designed for analytic datasets. However, the lack of a mature Rust binding for Iceberg makes it difficult to integrate with databases like Databend.

IceLake intends to fill this gap. By developing icelake, I expect to build up an open ecosystem that:

Users can read/write iceberg table from ANY storage services like s3, gcs, azblob, hdfs and so on.
ANY Databases can integrate with icelake to facilitate reading and writing of iceberg tables.
Provides NATIVE support transmute between arrow
Provides bindings so that other language can operate on iceberg tables powered by rust core.

For IceLake v0.1, I expect to implement the following features:

Setup the project layout and build development loop so that the community can take part in.
Support reading data for iceberg v2 from storage services (only limited file formats will be supported).
Evaluate our design by integrating it with databend.

This project is sponsored by Databend Labs

Use testcontainers to make integration tests better.

I will try to explore using https://docs.rs/testcontainers/latest/testcontainers/

Implement ManifestListWriter.

Tracking: Support partition write

BTW, do we prepare to support partition in 0.1 veriosn? It seems the whole API can build without partition.

Yes, it's necessary for practical use.

As discussed, support patition write is important. To support it, we need:

add partition field in data file
support partition writer #153
#102

Implement table metadata writer.

Switch to catalog api totally.

Refactor integration tests to use docker compose.

In our integration tests, we need to setup several docker images for minio, spark, etc to verify result. I previously introduced testcontainers-rs to setup them by code with isolation for different tests. However I realized that there are several problems in it:

Difficult to maintain for complex setup. It's difficult to get the dependencies from codes.
Difficult to debug. It's difficult to control the lifecycle of containers manually, which is painful when there are problems in container settings.

The only advantage is that it provides isolation for different tests, but I found that docker compose also provides isolation so that we can run several projects in parallel, just follow the rules below:

Give each run a project name by -p.
Don't specify container name explicitly. Docker compose give each container name like this: <project name>-<service name>-1.
Don't specify port mapping, but expose ports only.
Don't specify network explicitly. Docker compose will add a network for each run.

An example can be found at https://github.com/icelake-io/icelake/blob/main/icelake/tests/rest_catalog_tests.rs

The only left tests is https://github.com/icelake-io/icelake/blob/6035b7bdc41e04b676bc9017c0e457c536936afd/icelake/tests/insert_tests.rs

We should refactor to use docker compose and remove testcontainers.