Coder Social home page Coder Social logo

icelake's People

Contributors

chenzl25 avatar liurenjie1024 avatar nooberfsh avatar psiace avatar rinchannowww avatar tennyzhuang avatar wangrunji0408 avatar xuanwo avatar xudong963 avatar youngsofun avatar zenotme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

icelake's Issues

Disscussion: Is 'table metadata not loaded yet' reasonable?

I find that the current_table_metadata interface will check self.current_version and return "table metadata not loaded yet" when it's 0.

    pub fn current_table_metadata(&self) -> Result<&types::TableMetadata> {
        if self.current_version == 0 {
            return Err(anyhow!("table metadata not loaded yet"));
        }
        ...
    }

But I find that it will call load automatically in the open or open_with_op. So I guesss the error will occur in this case: self.current_version = metadata.last_updated_ms; // metadata.last_updated_ms = 0
And we hope user to fix this error like this:

let table = Table::open("");
table.current_table_metadata(); // get error
table.load(); // retry load 

But why not directly return error in open when we can't load a valid current version. And we can assert current_version is valid in current_table_metadata. Like

let table = Table::open(""); // Will return error if the metadata.last_updated_ms = 0 

pub fn current_table_metadata(&self) -> Result<&types::TableMetadata> {
    assert(self.current_version != 0)
    ...
}

I think this way is more friendly for user🤔 When it open succeesfully, it can convince that the version is valid. It don't need to check and process error every time in later operation.

Build spark image with dependencies rather downloading them for every run.

In our integrations setup for spark: https://github.com/icelake-io/icelake/blob/c07e4f13c65cee1a20b01f73bc67f74a8d93f4f1/testdata/docker/spark-script/spark-connect-server.sh

This script will download dependent packages every time the container starts. It's quite convenient that we only need to depend on official image, however the download is quite time consuming. We should maintain a image with dependencies already downloaded to speed up tests.

Friendly write test framework

For now our test for write is hard extend to test kinds of table. The schema is hard-code.

So I propose a way to make test more easily to extend.

We can describe the test in yaml file like following:

schema:
	init_sql:
		CREATE SCHEMA IF NOT EXISTS s1

src:  
	name: t1
	root: demo/s1/t1
	init_sql: 
		DROP TABLE IF EXISTS s1.t1,
        CREATE TABLE s1.t1
        (
            id long,
            v_int int,
            v_long long,
            v_float float,
            v_double double,
            v_varchar string,
            v_bool boolean,
            v_date date,
            v_timestamp timestamp,
            v_decimal decimal(36, 10),
            v_ts_ntz timestamp_ntz
        ) USING iceberg
        TBLPROPERTIES ('format-version'='2');


dst:
	name: tmp
	root: demo/s1/tmp
	init_sql: 
		DROP TABLE IF EXISTS s1.tmp,
        CREATE TABLE s1.tmp
        (
            id long,
            v_int int,
            v_long long,
            v_float float,
            v_double double,
            v_varchar string,
            v_bool boolean,
            v_date date,
            v_timestamp timestamp,
            v_decimal decimal(36, 10),
            v_ts_ntz timestamp_ntz
        ) USING iceberg
        TBLPROPERTIES ('format-version'='2');

write_data:
	schema: long,int,float,double,string,boolean,date,timestamp ..
	data:
		1,1,1000,1.1,1.11,1-1,true,2022-11-01,2022-11-01 11:03:02.123456+04:00,389.11111,2022-11-01 11:03:02.123456
		2,2,2000,2.2,2.22,2-2,false,2022-11-02,2022-11-02 11:03:02.123456+04:00,389.2222,2022-11-02 11:03:02.123456
		3,3,3000,3.3,3.33,3-3,true,2022-11-03,2022-11-03 11:03:02.123456+04:00,389.3333,2022-11-03 11:03:02.123456
		4,4,4000,4.4,4.44,4-4,false,2022-11-04,2022-11-04 11:04:02.123456+04:00,389.4444,2022-11-04 11:04:02.123456
		5,5,5000,5.5,5.55,5-5,true,2022-11-05,2022-11-05 11:05:02.123456+04:00,389.5555,2022-11-05 11:05:02.123456

query_sql: 
		select * from s1.t1;
		select * from s1.tmp;

query_sql:
		select * from s1.t1.partitions;
		select * from s1.tmp.partitions;
  1. In the init phase, the test framework will execute the init sql.
  2. In the write phase, the test framework will parse the schema and write data into t1 and tmp seperatly using icelake and other client(like spark sql).
  3. In the check phase, the test framework will execute two sql and check whether the output is same.

The benifit of this way:

  1. support kinds of schema
  2. describe the whole test process in a external file.

support transform expression

We need the transform expression to support partition write.

At high level, we need interface like:

fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch>;

and for implementation, it has the process to compute the transform looks like:

fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch> {
    ...
    for partition_field in partition_spec {
        // 1. Get the transform expression 
        let expr = transform_expr(partition_field.transform);
        // 2. Get source column 
        let column = record_batch.column[partition_field.source_field];
        // 3. Transform compute 
        let res_column = expr.eval(column);
    }
    ...
}

So we need the expression interface and implement it for kinds of transform, like

trait Expr {
    fn eval(&self,array: ArrayRef) -> ArrayRef
}

impl Expr for Identity {
    fn eval(&self,array: ArrayRef) {
        array 
    }
}

...

For the implementation of expression, we can make use of the compute module of arrow. I'm taking investiage for it.

Incompatitable schema using apache-avro rust

I try to support to write decimal in avro and find that there is incompatitablity using apache-avro rust schema directly.

In iceberg spec, the decimal represent in avro as:

{ 
  "type": "fixed",
  "size": minBytesRequired(P),
  "logicalType": "decimal",
  "precision": P,
  "scale": S 
}

But in Schema::Decimal in apache-avro rust, the serialize way will cause the type look like:

{ 
  "type": {
      "type": fixed,
      "size": minBytesRequired(P),
  },
  "logicalType": "decimal",
  "precision": P,
  "scale": S 
}

And it's incompatitable with the type spec in iceberg.

I have try to modify the serialize way in local and verify that it can work.

            Schema::Decimal(DecimalSchema {
                ref scale,
                ref precision,
                ref inner,
            }) => {
                let mut map = serializer.serialize_map(None)?;
                map.serialize_entry("type", "fixed")?;
                map.serialize_entry("name", "decimal_36_10")?;
                map.serialize_entry("size", &16)?;
                map.serialize_entry("logicalType", "decimal")?;
                map.serialize_entry("scale", scale)?;
                map.serialize_entry("precision", precision)?;
                map.end()
            }

There are seems no way provied by avro-rust to custom the serialize way. Apparently we can fork the avro-rust and modify it to make it work but this solution is not very well. Maybe we can have a communication wtih community of apache-avro rust.

Expose metrics to users.

As a library, we should be able to expose metrics to external users for better observability. Any suggestions?

Add Table Operations to allow reading data via IceLake

icelake 0.0.1 added most types that we need to load table metadata. We need to implement table operations to allow users reading data via us.

Tasks

  • Get current table metadata
  • Get all snapshots of current table metadata
  • Get all manifest lists from current snapshot
  • #31

Notes

  • We will not add partition support before v0.1, so we can ignore partition spec and sorting order.
  • API should be polished after.

Supports rest catalog.

The current implementation only supports filesystem catalog. In production, jdbc/hive metastore catalogs are widely used. To support them, we need to support rest catalog so that we don't need to interact with them directly.

Implement append only task writer.

An append-only task writer accepts an optional partitioner, file appender factory as arguments. When it receives records, it dispatches records to different file writer according to the partition key(generated by partitioner), and inserts it. When it finished, it returns generated data file structs.

Notice that this will be the api used directly by compute engines such as risingwave, ballista. We can refer to following implementation as an example.

https://github.com/apache/iceberg/blob/e340ad5be04e902398c576f431810c3dfa4fe717/core/src/main/java/org/apache/iceberg/io/PartitionedFanoutWriter.java#L28

Remove openapi generated code.

In the development of #181 , I found the openapi generated code is useless and has some bugs, so I tried to rewrite with reqwest, and suprisingly it works well. So I'll remove openapi generated code.

Integration test of file writer.

After we finished #37 #38 #39 , we should be able to create iceberg tables. We need to add integration tests to verify that it can be read from other libraries such as spark.

discussion: sync with the apache Iceberg community

Hi there,

There is a thread regarding the C++/Rust support for Iceberg in the community's dev mail list. Here is the link: https://lists.apache.org/thread/7njzq6b0m9qbjmbtgrtkhmj2nnmhso4t I wonder if you fellows can reply to this thread and ask for the feedbacks from the Iceberg community? I think most community members would welcome this project. However, there will be a lot to discuss in terms of the technical design, in which you may need the professional advice from the community. With the advice and feedbacks, we will be able to see if this project is heading towards the right direction.

discussion: use order_float instead of f32,f64

To support partition write, we need something like HashMap<StructValue,_>. So this means that we need to impl Eq for AnyValue. To do this, we need to replace f32, f64 to order_float. (e.g. https://crates.io/crates/ordered-float)

As https://internals.rust-lang.org/t/f32-f64-should-implement-hash/5436/33#:~:text=If%20you%20want%20to%20have%20a%20content%20hash%20of%20a%20struct%20that%20includes%20floats says, to support Hash for the struct, we also can use serde. But I guess there may be other scenes we need Eq.

RoadMap of IceLake v0.1

Iceberg is an open table format designed for analytic datasets. However, the lack of a mature Rust binding for Iceberg makes it difficult to integrate with databases like Databend.

IceLake intends to fill this gap. By developing icelake, I expect to build up an open ecosystem that:

  • Users can read/write iceberg table from ANY storage services like s3, gcs, azblob, hdfs and so on.
  • ANY Databases can integrate with icelake to facilitate reading and writing of iceberg tables.
  • Provides NATIVE support transmute between arrow
  • Provides bindings so that other language can operate on iceberg tables powered by rust core.

For IceLake v0.1, I expect to implement the following features:

  • Setup the project layout and build development loop so that the community can take part in.
  • Support reading data for iceberg v2 from storage services (only limited file formats will be supported).
  • Evaluate our design by integrating it with databend.

This project is sponsored by Databend Labs

Tracking: Support partition write

BTW, do we prepare to support partition in 0.1 veriosn? It seems the whole API can build without partition.

Yes, it's necessary for practical use.

As discussed, support patition write is important. To support it, we need:

  • add partition field in data file
  • support partition writer #153
  • #102

Refactor integration tests to use docker compose.

In our integration tests, we need to setup several docker images for minio, spark, etc to verify result. I previously introduced testcontainers-rs to setup them by code with isolation for different tests. However I realized that there are several problems in it:

  1. Difficult to maintain for complex setup. It's difficult to get the dependencies from codes.
  2. Difficult to debug. It's difficult to control the lifecycle of containers manually, which is painful when there are problems in container settings.

The only advantage is that it provides isolation for different tests, but I found that docker compose also provides isolation so that we can run several projects in parallel, just follow the rules below:

  1. Give each run a project name by -p.
  2. Don't specify container name explicitly. Docker compose give each container name like this: <project name>-<service name>-1.
  3. Don't specify port mapping, but expose ports only.
  4. Don't specify network explicitly. Docker compose will add a network for each run.

An example can be found at https://github.com/icelake-io/icelake/blob/main/icelake/tests/rest_catalog_tests.rs

The only left tests is https://github.com/icelake-io/icelake/blob/6035b7bdc41e04b676bc9017c0e457c536936afd/icelake/tests/insert_tests.rs

We should refactor to use docker compose and remove testcontainers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.