Coder Social home page Coder Social logo

Comments (12)

danny0405 avatar danny0405 commented on September 23, 2024 1

You may need to read this doc first: https://www.yuque.com/yuzhao-my9fz/kb/flqll8?

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

You can use the bulk_insert for history data and regular upsert for incrmental streaming ingestion.
Note that when you choose flink_state index instead of bucket index, the incremental streaming pipeline needs to enable the index bootstrap through index.bootstrap.enabled option.

from hudi.

ChiehFu avatar ChiehFu commented on September 23, 2024

@danny0405 thanks for replying.

I tried enabling index.bootstrap.enabled and noticed that there was a index_bootstrap task created in my Flink pipeline accordingly. Initially I thought the boot_strap task would complete and reach the finished state after indexing the existing table, but it continued to run without releasing its resource. Is this an expected behavior that the bootstrap task will run continuously?

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

Is this an expected behavior that the bootstrap task will run continuously?

yes, and after a checkpoint succeed, you can disable the bootstrap by setting up index.bootstrap.enabled as false.

from hudi.

ChiehFu avatar ChiehFu commented on September 23, 2024

Would the steps be like: checkpoint succeeds -> cancel the Flink job -> restart the Flink job from the checkpoint with bootstrap disabled?

Currently I am running Flink job via Flink SQL on Zeppelin notebook, do you know how I can perform the steps above?

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

Currently I am running Flink job via Flink SQL on Zeppelin notebook

You can disable with Flink SQL hint: /*+ OPTIONS('index.bootstrap.enabled'='false')*/.

from hudi.

ChiehFu avatar ChiehFu commented on September 23, 2024

@danny0405 Thanks for the answer. The SQL hint worked for me for disabling index bootstrap.

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

Close it now, feel free to reopen it if you still think it is an issue.

from hudi.

ChiehFu avatar ChiehFu commented on September 23, 2024

@danny0405 I got some follow-up questions.
Say I run the following steps to set up my data pipeline

  1. Run a batch job 1 to bulk_insert historical data into a Hudi table
  2. Run a flink stream job 2 with index bootstrap enabled and terminate the job after a checkpoint succeeded
  3. Run a flink stream job 3 with index bootstrap disabled restoring from the checkpoint job 2 created

My questions are

  • Would the checkpoint of job 3 contains all index information retrieved from the index bootstrap process in job 2? Asking this as I noticed a significant size differences between the checkpoint of job 2 and job 3. (500GB in job 2 vs < 50GB in job 3)
  • If job 3 fails and I need to start a job 4 using job 3's latest checkpoint, do I need to have index bootstrap enabled?

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

Would the checkpoint of job 3 contains all index information retrieved from the index bootstrap process in job 2? Asking this as I noticed a significant size differences between the checkpoint of job 2 and job 3. (500GB in job 2 vs < 50GB in job 3)

yes, one successful checkpoint indicates the bootstrap has finished.

If job 3 fails and I need to start a job 4 using job 3's latest checkpoint, do I need to have index bootstrap enabled?

No need to do that.

BTW, if your dataset is large, BUCKET index is more preferrable.

from hudi.

ChiehFu avatar ChiehFu commented on September 23, 2024

@danny0405

Is it expected that the checkpoint size of bucket_assigner operator changes significantly from 500GB in the job 2 to less than 50GB in the job 3 mentioned above?

The Hudi sink table has 9.6 billion records with around 570.3 GB, and is currently partition by a time attribute so the size of partition is expected to dynamically grow to store data for a even data distribution. Would my use case benefited by using Bucket index compared to Flink State index? Also do I need to enable index bootstrap task if switching to Bucket index?

from hudi.

ChiehFu avatar ChiehFu commented on September 23, 2024

In addition, I found some duplicates written by my bulk_insert batch job 1 and upsert stream job 2 (the one that had index bootstrap enabled).

For bulk_insert batch job, it had write.precombine set to true so there shouldn't be any duplicates in the result table?

For upsert stream job, it had write.precombine set to true and index bootstrap task had parallelism set to 480. I found this previous issue #4881 which suggests duplicates can happen when index bootstrap task parallelism > 1. Is that still the case in Hudi 0.14.1? The table that needs to be index bootstrapped is large so I am not sure if setting parallelism to 1 would work.

from hudi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.