Comments (12)
You may need to read this doc first: https://www.yuque.com/yuzhao-my9fz/kb/flqll8?
from hudi.
You can use the bulk_insert for history data and regular upsert for incrmental streaming ingestion.
Note that when you choose flink_state
index instead of bucket
index, the incremental streaming pipeline needs to enable the index bootstrap through index.bootstrap.enabled
option.
from hudi.
@danny0405 thanks for replying.
I tried enabling index.bootstrap.enabled
and noticed that there was a index_bootstrap task created in my Flink pipeline accordingly. Initially I thought the boot_strap task would complete and reach the finished state after indexing the existing table, but it continued to run without releasing its resource. Is this an expected behavior that the bootstrap task will run continuously?
from hudi.
Is this an expected behavior that the bootstrap task will run continuously?
yes, and after a checkpoint succeed, you can disable the bootstrap by setting up index.bootstrap.enabled
as false.
from hudi.
Would the steps be like: checkpoint succeeds -> cancel the Flink job -> restart the Flink job from the checkpoint with bootstrap disabled?
Currently I am running Flink job via Flink SQL on Zeppelin notebook, do you know how I can perform the steps above?
from hudi.
Currently I am running Flink job via Flink SQL on Zeppelin notebook
You can disable with Flink SQL hint: /*+ OPTIONS('index.bootstrap.enabled'='false')*/
.
from hudi.
@danny0405 Thanks for the answer. The SQL hint worked for me for disabling index bootstrap.
from hudi.
Close it now, feel free to reopen it if you still think it is an issue.
from hudi.
@danny0405 I got some follow-up questions.
Say I run the following steps to set up my data pipeline
- Run a batch job 1 to bulk_insert historical data into a Hudi table
- Run a flink stream job 2 with index bootstrap enabled and terminate the job after a checkpoint succeeded
- Run a flink stream job 3 with index bootstrap disabled restoring from the checkpoint job 2 created
My questions are
- Would the checkpoint of job 3 contains all index information retrieved from the index bootstrap process in job 2? Asking this as I noticed a significant size differences between the checkpoint of job 2 and job 3. (500GB in job 2 vs < 50GB in job 3)
- If job 3 fails and I need to start a job 4 using job 3's latest checkpoint, do I need to have index bootstrap enabled?
from hudi.
Would the checkpoint of job 3 contains all index information retrieved from the index bootstrap process in job 2? Asking this as I noticed a significant size differences between the checkpoint of job 2 and job 3. (500GB in job 2 vs < 50GB in job 3)
yes, one successful checkpoint indicates the bootstrap has finished.
If job 3 fails and I need to start a job 4 using job 3's latest checkpoint, do I need to have index bootstrap enabled?
No need to do that.
BTW, if your dataset is large, BUCKET
index is more preferrable.
from hudi.
Is it expected that the checkpoint size of bucket_assigner operator changes significantly from 500GB in the job 2 to less than 50GB in the job 3 mentioned above?
The Hudi sink table has 9.6 billion records with around 570.3 GB, and is currently partition by a time attribute so the size of partition is expected to dynamically grow to store data for a even data distribution. Would my use case benefited by using Bucket
index compared to Flink State
index? Also do I need to enable index bootstrap task if switching to Bucket
index?
from hudi.
In addition, I found some duplicates written by my bulk_insert batch job 1 and upsert stream job 2 (the one that had index bootstrap enabled).
For bulk_insert batch job, it had write.precombine
set to true
so there shouldn't be any duplicates in the result table?
For upsert stream job, it had write.precombine
set to true
and index bootstrap task had parallelism set to 480
. I found this previous issue #4881 which suggests duplicates can happen when index bootstrap task parallelism > 1. Is that still the case in Hudi 0.14.1? The table that needs to be index bootstrapped is large so I am not sure if setting parallelism to 1
would work.
from hudi.
Related Issues (20)
- [SUPPORT] Specified partition compaction HOT 1
- [SUPPORT] Huge Performance Issue With BLOOM Index On A 1.6 Billion COW Table HOT 8
- [SUPPORT] Table or view not found after create table success HOT 1
- [SUPPORT] hudi-cli.sh. Error creating bean with name 'exportCommand' defined in URL [jar:file:/opt/hudi/hudi-cli/target/hudi-cli-0.15.0.jar!/org/apache/hudi/cli/commands/ExportCommand.class] HOT 2
- [SUPPORT] spark task execute too long and can not finish when ObjectSizeCalculator.getObjectSize HOT 9
- [SUPPORT] Hudi CLI doesn't respect ENDPOINT, AWS_ENDPOINT or AWS_S3_ENDPOINT HOT 6
- [SUPPORT] org.apache.hudi.cli.utils.InputStreamConsumer failed to load org/apache/hudi/table/HoodieTable HOT 1
- [SUPPORT] Running compaction gives java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem HOT 1
- [SUPPORT] compaction org.apache.hudi.exception.HoodieLockException: Unsupported scheme :s3a HOT 7
- [SUPPORT] Compaction with issue with NoSuchMethodError com.fasterxml.jackson.databind.deser.SettableBeanProperty.<init> HOT 2
- REPARTITION In Bloom Index Causing Slow Down HOT 4
- [SUPPORT] Compaction - Could not find - /opt/demo/config/schema.avsc - schema file HOT 2
- [SUPPORT] Upgrading table through CLI changes from CustomKeyGenerator to SimpleKeyGenerator HOT 3
- [SUPPORT] HOT 1
- [SUPPORT] Exception when write null value to table with timestamp partitioning
- [SUPPORT] Hudi CLI. java.lang.NoClassDefFoundError: org/apache/hudi/avro/model/HoodieWriteStat HOT 1
- [SUPPORT] Hudi CLI conf is hard coded to /opt/hudi/packaging/hudi-cli-bundle/conf/hudi-defaults.conf HOT 1
- [SUPPORT] Everytime you run hudi-cli-with-bundle.sh, it downloads jakarta over and over again HOT 6
- [SUPPORT] Hudi Streamer EMR Serverless ( 7.0.0) with Hudi Extension ( DELTA| ICEBERG ) HOT 2
- [SUPPORT] org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3a://<path>/.hoodie/20240908172432285.replacecommit.requested HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hudi.