Comments (16)
I disable metadata in reader side, so the reader job just treat them as normal parquet files and ignore reading metadata.
Meanwhile, I disable metadata in writer side too, after few commits, I re-enable metadata in writer side.
But exception has been thrown at beginning of every day. The writer job is scheduled hourly, and insert overwrite whole day partition, due to event data without record key.
Seems not writer & reader version compatibility issue.
from hudi.
@michael1991 Async services do not make sense with spark core and sql write anyway. So you should use inline table services only.
from hudi.
how many writers are there, there might be multiple cleaners.
from hudi.
We have only one writer here, while multiple readers. So we have one cleaner within writer, right?
from hudi.
yeah, can you check the metadata file state under .hoodie
dir?
from hudi.
yeah, can you check the metadata file state under
.hoodie
dir?
Oh, I found some empty files there.
And if I enable metadata in reader side, warning thrown as below:
24/03/22 06:30:09 WARN AbstractHoodieLogRecordReader: TargetInstantTime 20240322043545698 invalid or extra rollback command block in gs://bucket/tables_prod/ods/hudi_log_events/.hoodie/metadata/files/.files-0000-0_20240322033640958001.log.5_1-0-1
24/03/22 06:30:10 WARN BaseTableMetadata: Metadata record for log_date=2024-03-20 encountered some files to be deleted which was not added before. Ignoring the spurious deletes as the `_hoodie.metadata.ignore.spurious.deletes` config is set to true
java.io.FileNotFoundException:
File not found: gs://bucket/tables_prod/ods/hudi_log_events/log_date=2024-03-20/62c503f0-0fca-463a-8643-c1fa688412e5-0_0-8-211_20240320203557276.parquet
How to avoid this error?
from hudi.
@michael1991 Why are we trying to read the data written using newer Hudi version from an older Hudi version . Between 0.12.3 and 0.14.1 anyway used the different table version and have different log format.
Also can you post writer configurations too.
from hudi.
- Because we are ingesting event log data without record key, before 0.14, we must set record key for writer. Meanwhile, in GCP Dataproc, Hudi v0.12.3 is the default version, so we use 0.12.3 bundle to read data generated by 0.14.1 bundle.
- Our writer configurations:
lazy val EVENT_HUDI_CONF_MAP: Map[String, String] = Map(
"hoodie.database.name" -> "database",
"hoodie.table.name" -> "log_events",
"hoodie.schema.on.read.enable" -> "true",
"hoodie.combine.before.upsert" -> "false",
"hoodie.datasource.write.partitionpath.field" -> "log_date",
"hoodie.datasource.write.operation" -> "insert_overwrite",
"hoodie.clean.async" -> "true",
"hoodie.cleaner.commits.retained" -> "5",
"hoodie.parquet.compression.codec" -> "snappy",
"hoodie.copyonwrite.record.size.estimate" -> "64",
"hoodie.datasource.write.hive_style_partitioning" -> "true"
)
from hudi.
@michael1991 I dont think that's the right way. We must use 0.14.1 only to read data also. You can use OSS hudi bundle on dataproc to read data back and see if you have same issue occurring there too. Thanks.
from hudi.
Ok, let me have a try, and we have a spark job, needs to read base data & event data, then join them to generate new dataset. Base data are generated by 0.12.3 bundle for 1yr, and event data are generated by 0.14.1 bundle from last week, so can we use 0.14.1 bundle to read two datasets with different table versions, meaning backward compatibility in reader side?
from hudi.
spark-hudi.log
I got more details for this error, actually this error occurs after this rollback error. But from log, I could not see any error before rollback occurs. Hope it could be helpful.
I'm not sure whether async cleaning could be enabled when metadata rollback happen.
from hudi.
Root cause is: rollback and async clean, two actions need to delete same file, then "dead lock" happens.
Disable async cleaning works.
from hudi.
Root cause is: rollback and async clean, two actions need to delete same file, then "dead lock" happens. Disable async cleaning works.
Nice findings, seems a bug, can you fire a fix for it.
from hudi.
@michael1991 Thanks for identifying the root cause. Do you have a fix in your mind. Created tracking jira for the same - https://issues.apache.org/jira/browse/HUDI-7560
Are you using spark structured streaming to write or HudiStreamer?
from hudi.
@michael1991 Thanks for identifying the root cause. Do you have a fix in your mind. Created tracking jira for the same - https://issues.apache.org/jira/browse/HUDI-7560
Are you using spark structured streaming to write or HudiStreamer?
I'm using Spark core and sql to write, not structured streaming or HudiStreamer. In the future, we will try Flink to write streaming data.
from hudi.
Related Issues (20)
- Critical and High vulnerabilities reported in hudi jar HOT 7
- [SUPPORT] Using BULK_INSERT mode multiple times writing causing a bug: Duplicate fileId 00000000-8651-4ae5-8f9e-4424fed2d181 from bucket 0 of partition found during the BucketStreamWriteFunction index bootstrap. HOT 1
- Nested object support in Hudi Table using Flink HOT 15
- [SUPPORT] Requesting Support for insert_overwrite in Delta Streamer HOT 12
- [SUPPORT] Occur bucketid multiple cannot write data to the wrong partition HOT 4
- [SUPPORT] Questions about LOG in Hudi source code HOT 2
- [SUPPORT] The parquet files for the MOR table have been generated, but the RO table in Hive still cannot query the latest data in the parquet files. HOT 7
- [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) HOT 12
- [SUPPORT] AWS Glue 4.0 taking to long to write to S3 HOT 3
- [SUPPORT] `Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.hive.HiveSyncTool` HOT 4
- [SUPPORT] How to run org.apache.spark.sql.execution.benchmark.LSMTimelineReadBenchmark ? HOT 6
- [SUPPORT] During the process of ingesting data from Kafka and persisting it into a Hudi MOR table, an Out of Memory exception was encountered HOT 3
- [SUPPORT] No way to clean `archived/` folder HOT 4
- [SUPPORT] Data duplicated in base file on updating record partition HOT 4
- [SUPPORT] Optimized code ComparableVersion HOT 3
- Data lose after writing HOT 4
- [SUPPORT] Avoid direct dependency on Flink Table Planner Jar HOT 2
- [SUGGEST] Can the community version be updated regularly and faster? The roadmap should also be updated regularly and synchronized. HOT 2
- [SUPPORT] Insert and Delete in a Single Operation. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hudi.