Describe the problem you faced I have two jobs:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

yeah, can you check the metadata file state under <code class="notranslat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Because we are ingesting event log data without record key, before 0.14, we must

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33) about hudi HOT 16 OPEN

michael1991 commented on September 23, 2024

[SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33)

from hudi.

Comments (16)

michael1991 commented on September 23, 2024 1

I disable metadata in reader side, so the reader job just treat them as normal parquet files and ignore reading metadata.
Meanwhile, I disable metadata in writer side too, after few commits, I re-enable metadata in writer side.
But exception has been thrown at beginning of every day. The writer job is scheduled hourly, and insert overwrite whole day partition, due to event data without record key.
Seems not writer & reader version compatibility issue.

from hudi.

ad1happy2go commented on September 23, 2024 1

@michael1991 Async services do not make sense with spark core and sql write anyway. So you should use inline table services only.

from hudi.

danny0405 commented on September 23, 2024

how many writers are there, there might be multiple cleaners.

from hudi.

michael1991 commented on September 23, 2024

We have only one writer here, while multiple readers. So we have one cleaner within writer, right?

from hudi.

danny0405 commented on September 23, 2024

yeah, can you check the metadata file state under .hoodie dir?

from hudi.

michael1991 commented on September 23, 2024

yeah, can you check the metadata file state under .hoodie dir?

Oh, I found some empty files there.

And if I enable metadata in reader side, warning thrown as below:

24/03/22 06:30:09 WARN AbstractHoodieLogRecordReader: TargetInstantTime 20240322043545698 invalid or extra rollback command block in gs://bucket/tables_prod/ods/hudi_log_events/.hoodie/metadata/files/.files-0000-0_20240322033640958001.log.5_1-0-1
24/03/22 06:30:10 WARN BaseTableMetadata: Metadata record for log_date=2024-03-20 encountered some files to be deleted which was not added before. Ignoring the spurious deletes as the `_hoodie.metadata.ignore.spurious.deletes` config is set to true
java.io.FileNotFoundException: 
File not found: gs://bucket/tables_prod/ods/hudi_log_events/log_date=2024-03-20/62c503f0-0fca-463a-8643-c1fa688412e5-0_0-8-211_20240320203557276.parquet

How to avoid this error?

from hudi.

ad1happy2go commented on September 23, 2024

@michael1991 Why are we trying to read the data written using newer Hudi version from an older Hudi version . Between 0.12.3 and 0.14.1 anyway used the different table version and have different log format.

Also can you post writer configurations too.

from hudi.

michael1991 commented on September 23, 2024

Because we are ingesting event log data without record key, before 0.14, we must set record key for writer. Meanwhile, in GCP Dataproc, Hudi v0.12.3 is the default version, so we use 0.12.3 bundle to read data generated by 0.14.1 bundle.
Our writer configurations:

lazy val EVENT_HUDI_CONF_MAP: Map[String, String] = Map(
    "hoodie.database.name" -> "database",
    "hoodie.table.name" -> "log_events",
    "hoodie.schema.on.read.enable" -> "true",
    "hoodie.combine.before.upsert" -> "false",
    "hoodie.datasource.write.partitionpath.field" -> "log_date",
    "hoodie.datasource.write.operation" -> "insert_overwrite",
    "hoodie.clean.async" -> "true",
    "hoodie.cleaner.commits.retained" -> "5",
    "hoodie.parquet.compression.codec" -> "snappy",
    "hoodie.copyonwrite.record.size.estimate" -> "64",
    "hoodie.datasource.write.hive_style_partitioning" -> "true"
  )

from hudi.

ad1happy2go commented on September 23, 2024

@michael1991 I dont think that's the right way. We must use 0.14.1 only to read data also. You can use OSS hudi bundle on dataproc to read data back and see if you have same issue occurring there too. Thanks.

from hudi.

michael1991 commented on September 23, 2024

Ok, let me have a try, and we have a spark job, needs to read base data & event data, then join them to generate new dataset. Base data are generated by 0.12.3 bundle for 1yr, and event data are generated by 0.14.1 bundle from last week, so can we use 0.14.1 bundle to read two datasets with different table versions, meaning backward compatibility in reader side?

from hudi.

michael1991 commented on September 23, 2024

spark-hudi.log
I got more details for this error, actually this error occurs after this rollback error. But from log, I could not see any error before rollback occurs. Hope it could be helpful.
I'm not sure whether async cleaning could be enabled when metadata rollback happen.

from hudi.

michael1991 commented on September 23, 2024

Root cause is: rollback and async clean, two actions need to delete same file, then "dead lock" happens.
Disable async cleaning works.

from hudi.

danny0405 commented on September 23, 2024

Root cause is: rollback and async clean, two actions need to delete same file, then "dead lock" happens. Disable async cleaning works.

Nice findings, seems a bug, can you fire a fix for it.

from hudi.

ad1happy2go commented on September 23, 2024

@michael1991 Thanks for identifying the root cause. Do you have a fix in your mind. Created tracking jira for the same - https://issues.apache.org/jira/browse/HUDI-7560

Are you using spark structured streaming to write or HudiStreamer?

from hudi.

michael1991 commented on September 23, 2024

@michael1991 Thanks for identifying the root cause. Do you have a fix in your mind. Created tracking jira for the same - https://issues.apache.org/jira/browse/HUDI-7560

Are you using spark structured streaming to write or HudiStreamer?

I'm using Spark core and sql to write, not structured streaming or HudiStreamer. In the future, we will try Flink to write streaming data.

from hudi.

[SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33) about hudi HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent