Coder Social home page Coder Social logo

[SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33) about hudi HOT 16 OPEN

michael1991 avatar michael1991 commented on September 23, 2024
[SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33)

from hudi.

Comments (16)

michael1991 avatar michael1991 commented on September 23, 2024 1

I disable metadata in reader side, so the reader job just treat them as normal parquet files and ignore reading metadata.
Meanwhile, I disable metadata in writer side too, after few commits, I re-enable metadata in writer side.
But exception has been thrown at beginning of every day. The writer job is scheduled hourly, and insert overwrite whole day partition, due to event data without record key.
Seems not writer & reader version compatibility issue.

from hudi.

ad1happy2go avatar ad1happy2go commented on September 23, 2024 1

@michael1991 Async services do not make sense with spark core and sql write anyway. So you should use inline table services only.

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

how many writers are there, there might be multiple cleaners.

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024

We have only one writer here, while multiple readers. So we have one cleaner within writer, right?

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

yeah, can you check the metadata file state under .hoodie dir?

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024

yeah, can you check the metadata file state under .hoodie dir?

Oh, I found some empty files there.
image

And if I enable metadata in reader side, warning thrown as below:

24/03/22 06:30:09 WARN AbstractHoodieLogRecordReader: TargetInstantTime 20240322043545698 invalid or extra rollback command block in gs://bucket/tables_prod/ods/hudi_log_events/.hoodie/metadata/files/.files-0000-0_20240322033640958001.log.5_1-0-1
24/03/22 06:30:10 WARN BaseTableMetadata: Metadata record for log_date=2024-03-20 encountered some files to be deleted which was not added before. Ignoring the spurious deletes as the `_hoodie.metadata.ignore.spurious.deletes` config is set to true
java.io.FileNotFoundException: 
File not found: gs://bucket/tables_prod/ods/hudi_log_events/log_date=2024-03-20/62c503f0-0fca-463a-8643-c1fa688412e5-0_0-8-211_20240320203557276.parquet

How to avoid this error?

from hudi.

ad1happy2go avatar ad1happy2go commented on September 23, 2024

@michael1991 Why are we trying to read the data written using newer Hudi version from an older Hudi version . Between 0.12.3 and 0.14.1 anyway used the different table version and have different log format.

Also can you post writer configurations too.

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024
  • Because we are ingesting event log data without record key, before 0.14, we must set record key for writer. Meanwhile, in GCP Dataproc, Hudi v0.12.3 is the default version, so we use 0.12.3 bundle to read data generated by 0.14.1 bundle.
  • Our writer configurations:
lazy val EVENT_HUDI_CONF_MAP: Map[String, String] = Map(
    "hoodie.database.name" -> "database",
    "hoodie.table.name" -> "log_events",
    "hoodie.schema.on.read.enable" -> "true",
    "hoodie.combine.before.upsert" -> "false",
    "hoodie.datasource.write.partitionpath.field" -> "log_date",
    "hoodie.datasource.write.operation" -> "insert_overwrite",
    "hoodie.clean.async" -> "true",
    "hoodie.cleaner.commits.retained" -> "5",
    "hoodie.parquet.compression.codec" -> "snappy",
    "hoodie.copyonwrite.record.size.estimate" -> "64",
    "hoodie.datasource.write.hive_style_partitioning" -> "true"
  )

from hudi.

ad1happy2go avatar ad1happy2go commented on September 23, 2024

@michael1991 I dont think that's the right way. We must use 0.14.1 only to read data also. You can use OSS hudi bundle on dataproc to read data back and see if you have same issue occurring there too. Thanks.

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024

Ok, let me have a try, and we have a spark job, needs to read base data & event data, then join them to generate new dataset. Base data are generated by 0.12.3 bundle for 1yr, and event data are generated by 0.14.1 bundle from last week, so can we use 0.14.1 bundle to read two datasets with different table versions, meaning backward compatibility in reader side?

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024

spark-hudi.log
I got more details for this error, actually this error occurs after this rollback error. But from log, I could not see any error before rollback occurs. Hope it could be helpful.
I'm not sure whether async cleaning could be enabled when metadata rollback happen.

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024

Root cause is: rollback and async clean, two actions need to delete same file, then "dead lock" happens.
Disable async cleaning works.

from hudi.

danny0405 avatar danny0405 commented on September 23, 2024

Root cause is: rollback and async clean, two actions need to delete same file, then "dead lock" happens. Disable async cleaning works.

Nice findings, seems a bug, can you fire a fix for it.

from hudi.

ad1happy2go avatar ad1happy2go commented on September 23, 2024

@michael1991 Thanks for identifying the root cause. Do you have a fix in your mind. Created tracking jira for the same - https://issues.apache.org/jira/browse/HUDI-7560

Are you using spark structured streaming to write or HudiStreamer?

from hudi.

michael1991 avatar michael1991 commented on September 23, 2024

@michael1991 Thanks for identifying the root cause. Do you have a fix in your mind. Created tracking jira for the same - https://issues.apache.org/jira/browse/HUDI-7560

Are you using spark structured streaming to write or HudiStreamer?

I'm using Spark core and sql to write, not structured streaming or HudiStreamer. In the future, we will try Flink to write streaming data.

from hudi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.