Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

support for spark 3.x ? about jpmml-evaluator-spark HOT 5 CLOSED

jpmml commented on July 17, 2024

support for spark 3.x ?

from jpmml-evaluator-spark.

Comments (5)

vruusmann commented on July 17, 2024

Is there any plan to support evaluator in spark 3.x?

The underlying JPMML-Evaluator library, and this JPMML-Evaluator-Python wrapper library are both written in the Java language and should therefore be totally agnostic towards Scala and Apache Spark ML versions.

According to GitHub log, I haven't touched this codebase for three years. I wonder, what has changed/broken API-wise in this timeframe?

Or maybe I could attempt to build jpmml-evaluator-spark on spark 3.x on my own?

I haven't marked this codebase as "Archived", so I do have some interest in reviving it. But it's not a high-priiority item for me personally.

Please try to deploy the current version on your target Apache Spark ML version (3.2.X perhaps?), and report back all the issues that you're experiencing. Also, if you can suggest immediate fixes to those issues, please do share those as well.

from jpmml-evaluator-spark.

lcx517 commented on July 17, 2024

I rebuilt project on Spark 3.1.1, now It's successful to run pmml-evaluator on Spark 3.1.1.
I have a pull request #44 for this version.

There are several compatibility problems I encountered. The last one has not solved yet.

untyped Scala UDF

ERROR Instrumentation: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution
	at org.apache.spark.sql.functions$.udf(functions.scala:5021)
	at org.jpmml.evaluator.spark.PMMLTransformer.transform(PMMLTransformer.scala:99)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311)
	at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
	at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
	at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311)
	at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
	at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310)
	at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
	at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
	at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307)

I got from Spark Migration guide page:

In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default.

And I googled the solution by adding:

sparkSession.sql("set spark.sql.legacy.allowUntypedScalaUDF=true")

When output columns String contains ".", transform function will run escapeColumnName(name) and add back quote to column name, which may cause error like :

org.apache.spark.sql.AnalysisException: No such struct field `probability(0.0)` in y, pmml(prediction), prediction, probability(0.0), probability(1.0)

My solution is not adding back quote for column name, instead, replace back quote with underline.
this modification is not in my pull request, since I have no better idea for this problem.

from jpmml-evaluator-spark.

vruusmann commented on July 17, 2024

untyped Scala UDF

In JPMML-Evaluator 1.6.X development branch, the signature of the main evaluation method was changed to:

Map<String, ?> evaluate(Map<String, ?> arguments);

The value type of both arguments and results map is java.lang.Object. In the Java land, it is impossible to insert a primitive value (eg. int, double) into such Map. In don't know if in Scala land it is possible or not.

The main point is that the UDF should keep null references unchanged (instead of replacing them with primitive-like 0 or 0.0 values), because the JPMML-Evaluator uses the null reference for denoting missing values.

Ideally, the Apache Spark UDF could have a signature that states: "send Map<String, Object> in, and get Map<String, Object> back. If there are any null values in the arguments or results maps, keep them as-is".

When output columns String contains ".", transform function will run escapeColumnName(name) and add back quote to column name.

The org.jpmml.evaluator.ModelEvaluatorBuilder class has setResultMapper(org.jpmml.evaluator.ResultMapper) method, which lets you "customize" result field names on the fly.

In the current case, you could replace the problematic dot character (.) with some other character, such as the underscore character (_), or delete it altogether:

ModelEvaluatorBuilder evaluatorBuilder = new ModelEvaluatorBuilder(...)
  .setResultMapper(new ResultMapper(){
    @Override
    public String apply(String pmmlName){
      return pmmlName.replace(".", "_");
    }
  });

IIRC, the whole model evaluator builder patter wasn't properly integrated into this codebase (three years ago). I should do it here and now.

from jpmml-evaluator-spark.

vruusmann commented on July 17, 2024

This issue is by no means done (aka closed) - I haven't written a single line of code yet!

from jpmml-evaluator-spark.

lcx517 commented on July 17, 2024

Oh.. sorry, I'm looking forwards your new version~

from jpmml-evaluator-spark.

support for spark 3.x ? about jpmml-evaluator-spark HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent