Coder Social home page Coder Social logo

Comments (5)

vruusmann avatar vruusmann commented on July 17, 2024

Is there any plan to support evaluator in spark 3.x?

The underlying JPMML-Evaluator library, and this JPMML-Evaluator-Python wrapper library are both written in the Java language and should therefore be totally agnostic towards Scala and Apache Spark ML versions.

According to GitHub log, I haven't touched this codebase for three years. I wonder, what has changed/broken API-wise in this timeframe?

Or maybe I could attempt to build jpmml-evaluator-spark on spark 3.x on my own?

I haven't marked this codebase as "Archived", so I do have some interest in reviving it. But it's not a high-priiority item for me personally.

Please try to deploy the current version on your target Apache Spark ML version (3.2.X perhaps?), and report back all the issues that you're experiencing. Also, if you can suggest immediate fixes to those issues, please do share those as well.

from jpmml-evaluator-spark.

lcx517 avatar lcx517 commented on July 17, 2024

I rebuilt project on Spark 3.1.1, now It's successful to run pmml-evaluator on Spark 3.1.1.
I have a pull request #44 for this version.

There are several compatibility problems I encountered. The last one has not solved yet.

  1. untyped Scala UDF
ERROR Instrumentation: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution
	at org.apache.spark.sql.functions$.udf(functions.scala:5021)
	at org.jpmml.evaluator.spark.PMMLTransformer.transform(PMMLTransformer.scala:99)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311)
	at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
	at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
	at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311)
	at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
	at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310)
	at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
	at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
	at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307)

I got from Spark Migration guide page:

In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default.

And I googled the solution by adding:

sparkSession.sql("set spark.sql.legacy.allowUntypedScalaUDF=true")
  1. When output columns String contains ".", transform function will run escapeColumnName(name) and add back quote to column name, which may cause error like :
org.apache.spark.sql.AnalysisException: No such struct field `probability(0.0)` in y, pmml(prediction), prediction, probability(0.0), probability(1.0)

My solution is not adding back quote for column name, instead, replace back quote with underline.
this modification is not in my pull request, since I have no better idea for this problem.

from jpmml-evaluator-spark.

vruusmann avatar vruusmann commented on July 17, 2024
  1. untyped Scala UDF

In JPMML-Evaluator 1.6.X development branch, the signature of the main evaluation method was changed to:

Map<String, ?> evaluate(Map<String, ?> arguments);

The value type of both arguments and results map is java.lang.Object. In the Java land, it is impossible to insert a primitive value (eg. int, double) into such Map. In don't know if in Scala land it is possible or not.

The main point is that the UDF should keep null references unchanged (instead of replacing them with primitive-like 0 or 0.0 values), because the JPMML-Evaluator uses the null reference for denoting missing values.

Ideally, the Apache Spark UDF could have a signature that states: "send Map<String, Object> in, and get Map<String, Object> back. If there are any null values in the arguments or results maps, keep them as-is".

When output columns String contains ".", transform function will run escapeColumnName(name) and add back quote to column name.

The org.jpmml.evaluator.ModelEvaluatorBuilder class has setResultMapper(org.jpmml.evaluator.ResultMapper) method, which lets you "customize" result field names on the fly.

In the current case, you could replace the problematic dot character (.) with some other character, such as the underscore character (_), or delete it altogether:

ModelEvaluatorBuilder evaluatorBuilder = new ModelEvaluatorBuilder(...)
  .setResultMapper(new ResultMapper(){
    @Override
    public String apply(String pmmlName){
      return pmmlName.replace(".", "_");
    }
  });

IIRC, the whole model evaluator builder patter wasn't properly integrated into this codebase (three years ago). I should do it here and now.

from jpmml-evaluator-spark.

vruusmann avatar vruusmann commented on July 17, 2024

This issue is by no means done (aka closed) - I haven't written a single line of code yet!

from jpmml-evaluator-spark.

lcx517 avatar lcx517 commented on July 17, 2024

Oh.. sorry, I'm looking forwards your new version~

from jpmml-evaluator-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.