Comments (5)
Is there any plan to support evaluator in spark 3.x?
The underlying JPMML-Evaluator library, and this JPMML-Evaluator-Python wrapper library are both written in the Java language and should therefore be totally agnostic towards Scala and Apache Spark ML versions.
According to GitHub log, I haven't touched this codebase for three years. I wonder, what has changed/broken API-wise in this timeframe?
Or maybe I could attempt to build jpmml-evaluator-spark on spark 3.x on my own?
I haven't marked this codebase as "Archived", so I do have some interest in reviving it. But it's not a high-priiority item for me personally.
Please try to deploy the current version on your target Apache Spark ML version (3.2.X perhaps?), and report back all the issues that you're experiencing. Also, if you can suggest immediate fixes to those issues, please do share those as well.
from jpmml-evaluator-spark.
I rebuilt project on Spark 3.1.1, now It's successful to run pmml-evaluator on Spark 3.1.1.
I have a pull request #44 for this version.
There are several compatibility problems I encountered. The last one has not solved yet.
- untyped Scala UDF
ERROR Instrumentation: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution
at org.apache.spark.sql.functions$.udf(functions.scala:5021)
at org.jpmml.evaluator.spark.PMMLTransformer.transform(PMMLTransformer.scala:99)
at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311)
at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311)
at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310)
at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307)
I got from Spark Migration guide page:
In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default.
And I googled the solution by adding:
sparkSession.sql("set spark.sql.legacy.allowUntypedScalaUDF=true")
- When output columns String contains ".",
transform
function will runescapeColumnName(name)
and add back quote to column name, which may cause error like :
org.apache.spark.sql.AnalysisException: No such struct field `probability(0.0)` in y, pmml(prediction), prediction, probability(0.0), probability(1.0)
My solution is not adding back quote for column name, instead, replace back quote with underline.
this modification is not in my pull request, since I have no better idea for this problem.
from jpmml-evaluator-spark.
- untyped Scala UDF
In JPMML-Evaluator 1.6.X development branch, the signature of the main evaluation method was changed to:
Map<String, ?> evaluate(Map<String, ?> arguments);
The value type of both arguments and results map is java.lang.Object
. In the Java land, it is impossible to insert a primitive value (eg. int
, double
) into such Map. In don't know if in Scala land it is possible or not.
The main point is that the UDF should keep null
references unchanged (instead of replacing them with primitive-like 0
or 0.0
values), because the JPMML-Evaluator uses the null
reference for denoting missing values.
Ideally, the Apache Spark UDF could have a signature that states: "send Map<String, Object>
in, and get Map<String, Object>
back. If there are any null
values in the arguments or results maps, keep them as-is".
When output columns String contains ".", transform function will run
escapeColumnName(name)
and add back quote to column name.
The org.jpmml.evaluator.ModelEvaluatorBuilder
class has setResultMapper(org.jpmml.evaluator.ResultMapper)
method, which lets you "customize" result field names on the fly.
In the current case, you could replace the problematic dot character (.
) with some other character, such as the underscore character (_
), or delete it altogether:
ModelEvaluatorBuilder evaluatorBuilder = new ModelEvaluatorBuilder(...)
.setResultMapper(new ResultMapper(){
@Override
public String apply(String pmmlName){
return pmmlName.replace(".", "_");
}
});
IIRC, the whole model evaluator builder patter wasn't properly integrated into this codebase (three years ago). I should do it here and now.
from jpmml-evaluator-spark.
This issue is by no means done (aka closed) - I haven't written a single line of code yet!
from jpmml-evaluator-spark.
Oh.. sorry, I'm looking forwards your new version~
from jpmml-evaluator-spark.
Related Issues (20)
- Invalid lambda deserialization at org.shaded.jpmml.evaluator.OutputFilters.$deserializeLambda$ HOT 4
- Rename transformer and transformer builder classes
- Simple prediction mode
- Model "data schema" exploration methods
- Replace `java.util.List<E>` parameters with `E[]` parameters in method signatures
- Row-oriented exception handling
- question about class 'PMMLTransformer' HOT 1
- question about build error HOT 1
- local class incompatible HOT 2
- dependency version not consistent HOT 2
- how to improve my pmml model‘s accuracy rate HOT 1
- submit spark job==》java.io.IOException: unexpected exception type HOT 1
- when i only use jpmml-evaluator-spark, it will incur an exception HOT 3
- reading pmml from hdfs HOT 1
- How to get the functionname of PMML model? HOT 2
- Resolving an application classpath conflict HOT 2
- Can I use Scala to load PMML model to complete prediction? HOT 1
- The period (.) in <output> creates problems
- Incomplete `TransformerBuilder` default configuration for `exploded(true)`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jpmml-evaluator-spark.