hydrospheredata / spark-ml-serving Goto Github PK

View Code? Open in Web Editor NEW

48.0 22.0 4.0 1.34 MB

Spark ML Lib serving library

License: Apache License 2.0

Scala 98.74% Shell 1.26%

spark serving scoring inference

spark-ml-serving's Introduction

Spark-ml-serving

Contextless ML implementation of Spark ML.

Proposal

To serve small ML pipelines there is no need to create SparkContext and use cluster-related features. In this project we made our implementations for ML Transformers. Some of them call context-independent Spark methods.

Structure

Instead of using DataFrames, we implemented simple LocalData class to get rid of SparkContext. All Transformers are rewritten to accept LocalData.

How to use

Import this project as dependency:

scalaVersion := "2.11.8"
// Artifact name is depends of what version of spark are you usng for model training:
// spark 2.0.x
libraryDependencies += Seq(
  "io.hydrosphere" %% "spark-ml-serving-2_0" % "0.3.0",
  "org.apache.spark" %% "spark-mllib" % "2.0.2"
)
// spark 2.1.x
libraryDependencies += Seq(
  "io.hydrosphere" %% "spark-ml-serving-2_1" % "0.3.0",
  "org.apache.spark" %% "spark-mllib" % "2.1.2"
)
// spark 2.2.x
libraryDependencies += Seq(
  "io.hydrosphere" %% "spark-ml-serving-2_2" % "0.3.0",
  "org.apache.spark" %% "spark-mllib" % "2.2.0"

)

Use it: example

import io.hydrosphere.spark_ml_serving._
import LocalPipelineModel._

// ....
val model = LocalPipelineModel.load("PATH_TO_MODEL") // Load
val columns = List(LocalDataColumn("text", Seq("Hello!")))
val localData = LocalData(columns)
val result = model.transform(localData) // Transformed result

More examples of different ML models are in tests.

spark-ml-serving's People

Contributors

Stargazers

Watchers

Forkers

vijaykumar243 chingloong wei-he jerry

spark-ml-serving's Issues

Pipelines with more than 9 stages doesn't load correctly

If a pipeline has more than 9 stages, the saved stages cannot be loaded. I think the code adds a single digit suffix for the path. Please check out the logs below.

Saved model at: [./target/test_models/2.3.1/regextokenizer-...-indextostring]
Loading model from: [./target/test_models/2.3.1/regextokenizer-...-indextostring]

File target/test_models/2.3.1/regextokenizer-...-indextostring/stages/0_regexTok_a4348c4a1fd1/metadata/part-00000 does not exist
java.io.FileNotFoundException: File target/test_models/2.3.1/regextokenizer-...-indextostring/stages/0_regexTok_a4348c4a1fd1/metadata/part-00000 does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)

However, the files are there, as shown below:

samik@samik-lap:~/git/spark-ml-serving/target/test_models/2.3.1/regextokenizer-...-indextostring/stages$ l
total 76K
drwxr-xr-x 19 samik samik 4.0K Sep  5 18:07 .
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 ..
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 00_regexTok_a4348c4a1fd1
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 01_stopWords_4061e58af19e
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 02_ngram_3eebcba67f89
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 03_ngram_00c144c7006b
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 04_ngram_c8cda5f15e23
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 05_hashingTF_da56ada55a20
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 06_hashingTF_426b130af12d
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 07_hashingTF_3fdadf69e9c4
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 08_hashingTF_5e81e16aa1df
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 09_idf_91a0d9e2bd25
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 10_idf_ff7e432c7377
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 11_idf_297287d8b400
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 12_idf_45a0afd1fe51
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 13_vecAssembler_7a2ee0c762c5
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 14_strIdx_b7fbad214c7e
drwxr-xr-x  9 samik samik 4.0K Sep  5 18:07 15_oneVsRest_a54f608345aa
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 16_idxToStr_0def5421e41c

The test code is available in my fork: https://github.com/samikrc/spark-ml-serving

Add suport for Gradient-boosted tree classifier

Please implement support for:
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier

Add Spark 2.2 ML models

http://spark.apache.org/releases/spark-release-2-2-0.html

Pyspark models fail to import

Any pyspark models with data contained in them fail to import. The exported data parameters are exported as an array and not a list. I am assuming this is an issue between scala and python. Logistic regression and random forest so far have issues. It looks like the data util class is looking for a list of impurityStats. Pyspark exports this out as an array and not a list. Logistic regression fails on colPtrs in spark 2.2 classification

LDA models are missing. LocalLDAModel and DistributedLDAModel

I am trying to use an lda algorithm and it cant find it in this lib. can you give me an example of how to add something like this ?

Vector assembler

Vector assembler is missing can you please add this.

add new models

how would you add a new model to be used ? meaning, i am trying to run a pipeline that is using a regex-tokenizer. I saw in the code that you have a regular tokenizer, but not a reg-ex tokenizer. so my question is this, I can add that class and compile the code successfully, however, how do I get the the run-time to compile of that local code instead of the baked repository. does that make sense ?

Working with a model outside pipelines

Currently there is the only one way to serve models locally: load them from a stored pipeline. If model is saved with just model.save() (without pipelines), there is no way to load it.

Thread-safety of `transform` methods

Need to do an analysis on internal SparkML methods to decide if it's ok to call LocalTransformer.transform from multiple threads