Coder Social home page Coder Social logo

spark-ml-serving's Introduction

Build Status

Spark-ml-serving

Contextless ML implementation of Spark ML.

Proposal

To serve small ML pipelines there is no need to create SparkContext and use cluster-related features. In this project we made our implementations for ML Transformers. Some of them call context-independent Spark methods.

Structure

Instead of using DataFrames, we implemented simple LocalData class to get rid of SparkContext. All Transformers are rewritten to accept LocalData.

How to use

  1. Import this project as dependency:
scalaVersion := "2.11.8"
// Artifact name is depends of what version of spark are you usng for model training:
// spark 2.0.x
libraryDependencies += Seq(
  "io.hydrosphere" %% "spark-ml-serving-2_0" % "0.3.0",
  "org.apache.spark" %% "spark-mllib" % "2.0.2"
)
// spark 2.1.x
libraryDependencies += Seq(
  "io.hydrosphere" %% "spark-ml-serving-2_1" % "0.3.0",
  "org.apache.spark" %% "spark-mllib" % "2.1.2"
)
// spark 2.2.x
libraryDependencies += Seq(
  "io.hydrosphere" %% "spark-ml-serving-2_2" % "0.3.0",
  "org.apache.spark" %% "spark-mllib" % "2.2.0"

)
  1. Use it: example
import io.hydrosphere.spark_ml_serving._
import LocalPipelineModel._

// ....
val model = LocalPipelineModel.load("PATH_TO_MODEL") // Load
val columns = List(LocalDataColumn("text", Seq("Hello!")))
val localData = LocalData(columns)
val result = model.transform(localData) // Transformed result

More examples of different ML models are in tests.

spark-ml-serving's People

Contributors

dos65 avatar kineticcookie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-ml-serving's Issues

Pipelines with more than 9 stages doesn't load correctly

If a pipeline has more than 9 stages, the saved stages cannot be loaded. I think the code adds a single digit suffix for the path. Please check out the logs below.

Saved model at: [./target/test_models/2.3.1/regextokenizer-...-indextostring]
Loading model from: [./target/test_models/2.3.1/regextokenizer-...-indextostring]

File target/test_models/2.3.1/regextokenizer-...-indextostring/stages/0_regexTok_a4348c4a1fd1/metadata/part-00000 does not exist
java.io.FileNotFoundException: File target/test_models/2.3.1/regextokenizer-...-indextostring/stages/0_regexTok_a4348c4a1fd1/metadata/part-00000 does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)

However, the files are there, as shown below:

samik@samik-lap:~/git/spark-ml-serving/target/test_models/2.3.1/regextokenizer-...-indextostring/stages$ l
total 76K
drwxr-xr-x 19 samik samik 4.0K Sep  5 18:07 .
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 ..
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 00_regexTok_a4348c4a1fd1
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 01_stopWords_4061e58af19e
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 02_ngram_3eebcba67f89
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 03_ngram_00c144c7006b
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 04_ngram_c8cda5f15e23
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 05_hashingTF_da56ada55a20
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 06_hashingTF_426b130af12d
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 07_hashingTF_3fdadf69e9c4
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 08_hashingTF_5e81e16aa1df
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 09_idf_91a0d9e2bd25
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 10_idf_ff7e432c7377
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 11_idf_297287d8b400
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 12_idf_45a0afd1fe51
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 13_vecAssembler_7a2ee0c762c5
drwxr-xr-x  4 samik samik 4.0K Sep  5 18:07 14_strIdx_b7fbad214c7e
drwxr-xr-x  9 samik samik 4.0K Sep  5 18:07 15_oneVsRest_a54f608345aa
drwxr-xr-x  3 samik samik 4.0K Sep  5 18:07 16_idxToStr_0def5421e41c

The test code is available in my fork: https://github.com/samikrc/spark-ml-serving

Pyspark models fail to import

Any pyspark models with data contained in them fail to import. The exported data parameters are exported as an array and not a list. I am assuming this is an issue between scala and python. Logistic regression and random forest so far have issues. It looks like the data util class is looking for a list of impurityStats. Pyspark exports this out as an array and not a list. Logistic regression fails on colPtrs in spark 2.2 classification

add new models

how would you add a new model to be used ? meaning, i am trying to run a pipeline that is using a regex-tokenizer. I saw in the code that you have a regular tokenizer, but not a reg-ex tokenizer. so my question is this, I can add that class and compile the code successfully, however, how do I get the the run-time to compile of that local code instead of the baked repository. does that make sense ?

Working with a model outside pipelines

Currently there is the only one way to serve models locally: load them from a stored pipeline. If model is saved with just model.save() (without pipelines), there is no way to load it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.