This is a scalable, high-performance machine learning package for Spark
.
This package is maintained independently from MLLib
in Spark
.
There's a chance that some of the algorithms would later get merged with MLLib
.
The algorithms listed here are going through continuous development, maintenance and updates and thus are provided without guarantees.
Use at your own risk.
This package is provided under the Apache license 2.0.
The current version is 0.2
.
Any comments or questions should be directed to [email protected]
The currently available algorithms are:
Clone this repository. And run ./sbt assembly
at the project directory. To create an assembly.
- Get
Spark
version 1.0.1. A pre-built version can be downloaded from here for some of Hadoop variants. For different Hadoop versions, you'll have to build it after cloning it from github. E.g., to buildSpark
for Apache Hadoop 2.0.5-alpha withYARN
support, you could do the following. git clone https://github.com/apache/spark.git
git checkout tags/v1.0.1
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
orSPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt -Dsbt.override.build.repos=true assembly
- If you want to run this against a different
Spark
version, you should modifyproject/build.scala
and change versions ofspark-core
andspark-mllib
to appropriate versions. Of course, you'll also need to build a matching version ofSpark
. - Additionally, by default, this package builds against
hadoop-client
version1.0.4
. This will have to change, for instance if you want to build this against different Hadoop versions that are not protocol-compatible with this version. Refer to thisSpark
page to find out about different Hadoop versions. - Clone this repository, and run
./sbt assembly
. - In order to connect to Hadoop clusters, you should have Hadoop configurations stored somewhere. E.g., if your Hadoop configurations are stored in
/home/me/hd-config
, then make sure to have the following environment variables.
export HADOOP_CONF_DIR=/home/me/hd-config
export YARN_CONF_DIR=/home/me/hd-config
- Find the location of the
Spark
assembly jar. E.g., it might beassembly/target/scala-2.10/spark-assembly-1.0.1-hadoop2.0.5-alpha.jar
under theSpark
directory. Runexport SPARK_JAR=jar_location
. - Have some data you want to train on in HDFS. A couple of data sets are provided in this package under the
data
directory for quick testing. E.g., copymnist.tsv.gz
andmnist.t.tsv.gz
to a HDFS directory (E.g./Datasets/
). - To train a classifier using
YARN
, run the following.SPARK_DIR
should be replaced with the directory ofSpark
andSPARK_ML_DIR
should be replaced with the directory where this package resides.
SPARK_DIR/bin/spark-submit --master yarn --deploy-mode cluster --class spark_ml.sequoia_forest.SequoiaForestRunner --name SequoiaForestRunner --driver-memory 4G --executor-memory 4G --num-executors 10 SPARK_ML_DIR/target/scala-2.10/spark_ml-assembly-0.1.jar --inputPath /Datasets/mnist.tsv.gz --outputPath /ModelOutputs/mnist --numTrees 100 --numPartitions 10 --labelIndex 780
- This will train a classification forest with 100 trees using the column 780 as the label and all the other columns as numeric features. The final model would be stored in
/ModelOutputs/mnist
in HDFS.
- Check status of training through the Hadoop job tracker page.
Spark
also provides its internal progress report when you click on the job's application master link from the job tracker page. - Once training is finished, you can predict on a new data set using the following command.
SPARK_DIR/bin/spark-submit --master yarn --deploy-mode cluster --class spark_ml.sequoia_forest.SequoiaForestPredictor --name SequoiaForestPredictor --driver-memory 4G --executor-memory 4G --num-executors 4 SPARK_ML_DIR/target/scala-2.10/spark_ml-assembly-0.1.jar --inputPath /Datasets/mnist.t.tsv.gz --forestPath /ModelOutputs/mnist --outputPath /ModelOutputs/mnistpredictions --labelIndex 780 --outputFieldIndices 780 --pauseDuration 100
- The above command would predict on
mnist.t.tsv.gz
using the previously trained model in/ModelOutputs/mnist
and write predictions under/ModelOutputs/mnistpredictions
. It'll also write the value of the column 780 (which happens to be the label in this case) along with the predicted value. In the standard output log of the driver, you should also be able to see computed accuracy since the label is given in this case.
- Training regression requires adding an argument
--forestType Variance
. Likewise, using categorical features requires adding an argument like--categoricalFeatureIndices 5,6
. This would mean that columns 5 and 6 are to be treated as categorical features. For other options, refer to the command line arguments described below.