A learning-to-rank recommender system built upon Apache Spark. This page shows how to set up development environment and a wiki site is associated with this project, showing technical details.
In this project we use Scala as the main development language, in which the Spark is based on. To collaborate with the team to work on the project, you will also need to set up Git in the local machine (to run this program in Yarn, Git should also be set up in the gateway machine in order to clone and pull code from GitHub).
The following are some installations that will help to learn Scala and Spark. Note that the two steps are not required to develop and run the project, and the Scala and Spark used in the project will be downloaded (separately)
- Download and install Scala to computer. The version of the Scala should be compatible with the Spark to be installed in the next step. For example, Spark 1.0.0 and Spark 0.91 use Scala 2.10. After installation, type
scala
in command line should take you to the Scala interative shell, which is the best way to learn Scala. - Download and install Spark. The latest Spark version up-to-date is 1.0.0. The Spark provides a Spark shell that includes an instance of
val sc:SparkContext
, which is the main entrance of Spark. The Spark shell is the best way to learn Spark.
Currently the development environment is Eclipse. To use Eclipse as an IDE, follow these steps:
- Download and install Eclipse. The current latest one is Kepler 4.3.
- Install the following plugins to your Eclipse
- Install Scala IDE. Recommended installation is via Update Site in the Eclipse (
Help
>Install New Software...
>Add...
). Note that the version of Scala IDE should be consistent with the Scala version to be worked with Spark. For Scala 2.10.4, the Scala IED 3.03/3.04 update site is
http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site
- Install m2eclipse. The Spark project is managed by the Maven build system. Recommended installation is dragging the installiation icon
If you are creating a Spark project from scratch in GitHub and want to use Eclipse, the most efficient way (hold before you finish reading the entire section) to do this is as follows:
- Create a GitHub repository in GitHub
- Clone the repository using git client on local machine
- Create a plain Scala (or Python) project using the folder cloned from GitHub. The Eclipse should be able to recognize the folder is related to a repository (additional cylinder in the icon).
- Add a Maven dependency on
spark-core
of corresponding version. To do this, right click the project, and chooseConfigure
>Convert to Maven Project
and follow the instructions to add dependencies. To use Spark 1.0.0, the Maven information isgroupId = org.apache.spark
,artifactId = spark-core_2.10
,version = 1.0.0
.
However, this may impose some problems for deployment because it may not compile a jar file (according to Yuan Zhang). Therefore an alternative is to use Maven project:
- Create a GitHub repository in GitHub
- Clone the repository using git client on local machine
- Create a Maven project, use the folder name cloned from GitHub as the artifact name. The Eclipse should be able to recognize the folder is related to a repository (additional cylinder in the icon).
- Add Scala nature by
Configure
>Add Scala Nature
to enable Scala. To this end, you should be able to compile Scala files. - To use Spark 1.0.0, add dependency
groupId = org.apache.spark
,artifactId = spark-core_2.10
,version = 1.0.0
. Since thespark-core_2.10
dependency already includes Scala, we may remove the Scala Library added by Eclipse in Step (4) fromJava Build Path
in the projectProperties
. - More often than not the default Maven JVM setting
J2SE-1.5
may not have the same compatible level as the system, and there may be a disturbing warning. Follow this page to adjust thecompiler compliance level
inJava Compiler
in the projectProperties
. - The the
pom.xml
file must be set-up according to project requirement. One reference is the one in this project. This might be the most time consuming part, and in case it is not working, copy and paste the one in this project with minor changes (group ID, artifact ID and etc).