recsys-spark

A learning-to-rank recommender system built upon Apache Spark. This page shows how to set up development environment and a wiki site is associated with this project, showing technical details.

Technical Requirements

In this project we use Scala as the main development language, in which the Spark is based on. To collaborate with the team to work on the project, you will also need to set up Git in the local machine (to run this program in Yarn, Git should also be set up in the gateway machine in order to clone and pull code from GitHub).

Getting Started

The following are some installations that will help to learn Scala and Spark. Note that the two steps are not required to develop and run the project, and the Scala and Spark used in the project will be downloaded (separately)

Download and install Scala to computer. The version of the Scala should be compatible with the Spark to be installed in the next step. For example, Spark 1.0.0 and Spark 0.91 use Scala 2.10. After installation, type scala in command line should take you to the Scala interative shell, which is the best way to learn Scala.
Download and install Spark. The latest Spark version up-to-date is 1.0.0. The Spark provides a Spark shell that includes an instance of val sc:SparkContext, which is the main entrance of Spark. The Spark shell is the best way to learn Spark.

Setup Development Environment

Currently the development environment is Eclipse. To use Eclipse as an IDE, follow these steps:

Download and install Eclipse. The current latest one is Kepler 4.3.
Install the following plugins to your Eclipse

Install Scala IDE. Recommended installation is via Update Site in the Eclipse (Help > Install New Software... > Add...). Note that the version of Scala IDE should be consistent with the Scala version to be worked with Spark. For Scala 2.10.4, the Scala IED 3.03/3.04 update site is

http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site

Install m2eclipse. The Spark project is managed by the Maven build system. Recommended installation is dragging the installiation icon

in the Eclipse workspace, and the installation dialogue will pop up. * Install [m2eclipse-scala](https://github.com/sonatype/m2eclipse-scala). This allows you to work with Maven in Scala project. The installation can be done by cloning `https://github.com/sonatype/m2eclipse-scala.git` to your Eclipse `dropins` folder. 3. Verify Git works fine with Eclipse (right click on any project and Git operations can be accessed in the `Team` context menu). Since we are using GitHub, it provides a nice native client for both [Windows](https://windows.github.com/) and [Mac](https://mac.github.com/). 4. Check out the project to your workspace (e.g., `$HOME\workspace`) 5. In Eclipse, import the checked out project folder as an existing project (`File` > `Import` > `General` > `Existing Projects into Workspace`). 6. If this is the first time Maven is used, then there are probably many missing libraries, which is normal because Maven does not automatically download necessary libraries for you. To fix download the missing libraries, simply build the project (Right click on the project folder > `Run As` > `Maven Build`. In the window poped up, type "install" into the `Goals`, and click `Run`). The maven should start to build and download all required libraries from Internet. 7. In rare case the Eclipse Maven will fail due to some downloading issues. In such case, one solution to build outside the Maven. To do this, install Maven in command line, remove the corrupted maven downloads (`rm -r $HOME\.m2`) and navigate to the project folder, and run `mvn install`, and it should build.

Create Projects from Scrach

If you are creating a Spark project from scratch in GitHub and want to use Eclipse, the most efficient way (hold before you finish reading the entire section) to do this is as follows:

Create a GitHub repository in GitHub
Clone the repository using git client on local machine
Create a plain Scala (or Python) project using the folder cloned from GitHub. The Eclipse should be able to recognize the folder is related to a repository (additional cylinder in the icon).
Add a Maven dependency on spark-core of corresponding version. To do this, right click the project, and choose Configure > Convert to Maven Project and follow the instructions to add dependencies. To use Spark 1.0.0, the Maven information is groupId = org.apache.spark, artifactId = spark-core_2.10, version = 1.0.0.

However, this may impose some problems for deployment because it may not compile a jar file (according to Yuan Zhang). Therefore an alternative is to use Maven project:

Create a GitHub repository in GitHub
Clone the repository using git client on local machine
Create a Maven project, use the folder name cloned from GitHub as the artifact name. The Eclipse should be able to recognize the folder is related to a repository (additional cylinder in the icon).
Add Scala nature by Configure > Add Scala Nature to enable Scala. To this end, you should be able to compile Scala files.
To use Spark 1.0.0, add dependency groupId = org.apache.spark, artifactId = spark-core_2.10, version = 1.0.0. Since the spark-core_2.10 dependency already includes Scala, we may remove the Scala Library added by Eclipse in Step (4) from Java Build Path in the project Properties.
More often than not the default Maven JVM setting J2SE-1.5 may not have the same compatible level as the system, and there may be a disturbing warning. Follow this page to adjust the compiler compliance level in Java Compiler in the project Properties.
The the pom.xml file must be set-up according to project requirement. One reference is the one in this project. This might be the most time consuming part, and in case it is not working, copy and paste the one in this project with minor changes (group ID, artifact ID and etc).

kiminh / spark_recommender Goto Github PK

spark_recommender's Introduction

recsys-spark

Technical Requirements

Getting Started

Setup Development Environment

Create Projects from Scrach

spark_recommender's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent