Coder Social home page Coder Social logo

spark-deployer's Introduction

spark-deployer

  • A Scala tool which helps deploying Apache Spark stand-alone cluster and submitting job.
  • Currently only support Amazon EC2 and Spark 1.4.0.
  • This project contains three parts, a core library, a SBT plugin, and a simple command line tool.
  • Since we're in the experiment state, the spec may change rapidly in the future.

How to use the SBT plugin

  • Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for AWS.
  • In your sbt project, create project/plugins.sbt:
addSbtPlugin("net.pishen" % "spark-deployer-sbt" % "0.5.1")
lazy val root = (project in file("."))
  .settings(
    name := "my-project-name",
    version := "0.1",
    scalaVersion := "2.10.5",
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % "1.4.0" % "provided"
    )
  )
  • Write your job's algorithm in src/main/scala/mypackage/Main.scala:
package mypackage

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object Main {
  def main(args: Array[String]) {
    //setup spark
    val sc = new SparkContext(new SparkConf())
    //your algorithm 
    sc.textFile("s3n://my-bucket/input.gz").collect().foreach(println)
  }
}
  • Create Spark cluster by sbt "sparkCreateCluster <number-of-workers>". You can also execute sbt first and type sparkCreateCluster <number-of-workers> in the sbt console. You may first type spark and hit TAB to see all the available commands.
  • Once created, submit your job by sparkSubmitJob <job-args>

Other available commands

  • sparkCreateMaster
  • sparkAddWorkers <number-of-workers> supports dynamically add more workers to an existing cluster.
  • sparkRemoveWorkers <number-of-workers> supports dynamically remove workers to scale down the cluster.
  • sparkDestroyCluster
  • sparkShowMachines
  • sparkUploadJar uploads the job's jar file to the master.
  • sparkRemoveS3Dir <dir-name> remove the s3 directory with the _$folder$ folder file. (ex. sparkRemoveS3Dir s3://bucket_name/middle_folder/target_folder)

How to use the command line tool

For the environment that couldn't install sbt, here is a command line tool (jar file) which only requires java installed.

  • Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for AWS.
  • Clone this project, build the jar file by sbt cmd/assembly. Get the output file at cmd/target/scala-2.10/spark-deployer-cmd-assembly-x.x.x.jar.
  • Provide the cluster configuration file spark-deployer.conf at the same directory with spark-deployer-cmd-assembly-x.x.x.jar.
  • Provide a Spark job's jar file spark-job.jar (The jar file built by sbt-assembly from a Spark project).
  • Use the following commands to create cluster, submit job, and destroy the cluster:
java -jar spark-deployer-cmd-assembly-x.x.x.jar --create-cluster <number-of-workers>
java -jar spark-deployer-cmd-assembly-x.x.x.jar --submit-job spark-job.jar <job-args>
java -jar spark-deployer-cmd-assembly-x.x.x.jar --destroy-cluster

Cluster configuration file

  • For the library to work, you need to provide a configuration file spark-deployer.conf:
cluster-name = "pishen-spark"

keypair = "pishen"
pem = "/home/pishen/.ssh/pishen.pem"

region = "us-west-2"

ami = "ami-e7527ed7"

master {
  instance-type = "m3.medium"
  # EBS disk-size, in GB, volume type fixed to gp2 SSD now.
  disk-size = 8
  # The driver-memory used at spark-submit (optional)
  driver-memory = "2G"
}

worker {
  instance-type = "c4.xlarge"
  disk-size = 40
  # The executor-memory used at spark-submit (optional)
  executor-memory = "6G"
}

# Max number of attempts trying to connect to the machine. (default is 10, one per minute.)
ssh-connection-attempts = 8

# URL for downloading the pre-built Spark tarball.
spark-tgz-url = "http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop1.tgz"

main-class = "mypackage.Main"

# Below are optional settings
app-name = "my-app-name"

security-group-ids = ["sg-xxxxxxxx", "sg-yyyyyyyy"]

subnet-id = "subnet-xxxxxxxx"
use-private-ip = true
  • The ami should be HVM EBS-Backed with Java 7 installed, you may pick one from Amazon Linux AMI or build one by yourself.
  • Currently tested instance types are t2.medium, m3.medium, and c4.xlarge. All the M3, C4, and C3 types should work, please report an issue if you encountered a problem.
  • The disk-size unit is in GB. It resets the size of root partition, which is used by both OS and Spark.
  • More information about spark-tgz-url:
    • You may find one URL from Spark's website or host one by yourself.
    • You may choose an older version of Spark or different version of Hadoop, but it's not tested, use at your own risk.
    • The URL must ends with /<spark-folder-name>.tgz for the auto deployment to work.
  • More information about security-group-ids:
    • Since akka use random port to connect with master, the security groups should allow all the traffic between cluster machines.
    • Allow port 22 for SSH login.
    • Allow port 8080, 8081, 4040 for web console (optional).
    • Please check Spark security page for more information about port settings.

spark-deployer's People

Contributors

pishen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.