Coder Social home page Coder Social logo

spark-demo's Introduction

spark-demo

A simple project intended to demo spark and get developers up and running quickly

Note: This project uses Gradle. You must install Gradle(1.12). If you would rather not install Gradle locally you can use the Gradle Wrapper by replacing all refernces to gradle with gradlew.

How To Build:

  1. Execute gradle build
  2. Find the artifact jars in './build/libs/'

Intellij Project Setup:

  1. Execute gradle idea
  2. Open project folder in Intellij or open the generated .ipr file

Note: If you have any issues in Intellij a good first troubleshooting step is to execute gradle cleanIdea idea

Eclipse Project Setup:

  1. Execute gradle eclipse
  2. Open the project folder in Eclipse

Note: If you have any issues in Eclipse a good first troubleshooting step is to execute gradle cleanEclipse eclipse

Key Spark Links:

Using The Project:

Note: This guide has only been tested on Mac OS X and may assume tools that are specific to it. If working in another OS substitutes may need to be used but should be available.

Step 1 - Build the Project:

  1. Run gradle build

Step 2 - Run the Demos in Local mode:

The demos generally take the first argument as the Spark Master URL. Setting this value to 'local' runs the demo in local mode. The trailing number in the brackets '[#]' indicates the number of cores to use. (ex. 'local[2]' runs locally with 2 cores)

This project has a Gradle task called 'runSpark' that manages the runtime classpath for you. This simplifies running spark jobs, ensures the same classpath is used in all modes, and shortens the development feedback loop.

The 'runSpark' Gradle task takes two arguments '-PsparkMain' and '-PsparkArgs':

  • -PsparkMain: The main class to run.
  • -PsparkArgs: The arguments to be passed to the main class. See the class for documentation and what arguments are expected.

Below are some sample commands for some simple demos:

  • SparkPi: gradle runSpark -PsparkMain="com.cloudera.sa.SparkPi" -PskipHadoopJar -PsparkArgs="local[2] 100"
  • Sessionize: gradle runSpark -PsparkMain="com.cloudera.sa.Sessionize" -PskipHadoopJar -PsparkArgs="local[2]"
  • HdfsWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.HdfsWordCount" -PskipHadoopJar -PsparkArgs="local[2] streaming-input"
  • NetworkWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.NetworkWordCount" -PskipHadoopJar -PsparkArgs="local[2] localhost 9999"

Note: The remaining steps are only required for running demos in "pseudo-distributed" mode and on a cluster.

Step 3 - Install Spark:

  1. Install Spark 1.0 using Homebrew: brew install apache-spark
  2. Add SPARK_HOME to your .bash_profile: export SPARK_HOME=/usr/local/Cellar/apache-spark/1.0.0/libexec
  3. Add SCALA_HOME and JAVA_HOME to your .bash_profile

Note: You may also install on your own following the Spark Documentation

Step 4 - Configure & Start Spark:

  1. The defaults should work for now. However, See Cluster Launch Scripts documentation for more information on configuring your pseudo cluster.
  2. Start your Spark cluster: $SPARK_HOME/sbin/start-all.sh
  3. Validate the master & worker are running in the Spark Master WebUI
  4. Note the master URL on the Spark Master WebUI. It will be used when submitting jobs.
  5. Shutdown when done: $SPARK_HOME/sbin/stop-all.sh

Step 5 - Run the Demos in Pseudo-Distributed mode:

Running in pseudo-distributed mode is almost exactly the same as local mode. Note: Please see step 2 before continuing on.

To run in pseudo-distributed mode just replace 'local[#]' in the Spark Master URL argument with the URL from Step 4.

Below are some sample commands for each demo:

Note: You will need to substitute in your Spark Master URL

  • SparkPi: gradle runSpark -PsparkMain="com.cloudera.sa.SparkPi" -PsparkArgs="spark://example:7077 100"
  • Sessionize: gradle runSpark -PsparkMain="com.cloudera.sa.Sessionize" -PsparkArgs="spark://example:7077"
  • HdfsWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.HdfsWordCount" -PsparkArgs="spark://example:7077 streaming-input"
  • NetworkWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.NetworkWordCount" -PsparkArgs="spark://example:7077 localhost 9999"

Step 6 - Run the Demos on a cluster:

The build creates a fat jar tagged with '-hadoop' that contains all dependencies needed to run on the cluster. The jar can be found in './build/libs/'.

TODO: Test this and fill out steps.

Step 7 - Develop your own Demos:

Develop demos of your own and send a pull request!

Notable Tools & Frameworks:

Todo List:

  • Create trait/class with generic context, smart defaults, and unified arg parsing (see spark-submit script for ref)
  • Document whats demonstrated in each demo (avro, parquet, kryo, etc) and usage
  • Add module level readme and docs
  • Tune logging output configuration (Redirect verbose logs into a rolling file)
  • Speed up HadoopJar task (and runSpark will follow)

Demos Working On:

spark-demo's People

Contributors

bbarker avatar granthenke avatar nathanph avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.