Coder Social home page Coder Social logo

vinceyang / spark-example-project Goto Github PK

View Code? Open in Web Editor NEW

This project forked from snowplow/spark-example-project

0.0 2.0 0.0 599 KB

A Spark WordCountJob example as a standalone SBT project with Specs2 tests, runnable on Amazon EMR

Home Page: http://snowplowanalytics.com

Scala 83.56% Shell 16.44%

spark-example-project's Introduction

Spark Example Project Build Status

Introduction

This is a simple word count job written in Scala for the Spark spark cluster computing platform, with instructions for running on [Amazon Elastic MapReduce] emr in non-interactive mode. The code is ported directly from Twitter's [WordCountJob] wordcount for Scalding.

This was built by the Professional Services team at [Snowplow Analytics] snowplow, who use Spark on their [Data pipelines and algorithms] data-pipelines-algos projects.

See also: [Scalding Example Project] scalding-example-project | [Cascalog Example Project] cascalog-example-project

Building

Assuming you already have SBT sbt installed:

$ git clone git://github.com/snowplow/spark-example-project.git
$ cd spark-example-project
$ sbt assembly

The 'fat jar' is now available as:

target/spark-example-project-0.2.0.jar

Unit testing

The assembly command above runs the test suite - but you can also run this manually with:

$ sbt test
<snip>
[info] + A WordCount job should
[info]   + count words correctly
[info] Passed: : Total 1, Failed 0, Errors 0, Passed 1, Skipped 0

Running on Amazon EMR

Prepare

Assuming you have already assembled the jarfile (see above), now upload the jar to an Amazon S3 bucket and make the file publically accessible.

Next, upload the data file [data/hello.txt] hello-txt to S3.

Run

Finally, you are ready to run this job using the [Amazon Ruby EMR client] emr-client:

$ elastic-mapreduce --create --name "Spark Example Project" --instance-type m1.xlarge --instance-count 3 \
  --bootstrap-action s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh --bootstrap-name "Install Spark/Shark" \
  --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --step-name "Run Spark Example Project" \
  --step-action TERMINATE_JOB_FLOW \
  --arg s3://snowplow-hosted-assets/common/spark/run-spark-job-0.1.0.sh \
  --arg s3://{{JAR_BUCKET}}/spark-example-project-0.2.0.jar \
  --arg com.snowplowanalytics.spark.WordCountJob \
  --arg s3n://{{IN_BUCKET}}/hello.txt \
  --arg s3n://{{OUT_BUCKET}}/results

Replace {{JAR_BUCKET}}, {{IN_BUCKET}} and {{OUT_BUCKET}} with the appropriate paths.

Inspect

Once the output has completed, you should see a folder structure like this in your output bucket:

 results
 |
 +- _SUCCESS
 +- part-00000
 +- part-00001

Download the files and check that part-00000 contains:

(hello,1)
(world,2)

while part-00001 contains:

(goodbye,1)

Running on your own Spark cluster

If you have successfully run this on your own Spark cluster, we would welcome a pull-request updating the instructions in this section.

Next steps

Fork this project and adapt it into your own custom Spark job.

To invoke/schedule your Spark job on EMR, check out:

Roadmap

  • Bump to Spark 0.9.x when this is supported by EMR ([#1] issue-1).
  • Change output from tuples to TSV ([#2] issue-2)

Further reading

Copyright and license

Copyright 2013-2014 Snowplow Analytics Ltd.

Licensed under the [Apache License, Version 2.0] license (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

spark-example-project's People

Contributors

alexanderdean avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.