Coder Social home page Coder Social logo

spark-koans's Introduction

Overview

A koan is an incomplete test. Complete it, and find enlightenment.

This is an interactive tutorial on Apache Spark with Scala. There are a series of unit tests: some already pass, while others require you to fill in the gaps to make them pass. Where you see __, replace it with the correct value, and where you see ???, replace it with a function body. Each test class has a Spark context called sc which is created by the TestSparkContext trait, giving you access to Spark's functionality.

While it may be possible to complete these exercises with no knowledge of Scala, it is assumed that you already have some familiarity with Scala and Scala collections.

Inspired by many other koan-style projects, which I guess all started with the Ruby koans.

Requirements

It should be possible to complete these exercises with only Scala and SBT installed. All dependencies, including Spark itself, should be downloaded by SBT.

Apache Spark

Apache Spark is an open source (Apache license) cluster computing engine. Put plainly, it's a tool for analysing large amounts of data in order to learn something about that data, and its strengths lie in its speed, versatility and language bindings. It can be used standalone or within Apache Hadoop and comes with bindings for Scala, Java and Python. Spark is designed with the intent of unifying batch processing, stream processing and interactive (query-based) analytics into one framework, which occur through its built-in libraries:

  • Spark SQL - a SQL interface for querying structured data
  • Spark Streaming - tools for processing real-time data streams
  • MLlib - a collection of machine learning algorithms: classification, regression, clustering, etc
  • GraphX - tools for analysing graphs (the vertex-edge kind)

List of koans

Manipulating RDDs (resilient distributed datasets)

sbt "testOnly AboutRDDs"
  • Build an RDD from a parallelized collection
  • Build an RDD from a file
  • Partitioning
  • Map, reduce and filter
  • Counting
  • Zipping
  • House prices

Using key-value pairs

sbt "testOnly AboutKeyValuePairs"
  • Key-value pairs
  • Mapping values; reducing keys
  • Grouping by key
  • Sorting by key
  • Counting words
  • Joins
  • Subtract by key (set difference)
  • Co-group

MLLib: vectors and matrices

sbt "testOnly AboutVectors"
  • Local vectors
  • Local matrices

MLLib: statistics

sbt "testOnly AboutStatistics"
  • Summary statistics
  • Correlations

Sources of inspiration

spark-koans's People

Contributors

archena avatar sudhirmohanraj avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.