Coder Social home page Coder Social logo

spark-storm's Introduction

#Testing Spark Streaming and Storm

#Table of Contents

#Introduction In this data engineering project we tested two processing frameworks: Spark streaming and Storm. The two main goals were:

  • To understand the differences between Spark Streaming and Storm.
  • Test the processing frameworks under different loads and measure throughput.

#Testing Setup

  • Testing Conditions: Each framework was set in separate clusters of 4 nodes on AWS.
  • Metrics that were measured: Amount of data processed / records per second (throughput).
  • Time: Tests were run for aproximately 10 minutes in both pipelines.
  • Caveat: The results obtained must be taken with precaution due to processing and semantic differences between Spark streaming and Storm.

#Data The data was generated by a producer that selected words from a list and created a comma separated string which was streamed through Kafka to a consumers in Spark/Storm.

#Use case Word count is a often a popular choice when testing frameworks. Other uses cases that I would like to implement would be a sorting algorithm and/or a graph.

#Cluster Setup Two distributed AWS clusters of four ec2 m3.medium nodes were used. The ingestion components, and processing frameworks were configured and run in distributed mode, with one master and three workers for Spark streaming, and one nimbus and three supervisors for Storm.

#Instructions to Set up this Pipeline

  • Spin 4 ec2 nodes, install Hadoop, Zookeeper, Kafka, Spark, and Storm

  • Start Zookeeper and Kafka

  • Select which processing framework you want to test, and start it.

  • Install python packages: sudo pip install kafka-python

  • Run the Kafka producer using the bash script: python spark/producer.py

  • Run pyspark script: $SPARK_HOME/bin/spark-submit metrics.py

Or:

  • Install python packages: sudo pip install pyleus

  • Build storm topology: pyleus build word_topology.yaml

  • Test the topology by running it locally pyleus local word_topology.jar -d

  • Submit pyleus topology: pyleus submit -n <public-dns> word_topology.jar

Note: Zookeeper and Kafka must be running before starting the producer, or metrics.py or submiting the Storm topology.

#Presentation The presentation slides are available here: gchoy.github.io

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.