Coder Social home page Coder Social logo

streamprocessing's Introduction

Online Convertor

For this project, we designed an online convertor for translating batch processing code to stream processing code. The tool is available at: https://streamprocessing.herokuapp.com/simulateStream/

source code for frontend is at frontend branch.

Please refer to "Auto-refactor" folder for the source code for auto-refactor

kafka is used to test generated stream processing code. To set up Kafka, see the following steps.

Spark Streaming + Kafka Integration

Kafka producer reads in .csv file and pushes data onto a topic. Spark streaming then works as a consumer to receive the streaming data and processes them.

Environment

  • Python version: 3.7 (pyspark only support python3.7)
  • Spark: 2.4.7
  • Java: 1.8

Run your program

Preparation

Please check the spark-up.sh and spark-stop.sh scripts to make sure the path are correct for your project

Only run the following command when you are the first time to run the spark-up.sh and spark-stop.sh shell script.

chmod +x ./spark-up.sh chmod +x ./spark-stop.sh

Step1: Start up Zookeeper and Kafka, start up spark master and workers

run spark-up.sh

Step2: Push data onto a topic through producer

Run python ticket_producer.py, which includes sending records to a new topic

Cluster overview

Application URL Description
Spark Driver localhost:4040 Spark Driver web ui
Spark Master localhost:8080 Spark Master node
Spark Worker I localhost:8081 Spark Worker node with 2 core and 2g of memory (default)
Spark Worker II localhost:8082 Spark Worker node with 2 core and 2g of memory (default)

Step3: Start your program

(Use spark-streaming-kafka package, from https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8)

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.7 dstream.py

To shut down workers and servers

run spark-stop.sh

Crtl+C on Kafka server terminal, then Zookeeper server.

Kafka

  1. Install Kafka(https://kafka.apache.org/quickstart)

  2. In Kafka folder, start zookeeper and kafka server
    bin/zookeeper-server-start.sh config/zookeeper.properties
    bin/kafka-server-start.sh config/server.properties

  3. pip install kafka-python

  4. Run python ticket_producer.py, which includes sending records to a new topic

  5. Run python ticket_consumer.py to retrieve data

Other useful commands:

  • Create new topic
    bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic_name

  • List all existed topics
    bin/kafka-topics.sh --list --zookeeper localhost:2181

  • Delete topic
    bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic remove-me

Spark

Version

  • Python: 3.7
  • Spark: 2.4.7
  • Java: 1.8

(Reference: https://maelfabien.github.io/bigdata/SparkInstall/#)

Step1: Installation

  • Install Java
  • Install Scala
  • Install Spark 2.4.7
    1. Download https://www.apache.org/dyn/closer.lua/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
    2. Unzip: tar xvf spark-2.4.7-bin-hadoop2.7.tgz
    3. Rename folder "spark-2.4.7-bin-hadoop2.7" to "2.4.7"
    4. Relocate folder: mv 2.4.7 /usr/local/apache-spark/
  • Install pyspark
    pip install pyspark==2.4.7
  • Set PATH
    (Path depends on where your files stored)
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
    export JRE_HOME=/Library/java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/
    export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.7
    export PATH=/usr/local/Cellar/apache-spark/2.4.7/bin:$PATH
    export PYSPARK_PYTHON=/Users/yourMac/anaconda3/bin/python

Step2: Set Configuration for Spark master and workers

(Reference from: https://mallikarjuna_g.gitbooks.io/spark/content/spark-standalone-example-2-workers-on-1-node-cluster.html)

Configuration:

  • 2 workers on 1 node
  • Each worker with 2 cores
  • Each worker with 2g memory

Go to config and create new configuration file,
cd /usr/local/Cellar/apache-spark/2.4.7/conf/

Create spark-env.sh and save it,

SPARK_WORKER_CORES=2
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_MEMORY=2g
SPARK_MASTER_HOST=localhost
SPARK_LOCAL_IP=localhost

Step3: Start Spark master and workers

cd /usr/local/Cellar/apache-spark/2.4.7/sbin/
/usr/local/Cellar/apache-spark/2.4.7/sbin/start-master.sh
/usr/local/Cellar/apache-spark/2.4.7/sbin/start-slave.sh spark://localhost:7077

Step4: Start your program

(Apply spark-streaming-kafka package: https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8)

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.7 dstream.py

Problem Solved:

  1. Set master ip: https://stackoverflow.com/questions/31166851/spark-standalone-cluster-slave-not-connecting-to-master
  2. Spark Standalone mode: https://spark.apache.org/docs/2.4.7/spark-standalone.html
  3. Spark Streaming Guide: https://spark.apache.org/docs/2.4.7/streaming-programming-guide.html
  4. Spark Streaming + Kafka Integration: https://spark.apache.org/docs/2.4.7/streaming-kafka-0-8-integration.html

streamprocessing's People

Contributors

hannahhan3 avatar lamonkey avatar tony-hsieh avatar

Stargazers

 avatar

Watchers

 avatar  avatar

streamprocessing's Issues

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
    spark-submit command as

    $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.4 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
    Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.4.
    Then, include the jar in the spark-submit command as

    $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.