Coder Social home page Coder Social logo

fogofortitude / datastreaming---sf-crime-statistics-with-spark Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 704 KB

A real-world dataset, extracted from Kaggle, on San Francisco crime incidents using Apache Spark Structured Streaming

Python 54.75% Shell 0.16% Jupyter Notebook 45.09%

datastreaming---sf-crime-statistics-with-spark's Introduction

Project Overview

In this project, you will be provided with a real-world dataset, extracted from Kaggle, on San Francisco crime incidents, and you will provide statistical analyses of the data using Apache Spark Structured Streaming. You will draw on the skills and knowledge you've learned in this course to create a Kafka server to produce data and ingest data through Spark Structured Streaming.

You can try to answer the following questions with the dataset:

  • What are the top types of crimes in San Fransisco?
  • What is the crime density by location?

Development Environment

You may choose to create your project in the workspace we provide here, or if you wish to develop your project locally, you will need to set up your environment properly as described below:

  • Spark 2.4.3
  • Scala 2.11.x
  • Java 1.8.x
  • Kafka build with Scala 2.11.x
  • Python 3.6.x or 3.7.x

Environment Setup (Only Necessary if You Want to Work on the Project Locally on Your Own Machine)

For Macs or Linux:
  • Download Spark from https://spark.apache.org/downloads.html. Choose "Prebuilt for Apache Hadoop 2.7 and later."
  • Unpack Spark in one of your folders (I usually put all my dev requirements in /home/users/user/dev).
  • Download binary for Kafka from this location https://kafka.apache.org/downloads, with Scala 2.11, version 2.3.0. Unzip in your local directory where you unzipped your Spark binary as well. Exploring the Kafka folder, you’ll see the scripts to execute in bin folders, and config files under config folder. You’ll need to modify zookeeper.properties and server.properties.
  • Download Scala from the official site, or for Mac users, you can also use brew install scala, but make sure you download version 2.11.x.
  • Run below to verify correct versions:
    java -version
    scala -version
    
  • Make sure your ~/.bash_profile looks like below (might be different depending on your directory):
    export SPARK_HOME=/Users/dev/spark-2.4.3-bin-hadoop2.7
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home
    export SCALA_HOME=/usr/local/scala/
    export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$SCALA_HOME/bin:$PATH
    
For Windows:

Please follow the directions found in this helpful StackOverflow post: https://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows

SF Crime Data

Project Directions

Starter Code

You can find three Python files that are starter code, the project dataset, and some other necessary resources in a zip file called "SF Crime Data Project Files" in the Resources tab in the left sidebar of your classroom:

  • producer_server.py
  • kafka_server.py
  • data_stream.py
  • police-department-calls-for-service.json
  • radio_code.json
  • start.sh
  • requirements.txt

These files are also included in the Project Workspace.

Files You Need to Edit in Your Project Work

These starter code files should be edited:

  • producer_server.py
  • data_stream.py
  • kafka_server.py

The following file should be created separately for you to check if your kafka_server.py is working properly:

  • consumer_server.py

Create a GitHub Repository

Create a new repo that will contain all these files for your project. You will submit a link to this repo as a key part of your project submission. If you complete the project in the classroom workspace here, just download the files you worked on and add them to your repo.

Beginning the Project

This project requires creating topics, starting Zookeeper and Kafka servers, and your Kafka bootstrap server. You’ll need to choose a port number (e.g., 9092, 9093..) for your Kafka topic, and come up with a Kafka topic name and modify the zookeeper.properties and server.properties appropriately.

Local Environment

  • Install requirements using ./start.sh if you use conda for Python. If you use pip rather than conda, then use pip install -r requirements.txt.

  • Use the commands below to start the Zookeeper and Kafka servers. You can find the bin and config folder in the Kafka binary that you have downloaded and unzipped.

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
  • You can start the bootstrap server using this Python command: python producer_server.py.

Workspace Environment

  • Modify the zookeeper.properties and producer.properties given to suit your topic and port number of your choice. Start up these servers in the terminal using the commands:
    
    /usr/bin/zookeeper-server-start /etc/kafka/zookeeper.properties
    kafka-server-start /etc/kafka/server.properties
    
  • You’ll need to open up two terminal tabs to execute each command.

  • Install requirements using the provided ./start.sh script. This needs to be done every time you re-open the workspace, or anytime after you've refreshed, or woken up, or reset data, or used the "Get New Content" button in this workspace.

  • In the terminal, to install other packages that you think are necessary to complete the project, use conda install <package_name>. You may need to reinstall these packages every time you re-open the workspace, or anytime after you've refreshed, or woken up, or reset data, or used the "Get New Content" button in this workspace.

 

Step 1

  • The first step is to build a simple Kafka server.
  • Complete the code for the server in producer_server.py and kafka_server.py.

Local Environment
To see if you correctly implemented the server, use the command below to see your output 

bin/kafka-console-consumer.sh --bootstrap-server localhost:<your-port-number> --topic <your-topic-name> --from-beginning 

Workspace Environment

  • setup the Udacity Workspace
    ./start.sh
  • started the zookeeper server
    /usr/bin/zookeeper-server-start /etc/kafka/zookeeper.properties
  • start the kafka server
    kafka-server-start /etc/kafka/server.properties
  • Create Topic
    kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic com.sf.police.event.calls
  • Checked the topic "com.sf.police.event.calls" now exists
    /usr/bin/kafka-topics --list --zookeeper localhost:2181
  • Run kafka-console-producer with Dummy JSON values
    kafka-console-producer --broker-list localhost:9092 --topic com.sf.police.event.calls

    { "crime_id": "183653763", "original_crime_type_name": "Traffic Stop", "report_date": "2018-12-31T00:00:00.000","call_date": "2018-12-31T00:00:00.000","offense_date": "2018-12-31T00:00:00.000","call_time": "23:57","call_date_time": "2018-12-31T23:57:00.000","disposition": "ADM","address": "Geary Bl/divisadero St","city": "San Francisco","state": "CA","agency_id": "1","address_type": "Intersection","common_location": "" }

    {"crime_id":"183653745","original_crime_type_name":"Audible Alarm","report_date":"2018-12-31T00:00:00.000","call_date":"2018-12-31T00:00:00.000","offense_date":"2018-12-31T00:00:00.000","call_time":"23:47","call_date_time":"2018-12-31T23:47:00.000","disposition":"PAS","address":"1900 Block Of 18th Av","city":"San Francisco","state":"CA","agency_id":"1","address_type":"Premise Address","common_location":""}

    {"crime_id":"183653706","original_crime_type_name":"Passing Call","report_date":"2018-12-31T00:00:00.000","call_date":"2018-12-31T00:00:00.000","offense_date":"2018-12-31T00:00:00.000","call_time":"23:34","call_date_time":"2018-12-31T23:34:00.000","disposition":"Not recorded","address":"1500 Block Of Haight St","city":"San Francisco","state":"CA","agency_id":"1","address_type":"Common Location","common_location":"Haight St Corridor, Sf"}

  • TIP: use this to tool to convert multiline JSON layout to single line https://tools.knowledgewalls.com/online-multiline-to-single-line-converter
  • Run kafka-console-consumer
    kafka-console-consumer --bootstrap-server localhost:9092 --topic com.sf.police.event.calls --from-beginning

 

 

Sample Kafka Consumer Console Output (Screenshot)

file 

 
 

Step 2

  • Apache Spark already has an integration with Kafka brokers, so we would not normally need a separate Kafka consumer. However, we are going to ask you to create one anyway. Why? We'd like you to create the consumer to demonstrate your understanding of creating a complete Kafka Module (producer and consumer) from scratch. In production, you might have to create a dummy producer or consumer to just test out your theory and this will be great practice for that.
  • Implement all the TODO items in data_stream.py. You may need to explore the dataset beforehand using a Jupyter Notebook.
  • Do a spark-submit using this command: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py.
  • Take a screenshot of your progress reporter after executing a Spark job. You will need to include this screenshot as part of your project submission.
  • Take a screenshot of the Spark Streaming UI as the streaming continues. You will need to include this screenshot as part of your project submission.
  • Run the following to view Spark UI
    wget "http://localhost:3000"

Step 3

  1. did changing values on the SparkSession property parameters affect the throughput and latency of the data?

    • Yes, altering them impacted Time it Tooks to Complete Jobs / Tasks
    • Altering the number of cores used ie master("local[*]") had the most significant impact. It seemed that by reducing the number of cores reduced the processing of the 200 Tasks I had it process. This I believe may have been the result of reduced Shuffle Read and Shuffle Write. 
    • Altering maxRatePerPartition and maxOffsetsPerTrigger also seemed to affect throughput and latency. 
  2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

    Property Value  
    maxRatePerPartition 10  
    maxOffsetsPerTrigger 100  
    master local[1]  

From looking at Sparks Web UI - Executors Tab it was evident from looking at the following columns: 

Column
Task Time
Shuffle Read
Shuffle Write

The screenshots show the differences in performance between two separate Spark Session configurations

 Screenshot 2 - Spark Session Properties

Property Value  
maxRatePerPartition 100  
maxOffsetsPerTrigger 200  
master local[*]  

 

file 

 Screenshot 2 - Spark Session Properties

Property Value  
maxRatePerPartition 10  
maxOffsetsPerTrigger 100  
master local[1]  

file 

 

 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.