Project Overview

In this project, you will be provided with a real-world dataset, extracted from Kaggle, on San Francisco crime incidents, and you will provide statistical analyses of the data using Apache Spark Structured Streaming. You will draw on the skills and knowledge you've learned in this course to create a Kafka server to produce data and ingest data through Spark Structured Streaming.

You can try to answer the following questions with the dataset:

What are the top types of crimes in San Fransisco?
What is the crime density by location?

Development Environment

You may choose to create your project in the workspace we provide here, or if you wish to develop your project locally, you will need to set up your environment properly as described below:

Spark 2.4.3
Scala 2.11.x
Java 1.8.x
Kafka build with Scala 2.11.x
Python 3.6.x or 3.7.x

Environment Setup (Only Necessary if You Want to Work on the Project Locally on Your Own Machine)

For Macs or Linux:

Download Spark from https://spark.apache.org/downloads.html. Choose "Prebuilt for Apache Hadoop 2.7 and later."
Unpack Spark in one of your folders (I usually put all my dev requirements in /home/users/user/dev).
Download binary for Kafka from this location https://kafka.apache.org/downloads, with Scala 2.11, version 2.3.0. Unzip in your local directory where you unzipped your Spark binary as well. Exploring the Kafka folder, you’ll see the scripts to execute in bin folders, and config files under config folder. You’ll need to modify zookeeper.properties and server.properties.
Download Scala from the official site, or for Mac users, you can also use brew install scala, but make sure you download version 2.11.x.
Run below to verify correct versions:
```
java -version
scala -version
```

Make sure your ~/.bash_profile looks like below (might be different depending on your directory):

export SPARK_HOME=/Users/dev/spark-2.4.3-bin-hadoop2.7
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home
export SCALA_HOME=/usr/local/scala/
export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$SCALA_HOME/bin:$PATH

For Windows:

Please follow the directions found in this helpful StackOverflow post: https://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows

SF Crime Data

Project Directions

Starter Code

You can find three Python files that are starter code, the project dataset, and some other necessary resources in a zip file called "SF Crime Data Project Files" in the Resources tab in the left sidebar of your classroom:

producer_server.py
kafka_server.py
data_stream.py
police-department-calls-for-service.json
radio_code.json
start.sh
requirements.txt

These files are also included in the Project Workspace.

Files You Need to Edit in Your Project Work

These starter code files should be edited:

producer_server.py
data_stream.py
kafka_server.py

The following file should be created separately for you to check if your kafka_server.py is working properly:

consumer_server.py

Create a GitHub Repository

Create a new repo that will contain all these files for your project. You will submit a link to this repo as a key part of your project submission. If you complete the project in the classroom workspace here, just download the files you worked on and add them to your repo.

Beginning the Project

This project requires creating topics, starting Zookeeper and Kafka servers, and your Kafka bootstrap server. You’ll need to choose a port number (e.g., 9092, 9093..) for your Kafka topic, and come up with a Kafka topic name and modify the zookeeper.properties and server.properties appropriately.

Local Environment

Install requirements using ./start.sh if you use conda for Python. If you use pip rather than conda, then use pip install -r requirements.txt.
Use the commands below to start the Zookeeper and Kafka servers. You can find the bin and config folder in the Kafka binary that you have downloaded and unzipped.

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

You can start the bootstrap server using this Python command: python producer_server.py.

Workspace Environment

Modify the zookeeper.properties and producer.properties given to suit your topic and port number of your choice. Start up these servers in the terminal using the commands:
```
/usr/bin/zookeeper-server-start /etc/kafka/zookeeper.properties
kafka-server-start /etc/kafka/server.properties
```

You’ll need to open up two terminal tabs to execute each command.
Install requirements using the provided ./start.sh script. This needs to be done every time you re-open the workspace, or anytime after you've refreshed, or woken up, or reset data, or used the "Get New Content" button in this workspace.
In the terminal, to install other packages that you think are necessary to complete the project, use conda install <package_name>. You may need to reinstall these packages every time you re-open the workspace, or anytime after you've refreshed, or woken up, or reset data, or used the "Get New Content" button in this workspace.

Step 1

The first step is to build a simple Kafka server.
Complete the code for the server in producer_server.py and kafka_server.py.

Local Environment
To see if you correctly implemented the server, use the command below to see your output

bin/kafka-console-consumer.sh --bootstrap-server localhost:<your-port-number> --topic <your-topic-name> --from-beginning

Workspace Environment

setup the Udacity Workspace
./start.sh
started the zookeeper server
/usr/bin/zookeeper-server-start /etc/kafka/zookeeper.properties
start the kafka server

kafka-server-start /etc/kafka/server.properties
Create Topic

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic com.sf.police.event.calls
Checked the topic "com.sf.police.event.calls" now exists

/usr/bin/kafka-topics --list --zookeeper localhost:2181
Run kafka-console-producer with Dummy JSON values

kafka-console-producer --broker-list localhost:9092 --topic com.sf.police.event.calls

{ "crime_id": "183653763", "original_crime_type_name": "Traffic Stop", "report_date": "2018-12-31T00:00:00.000","call_date": "2018-12-31T00:00:00.000","offense_date": "2018-12-31T00:00:00.000","call_time": "23:57","call_date_time": "2018-12-31T23:57:00.000","disposition": "ADM","address": "Geary Bl/divisadero St","city": "San Francisco","state": "CA","agency_id": "1","address_type": "Intersection","common_location": "" }

{"crime_id":"183653745","original_crime_type_name":"Audible Alarm","report_date":"2018-12-31T00:00:00.000","call_date":"2018-12-31T00:00:00.000","offense_date":"2018-12-31T00:00:00.000","call_time":"23:47","call_date_time":"2018-12-31T23:47:00.000","disposition":"PAS","address":"1900 Block Of 18th Av","city":"San Francisco","state":"CA","agency_id":"1","address_type":"Premise Address","common_location":""}

{"crime_id":"183653706","original_crime_type_name":"Passing Call","report_date":"2018-12-31T00:00:00.000","call_date":"2018-12-31T00:00:00.000","offense_date":"2018-12-31T00:00:00.000","call_time":"23:34","call_date_time":"2018-12-31T23:34:00.000","disposition":"Not recorded","address":"1500 Block Of Haight St","city":"San Francisco","state":"CA","agency_id":"1","address_type":"Common Location","common_location":"Haight St Corridor, Sf"}
TIP: use this to tool to convert multiline JSON layout to single line https://tools.knowledgewalls.com/online-multiline-to-single-line-converter
Run kafka-console-consumer
kafka-console-consumer --bootstrap-server localhost:9092 --topic com.sf.police.event.calls --from-beginning

Sample Kafka Consumer Console Output (Screenshot)

Step 2

Apache Spark already has an integration with Kafka brokers, so we would not normally need a separate Kafka consumer. However, we are going to ask you to create one anyway. Why? We'd like you to create the consumer to demonstrate your understanding of creating a complete Kafka Module (producer and consumer) from scratch. In production, you might have to create a dummy producer or consumer to just test out your theory and this will be great practice for that.
Implement all the TODO items in data_stream.py. You may need to explore the dataset beforehand using a Jupyter Notebook.
Do a spark-submit using this command: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py.
Take a screenshot of your progress reporter after executing a Spark job. You will need to include this screenshot as part of your project submission.
Take a screenshot of the Spark Streaming UI as the streaming continues. You will need to include this screenshot as part of your project submission.

wget "http://localhost:3000"

Step 3

did changing values on the SparkSession property parameters affect the throughput and latency of the data?
- Yes, altering them impacted Time it Tooks to Complete Jobs / Tasks
- Altering the number of cores used ie master("local[*]") had the most significant impact. It seemed that by reducing the number of cores reduced the processing of the 200 Tasks I had it process. This I believe may have been the result of reduced Shuffle Read and Shuffle Write.
- Altering maxRatePerPartition and maxOffsetsPerTrigger also seemed to affect throughput and latency.
What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

Property Value

maxRatePerPartition 10

maxOffsetsPerTrigger 100

master local[1]

From looking at Sparks Web UI - Executors Tab it was evident from looking at the following columns:

Column

Task Time

Shuffle Read

Shuffle Write

The screenshots show the differences in performance between two separate Spark Session configurations

Screenshot 2 - Spark Session Properties

Property	Value
maxRatePerPartition	100
maxOffsetsPerTrigger	200
master	local[*]

Screenshot 2 - Spark Session Properties

Property	Value
maxRatePerPartition	10
maxOffsetsPerTrigger	100
master	local[1]

fogofortitude / datastreaming---sf-crime-statistics-with-spark Goto Github PK

datastreaming---sf-crime-statistics-with-spark's Introduction

Project Overview

Development Environment

Environment Setup (Only Necessary if You Want to Work on the Project Locally on Your Own Machine)

For Macs or Linux:

For Windows:

Project Directions

Starter Code

Files You Need to Edit in Your Project Work

Create a GitHub Repository

Beginning the Project

Local Environment

Workspace Environment

Step 1

Workspace Environment

Step 2

Step 3

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent