This git repo is a collection of introductory tutorials and code samples on Apache Spark. The code samples are in python, so essentially we are using pySpark.
The goal is to
- Build expertise in Spark Dataframe
- Read/Write from/to AWS S3
- Apply Feature Engineering on the data read from AWS S3 on Spark
- Write features back to AWS S3
- Learn to use AWS EMR to execute all the above steps
- Be familiar with Spark MLLib
- Be familiar with Spark Structured Streaming with Kafka
- Apache Spark 2.4 with pySpark
- AWS S3 for data storage
- AWS EMR (Elastic Map Reduce)
- Spark Dataframe
- Spark MLLib (low priority)
- Spark Structured Streaming with Kafka
Please follow this Databricks tutorial if you are interested in Spark Structured Streaming with Kafka. Although the tutorial is written in Scala, you can easily do it in python if you have completed the above steps in python.