This project will demonstrate how to get started using pySpark and the DataFrame API to perform some basic data analysis, including:
- reading in data
- performing aggregations and joins using the Spark SQL module
- calculating summary statistics
We will use the MovieLens 20M Dataset on movie ratings to find out:
- What are the most popular movies?
- What are the top rated movies?
- Which movies are the most polarising?