The purpose of this project is to analyze a dataset of book reviews from Amazon. To this end, we utilize PySpark to extract, transform, and load the data into pgAdmin while connecting to Amazon Web Service's Relational Database Service (AWS RDS) instance. We will also use PySpark to ascertain whether the paid Amazon Vine program members leave more positive reviews based on the dataset.
- There were 5,012 Vine reviews;
- There were 109,297 non-Vine reviews.
- 2,031 Vine reviews were five stars;
- 49,967 non-Vine reviews were five stars.
- Approximately 40.52% of Vine reviews were five stars;
- Approximately 45.72% of non-Vine reviews were five stars.
Vine Reviews | Non-Vine Reviews | |
---|---|---|
Total Reviews | 5,012 | 109,297 |
Number of Five Stars | 2,031 | 49,967 |
Percentage of Five Stars | 40.52% | 45.72% |
Based on the calculations above, positivity bias from members of the Vine program is unlikely. The percentage of five-star Vine reviews was comparable to the percentage of five-star non-Vine reviews. Additional analysis could determine the distribution of star ratings by calculating the percentages of Vine reviews and non-Vine reviews at each star rating.
Data Source:
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz
Software:
Google Colaboratory notebook
Python MapReduce library mrjob
PySpark
Email: [email protected]
LinkedIn: https://www.linkedin.com/in/s-k-wang