Big Data using PySpark, Amazon Web Service (AWS), Google Colaboratory, and pgAdmin
Data analysts were tasked with analyzing Amazon reviews written by members of the paid Amazon Vine program.
Data from Amazon's Shoes department was analyzed to determine if having a paid Vine review makes a difference in the percentage of 5-star reviews.
The Extract, Transform, and Load (ETL) process was used on the Amazon Shoes dataset.
pgAdmin was utilized to connect to AWS, and pySpark and postgreSQL were used against the data set to create four separate DataFrames to match the table schema in pgAdmin.
The transformed data was then uploaded into AWS RDS.
The total number of reviews was 4366916
The total number of 5-star reviews was 2639935
The total number of Vine 5-star reviews was 13
The total number of non-Vine 5-star reviews was 14475
The Percentage of Vine 5-star reviews was 59%
The Percentage of non-Vine 5-star reviews was 54%
Based on analysis of the sample selected from the Amazon Shoes reviews, Vine reviews did not appear to have affected the 5-star reviews. There are slightly more 5-star reviews from unpaid reviews.
Additional analysis might be performed on all other Amazon review datasets with the same parameters that were used for the Shoes dataset.