This is a project for the course Massive Processing of Data (CC5212) using Apache Pig and Hadoop.
First, we got our data from Kaggle: https://www.kaggle.com/code/jeanpierrrerio/spotify-dataset-1921-2020-160k-tracks
Download data and do some preprocessing using Python Pandas Dataframes: See spotify_preprocessing.ipynb
Using Hadoop, load our samples and execute the pig scripts.
Our presentation and conclusion (Spanish) may be found at: https://docs.google.com/presentation/d/16A-DacDEsXz5IaoW_r4kUtBe8o2rtReC3MCJ0ZO03g0/edit?usp=sharing