1. Discuss the purpose of this database in the context of the startup, Sparkify, and their analytical goals.
This database is about storing music and artist records. Data is extracted from the source, then transformed using Pandas DataFrame, and finally loaded into the database. There are two sets of data used in the ETL process: song and log data. Song data provides song and artist information, while log data is provides covers song, artist and some metadata of each song.
Source data stands into json files
Fact Table:
songplays - records in log data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables:
users - users in the app
user_id, first_name, last_name, gender, level
songs - songs in music database
song_id, title, artist_id, year, duration
artists - artists in music database
artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
Execute "create_tables.py".
Execute "etl.py".
test.ipynb displays the first rows of each table to check database.
create_tables.py drops and creates your tables.
etl.ipynb reads and processes a single file from song_data and log_data and loads the data into your tables. Thi
etl.py reads and processes files from song_data and log_data and loads them into your tables.
sql_queries.py contains all sql queries, and is imported into the last three files above.