Coder Social home page Coder Social logo

sparkify-etl-postgres's Introduction

Sparkify ETL

Table of contents

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.

The analytics team is particularly interested in understanding what songs users are listening to.

We can extract the following data and more

  • The most played music in the entire app
  • The top 5 artists in the app according to their songs
  • The most day of the week people listen to songs
  • and more....

The ETL extracts the data from songs and log files into a single database as read-only database which runs with the star schema model for analysis.

The database management system used in the ETL is postgres which is realtional (SQL) database

Tools

pythonnumpypandaspsycopgPostgresql

Usage

  • To run the ETL, you have to create the database and the tables so we have to execute create_tables.py file.

  • Now we can run etl.py to transfer our data from songs and log files into the database tables.

Important: update the database and account credentials in the files to your own credentials

Congrats you executed your ETL successfully

Database Schema Design

Sparkify Schema is a star schema built for analysis and we can update it only with the ETL but users can not edit it.

Star Schema consists of facts and dimensions

Database Schema

Facts

Songplays fact

  • Records in log data associated with song plays
  • Records with page NextSong

Songplays

Members:-

  • songplay_id
  • start_time
  • user_id
  • level
  • song_id
  • artist_id
  • session_id
  • location
  • user_agent

Fact table contains the business event of the analysis which is listening to a song transactions

Fact table contains foreign keys for the dimensions around it.

Dimensions

Dimensions is the business entities around the business event

We have 4 dimensions which is songs, artists, users, time

Time dimension

  • timestamps of records in songplays broken down into specific units

Time

Members:-

  • start_time
  • hour
  • day
  • week
  • month
  • year
  • weekday

The Time dimension helps us to make the data consistant in the schema and remove redunduncies.

Songs dimension

  • Songs in music database

Songs

Members:-

  • song_id
  • title
  • artist_id
  • year
  • duration

Artists dimension

  • Artists in music database

Artists

Members:-

  • artist_id
  • name
  • location
  • latitude
  • longitude

Users diemension

  • Users in the app

Users

Members:-

  • user_id
  • first_name
  • last_name
  • gender
  • level

Sparkify Project Files

sql_queries.py

The file contains all create, drop, insert and select queries which the create_tables.py executes to create the database schema and integrates the data.

create_tables.py

We have 3 functions in this python file and main function which executes the functions

  • create_database()

    • Creates and connects to the sparkifydb
    • Returns the connection and cursor to sparkifydb
  • drop_tables(cur, conn)

    • Drops each table using the queries in drop_table_queries list in sql_queries.py
  • create_tables(cur, conn)

    • Creates each table using the queries in create_table_queries` list in sql_queries.py

etl.py

The file processes all the files and runs the inserts to insert data into the fact and dimensions and it consists of 3 functions

  • process_song_file(cur, filepath)

    • reads the song files from its path
    • inserts the songs data and artists data to songs and artists dimensions
  • process_log_file(cur, filepath)

    • reads the log files data from its path filtred by NextSong page
    • inserts the time records from the log files into time dimension
    • inserts the user records from the log files into users dimensions
    • inserts the transactions which is the songs listening to songplays fact
  • process_data(cur, conn, filepath, func)

    • Get all the files with .json extension which are the songs files and log files

etl.ipynb

This jupyter notebook file used for development to get the data from the first file only and ensure that the data is valid for insertion

test.ipynb

This file contains test queries and sanity tests for checking database schema and it's quality The file used sql magic to execute their queries

data folder

Contains log files in json and songs files and every file of songs files contains one song data

Future work

Create a dashboard for analytic queries on sparkify database in Microsoft Power BI

sparkify-etl-postgres's People

Contributors

amrelauoty avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.