Coder Social home page Coder Social logo

sparkify-postgresql-data-modeling's Introduction

Sparkify Project - Data Modeling with Postgres

1. Summary of the project

1.1 Problem

A startup called Sparkify wants to analyze the data they have been collecting about songs and user activity on their new music streaming app. The company is interested in finding out what songs users are listening to. The analysts don't have a were to query the data that is in a directory of JSON records about user activity in the app, as well as a directory with JSON metadata about the songs in their app.

1.2 Solution

I developed a solution using Python and SQL that collects and organizes client usage data and the application's song collection. I defined a star schema and the fact and dimension tables to receive the data. Finally, I wrote an ETL pipeline that transfers data from JSON files in two local directories to these tables in Postgres using Python and SQL.

1.2.1 Schema

Star Schema

2. Python scripts

The solution has 3 Python files with the following functions:

  • create_tables.py: This code drops and creates the tables. You need run this file to reset the tables before each time you run your ETL scripts.

  • etl.py: reads and processes the files from song_data and log_data and loads them into the tables.

  • sql_queries.py: contains all project sql queries. Its contain is imported into the other scheme Python files.

3. Project files

The project has, in addition to the Phyton files mentioned above, two notebooks and the JSON directories with the original data.

3.1 Notebooks:

  • etl.ipynb: Reads and processes a single file from song_data and log_data and loads the data into the tables. This notebook contains detailed instructions on the ETL process for each of the tables.

  • test.ipynb: displays the first few rows of each table to let you check your database.

  • sample_queries.ipynb: contains sample queries for analysis.

3.2 Data directories

3.2.1 Song Dataset:

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Song Dataset Song Dataset sample

3.2.2 Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

Log Dataset Log Dataset sample

4. Running project

The correct order to run the project is the bellow:

  1. Run create_tables.py to drop existing tables and create the final tables.
  2. Run test.ipynd to verify if the tables was created correctly.
  3. Run etl.py to read and process files from song_data and log_data and load the data into your tables.
  4. Run test.ipynd again to verify if the data was loaded correctly.

5. Sample queries

Examples of requests made to the database for analysis.

Number of users by account type and gender.

User_gender_account

Top 5 locations with the most played songs.

5_most_players_loc

Top 5 locations with the most music played by paid users.

5_most_players_loc_paid

Top 5 users.

top_5_users

sparkify-postgresql-data-modeling's People

Contributors

egoliveira1 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.