Coder Social home page Coder Social logo

Data Modeling with Postgresql

Objective

In this project I aimed to extract data from S3, perform ETL and reload the data into S3 for Sparkify, a fictional startup and music streaming platform. They have been collecting user and song activity on their new music streaming app, and have employed the skills of data analysts who are particularly interested in knowing what songs users are listening to. They have grown their user base and song database even more and want to move their data warehouse to a data lake. As such my skills as a data engineer was employed. My role in this project was to create a star schema and an ETL pipeline from both log data and song data.

Briefly:

  1. Setup a database from the log and song datasets.

Both datasets are in the json format. The log dataset is derived from Eventsim. Eventsim is a program that simulates activity logs from a music streaming app for testing and demo purposes. Meanwhile, the song dataset originates from a freely-available collection of audio features and metadata for a million contemporary popular music tracks at Million Song Dataset.

  1. Developed an ETL pipeline that pulls, processes data from the log and song datasets, reads processed data to parquet format, and loads it AWS S3.

Method

In order to attain my objective, I performed the following:

  1. Setup an AWS IAM role that grants the necessary access to extract, read and load on AWS S3.

  2. With the aid of my python script, perform the accomplish the following: i. develop a pipeline that connects to Sparkify's AWS S3 bucket and extract data from S3 ii. parse JSON formatted files from there into comprehensive dataframes using spark iii. more importantly prepare a star schema comprised of: a. fact table: songplays b. four dimensional tables: users, songs, artists, and time iv. write the aforementioned dataframes to S3 bucket in parquet format.

Result

This project yielded the following:

  1. dl.cfg contains necessary IAM_Role configurations.
  2. etl.py Extracts datasets from AWS S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables.

Installations and usage

In terminal run etl.py as indicated below: python etl.py

Valeri Nteleah's Projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.