Coder Social home page Coder Social logo

imsanjoykb / etl-project Goto Github PK

View Code? Open in Web Editor NEW
14.0 3.0 5.0 292 KB

The goal of this project is to illustrate Extract Transform Load (ETL) using Python and SQL. ETL is a process commonly done in computing, which takes raw data, cleans it and stores it for later use. The extraction phase targets and retrieves the data. Transform manipulates and cleans the data. Then load stores the data, typically in a data warehouse.

Home Page: https://imsanjoykb.github.io/

License: MIT License

HTML 0.01% Jupyter Notebook 100.00%
etl datawarehouse datalake etl-pipeline data-engineering database etl-automation etl-solutions

etl-project's Introduction

Covid-19 Infection ETL Project

By Sanjoy Biswas

Project Proposal

Based upon the data compiled by John Hopkins University, I want to explore ''' Insert reasons here''' This will be done by extracting the CSV data and migrating it to a PostgreSQL Database.

Project Description

I found data from data.data.org that had been compiled from John Hopkins University. I filtered the data for March 2020 and evaluated the number of cases with respect to the deaths, recovery, and the confirmed cases.

Finding Data

All of the data that we used were from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases where they compiled by the John Hopkins University Center for Systems Science and Engineering (JHU CSSE) from various sources that include the World Health Organisations, Hong Kong Department of Health, European Centre for Disease Prevention and Control, etc. This data is always being updated so we are narrowing the scope to the month of March.

Data Cleanup and Analysis

  • TRANSFORMATION STEPS My transformation steps I needed to clean the data to be readable, presentable, and easy for me to query in the later stages. This was done by:

    • Developing a cleaning function in python that would select the data in the month of March 2020. This was applied to all datasets that I have.
    • All of the dates that have in the data sets we treated as values through the pd.melt function in Pandas.
    • Found a way of finding the daily increase with respect to each table. This value was converted from a float to an integer
  • LOADING STEPS

    • I established a connection to a local PostgreSQL server in our desktop to store the data
    • I have a schema that just makes the tables and we can confirm it throught engine.table_names()
    • I pushed the Pandas DataFrame to the local PostgreSQL server so I can retrieve and query the data in our Jupyter Notebook
  • Analysis / SQL Queries In this part, I want to find:

    • Top 5 countries with the most/least confirmed cases
    • Top 5 countries with the most/least deaths
    • Top 5 countries with the most/least recovered
    • Date in March with the most confirmed/deaths/recovered

etl-project's People

Contributors

imsanjoykb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.