Coder Social home page Coder Social logo

minhky2185 / healthcare_data_pipeline Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 3.0 369 KB

An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.

Python 100.00%
analytics big-data data data-engineering data-engineering-pipeline data-lake emr-cluster mysql postgresql powerbi

healthcare_data_pipeline's Introduction

Healthcare Data Pipeline

The project aims to build a single source of true data storage for large healthcare datasets using Spark and S3. Some dashboards are also made in this project for visualization.

Tech Stack

  • Data lake: Amazon S3

  • Data source: PostgreSQL

  • Data read storage: MySQL on Amazone RDS

  • Processing layer: Apache Spark on EMR

  • Visualization: Power BI

Architecture

The architecture of this project is presented as follows:

architecture_2

  • Data is sourced from PostgreSQL and ingested into raw zone of Data Lake hosted on S3.
  • Raw data is cleansed and standardized before moving to cleansed zone.
  • Cleansed data is transformed into reportable form and loaded into curated zone.
  • Publish data from curated zone to Data read storage for higher performance report when connection from BI Tool.
  • Reports are created in Power BI from the data in MySQL.

Data Source

  • Source of raw data is from CMS. Data used is Medicare Part D.
  • Data source in PostgreSQL has 4 tables, total size around 10 GB:
    • Prescriber_drug: ~ 25M rows
    • Prescriber: ~ 1.1M rows
    • Drug: ~115K rows
    • State: ~30K rows

Visualization

Some dashboards create from the data from data read storage

  • Drug report

drug_report

  • Prescriber report

prescriber_report

Achievement in learning

Apache Spark

  • Components of Spark and how Spark works.
  • How to adjust resource (RAM, CPU, instances,...) for optimizing Spark performance and costs.
  • Tuning Spark application by using partition
  • Use Spark to implement a full data pipeline.
  • Fundamental of how to write Spark correct.
  • Manage Jar files for JDBC connection

Project set up

  • Implement logging and log file to track the Spark application
  • Test project on local mode before run on cluster.

AWS

  • Set up EMR for Spark
  • Track the resource utilization in EMR

healthcare_data_pipeline's People

Contributors

minhky2185 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.