Coder Social home page Coder Social logo

ds4a-capstone-data-pipeline-etl's Introduction

DS4A Data Engineering Capstone: ETL Pipeline and Data Warehouse Deployment

ETL pipelines and data warehouse construction and orchestration for the DS4A Data Engineering capstone project.

Overview

For this section of the capstone project, I have used the following resources / tools to build our data pipelines and set up our data warehouse:

  • Python 3
  • Prefect
  • AWS S3
  • AWS CloudFormation Stacks
  • Databricks Workspaces + Datawarehouse
  • Dask (on the Saturn Cloud Data Science platform)

Pipeline Orchestration Process

Prerequisites

  • Prefect account
  • Saturn Cloud Account
  • AWS Account
  • Databricks Account

Environment Setup

The code in this example uses prefect for orchestration (figuring out what to do, and in what order) and Dask Cluster for execution (doing the things).

While Prefect is our orchestration tool of choice and our Dask cluster has been configured to execute those orchestration tasks, we still have to set up our system environment to be able to run the scripts that define those tasks, as shown below:

Create a new conda virtual environment and activate it:

$ conda create โ€”n data_pipeline_env python=3.9

And activate it:

$ conda activate data_pipeline_env

Then, install the project dependencies:

$ pip install -r requirements.txt

And be sure to set environment variables to instantiate our "read-safe" credentials for our Saturn Cloud account:

$ export SATURN_USERNAME='SATURN_USERNAME'

Finally, we authenticate our Prefect account with our API key secret:

$ prefect auth login --key PREFECT_API_KEY

Prefect Cloud Project Management

Since Prefect organizes flows in association with projects, I created a project called "data-pipeline-warehouse-t23" within my account, and initialized the prefect client within the Saturn setup script in preparation for the upcoming tasks.

client = prefect.Client()
client.create_project(project_name=PREFECT_CLOUD_PROJECT_NAME)
integration = PrefectCloudIntegration(prefect_cloud_project_name=PREFECT_CLOUD_PROJECT_NAME)

Data Pipeline Tasks (Per Dataset)

Prefect organizes groups of tasks into "flows" (workflows), which must be defined programmatically by defining functions -- the following is a breakdown of each task that we have defined for the scripts below (specifying the extraction, transformation, and loading of the data running from APIs through our pipeline):

  • extract: Retrieving the data needed from the respective API
  • transform: Transforming the returned response from the API into a dataframe that matches our warehouse table schemas
  • load: Loading the data into CSVs locally, and into S3 buckets that as synchronized to our Databricks Workspace + Warehouse Cluster

The integration sections that are commented out instantiate the flows in in the Saturn <> Prefect integration, and then pass the scheduling task on to the Cloud for execution by our Prefect project.

(Optional) Local Directory Setup for CSV Generation

For the sake of demonstration, you can also create some local directories to house the resultant CSV files in:

$ mkdir -p data/treasury-data
$ mkdir -p data/natl-poverty-data
$ mkdir -p data/small-area-poverty-data

Review

Prefect Flows

Output

When the Prefect flows are successfully created and scheduled in the cloud (via the commented code for each script above), the output will look like this:

Treasury Dataset

Treasury Financial Flow

National Poverty Dataset

Census National Poverty Flow

Small Area Poverty Dataset

Census Small Area Poverty Flow

In the Project Dashboard (Cloud)

In the Prefect cloud dashboard, the flows will appear like so: Prefect Dashboard Flows

AWS Cloud Formation -> Databricks Workspace + Warehouse

Our S3 bucket is linked to our CloudFormation stack, which is housing our Databricks Workspace and Warehouse:

AWS Cloud Formation

AWS Cloud Formation

Databricks Workspace + Warehouse

Databricks

Table in Databricks Warehouse

Warehouse Table

Databricks Workspace SQL Dashboard

From the Databricks Workspace, it is easy to access tables via SQL for the entire warehouse via a single unified dashboard: Warehouse SQL Dash

ds4a-capstone-data-pipeline-etl's People

Contributors

libearden avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.