Coder Social home page Coder Social logo

webclinic017 / stock-data-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chandlergregg/stock-data-pipeline

0.0 0.0 0.0 1.52 MB

Prototype data pipeline for processing stock market data. Part of Springboard Data Engineering program curriculum.

Python 33.67% Jupyter Notebook 66.33%

stock-data-pipeline's Introduction

stock-data-pipeline

Overview

This project is a prototype of a data pipeline that consumes raw stock market data, processes the data using Python and Spark, outputs the data into Azure file storage, and tracks the process using Postgres. While the volume of data consumed in the project is quite low, the pipeline is designed to scale very quickly due to its cloud-based architecture and the use of Spark.

This project is part of the Springboard Data Engineering program curriculum.

Pipeline steps

Input format: mixture of CSV and JSON event files that log various stock data throughout the trading day.s

  1. Ingests data into Spark after parsing CSV / JSON using Python function passed to PySpark
  2. Preprocesses data by deduplicating and writing to Spark SQL tables
  3. Runs analytic queries on data and outputs

Output format: Parquet file / table containing analytical data

Architecture

All pipeline coordination is done by Python. Python runs Spark jobs using Databricks Connect - future iterations could consider packaging the Python code to be run directly in the Spark cluster and use a separate orchestration technology (e.g. Airflow) to manage pipeline runs.

The Python coordinator writes the status of pipeline steps and runs to Postgres for tracking.

The coordinator doesn't interact with Azure storage - Spark loads and writes data directly from object storage, as would be the case in a Spark-Hadoop or Spark-data lake architecture.

Components:

System requirements

Reproducing the project locally requires the cloud-hosted components mentioned above in "Architecture" as well as the following:

  • Requirement / dependency / package management:
    • Pipenv (can be installed with pip): handles all Python packages and dependencies for project
    • Postgres (requirement for psycopg2, the Python-Postgres driver package. There are ways to install psycopg2 without a Postgres installation on the local machine, but installing Postgres first is easier.)
    • Databricks Connect
  • Handled by Pipenv:
    • Python 3.10
    • See Pipfile for detailed list of Python packages that are installed by Pipenv
  • Cloud configuration:
    • Config file: needs to be named config.cfg - see example_config.txt for an example of how this configuration file should look, with fields to be filled in according to the above components

To set up the Python environment for running the code, run the following commands:

pipenv install
pipenv shell

To reproduce the Postgres tracker, see postgres_queries.sql within pyspark_etl_pipeline to initialize the database and table referenced by tracker.py.

Project structure

The project is contained in the pyspark_etl_pipeline folder. Other folders in the project contain Python files / iPython notebooks used to develop the PySpark code.

Pipfile and Pipfile.lock are included in the root directoy.

Make sure to create a config.cfg file (using example_config.txt as template) to connect to Spark, Postgres, and Azure object storage. (The actual config file is not included as it contains identifiers and passwords for the project Azure infrastructure.)

stock-data-pipeline's People

Contributors

chandlergregg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.