Coder Social home page Coder Social logo

trellixvulnteam / building-data-pipeline-with-reddit_api_geuy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chilubagh/building-data-pipeline-with-reddit_api

0.0 0.0 0.0 63.34 MB

A data pipeline to extract Reddit data from r/dataengineering

Shell 0.01% JavaScript 0.65% Python 97.98% C 0.22% PowerShell 0.02% TypeScript 0.16% CSS 0.31% TeX 0.01% HTML 0.62% Mako 0.02% HCL 0.01% Jinja 0.01%

building-data-pipeline-with-reddit_api_geuy's Introduction

Building-data-pipeline-with-reddit_API

A data pipeline to extract Reddit data from subreddit r/dataengineering.

Motivation

This is a personal project based on an interest in Data Engineering and the types of post found on the official subreddit, https://www.reddit.com/r/dataengineering/ It also provided a good opportunity to develop skills and experience in a range of tools. As such, project is more complex than required, utilising terraform,airflow, docker and cloud based technologies like redshift,S3 etc.

Technologies Used

AWS S3 (https://aws.amazon.com/s3/)

AWS Redshift (https://aws.amazon.com/redshift/)

AWS IAM (https://aws.amazon.com/iam/)

Terraform (https://www.terraform.io/)

Docker (https://www.docker.com/)

Airflow (https://airflow.apache.org/)

Summary of Project

Essentially there is one pipeline which extracts Reddit data from its API using Python's PRAW API wrapper.It is setup to extract data from the past 24 hours and store in a CSV with fields such as post ID, author name, etc.This CSV is then loaded directly into an AWS S3 bucket (cloud storage), before being copied to AWS Redshift (cloud data warehouse).This entire process is running with Apache Airflow (orchestration tool) running with Docker. This saves us having to manually setup Airflow. We use terraform(Infrastructure AS Code) to build the AWS services. Another two components make up this project that are not controlled with Airflow. First, we use dbt to connect to our data warehouse and transform the data. We're only really using dbt to gain some familiarity with it and build our skills. Second, we will connect a BI tool to our warehouse and create some looking visualisations.

Architecture

architecture

Docker & Airflow

The pipeline is scheduled daily.Each day, we'll extract the top Reddit posts for r/DataEngineering. Because LIMIT is set to None in the Reddit extraction script, it should in theory return all posts from the past 24 hours.

we navigate to http://localhost:8080 to access the Airflow Web Interface. This is running within one of the Docker containers, which is mapping onto our local machine. If nothing shows up, give it a few minutes more. Password and username are both airflow. The dag etl_reddit_pipeline should be set to start running automatically once the containers are created. It may have already finished by the time you login. This option is set within the docker-compose file. The next DAG run will be at midnight. If you click on the DAG and look under the Tree view, all boxes should be dark green if the DAG run was successful.

If you check in the airflow/dags folder, you'll find a file titled elt_pipeline.py. This is our DAG which you saw in Airflow's UI.

It's a very simple DAG. All it's doing is running 3 tasks, one after the other. This DAG will run everyday at midnight. It will also run once as soon as you create the Docker containers. These tasks are using BashOperator, meaning that they are running a bash command. The tasks here are running a bash command to call external Python scripts (these Python scripts also exist within our docker container through the use of volumes). You'll find them under the extraction folder.

reddit-airflow

How it works

extract_reddit_data_task

This is extracting Reddit data. Specifically, it's taking the top posts of the day from r/DataEngineering and collecting a few different attributes, like the number of comments. It's then saving this to a CSV within the /tmp folder.

upload_to_s3

This is uploading the newly created CSV to AWS S3 for storage within the bucket Terraform created.

copy_to_redshift

This is creating a table in Redshift if it doesn't already exist. It's then using the COPY command to copy data from the newly uploaded CSV file in S3 to Redshift. This is designed to avoid duplicate data based on post id. If the same post id is in a later DAG run load, then warehouse will be updated with that record. Read here for information on the COPY command.

redshift

Termination

I terminated my docker and airlow after 24 hours and also paused AWS Redshift since i didnt want to incur cost. The following commands were used

Terminate AWS resources

terraform destroy

Stop and delete containers, delete volumes with database data and download images. To do so, navigate to the airflow directory you first ran docker-compose up and run the following:

docker-compose down --volumes --rmi all

To remove all stopped containers,images and all dagling build cache.

docker system prune

Improvements

This projects needs constant improvements as I am still on a learning journey.In summary the follwing improvements would be made

Notifications

I noticed airflow can be used to send emails when failure happens.I would also simplify or refactor with a production environment in mind.

Testing

I need to implement better validation checks to make sure duplicate data is removed.

Simplify Process

The use of Airflow overkill. Alternative ways to run this pipeline could be with Cron for orchestration and PostgreSQL or SQLite for storage.I would look for performance improvements, reduce code redundancy, and implement software engineering best practices. For example, considering using Parquet file format over CSV, or consider whether warehouse data could be modeled as a star schema.

building-data-pipeline-with-reddit_api_geuy's People

Contributors

chilubagh avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.