Coder Social home page Coder Social logo

gakas14 / etl-pipeline-with-apache-airflow-snowflake-and-aws Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 29 KB

We build an ETL pipeline using Apache Airflow, Snowflake, and different AWS Services.

Python 100.00%
airflow ec2-instance etl-pipeline s3-bucket snowflake sql

etl-pipeline-with-apache-airflow-snowflake-and-aws's Introduction

ETL-Pipeline-With-Apache-Airflow-Snowflake-and-AWS

We build an ETL pipeline using Apache Airflow, Snowflake, and AWS Services. Using Airflow, the pipeline fetches All articles about NBA news from the last month, sorted by recent; then, the data is saved on AWS S3 in a parquet format. We configured a storage integration that stores a generated identity and access management (IAM) entity for AWS, and an external table is created as an interface between Snowflake and S3.

ETL-Pipeline-With-Apache-Airflow-Snowflake-and-AWS

1. Set up an EC2 instance(ubuntu image) and run airflow on it:

Create an EC2 instance

Create an IAM role to allow the EC2 to access S3 and attach the role to the EC2

Edit the inbound rules by allowing all traffic from anywhere

Ssh into the EC2, then run the different commands

 sudo apt update
 sudo apt install python3-pip
 sudo apt install sqlite3
 sudo apt-get install libpq-dev
 pip3 install --upgrade awscli

 pip3 install virtualenv
 sudo apt install python3-virtualenv
 virtualenv venv 
 source venv/bin/activate

 pip install "apache-airflow[postgres]==2.5.0" --constraint "https://github.com/gakas14/ETL-Pipeline-With-Apache-Airflow-Snowflake-and-AWS/blob/main/requirements.txt"
 pip install pandas apache-airflow-providers-snowflake==2.1.0 snowflake-connector-python==2.5.1 snowflake-sqlalchemy==1.2.5
 pip3  install pyarrow fastparquet

 airflow db init
 sudo apt-get install postgresql postgresql-contrib
 sudo -i -u postgres

 psql
 CREATE DATABASE airflow;
 CREATE USER airflow WITH PASSWORD 'airflow';
 GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;
 exit

 ls
 cd airflow
 sed -i 's#sqlite:////home/ubuntu/airflow/airflow.db#postgresql+psycopg2://airflow:airflow@localhost/airflow#g' airflow.cfg
 sed -i 's#SequentialExecutor#LocalExecutor#g' airflow.cfg

 airflow db init
 airflow users create -u airflow -f airflow -l airflow -r Admin -e [email protected]

 User id --airflow
 password--airflow!


 mkdir /home/ubuntu/dags
 cd airflow
 vi airflow.cfg

 change the below properties --
 dags_folder = /home/ubuntu/dags
 load_examples = False

 # Now, put airflow_script.py and sport_news.py files in the dags folder.
-- Copy command--
 scp -i /directory/to/abc.pem /your/local/file/to/copy [email protected]:path/to/file

Run the airflow server.

 Execute the below code to start the server:
------------------------------------------------
airflow db init
airflow webserver & (to run airflow in the background)

Open a new terminal & execute the below commands:
----------------------------------------------------------------------------------------
source venv/bin/activate
airflow scheduler

2. Set the snowflake connection

use COMPUTE_WH;

-Create Database
create database if not exists News_db;

--use the database
use news_db;

CREATE OR REPLACE file format News_DB.Public.parquet_format
    type = parquet;

-- Create a storage integration object
create or replace storage integration s3_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE 
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::***********:role/Snowflake_role'
  STORAGE_ALLOWED_LOCATIONS = ('s3://"s3 bucket"/')
  --COMMENT = 'This an optional comment'

-- See storage integration properties to fetch external_id so we can update it in S3
DESC integration s3_int;

-- Create stage object with integration object & file format object
CREATE OR REPLACE stage News_DB.Public.my_s3_parquet_stage
    URL = 's3://"s3 bucket"/'
    STORAGE_INTEGRATION = s3_int
    FILE_FORMAT = News_DB.Public.parquet_format

    
--check the data present in S3
list @News_DB.PUBLIC.my_s3_parquet_stage;

3. Create an airflow connection.

Screen Shot 2024-05-08 at 1 24 29 PM Screen Shot 2024-05-08 at 1 24 40 PM

Run the airflow dags.

Screen Shot 2024-05-08 at 1 29 34 PM

etl-pipeline-with-apache-airflow-snowflake-and-aws's People

Contributors

gakas14 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.