Coder Social home page Coder Social logo

apache-airflow's Introduction

Learning Apache Airflow

Apache Airflow is a platform to define data pipelines, monitor execution and handle workflow orchestration. If you are familiar with schedulers, consumers, and queues, Airflow is a great tool to explore.

Airflow solves several problems like managing scheduled jobs and easily handling dependencies between tasks. It also provides a great UI to monitor and manage the workflows.

This repository is part of a course on applied Apache Airflow. It is meant to be used as a reference for the course and not as a standalone guide.

Lesson 1: Installation

There are many different ways you can install and use Airflow. From building the project from source to using a hosted (ready-to-use) service. In this course we will explore installing from the Python Package Index (PyPI) as well as using Docker Compose.

PyPI

Always refer to the official installation guide. You'll need to have Python 3 installed. Only use pip to install Airflow as the other many ways the Python community has come up with to install packages can cause issues, including Poetry and pip-tools.

Create a temporary constraint file called constraint.sh:

AIRFLOW_VERSION=2.7.1

# Extract the version of Python you have installed. If you're currently using Python 3.11 you may want to set this manually as noted above, Python 3.11 is not yet supported.
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"

CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example this would install 2.7.1 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.7.1/constraints-3.8.txt

Then source it with source constraint.sh and install Airflow with pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}".

Once completed run the standalone sub-command to populate the database and start all components:

airflow standalone

Go to the UI at localhost:8080 and you should see the Airflow UI.

Docker Compose

For Apache Airflow 2.7.1 you can fetch a pre-made docker-compose.yaml file from the documentation:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.7.1/docker-compose.yaml'

Change the version if you want to use something different. The Docker compose method is meant to have an all-in-one setup for development and testing and it isn't recommended for production environments.

Initialize the database before starting the rest of the containers, this is required as it will otherwise not setup the environment correctly including populating the database with its initial data:

docker compose up airflow-init

Then start the rest of the containers with:

docker compose up

Access the environment at localhost:8080. Use the default credentials airflow and airflow to login.

Lesson 2: Apache Airflow Fundamentals

Airflow has several components that are useful to understand before diving into the code. Start by exploring the simple example to add a Python task to a DAG. Run the task and explore the logs and the UI.

Lesson 3: Creating and running a Pipeline

Creating a pipeline in Airflow allows you to feel more comfortable with core concepts related to Data Engineering. In this example we will create a pipeline that will download a file from the internet, we will clean the dataset using Pandas and then we will persist specific data to a database. All of these actions will be performed in separate steps in tasks.

Lesson 4: Practice Lab

Use the included practice lab to build a data pipeline using Apache Airflow to extract census data, transform it, and load it into a database based on certain conditions. Follow the steps in the lab to complete the exercise in your own repository.

apache-airflow's People

Contributors

alfredodeza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.