Coder Social home page Coder Social logo

yannibenoit / airflow-duckdb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hussein-awala/airflow-duckdb

0.0 0.0 0.0 9.48 MB

A package to run DuckDB queries from Apache Airflow.

License: Apache License 2.0

Shell 0.60% Python 89.87% Dockerfile 9.53%

airflow-duckdb's Introduction

Airflow DuckDB on Kubernetes

DuckDB is an in-memory analytical database to run analytical queries on large data sets.

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.

Apache Airflow is not an ETL tool, but more of a workflow scheduler that can be used to schedule and monitor ETL jobs. Airflow users create DAGs to schedule Spark, Hive, Athena, Trino, BigQuery, and other ETL jobs to process their data.

By using DuckDB with Airflow, the users can run analytical queries on local or remote large data sets and store the results without the need to use these ETL tools.

To use DuckDB with Airflow, the users can use the PythonOperator with the DuckDB Python library, the BashOperator with the DuckDB CLI, or one of the available Airflow operators that support DuckDB (e.g. airflow-provider-duckdb developed by Astronomer). All of these operators will be running in the worker pod and limited by its resources, for that reason, some users use the Kubernetes Executor to run the tasks in a dedicated Kubernetes pod to request more resources when needed.

Setting up Kubernetes Executor could be a bit challenging for some users, especially maintaining the workers docker image. This project provides an alternative solution to run DuckDB with Airflow using the KubernetesPodOperator.

How to use

The developed operator is completely based on the KubernetesPodOperator, so it needs cncf-kubernetes provider to be installed in the Airflow environment (preferably the latest version to profit from all the features).

Install the package

To use the operator, you need to install the package in your Airflow environment. You can install the package using pip:

pip install airflow-duckdb

Use the operator

The operators supports all the parameters of the KubernetesPodOperator, and it has some additional parameters to simplify the usage of DuckDB.

Here is an example of how to use the operator:

with DAG("duckdb_dag", ...) as dag:
    DuckDBPodOperator(
        task_id="duckdb_task",
        query="SELECT MAX(col1) AS  FROM READ_PARQUET('s3://my_bucket/data.parquet');",
        do_xcom_push=True,
        s3_fs_config=S3FSConfig(
            access_key_id="{{ conn.duckdb_s3.login }}",
            secret_access_key="{{ conn.duckdb_s3.password }}",
        ),
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "1", "memory": "8Gi"},
            limits={"cpu": "1", "memory": "8Gi"},
        ),
    )

Features

The current version of the operator supports the following features:

  • Running one or more DuckDB queries in a Kubernetes pod
  • Configuring the pod resources (requests and limits) to run the queries
  • Configuring the S3 credentials securely with a Kubernetes secret to read and write data from/to S3 (AWS S3, MinIO or GCS with S3 compatibility)
  • Using Jinja templating to configure the query
  • Loading the queries from a file
  • Pushing the query result to XCom

The project also provides a Docker image with DuckDB CLI and some extensions to use it with Airflow.

airflow-duckdb's People

Contributors

hussein-awala avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.