Coder Social home page Coder Social logo

sarit-si / docker-pdi-airflow-kafka Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 2.0 139 KB

Data pipeline involving pentaho DI, airflow and kafka. Run all services together using Docker.

Dockerfile 25.48% PLpgSQL 9.19% Shell 29.19% Python 36.13%
apache-airflow apache-kafka docker-compose dockeroperator kafdrop pentaho pentaho-data-integraion pentahodataintegration docker kafka

docker-pdi-airflow-kafka's Introduction

Description

Airflow, Kafka and Postgres (target db) services spawned using docker compose. Airflow orchestrates the data pipeline, including spawning of PDI containers (via DockerOperator). The producer.ktr file uses the built-in pentaho Kafka Producer design component to publish messages to Kafka container. Whereas, the consumer.ktr file runs a python script via a CPython Script Executor plugin. Reasons for not using built-in PDI Kafka consumer:

  1. Unable to stop the built-in PDI consumer in case of no messages. As a result the Airflow DAG stays in running state forever.
  2. Needed a way to assign specific topic parition to the consumer.ktr to work on.
  3. Commit offsets only after they are successfully processed and inserted into database.
  4. Added flexibility of Python for future customizations to the consumer.

Architecture

Architecture

Process flow:

  • TWO pdi transformation .ktr files have been used. One serving as the producer and the other as consumer.
  • The producer file gets the input data, builds the key-message pair and sends to Kafka topic.
  • The consumer transformation reads each message offset from the Kafka topic, processes them before loading into PostgresDb.
  • Both the producer and consumer transformations triggerred from Airflow via DockerOperator.
  • Source code files like KTRs and DAGs are rather mounted from host to docker. This removes the requirement of re-building the docker images in case of a code change.

Pre-requisites

Installation

All the required steps for this demo have been set in the below .sh file. If required update the SETUP PARAMETERS section of the shell script. Navigate to the git folder and run the below 2 commands.

bash setup.sh

  • Creates a .env file and sets all the environment variables required by the services in docker compose.
  • Builds the required docker images.

NOTE: This shell script also creates jdbc.properties file for PDI containers. User needs to add all necessary DB connection strings if required. It is recommended to add this file to .gitignore.

docker-compose up -d

Docker commands

  • Start all services: docker-compose up
  • Stop all services and remove containers: docker-compose down
  • List all the running containers: docker ps (add -a to include inactive/dandling containers as well)
  • Get insided a running container: docker exec -it <container-name> bash
  • Get system level docker objects list: docker system df
  • For more specific commands, please check Reference section.
  • Container logs: docker logs [CONTAINER NAME], e.g docker logs airflow-webserver. Add -f to follow log.
  • In order to increase no. of Airflow workers: docker-compose up --scale airflow-worker=3, if 3 workers required. Note: more the number of workers more will be the pressure on system resources.

Process Monitoring

  • Airflow logs: click task on UI > Logs
  • Kafdrop: topic messages, their offsets and consumer lags
  • Docker: using the above Docker commands, one can monitor the containers. You can add 3rd party container monitoring services to docker compose file.

Reference

docker-pdi-airflow-kafka's People

Contributors

sarit-si avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.