Coder Social home page Coder Social logo

mesaketk / lasagna_datapipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gmrqs/lasagna

0.0 0.0 0.0 11.94 MB

A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka

Shell 0.02% Python 40.88% Jupyter Notebook 57.71% Dockerfile 1.40%

lasagna_datapipeline's Introduction

alt text Lasagna (or pastabricks) is a interactive development environment I built to learn and practice PySpark.

It's built using Docker Compose template, provisioning a Jupyter Lab, a two-workers Spark Standalone Cluster, MinIO Object Storage, a Hive Standalone Metastore, Trino and a Kafka cluster for simulating events.

Requisites:

  • Docker Desktop
  • Docker Compose

To use it you just have to clone this repository and execute the following:

docker compose up -d

Docker will build the images by itself. I recommend having a wired internet connection for this

After all container are up and running, execute the following to get Jupyter Lab access link:

 docker logs workspace 2>&1 | grep http://127.0.0.1

(you can also the the link in docker desktop logs)

Clique no link http://127.0.0.1:8888/lab?token=<token_gigante_super_seguro>

To start the Kafka broker you need to go to the kafka folder and execute the following:

docker compose up -d

What does Lasagna creates?

alt text

The docker-compose.yml template create a series of containers:

๐Ÿ“™ Workspace

A Jupyter Lab client for interactive development sessions, featuring:

  • A work directory in order to persists your scripts and notebooks;
  • spark-defaults.conf pre-configured to make Spark Sessions easier to create;
  • Dedicated kernels for PySpark with Hive, Iceberg or Delta;

๐Ÿ‘€ Use %SparkSession command to easily configure Spark Session

alt text

๐Ÿ“‚ MinIO Object Storage

A single MinIO instance to serve as object storage:

  • Web UI accessible at localhost:9090 (user: admin password: password)
  • s3a protocol API available at port 9000;
  • mount/minio and mount/minio-config directories mounted to persist data between sessions.

โœจ Spark Cluster

A standalone spark cluster for workload processing:

  • 1 Master node (master at port 7077, web-ui at localhost:5050)
  • 2 Worker nodes (web-ui at localhost:5051 and localhost:5052)
  • All the necessary dependencies for MinIO connection;
  • Connectivity with MinIO @ port 9000.

๐Ÿ Hive Standalone Metastore

A Hive Standalone Metastore instance using PostgreSQL as back-end database allowinto to persist table metadata between sessions.

  • mount/postgres directory to persist tables between development sessions;
  • Connectivity with Spark cluster at through thift protocol at port 9083;
  • Connectivity with PostgresSQL through JDBC at port 5432.

๐Ÿฐ Trino

A single Trino instace to serve as query engine.

  • Hive, Delta e Iceberg catalos configured. All tables created in using PySpark are accessible with Trino;
  • Standar service available at port 8080.

๐Ÿ‘€ Don't forget you can use the %trino magic command in your notebooks!

๐ŸŒŠ Kafka

A separate docker compose template with a zookeper + kafka single-node instance to mock data-streams with a python producer.

  • Uses the same network as the lasagna docker compose creates;
  • A kafka-producer notebook/script is available to create random events with Faker library;
  • Accessible at kafka:29092.

lasagna_datapipeline's People

Contributors

gmrqs avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.