Coder Social home page Coder Social logo

acidonspark-etl's Introduction

AcidOnSpark-ETL

Watch about it here:

Read about it here:

Technologies

The main technologies are:

  • Python3
  • Docker
  • Spark
  • Airflow
  • MinIo
  • DeltaTable
  • Hive
  • Mariadb
  • Presto
  • Superset

How is it work?

I this section each part of this ETL pipeline will be illustrated:

Spark

Spark is used to read and write data in distributed and scalable manner.

make spark

will run spark master and one instance of worker

make scale-spark

will scale spark worker.

Airflow

One of the best workflow managemnet for spark jobs.

make airflow

MinIo

An opensource, distributed and performant object storage for datalake files and hive tables.

make minio

DeltaTable (Deltalake)

An opensource columnar parquet files formats with snappy compression. Delta supports update and delete, which is very nice. All necessary jar files for supporting delta and s3 objects are added to hive and spark docker images.

Hive and Mariadb

In order to create tables to run Spark SQL on delta tables, spark needs hive metastore and hive needs mariadb as metastoreDb. Mariadb is also used for data warehouse for to run query faster and create dashboards.

make hive

It will create hive and mariadb instances.

Presto

In order to have acces to delta tables without spark, presto is going to be employed as distributed query engine. It works with superset and hive tables. Presto is opensource, scalable and it can connect to any databases.

make presto-cluster

By this command will create a presto coordinator and worker, the worker can scale horizontally. In order to query delta tables using presto:

make presto-cli

In presto-cli just like spark sql, any query can be run.

Superset

Superset is opensource, supports any databases with many dashbord styles also famous in tech in order to create dashboards or to get hands on databases.

make superset

acidonspark-etl's People

Contributors

arezamoosavi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.