Coder Social home page Coder Social logo

kaoutaar / end-to-end-etl-pipeline-jcdecaux-api Goto Github PK

View Code? Open in Web Editor NEW
15.0 1.0 0.0 597 KB

velib-v2___an ETL pipeline that employs batch and streaming jobs using spark, kafka, airflow, and other tools

Python 88.20% Shell 2.68% TSQL 9.12%
airflow data-engineering etl etl-pipeline kafka spark spark-connect spark-streaming streamlit

end-to-end-etl-pipeline-jcdecaux-api's Introduction

End-to-end ETL pipeline - jcdecaux API


image


The architecture consists of 2 main pipelines:

  • Batch pipeline: after data being served to kafka, we use spark analytics engine to transform and process data in batches and send tables into a datawarehouse
  • Stream pipeline: we use spark streaming to fetch and filter data that will be served in our web application in ~ realtime.

Jcdecaux API:

a single call of this API returns the most recent information about all the existing stations, "the most recent " could be the last update got 1โ€“2 min ago, which doesn't really mean real-time data, we can do nothing to improve it, this is how the API works. And to make the API act like a stream source but also avoid to overload the server, a script is scheduled in Airflow to fetch data every 30 seconds, this data is then sent to kafka cluster using kafka produder.

kafka:

kafka receives data and store it in a topic called "velib_data" waiting to be polled, two consumer groups are configured to consume data from kafka broker in parallel:

  1. batch-consumer: using spark-connect, a spark client script is in charge of polling data from kafka in batches, is takes care of transforming and creating tables that are sent to the datawarehouse, Airflow is configured to run this job every day at 5am,
  2. stream-consumer: spark structured streaming is used here to allow consuming data from kafka broker in real time, the transformed data is then served to the web-based application. Note: spark-connect doesn't support yet streaming operations, which means in our case, the stream-consumer script will live in the same evironment as spark cluster.

Datawarehouse:

the choice of the right tool for datawarehousing and OLAP system is out of the scope of this project. In our case we use SQLserver.

web application:

to create a web application, there are a plenty of choices, the easiest one (to learn and implement) IMHO is streamlit, you don't need to be a web developer to use it, with a single python script your application is ready to be served.


Environment

  • the whole architecture can be installed locally in one go using dockercompose. you must have at least 8GB ram for it to run.

  • for zookeeper, kafka, spark and SQLserver, each service runs in its own separate container

  • Airflow is configured to use local executor to enable parallel tasks, for this we need to set up a postgreSQL database for backend service, for more options read the following link https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html

  • the web application could have been set in its own server, but the whole architecture is already taking enough space and cpu, this is why the app will live with spark engine in the same container


How to run

after cloning the repository, using a terminal, cd to its directory and run

docker-compose -f dockercompose.yaml up

this is going to take a few minutes because it will install base images and then build the costumized ones. after all containers are up, using your browser, you can view the airflow UI at

localhost:8080

Notes:

  • airflow webserver may take extra time to start, so be patient!
  • the credentials to login are admin:admin
  • the DAGs are configured to not automatically start when the servers are up, you have to start them manually using the toggle in the left, but you can change this option if needed
  • the first DAG to be run is api-to-kafka

Screenshot (47)

wait at least 2 min to run the web application at

localhost:8501

Screenshot (52)


Here we go.. the app shows that the station Gare Centrale in Brussels has 24 bikes availables. and 11 free docks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.