Coder Social home page Coder Social logo

satyam245 / airflow-logistics-data-pipeline Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 7 KB

Streamline logistics data orchestration with Apache Airflow on Google Cloud Platform. Automate ingestion, transformation, and storage of CSV files in Google Cloud Storage (GCS) into Hive tables on Google Cloud Dataproc. Utilizes dynamic partitioning for scalability and efficiency.

Python 100.00%

airflow-logistics-data-pipeline's Introduction

Airflow Data Orchestration for Logistics Data Pipeline

Overview

This Apache Airflow project streamlines the ingestion of logistics data from CSV files in Google Cloud Storage (GCS) into Hive tables on Google Cloud Dataproc. Leveraging GCS sensors, the system autonomously detects new file arrivals, initiating the extraction and transformation processes. Through Hive queries, databases and tables are dynamically created and managed, optimizing data organization for efficient analysis. Dynamic partitioning is implemented to enhance data storage scalability and performance. Processed files are moved to the archive folder to avoid duplicate processing. The DAG triggers itself upon completion, enabling ongoing data processing.

Components

  • Google Cloud Composer: It is used for orchestrating the data pipeline tasks.
  • Google Cloud Storage (GCS): GCS is utilized for storing incoming CSV files and processed data.
  • Google Cloud Dataproc: Dataproc is used for executing Hive queries and managing Hive tables.

airflow-logistics-data-pipeline's People

Contributors

satyam245 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.