Coder Social home page Coder Social logo

michailparaskevopoulos / covid19-airtraffic-spark-bigquery Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 120 KB

✈ A Spark-based ETL Pipeline for the OpenSky and OpenFlights Datasets

License: Apache License 2.0

Python 73.07% Shell 26.93%
opensky-network openflights bigquery pyspark dataproc

covid19-airtraffic-spark-bigquery's Introduction

✈ A Spark-based ETL Pipeline for the OpenSky and OpenFlights Datasets

Motivation

The OpenSky dataset spans flights since January 2019 has been instrumental in analyzing the effect of COVID-19 on the aviation industry. This repository contains a PySpark pipeline to extract and transform the OpenSky dataset, and load it on to Bigquery as a fact table. PySpark pipelines are also included to process data from the OpenFlights dataset to construct dimensional tables with airline and aircraft data. Combined together in a single Data Warehouse, the OpenSky and OpenFlights datasets not only can be used to analyze trends in airtraffic during the COVID-19 pandemic, but also to explore patterns between different airlines and aircraft types.

Description of Files

  • envSetup.sh Shell script that executes datasetFetch.py, downloads and unzips the OpenSky and OpenFlights CSV files, and copies them to Cloud Storage.
  • datasetFetch.py Python script that uses BeautifulSoup to scrap the file paths for the OpenSky dataset (26 files as of March 2021) and dumps them into a JSON file
  • clusterSetup.sh Shell script that enables the required Google Cloud APIs, creates a Dataproc cluster, and submits the PySpark jobs
  • pyspark_airtraffic.py Main PySpark file for the ETL of the OpenSky dataset:
    • The OpenSky dataset files are loaded from the Cloud Storage into a single PySpark dataframe
    • Rows with null values are dropped
    • Date format is modified to be compatible with the BigQuery date data type
    • The month part of the date of each flight is copied to a new column to be used a partioning field
    • A regex expression is used to extract the airline ICAO identifier from the callsign of each flight
    • The origin and destination coordinates are used to calculate the approximate travel distance for each flight and categorize it either as a short-, medium-, or long-haul flight
    • Reduntant columns are dropped
    • The trasnformed dataframe is loaded into BigQuery as a partitioned table based on the month column to imrpove query performance
  • pyspark_airlines.py and pyspark_aircraft_types.py PySpark files for the ETL of dimensional tables for airlines and aircraft types

Data Sources

OpenSky:
Data from the OpenSky Network 2020 were used to construct the airtraffic and aircraft type tables.

Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic and Matthias Wilhelm. "Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research". In Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83-94, April 2014.

Xavier Olive. "traffic, a toolbox for processing and analysing air traffic data." Journal of Open Source Software 4(39), July 2019.

OpenFlights:
Data from OpenFlights were used to construct the airlines dimensional table.

covid19-airtraffic-spark-bigquery's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.