Coder Social home page Coder Social logo

vmois / miwaitway Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 44 KB

Calculate average wait time on bus stops using GTFS real-time vehicle location and display it on a map

Home Page: https://vmois.dev/miwaitway-average-wait-time-on-stop/

Dockerfile 1.77% Python 97.12% Shell 1.11%
airflow gtfs python bigquery gtfs-rt mississauga public-transport

miwaitway's Introduction

MiWaitWay

A web app that shows average wait time on a bus stop for MiWay (Mississauga Transit) agency using their public GTFS feed. Work in progress.

Motivation

I like public transport, and I enjoy software engineering. For a long time, I wanted to build a project that worked with data end-to-end. From data ingestion throughout the analysis to the presentation to the external end user. The only thing that stopped me was not finding an analysis topic I would want to dive into (low-effort excuse, I know, but it is what it is). As I am growing an interest in public transportation and am a day-to-day user of it, I have recently found a topic I would like to explore - an average wait time at a stop. You can track the project progress by reading "MiWaitWay" series on my engineering blog.

Deploying Airflow on Google Cloud VM

  1. Pull repository
  2. Start services with docker compose up -d
  3. Create SSH tunnel from your local computer to VM instance. In this case, you will not need to expose AIrflow UI to the web.
gcloud compute ssh airflow-and-web \
    --project miwaitway \
    --zone us-central1-c \
    -- -NL 8080:localhost:8080

miwaitway's People

Contributors

vmois avatar

Stargazers

Ricardo Maçãs avatar

Watchers

 avatar

miwaitway's Issues

Deploy Airflow on Google Cloud Compute

Cloud Composer is quite expensive for personal use (around 200 per month at the minimum); therefore, it is better to deploy Airflow on a cheap Google Cloud compute instance, ideally under 15-20 dollars per month. Pricing for different machines can be found here.

e2-small (2vCPU, 2GB memory) costs around 13 dollars monthly. It should be enough for a local Airflow + a web server to host the dashboard. Around 30GB of storage space should be enough. So, in total, we are looking for 15 dollars per month. Not that bad.

Things to do in this issue:

  • Update docker-compose and other configs to be able to deploy
  • Install the necessary software on the server
  • Deploy Airflow on the server
  • Write a guide on how to setup server and airflow

Resolving duplication issues in raw data

Even after solving de-duplication issues in the vehicle position ingestor, problems still caused the MERGE request to fail when running a DAG. After some investigation, I found that two CSV files containing multiple chunks of vehicle positions still had an overlap by vehicle_id and timestamp. Because of that, the MERGE step couldn't determine which row to use to insert into the final raw table.

The solution is to load each file one by one. Load a single file to a stage, merge the stage into raw, clean the stage, and repeat. In this way, duplicates do not end up in the same table but will be loaded batch by batch, and with MERGE, they will match and simply update the values.

One minor issue still persists is that, for now, the order of the CSV files in GCS is random. Ideally, duplicated rows (by vehicle_id and timestamp) can have a slight difference where newer rows have, for example, occupancy specified. With current loading, we might lose this data because of random order. However, it is not critical for our desired feature and can be addressed later.

Transform raw data to prod to be used in features later

BQ has powerful functions that work with GEOGRAPHY data. I will need them to calculate the average wait time on a stop.
To use those functions, I need to convert raw latitude and longitude coordinates to BQ native Geography Point. In particular, I care about two tables: vehicle_position and stops. vehicle_position contains bus GPS coordinates + timestamp,
stops contains the GPS coordinates of the stops.

The aim of this epic is to design DAGs that transform lat and lon columns in vehicle_position and stops tables to BQ Geography Point. Transformed data will be saved as new tables in the prod stage (together with not-transformed data). This aligns with the second column in the diagram from the data flow article - https://vmois.dev/data-flow-bigquery-miwaitway/. Later, using those prod tables, I will be able to calculate average wait time, etc.

Make raw static and real-time bus position data available in BigQuery

Create necessary DAGs and tables in BigQuery to load static and real-time data. Static GTFS data needs to be loaded once per day; real-time needs continuous ingestion.

I want to publish two articles about static and real-time ingestion as part of the epic. It can be more if needed.

Tasks (in-order):

  • finalize importing static GTFS data using Airflow and deploy it (#10 )
  • #12 and #14
  • #8

Improve reliability of vehicle position ingestor

The vehicle position ingestor has some reliability issues that must be addressed before I can continue the analysis. Some issues:

  • All 100 batches (around 15-20 minutes of GTFS real-time data) are kept in memory. In case of a crash, we lose them and have a gap of 20 minutes in bus positions. A good solution is to introduce Write-Ahead-File so that after the restart, the ingestor can pick up the lost data.

  • Sometimes, the vehicle_id field is missing, and the ingestor crashes. In addition to losing all batches, without having a Protobuf file, it is hard to understand what happened and how to fix it. It would be good to catch polars.exceptions.ColumnNotFoundError errors and save the Protobuf file + error message for debugging (and unit tests). Failure in one batch should not cause a loss of all batches.

  • If unusual errors occur (like a missing vehicle_id field), some monitoring/alerting is needed. I want to try Sentry for that. In addition to saving the context of an error in the ingestor, I need to send an alert to Sentry.

  • Sometimes I can see such errors polars.exceptions.ComputeError: could not append value: 1 of type: i64 to the builder; make sure that all rows have the same schema or consider increasing infer_schema_length. It usually happens when creating a data frame from all batches (pl.DataFrame(flattened_data)). Possible ideas: reduce the number of batches from 100 to 50 and double-check how schema is inferred. In case this error is unavoidable, we need to dump batches to a file and notify, e.g., Sentry.

Calculate the average wait time at bus stops

Now that we have the necessary prod data after #21, we can calculate the average wait time at bus stops. The end goal for this issue is to have a final table in BigQuery that contains stop_id, date, direction_id, and line_id (bus number). We might also split the day into a few chunks as frequently bus service is reduced at night.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.