Coder Social home page Coder Social logo

gobble's Introduction

gobble

lint test

Screenshot in action

Gobble is a service that reads the MBTA V3 Streaming API for all train/bus events, and writes them out to a format that can be understood by the TransitMatters Data Dashboard.

Requirements to develop locally

Development Instructions

  1. Duplicate config/template.json into config/local.json, and change the null out with your MBTA V3 API key.
  2. In the root directory, run poetry install to install dependencies
  3. Run poetry run python3 src/gobble.py to start.
  4. Output will be in data/ in your current working directory. Good luck!

Linting

You can run the linter against any code changes with the following commands

$ poetry run flake8 src
$ poetry run black --check src

Support TransitMatters

If you've found this app helpful or interesting, please consider donating to TransitMatters to help support our mission to provide data-driven advocacy for a more reliable, sustainable, and equitable transit system in Metropolitan Boston.

gobble's People

Contributors

dependabot[bot] avatar devinmatte avatar hamima-halim avatar hhalim1 avatar idreyn avatar mathcolo avatar nathan-weinberg avatar

Watchers

 avatar  avatar

Forkers

nathan-weinberg

gobble's Issues

Commuter Rail missing stops

When looking at the commuter rail data collected by stop, we seem to be missing several inbound and outbound stops

https://github.com/transitmatters/t-performance-dash/blob/d5913dd893bf29bb419868632ce60681f5a2615f/common/constants/cr_constants/cr-fairmount.json

It seems especially true for end of lines, but not exclusively

https://github.com/transitmatters/t-performance-dash/blob/d5913dd893bf29bb419868632ce60681f5a2615f/common/constants/cr_constants/cr-worcester.json

Is the issue on our end or the MBTA side?

(note that these lists are generated based off the list of stops gobble puts in s3)

Add Commuter Rail lines

To be able to populate the data dashboard with commuter rail data, it would be nice to start processing at least 1 commuter rail line to start setting up the frontend against ๐Ÿš†

Crash due to nonexistent stop id

Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: Traceback (most recent call last):
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/gobble.py", line 45, in <module>
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     main()
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/gobble.py", line 40, in main
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     process_event(update, current_stop_state, gtfs_service_date, scheduled_trips, scheduled_stop_times, stops)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/.cache/pypoetry/virtualenvs/gobble-i42h0hpV-py3.11/lib/python3.11/site-packages/ddtrace/tracer.py", line 975, in func_wrapper
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     return f(*args, **kwargs)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:            ^^^^^^^^^^^^^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/event.py", line 85, in process_event
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     ) = reduce_update_event(update)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/event.py", line 49, in reduce_update_event
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     stop_id = update["relationships"]["stop"]["data"]["id"]
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: TypeError: 'NoneType' object is not subscriptable
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Main process exited, code=exited, status=1/FAILURE
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Failed with result 'exit-code'.
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Consumed 45min 9.973s CPU time.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: gobble.service: Scheduled restart job, restart counter is at 3963.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: Stopped gobble.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: gobble.service: Consumed 45min 9.973s CPU time.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: Started gobble.

Add more bus lines to gobble

Gobble currently queries and upload realtime data for the 1 bus only. We'll want to add at least the bus lines currently available on the dashboard so that we have parity with the ~monthly backfill data.

Calculate scheduled headway per event

In order to display colored dots in the data dashboard, scheduled_headway needs to be filled in.

Unfortunately, we don't immediately know what the scheduled headway is from a particular vehicle, so we need to calculate it ourselves from GTFS. It also might be possible to request them on-demand, once per day, from the MBTA v3 API.

Check for new GTFS bundle every day

At the moment the list of GTFS bundles is being read into memory once at launch, but we should check every day to see if that's the same one we should still be using. (Or something?)

Launch into a private subnet by default

#71 going in means gobble doesn't need a public IPv4 address. But the current public subnets auto-assign public IPv4 addresses (and need to for NTT and RRE), so we should:

a) Create a private subnet
b) Codify that private subnet in gobble's CF.

Split `state.json` into one file per route

The problem: writes to state.json are dominating our process_event calls:

image

The file grows contains JSON that grows in size over the course of the day by about 30kb/hr as more trips are added. That's a lot of data for a single JSON file, but not that much. We hypothesize that our event processing threads (one each for CR and rapid transit, many for bus) are all trying to write this file simultaneously, and most of the apparent write time is really waiting on a lock (see this thread).

The proposed solution is to split state.json into one file per route (or at least, one file per thread). Here's a sketch of the change:

1. Create a new trip_state.py file with this API

@dataclass
class TripState:
    # Holds the current state for a single trip
    stop_sequence: int
    stop_id: str
    updated_at: datetime
    event_type: str

@dataclass
class RouteTripsState(object):
    # Holds the current state for all trips in a route, probably in a Dict[str, TripState]
    route_id: str
    service_date: str

    def __post_init__(self):
        # Loads the latest trip state from ./data/trip_states/{self.route_id}.json
        ...

    def update_trip_state(self, trip_id: str, trip_state: TripState) -> None:
        # Updates in-memory representation of a specific trip's state
        # and writes the result to disk at ./data/trip_states/{route_id}.json
        # ๐Ÿค” do we really need to do this each time?
        ...

    def get_trip_state(self, trip_id: str) -> Optional[TripState]:
        # Returns the latest trip state for this trip (from memory)
        ...

    def purge_trip_state() -> None:
        # Purges trip state in-memory and on-disk (done overnight)
        ...

2. Instantiate one RouteTripsState per route

This could be done with a global dict, but we might as well keep it local to the thread that manages trips

3. Make sure the trips state is purged nightly

Using logic similar to the GTFS rollover calculations:

# check for new day
gtfs_service_date = util.service_date(datetime.now(util.EASTERN_TIME))
updated_at = datetime.fromisoformat(update["attributes"]["updated_at"])
service_date = util.service_date(updated_at)

if gtfs_service_date != service_date:
    # Purge trip state
    ...

(We should probably extract a helper method for this check since we'll do it in a number of places)

Optimize S3 Calls

s3 charges $0.005 per 1000 requests, right now we make 1 request, per stop, per half hour

We should see if there are ways we can optimize the calls slightly without reducing the frequency

https://github.com/transitmatters/gobble/blob/main/src/s3_upload.py#L52-L56

We now upload all files from the current day, every upload time. We should maybe limit to files updated in the last 1 hour again? Would help reduce uploads for commuter rail, weird bus stops and rapid transit at times of shutdowns

These would help reduce upload time as well as cost

Duplicate events

Maybe this bus became stalled or something? gobble wrote the same event to disk a bunch of times because the current stop kept flip-flopping.

2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:51.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:54.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:54.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:57.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:00.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:03.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:03.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:06.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:11.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:39.000Z,0,0

Store Rapid transit data without prefix

Right now we're storing data as Green-B_0_70121 when the data dashboard expects to be able to simply read 70121

This is making using gobble rapid transit data in the dashboard more complicated than it needs to be

Data Quality: Headways

When viewing the data for the commuter rail in the dashboard, headways are often missing points on the graph, yet, they read out realistic headway numbers on the points that do exist

Screenshot 2023-12-26 at 12 06 07โ€ฏPM

Some charts look very strange

Screenshot 2023-12-26 at 12 07 28โ€ฏPM

Add more bus routes to tracking

YEAR OF THE BUS, let's add a bunch more of the bus routes that don't share a schedule with any other bus routes

  • line-7
  • line-8
  • line-10
  • line-11
  • line-14
  • line-18
  • line-26
  • line-29
  • line-30
  • line-31
  • line-35
  • line-36
  • line-37
  • line-38
  • line-42
  • line-43
  • line-44
  • line-51
  • line-100
  • line-101
  • line-105
  • line-106
  • line-108
  • line-110
  • line-112
  • line-119
  • line-134
  • line-195
  • line-215
  • line-230
  • line-236
  • line-238
  • line-240
  • line-354
  • line-429
  • line-455
  • line-69
  • line-80
  • line-83
  • line-87
  • line-88
  • line-90
  • line-93
  • line-94
  • line-95
  • line-96
  • line-97
  • line-99

Make data available to the dashboard

So right now events are being written to disk on the instance. The data dashboard needs to get a hold of them somehow, though. I see a few distinct options...

a) Upload the events files to S3 overnight every night, accepting that we just won't have live bus or CR. (lame)
b) Every time we append an events.csv on disk, upload the entire thing to S3. (maybe, but like...no)
c) Serve live events over http that the dashboard can request on-demand.

My hunch is that we want to do (c), with some (a) sprinkled in. It's cool when things are live, and we shouldn't give that up. So rough steps:

  1. In a new process, create an express server that serves up events from ./output.
  2. Throw a load balancer in front of it, and wire up the load balancer to a .labs DNS record with the wildcard cert for https.
  3. Maybe add pre-shared key auth, since this is for internal use only?

FAQ
a) Why cannot the dashboard talk to the EC2 instance via its private IP, such that we can keep the EC2 instance off the public internet? That's possible, but it's a pain
b) If the EC2 instance has a public IP address, can the dashboard lambda just talk to that? Yes, it could. But the load balancer option lets us easily add https using the wildcard cert, which, even with no private data involved is good citizenry.

intuit stop ID for vehicles that dont report them

fairly regularly (5000ish times maybe), we will get gps pings from vehicles that reporting their route information, but not their upcoming stop. here's a map of ~2000 such pings over the course of a day https://www.google.com/maps/d/u/0/edit?mid=1ttstvWGxhXTY62ZOA7YYQr-o3Srnbj8&usp=sharing

we currently ignore these pings, which does give us decent headway calculations but can produce holes in our records. as far as i can tell, theres 3 types of stop id outages we tend to see

  1. short stretch outages, which occur for less than a minute and tend to happen at the beginning/end of the stop. these are short enough that we could probably ignore them and have reasonable calculations, even if they happen in the middle of a trip.
  2. medium stretch outages, which occur for maybe 2-10 minutes at a time. we see these a lot on the 39 (potentially caused by a glitchy AVL) and they can cause us to lose information for a couple of stops.
  3. long stretch outages, which might be because the AVL for a vehicle wasn't turned on but GPS was still reporting info.

if a vehicle has been dark for more than ~a minute, we should start trying to interpolate its progress along its shape if possible. there will probably be some complexity wrt

  • small route diversions
  • figuring out the inbound/outbound direction of the trip, if that's null (we might be able to grab this from gtfs? or just use previous ping info)
  • how to store shape information in-memory for quick enough calculations without ballooning memory usage (maybe just cache the more problematic route shapes)
  • monitoring the duration of a vehicle's outage in order to kick off this calculation
  • determining whether an event is an arrival or a departure (or neither.) i havent seen any STOPPED_AT events in these outages, and its unclear if the vehicles actually making any stops and not reporting it.

Ensure service is always running

We can't afford for the service to stop running for a long period of time, so we need ways to ensure it is running

  • Restarts
    • We should ensure that the service restarts every night when the T isn't running to clear any potential memory leaks and things that can be fixed with a restart
  • Monitoring
    • Datadog should notify us when the service isn't currently running so we can intervene

Reduce data loss on deploy

When we push a new deploy, a few things happen or can happen:

  1. Service restarts, missing a few data points

In a rare case, the EC2 instance needs to be recreated. In this case we're going to miss a lot of data points. Things we can do to reduce that:

  1. Force data to be pushed to s3 during a deploy so we lose the minimal amount
  2. When new EC2 is started up, pull down all of todays data from s3 so the service doesn't overwrite all earlier data on first upload
  3. Only apply these updates at 3am ๐Ÿ™ƒ

Unit Tests

We want to make sure things work as we expect, and continue to work. We should have a few unit tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.