transitmatters / gobble Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 3.0 682 KB

🦃 Process MBTA events into a format that can be consumed by the Data Dashboard

License: MIT License

Shell 2.57% Python 97.43%

mbta

gobble's People

Contributors

Stargazers

Watchers

Forkers

nathan-weinberg moralcode

gobble's Issues

Add Commuter Rail lines

To be able to populate the data dashboard with commuter rail data, it would be nice to start processing at least 1 commuter rail line to start setting up the frontend against 🚆

Duplicate events

Maybe this bus became stalled or something? gobble wrote the same event to disk a bunch of times because the current stop kept flip-flopping.

2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:51.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:54.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:54.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:57.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:00.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:03.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:03.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:06.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:11.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:39.000Z,0,0

Store Rapid transit data without prefix

Right now we're storing data as Green-B_0_70121 when the data dashboard expects to be able to simply read 70121

This is making using gobble rapid transit data in the dashboard more complicated than it needs to be

Optimize S3 Calls

s3 charges $0.005 per 1000 requests, right now we make 1 request, per stop, per half hour

We should see if there are ways we can optimize the calls slightly without reducing the frequency

https://github.com/transitmatters/gobble/blob/main/src/s3_upload.py#L52-L56

We now upload all files from the current day, every upload time. We should maybe limit to files updated in the last 1 hour again? Would help reduce uploads for commuter rail, weird bus stops and rapid transit at times of shutdowns

These would help reduce upload time as well as cost

intuit stop ID for vehicles that dont report them

fairly regularly (5000ish times maybe), we will get gps pings from vehicles that reporting their route information, but not their upcoming stop. here's a map of ~2000 such pings over the course of a day https://www.google.com/maps/d/u/0/edit?mid=1ttstvWGxhXTY62ZOA7YYQr-o3Srnbj8&usp=sharing

we currently ignore these pings, which does give us decent headway calculations but can produce holes in our records. as far as i can tell, theres 3 types of stop id outages we tend to see

short stretch outages, which occur for less than a minute and tend to happen at the beginning/end of the stop. these are short enough that we could probably ignore them and have reasonable calculations, even if they happen in the middle of a trip.
medium stretch outages, which occur for maybe 2-10 minutes at a time. we see these a lot on the 39 (potentially caused by a glitchy AVL) and they can cause us to lose information for a couple of stops.
long stretch outages, which might be because the AVL for a vehicle wasn't turned on but GPS was still reporting info.

if a vehicle has been dark for more than ~a minute, we should start trying to interpolate its progress along its shape if possible. there will probably be some complexity wrt

small route diversions
figuring out the inbound/outbound direction of the trip, if that's null (we might be able to grab this from gtfs? or just use previous ping info)
how to store shape information in-memory for quick enough calculations without ballooning memory usage (maybe just cache the more problematic route shapes)
monitoring the duration of a vehicle's outage in order to kick off this calculation
determining whether an event is an arrival or a departure (or neither.) i havent seen any STOPPED_AT events in these outages, and its unclear if the vehicles actually making any stops and not reporting it.

Is Massport shuttle data useful?

Hello! I hope this isn't an unnecessary question as a first-timer but i didnt seem to see any other related issues from a simple search.

When poking around the MBTA api docs for GTFS, i saw this:

As of February 21, 2019, the Massport shuttles will no longer be a part of the MBTA GTFS Feed. Developers can find that data from Trillium Transit. Please note that stop IDs and route IDs are different between the two feeds.

It seems like there's an interest in adding new modes of transit (at least I assume so given the existence of #77). Would including this data be something useful?

I realize it probably isnt a good first issue given the note about the different ID's, but i figured id at least ask so that the answer is documented for the future

Reduce data loss on deploy

When we push a new deploy, a few things happen or can happen:

Service restarts, missing a few data points

In a rare case, the EC2 instance needs to be recreated. In this case we're going to miss a lot of data points. Things we can do to reduce that:

Force data to be pushed to s3 during a deploy so we lose the minimal amount
When new EC2 is started up, pull down all of todays data from s3 so the service doesn't overwrite all earlier data on first upload
Only apply these updates at 3am 🙃

Correlate logs to traces

Last step in us having full observability for gobble: https://docs.datadoghq.com/tracing/other_telemetry/connect_logs_and_traces/python/

Add Datadog Monitoring

We should have datadog monitoring both at the EC2 instance level (agent) and the python code level (APM)

This will allow us to track EC2 resources, network usage, and code performance

https://docs.datadoghq.com/agent/basic_agent_usage/ansible/

https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/python/

Ensure service is always running

We can't afford for the service to stop running for a long period of time, so we need ways to ensure it is running

Restarts
- We should ensure that the service restarts every night when the T isn't running to clear any potential memory leaks and things that can be fixed with a restart
Monitoring
- Datadog should notify us when the service isn't currently running so we can intervene

Data Quality: Headways

When viewing the data for the commuter rail in the dashboard, headways are often missing points on the graph, yet, they read out realistic headway numbers on the points that do exist

Some charts look very strange

Add more bus lines to gobble

Gobble currently queries and upload realtime data for the 1 bus only. We'll want to add at least the bus lines currently available on the dashboard so that we have parity with the ~monthly backfill data.

Split `state.json` into one file per route

The problem: writes to state.json are dominating our process_event calls:

The file grows contains JSON that grows in size over the course of the day by about 30kb/hr as more trips are added. That's a lot of data for a single JSON file, but not that much. We hypothesize that our event processing threads (one each for CR and rapid transit, many for bus) are all trying to write this file simultaneously, and most of the apparent write time is really waiting on a lock (see this thread).

The proposed solution is to split state.json into one file per route (or at least, one file per thread). Here's a sketch of the change:

1. Create a new `trip_state.py` file with this API

@dataclass
class TripState:
    # Holds the current state for a single trip
    stop_sequence: int
    stop_id: str
    updated_at: datetime
    event_type: str

@dataclass
class RouteTripsState(object):
    # Holds the current state for all trips in a route, probably in a Dict[str, TripState]
    route_id: str
    service_date: str

    def __post_init__(self):
        # Loads the latest trip state from ./data/trip_states/{self.route_id}.json
        ...

    def update_trip_state(self, trip_id: str, trip_state: TripState) -> None:
        # Updates in-memory representation of a specific trip's state
        # and writes the result to disk at ./data/trip_states/{route_id}.json
        # 🤔 do we really need to do this each time?
        ...

    def get_trip_state(self, trip_id: str) -> Optional[TripState]:
        # Returns the latest trip state for this trip (from memory)
        ...

    def purge_trip_state() -> None:
        # Purges trip state in-memory and on-disk (done overnight)
        ...

2. Instantiate one `RouteTripsState` per route

This could be done with a global dict, but we might as well keep it local to the thread that manages trips

3. Make sure the trips state is purged nightly

Using logic similar to the GTFS rollover calculations:

# check for new day
gtfs_service_date = util.service_date(datetime.now(util.EASTERN_TIME))
updated_at = datetime.fromisoformat(update["attributes"]["updated_at"])
service_date = util.service_date(updated_at)

if gtfs_service_date != service_date:
    # Purge trip state
    ...

(We should probably extract a helper method for this check since we'll do it in a number of places)

Unit Tests

We want to make sure things work as we expect, and continue to work. We should have a few unit tests

Add more bus routes to tracking

YEAR OF THE BUS, let's add a bunch more of the bus routes that don't share a schedule with any other bus routes

Add more bus routes

We've recently added more bus routes to the data dashboard, we should add the same routes here.

The live list to reference is https://github.com/transitmatters/t-performance-dash/blob/main/common/types/lines.ts#L12

Check for new GTFS bundle every day

At the moment the list of GTFS bundles is being read into memory once at launch, but we should check every day to see if that's the same one we should still be using. (Or something?)

Launch into a private subnet by default

#71 going in means gobble doesn't need a public IPv4 address. But the current public subnets auto-assign public IPv4 addresses (and need to for NTT and RRE), so we should:

a) Create a private subnet
b) Codify that private subnet in gobble's CF.

add mattapan trolley data

Commuter Rail missing stops

When looking at the commuter rail data collected by stop, we seem to be missing several inbound and outbound stops

https://github.com/transitmatters/t-performance-dash/blob/d5913dd893bf29bb419868632ce60681f5a2615f/common/constants/cr_constants/cr-fairmount.json

It seems especially true for end of lines, but not exclusively

https://github.com/transitmatters/t-performance-dash/blob/d5913dd893bf29bb419868632ce60681f5a2615f/common/constants/cr_constants/cr-worcester.json

Is the issue on our end or the MBTA side?

(note that these lists are generated based off the list of stops gobble puts in s3)

Calculate scheduled headway per event

In order to display colored dots in the data dashboard, scheduled_headway needs to be filled in.

Unfortunately, we don't immediately know what the scheduled headway is from a particular vehicle, so we need to calculate it ourselves from GTFS. It also might be possible to request them on-demand, once per day, from the MBTA v3 API.

Make data available to the dashboard

So right now events are being written to disk on the instance. The data dashboard needs to get a hold of them somehow, though. I see a few distinct options...

a) Upload the events files to S3 overnight every night, accepting that we just won't have live bus or CR. (lame)
b) Every time we append an events.csv on disk, upload the entire thing to S3. (maybe, but like...no)
c) Serve live events over http that the dashboard can request on-demand.

My hunch is that we want to do (c), with some (a) sprinkled in. It's cool when things are live, and we shouldn't give that up. So rough steps:

In a new process, create an express server that serves up events from ./output.
Throw a load balancer in front of it, and wire up the load balancer to a .labs DNS record with the wildcard cert for https.
Maybe add pre-shared key auth, since this is for internal use only?

FAQ
a) Why cannot the dashboard talk to the EC2 instance via its private IP, such that we can keep the EC2 instance off the public internet? That's possible, but it's a pain
b) If the EC2 instance has a public IP address, can the dashboard lambda just talk to that? Yes, it could. But the load balancer option lets us easily add https using the wildcard cert, which, even with no private data involved is good citizenry.

Crash due to nonexistent stop id

Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: Traceback (most recent call last):
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/gobble.py", line 45, in <module>
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     main()
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/gobble.py", line 40, in main
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     process_event(update, current_stop_state, gtfs_service_date, scheduled_trips, scheduled_stop_times, stops)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/.cache/pypoetry/virtualenvs/gobble-i42h0hpV-py3.11/lib/python3.11/site-packages/ddtrace/tracer.py", line 975, in func_wrapper
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     return f(*args, **kwargs)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:            ^^^^^^^^^^^^^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/event.py", line 85, in process_event
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     ) = reduce_update_event(update)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:   File "/home/ubuntu/gobble/src/event.py", line 49, in reduce_update_event
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:     stop_id = update["relationships"]["stop"]["data"]["id"]
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]:               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: TypeError: 'NoneType' object is not subscriptable
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Main process exited, code=exited, status=1/FAILURE
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Failed with result 'exit-code'.
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Consumed 45min 9.973s CPU time.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: gobble.service: Scheduled restart job, restart counter is at 3963.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: Stopped gobble.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: gobble.service: Consumed 45min 9.973s CPU time.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: Started gobble.

transitmatters / gobble Goto Github PK

gobble's People

Contributors

Stargazers

Watchers

Forkers

gobble's Issues

1. Create a new trip_state.py file with this API

2. Instantiate one RouteTripsState per route

3. Make sure the trips state is purged nightly

Recommend Projects

Recommend Topics

Recommend Org

1. Create a new `trip_state.py` file with this API

2. Instantiate one `RouteTripsState` per route