transitmatters / gobble Goto Github PK
View Code? Open in Web Editor NEW๐ฆ Process MBTA events into a format that can be consumed by the Data Dashboard
License: MIT License
๐ฆ Process MBTA events into a format that can be consumed by the Data Dashboard
License: MIT License
To be able to populate the data dashboard with commuter rail data, it would be nice to start processing at least 1 commuter rail line to start setting up the frontend against ๐
Maybe this bus became stalled or something? gobble wrote the same event to disk a bunch of times because the current stop kept flip-flopping.
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:51.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:54.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:54.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:34:57.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:00.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:03.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:03.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:06.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:11.000Z,0,0
2023-11-03,66,58403721,1,2553,6,0,1734,DEP,2023-11-04T03:35:39.000Z,0,0
Right now we're storing data as Green-B_0_70121
when the data dashboard expects to be able to simply read 70121
This is making using gobble rapid transit data in the dashboard more complicated than it needs to be
s3 charges $0.005 per 1000 requests, right now we make 1 request, per stop, per half hour
We should see if there are ways we can optimize the calls slightly without reducing the frequency
https://github.com/transitmatters/gobble/blob/main/src/s3_upload.py#L52-L56
We now upload all files from the current day, every upload time. We should maybe limit to files updated in the last 1 hour again? Would help reduce uploads for commuter rail, weird bus stops and rapid transit at times of shutdowns
These would help reduce upload time as well as cost
fairly regularly (5000ish times maybe), we will get gps pings from vehicles that reporting their route information, but not their upcoming stop. here's a map of ~2000 such pings over the course of a day https://www.google.com/maps/d/u/0/edit?mid=1ttstvWGxhXTY62ZOA7YYQr-o3Srnbj8&usp=sharing
we currently ignore these pings, which does give us decent headway calculations but can produce holes in our records. as far as i can tell, theres 3 types of stop id outages we tend to see
if a vehicle has been dark for more than ~a minute, we should start trying to interpolate its progress along its shape if possible. there will probably be some complexity wrt
STOPPED_AT
events in these outages, and its unclear if the vehicles actually making any stops and not reporting it.Hello! I hope this isn't an unnecessary question as a first-timer but i didnt seem to see any other related issues from a simple search.
When poking around the MBTA api docs for GTFS, i saw this:
As of February 21, 2019, the Massport shuttles will no longer be a part of the MBTA GTFS Feed. Developers can find that data from Trillium Transit. Please note that stop IDs and route IDs are different between the two feeds.
It seems like there's an interest in adding new modes of transit (at least I assume so given the existence of #77). Would including this data be something useful?
I realize it probably isnt a good first issue given the note about the different ID's, but i figured id at least ask so that the answer is documented for the future
When we push a new deploy, a few things happen or can happen:
In a rare case, the EC2 instance needs to be recreated. In this case we're going to miss a lot of data points. Things we can do to reduce that:
Last step in us having full observability for gobble: https://docs.datadoghq.com/tracing/other_telemetry/connect_logs_and_traces/python/
We should have datadog monitoring both at the EC2 instance level (agent) and the python code level (APM)
This will allow us to track EC2 resources, network usage, and code performance
https://docs.datadoghq.com/agent/basic_agent_usage/ansible/
https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/python/
We can't afford for the service to stop running for a long period of time, so we need ways to ensure it is running
Gobble currently queries and upload realtime data for the 1 bus only. We'll want to add at least the bus lines currently available on the dashboard so that we have parity with the ~monthly backfill data.
The problem: writes to state.json
are dominating our process_event
calls:
The file grows contains JSON that grows in size over the course of the day by about 30kb/hr as more trips are added. That's a lot of data for a single JSON file, but not that much. We hypothesize that our event processing threads (one each for CR and rapid transit, many for bus) are all trying to write this file simultaneously, and most of the apparent write time is really waiting on a lock (see this thread).
The proposed solution is to split state.json
into one file per route (or at least, one file per thread). Here's a sketch of the change:
trip_state.py
file with this API@dataclass
class TripState:
# Holds the current state for a single trip
stop_sequence: int
stop_id: str
updated_at: datetime
event_type: str
@dataclass
class RouteTripsState(object):
# Holds the current state for all trips in a route, probably in a Dict[str, TripState]
route_id: str
service_date: str
def __post_init__(self):
# Loads the latest trip state from ./data/trip_states/{self.route_id}.json
...
def update_trip_state(self, trip_id: str, trip_state: TripState) -> None:
# Updates in-memory representation of a specific trip's state
# and writes the result to disk at ./data/trip_states/{route_id}.json
# ๐ค do we really need to do this each time?
...
def get_trip_state(self, trip_id: str) -> Optional[TripState]:
# Returns the latest trip state for this trip (from memory)
...
def purge_trip_state() -> None:
# Purges trip state in-memory and on-disk (done overnight)
...
RouteTripsState
per routeThis could be done with a global dict, but we might as well keep it local to the thread that manages trips
Using logic similar to the GTFS rollover calculations:
# check for new day
gtfs_service_date = util.service_date(datetime.now(util.EASTERN_TIME))
updated_at = datetime.fromisoformat(update["attributes"]["updated_at"])
service_date = util.service_date(updated_at)
if gtfs_service_date != service_date:
# Purge trip state
...
(We should probably extract a helper method for this check since we'll do it in a number of places)
We want to make sure things work as we expect, and continue to work. We should have a few unit tests
YEAR OF THE BUS, let's add a bunch more of the bus routes that don't share a schedule with any other bus routes
We've recently added more bus routes to the data dashboard, we should add the same routes here.
The live list to reference is https://github.com/transitmatters/t-performance-dash/blob/main/common/types/lines.ts#L12
At the moment the list of GTFS bundles is being read into memory once at launch, but we should check every day to see if that's the same one we should still be using. (Or something?)
#71 going in means gobble doesn't need a public IPv4 address. But the current public subnets auto-assign public IPv4 addresses (and need to for NTT and RRE), so we should:
a) Create a private subnet
b) Codify that private subnet in gobble's CF.
When looking at the commuter rail data collected by stop, we seem to be missing several inbound and outbound stops
It seems especially true for end of lines, but not exclusively
Is the issue on our end or the MBTA side?
(note that these lists are generated based off the list of stops gobble puts in s3)
In order to display colored dots in the data dashboard, scheduled_headway
needs to be filled in.
Unfortunately, we don't immediately know what the scheduled headway is from a particular vehicle, so we need to calculate it ourselves from GTFS. It also might be possible to request them on-demand, once per day, from the MBTA v3 API.
So right now events are being written to disk on the instance. The data dashboard needs to get a hold of them somehow, though. I see a few distinct options...
a) Upload the events files to S3 overnight every night, accepting that we just won't have live bus or CR. (lame)
b) Every time we append an events.csv on disk, upload the entire thing to S3. (maybe, but like...no)
c) Serve live events over http that the dashboard can request on-demand.
My hunch is that we want to do (c), with some (a) sprinkled in. It's cool when things are live, and we shouldn't give that up. So rough steps:
express
server that serves up events from ./output
..labs
DNS record with the wildcard cert for https.FAQ
a) Why cannot the dashboard talk to the EC2 instance via its private IP, such that we can keep the EC2 instance off the public internet? That's possible, but it's a pain
b) If the EC2 instance has a public IP address, can the dashboard lambda just talk to that? Yes, it could. But the load balancer option lets us easily add https using the wildcard cert, which, even with no private data involved is good citizenry.
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: Traceback (most recent call last):
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: File "/home/ubuntu/gobble/src/gobble.py", line 45, in <module>
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: main()
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: File "/home/ubuntu/gobble/src/gobble.py", line 40, in main
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: process_event(update, current_stop_state, gtfs_service_date, scheduled_trips, scheduled_stop_times, stops)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: File "/home/ubuntu/.cache/pypoetry/virtualenvs/gobble-i42h0hpV-py3.11/lib/python3.11/site-packages/ddtrace/tracer.py", line 975, in func_wrapper
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: return f(*args, **kwargs)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: ^^^^^^^^^^^^^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: File "/home/ubuntu/gobble/src/event.py", line 85, in process_event
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: ) = reduce_update_event(update)
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: File "/home/ubuntu/gobble/src/event.py", line 49, in reduce_update_event
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: stop_id = update["relationships"]["stop"]["data"]["id"]
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
Jan 08 02:51:27 ip-172-31-95-22 poetry[3252872]: TypeError: 'NoneType' object is not subscriptable
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Main process exited, code=exited, status=1/FAILURE
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Failed with result 'exit-code'.
Jan 08 02:51:28 ip-172-31-95-22 systemd[1]: gobble.service: Consumed 45min 9.973s CPU time.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: gobble.service: Scheduled restart job, restart counter is at 3963.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: Stopped gobble.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: gobble.service: Consumed 45min 9.973s CPU time.
Jan 08 02:51:33 ip-172-31-95-22 systemd[1]: Started gobble.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.