cal-itp / data-analyses Goto Github PK
View Code? Open in Web Editor NEWPlace for sharing quick reports, and works in progress
Home Page: https://analysis.calitp.org
Place for sharing quick reports, and works in progress
Home Page: https://analysis.calitp.org
This came from Greg Newmark, who's working with Cal-ITP to analyze RT schedule adherence. He's seeing how well actual arrival times match scheduled arrival times, and how accurate those arrival estimates are for various periods of time leading up to the actual arrival. For example, how good are we at estimating arrival times when the vehicle is 5, 10, or 20 minutes away. Specifically, he's asked us how far in advance we might care to study. This is a good question for our GTFS Schedule dataset.
Answering this question means exploring both how long a rider has to wait between vehicles, as well as how long that vehicle has to travel between stops. If the route is hourly, but each stop is five minutes apart, then we're only looking at a max of five minutes between estimation and arrival. Conversely, on a route that arrives every five minutes but the stops are an hour apart, a rider would only care about the estimation for five minutes until the next bus arrives.
To answer this question, I propose we produce a plot of stop_times with the following axes:
Note that there are a ton of stop_times in any one dataset, let along in the state database, so this needs to be filtered.
Meeting scheduled for Feb 24. Can build off of district presentations and show a variety of existing data statewide.
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Is your feature request related to a problem? Please describe.
As analytics requests increase, there's a need to store outside data sources, as well as significant overlap in datasets. Along with establishing GCS buckets (#526), we should be able to catalog our canonical data sources somewhere and everyone imports the same file for their analyses.
Describe the solution you'd like
Use intake to catalog our various data sources, determine canonical datasets for outside sources like Census, CA open data portal, etc.
Describe alternatives you've considered
Additional context
See data catalogs in City of LA repos planning-entitlements.
Level of bus service for census tracts by pop density/jobs and CalEnvironScreen.
For Chad Edison, CalSTA
gtfs_schedule_dim_stop_times
, gtfs_schedule_fact_daily_trips
(let's filter to Wed or Thurs for weekday), gtfs_schedule_dim_stops
for lat/lonRAC All Jobs Excluding Federal Jobs
and WAC All Jobs Excluding Federal Jobs
?The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Given an arbritary area and time (ie, a particular section of the state highway network, or the city of San Mateo), compute how many minutes of delay buses and rail have
Will need to impute the time of entry, time of exit from schedule data and guess "how many minutes should this take" vs on average, how many minutes does this take
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Create 2 datasets to upload to ArcGIS for Traffic Ops, eventually move this to Airflow to be scheduled to overwrite AGOL dataset at some frequency:
shapes.txt
or creating one from stops.txt
Make sure each row represents what is needed, get rid of "duplicates".
Don't use AGOL hosted feature service and use credits. Use some public-facing geoportal?
gtfs_schedule
tables: stops
, stop_times
, trips
, routes
, agencies
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Is your feature request related to a problem? Please describe.
Implement an analytics / visualization style guide based on Cal-ITP's branded resources for slides / docs.
Describe the solution you'd like
Build a package part of shared_utils
that will handle most of the styling for visualizations made within Jupyter Notebooks. Make it available for all the major charting packages: altair
, matplotlib
, seaborn
, plotnine
.
Describe alternatives you've considered
Additional context
Add any other context or screenshots about the feature request here.
The context is a query that came through the GTFS Helpdesk. A vendor provides 30 second updates, while the Guidelines specify 20 seconds or less. I want to help each transit provider understand where they fit in among other providers in California. What's "normal" or typical for an update frequency. What percentile would they be in with 20, 30, or 60 second updates? What's the shortest update frequency, and how common is it?
This issue looks at the the organizations who applied for 5311/5311(f)/CMAQ, 5339(a), or LCTOP using the new Consolidated Application process. Applicants just need to complete one application for the funds above, once a year.
Identify key data points in MST payments that detail current adoption of contactless payments.
As a Cal-ITP program manager or more senior Caltrans executive,
I want to know how many transit agencies are being assessed in California
so that I can have a baseline for calculating other metrics like the one described in cal-itp/data-infra#984.
The gist of this question lays the foundation for answering a variety of questions that high-level executives asks such as "what percent of transit agencies have GTFS Schedule data?" or "what percent of transit agencies have Fares v2?" or "what percent of transit agencies are GTFS-compliant?"
Research should be performed with various stakeholders to determine how to define and filter the data we have in airtable about organizations, services and potentially other items. Part of this task should include a document detailing how to filter the data in airtable in order to provide this baseline for measurement. If none of the stakeholders can give a clear answer about how to calculate this baseline, a deliverable of this report should propse at least one recommended option for calculating this baseline.
Given the data Cal-ITP has collected about transit agencies with respect to how they are funded, what kind of service they operate, and any other relevant critieria
When applying all relevant criteria about what qualifies as a transit agency for reporting purposes
Then a number should be calculated.
The deliverable of this should include:
The document Cal-ITP Transit Provider Categorization + Activities is a detailed document about the various ways that transit agencies could be categorized, but it does not include a recommendation for how to establish a baseline for reporting.
There already exists a filter within airtable that seems to do something with regarding filter assessed operators. Research should be done to determine if this is relevant. Screenshots of this filter is shown below:
Meeting scheduled for March 1. Before then, need to have speedmaps and metrics generated for District 11 transit operators, as well as a polished presentation.
With respect to avg speed for buses, how does it compare with avg speed on trolley or fixed rail services? When I was involved with deploying the first two BRT services in SD, I was surprised at how low the avg speed is on trolley due to lack of grade separations and number of stops. The avg bus speeds seemed low but in fact were higher than avg speeds of trolley or coaster. This data highlights the importance of transit priority projects: managed lanes conversions, signal priority, bus on shoulders, etc. and will help support purpose and need for these kinds of projects.
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Currently, agencies are required to apply for funding various times throughout the year. Due to this existing structure, there is a large administrative burden put on agencies, to get all these applications in and reviewed. This research will identify the frequency of the application cycles across divisions to answer the following:
Is your feature request related to a problem? Please describe.
The analytics tools section now serves 2 distinct audiences:
Describe the solution you'd like
csvkit
and writing from GCS to Big Query via command line) -- Charlie (cal-itp/calitp-py#54)Describe alternatives you've considered
Used City of LA best-practices
repo, which is now private and can't be accessed.
Additional context
Add any other context or screenshots about the feature request here.
For Caltrans and CalSTA exec board. For Gillian.
Deliverables:
Drive > Team Workspaces > data services
Desired output:
Operator Type | Weekday | Sat | Sun |
---|---|---|---|
Urban | list of x agencies and service hrs and trips/runs |
||
Suburban | list of y agencies ... | ||
Rural | list of z agencies... |
Pollution Burden
and Population Characteristic
scores??gtfs_schedule_dim_stop_times
, gtfs_schedule_fact_daily_trips
(let's filter to Wed or Thurs for weekday), gtfs_schedule_dim_stops
for lat/lonThe QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Similar to to the LOSSAN question @edasmalchi answer, Gillian would like to know "how many validators would be needed" to outfit the can he find the thruway buses that connect to CCJPA, LOSSAN and SJRRA?
How many validators would be needed for the Thruway bus services
Some nuance here is needed, since Amtrak isn't the operator / owner of many of its buses. Gillian is looking up ways to get a diff estimate of # of buses in the fleet, but we can guesstimate for now
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
How many feeds have data about physical accessibility?
This question can be answered by resolving cal-itp/data-infra#561 and cal-itp/data-infra#562, therefore it is blocked for now.
The exact criteria deciding whether a feed "has data about physical accessibility" could be determined in various ways. See proposed idea for an MVP or a more rigorous analysis.
MVP:
stops#wheelchair_boarding
fieldtrips#wheelchair_accessible
fieldMore rigorous analysis:
wheelchair_boarding
field set to "not unknown" valuewheelchair_accessible
field set to "not unknown" valuestops#wheelchair_boarding
trips#wheelchair_accessible
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Get PEMS data, see which dataset should be downloaded, maybe pushed into BigQuery? PEMS data may help us answer how fast are cars traveling along the streets bus routes are also traveling along. This helps us better understand traffic speeds by various times of day to help calculate car travel times when traveling along parallel-to-SHN bus route.
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
DLA notebooks for districts is a good prototype for papermill
and parameterizing reports.
A prototype notebook and this script here is a good place to start.
How many feeds that we track have Fares v2 data?
The total number of feeds that have a non-empty fare_leg_rules.txt
file.
A number of new files are being proposed to be added to the static GTFS specification together called "Fares v2". See reference doc. The MVP for checking this however is to simply assert that the fare_leg_rules.txt
file is present and it contains at least one potentially empty row.
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
How many assessed operators have a complete set of associated GTFS-Realtime feeds?
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Name:
Role:
Reports to:
Google Workspace Email Address:
GitHub Username:
Slack Username:
Set-up:
Technical Onboarding call scheduled
Added to tools:
Added to meetings:
Added to Slack channels:
A user story is implemented as well as it is communicated.
If the context and the goals are made clear, it will be easier for everyone to implement it, test it, refer to it.
A user story should typically have a summary structured this way:
For analysis, there's probably a set of steps in data cleaning or data visualization that we do repeatedly. We encounter them both within a research question and across research questions. Why reinvent the wheel?
I would want to document some of these shared utility functions to standardize and make it easier for analysts to do these steps. The utility functions would be importable across all directories in the data-analyses
repo and can be called within a Jupyter notebook or Python script.
See 2 examples from City of LA work: covid19-indicators and planning-entitlements
Ex 1: exporting a geodataframe to geoparquet and save to GCS bucket. cal-itp/data-infra#698. Solution was to create a function that would write a geoparquet locally, upload to GCS bucket, then erase the local file. This is a repeated step that many analysts would come across using the JupyterHub + GCS, and the typical way to export doesn't work.
Ex 2: aggregating by geography (census tract, Caltrans district, zip code, etc). There's a pandas
function that helps us aggregate and take the sum, count, count unique values, etc. Currently, to do a mix of these, you could merge your dataframes back together using df.groupby().agg()
or get wonky column names using df.pivot_table()
. A common function would wrap all this aggregation up and be paired with attaching geometry
back, as the geometry
column throws errors when you aggregate.
Other examples would be common charts or maps. We'll add more as we come across more use cases of generalizable functions!
We can start with Tiffany / Eric's transit service research, and expand as more analyses are done by others.
Also, here are a few points that need to be addressed:
shared_utils
is importable across all directories, since analysts are making their own folders within data-analyses
to store their work.docker-compose.yml
: pip install -e
...where should this go?Initial work here: https://github.com/cal-itp/data-analyses/tree/shared-utils
Over the past 6 months, the Cal-ITP payments team has been conducting a contactless payments demonstration with select agency partners across the State of California. Now that data and experience have been gathered, we are ready to begin synthesizing quantitative and qualitative learnings and communicating them out to stakeholders across the State.
The document Potential Questions for Payments Data generated by the payments team has been used to scope out questions to be answered during the process of the demonstration, and the team has been iterating the document against their experiences. The areas covered by this document should be used to inform this review and include, but are not limited to:
views
in the warehouse
payments_rides
payments
datset
stg_cleaned_customers
stg_cleaned_micropayment_adjustments
stg_enriched_micropayments
This will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views
related to payments in the warehousepayments
datsetAfter reviewing research with the asker:
About what would it cost to provide GTFS-RT for small/rural operators in CA? (to help Gillian with the budget process)
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Ask Lauren for an agency to start with, or use something in Humboldt County as a default.
Identify existing relationship between existing NTD data products, and how Caltrans can improve reporting processes.
What % of California (and Californians) has (open to the public) transit coverage in GTFS?
By area:
By Population:
By Employment (optional):
stops.txt
for GTFS datasets and shapes.txt
for continuous stops and locations.geojson
for flexible services.The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
What can we learn from the four demos re: open loop payments?
With the new views.gtfs_schedule_dim_shapes_geo
table now available, update the traffic_ops
scripts related to creating routes
and stops
data.
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Which transit routes are considered parallel to the State Highway Network (SHN)?
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Is your feature request related to a problem? Please describe.
Caltrans staff and other stakeholders need self-serve access to our GTFS-RT speed and delay data in the coming weeks to enable work on the pending innovation challenge. This will likely occur before the RT pipeline is fully built out.
Describe the solution you'd like
An interactive webpage based on existing GTFS-RT speed/delay work, with functionality such as:
Describe alternatives you've considered
Additional context
Eric to meet with @atvaccaro tomorrow about suitability of jupyter/papermill as a tool for this
What % of Californians have access to GTFS-Flex data?
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
If we were able to support a select number of agencies in providing GTFS accessibility data and GTFS-RT, which would make the biggest impact in overall coverage statewide (as measured in #169 and #170)?
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
Currently, we have all of the GTFS validator results stored in validation_notices
, broken down by type, number of errors, and if they are resolved month to month.
@mcplanner has put togehter a help list of "severity" of GTFS notices. We have metrics stored in this spreadsheet
let's setup a notebook or similar to track these over time
Use bus_service_increase
project as a prototype of how to set up scripts to run data cleaning, data assembly, visualization (charts + maps), save in sub-directories for various districts, counties, etc.
Currently, work is done in notebooks, and while it calls functions, the entire workflow to produce visualizations can benefit from more automation.
Long-term goal: create analytics portfolio for districts to access similar set of metrics and visualizations related to bus_service_increase
, dla grants
, drmt grants
, parallel corridors
, rt delay
.
Take existing work that lives in notebooks and generalize to accommodate various subsets of data (by district, by MPO, etc) and produce similar set of outputs.
Similar to the one we have in the notebooks
repo.
test test
How can ongoing RT speed/delay work best support Caltrans districts and other jurisdictions in improving the bus experience, statewide?
Meeting scheduled for Feb 22. Before then, need to have speedmaps and metrics generated for District 10 transit operators, as well as a polished presentation.
What % of California has (open to the public) transit coverage in GTFS which is explicitly wheelchair accessible?
By area:
The % of non-water area of Californian that is within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that is explicitly wheelchair accessible (and if in a station, that station has explicit pathways coding), and that has is served by a public-funded, open to the general public transit service with GTFS Schedule data that is served by a service that is explicitly wheelchair accessible
By Population:
The % of Californians that are within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that is explicitly wheelchair accessible (and if in a station, that station has explicit pathways coding), and that has is served by a public-funded, open to the general public transit service with GTFS Schedule data that is served by a service that is explicitly wheelchair accessible
By Employment (optional):
The % of Jobs that are within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that is explicitly wheelchair accessible (and if in a station, that station has explicit pathways coding), and that has is served by a public-funded, open to the general public transit service with GTFS Schedule data that is served by a service that is explicitly wheelchair accessible
NOTE: - I don't think we need to or should interface this with any census data about having a disability.
Thoughts on this:
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
This issue looks at the the organizations who applied for 5311/5311(f)/CMAQ, 5339(a), and/or LCTOP using the new Consolidated Application process. Applicants only need to complete one application for the funds above, once a year. The deadlines are in late March for LCTOP and late April for the other programs.
As part of the Caltrans Strategic Plan, Performance Plan, one of the big strategical goals related to multimodal transportation is P-01, to increase the total amount of service on the SHN and the reliability of that service by 2024.
Meeting on 3/9/22, @edasmalchi and @tiffanychu90 to present.
bus-service-increase
and parallel-corridors
The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.
Before starting research:
views.gtfs_schedule_fact_daily_feed_files
.After reviewing research with the asker:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.