Coder Social home page Coder Social logo

cal-itp / data-analyses Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 5.0 1.81 GB

Place for sharing quick reports, and works in progress

Home Page: https://analysis.calitp.org

Smarty 0.01% Jupyter Notebook 96.81% HTML 2.97% Makefile 0.01% Python 0.22%

data-analyses's People

Contributors

aly-medina avatar amandaha8 avatar atvaccaro avatar benjaminbressette avatar charlie-costanzo avatar csuyat-dot avatar edasmalchi avatar evansiroky avatar j-meelah avatar juliama0780 avatar katrinamkaiser avatar lauriemerrell avatar machow avatar mdsaifulislamfahim avatar mjumbewu avatar natam1 avatar nkdiaz avatar noah-ca avatar shweta487 avatar thekaveman avatar tiffanychu90 avatar vevetron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-analyses's Issues

Research Question: how far into the future do RT updates matter?

This came from Greg Newmark, who's working with Cal-ITP to analyze RT schedule adherence. He's seeing how well actual arrival times match scheduled arrival times, and how accurate those arrival estimates are for various periods of time leading up to the actual arrival. For example, how good are we at estimating arrival times when the vehicle is 5, 10, or 20 minutes away. Specifically, he's asked us how far in advance we might care to study. This is a good question for our GTFS Schedule dataset.

Answering this question means exploring both how long a rider has to wait between vehicles, as well as how long that vehicle has to travel between stops. If the route is hourly, but each stop is five minutes apart, then we're only looking at a max of five minutes between estimation and arrival. Conversely, on a route that arrives every five minutes but the stops are an hour apart, a rider would only care about the estimation for five minutes until the next bus arrives.

To answer this question, I propose we produce a plot of stop_times with the following axes:

  • On the X axis, the number of minutes since the previous trip of that same route arrived at that stop. This means we exclude the first trip for each route of the day. So for example, for an hourly route, it's 60 minutes between trips, even if another route also stops at that stop location.
  • On the Y axis, the number of minutes since the previous stop in that stop_times' trip's stop sequence. So for example, if the bus makes a stop at 1:05, 1:10, and 1:30, it would have entries for 5 and 20 minutes.

Note that there are a ton of stop_times in any one dataset, let along in the state database, so this needs to be filtered.

  • Rather than running this for every valid calendar date in every feed, run for only the present day.
  • Exclude the first trip of each route in that day.
  • Exclude any data point (X/Y pair) that represents fewer than 0.5% of all stop_times.

Research: GTFS-RT Presentation Demo for CARB

Question

Meeting scheduled for Feb 24. Can build off of district presentations and show a variety of existing data statewide.

Metrics

  • Presentation drafted
  • Presentation reviewed and complete

Data sources

  • GTFS-RT vehicle positions (existing raw table in warehouse)
  • GTFS Schedule data (in warehouse)

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Use intake for importing data in Jupyter Notebooks

Is your feature request related to a problem? Please describe.

As analytics requests increase, there's a need to store outside data sources, as well as significant overlap in datasets. Along with establishing GCS buckets (#526), we should be able to catalog our canonical data sources somewhere and everyone imports the same file for their analyses.

Describe the solution you'd like

Use intake to catalog our various data sources, determine canonical datasets for outside sources like Census, CA open data portal, etc.

Describe alternatives you've considered

Additional context
See data catalogs in City of LA repos planning-entitlements.

Research: Bus Service Opportunities by Census Tracts

Question

Level of bus service for census tracts by pop density/jobs and CalEnvironScreen.

For Chad Edison, CalSTA

Metrics

  • the # bus stop-visits per sq mi (each bus stop multiplied by # times bus visits the stop per day, normalized to per sq mi for census tract)
  • census tracts categorized as low to high job or population density vs census tracts categorized into CalEnvironScreen bins
  • break apart into weekday / weekend and peak / off-peak

Data sources

  • GTFS schedule: gtfs_schedule_dim_stop_times, gtfs_schedule_fact_daily_trips (let's filter to Wed or Thurs for weekday), gtfs_schedule_dim_stops for lat/lon
  • CalEnvironScreen 4.0 for census tract geometry, population, equity metrics
  • jobs by census tracts from LEHD origin-destination. Possibly use 1 or 2 from Urban Institute -- RAC All Jobs Excluding Federal Jobs and WAC All Jobs Excluding Federal Jobs?
  • Park-and-ride locations for intercity connectivity

Deliverable

  • Interactive map of CA census tracts: do we need bivariate legend
  • Add park-and-ride locations as points layery

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: Level of Delay, GTFS-RT

Question

Given an arbritary area and time (ie, a particular section of the state highway network, or the city of San Mateo), compute how many minutes of delay buses and rail have

Metrics

Will need to impute the time of entry, time of exit from schedule data and guess "how many minutes should this take" vs on average, how many minutes does this take

Data sources

  • GTFS-RT Parquet Timing Rectangles
  • GTFS Schedule Data

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: routes and stops shapefiles for public-facing portal

Question

Create 2 datasets to upload to ArcGIS for Traffic Ops, eventually move this to Airflow to be scheduled to overwrite AGOL dataset at some frequency:

  • every stop + what route at those stops
  • every route, with a line representing the route either from shapes.txt or creating one from stops.txt

Metrics

Make sure each row represents what is needed, get rid of "duplicates".

Don't use AGOL hosted feature service and use credits. Use some public-facing geoportal?

Data sources

  • Use gtfs_schedule tables: stops, stop_times, trips, routes, agencies

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Analytics Style Guide

Is your feature request related to a problem? Please describe.

Implement an analytics / visualization style guide based on Cal-ITP's branded resources for slides / docs.

Describe the solution you'd like
Build a package part of shared_utils that will handle most of the styling for visualizations made within Jupyter Notebooks. Make it available for all the major charting packages: altair, matplotlib, seaborn, plotnine.

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

Research Question: how often do GTFS-RT feeds get updated?

The context is a query that came through the GTFS Helpdesk. A vendor provides 30 second updates, while the Guidelines specify 20 seconds or less. I want to help each transit provider understand where they fit in among other providers in California. What's "normal" or typical for an update frequency. What percentile would they be in with 20, 30, or 60 second updates? What's the shortest update frequency, and how common is it?

Research: Consolidated Application

This issue looks at the the organizations who applied for 5311/5311(f)/CMAQ, 5339(a), or LCTOP using the new Consolidated Application process. Applicants just need to complete one application for the funds above, once a year.

Data Questions:

  1. Which organizations applied?
  2. Which funds did they apply for?
  3. What are people interested in?

Metrics:

  • TBD

Data sources:

Tasks and Goals:

  • Create a Tableau dashboard.

MST Payments Adoption slide deck

Question

Identify key data points in MST payments that detail current adoption of contactless payments.

Metrics

  • Contactless versus traditional fare
  • Was fare capping implemented?
  • Of contactless, how many paid less than nominal fare?

Data sources

  • MST trip data

Research: How many operators is Cal-ITP assessing?

User Story

As a Cal-ITP program manager or more senior Caltrans executive,
I want to know how many transit agencies are being assessed in California
so that I can have a baseline for calculating other metrics like the one described in cal-itp/data-infra#984.

Additional Context

The gist of this question lays the foundation for answering a variety of questions that high-level executives asks such as "what percent of transit agencies have GTFS Schedule data?" or "what percent of transit agencies have Fares v2?" or "what percent of transit agencies are GTFS-compliant?"

Research should be performed with various stakeholders to determine how to define and filter the data we have in airtable about organizations, services and potentially other items. Part of this task should include a document detailing how to filter the data in airtable in order to provide this baseline for measurement. If none of the stakeholders can give a clear answer about how to calculate this baseline, a deliverable of this report should propse at least one recommended option for calculating this baseline.

Acceptance Criteria

Given the data Cal-ITP has collected about transit agencies with respect to how they are funded, what kind of service they operate, and any other relevant critieria
When applying all relevant criteria about what qualifies as a transit agency for reporting purposes
Then a number should be calculated.

The deliverable of this should include:

  1. A memo containing the precise, quantifiable and measurable definition of what qualifies as a transit agency for answering the above-mentioned high-level questions
  2. A metabase question that simply shows the resulting number of transit agencies when applying the criteria to the data in airtable

Sprint Ready Checklist

  • Acceptance criteria defined
  • Team understands acceptance criteria
  • Team has defined solution / steps to satisfy acceptance criteria
  • Acceptance criteria is verifiable / testable
  • Dependencies identified

Appendix

The document Cal-ITP Transit Provider Categorization + Activities is a detailed document about the various ways that transit agencies could be categorized, but it does not include a recommendation for how to establish a baseline for reporting.

There already exists a filter within airtable that seems to do something with regarding filter assessed operators. Research should be done to determine if this is relevant. Screenshots of this filter is shown below:

Overall airtable filter

Screen Shot 2022-01-24 at 2 16 29 PM

Reporting Category

Screen Shot 2022-01-26 at 10 32 56 AM

Currently Operating

Screen Shot 2022-01-26 at 10 32 52 AM

Service Type

Screen Shot 2022-01-26 at 10 32 47 AM

Additional service type filter

Screen Shot 2022-01-26 at 10 32 28 AM

Research: GTFS-RT Speedmaps and Presentation Ready for D11

Question

Meeting scheduled for March 1. Before then, need to have speedmaps and metrics generated for District 11 transit operators, as well as a polished presentation.

Metrics

  • Speedmaps and other metrics ran for each D11 operator with available RT data
  • Maps and metrics highlight trolley/fixed rail vs. bus to extend available per D11 director ask
  • Presentation drafted
  • Presentation reviewed and complete

With respect to avg speed for buses, how does it compare with avg speed on trolley or fixed rail services? When I was involved with deploying the first two BRT services in SD, I was surprised at how low the avg speed is on trolley due to lack of grade separations and number of stops. The avg bus speeds seemed low but in fact were higher than avg speeds of trolley or coaster. This data highlights the importance of transit priority projects: managed lanes conversions, signal priority, bus on shoulders, etc. and will help support purpose and need for these kinds of projects.

Data sources

  • GTFS-RT vehicle positions (existing raw table in warehouse)
  • GTFS Schedule data (in warehouse)

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: Funding Application Frequencies

Question

Currently, agencies are required to apply for funding various times throughout the year. Due to this existing structure, there is a large administrative burden put on agencies, to get all these applications in and reviewed. This research will identify the frequency of the application cycles across divisions to answer the following:

  • How often do agencies apply for funding across the transportation-related grant programs?

Metrics

  • For each agency, how many times do they apply?
    By month
    By Quarter
    By year
  • Average days between applications by agency
  • Where are the grant applications concentrated between the grant groupings?
  • For each quarter in a FY, how many applications fall into the different grant groupings?

Data sources

Deliverables:

Acronyms/IDs for agencies in DLA data

Question

  • How can we connect the agencies in this dataset to different databases in the Cal-ITP warehouse
  • What unique identifiers exist, and which need to be created?

Metrics

  • Research existing datasets in the Cal-ITP warehouse and in DLA warehouse
  • If needed, create a new unique agency identifier that can be linked between the datasets

Data sources

Docs: add tutorials and centralized knowledge for analyst reference

Is your feature request related to a problem? Please describe.

The analytics tools section now serves 2 distinct audiences:

  • new analysts - in need of self-serving materials to go from zero to hero
  • current analysts - in need of references

Describe the solution you'd like

  • Work in tutorials into a different "chapter" of analytics tools docs for new analysts. -- Amanda (cal-itp/data-infra#1351)
  • Add some reference section for analysts to find reference materials for common packages used, interesting code tidbits to lift (sorting order on charts to not be alphabetical, getting labels to display $ and millions / thousands, adjust for inflation, etc). Live in docs section, but need to flesh out section on contributing to docs so analysts can continue to add to this body of knowledge -- all analysts contribute Tiffany, Natalie, compiled resources
  • Add in new section that reflects new things in analyst workflow (csvkit and writing from GCS to Big Query via command line) -- Charlie (cal-itp/calitp-py#54)

Describe alternatives you've considered

Used City of LA best-practices repo, which is now private and can't be accessed.

Additional context
Add any other context or screenshots about the feature request here.

Research: Transit (Bus) Service Increase

Question

  1. How many service hours and trips/runs need to be added by operator-agency-service type to reach desired levels of service?
  2. Which census tracts, grouped by CalEnvironScreen categories, have no service?

For Caltrans and CalSTA exec board. For Gillian.

Deliverables:
Drive > Team Workspaces > data services

Metrics

  1. Service hrs and trips/runs
  • LOS for urban (once every 15 min) / suburban (30 min) / rural (60 min) -- use census tract criteria to categorize
  • If route runs through urban/suburban/rural, use a simple cut-off for categorizing
  • Aggregate to the operator-route-type of service, possibly to county later on
  • Pick a couple of representative dates for weekday, Sat, Sun service

Desired output:

Operator Type Weekday Sat Sun
Urban list of x agencies and service hrs and trips/runs
Suburban list of y agencies ...
Rural list of z agencies...
  1. CalEnviroScreen
  • Subset by Pollution Burden and Population Characteristic scores??

Data sources

  • GTFS schedule: gtfs_schedule_dim_stop_times, gtfs_schedule_fact_daily_trips (let's filter to Wed or Thurs for weekday), gtfs_schedule_dim_stops for lat/lon
  • Census tract: check CA open data portal, need geometry and population, use pop density to do urban/suburban/rural cut-off
  • CalEnvironScreen: 3.0 available on open data portal and also has population OR go with 4.0

Methodology

  • Find geographic extent of route
  • Classify route as urban/rural/suburban based on census tract category containing the most route length
  • Pick some stops along each route: stop_sequence (min, max, midpoint)
  • Filter by time range in the day, skip overnight hours of service
  • Observe for that stop, how many trips it makes per hour (once every ___ min).
  • Calculate the average service hours it takes per trip, and scale up to see how many more trips (and therefore hours) operator would need to add along that route to bring to desired frequency.
  • Clarify: along a corridor, do we want all the routes to run at 15 min intervals overall, or each individual route to run at 15 min intervals? (@hunterowens)

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: How many validators for Thruway Buses`

Question

Similar to to the LOSSAN question @edasmalchi answer, Gillian would like to know "how many validators would be needed" to outfit the can he find the thruway buses that connect to CCJPA, LOSSAN and SJRRA?

Metrics

How many validators would be needed for the Thruway bus services

Data sources

  • NTD
  • GTFS schedule

Some nuance here is needed, since Amtrak isn't the operator / owner of many of its buses. Gillian is looking up ways to get a diff estimate of # of buses in the fleet, but we can guesstimate for now

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

MSD Dashboard Metric: Number of feeds with data about physical accessibility

Question

How many feeds have data about physical accessibility?

This question can be answered by resolving cal-itp/data-infra#561 and cal-itp/data-infra#562, therefore it is blocked for now.

Metrics

The exact criteria deciding whether a feed "has data about physical accessibility" could be determined in various ways. See proposed idea for an MVP or a more rigorous analysis.

MVP:

  • Presence of stops#wheelchair_boarding field
  • Presence of trips#wheelchair_accessible field
  • Presence of non-empty pathways table when at least one child_stop within parent_stop exists in stops

More rigorous analysis:

  • Require minimum percent of:
    • Rows in stops table with wheelchair_boarding field set to "not unknown" value
    • Rows in trips table with wheelchair_accessible field set to "not unknown" value
    • Child_stops within a parent_stop that can reach every other child_stop within parent_stop when simulating travel as an able-bodied person

Data sources

  • Static GTFS
    • stops#wheelchair_boarding
    • trips#wheelchair_accessible
    • Pathways data as it relates to child and parent stops

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: downloading PEMS data

Question

Get PEMS data, see which dataset should be downloaded, maybe pushed into BigQuery? PEMS data may help us answer how fast are cars traveling along the streets bus routes are also traveling along. This helps us better understand traffic speeds by various times of day to help calculate car travel times when traveling along parallel-to-SHN bus route.

Metrics

  • Is data available only along detectors on the freeway / SHN?
  • Is data available for local streets and roads?
  • How often should we download? What tools to download gzipped files, unzip, basic data cleaning, then push to BQ?

Data sources

  • PEMS

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

User Story: parameterized reports with papermill

User stories

DLA notebooks for districts is a good prototype for papermill and parameterizing reports.

A prototype notebook and this script here is a good place to start.


Summary

  1. Figure out if GitHub pages or something similar is where that notebook converted to html lives.
  2. See what limited interactivity / user interactivity the HTML page can handle - tooltips, hovering, selecting lines, highlighting, etc
  3. Long-term: consider accessibility needs, including browser vs mobile devices, do we get rid of interactive elements, etc

MSD Dashboard Metric: Number of feeds with Fares v2 data

Question

How many feeds that we track have Fares v2 data?

Metrics

The total number of feeds that have a non-empty fare_leg_rules.txt file.

Data sources

A number of new files are being proposed to be added to the static GTFS specification together called "Fares v2". See reference doc. The MVP for checking this however is to simply assert that the fare_leg_rules.txt file is present and it contains at least one potentially empty row.

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

MSD Dashboard Metric: Number of feeds with GTFS-Realtime

Question

How many assessed operators have a complete set of associated GTFS-Realtime feeds?

Metrics

  • An assessed operator is considered to have a complete set of associated GTFS-Realtime feeds if the existence of all three types of GTFS-Realtime feeds (Trip Updates, Vehicle Positions, Service Alerts) can be confirmed.

Data sources

  • Airtable via data warehouse

Depends on

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

New Team Member - [test]

Name:
Role:
Reports to:

Google Workspace Email Address:
GitHub Username:
Slack Username:

Set-up:

  • Technical Onboarding call scheduled

  • Added to tools:

    • Github
      • Organization: Cal-ITP
      • Team: jarvus
      • Team: warehouse-users
    • JupyterLab
    • Google Cloud Console
    • Metabase
    • Slack
  • Added to meetings:

    • Daily Stand-ups
    • Lunch n' Learn
  • Added to Slack channels:

    • #data-analyses
    • #data-office-hours
    • #lunch-n-learn

User Story: shared utility functions for analysis

User stories

A user story is implemented as well as it is communicated.
If the context and the goals are made clear, it will be easier for everyone to implement it, test it, refer to it.


Summary

A user story should typically have a summary structured this way:

For analysis, there's probably a set of steps in data cleaning or data visualization that we do repeatedly. We encounter them both within a research question and across research questions. Why reinvent the wheel?

I would want to document some of these shared utility functions to standardize and make it easier for analysts to do these steps. The utility functions would be importable across all directories in the data-analyses repo and can be called within a Jupyter notebook or Python script.

See 2 examples from City of LA work: covid19-indicators and planning-entitlements

Ex 1: exporting a geodataframe to geoparquet and save to GCS bucket. cal-itp/data-infra#698. Solution was to create a function that would write a geoparquet locally, upload to GCS bucket, then erase the local file. This is a repeated step that many analysts would come across using the JupyterHub + GCS, and the typical way to export doesn't work.

Ex 2: aggregating by geography (census tract, Caltrans district, zip code, etc). There's a pandas function that helps us aggregate and take the sum, count, count unique values, etc. Currently, to do a mix of these, you could merge your dataframes back together using df.groupby().agg() or get wonky column names using df.pivot_table(). A common function would wrap all this aggregation up and be paired with attaching geometry back, as the geometry column throws errors when you aggregate.

Other examples would be common charts or maps. We'll add more as we come across more use cases of generalizable functions!

Acceptance Criteria

We can start with Tiffany / Eric's transit service research, and expand as more analyses are done by others.

Also, here are a few points that need to be addressed:

  1. Need to test that shared_utils is importable across all directories, since analysts are making their own folders within data-analyses to store their work.
  2. Would like the python files to be editable. This was in a previous docker-compose.yml: pip install -e...where should this go?

Notes

Initial work here: https://github.com/cal-itp/data-analyses/tree/shared-utils

Tester [Stakeholder]

  1. @tiffanychu90

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified

Research: Contactless Payments Demonstration Review

Contactless Payments Demonstration Review

Over the past 6 months, the Cal-ITP payments team has been conducting a contactless payments demonstration with select agency partners across the State of California. Now that data and experience have been gathered, we are ready to begin synthesizing quantitative and qualitative learnings and communicating them out to stakeholders across the State.

Metrics

The document Potential Questions for Payments Data generated by the payments team has been used to scope out questions to be answered during the process of the demonstration, and the team has been iterating the document against their experiences. The areas covered by this document should be used to inform this review and include, but are not limited to:

  • Rate and speed of conversion to contactless payments
  • Revenue generated by contactless payments, relative to traditional payments
  • Behavior compared between full fare and discount fare customers
  • Influence on new transit ridership
  • Potential areas of focus: marketing effects, distribution of fare-capping, mobile-preference, rewards, eligibility, cash deposits, safety

Data sources

  • Payments Dashboard - Metabase
  • views in the warehouse
    • payments_rides
  • payments datset
    • stg_cleaned_customers
    • stg_cleaned_micropayment_adjustments
    • stg_enriched_micropayments
  • Payments 101 Content - Lilly & Jenny
  • Anecdotal experience - Lilly, Jenny, Mjumbe, Ben

(Data Servicess Team to Copy and Fill Out Below)

This will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: (Refine as needed) How has the introduction of contactless payments through the Contacless Payments Demonstration impacted transit agencies and their ridership from both a quantitative and anecdotal perspective?
  • View:
    • Payments Dashboard - Metabase
    • views related to payments in the warehouse
    • payments datset
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: Cost to provide GTFS-RT for small/rural operators

Question

About what would it cost to provide GTFS-RT for small/rural operators in CA? (to help Gillian with the budget process)

Metrics

  • number of (small/rural) operators
  • fleet sizes, both for current service and an estimate based on transit service increase work
  • total cost estimates based on GRaaS costs per vehicle and per operator
  • percentage of routes likely to not have cell signal (not in this analysis per Gillian's request)

Data sources

  • Data Warehouse/GTFS
  • Transit service increase estimates
  • Transit stacks/NTD
  • GRaaS costs from GRaaS team

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

NTD Process Mapping and Reporting Modernization

Question

Identify existing relationship between existing NTD data products, and how Caltrans can improve reporting processes.

Metrics

  • Number of issues coming back from NTD
  • Data sources of these issues - can GTFS or other sources improve this?
  • Are there other examples of NTD modernization?

Data sources

  • NTD data products
  • Caltrans data reports and communications with NTD
  • Internal process documents to identify reporting procedures

MSD Dashboard Metric: general population public transit gtfs coverage

Question

What % of California (and Californians) has (open to the public) transit coverage in GTFS?

Metrics

By area:

  • The % of non-water area of Californian that is within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that has is served by a public-funded, open to the general public transit service with GTFS Schedule data

By Population:

  • The % of Californians that are within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that stop that has is served by a public-funded, open to the general public transit service with GTFS Schedule data

By Employment (optional):

  • The % of Jobs that are within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that stop that has is served by a public-funded, open to the general public transit service with GTFS Schedule data

Data sources

  • Census data
  • Airtable database listing transit services, eligibilities, and GTFS data
  • stops.txt for GTFS datasets and shapes.txt for continuous stops and locations.geojson for flexible services.

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Open loop payment demos

Question

What can we learn from the four demos re: open loop payments?

Metrics

  • Differences in contactless versus nominal fare

Data sources

  • Payments data

Research: Update routes / stops shapefile scripts

Question

With the new views.gtfs_schedule_dim_shapes_geo table now available, update the traffic_ops scripts related to creating routes and stops data.

previous GH issue

Metrics

Data sources

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: Parallel transit corridors to State Highway Network

Question

Which transit routes are considered parallel to the State Highway Network (SHN)?

Metrics

  • Find transit routes within 1 mile of SHN
  • Define parallel vs intersecting route (based on % of the transit route overlaps with 1 mile from SHN and % of the highway that overlap is)
  • What improvements in service do these need to be competitive with cars
  • Use Google Directions API to constrain car to travel along bus route, see which routes are viable competitive routes that can be targeted for improved service

Data sources

Outputs / Presentations

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Initial GTFS-RT self-serve speed/delay tools via Jupyter/Papermill

Is your feature request related to a problem? Please describe.
Caltrans staff and other stakeholders need self-serve access to our GTFS-RT speed and delay data in the coming weeks to enable work on the pending innovation challenge. This will likely occur before the RT pipeline is fully built out.

Describe the solution you'd like
An interactive webpage based on existing GTFS-RT speed/delay work, with functionality such as:

  • ability to select a Caltrans district and view maps+charts for operators in that district
  • interactive maps and charts allowing filtering by time of day, route, etc within a single, pre-processed day of analysis
  • a basic geospatial data export capability

Describe alternatives you've considered

Additional context
Eric to meet with @atvaccaro tomorrow about suitability of jupyter/papermill as a tool for this

Research: 5311 Agencies

Data Questions:

  1. How many 5311 agencies are there in California? By district? By county?
  2. What is the average fleet size and number of doors?
  3. How old is the fleet? When are their 10 or 12 year cycles up for their existing fleet? With this information, we can estimate which agencies will potentially apply for 5339 or/and TIRCP.
  4. How many 5311 agencies have a GTFS Status?
  5. How many 5311 agencies have an existing CAD/AVL Vendor? If so, when are the contract dates up?

Metrics:

  • Grouping vehicle types and age into bins.
  • Calculating how many agencies overlap in the data sets we use.

Data sources:

Tasks and Goals:

  • Create a crosswalk for the Rural Reporters in California containing the NTD and Cal-ITP IDs
  • Determine aggregation method for the data
  • Create functions for cleaning, analysis, and visualizations

Research: TIRCP grants

Questions

  • What is the health of the TIRCP program, looking at the cycle in which award recipients received their money & their expenditure percentage?
  • How can Caltrans streamline the presenting and reporting of this data?
  • What other data sources can be merged into look at other funding sources the agencies have received & track environmental goals?
  • Automate two reports (Semi Annual and Program Allocation Plan) that are created manually using Python, incorporate ADA standards as well.

Metrics

  • Signaling progress of award recipients through categories: behind, ahead, no expenditure recorded, on track.
  • Looking at percentage of expended, allocated, and TIRCP amounts through tables and charts.
  • Grouping project details to track the types of projects TIRCP is funding.
  • Looking at GHG reduction/emissions.

Data sources

  • TIRCP project tracking Excel workbook.

MSD Dashboard Metric: Paratransit-using Californian GTFS coverage

Question

What % of Californians have access to GTFS-Flex data?

Metrics

  • The % of Californians that are within areas defined in the locations.geojson file of feeds with GTFS-Flex data.

Data sources

  • Census data
  • Feeds of providers

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: MSD Dashboard, model which agencies would most improve RT and accessibility coverage

Question

If we were able to support a select number of agencies in providing GTFS accessibility data and GTFS-RT, which would make the biggest impact in overall coverage statewide (as measured in #169 and #170)?

Metrics

  • estimated coverage increase by agency

Data sources

  • Data Warehouse (GTFS/RT)
  • Census (block level from 2020?)

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: Prototype functions, scripts for analytics portfolio

Question

Use bus_service_increase project as a prototype of how to set up scripts to run data cleaning, data assembly, visualization (charts + maps), save in sub-directories for various districts, counties, etc.

Currently, work is done in notebooks, and while it calls functions, the entire workflow to produce visualizations can benefit from more automation.

Long-term goal: create analytics portfolio for districts to access similar set of metrics and visualizations related to bus_service_increase, dla grants, drmt grants, parallel corridors, rt delay.

Metrics

Take existing work that lives in notebooks and generalize to accommodate various subsets of data (by district, by MPO, etc) and produce similar set of outputs.

Research: All GTFS-RT Speedmaps and Tools available for Better Bus Challenge

Question

How can ongoing RT speed/delay work best support Caltrans districts and other jurisdictions in improving the bus experience, statewide?

Metrics

  • Speedmaps and other metrics ran for each operator with available RT data
  • Results segmented by district, reproducible, and accessible alongside other ongoing analysis work (989, etc)
  • Instructions/documentation drafted
  • Instructions/documentation reviewed and complete

Data sources

  • GTFS-RT vehicle positions (existing raw table in warehouse)
  • GTFS Schedule data (in warehouse)

Research: GTFS-RT Speedmaps and Presentation Ready for D10

Question

Meeting scheduled for Feb 22. Before then, need to have speedmaps and metrics generated for District 10 transit operators, as well as a polished presentation.

Metrics

  • Speedmaps and other metrics ran for each D10 operator with available RT data
  • Presentation drafted
  • Presentation reviewed and complete

Data sources

  • GTFS-RT vehicle positions (existing raw table in warehouse)
  • GTFS Schedule data (in warehouse)

MSD Dashboard Metric: GTFS coverage for a wheelchair-user

Question

What % of California has (open to the public) transit coverage in GTFS which is explicitly wheelchair accessible?

Metrics

By area:
The % of non-water area of Californian that is within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that is explicitly wheelchair accessible (and if in a station, that station has explicit pathways coding), and that has is served by a public-funded, open to the general public transit service with GTFS Schedule data that is served by a service that is explicitly wheelchair accessible

By Population:
The % of Californians that are within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that is explicitly wheelchair accessible (and if in a station, that station has explicit pathways coding), and that has is served by a public-funded, open to the general public transit service with GTFS Schedule data that is served by a service that is explicitly wheelchair accessible

By Employment (optional):
The % of Jobs that are within 1/4 mi of a bus stop or 1 mi of a ferry/rail stop that is explicitly wheelchair accessible (and if in a station, that station has explicit pathways coding), and that has is served by a public-funded, open to the general public transit service with GTFS Schedule data that is served by a service that is explicitly wheelchair accessible

NOTE: - I don't think we need to or should interface this with any census data about having a disability.

Thoughts on this:

  1. we should assume that anybody could be disabled at any time (especially as people age in place).
  2. people using wheelchairs shouldn't be limited in where they go to where they have already self-selected to go. Per various ADA, Unruh, etc, we should be evaluating based on what access the general public has.
  3. The data isn't really there in a great way to support it anyway.

Data sources

  • Census data
  • Airtable database listing transit services, eligibilities, and GTFS data.
  • stops.txt for GTFS datasets and shapes.txt for continuous stops and locations.geojson for flexible services.

(Data Servicess Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Research: Consolidated Application Applicants

Data Questions:

This issue looks at the the organizations who applied for 5311/5311(f)/CMAQ, 5339(a), and/or LCTOP using the new Consolidated Application process. Applicants only need to complete one application for the funds above, once a year. The deadlines are in late March for LCTOP and late April for the other programs.

Metrics:

  1. Which organizations applied?
  2. Which funds are the most vied for?
  3. How much funding did the organizations receive?
  4. What are they planning to use the funds for?
  5. What mix of funds did organizations apply for?

Data sources:

Research: Planning and Modal Advisory Committee prep

Question

As part of the Caltrans Strategic Plan, Performance Plan, one of the big strategical goals related to multimodal transportation is P-01, to increase the total amount of service on the SHN and the reliability of that service by 2024.

Meeting on 3/9/22, @edasmalchi and @tiffanychu90 to present.

Metrics

  • Take expansive definition of "service on the SHN" to include all transit routes within 1 mile buffer, use parallel-corridors analysis
  • Count number of service hours and routes provided on typical weekday total, on SHN, for other intersecting routes that aren't parallel but somewhat touch SHN, and those that are not at all parallel
  • Show district breakdown of parallel routes and service
  • Show RT maps for subset of parallel routes to measure reliability of service

Data sources

  • Build on existing work in bus-service-increase and parallel-corridors
  • GTFS scheduled
  • GTFS RT

(Data Services Team to Copy and Fill Out Below)

The QuVR MD template below will be filled out by a member of the data services team.
This allows us to describe the request, in a way that is easy to hand-off for analysis.
After the research phase, we will sync with the asker to figure out if the metric and dashboard pieces are needed.

Before starting research:

  • Question: Question written as a single sentence.
  • View: E.g. views.gtfs_schedule_fact_daily_feed_files.
  • Research:
    • How should the results be presented?
    • When are they needed by?

After reviewing research with the asker:

  • Metric: what specific calculations are needed?
  • Dashboard: where should we put the result?

Division of Local Assistance (DLA) Data Driven Grant Management

Questions

  • Where can data help to standardize grant management?
  • Which grants are the most important in the eyes of the customer? If any, where is the overlap?
  • Where in the application process is the customer at any given moment?

Metrics

  • Number of grants by grant type, location and awardee, amount awarded

Data Sources:

Goals/Tasks:

  • Develop a unified schema for grant tracking and write automated scripts for programs already using databases
  • Determine the data structure of the database
  • Create a cohesive list of active DLA grants issued by State and Federal programs
  • Produce geographies interest for existing and potential grant applicants
  • Center the grant process around the customer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.