mobilitydata / mobility-database-catalogs Goto Github PK

View Code? Open in Web Editor NEW

251.0 251.0 48.0 1.87 MB

The Catalogs of Sources of the Mobility Database.

License: Apache License 2.0

Python 87.61% Swift 12.39%

mobility-database-catalogs's People

Contributors

Stargazers

Watchers

Forkers

polymath-is elerille mbehhacon salisburymistake homeeazy prhod e-lo emmambd yuorfaec isabelle-dr louisa-bieler 1-byte allanf181 nighthawk skedgo praneethd7 walkerconsultants nomeq thegrasshopper104 occam-labs ibmsayem jsbroks lapingvino michaelkirk mikey0000 rayfazt ayumukasuga sethherr joeltam cj-malone hfs fredericsimard jcpitre alexbodn botanize abhasdudeja reinhart1010 stephenyeargin 861 ilimikato johnnydependents nvbw stacyharper tellae giloliveira

mobility-database-catalogs's Issues

Add GitHub actions: archives, cronjobs

What problem are we trying to solve?
Users need to get the latest dataset of a source, which means modifying the current data pipeline and extracting the latest dataset and its bounding box.

How will we know when this is done?
There is a latest dataset URL available for each source.
A cronjob runs daily to check for the most recent update.

Constraints

If an auto-discovery URL is updated, the change will not be reflected in the latest dataset URL for (at most) 24 hours
No automated merge, MobilityData team are the only one to approve and merge data

Update Google form with needed data for the catalogs

What problem are we trying to solve?
Users unfamiliar with GitHub need an easy way to request new sources and source updates.

How will we know when this is done?
The pre-existing Google form includes the additional fields that are needed to populate the catalogs.

Add semantic PR title validation

Same as here

Change null behavior for location so it's blank, not "unknown"

Displaying municipality and subdivision as unknown made sense when they were required, but now they are optional. They should show up as blank if null.

Automatically replace accents for filename/latest URL

What problem is your feature request trying to solve?
The filename and last URL of a source should not contain accents, even if the source's provider, name and location contain them.

Describe the solution you'd like
Add an additional step to the normalization to remove the accents. Example here.

Verify the directory during the verify-source-url workflow

Is your feature request related to a problem? Please describe.
The verification of the source url should be only done on added or modified source files. The other files should not be checked.

Describe the solution you'd like
Modify the workflow to check if the directory is the catalogs one.

First data import

What problem are we trying to solve?
We want to make the data publicly available to the community in an organized way, and test the pipeline to ensure everything is working before launch.

How will we know when this is done?

200 sources have been added to the catalogs, with bounding box and latest dataset information.

Add tests for tools.operations

MobilityData is sharing a roadmap for the Mobility Database, a central hub for mobility datasets across the world. This roadmap is informed by user interviews conducted in December 2021 and January 2022, the product research Mobility Data shared in spring 2021 and the support requests received through OpenMobilityData. These outcomes and priorities have been revised to reflect our most recent conversations with the community.

The roadmap’s features are organized around 3 main outcomes for the Mobility Database, which are shared in detail below.

This roadmap will be regularly developed through annual feedback and inputs from the international mobility community, with public quarterly prioritization meetings to assess goals. You can learn more about how to contribute to this work throughout 2022 below.

Our Outcomes

Mobility Database provides reliable and up to date GTFS Schedule and GTFS Realtime data.

All stakeholders need access to up-to-date datasets and a stable URL for the datasets so the information they’re using is reliable and accurate. This is the first critical outcome that must be achieved.

It’s easy for data producers and consumers to collaborate to improve data on Mobility Database.

In order for Mobility Database to be the central hub for all GTFS datasets across the world, data producers and consumers need to be able to easily add and improve data together. Governance will be worked on in stages. Currently, stakeholders can request source updates, and later we’ll implement a process for data producers to easily verify that their source has been updated.

Shared identifiers allow data producers and consumers to reduce duplication and increase discoverability.

A lack of consistency in stop, trip, route, agency and source IDs makes it difficult to link static and realtime sources together and reduce duplication in what trip planning data travellers see. It also makes representing complex relationships, like interagency fare transfers, incredibly labor intensive. There is a need for shared identifiers across providers’ datasets to maintain consistency and reduce duplication of effort across the industry. This is a critical gap Mobility Database can fill as the canonical database for open mobility data.

How to contribute to this work

Share your feedback and ideas in the roadmap.
You can contribute directly to the Mobility Database’s development in the public sprints that occur every two week cycle. Take a look at our to-dos tagged as V1 and comment if you’re interested in taking an issue on.
If you are a current TransitFeeds user who is impacted by this transition, please contact Emma Blue, the Product Manager for Thriving Ecosystem at [email protected].
This work relies on the sponsorship of the mobility community. If you'd like to support the development and growth of our work, you can learn more about becoming a MobilityData member here.

Verify the uniqueness of GTFS Realtime URLs in integration tests

What problem is your feature request trying to solve?
There isn't test verifying the uniqueness of the GTFS Realtime URLs as there currently is for GTFS Schedule direct download URLs.

Describe the solution you'd like
Add new tests.

Add workflow to verify PEP8 compliancy (code styling)

Integration tests to verify the uniqueness of the source id and the auto-discovery url across the sources.

Require branches to be up to date before merging

What problem is your feature request trying to solve?
Status checks are not required currently because their job names are re-used across the workflow files (many jobs are calls "build").

Describe the solution you'd like
We need to modify the job names so that they are distinctive, then make the status checks required for the unit tests and integration tests in the settings so that a PR will require the latest version of main to be merged in the branch before it's being merged. The settings for status checks are located at: Settings --> Branches --> Branch Protection Rules --> Main --> Protect Matching Branches --> Require branches to be up to date before merging.

Export to CSV workflow fails because of the columns

Describe the bug
The export_to_csv.yml workflow fails because the columns static_reference, urls.realtime_vehicle_positions, urls.realtime_trip_updates and urls.realtime_alerts are missing from the CSV. The problem occurs because we remove the GTFS Realtime source for Barrie Transit in #114 - the process to export the CSV requires the columns - or features - to exist in at least one source to work.

To Reproduce
Run the export_to_csv.yml workflow

Expected behavior
The process should not fail if a column is missing. The column should just not be added to the csv.
We should modify the code so that columns are added one by one making sure it exists.

Truncating JSON file name (setting limits on provider input)

What problem are we trying to solve?
We want JSON file names to be easily scannable, even when provider names are long or there are multiple providers in 1 source.

How will we know when this is done?

If there are multiple providers separated by a comma (e.g provider1, provider2), content after the comma (including the comma itself) is removed from the file name, leaving provider1.

Cronjob is failing, probably because of a memory error

Describe the bug
The cronjob is failing, probably do to a memory error (the action run page is not loading)

To Reproduce
Go to the workflow page. It is not possible to open the run page (the artifacts may be too big), so we can't re-start the failing job and try to reproduce.

Expected behavior
The cronjob should work properly.

Set the bounding box coordinates to null when out of range

What problem is your feature request trying to solve?
We extract the coordinates of the bounding box from a given dataset and keep these coordinates without checking their values. Sometimes, we end up extracting invalid coordinates (out of range). The coordinates should be set to null in such a case.

Describe the solution you'd like
Make sure that the extracted coordinates are valid. If not, set the coordinates to null.

Update schedule schema to include API sources

Update schedule schema to include API key permission (yes/no) and where users can go to find the API key auth steps. This can match the realtime schema authentication information.

Create prototype for the catalog tools

What problem are we trying to solve?
Users need to be able to pull the catalogs' sources based on criteria like location, the latest dataset, etc.

How will we know when this is done?
Allow users to:

add_source
update_source
get_sources
get_sources_by_bounding_box
get_latest_datasets

Update JSON file names

JSON file names are country-code-subdivision-code-provider-numerical-id.json

Add workflow to verify new source submission URL

Is your feature request related to a problem? Please describe.
A new source should have a stable URL different than the other sources present on the database. If not, it should not be added to the database.

Describe the solution you'd like
Add a workflow to verify if a submitted source as an URL that is not already on the database. If the workflow fails, then the URL already exist for a source on the database.

Add Realtime operations

What problem are we trying to solve?
Users need to be able to add, update, and get GTFS realtime sources.

How will we know when this is done?
These operations are readily available on the repo.

Update sources file to correspond to new schemas

Change providers to agencies

What problem are we trying to solve?

As discussed here, the concept of a transit provider needs a clearer definition within the schema.

After further review and consideration, it makes more sense to include transit provider information when there is an independent catalog for providers that can highlight relationships between services and organizations. For V1, we plan to only include agency names that can be identified in agency.txt.

How do we know when this is done?

The Provider attribute is updated to Agency
The README documentation reflects this change and adds a clear definition of agency based on agency name in agency.txt.

Protect PRs

Protect PRs with:

Semantic title validation. Eg. here
Minimum of 1 reviewer approbation
Forbid push on main

Done with PR #50 + Modifying main branch rules (review + approbation required)

Pre-launch documentation additions

Add common name definition for transit provider (Done)
Add explanation of null behavior with bounding box and a link to the validator (Done)
Standard API URI (postponed until realtime sources added)

Add a CONTRIBUTE.md and README.md to the prototype

What problem are we trying to solve?
Users need to know the scope of the catalogs, how to contribute to the code and the data, and how to use the scripts.

How will we know when this is done?
There are README and CONTRIBUTE documents that outline what user stories the catalog covers, how users can use the catalogs and contribute to them.

Changing auto-discovery URL to direct download URL

What problem are we trying to solve?
We want to change the auto-discovery URL because "discovery" is not a commonly used term or convention in GTFS. (The original term was inspired by GBFS' systems.csv).

We intend to change this to "direct download URL" so it's clear that this link downloads a source.

How will we know when this is done?

All references to auto-discovery URL removed in the JSON schema, operations, and CSV export, and direct download URL used instead.

Run script to pull metadata from sources in the pipeline

How do we know when this is done?

This information is populated into the URLs document.

If the WikiBase ID exists in the URLs document, do not extract it.

Output provided includes

Country Code
Region
Municipality
Name
Auto-discovery URL

bug: Cronjob workflow fails

Describe the bug
The store_latest_dataset_cronjob.yml workflow is failing. See error logs here.

The problem is at lines 20 to 22. The step "Get added and modified files" using jitterbit/get-changed-files@v1 should not be there. It was copied by mistake from the "On approval" workflow. This GitHub action is incompatible with schedule events and is irrelevant for this workflow, so it should be removed.

To Reproduce
Let the schedule job run.

Expected behavior
The scheduled job should run correctly

Update archives bucket/file system to archive datasets

What problem is your feature request trying to solve?
As we migrate from the Wikibase instance, we may need to update the Mobility Archives hierarchy/file system to reflect our current needs.

The current bucket and file system for a source is

Storage level. For each source:

Bucket: {source_archives_id}archives{some_download_date}
- File: {sha1_hash}.{extension}
Bucket: {source_archives_id}archives{another_download_date}
- File: {sha1_hash}.{extension}
Bucket: {source_archives_id}_latest
- File: {sha1_hash}.{extension}

The system was built this way because we wanted to give each object a different [life cycle] (https://cloud.google.com/storage/docs/lifecycle) and we thought it should be done per bucket and not per object. After re-reading the documentation, I believe that the lifecycle rules apply to a bucket, but that each file in the bucket is assigned a lifecycle class based on how often it is used.

So we should restructure this, and also remove the dependency on the archives_id that are linked to the Wikibase instance.

Describe the solution you'd like

Storage level:

Bucket: archives
- Object folder (one per source): {source_filename_as_used_in_the_catalogs}
  - Object: {download_date}/{sha1_hash}.{extension}
    OR
  - Object: {sha1_hash}.{extension}
Bucket: latest
- Object (one per source): {source_filename_as_used_in_the_catalogs}.{extension}
Bucket: in-use?
- Object (one per source): {source_filename_as_used_in_the_catalogs}.{extension}

Then, if we have a map from the Wikibase entity id to the Mobility Catalogs MDB source ids, we will be able to restructure and keep what we previously did (store the historical data)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

How will we know when this is done?
A clear and concise user story or list of user stories that must be met for the issue to be completed. A user story follows the structure: “As a [persona], I [want to], [so that].”

Additional context
Add any other context or screenshots about the feature request here.

Update source ID to be numerical

How do we know when this done?

Sources have generated numerical ID
JSON file names are country-code-subdivision-code-provider-numerical-id.json

Clean data for import

What problem are we trying to solve?
We have approximately 1300 sources that have been harvested from different data platforms. However:

Not all this data conforms to the new schema
Many sources have transitfeeds.com has their only auto-discovery URL

We want this data to be easily discoverable and follow the conventions stated in the schema before importing.

How will we know when this is done?

Location data follows country code, subdivision and municipality conventions
Provider names are displayed with both full version and acronym
As many unique URLs have been found as possible to reduce the number of sources that only have a transitfeeds.com link as their auto-discovery URL

Setup Mobilitydatabase so we can redirect mobilitydatabase.org to the microsite

Second data import

What problem are we trying to solve?
We want to make the data publicly available to the community in an organized way, and test the pipeline to ensure everything is working before launch.

How will we know when this is done?

A second round of sources (100) have been added to the catalogs, with bounding box and latest dataset information.

Add CLA assistant

https://cla-assistant.io/

GitHub Integration with CLA Assistant

Support accents in JSON files

What problem is your feature request trying to solve?
The accents appear as their unicode espace sequence in the JSON files, eg. á is \u00e1.

Describe the solution you'd like
Load and dump the JSON file without ensuring ascii characters with ensure_ascii=False. See here.

Investigate options for GTFS Realtime JSON Schema

Review Transitfeeds, Transit.land, Interlines, etc

Finalize JSON schemas

What problem are we trying to solve?
We want users to be able to easily search for data by location and provider.

How will we know when this is done?

The JSON schema is updated to match the following field breakdowns.
The CSV artifact is updated to match the fields.

Field Name Required from users Definition

MDB Source ID No - system generated Unique identifier following the structure: mdbsrc-provider-subdivisionname-countrycode-numericalid. 3 character minimum and 63 character maximum based on Google Cloud Storage.

Data Type Yes The data format that the source uses, e.g GTFS, GTFS-RT.

Country Code Yes ISO 3166-1 alpha-2 code designating the country where the system is located. For a list of valid codes see here.

Subdivision name Yes ISO 3166-2 subdivision name designating the subdivision (e.g province, state, region) where the system is located. For a list of valid names see here.

Municipality Yes Primary municipality in which the transit system is located.

Provider Yes Name of the transit provider.

Name Optional An optional description of the data source, e.g to specify if the data source is an aggregate of multiple providers, or which network is represented by the source.

Auto-Discovery URL Yes URL that automatically opens the source.

Latest dataset URL No - system generated A stable URL for the latest dataset of a source.

License URL Optional The transit provider’s license information.

Bounding box No - system generated This is the bounding box of the data source when it was first added to the catalog. It includes the date and timestamp the bounding box was extracted in UTC.

Field Name	Required from users	Definition
MDB Source ID	No - system generated	Unique identifier following the structure: mdbsrc-provider-subdivisionname-countrycode-numericalid. 3 character minimum and 63 character maximum based on Google Cloud Storage.
Data Type	Yes	The data format that the source uses, e.g GTFS, GTFS-RT.
Country Code	Yes	ISO 3166-1 alpha-2 code designating the country where the system is located. For a list of valid codes see here.
Subdivision name	Yes	ISO 3166-2 subdivision name designating the subdivision (e.g province, state, region) where the system is located. For a list of valid names see here.
Municipality	Yes	Primary municipality in which the transit system is located.
Provider	Yes	Name of the transit provider.
Name	Optional	An optional description of the data source, e.g to specify if the data source is an aggregate of multiple providers, or which network is represented by the source.
Auto-Discovery URL	Yes	URL that automatically opens the source.
Latest dataset URL	No - system generated	A stable URL for the latest dataset of a source.
License URL	Optional	The transit provider’s license information.
Bounding box	No - system generated	This is the bounding box of the data source when it was first added to the catalog. It includes the date and timestamp the bounding box was extracted in UTC.

`tools.helpers.is_readable` has a syntax problem

Describe the bug
In tools.helpers.is_readable, when an exception is detected in the try/except block, a TypeError exception is raised: "TypeError: exceptions must derive from BaseException".

We are expecting an exception to be raised, but here we always get the same exception because of a syntax problem in the try/except block. The correct syntax should be raise Exception("message") and not raise ("message").

Correcting the syntax will correct the behaviour.

To Reproduce
Import tools.helpers.is_readable and test it with an URL that is not a GTFS Schedule source.

Expected behavior
The raised exception should come with a distinctive message.

1.0 Release plan

This is an issue for planning the 1.0 release of the catalogs. It will be updated as we plan and develop the next release.
Please share with us below your needs and ideas! 💡

You can see the current status of work by looking at the issues tagged as V1 in our sprint board. If you'd like to contribute to this release, please get in touch with Emma at [email protected].

Goals for the release 🎯

I can report source changes via a Google form and verify the update with a CSV file so I know consumers are using the right data
I need the most accurate GTFS Schedule and GTFS Realtime data (see update below) possible, so I'm providing travellers with the right information
I can add and update sources myself through a script so I know consumers are discovering the right data
I can get the latest dataset for GTFS sources via a script so my users are getting the right data
I can filter GTFS sources by bounding box so I only get the data I care about
I can download the dataset I want so I can open it and do my own analysis

What is outside the scope for this release

Providing a snapshot of GTFS Realtime sources
API-based sources
Updating the bounding box daily
Archiving datasets
March 2022 Update: Based on community conversations still occurring on the realtime schema, we are postponing finalizing the realtime data schema and importing the data until Q2. Please share your thoughts in the issue if interested.

Considerations

We need sources to have a stable ID so they’re easily recognizable through time
The latest dataset needs to always be available through a stable URL so consumers can rely on it
Source additions and changes should be added quickly to the catalogs so users are always using as up-to-date information as possible
The catalog needs to be both scalable and usable for consumers and easy to skim and contribute to for producers

Materials

The working document used to design this feature is available at https://bit.ly/v1-catalogs-working-doc. Feel free to leave comments directly in it!

The release’s priorities were based on the user research analysis MobilityData conducted in late 2021 and early 2022.

Update identify_source to add numerical id to a mdb source id

What problem are we trying to solve?

We want to create an ID that will be usable in a SQL database so that it's truly stable.

How will we know when this is done?

The MDB ID is changed to a numerical ID.

Add GitHub action to export a CSV

What problem are we trying to solve?
Data producers unfamiliar with JSON need an easy way to search the catalog sources and identify if their source information is correct.

How will we know when this is done?
There is a stable URL to a CSV export that we can link to in the source update google form.

Import all sources into the catalogs

We need to import the 1300+ sources we've harvested through the data pipeline and share them in the catalogs.

Make subdivision and municipality optional

What problem are we trying to solve?

As partially discussed here, municipality and subdivision are not always relevant for large rail systems or aggregate sources, so they should be made optional fields.

How do we know when this is done?

The schema in the README is updated to reflect the change
The add and update source operations no longer require subdivision and municipality

Resolve issue authenticating PRs from forks

What is the problem we're trying to solve?

In running tests on this PR after I forked the repo, this error kept recurring :

Error: google-github-actions/auth failed with: the GitHub Action workflow must specify exactly one of "workload_identity_provider" or "credentials_json"! If you are specifying input values via GitHub secrets, ensure the secret is being injected into the environment. By default, secrets are not passed to workflows triggered from forks, including Dependabot.

It looks like this is because secrets are not injected on PRs from forks.

This needs to be fixed in order for us to accept contributions.

How do we know when this is done?

PRs from forked repos can be accepted without the error.

Creating GTFS Realtime JSON schema

What problem are we trying to solve?
Users need GTFS realtime information and an easy way to identify which GTFS schedule source is associated with it.

How will we know when this is done?
There is a GTFS realtime schema available with 2 example files in the repository. We've received feedback on the schema from at least one external party before implementing the first iteration.

The realtime schema is added to the README
Schema includes

A GTFS schedule reference
When no reference is available, country code, subdivision and municipality are left blank and the file name includes "unknown" for the location information.

Protect sources from being overwritten by new sources (same information, same source id)

Research libraries to match bounding box to ISO subdivision

What problem are we trying to solve?
To understand the scalability of using the ISO standard to define location in the catalogs, we want to see what libraries are available to generate an ISO subdivision from the bounding box.

How will we know when this is done?

We have identified the options and tradeoffs associated with different approaches to automatically generating this information.

Clarify uniqueness of each ID / catalog entry / .json file

Right now the .json files containing the dataset information is a concatenation of:

{Country}-{Subdivision Name}-{Provider Name}-(dataset type}-{id}

Is the ID unique to the combo of:

country
subdivision name
provider name
dataset type AND
URL ?

In addition to adding the definition of what the ID is "identifying" to the documentation, I'm interested in what happens in the following (real life) use cases:

If the URL is updated, does the ID update?
If the URL is deprecated, does the catalog entry disappear?

mobilitydata / mobility-database-catalogs Goto Github PK

mobility-database-catalogs's People

Contributors

Stargazers

Watchers

Forkers

mobility-database-catalogs's Issues

Our Outcomes

How to contribute to this work

Goals for the release 🎯

What is outside the scope for this release

Considerations

Materials

Recommend Projects

Recommend Topics

Recommend Org