mobilitydata / mobility-database-catalogs Goto Github PK
View Code? Open in Web Editor NEWThe Catalogs of Sources of the Mobility Database.
License: Apache License 2.0
The Catalogs of Sources of the Mobility Database.
License: Apache License 2.0
What problem are we trying to solve?
Users need to get the latest dataset of a source, which means modifying the current data pipeline and extracting the latest dataset and its bounding box.
How will we know when this is done?
There is a latest dataset URL available for each source.
A cronjob runs daily to check for the most recent update.
Constraints
What problem are we trying to solve?
Users unfamiliar with GitHub need an easy way to request new sources and source updates.
How will we know when this is done?
The pre-existing Google form includes the additional fields that are needed to populate the catalogs.
Same as here
Displaying municipality and subdivision as unknown made sense when they were required, but now they are optional. They should show up as blank if null.
What problem is your feature request trying to solve?
The filename and last URL of a source should not contain accents, even if the source's provider, name and location contain them.
Describe the solution you'd like
Add an additional step to the normalization to remove the accents. Example here.
Is your feature request related to a problem? Please describe.
The verification of the source url should be only done on added or modified source files. The other files should not be checked.
Describe the solution you'd like
Modify the workflow to check if the directory is the catalogs one.
What problem are we trying to solve?
We want to make the data publicly available to the community in an organized way, and test the pipeline to ensure everything is working before launch.
How will we know when this is done?
200 sources have been added to the catalogs, with bounding box and latest dataset information.
MobilityData is sharing a roadmap for the Mobility Database, a central hub for mobility datasets across the world. This roadmap is informed by user interviews conducted in December 2021 and January 2022, the product research Mobility Data shared in spring 2021 and the support requests received through OpenMobilityData. These outcomes and priorities have been revised to reflect our most recent conversations with the community.
The roadmap’s features are organized around 3 main outcomes for the Mobility Database, which are shared in detail below.
This roadmap will be regularly developed through annual feedback and inputs from the international mobility community, with public quarterly prioritization meetings to assess goals. You can learn more about how to contribute to this work throughout 2022 below.
Mobility Database provides reliable and up to date GTFS Schedule and GTFS Realtime data.
All stakeholders need access to up-to-date datasets and a stable URL for the datasets so the information they’re using is reliable and accurate. This is the first critical outcome that must be achieved.
It’s easy for data producers and consumers to collaborate to improve data on Mobility Database.
In order for Mobility Database to be the central hub for all GTFS datasets across the world, data producers and consumers need to be able to easily add and improve data together. Governance will be worked on in stages. Currently, stakeholders can request source updates, and later we’ll implement a process for data producers to easily verify that their source has been updated.
Shared identifiers allow data producers and consumers to reduce duplication and increase discoverability.
A lack of consistency in stop, trip, route, agency and source IDs makes it difficult to link static and realtime sources together and reduce duplication in what trip planning data travellers see. It also makes representing complex relationships, like interagency fare transfers, incredibly labor intensive. There is a need for shared identifiers across providers’ datasets to maintain consistency and reduce duplication of effort across the industry. This is a critical gap Mobility Database can fill as the canonical database for open mobility data.
What problem is your feature request trying to solve?
There isn't test verifying the uniqueness of the GTFS Realtime URLs as there currently is for GTFS Schedule direct download URLs.
Describe the solution you'd like
Add new tests.
What problem is your feature request trying to solve?
Status checks are not required currently because their job names are re-used across the workflow files (many jobs are calls "build").
Describe the solution you'd like
We need to modify the job names so that they are distinctive, then make the status checks required for the unit tests and integration tests in the settings so that a PR will require the latest version of main to be merged in the branch before it's being merged. The settings for status checks are located at: Settings --> Branches --> Branch Protection Rules --> Main --> Protect Matching Branches --> Require branches to be up to date before merging.
Describe the bug
The export_to_csv.yml
workflow fails because the columns static_reference
, urls.realtime_vehicle_positions
, urls.realtime_trip_updates
and urls.realtime_alerts
are missing from the CSV. The problem occurs because we remove the GTFS Realtime source for Barrie Transit in #114 - the process to export the CSV requires the columns - or features - to exist in at least one source to work.
To Reproduce
Run the export_to_csv.yml
workflow
Expected behavior
The process should not fail if a column is missing. The column should just not be added to the csv.
We should modify the code so that columns are added one by one making sure it exists.
What problem are we trying to solve?
We want JSON file names to be easily scannable, even when provider names are long or there are multiple providers in 1 source.
How will we know when this is done?
Describe the bug
The cronjob is failing, probably do to a memory error (the action run page is not loading)
To Reproduce
Go to the workflow page. It is not possible to open the run page (the artifacts may be too big), so we can't re-start the failing job and try to reproduce.
Expected behavior
The cronjob should work properly.
What problem is your feature request trying to solve?
We extract the coordinates of the bounding box from a given dataset and keep these coordinates without checking their values. Sometimes, we end up extracting invalid coordinates (out of range). The coordinates should be set to null in such a case.
Describe the solution you'd like
Make sure that the extracted coordinates are valid. If not, set the coordinates to null.
Update schedule schema to include API key permission (yes/no) and where users can go to find the API key auth steps. This can match the realtime schema authentication information.
What problem are we trying to solve?
Users need to be able to pull the catalogs' sources based on criteria like location, the latest dataset, etc.
How will we know when this is done?
Allow users to:
add_source
update_source
get_sources
get_sources_by_bounding_box
get_latest_datasets
JSON file names are country-code-subdivision-code-provider-numerical-id.json
Is your feature request related to a problem? Please describe.
A new source should have a stable URL different than the other sources present on the database. If not, it should not be added to the database.
Describe the solution you'd like
Add a workflow to verify if a submitted source as an URL that is not already on the database. If the workflow fails, then the URL already exist for a source on the database.
What problem are we trying to solve?
Users need to be able to add, update, and get GTFS realtime sources.
How will we know when this is done?
These operations are readily available on the repo.
What problem are we trying to solve?
As discussed here, the concept of a transit provider needs a clearer definition within the schema.
After further review and consideration, it makes more sense to include transit provider information when there is an independent catalog for providers that can highlight relationships between services and organizations. For V1, we plan to only include agency names that can be identified in agency.txt.
How do we know when this is done?
What problem are we trying to solve?
Users need to know the scope of the catalogs, how to contribute to the code and the data, and how to use the scripts.
How will we know when this is done?
There are README and CONTRIBUTE documents that outline what user stories the catalog covers, how users can use the catalogs and contribute to them.
What problem are we trying to solve?
We want to change the auto-discovery URL because "discovery" is not a commonly used term or convention in GTFS. (The original term was inspired by GBFS' systems.csv).
We intend to change this to "direct download URL" so it's clear that this link downloads a source.
How will we know when this is done?
All references to auto-discovery URL removed in the JSON schema, operations, and CSV export, and direct download URL used instead.
How do we know when this is done?
This information is populated into the URLs document.
If the WikiBase ID exists in the URLs document, do not extract it.
Output provided includes
Describe the bug
The store_latest_dataset_cronjob.yml
workflow is failing. See error logs here.
The problem is at lines 20 to 22. The step "Get added and modified files" using jitterbit/get-changed-files@v1
should not be there. It was copied by mistake from the "On approval" workflow. This GitHub action is incompatible with schedule
events and is irrelevant for this workflow, so it should be removed.
To Reproduce
Let the schedule job run.
Expected behavior
The scheduled job should run correctly
What problem is your feature request trying to solve?
As we migrate from the Wikibase instance, we may need to update the Mobility Archives hierarchy/file system to reflect our current needs.
The current bucket and file system for a source is
Storage level. For each source:
The system was built this way because we wanted to give each object a different [life cycle] (https://cloud.google.com/storage/docs/lifecycle) and we thought it should be done per bucket and not per object. After re-reading the documentation, I believe that the lifecycle rules apply to a bucket, but that each file in the bucket is assigned a lifecycle class based on how often it is used.
So we should restructure this, and also remove the dependency on the archives_id
that are linked to the Wikibase instance.
Describe the solution you'd like
Storage level:
Then, if we have a map from the Wikibase entity id to the Mobility Catalogs MDB source ids, we will be able to restructure and keep what we previously did (store the historical data)
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
How will we know when this is done?
A clear and concise user story or list of user stories that must be met for the issue to be completed. A user story follows the structure: “As a [persona], I [want to], [so that].”
Additional context
Add any other context or screenshots about the feature request here.
How do we know when this done?
What problem are we trying to solve?
We have approximately 1300 sources that have been harvested from different data platforms. However:
We want this data to be easily discoverable and follow the conventions stated in the schema before importing.
How will we know when this is done?
What problem are we trying to solve?
We want to make the data publicly available to the community in an organized way, and test the pipeline to ensure everything is working before launch.
How will we know when this is done?
A second round of sources (100) have been added to the catalogs, with bounding box and latest dataset information.
What problem is your feature request trying to solve?
The accents appear as their unicode espace sequence in the JSON files, eg. á is \u00e1.
Describe the solution you'd like
Load and dump the JSON file without ensuring ascii characters with ensure_ascii=False
. See here.
Review Transitfeeds, Transit.land, Interlines, etc
What problem are we trying to solve?
We want users to be able to easily search for data by location and provider.
How will we know when this is done?
The JSON schema is updated to match the following field breakdowns.
The CSV artifact is updated to match the fields.
Field Name | Required from users | Definition |
---|---|---|
MDB Source ID | No - system generated | Unique identifier following the structure: mdbsrc-provider-subdivisionname-countrycode-numericalid. 3 character minimum and 63 character maximum based on Google Cloud Storage. |
Data Type | Yes | The data format that the source uses, e.g GTFS, GTFS-RT. |
Country Code | Yes | ISO 3166-1 alpha-2 code designating the country where the system is located. For a list of valid codes see here. |
Subdivision name | Yes | ISO 3166-2 subdivision name designating the subdivision (e.g province, state, region) where the system is located. For a list of valid names see here. |
Municipality | Yes | Primary municipality in which the transit system is located. |
Provider | Yes | Name of the transit provider. |
Name | Optional | An optional description of the data source, e.g to specify if the data source is an aggregate of multiple providers, or which network is represented by the source. |
Auto-Discovery URL | Yes | URL that automatically opens the source. |
Latest dataset URL | No - system generated | A stable URL for the latest dataset of a source. |
License URL | Optional | The transit provider’s license information. |
Bounding box | No - system generated | This is the bounding box of the data source when it was first added to the catalog. It includes the date and timestamp the bounding box was extracted in UTC. |
Describe the bug
In tools.helpers.is_readable
, when an exception is detected in the try/except block, a TypeError exception is raised: "TypeError: exceptions must derive from BaseException".
We are expecting an exception to be raised, but here we always get the same exception because of a syntax problem in the try/except block. The correct syntax should be raise Exception("message")
and not raise ("message")
.
Correcting the syntax will correct the behaviour.
To Reproduce
Import tools.helpers.is_readable
and test it with an URL that is not a GTFS Schedule source.
Expected behavior
The raised exception should come with a distinctive message.
This is an issue for planning the 1.0 release of the catalogs. It will be updated as we plan and develop the next release.
Please share with us below your needs and ideas! 💡
You can see the current status of work by looking at the issues tagged as V1 in our sprint board. If you'd like to contribute to this release, please get in touch with Emma at [email protected].
The working document used to design this feature is available at https://bit.ly/v1-catalogs-working-doc. Feel free to leave comments directly in it!
The release’s priorities were based on the user research analysis MobilityData conducted in late 2021 and early 2022.
What problem are we trying to solve?
How will we know when this is done?
What problem are we trying to solve?
Data producers unfamiliar with JSON need an easy way to search the catalog sources and identify if their source information is correct.
How will we know when this is done?
There is a stable URL to a CSV export that we can link to in the source update google form.
We need to import the 1300+ sources we've harvested through the data pipeline and share them in the catalogs.
What problem are we trying to solve?
As partially discussed here, municipality and subdivision are not always relevant for large rail systems or aggregate sources, so they should be made optional fields.
How do we know when this is done?
What is the problem we're trying to solve?
In running tests on this PR after I forked the repo, this error kept recurring :
Error: google-github-actions/auth failed with: the GitHub Action workflow must specify exactly one of "workload_identity_provider" or "credentials_json"! If you are specifying input values via GitHub secrets, ensure the secret is being injected into the environment. By default, secrets are not passed to workflows triggered from forks, including Dependabot.
It looks like this is because secrets are not injected on PRs from forks.
This needs to be fixed in order for us to accept contributions.
How do we know when this is done?
PRs from forked repos can be accepted without the error.
What problem are we trying to solve?
Users need GTFS realtime information and an easy way to identify which GTFS schedule source is associated with it.
How will we know when this is done?
There is a GTFS realtime schema available with 2 example files in the repository. We've received feedback on the schema from at least one external party before implementing the first iteration.
The realtime schema is added to the README
Schema includes
What problem are we trying to solve?
To understand the scalability of using the ISO standard to define location in the catalogs, we want to see what libraries are available to generate an ISO subdivision from the bounding box.
How will we know when this is done?
We have identified the options and tradeoffs associated with different approaches to automatically generating this information.
Right now the .json files containing the dataset information is a concatenation of:
{Country}-{Subdivision Name}-{Provider Name}-(dataset type}-{id}
Is the ID unique to the combo of:
In addition to adding the definition of what the ID is "identifying" to the documentation, I'm interested in what happens in the following (real life) use cases:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.