gtfs-aggregator-checker's People
gtfs-aggregator-checker's Issues
Rename this repo to "gtfs-aggregator-checker"
Given some internal discussion, we should rename this repo to gtfs-aggregator-checker
so that 3rd parties can better recognize the purpose of this library.
- change repository name
- rename
feed_checker.py
- update the docs in README
Add note to Readme about requirement of transit.land API key
The README should note that a transit.land API key is required and should show example usage. For example:
TRANSITLAND_API_KEY=SECRET python feed_checker.py --csv-file temp.csv --verbose
Check for non-realtime URLs too
The current code seems to check only for realtime URLs, but not the gtfs_schedule_url in agencies.yml. The script should be modified so that when it reads in the Cal-ITP agencies.yml file that it will also be checking the URL in the gtfs_schedule_url field.
Output should indicate URL presence by aggregator
The current output upon checking URL presence in the aggregator doesn't seem to include results about whether a URL was found in one feed aggregator, but not the other. The output should be modified so that it can provide feedback on whether a URL was present in each of the aggregators.
Refactor to use transit.land v1 API
The current code seems to use a combination of a GraphQL API and scraping to check for the presence of feeds. In order to make sure we are responsibly querying the data, we should be using their API to check for feed presence.
It seems like the way this can be done is by making a query to get all operators in California using this URL: https://api.transit.land/api/v1/operators?&apikey=API_KEY&limit=1000&sort_key=id&sort_order=asc&state=US-CA&total=true
. In the response, we can iterate through each operator and collect the values in represented_in_feed_onestop_ids
. Then, for each of those values, we can make a request to https://api.transit.land/api/v1/feeds/FEED_ID?apikey=API_KEY
and check that response for either the value(s) in the url
or urls
field.
Additional input formats
This script should have the ability to accept different inputs in addition to the Cal-ITP agencies.yml file. The input options should be:
- Ability to check a single URL as noted as a CLI argument option
- An option for a CSV of URLs. The CSV format should simply have each URL on a newline. There should be a command line option to specify this kind of input and the location of the input file.
- The Cal-ITP agencies.yml file. There should be a command line option to specify this kind of input and the location of the input file.
More permissive matching of URLs
In testing this out, I noticed that some URLs were not being matched since they were not exact matches, although they were functionally the same. feed-checker should be able to match URLs when the following situations occur:
- feed-checker should match
http
orhttps
. Ex:https://www.bart.gov/dev/schedules/google_transit.zip
should matchhttp://www.bart.gov/dev/schedules/google_transit.zip
. - feed-checker should be agnostic to the order of the URL query parameters. Ex:
http://example.com/feed?a=1&b=2
should matchhttp://example.com/feed?b=2&a=1
- feed-checker should omit API keys when checking URL query parameters. Ex:
http://api.511.org/transit/datafeeds?api_key={{ MTC_511_API_KEY}}&operator_id=AM
should matchhttp://api.511.org/transit/datafeeds?operator_id=AM
. The sensitive query parameters should be removed in both the input feed URL and the aggregated feed URL when doing a comparison. By default the query paramstoken
andapi_key
should be omitted.
Feature request: have the ability to return URLs not in input data within a certain region
User Story (Cal-ITP)
As a research data analyst,
I want to know if there are more up-to-date GTFS URLs found on feed aggregator websites than the GTFS URLs that Cal-ITP has
so that I can maintain a database of the GTFS URLs of the CA transit agencies
and so that I can have additional sources of information indicating which GTFS URLs transit agencies have
User Story (Community User)
As a transit application developer,
I want to get a list of all GTFS URLs on all feed aggregator websites for a particular region
so that I can have a complete list of all GTFS URLs to download data from to power my transit application
Acceptance Criteria
Given
- The input GTFS URLs given to any of the command-line input options of this program
- The input aggregator regions to check in
For transitland, it seems like the agencies can be queried to determine where they operate and compared with the feeds found based off of the input URLs. The command line arguments could look something like this:
--transit-land-adm1_iso=US-CA
For transitfeeds, the hardcoded location could be made configurable via a command line argument:
--transit-feeds-location=67-california-usa
- The GTFS URLs found on the aggregator websites for their respective regions
Then The URLs found on the aggregator websites that weren't within the input list URLs should be outputted in a separate section of the output.
Example:
When searching for all transitfeeds URLs in Saskatchewan, Canada, but also checking against a single input URL, the CLI input and result could be as follow:
CLI Input
python -m gtfs_aggregator_checker --url https://opengis.regina.ca/reginagtfs/google_transit.zip --output results.json --transit-feeds-location=196-saskatchewan-canada
JSON Output
{
"input_url_results": {
"https://opengis.regina.ca/reginagtfs/google_transit.zip": {
"transitfeeds": {
"public_web_url": "https://transitfeeds.com/p/the-city-of-regina/830",
"status": "present"
},
"transitland": {
"public_web_url": "https://www.transit.land/feeds/f-c8vx-thecityofregina",
"status": "present"
}
}
},
"additional_aggregator_urls_in_region_not_in_input_list": [
{
"transitfeeds_metadata": {
"name": "Saskatoon Transit GTFS",
"public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/264",
"type": "GTFS Schedule"
},
"url": "http://apps2.saskatoon.ca/app/data/google_transit.zip"
},
{
"transitfeeds_metadata": {
"name": "Saskatoon Transit Service Alerts",
"public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/842",
"type": "GTFS Realtime Service Alerts"
},
"url": "http://apps2.saskatoon.ca/app/data/Alert/Alerts.pb"
},
{
"transitfeeds_metadata": {
"name": "Saskatoon Transit Trip Updates",
"public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/841",
"type": "GTFS Realtime Trip Updates"
},
"url": "http://apps2.saskatoon.ca/app/data/TripUpdate/TripUpdates.pb"
},
{
"transitfeeds_metadata": {
"name": "Saskatoon Transit Vehicle Positions",
"public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/840",
"type": "GTFS Realtime Vehicle Positions"
},
"url": "http://apps2.saskatoon.ca/app/data/Vehicle/VehiclePositions.pb"
}
]
}
Changes to work with airflow dag
When I designed this I built it with cli usage in mind. There's a few tweaks I need to make to have this work inside a python script.
-
check_feeds
should return results - Ability to disable cache - I don't think this should run with the cache on in production. Make it so setting the cache_dir to 0 disables caching.
- Move stdout (print calls) to the
__main__
file. - Move
--output
flag to the main file
v1 release
JSON output format
This script should have the ability to output a JSON result so that the results file can be ingested by another system.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.