Coder Social home page Coder Social logo

gtfs-aggregator-checker's People

Contributors

atvaccaro avatar chriscauley avatar evansiroky avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gtfs-aggregator-checker's Issues

Rename this repo to "gtfs-aggregator-checker"

Given some internal discussion, we should rename this repo to gtfs-aggregator-checker so that 3rd parties can better recognize the purpose of this library.

  • change repository name
  • rename feed_checker.py
  • update the docs in README

Check for non-realtime URLs too

The current code seems to check only for realtime URLs, but not the gtfs_schedule_url in agencies.yml. The script should be modified so that when it reads in the Cal-ITP agencies.yml file that it will also be checking the URL in the gtfs_schedule_url field.

Output should indicate URL presence by aggregator

The current output upon checking URL presence in the aggregator doesn't seem to include results about whether a URL was found in one feed aggregator, but not the other. The output should be modified so that it can provide feedback on whether a URL was present in each of the aggregators.

Refactor to use transit.land v1 API

The current code seems to use a combination of a GraphQL API and scraping to check for the presence of feeds. In order to make sure we are responsibly querying the data, we should be using their API to check for feed presence.

It seems like the way this can be done is by making a query to get all operators in California using this URL: https://api.transit.land/api/v1/operators?&apikey=API_KEY&limit=1000&sort_key=id&sort_order=asc&state=US-CA&total=true. In the response, we can iterate through each operator and collect the values in represented_in_feed_onestop_ids. Then, for each of those values, we can make a request to https://api.transit.land/api/v1/feeds/FEED_ID?apikey=API_KEY and check that response for either the value(s) in the url or urls field.

Additional input formats

This script should have the ability to accept different inputs in addition to the Cal-ITP agencies.yml file. The input options should be:

  1. Ability to check a single URL as noted as a CLI argument option
  2. An option for a CSV of URLs. The CSV format should simply have each URL on a newline. There should be a command line option to specify this kind of input and the location of the input file.
  3. The Cal-ITP agencies.yml file. There should be a command line option to specify this kind of input and the location of the input file.

More permissive matching of URLs

In testing this out, I noticed that some URLs were not being matched since they were not exact matches, although they were functionally the same. feed-checker should be able to match URLs when the following situations occur:

  1. feed-checker should match http or https. Ex: https://www.bart.gov/dev/schedules/google_transit.zip should match http://www.bart.gov/dev/schedules/google_transit.zip.
  2. feed-checker should be agnostic to the order of the URL query parameters. Ex: http://example.com/feed?a=1&b=2 should match http://example.com/feed?b=2&a=1
  3. feed-checker should omit API keys when checking URL query parameters. Ex: http://api.511.org/transit/datafeeds?api_key={{ MTC_511_API_KEY}}&operator_id=AM should match http://api.511.org/transit/datafeeds?operator_id=AM. The sensitive query parameters should be removed in both the input feed URL and the aggregated feed URL when doing a comparison. By default the query params token and api_key should be omitted.

Feature request: have the ability to return URLs not in input data within a certain region

User Story (Cal-ITP)

As a research data analyst,
I want to know if there are more up-to-date GTFS URLs found on feed aggregator websites than the GTFS URLs that Cal-ITP has
so that I can maintain a database of the GTFS URLs of the CA transit agencies
and so that I can have additional sources of information indicating which GTFS URLs transit agencies have

User Story (Community User)

As a transit application developer,
I want to get a list of all GTFS URLs on all feed aggregator websites for a particular region
so that I can have a complete list of all GTFS URLs to download data from to power my transit application

Acceptance Criteria

Given

  1. The input GTFS URLs given to any of the command-line input options of this program
  2. The input aggregator regions to check in

For transitland, it seems like the agencies can be queried to determine where they operate and compared with the feeds found based off of the input URLs. The command line arguments could look something like this:

--transit-land-adm1_iso=US-CA

For transitfeeds, the hardcoded location could be made configurable via a command line argument:

--transit-feeds-location=67-california-usa

  1. The GTFS URLs found on the aggregator websites for their respective regions

Then The URLs found on the aggregator websites that weren't within the input list URLs should be outputted in a separate section of the output.

Example:

When searching for all transitfeeds URLs in Saskatchewan, Canada, but also checking against a single input URL, the CLI input and result could be as follow:

CLI Input

python -m gtfs_aggregator_checker --url https://opengis.regina.ca/reginagtfs/google_transit.zip --output results.json --transit-feeds-location=196-saskatchewan-canada

JSON Output

{
  "input_url_results": {
    "https://opengis.regina.ca/reginagtfs/google_transit.zip": {
      "transitfeeds": {
        "public_web_url": "https://transitfeeds.com/p/the-city-of-regina/830",
        "status": "present"
      },
      "transitland": {
        "public_web_url": "https://www.transit.land/feeds/f-c8vx-thecityofregina",
        "status": "present"
      }
    }
  },
  "additional_aggregator_urls_in_region_not_in_input_list": [
    {
      "transitfeeds_metadata": {
        "name": "Saskatoon Transit GTFS",
        "public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/264",
        "type": "GTFS Schedule"
      },
      "url": "http://apps2.saskatoon.ca/app/data/google_transit.zip"
    },
    {
      "transitfeeds_metadata": {
        "name": "Saskatoon Transit Service Alerts",
        "public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/842",
        "type": "GTFS Realtime Service Alerts"
      },
      "url": "http://apps2.saskatoon.ca/app/data/Alert/Alerts.pb"
    },
    {
      "transitfeeds_metadata": {
        "name": "Saskatoon Transit Trip Updates",
        "public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/841",
        "type": "GTFS Realtime Trip Updates"
      },
      "url": "http://apps2.saskatoon.ca/app/data/TripUpdate/TripUpdates.pb"
    },
    {
      "transitfeeds_metadata": {
        "name": "Saskatoon Transit Vehicle Positions",
        "public_web_url": "https://transitfeeds.com/p/city-of-saskatoon/840",
        "type": "GTFS Realtime Vehicle Positions"
      },
      "url": "http://apps2.saskatoon.ca/app/data/Vehicle/VehiclePositions.pb"
    }
  ]
}

Changes to work with airflow dag

When I designed this I built it with cli usage in mind. There's a few tweaks I need to make to have this work inside a python script.

  • check_feeds should return results
  • Ability to disable cache - I don't think this should run with the cache on in production. Make it so setting the cache_dir to 0 disables caching.
  • Move stdout (print calls) to the __main__ file.
  • Move --output flag to the main file

v1 release

After all initial issues are completed, this is ready for a v1 release! We should complete the following:

After that, let's create the needed items to publish to pypi and then publish a v1 release to pypi.

JSON output format

This script should have the ability to output a JSON result so that the results file can be ingested by another system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.