r-transit / tidytransit Goto Github PK

R package for working with GTFS data

Home Page: https://r-transit.github.io/tidytransit/

R 99.34% Rez 0.66%

transit transit-data gtfs cran tidyverse transport transportation public public-transport

tidytransit's Introduction

tidytransit

Use tidytransit to map transit stops and routes, calculate travel times and transit frequencies, and validate transit feeds. Tidytransit reads the General Transit Feed Specification into tidyverse and simple features data frames. Tidytransit can be used to:

Have a look at the following vignettes to see how tidytransit can be used to analyse a feed:

Installation

This package requires a working installation of sf.

Install tidytransit from CRAN:

install.packages('tidytransit')

For the development version from Github:

# install.packages("devtools")
devtools::install_github("r-transit/tidytransit")

GTFS-related packages

gtfsio R package to read and write gtfs feeds, tidytransit builds on gtfsio
gtfstools Tools for editing and analysing transit feeds
gtfsrouter Package for public transport routing
gtfs2gps Converting public transport data from GTFS format to GPS-like records

Contributing

Please feel free to issue a pull request or open an issue.

tidytransit's People

Contributors

Stargazers

Watchers

tidytransit's Issues

Package description on github

@tbuckl Should we change the package description on github? I don't think the package being sf compatible is its main focus. I'd suggest something like on tidytransit.r-transit.org:
"tidytransit reads the General Transit Feed Specification (GTFS) into tidyverse and simple features dataframes. Use tidytransit to map transit stops and routes, calculate transit frequencies, and validate transit feeds."

write a vignette on calculated tables and their relationship to standard tables

make route_type (as text value) easily accessible for filtering/analysis

route_type is useful for analysis and can be easily pulled from the routes_df

type descriptions:

https://gist.github.com/derhuerst/b0243339e22c310bee2386388151e11e

deprecated types:

https://sites.google.com/site/gtfschanges/proposals/route-type

clarify how the most frequent service id is identified

perhaps as its own function.

this is confusing:

https://github.com/r-transit/tidytransit/blob/master/R/frequencies.R#L34-L59

consistently name functions that modify and return a full gtfs object

as discussed here perhaps prefixing them with a gtfs_*

make the frequency calculation error message more helpful

see #72

in this case, the message suggests filtering by service id, but that doesn't help.

investigate potential benchmark for transit frequency (and source for walkability)

https://www.epa.gov/smartgrowth/smart-location-mapping#Trans45

consider removing/renaming `_sf` suffix from calculated simple features dataframes

it could be that there's a more intuitive/descriptive way of naming these dataframes.

improve the documentation of the headway/frequency functions

move usage examples from main readme to vignettes

this should make contribution simpler.

Remove _df and _sf suffix from dataframes and simple features data frames

Is there a compelling reason to use the _df suffix for the data frames (e.g. gtfs$stops_df)?

read non-gtfs-spec column names in parse_gtfs

When using the google sample feed, the resulting stop_times_df looks like

$stop_times_df
# A tibble: 28 x 9
   trip_id arrival_time departure_time stop_id        stop_sequence stop_headsign pickup_type X8    shape_dist_traveled
   <chr>   <chr>        <chr>          <chr>                  <int> <chr>               <int> <chr>               <dbl>
 1 STBA    6:00:00      6:00:00        STAGECOACH                 1 NA                     NA NA                     NA
 2 STBA    6:20:00      6:20:00        BEATTY_AIRPORT             2 NA                     NA NA                     NA
 3 CITY1   6:00:00      6:00:00        STAGECOACH                 1 NA                     NA NA                     NA
 4 CITY1   6:05:00      6:07:00        NANAA                      2 NA                     NA NA                     NA
 5 CITY1   6:12:00      6:14:00        NADAV                      3 NA                     NA NA                     NA
 6 CITY1   6:19:00      6:21:00        DADAN                      4 NA                     NA NA                     NA
 7 CITY1   6:26:00      6:28:00        EMSI                       5 NA                     NA NA                     NA
 8 CITY2   6:28:00      6:30:00        EMSI                       1 NA                     NA NA                     NA
 9 CITY2   6:35:00      6:37:00        DADAN                      2 NA                     NA NA                     NA
10 CITY2   6:42:00      6:44:00        NADAV                      3 NA                     NA NA                     NA

The problem seems to be that the column "drop_off_time" is not a valid column name which is pretty strange for an example feed. Also there are commas missing from line 17 on but that's not the point.

My question is: Why are only required/expected columns read in import.R#348? Why don't we simply read the whole file as a simple csv and check validity afterwards? The column X8 we get isn't really helpful anyways.

put calculated feeds in a sublist of the gtfs_obj

Use data.table to read feeds!

library(tidytransit)
library (magrittr)
f <- list.files (getwd (), full.names = TRUE)
filename <- f [grep ("VBB", f)] # GTFS for Berlin-Brandenburg Transport - it's huge!
get_df <- function (filename)
{
    flist <- file.path (utils::unzip (filename, list = TRUE)$Name)
    res <- list ()
    for (i in seq (flist))
    {
        cmd <- paste0 ("unzip -p \"", filename, "\" \"", flist [i], "\"")
        res [[i]] <- data.table::fread (cmd = cmd, showProgress = FALSE) %>%
            as.data.frame ()
    }
    names (res) <- strsplit (flist, ".txt")
    return (res)
}
rbenchmark::benchmark (
                       dat <- read_gtfs (filename, local = TRUE),
                       dat <- get_df (filename),
                       replications = 1)
#>                                       test                       replications elapsed relative   user.self sys.self user.child sys.child
#> 2                                dat <- get_df(filename)            1    3.463     1.00      6.678     0.432      1.745     0.312
#> 1 dat <- read_gtfs(filename, local = TRUE)            1  31.411     9.07    30.701    0.645      0.000     0.000

^{Created on 2019-02-01 by the reprex package (v0.2.1)}

GTFS feeds can be enormous, and data.table makes a pretty huge difference - it'll read a feed nearly ten times faster!

This is also by way of starting a separate conversation about the potential future merging of gtfs-router into this package. It seems like the obvious place for it, and the primary usage for tidytransit if it were available is surely likely to be transit routing? You could then check out your transit options from within the comfort of your R session!

release next version to CRAN

@mpadge if there's anything that would make this more usable for you let me know.

make plot(gtfs_obj) plot routes with frequencies by default

Reimplement hms frequency tests

See #45 and #46

update docs on website

remove optional gtfs files on read?

While working on #6 I came across specifications for files that are not defined in gtfs reference like directions and stop_attributes.

I guess there's is an extended or additional specification I'm not aware of?

Plot example not working

Based on this stackoverflow question.

local_gtfs_path <- system.file("extdata", 
                              "google_transit_nyc_subway.zip", 
                              package = "tidytransit")
nyc <- read_gtfs(local_gtfs_path, 
                local=TRUE)
plot(nyc)

with the plot function:

tidytransit:::plot.gtfs <- function (x, ...) {
    dots = list(...)
    routes_sf_frequencies <- x$routes_sf %>% dplyr::inner_join(x$routes_frequency_df, 
        by = "route_id") %>% dplyr::select(median_headways, mean_headways, 
        st_dev_headways, stop_count)
    plot(routes_sf_frequencies)
}

The problem seems to be twofold: routes_sf is missing by default and the headway calculations haven't been done.

benchmark memory limits

report of memory limitations. unclear what feed:

https://twitter.com/teebuckl/status/1033066113368616960

review and send new release to CRAN with @polettif date changes

general discussion of repository structure

Hey, I appreciate your effort to separate the different "modules" and I'm looking forward to seeing where this package might end up. As I don't know much about developing R packages, I don't yet understand how this repository and the others (trread, ...) are connected? Is it the same code duplicated and separate or is it automatically cloned/imported in some way?

Add CONTRIBUTING.md and CONDUCT.md

It's always good to have something explicit! Use ggplot2 CONTRIBUTING.md and sf CONDUCT.md as examples.

message printed during frequency calculation should not default to 6 am to 10 pm

Incorrect route linestrings in `.$routes_sf`

Great idea for a package! However I am noticing some issues with the output on my first use, looking at L train routes in Chicago. A reproducible example:

library(tidytransit)
library(mapview)

chicago_gtfs <- read_gtfs("http://www.transitchicago.com/downloads/sch_data/google_transit.zip")

routes <- chicago_gtfs$routes_sf

mapview(routes[routes$route_id == "Pink", ])

This is the Yellow Line, not the Pink Line. I'm wondering if some of the rows are getting shuffled when they are converted to simple features? I've forked and will go through the code but let me know if you have any suggestions. Thanks!

memory limitations for large feeds--prompt users

write a vignette using tidycensus

based on #15 its clear that we need better examples to review and test functionality.

a vignette using tidycensus might be great for this.

Encourage people to download development version on README.md

Since the version on CRAN is a bit different, in terms of functions available. We want people to keep up with our newest developments!

add a read_gtfs "name" parameter based on name in transitfeeds feedlist

it would be nice to be able to just:

gtfs_feed <- read_gtfs(name="Translink")

test get_route_frequency on all the gtfs feeds in transitfeeds

this would be a good way to estimate how its working.

think about more efficient representations of gtfs

https://medium.com/transit-app/how-we-shrank-our-trip-planner-till-it-didnt-need-data-84984ca56663

memory limitations for large feeds--use data.table

general discussion around routing packages in r

Sorry for potentially naive question but can this route from A to B with gtfs data?

Context: @mem48 and I are currently using OpenTripPlanner for this but it has quite a lot of overheads, but is very good for multi-modal routing (perhaps one day that will be possible in R).

update package for new dplyr release

from rstudio:

This is an automated email to let you know that:

A new version of dplyr is ready to go to CRAN. dplyr is
currently at version 0.7.8 and will become 0.8.0 upon release.
tidytransit uses dplyr and has problems with the new version.
We plan to submit dplyr to CRAN on February 1.

This release represents about 9 months of development, detailed in
this blog post:
https://www.tidyverse.org/articles/2018/12/dplyr-0-8-0-release-candidate/

I need your help to keep tidytransit and dplyr working together
smoothly. In the next weeks, can you please:

Read about the changes to dplyr at
https://github.com/tidyverse/dplyr/blob/master/NEWS.md#dplyr-080.
This page includes a list of breaking changes, the reasoning behind
them, and to how to update your code.
Carefully inspect the failing checks listed at the bottom of this email.
For each failing check, either update your package, or tell me
that I have a bug. If you have made changes to your package, please
submit an update to CRAN before February 1.

If you have discovered a bug in dplyr, please file an issue (ideally
with a small reprex that illustrates the problem) at
https://github.com/tidyverse/dplyr/issues. If you're not sure whether
or not you've found a bug, please an issue and we'll help you figure
it out. Breaking changes that are not listed qualify as bugs.

Please respond to this message if you have any questions.

Thanks,

Romain Francois

== CHECK RESULTS ========================================

checking examples ... ERROR

Running examples in ‘tidytransit-Ex.R’ failed
The error most likely occurred in:

Name: get_route_frequency

Title: Get Route Frequency

Aliases: get_route_frequency

** Examples

data(gtfs_obj)
gtfs_obj <- get_route_frequency(gtfs_obj)
Calculating route and stop headways using defaults (6 am to 10 pm
for weekday service).
Error in n() : could not find function "n"
Calls: get_route_frequency ... -> ->
mutate.tbl_df -> mutate_impl
Execution halted
```

checking tests ...

 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  16: `_fseq`(`_lhs`)
  17: freduce(value, `_function_list`)
  18: function_list[[i]](value)
  19: dplyr::mutate(., service_trips = n())
  20: mutate.tbl_df(., service_trips = n()) at

/Users/romain/git/tidyverse/dplyr/R/manip.r:416
21: mutate_impl(.data, dots) at
/Users/romain/git/tidyverse/dplyr/R/tbl-df.r:91

  ══ testthat results

═══════════════════════════════════════════════════════════
OK: 3 SKIPPED: 7 FAILED: 3
1. Error: Stop frequencies (headways) for included data are as
expected (@test_headways.R#4)
2. Error: Route frequencies (headways) for included data are as
expected (@test_headways.R#11)
3. Error: Route frequencies (headways) can be calculated for
included data for a particular service id (@test_headways.R#17)

  Error: testthat unit tests failed
  Execution halted
```

change lubridate::hms calls to hms::hms calls

as per the suggestions in #42

make geometry processing an optional argument in the read_gtfs function

I think we can find a way to rewrite the functions so that there's a parameter in function calls for sf=TRUE. Otherwise, we can just show lat and long in the dataframes that are returned, because that's what's in GTFS data anyway.

Will look into this later.

calculate route and stop frequencies in read_gtfs

so that they are just immediately available.

if a frequencies data frame exists, then use that by default instead of calculating another one

see #72

make get_* function names more consistent

make routes_df_as_sf match the naming of get_route_frequency. this would be more logical/easier to understand. so, for example, get_route_geometry.

identify other use cases for reading zip files of tables and, if found, move read_zip to generic r package.

these might be more generically useful than just GTFS:

https://github.com/r-transit/tidytransit/blob/master/R/import.R#L180-L241

https://github.com/r-transit/tidytransit/blob/master/R/import.R#L301-L316

something like:

dataframes <- read_zip("zip_of_csvs.zip")

submit new cran release

there have been enough changes that its about time

specify how to make this package more useful to package developers

hey @mpadge is this the right issue title for your question about how to manage "merging" this package with other packages?

i think tidytransit ended up being more oriented toward users than developers, partially by the prompting of @angela-li to consider that a user would not want to have to import and think about multiple packages to just do basic mapping and frequency/schedule analysis.

i think that intuition was right, but as you work on gtfs-router i expect you'll develop better approaches across a number of problems.

another way to think about this issue is how to deprecate this package gracefully as you advance your work on gtfs-router. could just be managed by having a similar api.

add basic write_gtfs function

rename import_gtfs to read_gtfs

to more closely mirror api's like read_csv

Stops and routes frequency calculation fails when routes have their own unique service ID's (e.g. Barcelona)

Hello everyone I am trying to plot the frecuency of buses that pass trough each stop.

The library suppose to add these data when I write frecuency true like below:
gtfs <-read_gtfs("gtfs.zip", local=TRUE, geometry = TRUE, frequency=TRUE)

Theorically I will get a new data with the frecuency of buses per stop. However I get a error in the console that says :

Warning message:
In get_route_frequency(gtfs_obj) : failed to calculate frequency--
try passing a service_id from calendar_df

becasue this I cant get that ifnormation (I get the data frame but without that data), I have also try to get it using "get_stop_frequency" without luck.

Can someone help me ?
Thank you all