ropensci / tidyhydat Goto Github PK

View Code? Open in Web Editor NEW

69.0 14.0 19.0 25.61 MB

An R package to import Water Survey of Canada hydrometric data and make it tidy

Home Page: https://docs.ropensci.org/tidyhydat

License: Apache License 2.0

R 94.71% TeX 0.88% CSS 4.41%

r tidy-data government-data water-resources rstats r-package hydrology hydrometrics citz

tidyhydat's People

Contributors

Stargazers

Watchers

Forkers

ateucher stephhazlitt ldecicco-usgs lawinslow jongoetz njatel paleolimbot nunofernandes-plight rajibshibly jingningshi felix-ren minghao2016 xsmeke rchlumsk chanjeunlam mgirlich gdelaplante vincenzocoia travis-simmons

tidyhydat's Issues

Examples that run the webservice

I'm not sure CRAN will be able run those examples (in the download_realtime_ws function specifically). You probably will need to wrap them in dontrun like:
https://github.com/USGS-R/dataRetrieval/blob/master/R/getWebServiceData.R#L30

get_ws_token failing often

No entirely reproducible but it does seem like the process is going too fast for the webservice. One possible fix is to use Sys,sleep() to give httr::POST to make the connection.

> get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD"))
Error in get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD")) : 
  403 Forbidden: the web service is denying your request. Try any of the following options: ensure you are not currently using all 5 tokens, 
         wait a few minutes and try again or copy the get_ws_token code and paste it directly into the console.
> get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD"))
This token will expire at 10:55:35

This does seem to work everytime for me though this is a silly way to do this:

> debugonce(get_ws_token)
> token <- get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD"))

Create a function to automatically download hydat

An early attempt was made here

https://github.com/bcgov/tidyhydat/blob/9195a03f13576efce3dba53f2e336a9d4f080397/R/download.R#L337-L349

Revise documentation

From the review:

Lastly. I would consider revising your tidyhydat-package documentation. This is frequently the first interaction a user might have with the package. Currently it is very focused on the data format and output style (tidy), not the data itself. This documentation (along with a Vignette) would be an excellent place to better give an overview of the data available. What data are availble through this package? Stream/river level, flow, and it seems, sediment flux estimates at daily and monthly timescales (also real-time in some cases)?

On the overview side, I'd recommend, if possible, pointing to further HYDAT documentation. A lot of the documentation here relies on the user's knowledge of the HYDAT database. For example, much of the function documentation refers to the functions as "Provides wrapper to turn the XXX table in HYDAT into a tidy data frame." which implies knowledge of the HYDAT database.

Also from the review:

Generally, I'm a big fan of every exported function having at least one functional example. It helps immensely, especially when teaching new R users. A package such as this could be very useful for new R users with a flavor for Canadian hydrology.

It would be really cool to have (maybe another) vignette that works through a complete analysis. Maybe from looking for data, to getting the data, to visualizing the data. Alternatively, maybe expanding the help files a bit to describe more completely what is returned from each function/data set. It wasn't obvious to me until I ran the function and looked at the result.

Timezone of time in download_realtime_dd

Possibly doing nothing is the best answer but the time is currently outputted in UTC. It would be helpful to provide a means of converting that to avoid confusion.

remove province argument

PROV_TERR_STATE_LOC = "BC" isn't very useful for the data function (DLY_FLOWS, DLY_LEVELS, etc, etc) will crash rstudio with such a big query.

download_realtime_ws tidying

In download.R, there's a message: stop("No exists for this station query")...might want to change to "No data...".
For me, none of the download_realtime_ws examples worked for me (I did get the token working). Each example ended with "No exists for this station query"

Also:

At least for the download tests, you'll probably want to add testthat::skip_on_cran()

Example for SED_MONTHLY_LOADS is not working. I played around a bit with it but couldn't get it to run. This may require better output of empty return values as it seems like it should be ok that a query returns no data.

> SED_MONTHLY_LOADS(PROV_TERR_STATE_LOC = "PE", hydat_path = "H:/Hydat.sqlite3", start_date = "2001-01-01", end_date = "2011-01-01")
 
 Error: !length(data) || is_named(data) is not TRUE 
11.stop(msg, call. = FALSE, domain = NA) 
10.stopifnot(!length(data) || is_named(data)) 
9.env_bind_impl(env, dots_list(...)) 
8.child_env(env_, ...) 
7.env_bury(overscope_top, !(!(!data))) 
6.vars_select_eval(.vars, quos) 
5.tidyselect::vars_select(names(data), !(!(!quos))) 
4.unname(tidyselect::vars_select(names(data), !(!(!quos)))) 
3.gather.data.frame(sed_monthly_loads, variable, temp, -(STATION_NUMBER:NO_DAYS)) 
2.tidyr::gather(sed_monthly_loads, variable, temp, -(STATION_NUMBER:NO_DAYS)) at SED_MONTHLY_LOADS.R#107
1.SED_MONTHLY_LOADS(PROV_TERR_STATE_LOC = "PE", hydat_path = "H:/Hydat.sqlite3", 
    start_date = "2001-01-01", end_date = "2011-01-01")

Possibly related from other review:

I noticed ifelse(FULL_MONTH == 1, "Yes", "No") in SED_MONTHLY_SUSCON...wouldn't it make more sense to be a logic? ie:

sed_monthly_suscon <- dplyr::mutate(sed_monthly_suscon, FULL_MONTH = FULL_MONTH == 1)
Running the tests locally, I got:
  1. Error: SED_MONTHLY_LOADS accepts single and multiple province arguments (@test_SED_MONTHLY_LOADS.R#5) 
  2. Error: SED_MONTHLY_LOADS accepts single and multiple province arguments (@test_SED_MONTHLY_LOADS.R#12) 
  3. Error: SED_MONTHLY_LOADS can accept both arguments for backward compatability (@test_SED_MONTHLY_LOADS.R#22) 
  4. Error: download_realtime_ws returns the correct data header (@test_download_realtime.R#6) 

absence of hydat path - other args

When a user has set the hydat_path it is tempting to omit the argument name may be omit like this:

STN_DATA_RANGE("08MF005")
Error in rsqlite_send_query(conn@ptr, statement) :
no such table: STATIONS

This error message is uninformative and this fail earlier with a message about the hydat path being the first argument.

Add all internal data to data-raw

Leave an R script in data-raw that pulls the data from HYDAT then overwrites existing internal data that can be run when their are HYDAT updates

Better error for a missing stations

Error:

> DLY_FLOWS(STATION_NUMBER = "08HA009")
No start and end dates specified. All dates available will be returned.
Error in enc2utf8(col_names(col_labels, sep = sep)) : 
  argument is not a character vector

add search_station function

Consider merging hy_daily_levels and hy_daily_flow

Could be a simply function that just internally calls each one. Could be called hy_daily()

A better means of updating users that there is new version of HYDAT.

From bcmaps:

.onAttach <- function(libname, pkgname) {
  if (!.bcmaps_env$has_data) {
    msg <- has_no_data_msg()
  } else {
    msg <- check_for_data_pkg_update()
  }
  if (!is.null(msg)) packageStartupMessage(msg)
}

Remove @import statement in favour of ::

MONTHLY_LEVELS and MONTHLY_FLOWS returning future dates

Only occurs when DAY = NA for instances like TOTAL and MEAN

internal data not found with tidyhydat:: call

If a function is called like below rather than with library(tidyhydat) the following error message is received. This is likely due to calling internal data rather than functions.

> tidyhydat::DLY_FLOWS(STATION_NUMBER = "08MF005")
No start and end dates specified. All dates available will be returned.
Error in tbl_vars(y) : object 'DATA_SYMBOLS' not found

Options to fix include:

Making all data a function
putting a data() statement directly in the function.

add on.exit to download_realtime_dd

Intent is to remove warnings like this:

Warning message:
closing unused connection 3 (http://dd.weather.gc.ca/hydrometric/csv/BC/hourly/BC_07ED003_hourly_hydrometric.csv)

Missing station results in an uninformative error message

For example:

> DLY_FLOWS(STATION_NUMBER = "08HF014")
No start and end dates specified. All dates available will be returned.
Error in enc2utf8(col_names(col_labels, sep = sep)) : 
  argument is not a character vector

This can probably be rectified with here somewhere: https://github.com/bcgov/tidyhydat/blob/f5c90197ead1cdc4b880ba0eb3522302619ad1cc/R/internal.R#L11-L51

Station selection

Each HYDAT function use the same code to parse, based on arguments, which stns we are querying. That code could be wrapped in an internal function make debugging and code maintenance easier.

Dealing with .onload differently

Either:

Line 29 zzz.R I would drop the #' comments which results in R trying to build an Rd file for .onLoad.

The dplyr coding within the functions looks pretty straightforward...so for that reason I probably wouldn't change it, but it might be good time to look into some of the newer features of dplyr that don't require the utils::globalVariables call .onLoad. There's a good collection of blogs and stuff here:
https://maraaverick.rbind.io/2017/08/tidyeval-resource-roundup/

Web service tests

All web service tests have currently been commented out because data sources are password protected.

The download_realtime_ws issue can likely be solved using methods outlined here: http://httr.r-lib.org/articles/secrets.html#travis

Capitalization of functions

From the review:

Another overarching issue (which is somewhat subjective) is to the outside reviewer, the choice of function capitalization seems arbitrary and not helpful. While the choices of function style are subject to personal preferences, I would suggest the usage of consistent capitalization scheme, preferrably a lowercase-based one. I am personally especially partial to the consistent prefix-style function naming, which is used in a number of rOpenSci and other related packages. It eases function discovery as well as makes packages > seem more unified and consistent.

Recommends switching from DLY_FLOWS() to hydat_dly_flows() with all other HYDAT functions following a similar convention.

Provide some tidy tools functionality

From one review:

Notes on the implications of focusing on "tidy":
When I saw the name of the package, I thought that there was going to be something like:
find_stations(bbox=c(w,x,y,z)) %>%
  get_data(parameter="flow") %>%
  calc_stats("annual") 
I see what was meant is that it converts the data from messy to tidy....but just a side not that it would be cool get some hydro tidy tools too....maybe another day!
https://bookdown.org/rdpeng/RProgDA/gaining-your-tidyverse-citizenship.html

This could be done is a few ways. Possibly something like this:

CODE TO GET A VECTOR OF STATIONS %>%
map(DLY_FLOWS)

Or possible the dly_flows function could be modified such that it accepts a vector in a pipe for the STATION_NUMBER argument.

Include environmental variable or option for hydat

Result is that users don't always have to specify where hydat is

Add additional tables in hydat

As functions

As internal data

DATA_TYPES
DATA_SYMBOLS

Parsed manually via `ifelse` and `dplyr::case_when`

PEAK_CODES
PRECISION_CODES

path to hydat.sqlite3 is not platform-independent

In most places where the path to the hydat.sqlite3 db is constructed, paste0(hy_dir(),"\\Hydat.sqlite3") is used, but this doesn't work on a mac (nor would it on Linux I don't think??). I'd suggest: file.path(hy_dir(),"Hydat.sqlite3")

Problem setting the data with DLY_FLOWS

This is bug.

Possibly is due to the fact that default dates maybe should be NULL

> flow_data <- DLY_FLOWS(STATION_NUMBER = c("08MF005","07EF001","08NE049"), start_date = "2001-01-01", end_date = Sys.Date())
Error in as.POSIXlt.character(x, tz, ...) : 
  character string is not in a standard unambiguous format

Match tidyhydat to tidyverse style guide

This results in tidy code as well as tidy output.

Using this as my main reference: http://style.tidyverse.org/index.html

This tools will be useful for these purposes:
https://lorenzwalthert.github.io/gsoc2017/

Exceptions, if any can be listed, in this issue.

ANNUAL_INSTANT_PEAKS doesn't have timing in a date class

This is because of instances of is.na(TIME_ZONE) where there is a time:

dplyr::tbl(hydat_con, "ANNUAL_INSTANT_PEAKS") %>%
  filter(!is.na(MINUTE) & is.na(TIME_ZONE))

Or where `TIME_ZONE %in% c("*","0")

  dplyr::tbl(hydat_con, "ANNUAL_INSTANT_PEAKS") %>%
    filter(TIME_ZONE %in%  c("*","0"))

Options are to either format two columns Date and Datetime OR to leave items as separate columns. Issue right now is that there are some instance where there are full date and time but no time zone.

STN_DATA_RANGE not working

library(tidyhydat)
> STN_DATA_RANGE(STATION_NUMBER = "08NM146")
Error in dplyr::tbl(hydat_con, "STATIONS") : object 'hydat_con' not found

error handling with download_realtime_dd

download_realtime_dd(STATION_NUMBER = "07ED003") returns this error:

Error in open.connection(con, "rb") : 07ED003 cannot be found

This makes sense when it is only one station. But it might make more sense to output a dataframe of NA's so that something like this:

download_realtime_dd(STATION_NUMBER = c("08MF005", "07ED003"))

would return a data for at least 08MF005 and leave 07ED003 with NA's.

Add monthly flows and sed function

Unique case as it is drawn from the DLY_FLOWS table

Truncated missing STATION information

Something like:

  differ = setdiff(unique(stns), unique(annual_statistics$STATION_NUMBER))
  if( length(differ) !=0 ){
     if( length <= 10) {
    message("The following station(s) were not retrieved: ", paste0(differ, sep = " "))
    message("Check station number typos or if it is a valid station in the network")
     } else {
      message(More than 10 stations missed)
    }
  } else{
    message("All station successfully retrieved")
  }

search_name and search_number should access hy_stations() and realtime_meta

If 0 returned hydat then proceed to realtime

Symbols column

Add an argument in DLY_FLOWS, DLY_LEVELS and likely in SED_DLY_LOAD (possibly other SED functions as well to specify if the symbol should be translated or should be keep as a single character symbol. Default behaviour will be the single character to save space.

Remove typos and grammatical error from README

Database that examples access

Review:

One thing about the examples. Using devtools I can run them, but a bunch of them are hard-coded to use h:/Hydat.sqlite3. It would be better if these examples followed the style outlined in the github README which uses an .Renviron variable to set the path (which would then allow anyone to quickly use/run examples).

Other review:

I agree with Luke that it was frustrating to have the examples use a path to the database that I did not use. If there was a cleaner way to deal with this (he gave a few example packages to check out), that would make life for the user easier. You could probably have the examples work if you pointed them to:
tinyflow <- DLY_FLOWS(STATION_NUMBER = c("08LA001","08NL071"),
                      hydat_path = system.file("test_db/tinyhydat.sqlite3", package = "tidyhydat"))
of course...you'd need to adjust stations and what-not that are in the tiny db.

Tests don't pass because of data source

All tests have currently been commented out because data sources are either stored locally or password protected.

The download_realtime_ws issue can likely be solved using methods outlined here: http://httr.r-lib.org/articles/secrets.html#travis
A solution needs to be found for the HYDAT database.
Currently simply not running any tests for the TRAVIS build https://github.com/bcgov/tidyhydat/blob/9195a03f13576efce3dba53f2e336a9d4f080397/.travis.yml#L8

Building a vignette

When a vignette is built using devtools::build_vignette() all built files are moved over to here. Normally those files are part of gitignore. However travis was throwing a warning if built files weren't in the repo. So I built the vignette and remove isnt/doc from gitignore.

This has the unintended effect of removing tidyhydat.md which was produced using this method.

Here is what I want. Build the vignette so that a plain markdonw file is generated that includes all the references from the bibtex file and figure out a ways so that TRAVIS doesn't expect the inst/doc directory.

Suggestion for STATION_NUMBER parameter to gather stations that contain specified text

For the STATION_NUMBER parameter in the functions, I suggest having it have the ability to look up stations that contain text, not just matching text (other than "ALL"). For example, one can type in just “08NM” and pulls all data from all stations containing 08NM in the number.

Add on.exit to both database and webbased connections

Change outputs in STATIONS() function

To make the metadata table more user-friendly, I suggest changing the STATIONS() table with the following.

HYD_STATUS | A = ACTIVE; D = DISCONTINUED
SED_STATUS | A = ACTIVE; D = DISCONTINUED
RHBN | 0 = FALSE; 1 = TRUE
REAL_TIME | 0 = FALSE; 1 = TRUE

REGIONAL_OFFICE_ID | replace with REGIONAL_OFFICE text from look-up table
CONTRIBUTOR_ID | replace with CONTRIBUTOR text from look-up table
OPERATOR_ID | replace with OPERATOR text from look-up table
DATUM_ID | replace with DATUM text from look-up table

The HYDAT package uses the following to convert the the first 4 listed above, which could be applied.
metadata$RHBN <- ifelse(metadata$RHBN==1, TRUE, FALSE)
metadata$REAL_TIME <- ifelse(metadata$REAL_TIME==1, TRUE, FALSE)
metadata$HYD_STATUS <- ifelse(metadata$HYD_STATUS=="A","ACTIVE",
ifelse(metadata$HYD_STATUS=="D", "DISCONTINUED", NA))
metadata$SED_STATUS <- ifelse(metadata$SED_STATUS=="A","ACTIVE",
ifelse(metadata$SED_STATUS=="D", "DISCONTINUED", NA))

httr::content generating some parsing errors

Here is the error:

Warning: 380 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected actual 1 4600 Value no trailing characters .54 file 2 4601 Value no trailing characters .51 row 3 4602 Value no trailing characters .48 col 4 4603 Value no trailing characters .4 expected 5 4604 Value no trailing characters .33
... ................. ... ........................................................ ........ ........................................................ ...... ........................................................ .... ........................................................ ... ........................................................ ... ........................................................ ........ ........................................................
See problems(...) for more details.

All station successfully retrieved
Warning message:
In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)

I suspect that these lines:

https://github.com/bcgov/tidyhydat/blob/44877c42327d425842c9defdeeb7ba1e11655285/R/download.R#L388-L391

Are restricting the Value column to be only an integer so when temperature or another parameter are requested that have decimals, the parsing gets a little angry.

There seems to be no indication of the units for any of the data. The only unit I could find is in the DATA_TYPES table, though that was only for Sediment. This is something I feel is important for data packages such as this as units are critical to understanding the data, and are often the one piece of information I find myself spending excessive time hunting for (or trying to reverse engineer).

Remove need for both PROVINCE and STATION_NUMBER as arguments

Database download

From the review:

One issue I encountered was the database download, with personal handling of the database path, is awkward. It seems that much of the work of downloading the database, and then storing the path could be automated. The storage location could be supported by the rappdirs package. There are existing examples of such file downloading and use, such as getlandsat, hydrolinks, and LAGOS to name a few with various implementations. This would make it easier for users to quickly get started using the data and eliminate the need to muck around in the Renviron file.

Other review:

The example:
file.edit("~/.Renviron")
That path does not automatically work for Windows users I think. You might want to clarify that you will probably want a .Renviron file in the same directory as your RStudio project (assuming you use RStudio)...or something like that

Need to standardize capitlization of Flow and Level

> hy_daily_flows(c("08HA002"))
No start and end dates specified. All dates available will be returned.
All station successfully retrieved
# A tibble: 30,286 x 5
   STATION_NUMBER       Date Parameter Value Symbol
            <chr>     <date>     <chr> <dbl>  <chr>
 1        08HA002 1913-03-01      FLOW  49.3   <NA>
 2        08HA002 1913-03-02      FLOW  47.3   <NA>
 3        08HA002 1913-03-03      FLOW  45.3   <NA>
 4        08HA002 1913-03-04      FLOW  43.3   <NA>
 5        08HA002 1913-03-05      FLOW  43.3   <NA>
 6        08HA002 1913-03-06      FLOW  41.3   <NA>
 7        08HA002 1913-03-07      FLOW  41.3   <NA>
 8        08HA002 1913-03-08      FLOW  39.6   <NA>
 9        08HA002 1913-03-09      FLOW  39.6   <NA>
10        08HA002 1913-03-10      FLOW  39.6   <NA>
# ... with 30,276 more rows

while:

> hy_annual_stats(c("08HA002"))
All station successfully retrieved
# A tibble: 267 x 7
   STATION_NUMBER Parameter  Year Sum_stat   Value       Date Symbol
            <chr>     <chr> <int>    <chr>   <dbl>     <date>  <chr>
 1        08HA002      Flow  1913     MEAN      NA         NA   <NA>
 2        08HA002      Flow  1913      MIN   6.510 1913-08-30   <NA>
 3        08HA002      Flow  1913      MAX      NA         NA   <NA>
 4        08HA002      Flow  1914     MEAN  53.200         NA   <NA>
 5        08HA002      Flow  1914      MIN   2.460 1914-09-06   <NA>
 6        08HA002      Flow  1914      MAX 207.000 1914-01-07   <NA>
 7        08HA002      Flow  1915     MEAN  39.800         NA   <NA>
 8        08HA002      Flow  1915      MIN   0.906 1915-09-30   <NA>
 9        08HA002      Flow  1915      MAX 141.000 1915-12-09   <NA>
10        08HA002      Flow  1916     MEAN  41.900         NA   <NA>
# ... with 257 more rows

Session Info

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2       RcppRoll_0.2.2     lubridate_1.7.1    ggplot2_2.2.1.9000 stringr_1.2.0      forcats_0.2.0     
 [7] tidyr_0.7.2        dplyr_0.7.4        sf_0.5-5           bcmaps_0.13.0      tidyhydat_0.3.0    testthat_1.0.2    
[13] devtools_1.13.4   

loaded via a namespace (and not attached):
 [1] hrbrthemes_0.1.0      tidyselect_0.2.3      purrr_0.2.4           colorspace_1.3-2      viridisLite_0.2.0    
 [6] yaml_2.1.14           blob_1.1.0            rlang_0.1.4           e1071_1.6-8           glue_1.2.0           
[11] withr_2.1.0.9000      DBI_0.7               rappdirs_0.3.1        bit64_0.9-7           dbplyr_1.1.0         
[16] hunspell_2.7          bindr_0.1             plyr_1.8.4            bcmaps.rdata_0.1.0    munsell_0.4.3        
[21] gtable_0.2.0          memoise_1.1.0         labeling_0.3          extrafont_0.17        curl_3.0             
[26] class_7.3-14          Rttf2pt1_1.3.4        Rcpp_0.12.13          readr_1.1.1           udunits2_0.13        
[31] scales_0.5.0.9000     classInt_0.1-24       bit_1.1-12            hms_0.3               digest_0.6.12        
[36] stringi_1.1.6         grid_3.4.2            tools_3.4.2           magrittr_1.5          lazyeval_0.2.1       
[41] tibble_1.3.4          RSQLite_2.0           crayon_1.3.4          extrafontdb_1.0       pkgconfig_2.0.1      
[46] assertthat_0.2.0      httr_1.3.1            rstudioapi_0.7.0-9000 R6_2.2.2              units_0.4-6          
[51] compiler_3.4.2

ropensci / tidyhydat Goto Github PK

tidyhydat's People

Contributors

Stargazers

Watchers

Forkers

tidyhydat's Issues

As functions

As internal data

Parsed manually via ifelse and dplyr::case_when

Recommend Projects

Recommend Topics

Recommend Org

Parsed manually via `ifelse` and `dplyr::case_when`