ropensci / tidyhydat Goto Github PK
View Code? Open in Web Editor NEWAn R package to import Water Survey of Canada hydrometric data and make it tidy
Home Page: https://docs.ropensci.org/tidyhydat
License: Apache License 2.0
An R package to import Water Survey of Canada hydrometric data and make it tidy
Home Page: https://docs.ropensci.org/tidyhydat
License: Apache License 2.0
I'm not sure CRAN will be able run those examples (in the download_realtime_ws function specifically). You probably will need to wrap them in dontrun like:
https://github.com/USGS-R/dataRetrieval/blob/master/R/getWebServiceData.R#L30
No entirely reproducible but it does seem like the process is going too fast for the webservice. One possible fix is to use Sys,sleep() to give httr::POST
to make the connection.
> get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD"))
Error in get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD")) :
403 Forbidden: the web service is denying your request. Try any of the following options: ensure you are not currently using all 5 tokens,
wait a few minutes and try again or copy the get_ws_token code and paste it directly into the console.
> get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD"))
This token will expire at 10:55:35
This does seem to work everytime for me though this is a silly way to do this:
> debugonce(get_ws_token)
> token <- get_ws_token(username = Sys.getenv("WS_USRNM"), password = Sys.getenv("WS_PWD"))
An early attempt was made here
From the review:
Lastly. I would consider revising your tidyhydat-package documentation. This is frequently the first interaction a user might have with the package. Currently it is very focused on the data format and output style (tidy), not the data itself. This documentation (along with a Vignette) would be an excellent place to better give an overview of the data available. What data are availble through this package? Stream/river level, flow, and it seems, sediment flux estimates at daily and monthly timescales (also real-time in some cases)?
On the overview side, I'd recommend, if possible, pointing to further HYDAT documentation. A lot of the documentation here relies on the user's knowledge of the HYDAT database. For example, much of the function documentation refers to the functions as "Provides wrapper to turn the XXX table in HYDAT into a tidy data frame." which implies knowledge of the HYDAT database.
Also from the review:
Generally, I'm a big fan of every exported function having at least one functional example. It helps immensely, especially when teaching new R users. A package such as this could be very useful for new R users with a flavor for Canadian hydrology.
It would be really cool to have (maybe another) vignette that works through a complete analysis. Maybe from looking for data, to getting the data, to visualizing the data. Alternatively, maybe expanding the help files a bit to describe more completely what is returned from each function/data set. It wasn't obvious to me until I ran the function and looked at the result.
Possibly doing nothing is the best answer but the time is currently outputted in UTC. It would be helpful to provide a means of converting that to avoid confusion.
PROV_TERR_STATE_LOC = "BC" isn't very useful for the data function (DLY_FLOWS
, DLY_LEVELS
, etc, etc) will crash rstudio with such a big query.
In download.R, there's a message: stop("No exists for this station query")...might want to change to "No data...".
For me, none of the download_realtime_ws examples worked for me (I did get the token working). Each example ended with "No exists for this station query"
Also:
At least for the download tests, you'll probably want to add testthat::skip_on_cran()
Example for SED_MONTHLY_LOADS is not working. I played around a bit with it but couldn't get it to run. This may require better output of empty return values as it seems like it should be ok that a query returns no data.
> SED_MONTHLY_LOADS(PROV_TERR_STATE_LOC = "PE", hydat_path = "H:/Hydat.sqlite3", start_date = "2001-01-01", end_date = "2011-01-01") Error: !length(data) || is_named(data) is not TRUE 11.stop(msg, call. = FALSE, domain = NA) 10.stopifnot(!length(data) || is_named(data)) 9.env_bind_impl(env, dots_list(...)) 8.child_env(env_, ...) 7.env_bury(overscope_top, !(!(!data))) 6.vars_select_eval(.vars, quos) 5.tidyselect::vars_select(names(data), !(!(!quos))) 4.unname(tidyselect::vars_select(names(data), !(!(!quos)))) 3.gather.data.frame(sed_monthly_loads, variable, temp, -(STATION_NUMBER:NO_DAYS)) 2.tidyr::gather(sed_monthly_loads, variable, temp, -(STATION_NUMBER:NO_DAYS)) at SED_MONTHLY_LOADS.R#107 1.SED_MONTHLY_LOADS(PROV_TERR_STATE_LOC = "PE", hydat_path = "H:/Hydat.sqlite3", start_date = "2001-01-01", end_date = "2011-01-01")
Possibly related from other review:
I noticed
ifelse(FULL_MONTH == 1, "Yes", "No") in SED_MONTHLY_SUSCON
...wouldn't it make more sense to be a logic? ie:
sed_monthly_suscon <- dplyr::mutate(sed_monthly_suscon, FULL_MONTH = FULL_MONTH == 1)
Running the tests locally, I got:1. Error: SED_MONTHLY_LOADS accepts single and multiple province arguments (@test_SED_MONTHLY_LOADS.R#5) 2. Error: SED_MONTHLY_LOADS accepts single and multiple province arguments (@test_SED_MONTHLY_LOADS.R#12) 3. Error: SED_MONTHLY_LOADS can accept both arguments for backward compatability (@test_SED_MONTHLY_LOADS.R#22) 4. Error: download_realtime_ws returns the correct data header (@test_download_realtime.R#6)
When a user has set the hydat_path it is tempting to omit the argument name may be omit like this:
STN_DATA_RANGE("08MF005")
Error in rsqlite_send_query(conn@ptr, statement) :
no such table: STATIONS
This error message is uninformative and this fail earlier with a message about the hydat path being the first argument.
Leave an R script in data-raw that pulls the data from HYDAT then overwrites existing internal data that can be run when their are HYDAT updates
Error:
> DLY_FLOWS(STATION_NUMBER = "08HA009")
No start and end dates specified. All dates available will be returned.
Error in enc2utf8(col_names(col_labels, sep = sep)) :
argument is not a character vector
Could be a simply function that just internally calls each one. Could be called hy_daily()
From bcmaps:
.onAttach <- function(libname, pkgname) {
if (!.bcmaps_env$has_data) {
msg <- has_no_data_msg()
} else {
msg <- check_for_data_pkg_update()
}
if (!is.null(msg)) packageStartupMessage(msg)
}
Only occurs when DAY = NA for instances like TOTAL and MEAN
If a function is called like below rather than with library(tidyhydat)
the following error message is received. This is likely due to calling internal data rather than functions.
> tidyhydat::DLY_FLOWS(STATION_NUMBER = "08MF005")
No start and end dates specified. All dates available will be returned.
Error in tbl_vars(y) : object 'DATA_SYMBOLS' not found
Options to fix include:
Intent is to remove warnings like this:
Warning message:
closing unused connection 3 (http://dd.weather.gc.ca/hydrometric/csv/BC/hourly/BC_07ED003_hourly_hydrometric.csv)
For example:
> DLY_FLOWS(STATION_NUMBER = "08HF014")
No start and end dates specified. All dates available will be returned.
Error in enc2utf8(col_names(col_labels, sep = sep)) :
argument is not a character vector
This can probably be rectified with here somewhere: https://github.com/bcgov/tidyhydat/blob/f5c90197ead1cdc4b880ba0eb3522302619ad1cc/R/internal.R#L11-L51
Each HYDAT function use the same code to parse, based on arguments, which stns we are querying. That code could be wrapped in an internal function make debugging and code maintenance easier.
Either:
Line 29 zzz.R I would drop the #' comments which results in R trying to build an Rd file for .onLoad.
Or
The dplyr coding within the functions looks pretty straightforward...so for that reason I probably wouldn't change it, but it might be good time to look into some of the newer features of dplyr that don't require the utils::globalVariables call .onLoad. There's a good collection of blogs and stuff here:
https://maraaverick.rbind.io/2017/08/tidyeval-resource-roundup/
All web service tests have currently been commented out because data sources are password protected.
The download_realtime_ws
issue can likely be solved using methods outlined here: http://httr.r-lib.org/articles/secrets.html#travis
From the review:
Another overarching issue (which is somewhat subjective) is to the outside reviewer, the choice of function capitalization seems arbitrary and not helpful. While the choices of function style are subject to personal preferences, I would suggest the usage of consistent capitalization scheme, preferrably a lowercase-based one. I am personally especially partial to the consistent prefix-style function naming, which is used in a number of rOpenSci and other related packages. It eases function discovery as well as makes packages > seem more unified and consistent.
Recommends switching from DLY_FLOWS() to hydat_dly_flows() with all other HYDAT functions following a similar convention.
From one review:
Notes on the implications of focusing on "tidy":
When I saw the name of the package, I thought that there was going to be something like:find_stations(bbox=c(w,x,y,z)) %>% get_data(parameter="flow") %>% calc_stats("annual")
I see what was meant is that it converts the data from messy to tidy....but just a side not that it would be cool get some hydro tidy tools too....maybe another day!
https://bookdown.org/rdpeng/RProgDA/gaining-your-tidyverse-citizenship.html
This could be done is a few ways. Possibly something like this:
CODE TO GET A VECTOR OF STATIONS %>%
map(DLY_FLOWS)
Or possible the dly_flows function could be modified such that it accepts a vector in a pipe for the STATION_NUMBER argument.
Result is that users don't always have to specify where hydat is
ifelse
and dplyr::case_when
In most places where the path to the hydat.sqlite3 db is constructed, paste0(hy_dir(),"\\Hydat.sqlite3")
is used, but this doesn't work on a mac (nor would it on Linux I don't think??). I'd suggest: file.path(hy_dir(),"Hydat.sqlite3")
This is bug.
Possibly is due to the fact that default dates maybe should be NULL
> flow_data <- DLY_FLOWS(STATION_NUMBER = c("08MF005","07EF001","08NE049"), start_date = "2001-01-01", end_date = Sys.Date())
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
This results in tidy code as well as tidy output.
Using this as my main reference: http://style.tidyverse.org/index.html
This tools will be useful for these purposes:
https://lorenzwalthert.github.io/gsoc2017/
Exceptions, if any can be listed, in this issue.
This is because of instances of is.na(TIME_ZONE)
where there is a time:
dplyr::tbl(hydat_con, "ANNUAL_INSTANT_PEAKS") %>%
filter(!is.na(MINUTE) & is.na(TIME_ZONE))
Or where `TIME_ZONE %in% c("*","0")
dplyr::tbl(hydat_con, "ANNUAL_INSTANT_PEAKS") %>%
filter(TIME_ZONE %in% c("*","0"))
Options are to either format two columns Date
and Datetime
OR to leave items as separate columns. Issue right now is that there are some instance where there are full date and time but no time zone.
library(tidyhydat)
> STN_DATA_RANGE(STATION_NUMBER = "08NM146")
Error in dplyr::tbl(hydat_con, "STATIONS") : object 'hydat_con' not found
download_realtime_dd(STATION_NUMBER = "07ED003")
returns this error:
Error in open.connection(con, "rb") : 07ED003 cannot be found
This makes sense when it is only one station. But it might make more sense to output a dataframe of NA's so that something like this:
download_realtime_dd(STATION_NUMBER = c("08MF005", "07ED003"))
would return a data for at least 08MF005 and leave 07ED003 with NA's.
Unique case as it is drawn from the DLY_FLOWS
table
Something like:
differ = setdiff(unique(stns), unique(annual_statistics$STATION_NUMBER))
if( length(differ) !=0 ){
if( length <= 10) {
message("The following station(s) were not retrieved: ", paste0(differ, sep = " "))
message("Check station number typos or if it is a valid station in the network")
} else {
message(More than 10 stations missed)
}
} else{
message("All station successfully retrieved")
}
If 0 returned hydat then proceed to realtime
Add an argument in DLY_FLOWS
, DLY_LEVELS
and likely in SED_DLY_LOAD
(possibly other SED functions as well to specify if the symbol should be translated or should be keep as a single character symbol. Default behaviour will be the single character to save space.
Review:
One thing about the examples. Using devtools I can run them, but a bunch of them are hard-coded to use h:/Hydat.sqlite3. It would be better if these examples followed the style outlined in the github README which uses an .Renviron variable to set the path (which would then allow anyone to quickly use/run examples).
Other review:
I agree with Luke that it was frustrating to have the examples use a path to the database that I did not use. If there was a cleaner way to deal with this (he gave a few example packages to check out), that would make life for the user easier. You could probably have the examples work if you pointed them to:
tinyflow <- DLY_FLOWS(STATION_NUMBER = c("08LA001","08NL071"), hydat_path = system.file("test_db/tinyhydat.sqlite3", package = "tidyhydat"))
of course...you'd need to adjust stations and what-not that are in the tiny db.
All tests have currently been commented out because data sources are either stored locally or password protected.
download_realtime_ws
issue can likely be solved using methods outlined here: http://httr.r-lib.org/articles/secrets.html#travisWhen a vignette is built using devtools::build_vignette()
all built files are moved over to here. Normally those files are part of gitignore. However travis was throwing a warning if built files weren't in the repo. So I built the vignette and remove isnt/doc from gitignore.
This has the unintended effect of removing tidyhydat.md
which was produced using this method.
Here is what I want. Build the vignette so that a plain markdonw file is generated that includes all the references from the bibtex file and figure out a ways so that TRAVIS doesn't expect the inst/doc directory.
For the STATION_NUMBER parameter in the functions, I suggest having it have the ability to look up stations that contain text, not just matching text (other than "ALL"). For example, one can type in just โ08NMโ and pulls all data from all stations containing 08NM in the number.
To make the metadata table more user-friendly, I suggest changing the STATIONS() table with the following.
HYD_STATUS | A = ACTIVE; D = DISCONTINUED
SED_STATUS | A = ACTIVE; D = DISCONTINUED
RHBN | 0 = FALSE; 1 = TRUE
REAL_TIME | 0 = FALSE; 1 = TRUE
REGIONAL_OFFICE_ID | replace with REGIONAL_OFFICE text from look-up table
CONTRIBUTOR_ID | replace with CONTRIBUTOR text from look-up table
OPERATOR_ID | replace with OPERATOR text from look-up table
DATUM_ID | replace with DATUM text from look-up table
The HYDAT package uses the following to convert the the first 4 listed above, which could be applied.
metadata$RHBN <- ifelse(metadata$RHBN==1, TRUE, FALSE)
metadata$REAL_TIME <- ifelse(metadata$REAL_TIME==1, TRUE, FALSE)
metadata$HYD_STATUS <- ifelse(metadata$HYD_STATUS=="A","ACTIVE",
ifelse(metadata$HYD_STATUS=="D", "DISCONTINUED", NA))
metadata$SED_STATUS <- ifelse(metadata$SED_STATUS=="A","ACTIVE",
ifelse(metadata$SED_STATUS=="D", "DISCONTINUED", NA))
Here is the error:
Warning: 380 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected actual 1 4600 Value no trailing characters .54 file 2 4601 Value no trailing characters .51 row 3 4602 Value no trailing characters .48 col 4 4603 Value no trailing characters .4 expected 5 4604 Value no trailing characters .33
... ................. ... ........................................................ ........ ........................................................ ...... ........................................................ .... ........................................................ ... ........................................................ ... ........................................................ ........ ........................................................
See problems(...) for more details.All station successfully retrieved
Warning message:
In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)
I suspect that these lines:
Are restricting the Value
column to be only an integer so when temperature or another parameter are requested that have decimals, the parsing gets a little angry.
From one reviewer
There seems to be no indication of the units for any of the data. The only unit I could find is in the DATA_TYPES table, though that was only for Sediment. This is something I feel is important for data packages such as this as units are critical to understanding the data, and are often the one piece of information I find myself spending excessive time hunting for (or trying to reverse engineer).
From the review:
One issue I encountered was the database download, with personal handling of the database path, is awkward. It seems that much of the work of downloading the database, and then storing the path could be automated. The storage location could be supported by the rappdirs package. There are existing examples of such file downloading and use, such as getlandsat, hydrolinks, and LAGOS to name a few with various implementations. This would make it easier for users to quickly get started using the data and eliminate the need to muck around in the Renviron file.
Other review:
The example:
file.edit("~/.Renviron")
That path does not automatically work for Windows users I think. You might want to clarify that you will probably want a .Renviron file in the same directory as your RStudio project (assuming you use RStudio)...or something like that
> hy_daily_flows(c("08HA002"))
No start and end dates specified. All dates available will be returned.
All station successfully retrieved
# A tibble: 30,286 x 5
STATION_NUMBER Date Parameter Value Symbol
<chr> <date> <chr> <dbl> <chr>
1 08HA002 1913-03-01 FLOW 49.3 <NA>
2 08HA002 1913-03-02 FLOW 47.3 <NA>
3 08HA002 1913-03-03 FLOW 45.3 <NA>
4 08HA002 1913-03-04 FLOW 43.3 <NA>
5 08HA002 1913-03-05 FLOW 43.3 <NA>
6 08HA002 1913-03-06 FLOW 41.3 <NA>
7 08HA002 1913-03-07 FLOW 41.3 <NA>
8 08HA002 1913-03-08 FLOW 39.6 <NA>
9 08HA002 1913-03-09 FLOW 39.6 <NA>
10 08HA002 1913-03-10 FLOW 39.6 <NA>
# ... with 30,276 more rows
while:
> hy_annual_stats(c("08HA002"))
All station successfully retrieved
# A tibble: 267 x 7
STATION_NUMBER Parameter Year Sum_stat Value Date Symbol
<chr> <chr> <int> <chr> <dbl> <date> <chr>
1 08HA002 Flow 1913 MEAN NA NA <NA>
2 08HA002 Flow 1913 MIN 6.510 1913-08-30 <NA>
3 08HA002 Flow 1913 MAX NA NA <NA>
4 08HA002 Flow 1914 MEAN 53.200 NA <NA>
5 08HA002 Flow 1914 MIN 2.460 1914-09-06 <NA>
6 08HA002 Flow 1914 MAX 207.000 1914-01-07 <NA>
7 08HA002 Flow 1915 MEAN 39.800 NA <NA>
8 08HA002 Flow 1915 MIN 0.906 1915-09-30 <NA>
9 08HA002 Flow 1915 MAX 141.000 1915-12-09 <NA>
10 08HA002 Flow 1916 MEAN 41.900 NA <NA>
# ... with 257 more rows
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 RcppRoll_0.2.2 lubridate_1.7.1 ggplot2_2.2.1.9000 stringr_1.2.0 forcats_0.2.0
[7] tidyr_0.7.2 dplyr_0.7.4 sf_0.5-5 bcmaps_0.13.0 tidyhydat_0.3.0 testthat_1.0.2
[13] devtools_1.13.4
loaded via a namespace (and not attached):
[1] hrbrthemes_0.1.0 tidyselect_0.2.3 purrr_0.2.4 colorspace_1.3-2 viridisLite_0.2.0
[6] yaml_2.1.14 blob_1.1.0 rlang_0.1.4 e1071_1.6-8 glue_1.2.0
[11] withr_2.1.0.9000 DBI_0.7 rappdirs_0.3.1 bit64_0.9-7 dbplyr_1.1.0
[16] hunspell_2.7 bindr_0.1 plyr_1.8.4 bcmaps.rdata_0.1.0 munsell_0.4.3
[21] gtable_0.2.0 memoise_1.1.0 labeling_0.3 extrafont_0.17 curl_3.0
[26] class_7.3-14 Rttf2pt1_1.3.4 Rcpp_0.12.13 readr_1.1.1 udunits2_0.13
[31] scales_0.5.0.9000 classInt_0.1-24 bit_1.1-12 hms_0.3 digest_0.6.12
[36] stringi_1.1.6 grid_3.4.2 tools_3.4.2 magrittr_1.5 lazyeval_0.2.1
[41] tibble_1.3.4 RSQLite_2.0 crayon_1.3.4 extrafontdb_1.0 pkgconfig_2.0.1
[46] assertthat_0.2.0 httr_1.3.1 rstudioapi_0.7.0-9000 R6_2.2.2 units_0.4-6
[51] compiler_3.4.2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.