edgararuiz-zz / dbplot Goto Github PK
View Code? Open in Web Editor NEWSimplifies plotting of database and sparklyr data
Home Page: https://edgararuiz.github.io/dbplot/
Simplifies plotting of database and sparklyr data
Home Page: https://edgararuiz.github.io/dbplot/
I'd like to produce an image like the one below, but if I am not mistaken, currently a color
aesthetic is not supported in dbplot_boxplot()
as well as db_compute_boxplot()
. Would you consider adding this feature? Alternatively, if the argument x
in db_compute_boxplot()
could take multiple grouping variables, that would already be an improvement. Thanks for maintaining this package.
User reported, they started from a clean Windows machine installing everything from CRAN, this was using sparklyr 1.0.0
, so maybe related but don't remember
library(dbplot)
sc <- sparklyr::spark_connect(master = "local")
cars <- copt_to(sc, mtcars)
cars %>%
dbplot::dbplot_histogram(mpg, binwidth = 3)
## Error in call[[1]] : object of type 'symbol' is not subsettable
Could not reproduce with sparklyr 1.0.0
and neither dbplot 0.3.0
... I worry there is something broken that we are not being able to reproduce.
Thank you very much for developing this package, it looks very interesting.
I can reproduce the examples in the documentation, however when I try the function on real data I get the following spark error:
Error: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down
The table I am trying to plot is quite big - but even after trying head %>% compute or head %>% coalesce, the error remain.
Do you have any suggestion on how to debug this, or which parameters could be adjusted?
Thanks
In the "Calculation functions" section you show a histogram with a scale on the left and a color legend on the right. This color legend is not needed and is not conventional with histograms (IMHO). The color legend doesn't add any information that's not in the scale at the left.
But how can I change the bars in the histogram to some constant color? Why doesn't something like fill="red" work?
When used via dbplyr
, anything that uses the db_bin()
function will return the following error message: unused argument (na.rm = TRUE)
This is due to using dbplyr
development version when developing. In order to use the functions in this package, there is no need to connect via dbplyr
, so there was no dependency noted. In real-life, this package would not be used unless in combination with dbplyr
, so I must note its dependency and make the appropriate change.
Hi @edgararuiz
Very nice package, very clever. I'm waiting the opportunity to use it.
Why not use your lower levels functions in your higher level functions, so the maintenance is a bit easier due you don't repeat code. So, as example:
dbplot_histogram <- function(data, x, bins = 30, binwidth = NULL){
db_compute_bins (data, x, bins = bins, binwidth = binwidth) %>%
ggplot() +
geom_col(aes(x, n)) +
labs(x = x,
y = "count")
}
Thanks again for your package.
Great package! :)
It would be nice to be able to facet plots computed in the database, e.g. via facet_wrap()
or on a grouped table. Or is there a straightforward approach already that I am missing?
Big Query does not like the default n()
or any other default name for a calculation, such as sum(x)
, need to create a sensible default, just like dbplot_histogram()
uses the name count
as the default:
con <- DBI::dbConnect(
dbi_driver(),
project = "bigquery-public-data",
dataset = "nlm_rxnorm",
billing = "",
use_legacy_sql = FALSE
)
tbl(con, "rxn_all_pathways_current") %>%
filter(
SOURCE_TTY == "BN",
TARGET_TTY == "IN"
) %>%
group_by(SOURCE_NAME) %>%
tally() %>%
dbplot_bar(n)
It is not possible to add a legend to a boxplot with dbplot.
https://stackoverflow.com/questions/56008148/dont-know-how-to-add-o-to-a-plot-with-r-package-dbplot
This functionality would be great with MS SQL Server as well. I realized that the newly released quantile functions could be leveraged to standardize this computation. I more or less copied the existing funciton into what I think could be the backend-agnostic version. Note that MS SQL requires quantile functions to be used with a mutate
rather than summarise
, hence the mutate %>% select %>% distinct
chain.
compute_boxplot = function(x, var, coef = 1.5) {
var = enquo(var)
x %>%
mutate(
lower = quantile(!!var, .25),
middle = quantile(!!var, .5),
upper = quantile(!!var, .75),
min_raw = min(!!var, na.rm = T),
max_raw = max(!!var, na.rm = T),
iqr = (upper - lower) * coef,
min_iqr = lower - iqr,
max_iqr = upper + iqr,
ymax = ifelse(max_raw > max_iqr, max_iqr, max_raw),
ymin = ifelse(min_raw < min_iqr, min_iqr, min_raw)) %>%
select(ymin, lower, middle, upper, ymax) %>%
distinct()
}
Furthermore, if the grouping variable isn't hardcoded into this function, any multi-variable grouping can be applied outside of this call.
First of all thank you for this nice (and in my case, life-saving) package.
I have a dataset of records (sightings) with their latitude and longitude, and I am using the function db_compute_raster(lon, lat, fill = n(), resolution = 500)
to calculate density over a spatial grid. I am using this package because of its seamless integration with sparklyr.
The computation is done correctly but the output drops the bins - i.e. the lat-lon pairs - where the count was zero. This means that I will have empty areas when visualising with ggplot geom_raster
. Is there the possibility of adding the choice of not dropping them?
Thanks for the work on this project.
I've noticed that the calculation of the histogram bins is done using rlang and sparlyr/dbplyr verbs.
Have you considered using the histogram_numeric function from hive?
For some reasons, summarise(histrogram_numeric(column_name, nbins)) doesn't work. However it should be possible to encode this as a SQL statement directly.
@edgararuiz I very much enjoyed your webinar on "Best practices for working with databases". Using SQL server isn't that convenient for many of our students, however, so I wanted to adapt your databases-and-R.Rmd file to work with SQLite.
Unfortunately lots of errors popped up with dbplot and, for example, the sampling example. I just wanted to check with you if that is expected. I know that SQLite is more limited as a database so perhaps this is expected.
db_flights %>%
filter(!is.na(arrdelay)) %>%
dbplot_histogram(arrdelay)
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "OVER": syntax error (1)
29.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
28.
new_result(connection@ptr, statement)
27.
OdbcResult(connection = conn, statement = statement)
26.
dbSendQuery(con, sql)
25.
dbSendQuery(con, sql)
24.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
23.
db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
29. | stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
-- | --
...
flights_sample <- db_flights %>%
filter(cancelled == 0, year == 2006) %>%
arrange(dayofmonth) %>%
mutate(row = row_number()) %>%
head() %>%
collect()
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "(": syntax error (1)
17.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"(\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
16.
new_result(connection@ptr, statement)
15.
OdbcResult(connection = conn, statement = statement)
14.
dbSendQuery(con, sql)
13.
dbSendQuery(con, sql)
12.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
...
How can I use this after a call to make a JDBC connection with DBI::dbConnect()
? I would love to have an example vignette that I can follow. Thanks!
> tbl(con, in_schema("datawarehouse", "flight")) %>%
+ group_by(month) %>%
+ tally() %>%
+ collect()
# A tibble: 12 x 2
month n
<dbl> <S3: integer64>
1 10 556205
2 4 598126
3 12 544958
4 5 606293
5 2 569236
6 7 627931
I'm using dbplot v0.3.3 with hive (version Hive 1.2.1000.2.6.5.0-292).
db_compute_boxplot is using percentile rather than percentile_approx, this limits the function to integers only.
Here is some sample code
library(DBI)
library(odbc)
library(dplyr)
library(dbplyr)
library(ggplot2)
library(dbplot)
con <- dbConnect(odbc(), "QoE", user = "foo",
pwd = "bar", bigint = "numeric")
Write Iris to Hive table removing '.' from col names
foo<-iris
colnames(foo)<-gsub(".", "_", colnames(foo), fixed = TRUE)
dbWriteTable(con,"iris",foo,overwrite=TRUE)
Run db_compute_boxplot - throws data type error for UDAF Percentile
db_iris<-tbl(con,in_schema("default","iris"))
db_iris %>% db_compute_boxplot(species,sepal_length)
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server:
Error while compiling statement:
FAILED: NoMatchingMethodException
No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double)
Run again casting Sepal_Length to Integer
db_iris %>% db_compute_boxplot(species,as.integer(sepal_length)) %>%
ggplot()+geom_boxplot(aes(x=species,
middle=middle,
lower=lower,
upper=upper,
ymin=ymin,
ymax=ymax,
color=species),stat='identity')
Cordial saludo, de forma atenta me comunico porque estoy estudiando temas de big data con R studio+Spark y presento problemas para instalar el paquete "dbplot", es posible puedas ayudarme con este error. Por su amable atención y pronta respuesta, gracias
I enjoy using this library, but recently the histograms were not working as should. There were no bars for numeric values, only an empty graph. For integers there was no problem...
Solution: change line 92 in https://github.com/edgararuiz/dbplot/blob/master/R/histogram.R
92 (old) geom_col(aes(x, count)) +
92 (new) geom_col(aes(x, count), orientation="x", width = binwidth) +
This works as a charm when binwidths are known and it produces proper histograms at the same time, since there should be no gap between bars in a histogram!
Although this does not work when the binwidth is not set, maybe the binwidth calculation from (https://github.com/edgararuiz/dbplot/blob/master/R/dbbin.R) could be included somehow?
Thanks in advance!
Cool package! I've noticed that your DESCRIPTION doesn't link to this repo and to the issue page, it'd be worth adding the URL and BugReports fields. :-)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.