edgararuiz-zz / dbplot Goto Github PK

View Code? Open in Web Editor NEW

123.0 6.0 20.0 3.19 MB

Simplifies plotting of database and sparklyr data

Home Page: https://edgararuiz.github.io/dbplot/

R 100.00%

visualization r ggplot2 databases sparklyr dbplot rlang

dbplot's Introduction

dbplot

Installation
Connecting to a data source
Example
ggplot
- Histogram
- Raster
- Bar Plot
- Line plot
- Boxplot
Calculation functions
db_bin()

Leverages dplyr to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:

Functions that ouput a ggplot2 object
Functions that outputs a data.frame object with the calculations
Creates the formula needed to calculate bins for a Histogram or a Raster plot

Installation

You can install the released version from CRAN:

# install.packages("dbplot")

Or the the development version from GitHub, using the remotes package:

# install.packages("remotes")
# remotes::install_github("edgararuiz/dbplot")

Connecting to a data source

For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com
To use Spark, please visit the sparklyr official website: http://spark.rstudio.com

Example

In addition to database connections, the functions work with sparklyr. A local RSQLite database will be used for the examples in this README.

library(DBI)
library(odbc)
library(dplyr)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
db_flights <- copy_to(con, nycflights13::flights, "flights")

`ggplot`

Histogram

By default dbplot_histogram() creates a 30 bin histogram

library(ggplot2)

db_flights %>% 
  dbplot_histogram(distance)

Use binwidth to fix the bin size

db_flights %>% 
  dbplot_histogram(distance, binwidth = 400)

Because it outputs a ggplot2 object, more customization can be done

db_flights %>% 
  dbplot_histogram(distance, binwidth = 400) +
  labs(title = "Flights - Distance traveled") +
  theme_bw()

Raster

To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.

A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.

If no fill argument is passed, the default calculation will be count, n()

db_flights %>%
  dbplot_raster(sched_dep_time, sched_arr_time)

Pass an aggregation formula that can run inside the database

db_flights %>%
  dbplot_raster(
    sched_dep_time, 
    sched_arr_time, 
    mean(distance, na.rm = TRUE)
    )

Increase or decrease for more, or less, definition. The resolution argument controls that, it defaults to 100

db_flights %>%
  dbplot_raster(
    sched_dep_time, 
    sched_arr_time, 
    mean(distance, na.rm = TRUE),
    resolution = 20
    )

Bar Plot

dbplot_bar() defaults to a tally() of each value in a discrete variable

db_flights %>%
  dbplot_bar(origin)

Pass a formula, and column name, that will be operated for each value in the discrete variable

db_flights %>%
  dbplot_bar(origin, avg_delay =  mean(dep_delay, na.rm = TRUE))

Line plot

dbplot_line() defaults to a tally() of each value in a discrete variable

db_flights %>%
  dbplot_line(month)

Pass a formula that will be operated for each value in the discrete variable

db_flights %>%
  dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))

Boxplot

It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn’t calculate outliers. It has been tested with the following connections:

MS SQL Server
PostgreSQL
Oracle
sparklyr

Here is an example using dbplot_boxplot() with a local data frame:

nycflights13::flights %>%
  dbplot_boxplot(origin, distance)

Calculation functions

If a more customized plot is needed, the data the underpins the plots can also be accessed:

db_compute_bins() - Returns a data frame with the bins and count per bin
db_compute_count() - Returns a data frame with the count per discrete value
db_compute_raster() - Returns a data frame with the results per x/y intersection
db_compute_raster2() - Returns same as db_compute_raster() function plus the coordinates of the x/y boxes
db_compute_boxplot() - Returns a data frame with boxplot calculations

db_flights %>%
  db_compute_bins(arr_delay) 
#> # A tibble: 28 x 2
#>    arr_delay  count
#>        <dbl>  <int>
#>  1     NA      9430
#>  2    -86      5325
#>  3    -40.7  207999
#>  4      4.53  79784
#>  5     49.8   19063
#>  6     95.1    7890
#>  7    140.     3746
#>  8    186.     1742
#>  9    231.      921
#> 10    276.      425
#> # … with 18 more rows

The data can be piped to a plot

db_flights %>%
  filter(arr_delay < 100 , arr_delay > -50) %>%
  db_compute_bins(arr_delay) %>%
  ggplot() +
  geom_col(aes(arr_delay, count, fill = count))

`db_bin()`

Uses ‘rlang’ to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.

db_bin(var)
#> (((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30) * ifelse(as.integer(floor((var - 
#>     min(var, na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, 
#>     na.rm = TRUE))/30))) == 30, as.integer(floor((var - min(var, 
#>     na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30))) - 
#>     1, as.integer(floor((var - min(var, na.rm = TRUE))/((max(var, 
#>     na.rm = TRUE) - min(var, na.rm = TRUE))/30))))) + min(var, 
#>     na.rm = TRUE)

db_flights %>%
  group_by(x = !! db_bin(arr_delay)) %>%
  tally()
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.29.0 [:memory:]
#>         x      n
#>     <dbl>  <int>
#>  1  NA      9430
#>  2 -86      5325
#>  3 -40.7  207999
#>  4   4.53  79784
#>  5  49.8   19063
#>  6  95.1    7890
#>  7 140.     3746
#>  8 186.     1742
#>  9 231.      921
#> 10 276.      425
#> # … with more rows

db_flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(x = !! db_bin(arr_delay)) %>%
  tally()%>%
  collect %>%
  ggplot() +
  geom_col(aes(x, n))

dbDisconnect(con)

dbplot's People

Contributors

Stargazers

Watchers

dbplot's Issues

Support color aesthetic for boxplots

I'd like to produce an image like the one below, but if I am not mistaken, currently a color aesthetic is not supported in dbplot_boxplot() as well as db_compute_boxplot(). Would you consider adding this feature? Alternatively, if the argument x in db_compute_boxplot() could take multiple grouping variables, that would already be an improvement. Thanks for maintaining this package.

How to color bars in histogram?

In the "Calculation functions" section you show a histogram with a scale on the left and a color legend on the right. This color legend is not needed and is not conventional with histograms (IMHO). The color legend doesn't add any information that's not in the scale at the left.

But how can I change the bars in the histogram to some constant color? Why doesn't something like fill="red" work?

Rstudio webinar - use with SQLite

@edgararuiz I very much enjoyed your webinar on "Best practices for working with databases". Using SQL server isn't that convenient for many of our students, however, so I wanted to adapt your databases-and-R.Rmd file to work with SQLite.

Unfortunately lots of errors popped up with dbplot and, for example, the sampling example. I just wanted to check with you if that is expected. I know that SQLite is more limited as a database so perhaps this is expected.

db_flights %>%
  filter(!is.na(arrdelay)) %>%
  dbplot_histogram(arrdelay)

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "OVER": syntax error (1)
29.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
28.
new_result(connection@ptr, statement)
27.
OdbcResult(connection = conn, statement = statement)
26.
dbSendQuery(con, sql)
25.
dbSendQuery(con, sql)
24.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
23.
db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
29. | stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
-- | --
...

flights_sample <- db_flights %>%
  filter(cancelled == 0, year == 2006) %>%
  arrange(dayofmonth) %>%
  mutate(row = row_number()) %>%
  head() %>%
  collect()

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "(": syntax error (1)
17.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"(\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
16.
new_result(connection@ptr, statement)
15.
OdbcResult(connection = conn, statement = statement)
14.
dbSendQuery(con, sql)
13.
dbSendQuery(con, sql)
12.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
...

Using lower levels functions in higher level functions

Hi @edgararuiz

Very nice package, very clever. I'm waiting the opportunity to use it.

Why not use your lower levels functions in your higher level functions, so the maintenance is a bit easier due you don't repeat code. So, as example:

dbplot_histogram <- function(data, x, bins = 30, binwidth = NULL){

  db_compute_bins (data, x, bins = bins, binwidth = binwidth)  %>%
    ggplot() +
    geom_col(aes(x, n)) +
    labs(x = x,
         y = "count")
}

Thanks again for your package.

Use of histogram_numeric function from Hive?

Thanks for the work on this project.
I've noticed that the calculation of the histogram bins is done using rlang and sparlyr/dbplyr verbs.

Have you considered using the histogram_numeric function from hive?

https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining

For some reasons, summarise(histrogram_numeric(column_name, nbins)) doesn't work. However it should be possible to encode this as a SQL statement directly.

unused argument (na.rm = TRUE)

When used via dbplyr, anything that uses the db_bin() function will return the following error message: unused argument (na.rm = TRUE)

This is due to using dbplyr development version when developing. In order to use the functions in this package, there is no need to connect via dbplyr, so there was no dependency noted. In real-life, this package would not be used unless in combination with dbplyr , so I must note its dependency and make the appropriate change.

Add support for quantile plot

object of type 'symbol' is not subsettable

User reported, they started from a clean Windows machine installing everything from CRAN, this was using sparklyr 1.0.0, so maybe related but don't remember

library(dbplot)
sc <- sparklyr::spark_connect(master = "local")
cars <- copt_to(sc, mtcars)
cars %>%
  dbplot::dbplot_histogram(mpg, binwidth = 3)

## Error in call[[1]] : object of type 'symbol' is not subsettable

Could not reproduce with sparklyr 1.0.0 and neither dbplot 0.3.0... I worry there is something broken that we are not being able to reproduce.

Improve default naming

Big Query does not like the default n() or any other default name for a calculation, such as sum(x), need to create a sensible default, just like dbplot_histogram() uses the name count as the default:

con <- DBI::dbConnect(
  dbi_driver(),
  project = "bigquery-public-data",
  dataset = "nlm_rxnorm",
  billing = "",
  use_legacy_sql = FALSE
)

tbl(con, "rxn_all_pathways_current") %>%
  filter(
    SOURCE_TTY == "BN",
    TARGET_TTY == "IN"
  ) %>%
  group_by(SOURCE_NAME) %>%
  tally() %>%
  dbplot_bar(n)

dbplot_histogram resulting in empty plots + proper histograms

I enjoy using this library, but recently the histograms were not working as should. There were no bars for numeric values, only an empty graph. For integers there was no problem...

Solution: change line 92 in https://github.com/edgararuiz/dbplot/blob/master/R/histogram.R

92 (old) geom_col(aes(x, count)) +
92 (new) geom_col(aes(x, count), orientation="x", width = binwidth) +

This works as a charm when binwidths are known and it produces proper histograms at the same time, since there should be no gap between bars in a histogram!
Although this does not work when the binwidth is not set, maybe the binwidth calculation from (https://github.com/edgararuiz/dbplot/blob/master/R/dbbin.R) could be included somehow?

Thanks in advance!

Error en cargar el paquete dbplot

Cordial saludo, de forma atenta me comunico porque estoy estudiando temas de big data con R studio+Spark y presento problemas para instalar el paquete "dbplot", es posible puedas ayudarme con este error. Por su amable atención y pronta respuesta, gracias

Job cancelled because SparkContext was shut down

Thank you very much for developing this package, it looks very interesting.

I can reproduce the examples in the documentation, however when I try the function on real data I get the following spark error:

Error: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

The table I am trying to plot is quite big - but even after trying head %>% compute or head %>% coalesce, the error remain.

Do you have any suggestion on how to debug this, or which parameters could be adjusted?

Thanks

Add support for MSSQL connections

This functionality would be great with MS SQL Server as well. I realized that the newly released quantile functions could be leveraged to standardize this computation. I more or less copied the existing funciton into what I think could be the backend-agnostic version. Note that MS SQL requires quantile functions to be used with a mutate rather than summarise, hence the mutate %>% select %>% distinct chain.

compute_boxplot = function(x, var, coef = 1.5) {
  
  var = enquo(var)
  
  x %>%
    mutate(
      lower   = quantile(!!var, .25),
      middle  = quantile(!!var, .5),
      upper   = quantile(!!var, .75),
      min_raw = min(!!var, na.rm = T),
      max_raw = max(!!var, na.rm = T),
      iqr     = (upper - lower) * coef,
      min_iqr = lower - iqr,
      max_iqr = upper + iqr,
      ymax    = ifelse(max_raw > max_iqr, max_iqr, max_raw),
      ymin    = ifelse(min_raw < min_iqr, min_iqr, min_raw)) %>%
    select(ymin, lower, middle, upper, ymax) %>% 
    distinct()
}

Furthermore, if the grouping variable isn't hardcoded into this function, any multi-variable grouping can be applied outside of this call.

Add support for `int64` results

> tbl(con, in_schema("datawarehouse", "flight")) %>%
+     group_by(month) %>%
+     tally() %>%
+     collect() 
# A tibble: 12 x 2
   month               n
   <dbl> <S3: integer64>
 1    10          556205
 2     4          598126
 3    12          544958
 4     5          606293
 5     2          569236
 6     7          627931

Add possibility to facet plots

Great package! :)

It would be nice to be able to facet plots computed in the database, e.g. via facet_wrap() or on a grouped table. Or is there a straightforward approach already that I am missing?

Adding legend to boxplot

It is not possible to add a legend to a boxplot with dbplot.
https://stackoverflow.com/questions/56008148/dont-know-how-to-add-o-to-a-plot-with-r-package-dbplot

URL and BugReports fields in DESCRIPTION

Cool package! I've noticed that your DESCRIPTION doesn't link to this repo and to the issue page, it'd be worth adding the URL and BugReports fields. :-)

add vignette example of how to use with a DBIConnection?

How can I use this after a call to make a JDBC connection with DBI::dbConnect()? I would love to have an example vignette that I can follow. Thanks!

db_compute_boxplot in Hive is using percentile instead of percentile approx

I'm using dbplot v0.3.3 with hive (version Hive 1.2.1000.2.6.5.0-292).

db_compute_boxplot is using percentile rather than percentile_approx, this limits the function to integers only.

Here is some sample code

    library(DBI)
    library(odbc)
    library(dplyr)
    library(dbplyr)
    library(ggplot2)
    library(dbplot)

con <- dbConnect(odbc(), "QoE", user = "foo", 
    pwd = "bar", bigint = "numeric")

Write Iris to Hive table removing '.' from col names

foo<-iris
colnames(foo)<-gsub(".", "_", colnames(foo), fixed = TRUE)
dbWriteTable(con,"iris",foo,overwrite=TRUE)

Run db_compute_boxplot - throws data type error for UDAF Percentile

db_iris<-tbl(con,in_schema("default","iris"))
db_iris %>% db_compute_boxplot(species,sepal_length)

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: 
Error while compiling statement: 
FAILED: NoMatchingMethodException 
No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double)

Run again casting Sepal_Length to Integer

db_iris %>% db_compute_boxplot(species,as.integer(sepal_length)) %>% 
  ggplot()+geom_boxplot(aes(x=species,
                            middle=middle,
                            lower=lower,
                            upper=upper,
                            ymin=ymin,
                            ymax=ymax,
                            color=species),stat='identity')

Add support for density plots

db_compute_raster dropping empty bins

First of all thank you for this nice (and in my case, life-saving) package.

I have a dataset of records (sightings) with their latitude and longitude, and I am using the function db_compute_raster(lon, lat, fill = n(), resolution = 500) to calculate density over a spatial grid. I am using this package because of its seamless integration with sparklyr.

The computation is done correctly but the output drops the bins - i.e. the lat-lon pairs - where the count was zero. This means that I will have empty areas when visualising with ggplot geom_raster. Is there the possibility of adding the choice of not dropping them?