Coder Social home page Coder Social logo

dbplot's Introduction

dbplot

Build Status CRAN_Status_Badge Coverage status

Leverages dplyr to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:

  1. Functions that ouput a ggplot2 object
  2. Functions that outputs a data.frame object with the calculations
  3. Creates the formula needed to calculate bins for a Histogram or a Raster plot

Installation

You can install the released version from CRAN:

# install.packages("dbplot")

Or the the development version from GitHub, using the remotes package:

# install.packages("remotes")
# remotes::install_github("edgararuiz/dbplot")

Connecting to a data source

Example

In addition to database connections, the functions work with sparklyr. A local RSQLite database will be used for the examples in this README.

library(DBI)
library(odbc)
library(dplyr)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
db_flights <- copy_to(con, nycflights13::flights, "flights")

ggplot

Histogram

By default dbplot_histogram() creates a 30 bin histogram

library(ggplot2)

db_flights %>% 
  dbplot_histogram(distance)

Use binwidth to fix the bin size

db_flights %>% 
  dbplot_histogram(distance, binwidth = 400)

Because it outputs a ggplot2 object, more customization can be done

db_flights %>% 
  dbplot_histogram(distance, binwidth = 400) +
  labs(title = "Flights - Distance traveled") +
  theme_bw()

Raster

To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.

A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.

  • If no fill argument is passed, the default calculation will be count, n()
db_flights %>%
  dbplot_raster(sched_dep_time, sched_arr_time) 

  • Pass an aggregation formula that can run inside the database
db_flights %>%
  dbplot_raster(
    sched_dep_time, 
    sched_arr_time, 
    mean(distance, na.rm = TRUE)
    ) 

  • Increase or decrease for more, or less, definition. The resolution argument controls that, it defaults to 100
db_flights %>%
  dbplot_raster(
    sched_dep_time, 
    sched_arr_time, 
    mean(distance, na.rm = TRUE),
    resolution = 20
    ) 

Bar Plot

  • dbplot_bar() defaults to a tally() of each value in a discrete variable
db_flights %>%
  dbplot_bar(origin)

  • Pass a formula, and column name, that will be operated for each value in the discrete variable
db_flights %>%
  dbplot_bar(origin, avg_delay =  mean(dep_delay, na.rm = TRUE))

Line plot

  • dbplot_line() defaults to a tally() of each value in a discrete variable
db_flights %>%
  dbplot_line(month)

  • Pass a formula that will be operated for each value in the discrete variable
db_flights %>%
  dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))

Boxplot

It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn’t calculate outliers. It has been tested with the following connections:

  • MS SQL Server
  • PostgreSQL
  • Oracle
  • sparklyr

Here is an example using dbplot_boxplot() with a local data frame:

nycflights13::flights %>%
  dbplot_boxplot(origin, distance)

Calculation functions

If a more customized plot is needed, the data the underpins the plots can also be accessed:

  1. db_compute_bins() - Returns a data frame with the bins and count per bin
  2. db_compute_count() - Returns a data frame with the count per discrete value
  3. db_compute_raster() - Returns a data frame with the results per x/y intersection
  4. db_compute_raster2() - Returns same as db_compute_raster() function plus the coordinates of the x/y boxes
  5. db_compute_boxplot() - Returns a data frame with boxplot calculations
db_flights %>%
  db_compute_bins(arr_delay) 
#> # A tibble: 28 x 2
#>    arr_delay  count
#>        <dbl>  <int>
#>  1     NA      9430
#>  2    -86      5325
#>  3    -40.7  207999
#>  4      4.53  79784
#>  5     49.8   19063
#>  6     95.1    7890
#>  7    140.     3746
#>  8    186.     1742
#>  9    231.      921
#> 10    276.      425
#> # … with 18 more rows

The data can be piped to a plot

db_flights %>%
  filter(arr_delay < 100 , arr_delay > -50) %>%
  db_compute_bins(arr_delay) %>%
  ggplot() +
  geom_col(aes(arr_delay, count, fill = count))

db_bin()

Uses ‘rlang’ to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.

db_bin(var)
#> (((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30) * ifelse(as.integer(floor((var - 
#>     min(var, na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, 
#>     na.rm = TRUE))/30))) == 30, as.integer(floor((var - min(var, 
#>     na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30))) - 
#>     1, as.integer(floor((var - min(var, na.rm = TRUE))/((max(var, 
#>     na.rm = TRUE) - min(var, na.rm = TRUE))/30))))) + min(var, 
#>     na.rm = TRUE)
db_flights %>%
  group_by(x = !! db_bin(arr_delay)) %>%
  tally()
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.29.0 [:memory:]
#>         x      n
#>     <dbl>  <int>
#>  1  NA      9430
#>  2 -86      5325
#>  3 -40.7  207999
#>  4   4.53  79784
#>  5  49.8   19063
#>  6  95.1    7890
#>  7 140.     3746
#>  8 186.     1742
#>  9 231.      921
#> 10 276.      425
#> # … with more rows
db_flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(x = !! db_bin(arr_delay)) %>%
  tally()%>%
  collect %>%
  ggplot() +
  geom_col(aes(x, n))

dbDisconnect(con)

dbplot's People

Contributors

dominickg avatar edgararuiz avatar edgararuiz-zz avatar lionel- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dbplot's Issues

Support color aesthetic for boxplots

I'd like to produce an image like the one below, but if I am not mistaken, currently a color aesthetic is not supported in dbplot_boxplot() as well as db_compute_boxplot(). Would you consider adding this feature? Alternatively, if the argument x in db_compute_boxplot() could take multiple grouping variables, that would already be an improvement. Thanks for maintaining this package.
boxplots

How to color bars in histogram?

In the "Calculation functions" section you show a histogram with a scale on the left and a color legend on the right. This color legend is not needed and is not conventional with histograms (IMHO). The color legend doesn't add any information that's not in the scale at the left.

But how can I change the bars in the histogram to some constant color? Why doesn't something like fill="red" work?

Rstudio webinar - use with SQLite

@edgararuiz I very much enjoyed your webinar on "Best practices for working with databases". Using SQL server isn't that convenient for many of our students, however, so I wanted to adapt your databases-and-R.Rmd file to work with SQLite.

Unfortunately lots of errors popped up with dbplot and, for example, the sampling example. I just wanted to check with you if that is expected. I know that SQLite is more limited as a database so perhaps this is expected.

db_flights %>%
  filter(!is.na(arrdelay)) %>%
  dbplot_histogram(arrdelay)
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "OVER": syntax error (1)
29.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
28.
new_result(connection@ptr, statement)
27.
OdbcResult(connection = conn, statement = statement)
26.
dbSendQuery(con, sql)
25.
dbSendQuery(con, sql)
24.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
23.
db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
29. | stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
-- | --
...
flights_sample <- db_flights %>%
  filter(cancelled == 0, year == 2006) %>%
  arrange(dayofmonth) %>%
  mutate(row = row_number()) %>%
  head() %>%
  collect()
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "(": syntax error (1)
17.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"(\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
16.
new_result(connection@ptr, statement)
15.
OdbcResult(connection = conn, statement = statement)
14.
dbSendQuery(con, sql)
13.
dbSendQuery(con, sql)
12.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
...

Using lower levels functions in higher level functions

Hi @edgararuiz

Very nice package, very clever. I'm waiting the opportunity to use it.

Why not use your lower levels functions in your higher level functions, so the maintenance is a bit easier due you don't repeat code. So, as example:

dbplot_histogram <- function(data, x, bins = 30, binwidth = NULL){

  db_compute_bins (data, x, bins = bins, binwidth = binwidth)  %>%
    ggplot() +
    geom_col(aes(x, n)) +
    labs(x = x,
         y = "count")
}

Thanks again for your package.

unused argument (na.rm = TRUE)

When used via dbplyr, anything that uses the db_bin() function will return the following error message: unused argument (na.rm = TRUE)

This is due to using dbplyr development version when developing. In order to use the functions in this package, there is no need to connect via dbplyr, so there was no dependency noted. In real-life, this package would not be used unless in combination with dbplyr , so I must note its dependency and make the appropriate change.

object of type 'symbol' is not subsettable

User reported, they started from a clean Windows machine installing everything from CRAN, this was using sparklyr 1.0.0, so maybe related but don't remember

library(dbplot)
sc <- sparklyr::spark_connect(master = "local")
cars <- copt_to(sc, mtcars)
cars %>%
  dbplot::dbplot_histogram(mpg, binwidth = 3)
## Error in call[[1]] : object of type 'symbol' is not subsettable

Could not reproduce with sparklyr 1.0.0 and neither dbplot 0.3.0... I worry there is something broken that we are not being able to reproduce.

Improve default naming

Big Query does not like the default n() or any other default name for a calculation, such as sum(x), need to create a sensible default, just like dbplot_histogram() uses the name count as the default:

con <- DBI::dbConnect(
  dbi_driver(),
  project = "bigquery-public-data",
  dataset = "nlm_rxnorm",
  billing = "",
  use_legacy_sql = FALSE
)

tbl(con, "rxn_all_pathways_current") %>%
  filter(
    SOURCE_TTY == "BN",
    TARGET_TTY == "IN"
  ) %>%
  group_by(SOURCE_NAME) %>%
  tally() %>%
  dbplot_bar(n)

dbplot_histogram resulting in empty plots + proper histograms

I enjoy using this library, but recently the histograms were not working as should. There were no bars for numeric values, only an empty graph. For integers there was no problem...

Solution: change line 92 in https://github.com/edgararuiz/dbplot/blob/master/R/histogram.R

92 (old) geom_col(aes(x, count)) +
92 (new) geom_col(aes(x, count), orientation="x", width = binwidth) +

This works as a charm when binwidths are known and it produces proper histograms at the same time, since there should be no gap between bars in a histogram!
Although this does not work when the binwidth is not set, maybe the binwidth calculation from (https://github.com/edgararuiz/dbplot/blob/master/R/dbbin.R) could be included somehow?

Thanks in advance!

Error en cargar el paquete dbplot

Cordial saludo, de forma atenta me comunico porque estoy estudiando temas de big data con R studio+Spark y presento problemas para instalar el paquete "dbplot", es posible puedas ayudarme con este error. Por su amable atención y pronta respuesta, gracias

Job cancelled because SparkContext was shut down

Thank you very much for developing this package, it looks very interesting.

I can reproduce the examples in the documentation, however when I try the function on real data I get the following spark error:

Error: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

The table I am trying to plot is quite big - but even after trying head %>% compute or head %>% coalesce, the error remain.

Do you have any suggestion on how to debug this, or which parameters could be adjusted?

Thanks

Add support for MSSQL connections

This functionality would be great with MS SQL Server as well. I realized that the newly released quantile functions could be leveraged to standardize this computation. I more or less copied the existing funciton into what I think could be the backend-agnostic version. Note that MS SQL requires quantile functions to be used with a mutate rather than summarise, hence the mutate %>% select %>% distinct chain.

compute_boxplot = function(x, var, coef = 1.5) {
  
  var = enquo(var)
  
  x %>%
    mutate(
      lower   = quantile(!!var, .25),
      middle  = quantile(!!var, .5),
      upper   = quantile(!!var, .75),
      min_raw = min(!!var, na.rm = T),
      max_raw = max(!!var, na.rm = T),
      iqr     = (upper - lower) * coef,
      min_iqr = lower - iqr,
      max_iqr = upper + iqr,
      ymax    = ifelse(max_raw > max_iqr, max_iqr, max_raw),
      ymin    = ifelse(min_raw < min_iqr, min_iqr, min_raw)) %>%
    select(ymin, lower, middle, upper, ymax) %>% 
    distinct()
}

Furthermore, if the grouping variable isn't hardcoded into this function, any multi-variable grouping can be applied outside of this call.

Add support for `int64` results

> tbl(con, in_schema("datawarehouse", "flight")) %>%
+     group_by(month) %>%
+     tally() %>%
+     collect() 
# A tibble: 12 x 2
   month               n
   <dbl> <S3: integer64>
 1    10          556205
 2     4          598126
 3    12          544958
 4     5          606293
 5     2          569236
 6     7          627931

Add possibility to facet plots

Great package! :)

It would be nice to be able to facet plots computed in the database, e.g. via facet_wrap() or on a grouped table. Or is there a straightforward approach already that I am missing?

db_compute_boxplot in Hive is using percentile instead of percentile approx

I'm using dbplot v0.3.3 with hive (version Hive 1.2.1000.2.6.5.0-292).

db_compute_boxplot is using percentile rather than percentile_approx, this limits the function to integers only.

Here is some sample code

    library(DBI)
    library(odbc)
    library(dplyr)
    library(dbplyr)
    library(ggplot2)
    library(dbplot)
con <- dbConnect(odbc(), "QoE", user = "foo", 
    pwd = "bar", bigint = "numeric")

Write Iris to Hive table removing '.' from col names

foo<-iris
colnames(foo)<-gsub(".", "_", colnames(foo), fixed = TRUE)
dbWriteTable(con,"iris",foo,overwrite=TRUE)

Run db_compute_boxplot - throws data type error for UDAF Percentile

db_iris<-tbl(con,in_schema("default","iris"))
db_iris %>% db_compute_boxplot(species,sepal_length)
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: 
Error while compiling statement: 
FAILED: NoMatchingMethodException 
No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double)

Run again casting Sepal_Length to Integer

db_iris %>% db_compute_boxplot(species,as.integer(sepal_length)) %>% 
  ggplot()+geom_boxplot(aes(x=species,
                            middle=middle,
                            lower=lower,
                            upper=upper,
                            ymin=ymin,
                            ymax=ymax,
                            color=species),stat='identity')

image

db_compute_raster dropping empty bins

First of all thank you for this nice (and in my case, life-saving) package.

I have a dataset of records (sightings) with their latitude and longitude, and I am using the function db_compute_raster(lon, lat, fill = n(), resolution = 500) to calculate density over a spatial grid. I am using this package because of its seamless integration with sparklyr.

The computation is done correctly but the output drops the bins - i.e. the lat-lon pairs - where the count was zero. This means that I will have empty areas when visualising with ggplot geom_raster. Is there the possibility of adding the choice of not dropping them?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.