edgararuiz-zz / dbplot Goto Github PK

Simplifies plotting of database and sparklyr data

Home Page: https://edgararuiz.github.io/dbplot/

R 100.00%

visualization r ggplot2 databases sparklyr dbplot rlang

dbplot's Issues

Support color aesthetic for boxplots

I'd like to produce an image like the one below, but if I am not mistaken, currently a color aesthetic is not supported in dbplot_boxplot() as well as db_compute_boxplot(). Would you consider adding this feature? Alternatively, if the argument x in db_compute_boxplot() could take multiple grouping variables, that would already be an improvement. Thanks for maintaining this package.

object of type 'symbol' is not subsettable

User reported, they started from a clean Windows machine installing everything from CRAN, this was using sparklyr 1.0.0, so maybe related but don't remember

library(dbplot)
sc <- sparklyr::spark_connect(master = "local")
cars <- copt_to(sc, mtcars)
cars %>%
  dbplot::dbplot_histogram(mpg, binwidth = 3)

## Error in call[[1]] : object of type 'symbol' is not subsettable

Could not reproduce with sparklyr 1.0.0 and neither dbplot 0.3.0... I worry there is something broken that we are not being able to reproduce.

Job cancelled because SparkContext was shut down

Thank you very much for developing this package, it looks very interesting.

I can reproduce the examples in the documentation, however when I try the function on real data I get the following spark error:

Error: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

The table I am trying to plot is quite big - but even after trying head %>% compute or head %>% coalesce, the error remain.

Do you have any suggestion on how to debug this, or which parameters could be adjusted?

Thanks

How to color bars in histogram?

In the "Calculation functions" section you show a histogram with a scale on the left and a color legend on the right. This color legend is not needed and is not conventional with histograms (IMHO). The color legend doesn't add any information that's not in the scale at the left.

But how can I change the bars in the histogram to some constant color? Why doesn't something like fill="red" work?

Add support for density plots

unused argument (na.rm = TRUE)

When used via dbplyr, anything that uses the db_bin() function will return the following error message: unused argument (na.rm = TRUE)

This is due to using dbplyr development version when developing. In order to use the functions in this package, there is no need to connect via dbplyr, so there was no dependency noted. In real-life, this package would not be used unless in combination with dbplyr , so I must note its dependency and make the appropriate change.

Using lower levels functions in higher level functions

Hi @edgararuiz

Very nice package, very clever. I'm waiting the opportunity to use it.

Why not use your lower levels functions in your higher level functions, so the maintenance is a bit easier due you don't repeat code. So, as example:

dbplot_histogram <- function(data, x, bins = 30, binwidth = NULL){

  db_compute_bins (data, x, bins = bins, binwidth = binwidth)  %>%
    ggplot() +
    geom_col(aes(x, n)) +
    labs(x = x,
         y = "count")
}

Thanks again for your package.

Add possibility to facet plots

Great package! :)

It would be nice to be able to facet plots computed in the database, e.g. via facet_wrap() or on a grouped table. Or is there a straightforward approach already that I am missing?

Improve default naming

Big Query does not like the default n() or any other default name for a calculation, such as sum(x), need to create a sensible default, just like dbplot_histogram() uses the name count as the default:

con <- DBI::dbConnect(
  dbi_driver(),
  project = "bigquery-public-data",
  dataset = "nlm_rxnorm",
  billing = "",
  use_legacy_sql = FALSE
)

tbl(con, "rxn_all_pathways_current") %>%
  filter(
    SOURCE_TTY == "BN",
    TARGET_TTY == "IN"
  ) %>%
  group_by(SOURCE_NAME) %>%
  tally() %>%
  dbplot_bar(n)

Adding legend to boxplot

It is not possible to add a legend to a boxplot with dbplot.
https://stackoverflow.com/questions/56008148/dont-know-how-to-add-o-to-a-plot-with-r-package-dbplot

Add support for MSSQL connections

This functionality would be great with MS SQL Server as well. I realized that the newly released quantile functions could be leveraged to standardize this computation. I more or less copied the existing funciton into what I think could be the backend-agnostic version. Note that MS SQL requires quantile functions to be used with a mutate rather than summarise, hence the mutate %>% select %>% distinct chain.

compute_boxplot = function(x, var, coef = 1.5) {
  
  var = enquo(var)
  
  x %>%
    mutate(
      lower   = quantile(!!var, .25),
      middle  = quantile(!!var, .5),
      upper   = quantile(!!var, .75),
      min_raw = min(!!var, na.rm = T),
      max_raw = max(!!var, na.rm = T),
      iqr     = (upper - lower) * coef,
      min_iqr = lower - iqr,
      max_iqr = upper + iqr,
      ymax    = ifelse(max_raw > max_iqr, max_iqr, max_raw),
      ymin    = ifelse(min_raw < min_iqr, min_iqr, min_raw)) %>%
    select(ymin, lower, middle, upper, ymax) %>% 
    distinct()
}

Furthermore, if the grouping variable isn't hardcoded into this function, any multi-variable grouping can be applied outside of this call.

db_compute_raster dropping empty bins

First of all thank you for this nice (and in my case, life-saving) package.

I have a dataset of records (sightings) with their latitude and longitude, and I am using the function db_compute_raster(lon, lat, fill = n(), resolution = 500) to calculate density over a spatial grid. I am using this package because of its seamless integration with sparklyr.

The computation is done correctly but the output drops the bins - i.e. the lat-lon pairs - where the count was zero. This means that I will have empty areas when visualising with ggplot geom_raster. Is there the possibility of adding the choice of not dropping them?

Use of histogram_numeric function from Hive?

Thanks for the work on this project.
I've noticed that the calculation of the histogram bins is done using rlang and sparlyr/dbplyr verbs.

Have you considered using the histogram_numeric function from hive?

https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining

For some reasons, summarise(histrogram_numeric(column_name, nbins)) doesn't work. However it should be possible to encode this as a SQL statement directly.

Rstudio webinar - use with SQLite

@edgararuiz I very much enjoyed your webinar on "Best practices for working with databases". Using SQL server isn't that convenient for many of our students, however, so I wanted to adapt your databases-and-R.Rmd file to work with SQLite.

Unfortunately lots of errors popped up with dbplot and, for example, the sampling example. I just wanted to check with you if that is expected. I know that SQLite is more limited as a database so perhaps this is expected.

db_flights %>%
  filter(!is.na(arrdelay)) %>%
  dbplot_histogram(arrdelay)

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "OVER": syntax error (1)
29.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
28.
new_result(connection@ptr, statement)
27.
OdbcResult(connection = conn, statement = statement)
26.
dbSendQuery(con, sql)
25.
dbSendQuery(con, sql)
24.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
23.
db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
29. | stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
-- | --
...

flights_sample <- db_flights %>%
  filter(cancelled == 0, year == 2006) %>%
  arrange(dayofmonth) %>%
  mutate(row = row_number()) %>%
  head() %>%
  collect()

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "(": syntax error (1)
17.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"(\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
16.
new_result(connection@ptr, statement)
15.
OdbcResult(connection = conn, statement = statement)
14.
dbSendQuery(con, sql)
13.
dbSendQuery(con, sql)
12.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
...

add vignette example of how to use with a DBIConnection?

How can I use this after a call to make a JDBC connection with DBI::dbConnect()? I would love to have an example vignette that I can follow. Thanks!

Add support for `int64` results

> tbl(con, in_schema("datawarehouse", "flight")) %>%
+     group_by(month) %>%
+     tally() %>%
+     collect() 
# A tibble: 12 x 2
   month               n
   <dbl> <S3: integer64>
 1    10          556205
 2     4          598126
 3    12          544958
 4     5          606293
 5     2          569236
 6     7          627931

Add support for quantile plot

db_compute_boxplot in Hive is using percentile instead of percentile approx

I'm using dbplot v0.3.3 with hive (version Hive 1.2.1000.2.6.5.0-292).

db_compute_boxplot is using percentile rather than percentile_approx, this limits the function to integers only.

Here is some sample code

    library(DBI)
    library(odbc)
    library(dplyr)
    library(dbplyr)
    library(ggplot2)
    library(dbplot)

con <- dbConnect(odbc(), "QoE", user = "foo", 
    pwd = "bar", bigint = "numeric")

Write Iris to Hive table removing '.' from col names

foo<-iris
colnames(foo)<-gsub(".", "_", colnames(foo), fixed = TRUE)
dbWriteTable(con,"iris",foo,overwrite=TRUE)

Run db_compute_boxplot - throws data type error for UDAF Percentile

db_iris<-tbl(con,in_schema("default","iris"))
db_iris %>% db_compute_boxplot(species,sepal_length)

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: 
Error while compiling statement: 
FAILED: NoMatchingMethodException 
No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double)

Run again casting Sepal_Length to Integer

db_iris %>% db_compute_boxplot(species,as.integer(sepal_length)) %>% 
  ggplot()+geom_boxplot(aes(x=species,
                            middle=middle,
                            lower=lower,
                            upper=upper,
                            ymin=ymin,
                            ymax=ymax,
                            color=species),stat='identity')

Error en cargar el paquete dbplot

Cordial saludo, de forma atenta me comunico porque estoy estudiando temas de big data con R studio+Spark y presento problemas para instalar el paquete "dbplot", es posible puedas ayudarme con este error. Por su amable atención y pronta respuesta, gracias

dbplot_histogram resulting in empty plots + proper histograms

I enjoy using this library, but recently the histograms were not working as should. There were no bars for numeric values, only an empty graph. For integers there was no problem...

Solution: change line 92 in https://github.com/edgararuiz/dbplot/blob/master/R/histogram.R

92 (old) geom_col(aes(x, count)) +
92 (new) geom_col(aes(x, count), orientation="x", width = binwidth) +

This works as a charm when binwidths are known and it produces proper histograms at the same time, since there should be no gap between bars in a histogram!
Although this does not work when the binwidth is not set, maybe the binwidth calculation from (https://github.com/edgararuiz/dbplot/blob/master/R/dbbin.R) could be included somehow?

Thanks in advance!

URL and BugReports fields in DESCRIPTION

Cool package! I've noticed that your DESCRIPTION doesn't link to this repo and to the issue page, it'd be worth adding the URL and BugReports fields. :-)

edgararuiz-zz / dbplot Goto Github PK

dbplot's Issues

Recommend Projects

Recommend Topics

Recommend Org