Coder Social home page Coder Social logo

dbplot's Issues

Support color aesthetic for boxplots

I'd like to produce an image like the one below, but if I am not mistaken, currently a color aesthetic is not supported in dbplot_boxplot() as well as db_compute_boxplot(). Would you consider adding this feature? Alternatively, if the argument x in db_compute_boxplot() could take multiple grouping variables, that would already be an improvement. Thanks for maintaining this package.
boxplots

object of type 'symbol' is not subsettable

User reported, they started from a clean Windows machine installing everything from CRAN, this was using sparklyr 1.0.0, so maybe related but don't remember

library(dbplot)
sc <- sparklyr::spark_connect(master = "local")
cars <- copt_to(sc, mtcars)
cars %>%
  dbplot::dbplot_histogram(mpg, binwidth = 3)
## Error in call[[1]] : object of type 'symbol' is not subsettable

Could not reproduce with sparklyr 1.0.0 and neither dbplot 0.3.0... I worry there is something broken that we are not being able to reproduce.

Job cancelled because SparkContext was shut down

Thank you very much for developing this package, it looks very interesting.

I can reproduce the examples in the documentation, however when I try the function on real data I get the following spark error:

Error: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

The table I am trying to plot is quite big - but even after trying head %>% compute or head %>% coalesce, the error remain.

Do you have any suggestion on how to debug this, or which parameters could be adjusted?

Thanks

How to color bars in histogram?

In the "Calculation functions" section you show a histogram with a scale on the left and a color legend on the right. This color legend is not needed and is not conventional with histograms (IMHO). The color legend doesn't add any information that's not in the scale at the left.

But how can I change the bars in the histogram to some constant color? Why doesn't something like fill="red" work?

unused argument (na.rm = TRUE)

When used via dbplyr, anything that uses the db_bin() function will return the following error message: unused argument (na.rm = TRUE)

This is due to using dbplyr development version when developing. In order to use the functions in this package, there is no need to connect via dbplyr, so there was no dependency noted. In real-life, this package would not be used unless in combination with dbplyr , so I must note its dependency and make the appropriate change.

Using lower levels functions in higher level functions

Hi @edgararuiz

Very nice package, very clever. I'm waiting the opportunity to use it.

Why not use your lower levels functions in your higher level functions, so the maintenance is a bit easier due you don't repeat code. So, as example:

dbplot_histogram <- function(data, x, bins = 30, binwidth = NULL){

  db_compute_bins (data, x, bins = bins, binwidth = binwidth)  %>%
    ggplot() +
    geom_col(aes(x, n)) +
    labs(x = x,
         y = "count")
}

Thanks again for your package.

Add possibility to facet plots

Great package! :)

It would be nice to be able to facet plots computed in the database, e.g. via facet_wrap() or on a grouped table. Or is there a straightforward approach already that I am missing?

Improve default naming

Big Query does not like the default n() or any other default name for a calculation, such as sum(x), need to create a sensible default, just like dbplot_histogram() uses the name count as the default:

con <- DBI::dbConnect(
  dbi_driver(),
  project = "bigquery-public-data",
  dataset = "nlm_rxnorm",
  billing = "",
  use_legacy_sql = FALSE
)

tbl(con, "rxn_all_pathways_current") %>%
  filter(
    SOURCE_TTY == "BN",
    TARGET_TTY == "IN"
  ) %>%
  group_by(SOURCE_NAME) %>%
  tally() %>%
  dbplot_bar(n)

Add support for MSSQL connections

This functionality would be great with MS SQL Server as well. I realized that the newly released quantile functions could be leveraged to standardize this computation. I more or less copied the existing funciton into what I think could be the backend-agnostic version. Note that MS SQL requires quantile functions to be used with a mutate rather than summarise, hence the mutate %>% select %>% distinct chain.

compute_boxplot = function(x, var, coef = 1.5) {
  
  var = enquo(var)
  
  x %>%
    mutate(
      lower   = quantile(!!var, .25),
      middle  = quantile(!!var, .5),
      upper   = quantile(!!var, .75),
      min_raw = min(!!var, na.rm = T),
      max_raw = max(!!var, na.rm = T),
      iqr     = (upper - lower) * coef,
      min_iqr = lower - iqr,
      max_iqr = upper + iqr,
      ymax    = ifelse(max_raw > max_iqr, max_iqr, max_raw),
      ymin    = ifelse(min_raw < min_iqr, min_iqr, min_raw)) %>%
    select(ymin, lower, middle, upper, ymax) %>% 
    distinct()
}

Furthermore, if the grouping variable isn't hardcoded into this function, any multi-variable grouping can be applied outside of this call.

db_compute_raster dropping empty bins

First of all thank you for this nice (and in my case, life-saving) package.

I have a dataset of records (sightings) with their latitude and longitude, and I am using the function db_compute_raster(lon, lat, fill = n(), resolution = 500) to calculate density over a spatial grid. I am using this package because of its seamless integration with sparklyr.

The computation is done correctly but the output drops the bins - i.e. the lat-lon pairs - where the count was zero. This means that I will have empty areas when visualising with ggplot geom_raster. Is there the possibility of adding the choice of not dropping them?

Rstudio webinar - use with SQLite

@edgararuiz I very much enjoyed your webinar on "Best practices for working with databases". Using SQL server isn't that convenient for many of our students, however, so I wanted to adapt your databases-and-R.Rmd file to work with SQLite.

Unfortunately lots of errors popped up with dbplot and, for example, the sampling example. I just wanted to check with you if that is expected. I know that SQLite is more limited as a database so perhaps this is expected.

db_flights %>%
  filter(!is.na(arrdelay)) %>%
  dbplot_histogram(arrdelay)
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "OVER": syntax error (1)
29.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
28.
new_result(connection@ptr, statement)
27.
OdbcResult(connection = conn, statement = statement)
26.
dbSendQuery(con, sql)
25.
dbSendQuery(con, sql)
24.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
23.
db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
29. | stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"OVER\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
-- | --
...
flights_sample <- db_flights %>%
  filter(cancelled == 0, year == 2006) %>%
  arrange(dayofmonth) %>%
  mutate(row = row_number()) %>%
  head() %>%
  collect()
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near "(": syntax error (1)
17.
stop(structure(list(message = "nanodbc/nanodbc.cpp:1344: HY000: [SQLite]near \"(\": syntax error (1) ", call = new_result(connection@ptr, statement), cppstack = NULL), class = c("nanodbc::database_error", "C++Error", "error", "condition")))
16.
new_result(connection@ptr, statement)
15.
OdbcResult(connection = conn, statement = statement)
14.
dbSendQuery(con, sql)
13.
dbSendQuery(con, sql)
12.
db_collect.DBIConnection(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
...

Add support for `int64` results

> tbl(con, in_schema("datawarehouse", "flight")) %>%
+     group_by(month) %>%
+     tally() %>%
+     collect() 
# A tibble: 12 x 2
   month               n
   <dbl> <S3: integer64>
 1    10          556205
 2     4          598126
 3    12          544958
 4     5          606293
 5     2          569236
 6     7          627931

db_compute_boxplot in Hive is using percentile instead of percentile approx

I'm using dbplot v0.3.3 with hive (version Hive 1.2.1000.2.6.5.0-292).

db_compute_boxplot is using percentile rather than percentile_approx, this limits the function to integers only.

Here is some sample code

    library(DBI)
    library(odbc)
    library(dplyr)
    library(dbplyr)
    library(ggplot2)
    library(dbplot)
con <- dbConnect(odbc(), "QoE", user = "foo", 
    pwd = "bar", bigint = "numeric")

Write Iris to Hive table removing '.' from col names

foo<-iris
colnames(foo)<-gsub(".", "_", colnames(foo), fixed = TRUE)
dbWriteTable(con,"iris",foo,overwrite=TRUE)

Run db_compute_boxplot - throws data type error for UDAF Percentile

db_iris<-tbl(con,in_schema("default","iris"))
db_iris %>% db_compute_boxplot(species,sepal_length)
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: 
Error while compiling statement: 
FAILED: NoMatchingMethodException 
No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double)

Run again casting Sepal_Length to Integer

db_iris %>% db_compute_boxplot(species,as.integer(sepal_length)) %>% 
  ggplot()+geom_boxplot(aes(x=species,
                            middle=middle,
                            lower=lower,
                            upper=upper,
                            ymin=ymin,
                            ymax=ymax,
                            color=species),stat='identity')

image

Error en cargar el paquete dbplot

Cordial saludo, de forma atenta me comunico porque estoy estudiando temas de big data con R studio+Spark y presento problemas para instalar el paquete "dbplot", es posible puedas ayudarme con este error. Por su amable atención y pronta respuesta, gracias

dbplot_histogram resulting in empty plots + proper histograms

I enjoy using this library, but recently the histograms were not working as should. There were no bars for numeric values, only an empty graph. For integers there was no problem...

Solution: change line 92 in https://github.com/edgararuiz/dbplot/blob/master/R/histogram.R

92 (old) geom_col(aes(x, count)) +
92 (new) geom_col(aes(x, count), orientation="x", width = binwidth) +

This works as a charm when binwidths are known and it produces proper histograms at the same time, since there should be no gap between bars in a histogram!
Although this does not work when the binwidth is not set, maybe the binwidth calculation from (https://github.com/edgararuiz/dbplot/blob/master/R/dbbin.R) could be included somehow?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.