Coder Social home page Coder Social logo

crplyr's Introduction

crplyr: A 'dplyr' Interface for Crunch

R build status codecov cran

dplyr defines "a grammar of data manipulation" popular among R users. In order to facilitate analysis of datasets hosted by Crunch, this package implements 'dplyr' methods on top of the Crunch backend. The usual methods "select", "filter", "group_by", "summarize", and "collect" are implemented in such a way as to perform as much computation on the server and pull as little data locally as possible.

With a local data.frame, you might chain together a series of manipulations and create a table, such as:

> library(dplyr)
> data(mtcars)
> mtcars %>%
    filter(vs == 1) %>%
    group_by(gear) %>%
    summarize(horses=mean(hp), sd_horses=sd(hp), count=n())

## # A tibble: 3 × 4
##    gear horses sd_horses count
##   <dbl>  <dbl>     <dbl> <int>
## 1     3  104.0  6.557439     3
## 2     4   85.4 26.596575    10
## 3     5  113.0        NA     1

With crplyr, you can do the same operations, except that the dataset you're working with sits in the Crunch platform, and Crunch is doing the aggregations in the cloud:

> library(crplyr)
[crunch] > mtcars <- loadDataset("mtcars from R")
[crunch] > mtcars %>%
    filter(vs == 1) %>%
    group_by(gear) %>%
    summarize(horses=mean(hp), sd_horses=sd(hp), count=n())

## # A tibble: 3 × 4
##    gear horses sd_horses count
##  <fctr>  <dbl>     <dbl> <dbl>
## 1     3  104.0  6.557439     3
## 2     4   85.4 26.596575    10
## 3     5  113.0        NA     1

Obviously, the fact that the calculations in crplyr are happening remotely doesn't matter as much when working with a tiny dataset like "mtcars", but Crunch allows you to work with datasets larger than can fit in memory on your machine, and it enables you to collaborate naturally with others on the same dataset.

Installing

Install the CRAN release of crplyr with

install.packages("crplyr")

The pre-release version of the package can be pulled from GitHub using the remotes package:

# install.packages("remotes")
remotes::install_github("Crunch-io/crplyr")

For developers

The repository includes a Makefile to facilitate some common tasks, if you're into that sort of thing.

Running tests

$ make test. Requires the httptest package. You can also specify a specific test file or files to run by adding a "file=" argument, like $ make test file=select. test_package will do a regular-expression pattern match within the file names. See its documentation in the testthat package.

Updating documentation

$ make doc. Requires the roxygen2 package.

crplyr's People

Contributors

gergness avatar gshotwell avatar jonkeane avatar malecki avatar nealrichardson avatar romainfrancois avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crplyr's Issues

Issue from CRAN

Suggested packages should be used conditionally: see §1.1.3.1 of
'Writing R Extensions'. Some of the requirements of vdiffr are hard to
install on a platform without X11 such as M1 Macs: see the logs at
https://www.stats.ox.ac.uk/pub/bdr/M1mac/.

In some cases there are other suggested packages not used conditionally:
you can check all of them by setting environment variable
R_CHECK_DEPENDS_ONLY=true -- see
https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Tools .

Please correct ASAP and before 2021-01-12 to safely retain the package
on CRAN.

better dplyr compatibility for `tbl_crunch_cube`

783af14 removes some workarounds that allowed bind_cols() to work on tbl_crunch_cube because they aren't available in current CRAN version of vctrs. We could eventually add it back in.

This work wasn't complete though, because attributes were still lost after bind_cols() but they aren't in sf so there must be a way to do even better.

In dplyr 1.0:

library(crplyr)

ds <- loadDataset("temp cola fork")

cube <- ds %>% 
  group_by(d1) %>%
  summarize(d3 = mean(d3))

bind_cols(cube, tibble(x = seq_len(nrow(cube))))
#>  Error: No common type for `..1` <tbl_crunch_cube<>> and `..2` <tbl_df<>>.
library(sf)
nc <- read_sf(system.file("shape/nc.shp", package="sf"))
bind_cols(nc, tibble(x = seq_len(nrow(nc))))
# works fine

CRAN warning

Version: 0.3.9 
Check: S3 generic/method consistency 
Result: WARN 
    autoplot:
     function(object, ...)
    autoplot.NumericVariable:
     function(x, ...)
    
    autoplot:
     function(object, ...)
    autoplot.DatetimeVariable:
     function(x, ...)
    
    autoplot:
     function(object, ...)
    autoplot.CategoricalArrayVariable:
     function(x, ...)
    
    autoplot:
     function(object, ...)
    autoplot.CategoricalVariable:
     function(x, ...)
    
    autoplot:
     function(object, ...)
    autoplot.CrunchCubeCalculation:
     function(x, plot_type, ...)
    
    autoplot:
     function(object, ...)
    autoplot.tbl_crunch_cube:
     function(x, plot_type, measure)
    
    autoplot:
     function(object, ...)
    autoplot.CrunchCube:
     function(x, ...)
    
    autoplot:
     function(object, ...)
    autoplot.MultipleResponseVariable:
     function(x, ...)
    See section ‘Generic functions and methods’ in the ‘Writing R
    Extensions’ manual. 

https://cran.rstudio.com/web/checks/check_results_crplyr.html

Forthcoming release of ggplot2 and crplyr

We are contacting you because you are the maintainer of crplyr, which imports ggplot2 and uses vdiffr to manage visual test cases. The upcoming release of ggplot2 includes several improvements to plot rendering, including the ability to specify lineend and linejoin in geom_rect() and geom_tile(), and improved rendering of text. These improvements will result in subtle changes to your vdiffr dopplegangers when the new version is released.

Because vdiffr test cases do not run on CRAN by default, your CRAN checks will still pass. However, we suggest updating your visual test cases with the new version of ggplot2 as soon as possible to avoid confusion. You can install the development version of ggplot2 using remotes::install_github("tidyverse/ggplot2").

If you have any questions, let me know!

Plotting vignette failing on master

The plotting vignette is failing to build on master which is causing the cube calculations PR to fail. This seems to be coming from a mocking problem. When rendering the vignette I get a GET error looking for api/projects which isn't in the vignette mocks. I've tried the following:

  • Delete the vignette data and re-render it to generate fresh mocks
  • Ensure that I have the development version of httptest installed
  • Ensure that I have the CRAN version of crunch installed (1.214)

collect() gives error on hidden variables even if editor of dataset

Tried using ds %>% select(var, ...) %>% collect().
Documentation for collect() does not mention a work around

unlock(ds)
df <- ds %>%
select('identity', 'gender', 'age', 'age4', 'race4', 'educ4', 'presvote16x', 'e14_presvote12', 'pid3', 'ideo3', 'region', 'votereg2', 'app_dtrmp') %>%
collect() %>%
mutate(
race3 = recode_factor(race4, 'White' = 'White/Other', 'Other' = 'White/Other', 'Black'='Black', 'Hispanic'='Hispanic'),
educ3 = recode_factor(educ4, 'HS or less' = 'HS or less', 'Some college' = 'Some college', 'College grad' = 'College degree', 'Postgrad' = 'College degree'),
educ2 = recode_factor(educ3, 'HS or less' = 'No degree', 'Some college' = 'No degree', 'College degree' = 'College grad')
) %>%
rename(presvote12 = e14_presvote12) %>%
filter(complete.cases(.))

Ends up with this error:
Error: Unknown column identity

Add better error message when user calls unimplemented functions like `mutate` `rename` etc.

This stems from a discussion about using as.data.frame(..., force=TRUE).

The use case here is to do external weighting. I need to be able to manipulate a data.frame object in order to use a raking script on the dataset. I don't need or want to create variables in the actual client facing dataset. If I use as.data.frame(..., force = TRUE) I get a data.frame object that I can manipulate. If is use as.data.frame(..., force = FALSE) I cannot manipulate the data.frame to do common recodings.

Yet, I have been told to use crplyr() with as.data.frame(..., force = FALSE) to get the same functionality. That doesn't appear to be the case.

Should we expect as.data.frame(..., force=FALSE) to have the same level of functionality as force=TRUE?

dt <- as.data.frame(ds[c("identity", "gender", "age", "age4", "race4", "educ4", "presvote16x", "e14_presvote12", "pid3", "ideo3", "region", "votereg2", "app_dtrmp")], include.hidden = TRUE, force = FALSE) %>%
mutate(
race3 = recode_factor(race4, 'White' = 'White/Other', 'Other' = 'White/Other', 'Black'='Black', 'Hispanic'='Hispanic'),
educ3 = recode_factor(educ4, 'HS or less' = 'HS or less', 'Some college' = 'Some college', 'College grad' = 'College degree', 'Postgrad' = 'College degree'),
educ2 = recode_factor(educ3, 'HS or less' = 'No degree', 'Some college' = 'No degree', 'College degree' = 'College grad'))

Produces the following error:

Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "CrunchDataFrame"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.