Coder Social home page Coder Social logo

rstudio / pointblank Goto Github PK

View Code? Open in Web Editor NEW
826.0 30.0 51.0 103.88 MB

Data quality assessment and metadata reporting for data frames and database tables

Home Page: https://rstudio.github.io/pointblank/

License: Other

R 98.09% CSS 0.20% JavaScript 0.03% HTML 0.02% Rez 1.66%
data-validation database-tables data-dictionaries easy-to-understand data-frames reporting-tool data-profiler data-management schema-validation data-verification

pointblank's Introduction


CRAN status License: MIT R build status Linting Coverage status

Best Practices The project has reached a stable, usable state and is being actively developed. Monthly Downloads Total Downloads

Posit Cloud

Contributor Covenant



With the pointblank package it’s really easy to methodically validate your data whether in the form of data frames or as database tables. On top of the validation toolset, the package gives you the means to provide and keep up-to-date with the information that defines your tables.

For table validation, the agent object works with a large collection of simple (yet powerful!) validation functions. We can enable much more sophisticated validation checks by using custom expressions, segmenting the data, and by selective mutations of the target table. The suite of validation functions ensures that everything just works no matter whether your table is a data frame or a database table.

Sometimes, we want to maintain table information and update it when the table goes through changes. For that, we can use an informant object plus associated functions to help define the metadata entries and present it as a data dictionary. Just like we can with validation, pointblank offers easy ways to have the metadata updated so that this important documentation doesn't become stale.


TABLE VALIDATIONS WITH AN AGENT AND DATA QUALITY REPORTING

Data validation can be carried out in Data Quality Reporting workflow, ultimately resulting in the production of a data quality analysis report. This is most useful in a non-interactive mode where data quality for database tables and on-disk data files must be periodically checked. The pointblank agent is given a collection of validation functions to define validation steps. We can get extracts of data rows that failed validation, set up custom functions that are invoked by exceeding set threshold failure rates, etc. Want to email the report regularly (or, only if certain conditions are met)? Yep, you can do all that.

Here is an example of how to use pointblank to validate a local table with an agent.

# Generate a simple `action_levels` object to
# set the `warn` state if a validation step
# has a single 'fail' test unit
al <- action_levels(warn_at = 1)

# Create a pointblank `agent` object, with the
# tibble as the target table. Use three validation
# functions, then, `interrogate()`. The agent will
# then have some useful intel.
agent <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_agent(
    label = "A very *simple* example.",
    actions = al
  ) %>%
  col_vals_between(
    vars(a), 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    vars(c), 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(vars(a, b)) %>%
  interrogate()

The reporting’s pretty sweet. We can get a gt-based report by printing an agent.

The pointblank package is designed to be both straightforward yet powerful. And fast! Local data frames don’t take very long to validate extensively and all validation checks on remote tables are done entirely in-database. So we can add dozens or even hundreds of validation steps without any long waits for reporting.

Should you want to perform validation checks on database or Spark tables, provide a tbl_dbi or tbl_spark object to create_agent(). The pointblank package currently supports PostgreSQL. MySQL, MariaDB, Microsoft SQL Server, Google BigQuery, DuckDB, SQLite, and Spark DataFrames (through the sparklyr package).

Here are some validation reports for the considerably larger intendo::intendo_revenue table.

postgres    mysql    duckdb


VALIDATIONS DIRECTLY ON DATA

The Pipeline Data Validation workflow uses the same collection of validation functions but without need of an agent. This is useful for an ETL process where we want to periodically check data and trigger warnings, raise errors, or write out logs when exceeding specified failure thresholds. It’s a cinch to perform checks on import of the data and at key points during the transformation process, perhaps stopping data flow if things are unacceptable with regard to data quality.

The following example uses the same three validation functions as before but, this time, we use them directly on the data. The validation functions act as a filter, passing data through unless execution is stopped by failing validations beyond the set threshold. In this workflow, by default, an error will occur if there is a single ‘fail’ test unit in any validation step:

dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(c(a, b))
Error: Exceedance of failed test units where values in `c` should have been < `12`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (2) >= failure threshold (1) 

We can downgrade this error to a warning with the warn_on_fail() helper function (assigning it to actions). In this way, the data will always be returned, but warnings will appear.

# The `warn_on_fail()` function is a nice
# shortcut for `action_levels(warn_at = 1)`;
# it works great in this data checking workflow
# (and the threshold can still be adjusted)
dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE,
    actions = warn_on_fail()
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b),
    actions = warn_on_fail()
  ) %>%
  col_is_numeric(
    c(a, b),
    actions = warn_on_fail()
  )
#> # A tibble: 6 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     5     6
#> 2     7     1
#> 3     6     0
#> 4     5     6
#> 5    NA     0
#> 6     7     7

Warning message:
Exceedance of failed test units where values in `c` should have been < `12`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (2) >= failure threshold (1) 

Should you need more fine-grained thresholds and resultant actions, the action_levels() function can be used to specify multiple failure thresholds and side effects for each failure state. However, with warn_on_fail() and stop_on_fail() (applied by default, with stop_at = 1), you should have good enough options for this validation workflow.


VALIDATIONS IN R MARKDOWN DOCUMENTS

Using pointblank in an R Markdown workflow is enabled by default once the pointblank library is loaded. The framework allows for validation testing within specialized validation code chunks where the validate = TRUE option is set. Using pointblank validation functions on data in these marked code chunks will flag overall failure if the stop threshold is exceeded anywhere. All errors are reported in the validation code chunk after rendering the document to HTML, where green or red status buttons indicate whether all validations succeeded or failures occurred. Click them to reveal the otherwise hidden validation statements and any associated error messages.

The above R Markdown document is available as a template in the RStudio IDE; it’s called Pointblank Validation.


TABLE INFORMATION

Table information can be synthesized in an information management workflow, giving us a snapshot of a data table we care to collect information on. The pointblank informant is fed a series of info_*() functions to define bits of information about a table. This info text can pertain to individual columns, the table as a whole, and whatever additional information makes sense for your organization. We can even glean little snippets of information (like column stats or sample values) from the target table with info_snippet() and the snip_*() functions and mix them into the data dictionary wherever they're needed.

Here is an example of how to use pointblank to incorporate pieces of info text into an informant object.

# Create a pointblank `informant` object, with the
# tibble as the target table. Use a few information
# functions and end with `incorporate()`. The informant
# will then show you information about the tibble.
informant <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_informant(
    label = "A very *simple* example.",
    tbl_name = "example_tbl"
  ) %>%
  info_tabular(
    description = "This two-column table is nothing all that
    interesting, but, it's fine for examples on **GitHub**
    `README` pages. Column names are `a` and `b`. ((Cool stuff))"
  ) %>%
  info_columns(
    columns = a,
    info = "This column has an `NA` value. [[Watch out!]]<<color: red;>>"
  ) %>%
  info_columns(
    columns = a,
    info = "Mean value is `{a_mean}`."
  ) %>%
  info_columns(
    columns = b,
    info = "Like column `a`. The lowest value is `{b_lowest}`."
  ) %>%
  info_columns(
    columns = b,
    info = "The highest value is `{b_highest}`."
  ) %>%
  info_snippet(
    snippet_name = "a_mean",
    fn = ~ . %>% .$a %>% mean(na.rm = TRUE) %>% round(2)
  ) %>%
  info_snippet(snippet_name = "b_lowest", fn = snip_lowest("b")) %>%
  info_snippet(snippet_name = "b_highest", fn = snip_highest("b")) %>%
  info_section(
    section_name = "further information", 
    `examples and documentation` = "Examples for how to use the
    `info_*()` functions (and many more) are available at the
    [**pointblank** site](https://rstudio.github.io/pointblank/)."
  ) %>%
  incorporate()

By printing the informant we get the table information report.

Here is a link to a hosted information report for the intendo::intendo_revenue table:

Information Report for intendo::intendo_revenue


TABLE SCANS

We can use the scan_data() function to generate a comprehensive summary of a tabular dataset. This allows us to quickly understand what's in the dataset and it helps us determine if there are any peculiarities within the data. Scanning the dplyr::storms dataset with scan_data(tbl = dplyr::storms) gives us an interactive HTML report. Here are a few of them, published in RPubs:

Table Scan of dplyr::storms

Table Scan of pointblank::game_revenue

Database tables can be used with scan_data() as well. Here are two examples using (1) the full_region table of the Rfam database (hosted publicly at mysql-rfam-public.ebi.ac.uk) and (2) the assembly table of the Ensembl database (hosted publicly at ensembldb.ensembl.org).

Rfam: full_region

Ensembl: assembly


OVERVIEW OF PACKAGE FUNCTIONS

There are many functions available in pointblank for understanding data quality and creating data documentation. Here is an overview of all of them, grouped by family. For much more information on these, visit the documentation website or take a Test Drive in the Posit Cloud project.


DISCUSSIONS

Let's talk about data validation and data documentation in pointblank Discussions! It's a great place to ask questions about how to use the package, discuss some ideas, engage with others, and much more!

INSTALLATION

Want to try this out? The pointblank package is available on CRAN:

install.packages("pointblank")

You can also install the development version of pointblank from GitHub:

devtools::install_github("rstudio/pointblank")

If you encounter a bug, have usage questions, or want to share ideas to make this package better, feel free to file an issue.


Code of Conduct

Please note that the gt project is released with a contributor code of conduct.
By participating in this project you agree to abide by its terms.

📄 License

pointblank is licensed under the MIT license. See the LICENSE.md file for more details.

© Posit Software, PBC.

🏛️ Governance

This project is primarily maintained by Rich Iannone. Other authors may occasionally assist with some of these duties.


pointblank's People

Contributors

brancengregory avatar davzim avatar ekothe avatar gadenbuie avatar kierisi avatar ldalby avatar mayeulk avatar mikejohnpage avatar nutterb avatar pachadotdev avatar rich-iannone avatar yjunechoe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pointblank's Issues

Add dataset to package

Currently there are no datasets in the package but one or two would be useful for examples and vignettes.

Create an actions and levels info strip

This is necessary for creating any schematics of the validation plan and for reporting post-interrogation. It should indicate the settings for the object returned by the action_levels() helper function and applied to the validation step.

Release pointblank 0.2.1

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

Add manual tests for a variety of database types

We need a standardized test suite that exercises all of the validation step functions with a variety of database types. The databases and their drivers should be: MySQL (with RMariaDB), PostgreSQL (with RPostgres), SQLite (with RSQLite).

Add an `active` option to all validation step functions

Each validation step function will get the argument active, which will accept a logical value (defaulting to TRUE).

If step functions are working with an agent, FALSE will make the step inactive (still reporting its presence and keeping indexes for the steps unchanged).

If the step functions are operating directly on data, then any step with active = FALSE will simply pass the data through, no longer acting as a filter (internally, just returning the data early).

A valid use case for this is setting a global switch on some or all validation steps depending on the context (e.g., in production or not).

`focus_on` can fail to get the right local dataframe

Here's an example:

library(pointblank)
# Copied from the docs, but wrapped in a function
fn <- function() {
  my_df <- data.frame(a = c(5, 4, 3, 5, 1, 2))
  
  agent <- create_agent() %>%
    focus_on(tbl_name = "my_df") %>%
    col_vals_lt(
      column = a,
      value = 6) %>%
    interrogate()
  
  all_passed(agent)
}
fn()
#> Error in get(tbl_name): object 'my_df' not found

Created on 2019-09-21 by the reprex package (v0.3.0)

Note that if a my_df object existed in the global scope, focus_on would use the global object instead. I think this behavior is happening because focus_on uses get instead of dynGet here.

col_values_in_set passes even when values are not in the set

The col_values_in_set test appears to pass regardless of whether or not values are actually in the set. See reproducible example below:

Create a simple two column data frame

df <-
  data.frame(
    a = c(1, 2, 3, 4),
    b = c("one", "two", "three", "four"),
    stringsAsFactors = FALSE)

Validate that all numerical values in column a belong to a numerical set, and, that all values in column b belong to a set of string values. Note that none of the values in either validation set should pass.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_in_set(
    column = a,
    set = 10:20) %>%
  col_vals_in_set(
    column = b,
    set = c("mouse", "dog", "cat", "pig")) %>%
  interrogate()

However, all validation checks are reported as passed

all_passed(agent)
[1] TRUE

welcome page example not working

>  create_agent() %>%             # (1)
+   focus_on(
+     tbl_name = "tbl_1") %>%      # (2)
+   col_vals_gt(
+     column = "a",
+     value = 0)
Error in bind_rows_(x, .id) : 
  Evaluation error: Argument 6: list can't contain data frames.
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.4 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.1     pointblank_0.1   rlang_0.0.0.9018

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10.2      bindr_0.1           knitr_1.15.1        magrittr_1.5        hms_0.3             devtools_1.12.0    
 [7] R6_2.2.0            stringr_1.2.0       httr_1.2.1          dplyr_0.5.0.9004    tools_3.3.1         DBI_0.6-11         
[13] git2r_0.15.0        withr_1.0.2         htmltools_0.3.5     lazyeval_0.2.0.9000 RPostgreSQL_0.4-1   assertthat_0.2.0   
[19] digest_0.6.12       rprojroot_1.2       tibble_1.3.0        tidyr_0.6.1         purrr_0.2.2         readr_1.1.0        
[25] curl_2.3            memoise_1.0.0       glue_1.0.0          evaluate_0.10       rmarkdown_1.5       stringi_1.1.5      
[31] backports_1.0.5   

Add option to use environment variables for DB connections

There needs to be a convenient method for passing in references to environment variables that hold DB credentials. A bonus function would be for testing environment variables (i.e., do the supplied environment variables result in a successful connection?).

Add functionality for simple validations (e.g., `df %>% col_vals_gt(...)`)

The idea is to pass a data object directly to a validation function and get a re-usable output (e.g., vector of logical values) that can be used in other functions. This would be very useful for joint validations where we could have:

df %>% <validation_function>(...) & 
df %>% <validation_function>(...)

And the resultant vector of logicals could show which rows jointly passed (of course, one has to ensure that the input is passed to the validation functions unchanged).

This shouldn't affect the existing API that much. The first argument of any validation function will change from agent to ... where each function will internally sort out whether to use an agent object or immediately interrogate. The ... will also be useful if we decide to wrap inputs in some helper function (e.g., jointly(), etc.).

Potential unclosed connections in Redshift

When using the package to validate tables in Redshift, the amount of non-strapped connections shows a definite increase. Look into solutions on how to close connections and re-use existing connections efficiently.

all_passed and n_passed handle NA values differently for some interrogation tests

Different parts of the interrogration report handle NA values differently. This is apparent for col_vals_between and col_vals_in_set (as well as presumably effecting other tests I don't currently use).

I have a simple dataframe that includes missing values and check that all values of column B are between 0 and 20.

df<-structure(list(A = c(NA, 2L, 3L, 4L, 5L, 6L, NA, 8L, 9L, NA), 
               B = c(11L, 12L, NA, NA, NA, NA, 17L, 18L, 19L, 20L), 
               C = c(NA, NA, NA, 24L, NA, 26L, NA, 28L, NA, 30L)), 
               .Names = c("A",  "B", "C"), row.names = c(NA, -10L), class = "data.frame")

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
    col_vals_between( 
    column = B,
    left = 0,
    right = 20)
  interrogate()

all_passed(agent)
get_interrogation_summary(agent)

all_passed() returns TRUE but get_interrogation_summary() reports that 60% of rows are not within range.

# A tibble: 1 x 12
  tbl_name db_type  assertion_type   column value regex all_passed     n n_passed f_passed action brief            
  <chr>    <chr>    <chr>            <chr>  <dbl> <chr> <lgl>      <dbl>    <dbl>    <dbl> <chr>  <chr>            
1 df       local_df col_vals_between B         NA NA    TRUE          10        6      0.6 NA     Expect that valu~

This occurs because the NAs are counted as failing in some calculations but not in others.

I can partially control this behaviour by adding a pre-condition to only apply this test to rows where B is not NA.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    preconditions = is.na(B) == FALSE) %>%
  interrogate()

But this becomes tedious when applying testing a large number of columns where NAs should not be counted as failing (since each would require a pre-condition that refers to the relevant column by name).

It's also not apparent whether I could specify that NAs should be counted as not in range (except by using a different test). This is a problem if I want the to trigger a warning or notification based on the joint failure rate of NAs and out of range values.

The ability to include or exclude NAs from any given test (as in the example below) would improve the usability of this function and add consistency between the n_passed and all_passed values.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    na_as_in_range = TRUE) %>%
  interrogate()

Release pointblank 0.3.0

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Include a notification function that integrates with Slack

Thanks for making this package! It's very helpful and I got a few validation processes running smoothly in production (no issues with that at all).

I like the email notifier that's included and something along the same lines is a Slack notifier, which would be a great addition! Would that be a feature you're willing to add in?

col_exists handling of multiple columns

The description for col_exists implies that it can handle multiple columns. However the method for doing this is not apparent.

col_exists(column = c(start_date, age)) looks for a column called "c(start_date, age)"

col_exists(column = c("start_date", "age")) looks for a column called c('start_date', 'age')

Given this function isn't fully documented I'm not sure if I'm missing something. Is there a way to pass a vector of column names to this function?

Update README

The README hasn't been touched in quite a long time and it could be a little shorter. Goal, I think, is to talk about the main workflows and problems that can be solved. All the other little details can go into vignettes/articles.

Use a `values` list column in the `validation_set` object

This is needed to simplify the model for validation steps. With a list column we can accommodate any type so any values put in value, set, and regex would simply go into the values list column.

This also makes it easier to have non-numeric comparisons so dates or date-times could then be specified and used.

`preconditions` should be a list of expressions

Presently, any preconditions just filter the data before performing a validation. It would be much better to accept a list of expressions that manipulate the data. In this way, the user could mutate the table and perhaps generate a new column that would undergo validation (among other possibilities).

Have options to sort agent report by failure condition

Currently, the agent report provides line entries for all validation steps in the given order. There should be options for sorting by severity of the failure conditions and limiting/omitting the passing steps. This will make for more succinct reporting especially in an email context.

Reporting across interrogations

I've been using this package (along with your gt package!) for about a month and it's really helped out a lot at work. We're trying to get our data quality under control and this package solves that problem perfectly (it's actually amazing how everything just seems to work without problems).

A feature request that I have is making a sort of combined report of interrogations for the same table but at different times. We want to have these to see (in a simple table) where things have improved or gotten worse.

I honestly wouldn't be surprised if you weren't thinking about this already, so, if this is something you were planning it would be a great next step.

Thank you so much for all your great packages. You're making my life easier!

Have some of the step functions use columns (not just numbers) as comparisons

Love this package! I'm setting up all sorts of validations and one thing I think would be useful is to enable a direct comparison of one column to another.

For example if we wanted to validate that column a is always greater than column b, we should be able to use col_vals_gt(vars(a), vars(b)). What do you think?

Again, this package is incredible. Thanks!

Checks for sf-objects

I intended to check key-properties of sf(c)-objects making use of rows_not_duplicated(). The check was supposed to ignore the geometry column of the object (cf. 2nd example in reprex).

It seems that interrogate() ran into an error, because of the way, summarize() works on these objects.

Reprex example:

library(pointblank)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

# Geometry object with 2 features
g <- rep(st_sfc(st_point(1:2)), 2)

# vector with 2 entries
v <- c("a", "b")

# object including both objects
mixed_obj <- st_sf("vector" = v, "points" = g)
mixed_obj
#> Simple feature collection with 2 features and 1 field
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 2 xmax: 1 ymax: 2
#> epsg (SRID):    NA
#> proj4string:    NA
#>   vector      points
#> 1      a POINT (1 2)
#> 2      b POINT (1 2)

agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated() %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

# It already happens, when I only check if column "vector" is duplicated 
# (likely because `sf`-objects have "sticky geometries")
agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated(cols = vector) %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

Created on 2019-02-12 by the reprex package (v0.2.1)

I think it happens at the following chunk in interrogate() in the section "# Judge tables on expectation of non-duplicated rows":

      # Get total count of rows
      row_count <-
        table %>%
        dplyr::group_by() %>%
        dplyr::summarize(row_count = n()) %>%
        dplyr::as_tibble() %>%
        purrr::flatten_dbl()

My expectation would be, that

  1. in the first case of the reprex (rows_not_duplicated(), without specifying columns) each whole row, including the geometry column, would be compared with the others.
  2. in the second case (rows_not_duplicated(cols = vector)) the check would be done only for the column "vector".

Perhaps a solution might be to call as_tibble() before group_by() and summarize()?

CC: @krlmlr

col_vals_in_set() broken

Seeing different behavior for the CRAN version, an intermediate version, and master :

CRAN

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> pointblank agent // <agent_2019-02-12_17:28:44>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:28:44) resulted in:
#>   - 1 passing validation
#>   - no failing validations   more info: `get_interrogation_summary()`

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

5f7b88a (last good revision, parent of b2541da)

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#> 
#>   # Bad:
#>   rlang::expr(mean(rlang::UQ(var) * 100))
#> 
#>   # Ok:
#>   rlang::expr(mean(UQ(var) * 100))
#> 
#>   # Good:
#>   rlang::expr(mean(!!var * 100))
#> 
#> This warning is displayed once per session.
#> pointblank agent // <agent_2019-02-12_17:29:47>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:29:48) resulted in:
#>   - no passing validations
#>   - 1 failing validation   more info: `get_interrogation_summary()`

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

b2541da up to master

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Error in create_autobrief(agent = agent, assertion_type = "col_vals_in_set", : argument "set" is missing, with no default

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

Consider less stringent R version dependence.

The dependence on R version >= 3.4.0 means I cannot install this package at work. (We are stuck on R 3.2, and it's not going to change.) I have looked at the dependencies of pointblank, and none of them seem to require 3.4, although perhaps some of their dependencies do. I might suggest modifying the .travis.yml file to try builds on older releases as a test, c.f. R versions.

Add print method for the x-list

As a way to make the x-list object a bit more visually appealing in the console (and less annoying), a print method should be added.

Additional language support for message parts

I really like that you added multilingual support for the report outputs. One place where that is currently missing (I think) is in the stock message parts for the emailing of the pointblank report. Could you add those in?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.