rstudio / pointblank Goto Github PK

Data quality assessment and metadata reporting for data frames and database tables

Home Page: https://rstudio.github.io/pointblank/

License: Other

R 98.09% CSS 0.20% JavaScript 0.03% HTML 0.02% Rez 1.66%

data-validation database-tables data-dictionaries easy-to-understand data-frames reporting-tool data-profiler data-management schema-validation data-verification

pointblank's Introduction

With the pointblank package it’s really easy to methodically validate your data whether in the form of data frames or as database tables. On top of the validation toolset, the package gives you the means to provide and keep up-to-date with the information that defines your tables.

For table validation, the agent object works with a large collection of simple (yet powerful!) validation functions. We can enable much more sophisticated validation checks by using custom expressions, segmenting the data, and by selective mutations of the target table. The suite of validation functions ensures that everything just works no matter whether your table is a data frame or a database table.

Sometimes, we want to maintain table information and update it when the table goes through changes. For that, we can use an informant object plus associated functions to help define the metadata entries and present it as a data dictionary. Just like we can with validation, pointblank offers easy ways to have the metadata updated so that this important documentation doesn't become stale.

TABLE VALIDATIONS WITH AN AGENT AND DATA QUALITY REPORTING

Data validation can be carried out in Data Quality Reporting workflow, ultimately resulting in the production of a data quality analysis report. This is most useful in a non-interactive mode where data quality for database tables and on-disk data files must be periodically checked. The pointblank agent is given a collection of validation functions to define validation steps. We can get extracts of data rows that failed validation, set up custom functions that are invoked by exceeding set threshold failure rates, etc. Want to email the report regularly (or, only if certain conditions are met)? Yep, you can do all that.

Here is an example of how to use pointblank to validate a local table with an agent.

# Generate a simple `action_levels` object to
# set the `warn` state if a validation step
# has a single 'fail' test unit
al <- action_levels(warn_at = 1)

# Create a pointblank `agent` object, with the
# tibble as the target table. Use three validation
# functions, then, `interrogate()`. The agent will
# then have some useful intel.
agent <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_agent(
    label = "A very *simple* example.",
    actions = al
  ) %>%
  col_vals_between(
    vars(a), 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    vars(c), 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(vars(a, b)) %>%
  interrogate()

The reporting’s pretty sweet. We can get a gt-based report by printing an agent.

The pointblank package is designed to be both straightforward yet powerful. And fast! Local data frames don’t take very long to validate extensively and all validation checks on remote tables are done entirely in-database. So we can add dozens or even hundreds of validation steps without any long waits for reporting.

Should you want to perform validation checks on database or Spark tables, provide a tbl_dbi or tbl_spark object to create_agent(). The pointblank package currently supports PostgreSQL. MySQL, MariaDB, Microsoft SQL Server, Google BigQuery, DuckDB, SQLite, and Spark DataFrames (through the sparklyr package).

Here are some validation reports for the considerably larger intendo::intendo_revenue table.

VALIDATIONS DIRECTLY ON DATA

The Pipeline Data Validation workflow uses the same collection of validation functions but without need of an agent. This is useful for an ETL process where we want to periodically check data and trigger warnings, raise errors, or write out logs when exceeding specified failure thresholds. It’s a cinch to perform checks on import of the data and at key points during the transformation process, perhaps stopping data flow if things are unacceptable with regard to data quality.

The following example uses the same three validation functions as before but, this time, we use them directly on the data. The validation functions act as a filter, passing data through unless execution is stopped by failing validations beyond the set threshold. In this workflow, by default, an error will occur if there is a single ‘fail’ test unit in any validation step:

dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(c(a, b))

Error: Exceedance of failed test units where values in `c` should have been < `12`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (2) >= failure threshold (1)

We can downgrade this error to a warning with the warn_on_fail() helper function (assigning it to actions). In this way, the data will always be returned, but warnings will appear.

# The `warn_on_fail()` function is a nice
# shortcut for `action_levels(warn_at = 1)`;
# it works great in this data checking workflow
# (and the threshold can still be adjusted)
dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE,
    actions = warn_on_fail()
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b),
    actions = warn_on_fail()
  ) %>%
  col_is_numeric(
    c(a, b),
    actions = warn_on_fail()
  )

#> # A tibble: 6 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     5     6
#> 2     7     1
#> 3     6     0
#> 4     5     6
#> 5    NA     0
#> 6     7     7

Warning message:
Exceedance of failed test units where values in `c` should have been < `12`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (2) >= failure threshold (1)

Should you need more fine-grained thresholds and resultant actions, the action_levels() function can be used to specify multiple failure thresholds and side effects for each failure state. However, with warn_on_fail() and stop_on_fail() (applied by default, with stop_at = 1), you should have good enough options for this validation workflow.

VALIDATIONS IN R MARKDOWN DOCUMENTS

Using pointblank in an R Markdown workflow is enabled by default once the pointblank library is loaded. The framework allows for validation testing within specialized validation code chunks where the validate = TRUE option is set. Using pointblank validation functions on data in these marked code chunks will flag overall failure if the stop threshold is exceeded anywhere. All errors are reported in the validation code chunk after rendering the document to HTML, where green or red status buttons indicate whether all validations succeeded or failures occurred. Click them to reveal the otherwise hidden validation statements and any associated error messages.

The above R Markdown document is available as a template in the RStudio IDE; it’s called Pointblank Validation.

TABLE INFORMATION

Table information can be synthesized in an information management workflow, giving us a snapshot of a data table we care to collect information on. The pointblank informant is fed a series of info_*() functions to define bits of information about a table. This info text can pertain to individual columns, the table as a whole, and whatever additional information makes sense for your organization. We can even glean little snippets of information (like column stats or sample values) from the target table with info_snippet() and the snip_*() functions and mix them into the data dictionary wherever they're needed.

Here is an example of how to use pointblank to incorporate pieces of info text into an informant object.

# Create a pointblank `informant` object, with the
# tibble as the target table. Use a few information
# functions and end with `incorporate()`. The informant
# will then show you information about the tibble.
informant <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_informant(
    label = "A very *simple* example.",
    tbl_name = "example_tbl"
  ) %>%
  info_tabular(
    description = "This two-column table is nothing all that
    interesting, but, it's fine for examples on **GitHub**
    `README` pages. Column names are `a` and `b`. ((Cool stuff))"
  ) %>%
  info_columns(
    columns = a,
    info = "This column has an `NA` value. [[Watch out!]]<<color: red;>>"
  ) %>%
  info_columns(
    columns = a,
    info = "Mean value is `{a_mean}`."
  ) %>%
  info_columns(
    columns = b,
    info = "Like column `a`. The lowest value is `{b_lowest}`."
  ) %>%
  info_columns(
    columns = b,
    info = "The highest value is `{b_highest}`."
  ) %>%
  info_snippet(
    snippet_name = "a_mean",
    fn = ~ . %>% .$a %>% mean(na.rm = TRUE) %>% round(2)
  ) %>%
  info_snippet(snippet_name = "b_lowest", fn = snip_lowest("b")) %>%
  info_snippet(snippet_name = "b_highest", fn = snip_highest("b")) %>%
  info_section(
    section_name = "further information", 
    `examples and documentation` = "Examples for how to use the
    `info_*()` functions (and many more) are available at the
    [**pointblank** site](https://rstudio.github.io/pointblank/)."
  ) %>%
  incorporate()

By printing the informant we get the table information report.

Here is a link to a hosted information report for the intendo::intendo_revenue table:

TABLE SCANS

We can use the scan_data() function to generate a comprehensive summary of a tabular dataset. This allows us to quickly understand what's in the dataset and it helps us determine if there are any peculiarities within the data. Scanning the dplyr::storms dataset with scan_data(tbl = dplyr::storms) gives us an interactive HTML report. Here are a few of them, published in RPubs:

Database tables can be used with scan_data() as well. Here are two examples using (1) the full_region table of the Rfam database (hosted publicly at mysql-rfam-public.ebi.ac.uk) and (2) the assembly table of the Ensembl database (hosted publicly at ensembldb.ensembl.org).

OVERVIEW OF PACKAGE FUNCTIONS

There are many functions available in pointblank for understanding data quality and creating data documentation. Here is an overview of all of them, grouped by family. For much more information on these, visit the documentation website or take a Test Drive in the Posit Cloud project.

DISCUSSIONS

Let's talk about data validation and data documentation in pointblank Discussions! It's a great place to ask questions about how to use the package, discuss some ideas, engage with others, and much more!

INSTALLATION

Want to try this out? The pointblank package is available on CRAN:

install.packages("pointblank")

You can also install the development version of pointblank from GitHub:

devtools::install_github("rstudio/pointblank")

If you encounter a bug, have usage questions, or want to share ideas to make this package better, feel free to file an issue.

Code of Conduct

Please note that the gt project is released with a contributor code of conduct.
By participating in this project you agree to abide by its terms.

📄 License

pointblank is licensed under the MIT license. See the LICENSE.md file for more details.

🏛️ Governance

This project is primarily maintained by Rich Iannone. Other authors may occasionally assist with some of these duties.

pointblank's People

Contributors

Stargazers

Watchers

Forkers

eleakin ekothe realcochenchao lmeilibr elong0527 shawnagill chenyz03 bradweiner nemochina2008 aespar21 janekbennett jordanmllr5 jaykimbravekjh rmurrayed agwillis84 shawnoneill-code pachadotdev filipwb micseb diego-pacheco-pe adisarid lcreteig gadenbuie jimsforks dwtcourses gwd666 ldalby guptam santosvsdata scoultersdcoe kierisi psanker maitreyeesidhaye ddotta magnusnes dikeshh jayswia brancengregory mayeulk arulmouzhi sermetpekin allenlile gast1111 hapu66 justkissme009 dpprdan elenathomson yjunechoe pawanramamali mikejohnpage fenggefeifei

pointblank's Issues

Add dataset to package

Currently there are no datasets in the package but one or two would be useful for examples and vignettes.

Add documentation that explains the agent x-list

Add documentation for the x-list object. It can take multiple forms, which is potentially confusing. However, good docs and examples can make things more clear.

Create an actions and levels info strip

This is necessary for creating any schematics of the validation plan and for reporting post-interrogation. It should indicate the settings for the object returned by the action_levels() helper function and applied to the validation step.

Add documentation for the `action_levels()` helper function

Currently, the documentation is minimal and doesn't explain how this function is utilized in the different validation contexts.

Refactor the `create_autobrief()` util function

This function could use some refactoring and tidying up.

Release pointblank 0.2.1

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()
Tweet

Add manual tests for a variety of database types

We need a standardized test suite that exercises all of the validation step functions with a variety of database types. The databases and their drivers should be: MySQL (with RMariaDB), PostgreSQL (with RPostgres), SQLite (with RSQLite).

Add an `active` option to all validation step functions

Each validation step function will get the argument active, which will accept a logical value (defaulting to TRUE).

If step functions are working with an agent, FALSE will make the step inactive (still reporting its presence and keeping indexes for the steps unchanged).

If the step functions are operating directly on data, then any step with active = FALSE will simply pass the data through, no longer acting as a filter (internally, just returning the data early).

A valid use case for this is setting a global switch on some or all validation steps depending on the context (e.g., in production or not).

Add testthat tests for the `conjointly()` function

This is currently not tested at all.

Create a branch without the mailR and rJava dependencies?

Just a suggestion as these are quite difficult for some people to meet. 😮

Add arg to specify a `stop()` threshold

Add the stop_count and stop_fraction arguments for stopping validations when encountering a set threshold of failed checks.

`focus_on` can fail to get the right local dataframe

Here's an example:

library(pointblank)
# Copied from the docs, but wrapped in a function
fn <- function() {
  my_df <- data.frame(a = c(5, 4, 3, 5, 1, 2))
  
  agent <- create_agent() %>%
    focus_on(tbl_name = "my_df") %>%
    col_vals_lt(
      column = a,
      value = 6) %>%
    interrogate()
  
  all_passed(agent)
}
fn()
#> Error in get(tbl_name): object 'my_df' not found

^{Created on 2019-09-21 by the reprex package (v0.3.0)}

Note that if a my_df object existed in the global scope, focus_on would use the global object instead. I think this behavior is happening because focus_on uses get instead of dynGet here.

col_values_in_set passes even when values are not in the set

The col_values_in_set test appears to pass regardless of whether or not values are actually in the set. See reproducible example below:

Create a simple two column data frame

df <-
  data.frame(
    a = c(1, 2, 3, 4),
    b = c("one", "two", "three", "four"),
    stringsAsFactors = FALSE)

Validate that all numerical values in column a belong to a numerical set, and, that all values in column b belong to a set of string values. Note that none of the values in either validation set should pass.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_in_set(
    column = a,
    set = 10:20) %>%
  col_vals_in_set(
    column = b,
    set = c("mouse", "dog", "cat", "pig")) %>%
  interrogate()

However, all validation checks are reported as passed

all_passed(agent)
[1] TRUE

welcome page example not working

>  create_agent() %>%             # (1)
+   focus_on(
+     tbl_name = "tbl_1") %>%      # (2)
+   col_vals_gt(
+     column = "a",
+     value = 0)
Error in bind_rows_(x, .id) : 
  Evaluation error: Argument 6: list can't contain data frames.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.4 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.1     pointblank_0.1   rlang_0.0.0.9018

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10.2      bindr_0.1           knitr_1.15.1        magrittr_1.5        hms_0.3             devtools_1.12.0    
 [7] R6_2.2.0            stringr_1.2.0       httr_1.2.1          dplyr_0.5.0.9004    tools_3.3.1         DBI_0.6-11         
[13] git2r_0.15.0        withr_1.0.2         htmltools_0.3.5     lazyeval_0.2.0.9000 RPostgreSQL_0.4-1   assertthat_0.2.0   
[19] digest_0.6.12       rprojroot_1.2       tibble_1.3.0        tidyr_0.6.1         purrr_0.2.2         readr_1.1.0        
[25] curl_2.3            memoise_1.0.0       glue_1.0.0          evaluate_0.10       rmarkdown_1.5       stringi_1.1.5      
[31] backports_1.0.5

Add option to use environment variables for DB connections

There needs to be a convenient method for passing in references to environment variables that hold DB credentials. A bonus function would be for testing environment variables (i.e., do the supplied environment variables result in a successful connection?).

Add functionality for simple validations (e.g., `df %>% col_vals_gt(...)`)

The idea is to pass a data object directly to a validation function and get a re-usable output (e.g., vector of logical values) that can be used in other functions. This would be very useful for joint validations where we could have:

df %>% <validation_function>(...) & 
df %>% <validation_function>(...)

And the resultant vector of logicals could show which rows jointly passed (of course, one has to ensure that the input is passed to the validation functions unchanged).

This shouldn't affect the existing API that much. The first argument of any validation function will change from agent to ... where each function will internally sort out whether to use an agent object or immediately interrogate. The ... will also be useful if we decide to wrap inputs in some helper function (e.g., jointly(), etc.).

Add autobrief functionality for the `conjointly()` validation step function

The conjointly() function does not currently create an autobrief if brief = NULL (which is most of the time). So, handcraft a nice brief that states the number of steps and the types of validations contained in this step. Limit the listing of validation types to three.

Potential unclosed connections in Redshift

When using the package to validate tables in Redshift, the amount of non-strapped connections shows a definite increase. Look into solutions on how to close connections and re-use existing connections efficiently.

all_passed and n_passed handle NA values differently for some interrogation tests

Different parts of the interrogration report handle NA values differently. This is apparent for col_vals_between and col_vals_in_set (as well as presumably effecting other tests I don't currently use).

I have a simple dataframe that includes missing values and check that all values of column B are between 0 and 20.

df<-structure(list(A = c(NA, 2L, 3L, 4L, 5L, 6L, NA, 8L, 9L, NA), 
               B = c(11L, 12L, NA, NA, NA, NA, 17L, 18L, 19L, 20L), 
               C = c(NA, NA, NA, 24L, NA, 26L, NA, 28L, NA, 30L)), 
               .Names = c("A",  "B", "C"), row.names = c(NA, -10L), class = "data.frame")

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
    col_vals_between( 
    column = B,
    left = 0,
    right = 20)
  interrogate()

all_passed(agent)
get_interrogation_summary(agent)

all_passed() returns TRUE but get_interrogation_summary() reports that 60% of rows are not within range.

# A tibble: 1 x 12
  tbl_name db_type  assertion_type   column value regex all_passed     n n_passed f_passed action brief            
  <chr>    <chr>    <chr>            <chr>  <dbl> <chr> <lgl>      <dbl>    <dbl>    <dbl> <chr>  <chr>            
1 df       local_df col_vals_between B         NA NA    TRUE          10        6      0.6 NA     Expect that valu~

This occurs because the NAs are counted as failing in some calculations but not in others.

I can partially control this behaviour by adding a pre-condition to only apply this test to rows where B is not NA.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    preconditions = is.na(B) == FALSE) %>%
  interrogate()

But this becomes tedious when applying testing a large number of columns where NAs should not be counted as failing (since each would require a pre-condition that refers to the relevant column by name).

It's also not apparent whether I could specify that NAs should be counted as not in range (except by using a different test). This is a problem if I want the to trigger a warning or notification based on the joint failure rate of NAs and out of range values.

The ability to include or exclude NAs from any given test (as in the example below) would improve the usability of this function and add consistency between the n_passed and all_passed values.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    na_as_in_range = TRUE) %>%
  interrogate()

Include console messages that show progress of interrogation

Right now, nothing in shown in the console during an interrogation (or even at the conclusion of one). So, use messages to convey what is happening.

Provide examples for all functions in their help pages

A few usage examples for each function would be helpful.

Improve the reporting produced by `get_agent_report()`

Right now, it's a pretty barebones report.

Release pointblank 0.3.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

Include a notification function that integrates with Slack

Thanks for making this package! It's very helpful and I got a few validation processes running smoothly in production (no issues with that at all).

I like the email notifier that's included and something along the same lines is a Slack notifier, which would be a great addition! Would that be a feature you're willing to add in?

col_exists handling of multiple columns

The description for col_exists implies that it can handle multiple columns. However the method for doing this is not apparent.

col_exists(column = c(start_date, age)) looks for a column called "c(start_date, age)"

col_exists(column = c("start_date", "age")) looks for a column called c('start_date', 'age')

Given this function isn't fully documented I'm not sure if I'm missing something. Is there a way to pass a vector of column names to this function?

Update README

The README hasn't been touched in quite a long time and it could be a little shorter. Goal, I think, is to talk about the main workflows and problems that can be solved. All the other little details can go into vignettes/articles.

Use a `values` list column in the `validation_set` object

This is needed to simplify the model for validation steps. With a list column we can accommodate any type so any values put in value, set, and regex would simply go into the values list column.

This also makes it easier to have non-numeric comparisons so dates or date-times could then be specified and used.

Validations that work for numeric columns should also work for Date and datetime-type columns

Right now date and datetime column values cannot be validated, and that’s a shame.

`preconditions` should be a list of expressions

Presently, any preconditions just filter the data before performing a validation. It would be much better to accept a list of expressions that manipulate the data. In this way, the user could mutate the table and perhaps generate a new column that would undergo validation (among other possibilities).

lubridate not installed with package

lubridate is required for creation of the HTML summary but is not required on package install.

Install fails due to missing packages

Installation required manual installation of rJava, Hmisc, and lazyWeave

Add the `tldr_pointblank()` function

This function is just to display simple instructions on how to do the basic things à la https://tldr.sh/.

Have options to sort agent report by failure condition

Currently, the agent report provides line entries for all validation steps in the given order. There should be options for sorting by severity of the failure conditions and limiting/omitting the passing steps. This will make for more succinct reporting especially in an email context.

Fix the `rows_distinct()` function for `tbl_dbi` objects

Currently this doesn't work for tbl_dbi objects (tested on SQLite and MySQL) but it's fine for data.frame and tbl_df.

Use tidyselect and reexport select helpers

While most functions only require the use of one column, some take multiple columns. Using select helpers would be helpful for those cases.

Reporting across interrogations

I've been using this package (along with your gt package!) for about a month and it's really helped out a lot at work. We're trying to get our data quality under control and this package solves that problem perfectly (it's actually amazing how everything just seems to work without problems).

A feature request that I have is making a sort of combined report of interrogations for the same table but at different times. We want to have these to see (in a simple table) where things have improved or gotten worse.

I honestly wouldn't be surprised if you weren't thinking about this already, so, if this is something you were planning it would be a great next step.

Thank you so much for all your great packages. You're making my life easier!

Use a ‘secret agent‘ when directly validating a table

Behind the scenes, create a secret agent and perform a one-hit interrogation on the target. Put the logic in secret_ops.R.

Have some of the step functions use columns (not just numbers) as comparisons

Love this package! I'm setting up all sorts of validations and one thing I think would be useful is to enable a direct comparison of one column to another.

For example if we wanted to validate that column a is always greater than column b, we should be able to use col_vals_gt(vars(a), vars(b)). What do you think?

Again, this package is incredible. Thanks!

Checks for sf-objects

I intended to check key-properties of sf(c)-objects making use of rows_not_duplicated(). The check was supposed to ignore the geometry column of the object (cf. 2nd example in reprex).

It seems that interrogate() ran into an error, because of the way, summarize() works on these objects.

Reprex example:

library(pointblank)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

# Geometry object with 2 features
g <- rep(st_sfc(st_point(1:2)), 2)

# vector with 2 entries
v <- c("a", "b")

# object including both objects
mixed_obj <- st_sf("vector" = v, "points" = g)
mixed_obj
#> Simple feature collection with 2 features and 1 field
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 2 xmax: 1 ymax: 2
#> epsg (SRID):    NA
#> proj4string:    NA
#>   vector      points
#> 1      a POINT (1 2)
#> 2      b POINT (1 2)

agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated() %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

# It already happens, when I only check if column "vector" is duplicated 
# (likely because `sf`-objects have "sticky geometries")
agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated(cols = vector) %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

^{Created on 2019-02-12 by the reprex package (v0.2.1)}

I think it happens at the following chunk in interrogate() in the section "# Judge tables on expectation of non-duplicated rows":

      # Get total count of rows
      row_count <-
        table %>%
        dplyr::group_by() %>%
        dplyr::summarize(row_count = n()) %>%
        dplyr::as_tibble() %>%
        purrr::flatten_dbl()

My expectation would be, that

in the first case of the reprex (rows_not_duplicated(), without specifying columns) each whole row, including the geometry column, would be compared with the others.
in the second case (rows_not_duplicated(cols = vector)) the check would be done only for the column "vector".

Perhaps a solution might be to call as_tibble() before group_by() and summarize()?

CC: @krlmlr

Add vignette for using this package to test input data for a machine learning model.

👋 Howdy! Not so much an issue, but a request. Would it be useful to create a vignette detailing how to use this package for data testing prior to retraining a machine learning model?

.pb

Compare to assertr?

Would it be useful to list the specificities and strengths of the package w/r/t the assert and assertive packages, which perform similar tasks?

Add support for sparklyr and add tests for interrogations on Spark tables

Tables that are tbl_spark are not yet supported but this shouldn't be difficult to implement. There will also need to be tests for performing pointblank validations on tbl_spark datasets.

col_vals_in_set() broken

Seeing different behavior for the CRAN version, an intermediate version, and master :

CRAN

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> pointblank agent // <agent_2019-02-12_17:28:44>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:28:44) resulted in:
#>   - 1 passing validation
#>   - no failing validations   more info: `get_interrogation_summary()`

^{Created on 2019-02-12 by the reprex package (v0.2.1.9000)}

`5f7b88a` (last good revision, parent of `b2541da`)

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#> 
#>   # Bad:
#>   rlang::expr(mean(rlang::UQ(var) * 100))
#> 
#>   # Ok:
#>   rlang::expr(mean(UQ(var) * 100))
#> 
#>   # Good:
#>   rlang::expr(mean(!!var * 100))
#> 
#> This warning is displayed once per session.
#> pointblank agent // <agent_2019-02-12_17:29:47>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:29:48) resulted in:
#>   - no passing validations
#>   - 1 failing validation   more info: `get_interrogation_summary()`

^{Created on 2019-02-12 by the reprex package (v0.2.1.9000)}

`b2541da` up to master

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Error in create_autobrief(agent = agent, assertion_type = "col_vals_in_set", : argument "set" is missing, with no default

^{Created on 2019-02-12 by the reprex package (v0.2.1.9000)}

Add ability to send notifications through Slack

We need the option to send warnings or other general notifications to a specified Slack channel. This is very important for in-production validations running non-interactively.

Consider less stringent R version dependence.

The dependence on R version >= 3.4.0 means I cannot install this package at work. (We are stuck on R 3.2, and it's not going to change.) I have looked at the dependencies of pointblank, and none of them seem to require 3.4, although perhaps some of their dependencies do. I might suggest modifying the .travis.yml file to try builds on older releases as a test, c.f. R versions.

rstudio / pointblank Goto Github PK

pointblank's Introduction

TABLE VALIDATIONS WITH AN AGENT AND DATA QUALITY REPORTING

VALIDATIONS DIRECTLY ON DATA

VALIDATIONS IN R MARKDOWN DOCUMENTS

TABLE INFORMATION

TABLE SCANS

OVERVIEW OF PACKAGE FUNCTIONS

DISCUSSIONS

INSTALLATION

Code of Conduct

📄 License

🏛️ Governance

pointblank's People

Contributors

Stargazers

Watchers

Forkers

pointblank's Issues

CRAN

5f7b88a (last good revision, parent of b2541da)

b2541da up to master

Recommend Projects

Recommend Topics

Recommend Org

`5f7b88a` (last good revision, parent of `b2541da`)

`b2541da` up to master