alastairrushworth / inspectdf Goto Github PK
View Code? Open in Web Editor NEW🛠️ 📊 Tools for Exploring and Comparing Data Frames
Home Page: https://alastairrushworth.github.io/inspectdf/
🛠️ 📊 Tools for Exploring and Comparing Data Frames
Home Page: https://alastairrushworth.github.io/inspectdf/
👋🏽 I maintain the cran checks badges. Please change to the new cran checks badge URL (e.g., https://badges.cranchecks.info/worst/dplyr.svg
). Old badges at (e.g. https://cranchecks.info/badges/worst/dplyr
) will be unavailable as of Jan 1st 2023.
Hi Alastair,
I like the idea and the visual, but the binning implementation seems wrong:
library(inspectdf)
data1 <- data.frame(x = c(rep(0, 250), rep(1, 250)))
data2 <- data.frame(x = c(rep(0, 230), rep(1, 270)))
show_plot(inspect_num(data1, data2))
Created on 2021-06-09 by the reprex package (v1.0.0)
It even occurs for identical data:
show_plot(inspect_num(data1, data1))
Hi
This is on my work PC.
compilation terminated.
make: *** [C:/PROGRA1/R/R-351.1/etc/i386/Makeconf:215: RcppExports.o] Error 1
ERROR: compilation failed for package 'inspectdf'
Hi @alastairrushworth and thanks for this awesome package! I turn to it frequently to get a sense of new datasets.
One point of friction for me is that show_plot()
doesn't return the ggplot2 object created by lower-level functions like plot_cat()
. Currently, I believe that if type$method == "types"
in show_plot()
the result will be the ggplot2 object but otherwise, because of the if statements throughout show_plot()
, the result will always be NULL
.
library(dplyr)
library(inspectdf)
g <- starwars %>%
inspect_cat() %>%
show_plot()
g
#> NULL
This makes it difficult for users who would like to work with the ggplot2 object, to add or change styles, for example, because they need to fall back to using :::
to access inspectdf:::plot_cat()
or similar. Unfortunately for these users, the default values for plot_cat()
are handled by show_plot()
, further increasing friction.
g2 <- starwars %>%
inspect_cat() %>%
inspectdf:::plot_cat()
#> Error in lapply(lvl_df$levels, merge_card, high_cardinality = high_cardinality): argument "high_cardinality" is missing, with no default
If I provide the default values to the lower level functions, then I can gain access to the created ggplot2 object, but it's clear that plot_cat()
isn't designed for end user consumption.
g2 <- starwars %>%
inspect_cat() %>%
inspectdf:::plot_cat(
df_names = list(df1 = "starwars"),
high_cardinality = 10,
col_palette = 0,
text_labels = TRUE,
label_thresh = 0.1
)
g2
#> Warning: Stacking not well defined when not anchored on the axis
Personally, I would prefer that show_plot()
simply return the ggplot2 object and that default printing rules are used to display the plot rather than explicitly calling print()
internally. In this way, show_plot()
would work as in the last example, but without automatically printing the plot. Doing this would give the user more control over where and how the inspectdf plots are used.
Created on 2019-07-23 by the reprex package (v0.2.1)
Is there any way to change the color and size of text labels on the plots?
To improve spotting differences between datasets visually
(especially when there are many columns) it would be helpful if one could sort the categorical columns by the Jensen–Shannon divergence.
The code below tries to do so but it seems to distort the labels on the y-axis. Also, in case the jsd
column contains missing values, those variables are deleted from the graph.
library(inspectdf)
library(dplyr)
inspect_cat(starwars, starwars[1:20, ]) %>%
arrange(desc(jsd)) %>%
show_plot()
Created on 2020-04-01 by the reprex package (v0.3.0)
Hi team,
Thanks for the great package. Just noticed there's a change in the expected plot for show_plot(inspect_types(df1,df2)) between v0.0.9 and c0.0.12 and wanted to let you know incase it wasn't intended.
Code:
set.seed(2019)
diamonds_1 <- sample_n(diamonds,50)
diamonds_2 <- sample_n(diamonds,50)
show_plot(inspect_types(diamonds_1,diamonds_2))
Edited to add - sorry, I think this was covered in the news for v 0.0.10 https://cran.r-project.org/web/packages/inspectdf/news/news.html
Thanks for the great package; I noticed that the inspect_num
function hits an error when hist
gets a column with exclusively NAs, like this.
inspect_num(data.frame(a = 1:100, b = rep(NA_real_, 100)))
Here's the error I'm seeing:
Error in hist.default(df_num[[breaks_tbl$col_name[i]]], plot = FALSE, : character(0)
In addition: Warning messages:
1: In min(value, na.rm = T) : no non-missing arguments to min; returning Inf
2: In max(value, na.rm = T) : no non-missing arguments to max; returning -Inf
Cheers!
Happens when comparing two data frames
Suggests plotly
inspect_cat
: tooltip for proportion, number and label
inspect_types
: ?
inspect_num
: ?
Hi Alastair.
Thanks for the package.
I've tried to do categorical-comparison plots between two data-frames (the two being partitions of some training data based on target-values).
Some example data might explain my problem a bit better:
Reprex:
library(tibble)
library(dplyr)
library(inspectdf)
df <- tibble(
a = c(rep("x", 4), rep("y", 2), rep("x", 1), rep("y", 5)),
target = c(rep(0, 6), rep(1, 6))
)
inspect_cat(
df %>% filter(target == 0),
df %>% filter(target == 1)
) %>%
show_plot()
This results in the following image:
For category "a"
I was wondering whether the level reordering is supposed to work as it does in the figure (x-first for the first data-frame, y-first for the second) or whether this might be a bug. Do you think it might make more sense for the levels to be ordered by their frequency across the combined data-frames (there are 7 ys and 5 xs here, so maybe y should come first for both dataframes)
My original aim was to quickly identify categorical vars that distinguish positive from negative samples, but this is a bit obscured when scanning down the figure (for a dozen categories), because the levels are presented in an inconsistent order for the two data-frames that are being compared.
Aside: am I correct in thinking that the planned grouped-df API would allow the above, without needing to partition the original dataframe; that is, like df %>% group_by(target) %>% inspect_cat() %>% show_plot()
While running the full dataset it gives the following errors.
datafrem characteristics:
df = 78550 rows 10 columns
str(df)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 78550 obs. of 10 variables:
$ date : Date, format: "2015-02-15" "2015-02-15" "2015-02-15" ...
$ author : Factor w/ 5 levels " vazio","joni",..: 2 2 3 3 2 3 3 2 3 2 ...
$ message : chr "Oi Nubi" "Bom dia" "Bom dia!" "\U0001f60a" ...
$ msn_lengh : int 7 7 9 1 52 30 11 34 24 31 ...
$ day : int 15 15 15 15 15 15 15 15 15 15 ...
$ week : num 7 7 7 7 7 7 7 7 7 7 ...
$ month : num 2 2 2 2 2 2 2 2 2 2 ...
$ year : num 2015 2015 2015 2015 2015 ...
$ question_flag: chr "N" "N" "N" "N" ...
$ laughs : chr "N" "N" "N" "N" ...
inspect_cat(df)
Column (2/5): authorError: Tibble columns must have consistent lengths, only values of length one are recycled:
value
prop
rlang::last_error()
to see a backtraceWhen I sample it to 10k rows it works. Still looking around over the problem.
Many thanks for this great package. Trying to install it on a MacBook Pro (macOS Mojave 10.14.2), I am getting the following error:
** libs
/usr/local/opt/llvm/bin/clang++ -fopenmp -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/opt/gettext/include -I/usr/local/opt/llvm/include -fPIC -g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe -c RcppExports.cpp -o RcppExports.o
In file included from RcppExports.cpp:4:
In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp.h:27:
In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/RcppCommon.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/r/headers.h:59:
In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/platform/compiler.h:100:
In file included from /usr/local/Cellar/llvm/6.0.0/include/c++/v1/cmath:305:
/usr/local/Cellar/llvm/6.0.0/include/c++/v1/math.h:301:15: fatal error: 'math.h' file not found
#include_next <math.h>
^~~~~~~~
1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘inspectdf’
* removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/inspectdf’
Error in i.p(...) :
(converted from warning) installation of package /var/folders/dr/93kfhwds3l91jn94w45p2vc00000gp/T//RtmpHQU9BJ/file20f6d693429/inspectdf_0.0.0.9000.tar.gz’ had non-zero exit status```
Any help would be much appreciated.
Here is my sessionInfo() in case this would be helpful:
```sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 roomba_0.1.0 ssh.utils_1.0 rio_0.5.16 fs_1.2.6 rebus_0.1-3 getPass_0.2-2 httr_1.4.0 jsonlite_1.6 sjmisc_2.7.6
[11] naniar_0.4.1 readxl_1.1.0 janitor_1.1.1 data.table_1.11.8 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8 purrr_0.2.5 readr_1.3.0 tidyr_0.8.2
[21] tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1 opencgaR_1.4.0
loaded via a namespace (and not attached):
[1] nlme_3.1-137 usethis_1.4.0 lubridate_1.7.4 devtools_2.0.1 rprojroot_1.3-2 tools_3.5.0 backports_1.1.2 R6_2.3.0
[9] sjlabelled_1.0.15 rebus.base_0.0-3 lazyeval_0.2.1 colorspace_1.3-2 withr_2.1.2 tidyselect_0.2.5 prettyunits_1.0.2 processx_3.2.1
[17] curl_3.2 compiler_3.5.0 cli_1.0.1 rvest_0.3.2 xml2_1.2.0 desc_1.2.0 scales_1.0.0 callr_3.1.0
[25] digest_0.6.18 foreign_0.8-70 rebus.unicode_0.0-2 stringdist_0.9.5.1 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 sessioninfo_1.1.1
[33] rlang_0.3.0.1 rstudioapi_0.8 shiny_1.2.0 bindr_0.1.1 generics_0.0.2 zip_1.0.0 magrittr_1.5 Rcpp_1.0.0
[41] munsell_0.5.0 prediction_0.3.6.1 visdat_0.5.2 stringi_1.2.4 snakecase_0.9.2 pkgbuild_1.0.2 plyr_1.8.4 grid_3.5.0
[49] parallel_3.5.0 promises_1.0.1 crayon_1.3.4 rebus.datetimes_0.0-1 miniUI_0.1.1.1 lattice_0.20-35 haven_2.0.0 hms_0.4.2
[57] ps_1.2.1 pillar_1.3.0 rebus.numbers_0.0-1 pkgload_1.0.2 glue_1.3.0 packrat_0.5.0 remotes_2.0.2 modelr_0.1.2
[65] httpuv_1.4.5 testthat_2.0.1 cellranger_1.1.0 gtable_0.2.0 assertthat_0.2.0 openxlsx_4.1.0 mime_0.6 xtable_1.8-3
[73] broom_0.5.1 later_0.7.5 memoise_1.1.0```
Hi Alastair,
inspectdf Pkg is Great!.
This show_plot() example
works fine:
x <- inspect_num(starwars)
show_plot(x)
But... this ex. does not work:
(it just shows 4 empty histograms...)
x <- inspect_num(iris)
show_plot(x)
Neither does this example:
(it just shows 11 empty histograms...)
x <- inspect_num(mtcars)
show_plot(x)
Help! What am I missing?.
SFd99
San Francisco
Ubuntu Linux, R 351, Rstudio 1.1.463,
inspectdf PKG ver 0.0.2 (installed from CRAN)
----
Related to #6. Remove duplicated factor levels internally with a warning to user.
In cases where two cat columns have levels that do not appear in the other, bundle together the non-shared values?
In plot_na.R there is a typo, "Prevalance of NAs" should be "Prevalence of NAs".
Cheers.
Hi. The color for the last factor is white but for especially for groups with two factors the first one is coloured but the second is white. It would be nice the second group is not white but a lighter shade of the first color. Meaning the colorpallete should not go until white. Thanks.
When running the following code
withr::with_options(list(warnPartialMatchArgs = TRUE,
warnPartialMatchDollar = TRUE,
warnPartialMatchAttr = TRUE), {
iris |>
inspectdf::inspect_mem() |>
inspectdf::show_plot()
})
I get multiple instances of the following warning:
Warning in format.object_size(size, standard = "auto", unit = "auto", digits = 2L) :
partial argument match of 'unit' to 'units'
check whether functions cope with column names with spaces. If ok, add tests.
For scalability, when working with data with many columns, it would be great if there would be an option to rotate the inspect_imb()
plot. Currently, the best way to achieve this is by using ggplot2::coord_flip()
, but that messes up the label positioning (vertical and overlapping) and it's not possible to reverse the levels on the discrete y-axis.
# Load
library(inspectdf)
library(dplyr)
library(ggplot2)
# Imbalance plot
inspect_imb(starwars) %>%
show_plot()
# Rotate
inspect_imb(starwars) %>%
show_plot() +
coord_flip()
Created on 2020-02-24 by the reprex package (v0.3.0)
Is there a way to make the y axis to go always from 0 to 100%. This would be ideal when comparing different plots from different data frames. Thanks!
Is there a way of not sorting the plots when enabling the option show_plot = TRUE
? It would be really useful.
Thanks and congrats on the package!
I liked very much your correlation plot!
It would be nice to have the option of choosing alternative methods for calculating the correlation in inspect_cor
(e.g "spearman" and "kendall" )
Hi @alastairrushworth, thanks for the great package!
Would you be interested in a pull request that adds automatic text sizing to the plots?
plot_cat()
for mpg
currently looks like this:
If plot_cat()
instead used geom_fit_text()
from my package ggfittext, the text could be automatically sized like this:
geom_fit_text()
supports options for hiding text below a minimum size etc. that could be passed on through show_plot()
.
If you think this would be a worthwhile addition I can submit a pull request that adds geom_fit_text()
to all the plot_*()
functions.
Hello,
Firstly, thanks for your excellent and very handy package. I use it in my normal modeling flow and when I teach, I highlight it as the de-facto
solution for automatic EDA.
When using it in rMarkdown reports, the charts for the numerical variables histograms get very small.
I think, it would get a lot better if it could be parametrized the number of columns for the charts, intead of maximizing the plotting area, extending the output long wise would make the charts bigger and more readable.
Thanks again,
Carlos.
Great package for exploratory data analysis, thanks for sharing!
Could the package enable filtering results before plotting? I'd love to be able to do something like this:
# Load packages
library(inspectdf)
library(dplyr)
# Single dataframe summary
inspect_cor(starwars)
#> # A tibble: 3 x 7
#> col_1 col_2 corr p_value lower upper pcnt_nna
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 birth_year mass 0.478 0.00602 0.177 0.697 41.4
#> 2 birth_year height -0.400 0.0114 -0.625 -0.113 49.4
#> 3 mass height 0.134 0.316 -0.127 0.377 67.8
# Filter
inspect_cor(starwars) %>%
filter(abs(corr) > 0.2)
#> # A tibble: 2 x 7
#> col_1 col_2 corr p_value lower upper pcnt_nna
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 birth_year mass 0.478 0.00602 0.177 0.697 41.4
#> 2 birth_year height -0.400 0.0114 -0.625 -0.113 49.4
# Filter and plot
inspect_cor(starwars) %>%
filter(abs(corr) > 0.2) %>%
show_plot()
#> Error: Tibble columns must have consistent sizes, only values of size one are recycled:
#> * Size 2: Existing data
#> * Size 3: Column `pair`
Created on 2020-02-21 by the reprex package (v0.3.0)
When you have a lot of features, you want to focus only on relevant correlations and avoid clutter.
Hello,
show_plot()
is failing to render a plot out of inspect_num()
when the dataset is grouped.
The mid
object in the code at plot_num.R#L173 is actually not defined
library(dplyr)
library(inspectdf)
mtcars %>% dplyr::group_by(am) %>% inspect_num() %>% show_plot()
#> Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomBar, : object 'mid' not found
Created on 2021-08-18 by the reprex package (v2.0.1)
Using the latest package version on github, I'm getting errors for inspect_imb()
when working with factors:
# Load
library(inspectdf)
library(dplyr)
# Change character to factor
starwars_factor <- starwars %>%
mutate_if(is.character, as.factor)
# Imbalance plot
inspect_imb(starwars_factor)
#> Error: Tibble columns must have consistent sizes, only values of size one are recycled:
#> * Size 13: Existing data
#> * Size 17: Column `prop`
Created on 2020-02-24 by the reprex package (v0.3.0)
Encountering following error upon using
"""inspect_num(train, valid) %>%
show_plot()"""
Error message in R:
Error in grid.Call(C_convert, x, as.integer(whatfrom), as.integer(whatto), :
Viewport has zero dimension(s)
Hi Alastair,
thank you for the package! I think that inspect_cat
/ show_plot
would benefit from the possibility of declaring a categorical variable or integer variable ordered, so that its level are plotted in a prespecified order (like in an ordered factor). This would be especially beneficial for features with only a few ordered categories, such as very satisfied, satisfied, ..., dissatisfied.
Best, Ulrike
Thank you for the great package. I've recently observed a bug that such errors are thrown when a data.frame
with a logical column is provided.
randnums <- c(0.2, 0.48, 0.91, -1.93, 0.75)
booleans <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
df_without_na <- data.frame(a = randnums, b = booleans)
df_with_na <- data.frame(a = c(NA_real_, randnums, NA_real_), b = c(NA, booleans, NA))
df_without_na
#> a b
#> 1 0.20 TRUE
#> 2 0.48 FALSE
#> 3 0.91 FALSE
#> 4 -1.93 TRUE
#> 5 0.75 FALSE
df_with_na
#> a b
#> 1 NA NA
#> 2 0.20 TRUE
#> 3 0.48 FALSE
#> 4 0.91 FALSE
#> 5 -1.93 TRUE
#> 6 0.75 FALSE
#> 7 NA NA
library(magrittr)
library(inspectdf)
inspect_na(df_without_na) %>%
show_plot()
#> Error in if (!("ymin" %in% names(data)) | (all(data$ymin == data$ymax) & : missing value where TRUE/FALSE needed
inspect_na(df_with_na) %>%
show_plot()
#> Error in if (!("ymin" %in% names(data)) | (all(data$ymin == data$ymax) & : missing value where TRUE/FALSE needed
Hi Alastair,
I ran into this issue where inspect_num()
is not able to handle cases when the numeric variable has a different range in the comparison data set df2
. It seems the histogram breaks are computed on the range seen in df1
alone and then applied to df2
rather than computed on the range of df1
and df2
jointly.
Here's a minimal reprex:
library(inspectdf)
data("starwars", package = "dplyr")
starwars1 <- starwars[, "height"]
starwars2 <- starwars[, "height"] + 100
inspect_num(starwars1, starwars2)
#> Error in hist.default(col_i, plot = FALSE, right = TRUE, breaks = hist_breaks): some 'x' not counted; maybe 'breaks' do not span range of 'x'
Created on 2023-07-07 with reprex v2.0.2
It appears that "" values cause the following error:
Error in if (tg$gp$fontsize < x$min.size) return() :
missing value where TRUE/FALSE needed
I am able to update all of "" to "DATA NOT PROVIDED" and it runs fine.
Thanks for the great tool!
Hi Alastair,
the inspectdf PKG is really USEFUL!.
But a show_plot() quirk...
try:
unique(mtcars$carb)
[1] 4 1 2 3 6 8
inspect_num(mtcars) %>% show_plot()
See?.
The vertical bars in the CARB plot
are not "aligned" with the unique value markers below,
(in the x-axis).
The bars are all slightly "displaced" to the right...
(not "on top" of the unique CARB values: 4 1 2 3 6 8 ).
Even Zoomming the size of the Rstudio [Plots] Panel
doesn't help.
Same problem with columns for:
AM, GEAR , VS and CYL ...etc
Hope you can help.
Thanks Alastair!
sfd99
San Francisco
latest Rstudio/R/Ubuntu Linux
inspectdf 0.0.11
I'm not quite sure why, but inspect_imb
fails a lot for me with the same error. See below for a simple example.
library(inspectdf)
inspect_imb(iris)
# Error in sapply(df_cat_fact, are_lvls_unq) :
# object 'df_cat_fact' not found
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.