dcomtois / summarytools Goto Github PK
View Code? Open in Web Editor NEWR Package to Quickly and Neatly Summarize Data
R Package to Quickly and Neatly Summarize Data
When I use the ctable
function with the pipe operator %$%
from the package magrittr
an error occurs: Error: $ operator is invalid for atomic vectors
library(summarytools)
library(magrittr)
tobacco %$% ctable(smoker, diseased)
Traceback
14. na.omit(c(parse_info_y$var_names, deparse(dnn[[2]]))) at ctable.R#194
13. ctable(smoker, diseased)
12. eval(substitute(expr), data, enclos = parent.frame())
11. eval(substitute(expr), data, enclos = parent.frame())
10. with.default(., ctable(smoker, diseased))
9. with(., ctable(smoker, diseased))
8. function_list[k]
7. withVisible(function_list[k])
6. freduce(value, _function_list
)
5. _fseq
(_lhs
)
4. eval(quote(_fseq
(_lhs
)), env, env)
3. eval(quote(_fseq
(_lhs
)), env, env)
2. withVisible(eval(quote(_fseq
(_lhs
)), env, env))
Error appears in na.omit function in line 194, ctable.R file.
y_name <- na.omit(c(parse_info_y$var_names, deparse(dnn[[2]])))[1]
Many thanks.
This is a feature request. It would be great to add an argument to limit the statistics (mean, sd, etc.). For example, if someone only wants to return mean, median and sd , then the argument could be something like
stats = c('mean', 'median', 'sd')
The final descriptive table would only return the above listed statistics instead of all of them. The default could be stats = "all"
.
Dear Dominic,
I found the package "summarytools" very useful!
However, I also found that CV values are calculated inappropriately in the package. When viewing the relevant code contained in "descr.R", I found that CV values are calculated using
ifelse("cv" %in% stats, variable.mean / variable.sd, NA)
As you know the correct formula to calculate coefficient of variation is: CV = (Standard Deviation (σ) / Mean (μ)), why this chunk needs to be replaced by
ifelse("cv" %in% stats, variable.sd / variable.mean, NA)
Best regards,
Payam
New to the package. Very interesting contribution! I may have missed this: is there a way to select the columns that freq
returns? I can remove NAs with report.nas = FALSE
. I know I can drop the Totals
row with totals = FALSE
. Is there an option of the freq
function to keep/drop the percentage column and/or the cumulative percentage column?
Something like report.cum = FALSE
and report.pct = FALSE
...
I love the summaries this tool generates in RStudio. Thanks!
My problem is that using this with Jupyter doesn't seem to work. Reproduction below:
Inspecting the data frame:
ddd = summarytools::dfSummary(mtcars)
ddd$Variable
Produces this:
[1] "mpg\\\n[numeric]" "cyl\\\n[numeric]" "disp\\\n[numeric]" "hp\\\n[numeric]" "drat\\\n[numeric]"
[6] "wt\\\n[numeric]" "qsec\\\n[numeric]" "vs\\\n[numeric]" "am\\\n[numeric]" "gear\\\n[numeric]"
[11] "carb\\\n[numeric]"
Which works great in RStudio or the command line, poorly in Jupyter.
I am struggling to figure out if there is simple a parameter I am missing? Or maybe there is a method I can pipe this output through to unescape those characters?
If I figure it out I'll post a solution.
In the data frame summary if an column contains 115 distinct values (such as countries) and 99% of the values is a specific country, this is very useful to mention what the most frequent country is. In general I believe It is usefull to display to most frequent values.
Thanks again for your great package. Is it possible to add some suggestion on how to render the output in word or html using rmarkdown in RStudio?
Best
Hello,
I have a .csv data file that I am reading into a data frame. When I run the dfSummary() function in the console or render on RMarkdown, although some integers are only two distinct values with 100% valid entries, the frequencies(%) are not printed on the output. Interestingly, some integers with <10 values will have printed out frequencies, but there really isn't any notable pattern to why these will print whereas the majority will not. When using an older version of summarytools (0.6.5), this frequency issue is not a problem. Is there something I can do besides go through all of my variables and convert them to factors to resolve this issue? Thanks and please let me know if I need to clarify anything. I'm relatively new to programming and R. :)
Thanks for your great package. As a suggestion, I would like to propose to add a character vector parameter with default values to explicit which statistics are being tabulated to the descr function .
When using the data frame summary I encountered a dataset which had rows with only empty columns (NA's). It would be handy to mention this when this occurs on the top of the page at the data frame summary.
I'm wondering whether it is possible to control which Stats to be shown in the case of numerical variables when using dfSummary().
This is almost necessary to be able to control which Stats to use for numerical variables, particularity in the case of CV. This is because CV values should not be calculated for a data on a logarithmic scale!
data(exams)
with(exams, by(english, gender, descr))
returns descriptive statistics for "english" for each gender. However, the statistics table shows Var1 as the column name instead of showing the actual named variable (which would be english, in this case).
Var1
Mean 76.66
Std.Dev 9.35
Min 55.9
Max 93.2
Median 77.1
mad 7.56
IQR 8.2
CV 8.2
Skewness -0.25
SE.Skewness 0.58
Kurtosis -0.25
Was it intentional? If not, it would probably be a good idea to display the actual variable name
Getting error "Error in ctable... : Could not find function "ctable""
Dear Dominic,
first of all, I want to say that your package is great! Thank you!!!
Second I have noticed that the two options of dfSummary
do not seem to work when set to false.
Am I doing something wrong?
here is an example with iris
view(dfSummary(iris, varnumbers = FALSE, valid.col = FALSE, na.col = FALSE , omit.headings=TRUE))
Under some circonstances, the html graph can take a (very) long time to generate. If you do not need the graphs, just set graph.col = FALSE
until the issue is resolved. Thanks to Adam Medcalf for pointing this out.
In the Data Frame Summary it would be very useful to identify which column contains the 'primary key' (as it is called in databases). A column could be the primary key when the number of rows in the data frame equals the number of distinct values. Of course not every table has a primary key, but that is also useful to mention.
I read a clean dataset in from SQL, and tried the below:
library(summarytools)
view(dfSummary(df))
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values
In addition: Warning messages:
1: In n * h : NAs produced by integer overflow
2: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
3: In n * h : NAs produced by integer overflow
4: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
The descr() function from R-package summarytools generates common central tendency statistics and measures of dispersion for numerical data in R.
When I use descr() with by() in a Shiny app, names of variable (features) contained in the data disappear and not displayed. Instead, the names are replaced by Var1, Var2, Var3 etc.
I do not really understand why the names disappear when I implement these code in the Shiny app (see below).
source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools')
library("summarytools")
library(Biobase)
library("ALL")
server <- function(input, output, session) {
output$summaryTable <- renderUI({
#-- Load the ALL data
data(ALL)
#-- Subset
eset_object <- ALL [1:3,] # choose only 3 variables
#-- The group of interest
eset_groups <-"BT"
ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))),
INDICES = (pData(eset_object)[,eset_groups]),
FUN = descr, stats ="all",
transpose = TRUE)
view(ALL_stats_by_BT,
method = 'render',
omit.headings = FALSE,
bootstrap.css = FALSE)
})
}
ui <- fluidPage(theme = "dfSummary.css",
fluidRow(
uiOutput("summaryTable")
)
)
As a side note, if one reads in the data as Global variable: eset_object <<- ALL [1:3,], the variable names will be displayed. But this is not a solution to the problem as it is wise to avoid global variables!
When I run
view(dfSummary(data))
in the console I get something like this
but when I put
view(dfSummary(data), method = "render")
in my Rmd (html_output) , I get this :
I think adding some <br>
at the end each lines in Stats / Values
and Freqs
to have the same result that in the Rstudio Viewer could be very good :)
Thanks for your package !
Especially in the context of rendering html for markdown; right now the size of graphs responds to windows size and the graph.magnif parameter doesn't enforce actual wanted size.
I found that the Name of Group variable (and group level) is not retrieved or re-updated on UI when selecting a new Group variable in the app. It should be noted that the corresponding table (calculations) updates upon selection a new group variable.
Moreover, the group variable is also displayed in a static manner. Using the Shiny App below, the issue could be exemplified to some extent. For example, the Group variable is displayed on UI, as shown below:
Group: (pData(eset_object)[, eset_groups]) = B
# Install packages
source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
biocLite("Biobase")
install.packages('devtools')
devtools::install_github('dcomtois/summarytools')
# Load packages
library(summarytools)
library(Biobase)
library(ALL)
# Shiny Server
server <- function(input, output, session) {
output$summaryTable <- renderUI({
#-- Load the ALL data
data(ALL)
#-- Subset
eset_object <- ALL [1:3,] # choose only 3 variables
#-- The group of interest
eset_groups <-"BT"
# print(rownames (eset_object)) # print variable names
ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))),
INDICES = (pData(eset_object)[,eset_groups]),
FUN = descr, stats ="all",
transpose = TRUE)
view(ALL_stats_by_BT,
method = 'render',
omit.headings = FALSE,
bootstrap.css = FALSE)
})
}
# Shiny UI
ui <- fluidPage(theme = "dfSummary.css",
fluidRow(
uiOutput("summaryTable")
)
)
# Lauch
shinyApp(ui, server)
Of note, if you replace eSet_object <- relevant_est()
to eSet_object <<- relevant_est()
(that is Global Env) the option Data Frame will be retrieved and displayed on UI, as presented below:
Data Frame: as.data.frame(t(exprs(eSet_object)))
Group: (pData(eSet_object)[, eSet_groups]) = B
In the data frame summary, when a column contains for example an amount in euro's, I would suggest to also add the sum of the values in the data frame summary.
Maybe add plots like in vcd?
The current version of the Data Frame Summary shows the number of rows. In many cases it is very usefull to know how many unique rows there are. For example the iris dataset contains 150 rows, but there is one duplicate row (e.g. nrow(unique(iris)) gives 149). It would be very helpfull to add this to the top of the report.
When in the data frame summary an integer column contains for example 110 distinct values (0 < 43 < 109) it is useful to note that it is the sequence 0:109.
I keep getting this error using dfSummary -- and it has happened for all of my data. All of the code worked before...
x was converted to a data frame
Error in sect_title[[2]] : subscript out of bounds
view(dfSummary(hehe, graph.col = FALSE), file = "data_summary.html", append = TRUE, footnote = NA)
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values
When using the data frame summary it is very handy when there are two columns that contain an ID and a description of something. For example when column 3 has distinct ID's 1 and 2 and column 123 contains the distinct values "MALE" and "FEMALE" it is very practical to mention that column 3 and column 123 are related.
hola!
excited so a couple more suggestions:
hth
Using summarytools 0.8.6 getting error on some variables where everything is either 0 or 1 and there is a also a missing value. I have other character and factor vectors with missing values and those are being handled correctly.
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument
Reproducible example
dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, 0), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)
#so far so good, but then look what happens when an NA is inserted
dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, NA), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)
Especially when using the data frame summary with a lot of columns it is handy to mention the number of columns on the top of the page, e.g. next to the number of rows.
In the data frame summary when a column contains date/time information I would suggest to give a distribution of the days (e.g. monday 6%, tuesday 12%, etc..), months, hours, etc. This reveals easily season patterns, workday behavior, etc.
I am getting an error using ctable every time I try and set useNA to "no". It works just fine for "always" or "ifany". Error is below:
Error in ctable(stjean$pnc5_new, stjean$preterm, prop = "t", useNA = "no") :
'useNA' must be one of 'ifany', 'always', or 'no'
I see that freq always print NA and Cumulative Valid. I would suggest to add an boolean option, ignore.na=FALSE, that, when TRUE, ignores NA (and also does not print "Valid" frequencies columns
It would be useful to have a distinct count of unique values of either factor or character column.
For example, if I have a column labeled email, I would like to know how many unique emails I have in that column.
Here is an example of my output.
When the field type is Integer, then you get a distinct could of values, but when it's a character/factor then it counts frequency but not count of unique values.
Thank you
Hi,
When I was trying to generate a dfSummary of a new dataset I could not due to an error. I could replicate the bug when running this functions on the iris dataset. This error occurs when I have a whole factor column with NAs.
This works:
data(iris)
dfSummary(iris)
Now, when I set a factor column to NA it doesn't.
iris$Species <- as.factor(rep(NA, nrow(iris)))
dfSummary(iris)
This is the error, identical to my dataset.
Error in png(img_png <- tempfile(fileext = ".png"), width = 150, height = 26 * :
invalid 'height' argument
In addition: Warning messages:
1: In max(counts) : no non-missing arguments to max; returning -Inf
2: In max(props * 100) : no non-missing arguments to max; returning -Inf
Regards,
Victor
including cumsum etc.
https://github.com/TysonStanley/furniture
A user has requested that feature, will be working on it soon.
Hi, is it also possible to specify 25th and 75th percentiles (as Q1 and Q3) maybe? Cause they are frequently used as descriptive reporting. Best
I assume the percentage should always sum to 100% but in the screenshot below the "other" level gets 103.1% Not sure what is going on there. A link to the dataset used is provided below.
https://github.com/radiant-rstats/radiant.data/raw/master/data/titanic.rda
When analyzing a data set with e.g. client ID's it is very usefull to know how often unique ID's appear in the dataset. e.g. 90% appears once, 5% appears twice, etc.. (data frame summary)
When a data set contains an ID which has a checksum, this is very useful to know. E.g. when bar codes are used (EAN https://en.wikipedia.org/wiki/International_Article_Number) it is very useful to know, especially when column names are not obvious.
A user requested that feature, applicable in calls such as this one (analysing only one variable)
with(iris, by(Petal.Width, Species, descr))
The output of dfSummary
would look nice in the data documentation created by roxygen2
(r-lib/roxygen2#307). Converting a data frame to .Rd
is straightforward, but the data frames created by dfSummary
contain embedded newlines -- this makes it a bit more difficult.
There's an error with the link for the recommendation vignette. I'll create a PR that solves this.
The error is here:
The following vignettes complements this page: [Recommendations for
Using summarytools With
Rmarkdown](https://cdn.rawgit.com/dcomtois/summarytools/dev-current/inst/doc/Recommendations-rmarkdown.html)
I just installed summarytools 0.8.3 from CRAN with no error messages.
packageVersion("summarytools")
[1] ‘0.8.3’
> library(summarytools)
Error in get(method, envir = home) :
lazy-load database 'xxx/summarytools/R/summarytools.rdb' is corrupt
In addition: Warning messages:
1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4], :
restarting interrupted promise evaluation
2: In get(method, envir = home) :
restarting interrupted promise evaluation
3: In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘summarytools’
Installing from github gives the same results.
Session info ------------------------------------------------------------------------------------------
setting value
version R version 3.3.2 (2016-10-31)
system x86_64, linux-gnu
ui RStudio (1.1.447)
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2018-04-27
Hey,
Great package!
I think code for descr etc. can be radically simplified using dplyr.
For instance:
iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = F)
iris_num <- iris %>%
summarize_if(is.numeric, funs(mean = mean, median = median, min = min, max = max, missing = sum(is.na(.))))
iris_num_long <- iris_num %>%
gather(key = "key", value = "words") %>%
separate(key, into = c("var", "statistic")) %>%
spread(key = "var", value = "words")
produces
iris_num_long
# A tibble: 5 x 5
statistic X1 X2 X3 X4
* <chr> <dbl> <dbl> <dbl> <dbl>
1 max 7.90 4.40 6.90 2.50
2 mean 5.84 3.05 3.76 1.20
3 median 5.80 3.00 4.35 1.30
4 min 4.30 2.00 1.00 0.100
5 missing 0 0 0 0
and this allows you to pass arbitrary functions to summarize easily
fyi, I get the following error:
Error in isTRUE(extra_space) : object 'extra_space' not found
Will try to post a reproducible example
In the data frame summary if an integer contains only 0 and 1's I believe it is not very useful to describe "mean (sd) : 0.23 (0.42) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.82)". I suggest it is more usefull to mention how many 0 and 1 values occur.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.