dcomtois / summarytools Goto Github PK

View Code? Open in Web Editor NEW

499.0 499.0 77.0 35.63 MB

R Package to Quickly and Neatly Summarize Data

R 97.42% CSS 2.02% HTML 0.42% TeX 0.14%

descriptive-statistics frequency-table html-report markdown pander pandoc pandoc-markdown r rmarkdown rstats rstudio

summarytools's People

Contributors

Stargazers

Watchers

Forkers

arturochian kevinykuo alexamansalva woali pjcrosbie najah-lshanableh zhaoxiaohe skynode applied-statistic-using-r anishsingh20 abiraja2004 baifengbai alyselin onesandzeroes faviovazquez weizai118 jonmcalder paulfeitsma guhjy montrealist vivek2319 gaybro8777 karant17 brunaw sopettycashpi emraher gourabchanda1990 serviolimareina imarcello antuki zyf19940501 47dx drroad iago-contributedforks deepanshu88 krishnapsrinivasan gravitytrope karagul jrgant alex33261 anuragsinghchaudhary tjess rmsharp wilbrodn kmaheshkulkarni thets cmrnp adicara anhnguyendepocen yadevi tdbrothers fquezadae robcareta clinicopath vlasvlasvlas rprrr wyuyua cropland-bv shicheng-guo charmarshall2 skolenik krystian8207 esbenph davidgoes4wce kabijaoude shonev vineetp6 acdoering carbirbal sajjadanwar0 sharmaar342 nndemeo

summarytools's Issues

Error in ctable function: $ operator is invalid for atomic vectors

When I use the ctable function with the pipe operator %$% from the package magrittr an error occurs: Error: $ operator is invalid for atomic vectors

library(summarytools)
library(magrittr)

tobacco %$% ctable(smoker, diseased)

Traceback
14. na.omit(c(parse_info_y$var_names, deparse(dnn[[2]]))) at ctable.R#194
13. ctable(smoker, diseased)
12. eval(substitute(expr), data, enclos = parent.frame())
11. eval(substitute(expr), data, enclos = parent.frame())
10. with.default(., ctable(smoker, diseased))
9. with(., ctable(smoker, diseased))
8. function_list[k]
7. withVisible(function_list[k])
6. freduce(value, _function_list)
5. _fseq(_lhs)
4. eval(quote(_fseq(_lhs)), env, env)
3. eval(quote(_fseq(_lhs)), env, env)
2. withVisible(eval(quote(_fseq(_lhs)), env, env))

tobacco %$% ctable(smoker, diseased)

Error appears in na.omit function in line 194, ctable.R file.

y_name  <- na.omit(c(parse_info_y$var_names, deparse(dnn[[2]])))[1]

Many thanks.

limiting the statistics in descr()

This is a feature request. It would be great to add an argument to limit the statistics (mean, sd, etc.). For example, if someone only wants to return mean, median and sd , then the argument could be something like

stats = c('mean', 'median', 'sd')

The final descriptive table would only return the above listed statistics instead of all of them. The default could be stats = "all".

Found issue with coefficient of variation (CV)

Dear Dominic,
I found the package "summarytools" very useful!

However, I also found that CV values are calculated inappropriately in the package. When viewing the relevant code contained in "descr.R", I found that CV values are calculated using
ifelse("cv" %in% stats, variable.mean / variable.sd, NA)
As you know the correct formula to calculate coefficient of variation is: CV = (Standard Deviation (σ) / Mean (μ)), why this chunk needs to be replaced by
ifelse("cv" %in% stats, variable.sd / variable.mean, NA)

Best regards,
Payam

feature: select columns in freq

New to the package. Very interesting contribution! I may have missed this: is there a way to select the columns that freq returns? I can remove NAs with report.nas = FALSE. I know I can drop the Totals row with totals = FALSE. Is there an option of the freq function to keep/drop the percentage column and/or the cumulative percentage column?

Something like report.cum = FALSE and report.pct = FALSE ...

Escape Characters Causing Ugly Display in Jupyter

I love the summaries this tool generates in RStudio. Thanks!

My problem is that using this with Jupyter doesn't seem to work. Reproduction below:

Inspecting the data frame:

ddd = summarytools::dfSummary(mtcars)
ddd$Variable

Produces this:

[1] "mpg\\\n[numeric]"  "cyl\\\n[numeric]"  "disp\\\n[numeric]" "hp\\\n[numeric]"   "drat\\\n[numeric]"
 [6] "wt\\\n[numeric]"   "qsec\\\n[numeric]" "vs\\\n[numeric]"   "am\\\n[numeric]"   "gear\\\n[numeric]"
[11] "carb\\\n[numeric]"

Which works great in RStudio or the command line, poorly in Jupyter.

I am struggling to figure out if there is simple a parameter I am missing? Or maybe there is a method I can pipe this output through to unescape those characters?

If I figure it out I'll post a solution.

Suggestion: mention most frequent value

In the data frame summary if an column contains 115 distinct values (such as countries) and 99% of the values is a specific country, this is very useful to mention what the most frequent country is. In general I believe It is usefull to display to most frequent values.

Winword output suggestion

Thanks again for your great package. Is it possible to add some suggestion on how to render the output in word or html using rmarkdown in RStudio?
Best

dfSummary: Freqs(% Valid) numerical vectors and integer proportions

Hello,

I have a .csv data file that I am reading into a data frame. When I run the dfSummary() function in the console or render on RMarkdown, although some integers are only two distinct values with 100% valid entries, the frequencies(%) are not printed on the output. Interestingly, some integers with <10 values will have printed out frequencies, but there really isn't any notable pattern to why these will print whereas the majority will not. When using an older version of summarytools (0.6.5), this frequency issue is not a problem. Is there something I can do besides go through all of my variables and convert them to factors to resolve this issue? Thanks and please let me know if I need to clarify anything. I'm relatively new to programming and R. :)

parameter to select which statistics to print

Thanks for your great package. As a suggestion, I would like to propose to add a character vector parameter with default values to explicit which statistics are being tabulated to the descr function .

Suggestion: mention rows with all NA's

When using the data frame summary I encountered a dataset which had rows with only empty columns (NA's). It would be handy to mention this when this occurs on the top of the page at the data frame summary.

Controlling which stats to use in the case of numerical variables when using dfSummary()

I'm wondering whether it is possible to control which Stats to be shown in the case of numerical variables when using dfSummary().
This is almost necessary to be able to control which Stats to use for numerical variables, particularity in the case of CV. This is because CV values should not be calculated for a data on a logarithmic scale!

with() returns Var1 instead of the named variable

data(exams)
with(exams, by(english, gender, descr))

returns descriptive statistics for "english" for each gender. However, the statistics table shows Var1 as the column name instead of showing the actual named variable (which would be english, in this case).

                Var1

         Mean  76.66
      Std.Dev   9.35
          Min   55.9
          Max   93.2
       Median   77.1
          mad   7.56
          IQR    8.2
           CV    8.2
     Skewness  -0.25
  SE.Skewness   0.58
     Kurtosis  -0.25

Was it intentional? If not, it would probably be a good idea to display the actual variable name

Getting error "Error in ctable(... : Could not find function "ctable""

Getting error "Error in ctable... : Could not find function "ctable""

dfSummary: options valid.col & na.col

Dear Dominic,

first of all, I want to say that your package is great! Thank you!!!

Second I have noticed that the two options of dfSummary do not seem to work when set to false.
Am I doing something wrong?
here is an example with iris
view(dfSummary(iris, varnumbers = FALSE, valid.col = FALSE, na.col = FALSE , omit.headings=TRUE))

dfSummary graphs slow to generate when number of breaks is high

Under some circonstances, the html graph can take a (very) long time to generate. If you do not need the graphs, just set graph.col = FALSE until the issue is resolved. Thanks to Adam Medcalf for pointing this out.

suggestion: identify primary key of dataframe

In the Data Frame Summary it would be very useful to identify which column contains the 'primary key' (as it is called in databases). A column could be the primary key when the number of rows in the data frame equals the number of distinct values. Of course not every table has a primary key, but that is also useful to mention.

Error from view(dfSummary(df))

I read a clean dataset in from SQL, and tried the below:
library(summarytools)

view(dfSummary(df))
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values
In addition: Warning messages:
1: In n * h : NAs produced by integer overflow
2: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
3: In n * h : NAs produced by integer overflow
4: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow

Name of variable (features) disappears when using descr() with by() in Shiny app

The descr() function from R-package summarytools generates common central tendency statistics and measures of dispersion for numerical data in R.

When I use descr() with by() in a Shiny app, names of variable (features) contained in the data disappear and not displayed. Instead, the names are replaced by Var1, Var2, Var3 etc.

I do not really understand why the names disappear when I implement these code in the Shiny app (see below).

Install packages

source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools')

Load packages

library("summarytools")
library(Biobase)
library("ALL")

Shiny Server

server <- function(input, output, session) {
output$summaryTable <- renderUI({
#-- Load the ALL data
data(ALL)
#-- Subset
eset_object <- ALL [1:3,] # choose only 3 variables
#-- The group of interest
eset_groups <-"BT"
ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))),
INDICES = (pData(eset_object)[,eset_groups]),
FUN = descr, stats ="all",
transpose = TRUE)

view(ALL_stats_by_BT,
method = 'render',
omit.headings = FALSE,
bootstrap.css = FALSE)
})
}

Shiny UI

ui <- fluidPage(theme = "dfSummary.css",
fluidRow(
uiOutput("summaryTable")
)
)

As a side note, if one reads in the data as Global variable: eset_object <<- ALL [1:3,], the variable names will be displayed. But this is not a solution to the problem as it is wise to avoid global variables!

Suggestion : add some <br> in the view(dfSummary(data),method = "render")

When I run
view(dfSummary(data)) in the console I get something like this

but when I put
view(dfSummary(data), method = "render") in my Rmd (html_output) , I get this :

I think adding some <br> at the end each lines in Stats / Values and Freqs to have the same result that in the Rstudio Viewer could be very good :)

Thanks for your package !

Specify column widths

Especially in the context of rendering html for markdown; right now the size of graphs responds to windows size and the graph.magnif parameter doesn't enforce actual wanted size.

Name of Group variable is not updated when using descr() with by() in Shiny app

I found that the Name of Group variable (and group level) is not retrieved or re-updated on UI when selecting a new Group variable in the app. It should be noted that the corresponding table (calculations) updates upon selection a new group variable.
Moreover, the group variable is also displayed in a static manner. Using the Shiny App below, the issue could be exemplified to some extent. For example, the Group variable is displayed on UI, as shown below:

Group: (pData(eset_object)[, eset_groups]) = B


# Install packages
source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
biocLite("Biobase")
install.packages('devtools')
devtools::install_github('dcomtois/summarytools')

# Load packages
library(summarytools)
library(Biobase)
library(ALL) 

# Shiny Server
server <- function(input, output, session) {
  output$summaryTable <- renderUI({
    #-- Load the ALL data
    data(ALL)  
    #-- Subset
    eset_object <- ALL [1:3,] # choose only 3 variables 
    #-- The group of interest 
    eset_groups <-"BT"
    # print(rownames (eset_object)) # print variable names
    ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))), 
                          INDICES = (pData(eset_object)[,eset_groups]), 
                          FUN = descr, stats ="all", 
                          transpose = TRUE)

    view(ALL_stats_by_BT,
         method = 'render',
         omit.headings = FALSE,
         bootstrap.css = FALSE)
  })
}

# Shiny UI
ui <- fluidPage(theme = "dfSummary.css",
                fluidRow(
                  uiOutput("summaryTable")
                )
)

# Lauch
shinyApp(ui, server)

Of note, if you replace eSet_object <- relevant_est() to eSet_object <<- relevant_est() (that is Global Env) the option Data Frame will be retrieved and displayed on UI, as presented below:

Data Frame: as.data.frame(t(exprs(eSet_object)))
Group: (pData(eSet_object)[, eSet_groups]) = B

Suggestion: mention also the sum of the values

In the data frame summary, when a column contains for example an amount in euro's, I would suggest to also add the sum of the values in the data frame summary.

Add mosaic plots as a feature.

Maybe add plots like in vcd?

Suggestion to add number of unique rows

The current version of the Data Frame Summary shows the number of rows. In many cases it is very usefull to know how many unique rows there are. For example the iris dataset contains 150 rows, but there is one duplicate row (e.g. nrow(unique(iris)) gives 149). It would be very helpfull to add this to the top of the report.

Suggestion: mention it when an integer is a sequence

When in the data frame summary an integer column contains for example 110 distinct values (0 < 43 < 109) it is useful to note that it is the sequence 0:109.

Error in sect_title[[2]] : subscript out of bounds

I keep getting this error using dfSummary -- and it has happened for all of my data. All of the code worked before...

x was converted to a data frame
Error in sect_title[[2]] : subscript out of bounds

even when graph.col = FALSE, gives graph error

view(dfSummary(hehe, graph.col = FALSE), file = "data_summary.html", append = TRUE, footnote = NA)
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values

Suggestion: identify when there is a ID - Despription relation between two columns

When using the data frame summary it is very handy when there are two columns that contain an ID and a description of something. For example when column 3 has distinct ID's 1 and 2 and column 123 contains the distinct values "MALE" and "FEMALE" it is very practical to mention that column 3 and column 123 are related.

special handling for dates

hola!

excited so a couple more suggestions:

i think it would be to useful to allow for special handling of date and time vars.
for categorical vars./char. vars. with more than 10 unique, it may be useful to present breakdown of 9 most common (as you do, I think) and then the 10th can be 'other or all else' (which totals up for everything other char. val.

hth

Calculation of percentage for "other" category seems incorrect

In a dataset with 75K unique identifiers (i.e., id). I see the output below. Shouldn't the percentage for "74990 other" be 100% or something very close to that?

Error in prettyNum if missing value

Using summarytools 0.8.6 getting error on some variables where everything is either 0 or 1 and there is a also a missing value. I have other character and factor vectors with missing values and those are being handled correctly.

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument

Reproducible example

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, 0), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

#so far so good, but then look what happens when an NA is inserted

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, NA), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

Suggestion: mention the number of columns in the top of the data frame summary

Especially when using the data frame summary with a lot of columns it is handy to mention the number of columns on the top of the page, e.g. next to the number of rows.

Suggestion: give information about date-time columns

In the data frame summary when a column contains date/time information I would suggest to give a distribution of the days (e.g. monday 6%, tuesday 12%, etc..), months, hours, etc. This reveals easily season patterns, workday behavior, etc.

useNA in ctable

I am getting an error using ctable every time I try and set useNA to "no". It works just fine for "always" or "ifany". Error is below:

Error in ctable(stjean$pnc5_new, stjean$preterm, prop = "t", useNA = "no") :
'useNA' must be one of 'ifany', 'always', or 'no'

handling NA in freq

I see that freq always print NA and Cumulative Valid. I would suggest to add an boolean option, ignore.na=FALSE, that, when TRUE, ignores NA (and also does not print "Valid" frequencies columns

Suggestion: Distinct count of factor/character column

It would be useful to have a distinct count of unique values of either factor or character column.
For example, if I have a column labeled email, I would like to know how many unique emails I have in that column.
Here is an example of my output.
When the field type is Integer, then you get a distinct could of values, but when it's a character/factor then it counts frequency but not count of unique values.

Thank you

dfSummary fails when a whole factor column is NA

Hi,

When I was trying to generate a dfSummary of a new dataset I could not due to an error. I could replicate the bug when running this functions on the iris dataset. This error occurs when I have a whole factor column with NAs.

This works:

data(iris)
dfSummary(iris)

Now, when I set a factor column to NA it doesn't.

iris$Species <- as.factor(rep(NA, nrow(iris)))
dfSummary(iris)

This is the error, identical to my dataset.
Error in png(img_png <- tempfile(fileext = ".png"), width = 150, height = 26 * :
invalid 'height' argument
In addition: Warning messages:
1: In max(counts) : no non-missing arguments to max; returning -Inf
2: In max(props * 100) : no non-missing arguments to max; returning -Inf

Regards,
Victor

fyi on other functionality from another package

including cumsum etc.
https://github.com/TysonStanley/furniture

Option to omit Totals row in freq()

A user has requested that feature, will be working on it soon.

Q1 and Q3

Hi, is it also possible to specify 25th and 75th percentiles (as Q1 and Q3) maybe? Cause they are frequently used as descriptive reporting. Best

strange result for character variable in dfSummary

I assume the percentage should always sum to 100% but in the screenshot below the "other" level gets 103.1% Not sure what is going on there. A link to the dataset used is provided below.

https://github.com/radiant-rstats/radiant.data/raw/master/data/titanic.rda

Suggestion: how often does ID values exist

When analyzing a data set with e.g. client ID's it is very usefull to know how often unique ID's appear in the dataset. e.g. 90% appears once, 5% appears twice, etc.. (data frame summary)

Suggestion: identify when an ID columns contains a checksum

When a data set contains an ID which has a checksum, this is very useful to know. E.g. when bar codes are used (EAN https://en.wikipedia.org/wiki/International_Article_Number) it is very useful to know, especially when column names are not obvious.

In descr(), display a single table when a by-group is specified

A user requested that feature, applicable in calls such as this one (analysing only one variable)

with(iris, by(Petal.Width, Species, descr))

Rd formatting

The output of dfSummary would look nice in the data documentation created by roxygen2 (r-lib/roxygen2#307). Converting a data frame to .Rd is straightforward, but the data frames created by dfSummary contain embedded newlines -- this makes it a bit more difficult.

Wrong link for the recommendation vignette

There's an error with the link for the recommendation vignette. I'll create a PR that solves this.

The error is here:

The following vignettes complements this page: [Recommendations for
Using summarytools With
Rmarkdown](https://cdn.rawgit.com/dcomtois/summarytools/dev-current/inst/doc/Recommendations-rmarkdown.html)

Error loading summarytools

I just installed summarytools 0.8.3 from CRAN with no error messages.

packageVersion("summarytools")
[1] ‘0.8.3’

> library(summarytools)
Error in get(method, envir = home) : 
  lazy-load database 'xxx/summarytools/R/summarytools.rdb' is corrupt
In addition: Warning messages:
1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4],  :
  restarting interrupted promise evaluation
2: In get(method, envir = home) :
  restarting interrupted promise evaluation
3: In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘summarytools’

Installing from github gives the same results.

Session info ------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.447)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2018-04-27

simplify code with dplyr

Hey,

Great package!

I think code for descr etc. can be radically simplified using dplyr.

For instance:

iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = F)
iris_num <- iris %>%
   summarize_if(is.numeric, funs(mean = mean, median = median, min = min, max = max, missing = sum(is.na(.)))) 
iris_num_long <- iris_num %>%
  gather(key = "key", value = "words") %>%
  separate(key, into = c("var", "statistic")) %>%
  spread(key = "var", value = "words")

produces

 iris_num_long
# A tibble: 5 x 5
  statistic    X1    X2    X3    X4
* <chr>     <dbl> <dbl> <dbl> <dbl>
1 max        7.90  4.40  6.90 2.50 
2 mean       5.84  3.05  3.76 1.20 
3 median     5.80  3.00  4.35 1.30 
4 min        4.30  2.00  1.00 0.100
5 missing    0     0     0    0

and this allows you to pass arbitrary functions to summarize easily

bug

fyi, I get the following error:

Error in isTRUE(extra_space) : object 'extra_space' not found

Will try to post a reproducible example

Suggestion: threat binary integers different

In the data frame summary if an integer contains only 0 and 1's I believe it is not very useful to describe "mean (sd) : 0.23 (0.42) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.82)". I suggest it is more usefull to mention how many 0 and 1 values occur.