Coder Social home page Coder Social logo

dcomtois / summarytools Goto Github PK

View Code? Open in Web Editor NEW
499.0 499.0 77.0 35.63 MB

R Package to Quickly and Neatly Summarize Data

R 97.42% CSS 2.02% HTML 0.42% TeX 0.14%
descriptive-statistics frequency-table html-report markdown pander pandoc pandoc-markdown r rmarkdown rstats rstudio

summarytools's People

Contributors

brunaw avatar cmrnp avatar dcomtois avatar emraher avatar faviovazquez avatar iago-pssjd avatar jonmcalder avatar mcanouil avatar rprrr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

summarytools's Issues

Error in ctable function: $ operator is invalid for atomic vectors

When I use the ctable function with the pipe operator %$% from the package magrittr an error occurs: Error: $ operator is invalid for atomic vectors

library(summarytools)
library(magrittr)

tobacco %$% ctable(smoker, diseased)

Traceback
14. na.omit(c(parse_info_y$var_names, deparse(dnn[[2]]))) at ctable.R#194
13. ctable(smoker, diseased)
12. eval(substitute(expr), data, enclos = parent.frame())
11. eval(substitute(expr), data, enclos = parent.frame())
10. with.default(., ctable(smoker, diseased))
9. with(., ctable(smoker, diseased))
8. function_list[k]
7. withVisible(function_list[k])
6. freduce(value, _function_list)
5. _fseq(_lhs)
4. eval(quote(_fseq(_lhs)), env, env)
3. eval(quote(_fseq(_lhs)), env, env)
2. withVisible(eval(quote(_fseq(_lhs)), env, env))

  1. tobacco %$% ctable(smoker, diseased)

Error appears in na.omit function in line 194, ctable.R file.

y_name  <- na.omit(c(parse_info_y$var_names, deparse(dnn[[2]])))[1]

Many thanks.

limiting the statistics in descr()

This is a feature request. It would be great to add an argument to limit the statistics (mean, sd, etc.). For example, if someone only wants to return mean, median and sd , then the argument could be something like

stats = c('mean', 'median', 'sd')

The final descriptive table would only return the above listed statistics instead of all of them. The default could be stats = "all".

Found issue with coefficient of variation (CV)

Dear Dominic,
I found the package "summarytools" very useful!

However, I also found that CV values are calculated inappropriately in the package. When viewing the relevant code contained in "descr.R", I found that CV values are calculated using
ifelse("cv" %in% stats, variable.mean / variable.sd, NA)
As you know the correct formula to calculate coefficient of variation is: CV = (Standard Deviation (σ) / Mean (μ)), why this chunk needs to be replaced by
ifelse("cv" %in% stats, variable.sd / variable.mean, NA)

Best regards,
Payam

feature: select columns in freq

New to the package. Very interesting contribution! I may have missed this: is there a way to select the columns that freq returns? I can remove NAs with report.nas = FALSE. I know I can drop the Totals row with totals = FALSE. Is there an option of the freq function to keep/drop the percentage column and/or the cumulative percentage column?

Something like report.cum = FALSE and report.pct = FALSE ...

Escape Characters Causing Ugly Display in Jupyter

I love the summaries this tool generates in RStudio. Thanks!

My problem is that using this with Jupyter doesn't seem to work. Reproduction below:

jupyter

Inspecting the data frame:

ddd = summarytools::dfSummary(mtcars)
ddd$Variable

Produces this:

[1] "mpg\\\n[numeric]"  "cyl\\\n[numeric]"  "disp\\\n[numeric]" "hp\\\n[numeric]"   "drat\\\n[numeric]"
 [6] "wt\\\n[numeric]"   "qsec\\\n[numeric]" "vs\\\n[numeric]"   "am\\\n[numeric]"   "gear\\\n[numeric]"
[11] "carb\\\n[numeric]"

Which works great in RStudio or the command line, poorly in Jupyter.

I am struggling to figure out if there is simple a parameter I am missing? Or maybe there is a method I can pipe this output through to unescape those characters?

If I figure it out I'll post a solution.

Suggestion: mention most frequent value

In the data frame summary if an column contains 115 distinct values (such as countries) and 99% of the values is a specific country, this is very useful to mention what the most frequent country is. In general I believe It is usefull to display to most frequent values.

Winword output suggestion

Thanks again for your great package. Is it possible to add some suggestion on how to render the output in word or html using rmarkdown in RStudio?
Best

dfSummary: Freqs(% Valid) numerical vectors and integer proportions

Hello,

I have a .csv data file that I am reading into a data frame. When I run the dfSummary() function in the console or render on RMarkdown, although some integers are only two distinct values with 100% valid entries, the frequencies(%) are not printed on the output. Interestingly, some integers with <10 values will have printed out frequencies, but there really isn't any notable pattern to why these will print whereas the majority will not. When using an older version of summarytools (0.6.5), this frequency issue is not a problem. Is there something I can do besides go through all of my variables and convert them to factors to resolve this issue? Thanks and please let me know if I need to clarify anything. I'm relatively new to programming and R. :)

parameter to select which statistics to print

Thanks for your great package. As a suggestion, I would like to propose to add a character vector parameter with default values to explicit which statistics are being tabulated to the descr function .

Suggestion: mention rows with all NA's

When using the data frame summary I encountered a dataset which had rows with only empty columns (NA's). It would be handy to mention this when this occurs on the top of the page at the data frame summary.

with() returns Var1 instead of the named variable

data(exams)
with(exams, by(english, gender, descr))

returns descriptive statistics for "english" for each gender. However, the statistics table shows Var1 as the column name instead of showing the actual named variable (which would be english, in this case).

                Var1

         Mean  76.66
      Std.Dev   9.35
          Min   55.9
          Max   93.2
       Median   77.1
          mad   7.56
          IQR    8.2
           CV    8.2
     Skewness  -0.25
  SE.Skewness   0.58
     Kurtosis  -0.25

Was it intentional? If not, it would probably be a good idea to display the actual variable name

dfSummary: options valid.col & na.col

Dear Dominic,

first of all, I want to say that your package is great! Thank you!!!

Second I have noticed that the two options of dfSummary do not seem to work when set to false.
Am I doing something wrong?
here is an example with iris
view(dfSummary(iris, varnumbers = FALSE, valid.col = FALSE, na.col = FALSE , omit.headings=TRUE))

suggestion: identify primary key of dataframe

In the Data Frame Summary it would be very useful to identify which column contains the 'primary key' (as it is called in databases). A column could be the primary key when the number of rows in the data frame equals the number of distinct values. Of course not every table has a primary key, but that is also useful to mention.

Error from view(dfSummary(df))

I read a clean dataset in from SQL, and tried the below:
library(summarytools)

view(dfSummary(df))
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values
In addition: Warning messages:
1: In n * h : NAs produced by integer overflow
2: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
3: In n * h : NAs produced by integer overflow
4: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow

Name of variable (features) disappears when using descr() with by() in Shiny app

The descr() function from R-package summarytools generates common central tendency statistics and measures of dispersion for numerical data in R.

When I use descr() with by() in a Shiny app, names of variable (features) contained in the data disappear and not displayed. Instead, the names are replaced by Var1, Var2, Var3 etc.

I do not really understand why the names disappear when I implement these code in the Shiny app (see below).

Install packages

source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools')

Load packages

library("summarytools")
library(Biobase)
library("ALL")

Shiny Server

server <- function(input, output, session) {
output$summaryTable <- renderUI({
#-- Load the ALL data
data(ALL)
#-- Subset
eset_object <- ALL [1:3,] # choose only 3 variables
#-- The group of interest
eset_groups <-"BT"
ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))),
INDICES = (pData(eset_object)[,eset_groups]),
FUN = descr, stats ="all",
transpose = TRUE)

view(ALL_stats_by_BT,
method = 'render',
omit.headings = FALSE,
bootstrap.css = FALSE)
})
}

Shiny UI

ui <- fluidPage(theme = "dfSummary.css",
fluidRow(
uiOutput("summaryTable")
)
)

As a side note, if one reads in the data as Global variable: eset_object <<- ALL [1:3,], the variable names will be displayed. But this is not a solution to the problem as it is wise to avoid global variables!

Suggestion : add some <br> in the view(dfSummary(data),method = "render")

When I run
view(dfSummary(data)) in the console I get something like this
2018-12-14 11_03_08-data frame summary

but when I put
view(dfSummary(data), method = "render") in my Rmd (html_output) , I get this :

2018-12-14 11_03_44-rapport sur dig reporting de decembre

I think adding some <br> at the end each lines in Stats / Values and Freqs to have the same result that in the Rstudio Viewer could be very good :)

Thanks for your package !

Specify column widths

Especially in the context of rendering html for markdown; right now the size of graphs responds to windows size and the graph.magnif parameter doesn't enforce actual wanted size.

Name of Group variable is not updated when using descr() with by() in Shiny app

I found that the Name of Group variable (and group level) is not retrieved or re-updated on UI when selecting a new Group variable in the app. It should be noted that the corresponding table (calculations) updates upon selection a new group variable.
Moreover, the group variable is also displayed in a static manner. Using the Shiny App below, the issue could be exemplified to some extent. For example, the Group variable is displayed on UI, as shown below:

Group: (pData(eset_object)[, eset_groups]) = B


# Install packages
source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
biocLite("Biobase")
install.packages('devtools')
devtools::install_github('dcomtois/summarytools')

# Load packages
library(summarytools)
library(Biobase)
library(ALL) 

# Shiny Server
server <- function(input, output, session) {
  output$summaryTable <- renderUI({
    #-- Load the ALL data
    data(ALL)  
    #-- Subset
    eset_object <- ALL [1:3,] # choose only 3 variables 
    #-- The group of interest 
    eset_groups <-"BT"
    # print(rownames (eset_object)) # print variable names
    ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))), 
                          INDICES = (pData(eset_object)[,eset_groups]), 
                          FUN = descr, stats ="all", 
                          transpose = TRUE)

    view(ALL_stats_by_BT,
         method = 'render',
         omit.headings = FALSE,
         bootstrap.css = FALSE)
  })
}

# Shiny UI
ui <- fluidPage(theme = "dfSummary.css",
                fluidRow(
                  uiOutput("summaryTable")
                )
)

# Lauch
shinyApp(ui, server)

Of note, if you replace eSet_object <- relevant_est() to eSet_object <<- relevant_est() (that is Global Env) the option Data Frame will be retrieved and displayed on UI, as presented below:

Data Frame: as.data.frame(t(exprs(eSet_object)))
Group: (pData(eSet_object)[, eSet_groups]) = B

Suggestion to add number of unique rows

The current version of the Data Frame Summary shows the number of rows. In many cases it is very usefull to know how many unique rows there are. For example the iris dataset contains 150 rows, but there is one duplicate row (e.g. nrow(unique(iris)) gives 149). It would be very helpfull to add this to the top of the report.

Error in sect_title[[2]] : subscript out of bounds

I keep getting this error using dfSummary -- and it has happened for all of my data. All of the code worked before...

x was converted to a data frame
Error in sect_title[[2]] : subscript out of bounds

special handling for dates

hola!

excited so a couple more suggestions:

  1. i think it would be to useful to allow for special handling of date and time vars.
  2. for categorical vars./char. vars. with more than 10 unique, it may be useful to present breakdown of 9 most common (as you do, I think) and then the 10th can be 'other or all else' (which totals up for everything other char. val.

hth

Error in prettyNum if missing value

Using summarytools 0.8.6 getting error on some variables where everything is either 0 or 1 and there is a also a missing value. I have other character and factor vectors with missing values and those are being handled correctly.

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument

Reproducible example

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, 0), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

#so far so good, but then look what happens when an NA is inserted

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, NA), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

Suggestion: give information about date-time columns

In the data frame summary when a column contains date/time information I would suggest to give a distribution of the days (e.g. monday 6%, tuesday 12%, etc..), months, hours, etc. This reveals easily season patterns, workday behavior, etc.

useNA in ctable

I am getting an error using ctable every time I try and set useNA to "no". It works just fine for "always" or "ifany". Error is below:

Error in ctable(stjean$pnc5_new, stjean$preterm, prop = "t", useNA = "no") :
'useNA' must be one of 'ifany', 'always', or 'no'

handling NA in freq

I see that freq always print NA and Cumulative Valid. I would suggest to add an boolean option, ignore.na=FALSE, that, when TRUE, ignores NA (and also does not print "Valid" frequencies columns

Suggestion: Distinct count of factor/character column

It would be useful to have a distinct count of unique values of either factor or character column.
For example, if I have a column labeled email, I would like to know how many unique emails I have in that column.
Here is an example of my output.
When the field type is Integer, then you get a distinct could of values, but when it's a character/factor then it counts frequency but not count of unique values.

1

Thank you

dfSummary fails when a whole factor column is NA

Hi,

When I was trying to generate a dfSummary of a new dataset I could not due to an error. I could replicate the bug when running this functions on the iris dataset. This error occurs when I have a whole factor column with NAs.

This works:

data(iris)
dfSummary(iris)

Now, when I set a factor column to NA it doesn't.

iris$Species <- as.factor(rep(NA, nrow(iris)))
dfSummary(iris)

This is the error, identical to my dataset.
Error in png(img_png <- tempfile(fileext = ".png"), width = 150, height = 26 * :
invalid 'height' argument
In addition: Warning messages:
1: In max(counts) : no non-missing arguments to max; returning -Inf
2: In max(props * 100) : no non-missing arguments to max; returning -Inf

Regards,
Victor

Q1 and Q3

Hi, is it also possible to specify 25th and 75th percentiles (as Q1 and Q3) maybe? Cause they are frequently used as descriptive reporting. Best

Suggestion: how often does ID values exist

When analyzing a data set with e.g. client ID's it is very usefull to know how often unique ID's appear in the dataset. e.g. 90% appears once, 5% appears twice, etc.. (data frame summary)

Rd formatting

The output of dfSummary would look nice in the data documentation created by roxygen2 (r-lib/roxygen2#307). Converting a data frame to .Rd is straightforward, but the data frames created by dfSummary contain embedded newlines -- this makes it a bit more difficult.

Wrong link for the recommendation vignette

There's an error with the link for the recommendation vignette. I'll create a PR that solves this.

The error is here:

The following vignettes complements this page: [Recommendations for
Using summarytools With
Rmarkdown](https://cdn.rawgit.com/dcomtois/summarytools/dev-current/inst/doc/Recommendations-rmarkdown.html)

Error loading summarytools

I just installed summarytools 0.8.3 from CRAN with no error messages.

packageVersion("summarytools")
[1] ‘0.8.3’
> library(summarytools)
Error in get(method, envir = home) : 
  lazy-load database 'xxx/summarytools/R/summarytools.rdb' is corrupt
In addition: Warning messages:
1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4],  :
  restarting interrupted promise evaluation
2: In get(method, envir = home) :
  restarting interrupted promise evaluation
3: In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘summarytools’

Installing from github gives the same results.

Session info ------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.447)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2018-04-27                  

simplify code with dplyr

Hey,

Great package!

I think code for descr etc. can be radically simplified using dplyr.

For instance:

iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = F)
iris_num <- iris %>%
   summarize_if(is.numeric, funs(mean = mean, median = median, min = min, max = max, missing = sum(is.na(.)))) 
iris_num_long <- iris_num %>%
  gather(key = "key", value = "words") %>%
  separate(key, into = c("var", "statistic")) %>%
  spread(key = "var", value = "words")

produces

 iris_num_long
# A tibble: 5 x 5
  statistic    X1    X2    X3    X4
* <chr>     <dbl> <dbl> <dbl> <dbl>
1 max        7.90  4.40  6.90 2.50 
2 mean       5.84  3.05  3.76 1.20 
3 median     5.80  3.00  4.35 1.30 
4 min        4.30  2.00  1.00 0.100
5 missing    0     0     0    0    

and this allows you to pass arbitrary functions to summarize easily

bug

fyi, I get the following error:

Error in isTRUE(extra_space) : object 'extra_space' not found

Will try to post a reproducible example

Suggestion: threat binary integers different

In the data frame summary if an integer contains only 0 and 1's I believe it is not very useful to describe "mean (sd) : 0.23 (0.42) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.82)". I suggest it is more usefull to mention how many 0 and 1 values occur.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.