Coder Social home page Coder Social logo

aeturrell / skimpy Goto Github PK

View Code? Open in Web Editor NEW
361.0 10.0 17.0 3.77 MB

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.

Home Page: https://aeturrell.github.io/skimpy/

License: Other

Python 98.59% Makefile 1.41%
eda pandas data-science exploratory-data-analysis summary-statistics statistics

skimpy's People

Contributors

aeturrell avatar dependabot[bot] avatar galenseilis avatar rumiallbert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skimpy's Issues

Skewness & kurtosis?

Hey Arthur!
Are you planning to add skewness & kurtosis to the summary stats?
Thanks!
Pedro

adding support for datetime.date object types

Hi,

The package is superuseful. However, it seems like the support for some key datatypes frequently used with pandas is missing.
It would be great if you could add support for datetime.date, datetime.month, datetime.year and so on.

for example, it supports datetime64 but if one wants to keep only date part
dt['date'] = dt['datetime'].dt.date

It will give an error "data type 'date' not understood"

Thank you

IndexError: list index out of range

Colab notebook including data to reproduced error is here:

https://github.com/Mjboothaus/Jupyter/blob/master/cleanup_beach_data.ipynb

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-d37235d13c7a> in <module>()
----> 1 skim(df)

/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

/usr/local/lib/python3.7/dist-packages/skimpy/__init__.py in skim(df, header_style, **colour_kwargs)
    543         grid.add_row(sum_tab)
    544     # Weirdly, iteration over list of tabs misses last entry
--> 545     grid.add_row(list_of_tabs[-1])
    546     console.print(Panel(grid, title="skimpy summary", subtitle="End"))
    547 

IndexError: list index out of range

Error on __infer_datatypes due to 'cannot convert NA to integer'

Running skimpy.skim(df) returns me an error

    662     df = _delete_unsupported_columns(df)
    663     # Perform inference of datatypes
--> 664     df = _infer_datatypes(df)

/python3.9/site-packages/skimpy/__init__.py in _infer_datatypes(df)
    137             continue
    138         # There is no else statement here because logic should never get to this point.
--> 139         df[col[0]] = df[col[0]].astype(data_type)
    140     return df
    141 

I have a bunch of columns so the message does not usefully describe how to fix.

i also cleaned my df (b10_r) with and i still get that error.

for column in b10_r.columns:
    ty = pandas.api.types.infer_dtype(b10_r[column])
    print("{} - {}".format(column, ty))
    if ty in ["mixed-integer", "mixed", "mixed-integer-float", "unknown-array"]:
        kols.append(column)

for k in kols :
    del b10_r[k]

but i still get it

Skim output is not able to be recorded or exported to html or svg

I have tried to export skim results. I tried to record it using Console(record=True) before calling the function; however, I got a NoneType object. The expected result is an object to be exported via html or svg to share the results obtained.
I also tested the Console.capture() method obtaining the same behavior. Did I do something wrong?

from skimpy import skim, generate_test_data
console = Console(record=True)
df = generate_test_data()
skim(df)
console.save_html("demo.html")

The demo.html is empty. Thanks for your support
Regards

Truncated names should be longer (as a lot of empty space is present too)

It seems that skimpy truncates variable names to 20 symbols. This seems to be unreasonable as there is a lot of empty space which is not used (indicated with yellow squares). This empty space can be removed to have more space for longer names

image

image

  1. How can we control the maximum length of variable names?
  2. Can this empty space be removed to longer names?
  3. Can an ellipsis (single symbol "…") be used to pay attention to that variable name was truncated?
  4. There should be a way to identify ambiguous variables after the names are truncated (see the second figure, red squares).

Citation

I'm using skimpy in a project and would love to have details for a Bibtex reference, thank you!

Inline histogram distorts the output of the layout

Uneven inline histogram bar widths distort the layout of the output:

image

This is the case as UTF-8 symbols (squares) that form the histogram, have different widths. I noticed that in R, in some cases, 4-th and 8-th (the narrowest) symbols are excluded in some cases:

https://github.com/ropensci/skimr/blob/d5126aa020e703f37740af7ee56a4acb5830fd08/R/stats.R#L133-L136

My question:

  • Can an option be added to remove the histogram? Instead, an option to include the median, which is missing, could be added.

Consider Pandas 2.0+ support?

Hi there. Pandas 2 came out a few months back and your installation dependency is at pandas ^1.3.2. Would you consider checking for Pandas 2 compatibility?
I'm going to mention skimpy on my newsletter (to 1,600 data scientists), I know that a bunch have upgraded to Pandas 2 already (given recent conferences talks I've given on Pandas 2 and Polars), so hopefully that'd open the door to a new base of users for you.
ydata-profiling (neé pandas-profiling) just added Pandas 2 support too: https://github.com/ydataai/ydata-profiling/releases/tag/v4.3.0
Cheers Ian.

Broken Contributing Link

The link at the top of the home page points to contributing.html, but the page is called CONTRIBUTING.html, hence the link is broken.

Make output friendly to Quarto documents when there is any R code being executed in the .qmd file too

It would be helpful to have a quarto-friendly output option, so that tables generated from skim render in markdown instead of rendering as code-like objects.

For instance, a file like this, with a python skim(df) statement and an R skim(df) statement

(you'll have to add the qmd extension, github won't let me upload a qmd file)
test.qmd

renders as

Screenshot of rendered html file showing skim-py output as text and skim-r output as HTML tables.

Thank you for making this package, btw - it has made it much easier to teach my students R and python simultaneously when there are so many packages that have parallel functions and syntax between them.

Support for polars

Polars is an increasingly popular data frame package. Although polars users can currently convert to pandas to run Skimpy would it be better if it was native?

Be able to handle time delta

Currently, this data type is converted to strings.

eg

import pandas as pd

df_check = pd.DataFrame(
        {
            "header": [pd.Timedelta(365, "d"), pd.Timedelta(-19, "d")],
            "header_1": ["length_one", "length_two"],
        }
    )
skim(df_check)

should produce a table with a time difference section.

A lot of Jupyter dependencies

This tool looks very useful. Although when I try installing it into my kernel environment it has a lot of dependencies including Jupyter and all associated server dependencies. Perhaps these need to be dev dependencies? I can't see where the dependency is used otherwise. I can't see where ipykernel is used either (initially i thought you might need to import from IPython.display).

bash$ poetry add git+https://github.com/aeturrell/skimpy.git

Updating dependencies
Resolving dependencies... (7.9s)

Package operations: 34 installs, 0 updates, 0 removals

• Installing types-python-dateutil (2.8.19.20240106)
• Installing arrow (1.3.0)
• Installing fqdn (1.5.1)
• Installing isoduration (20.11.0)
• Installing jsonpointer (2.4)
• Installing rfc3339-validator (0.1.4)
• Installing rfc3986-validator (0.1.1)
• Installing uri-template (1.3.0)
• Installing webcolors (1.13)
• Installing argon2-cffi-bindings (21.2.0)
• Installing python-json-logger (2.0.7)
• Installing terminado (0.18.0)
• Installing anyio (4.2.0)
• Installing argon2-cffi (23.1.0)
• Installing jupyter-events (0.9.0)
• Installing jupyter-server-terminals (0.5.1)

• Installing overrides (7.4.0)
• Installing send2trash (1.8.2)
• Installing websocket-client (1.7.0)
• Installing babel (2.14.0)
• Installing json5 (0.9.14)
• Installing jupyter-server (2.12.4)
• Installing async-lru (2.0.4)
• Installing jupyter-lsp (2.2.1)
• Installing jupyterlab-server (2.25.2)

• Installing notebook-shim (0.2.3)
• Installing jupyterlab (4.0.10)
• Installing qtpy (2.4.1)
• Installing jupyter-console (6.6.3)
• Installing notebook (7.0.6)

• Installing qtconsole (5.5.1)
• Installing jupyter (1.0.0)
• Installing typeguard (4.1.5)
• Installing skimpy (0.0.11 556aff6)

Writing lock file

Please make it available on Polars as well.

First of all, thank you for creating such a wonderful package.

I was able to quickly understand the characteristics of the data using skim in R, and thank you for making it possible in Python as well.

Polars in the DataFrame package has been growing rapidly in popularity recently.
You can use the skim function in Polars using the to_pandas() function.
However, it would be better if polars was supported directly in pyskim.
Also, Pandas has been updated to version 2.x, but if you install pyskim, the Pandas version will be downgraded. It would be nice if Pandas were also updated to support 2.x.

Thank you

TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'

From my python (v 3.7) dev vm at work I pull data from Vertica (mysql) into a pandas df, and I get what smells like a dependency issue. If this is based on a pandas dependency, is it possible to use skimpy with a different version of pandas through some older version of skimpy?

I run:
skim(df)

I get the issue:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_xyz/myscript.py in
----> 1 skim(shipments)
2 # shipments.describe()

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in skim(df, header_style, **colour_kwargs)
527 xf = df.select_dtypes(col_type)
528 if not xf.empty:
--> 529 sum_df = summary_func(xf)
530 list_of_tabs.append(
531 dataframe_to_rich_table(

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in numeric_variable_summary_table(xf)
306 data_dict = {
307 "missing": count_nans_vec,
--> 308 "complete rate": 1 - count_nans_vec / xf.shape[0],
309 NUM_COL_MEAN: xf.mean(),
310 "sd": xf.std(),

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
63 break
64 if isinstance(other, cls):
---> 65 return NotImplemented
66
67 other = item_from_zerodim(other)

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/arraylike.py in truediv(self, other)
111 def rmul(self, other):
112 return self._arith_method(other, roperator.rmul)
--> 113
114 @unpack_zerodim_and_defer("truediv")
115 def truediv(self, other):

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/series.py in _arith_method(self, other, op)
4996 0 True
4997 1 True
-> 4998 2 True
4999 3 False
5000 4 True

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op)
187 Evaluate an arithmetic operation +, -, *, /, //, %, **, ...
188
--> 189 Note: the caller is responsible for ensuring that numpy warnings are
190 suppressed (with np.errstate(all="ignore")) if needed.
191

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in _na_arithmetic_op(left, right, op, is_cmp)
137
138 def _na_arithmetic_op(left, right, op, is_cmp: bool = False):
--> 139 """
140 Return the result of evaluating op on the passed in values.
141

~/.venv/asdflib/python3.7/site-packages/pandas/core/computation/expressions.py in
17 from pandas._typing import FuncType
18
---> 19 from pandas.core.computation.check import NUMEXPR_INSTALLED
20 from pandas.core.ops import roperator
21

~/.venv/data_analyses/lib/python3.7/site-packages/pandas/core/computation/check.py in
1 from pandas.compat._optional import import_optional_dependency
2
----> 3 ne = import_optional_dependency("numexpr", errors="warn")
4 NUMEXPR_INSTALLED = ne is not None
5 if NUMEXPR_INSTALLED:

  TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'`

Skimpy ignores other data types except for Float and Integer

Hi,

I've tried Skimpy first time today and it seems like I found a bug. I used Skimpy on my sample dataframe and it only returns the summary for the Float and Integer columns while others were ignored.

This is the sample code:

from skimpy import skim
import datetime
import pandas as pd

data = ([datetime.datetime(2021, 1, 1), None, 'as', 6],
        [datetime.datetime(2021, 1, 2), 5.2, 'asd', 7],
        [None, 6.3, 'adasda', 8])

df = pd.DataFrame(data, columns=['date', 'float', 'string', 'integer'])

skim(df)

This is the result I got:
image

Column name colour - how can we change / customise?

Hi there, great package here, wondering if there is an easy way to change the colour used for the column names in the output - current default uses pink, unfortunately I have a grey terminal background, pink foreground font is pretty much impossible to see...

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.