The skimpy from aeturrell

Skewness & kurtosis?

Hey Arthur!
Are you planning to add skewness & kurtosis to the summary stats?
Thanks!
Pedro

skim raises exception with multiindexes

The culprit appears to be the _infer_datatypes.

skimpy/src/skimpy/__init__.py

Line 95 in ad48d11

def _infer_datatypes(df: pd.DataFrame) -> pd.DataFrame:

The workaround appears to be replacing the above function with panda's builtin infer_objects method.

Explore jupyter notebook for readme.rst generation

jupyter nbconvert --to rst README.ipynb

adding support for datetime.date object types

Hi,

The package is superuseful. However, it seems like the support for some key datatypes frequently used with pandas is missing.
It would be great if you could add support for datetime.date, datetime.month, datetime.year and so on.

for example, it supports datetime64 but if one wants to keep only date part
dt['date'] = dt['datetime'].dt.date

It will give an error "data type 'date' not understood"

Thank you

A neat option to export to a well-formatted table for onward inclusion in reports and figures

Practically, this will need to be something like a JSON given the structure of the results table.

IndexError: list index out of range

Colab notebook including data to reproduced error is here:

https://github.com/Mjboothaus/Jupyter/blob/master/cleanup_beach_data.ipynb

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-d37235d13c7a> in <module>()
----> 1 skim(df)

/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

/usr/local/lib/python3.7/dist-packages/skimpy/__init__.py in skim(df, header_style, **colour_kwargs)
    543         grid.add_row(sum_tab)
    544     # Weirdly, iteration over list of tabs misses last entry
--> 545     grid.add_row(list_of_tabs[-1])
    546     console.print(Panel(grid, title="skimpy summary", subtitle="End"))
    547 

IndexError: list index out of range

Have an export to pandas option

This would mostly be straightforward.

For examples of how to do the charts within pandas dataframe, see https://twitter.com/jonathanrlarkin/status/1503591106939867137?s=11 and https://twitter.com/jonathanrlarkin/status/1503591106939867137?s=11.

Error on __infer_datatypes due to 'cannot convert NA to integer'

Running skimpy.skim(df) returns me an error

    662     df = _delete_unsupported_columns(df)
    663     # Perform inference of datatypes
--> 664     df = _infer_datatypes(df)

/python3.9/site-packages/skimpy/__init__.py in _infer_datatypes(df)
    137             continue
    138         # There is no else statement here because logic should never get to this point.
--> 139         df[col[0]] = df[col[0]].astype(data_type)
    140     return df
    141

I have a bunch of columns so the message does not usefully describe how to fix.

i also cleaned my df (b10_r) with and i still get that error.

for column in b10_r.columns:
    ty = pandas.api.types.infer_dtype(b10_r[column])
    print("{} - {}".format(column, ty))
    if ty in ["mixed-integer", "mixed", "mixed-integer-float", "unknown-array"]:
        kols.append(column)

for k in kols :
    del b10_r[k]

but i still get it

codecoverage stats are not appearing

Skim output is not able to be recorded or exported to html or svg

I have tried to export skim results. I tried to record it using Console(record=True) before calling the function; however, I got a NoneType object. The expected result is an object to be exported via html or svg to share the results obtained.
I also tested the Console.capture() method obtaining the same behavior. Did I do something wrong?

from skimpy import skim, generate_test_data
console = Console(record=True)
df = generate_test_data()
skim(df)
console.save_html("demo.html")

The demo.html is empty. Thanks for your support
Regards

Truncated names should be longer (as a lot of empty space is present too)

It seems that skimpy truncates variable names to 20 symbols. This seems to be unreasonable as there is a lot of empty space which is not used (indicated with yellow squares). This empty space can be removed to have more space for longer names

How can we control the maximum length of variable names?
Can this empty space be removed to longer names?
Can an ellipsis (single symbol "…") be used to pay attention to that variable name was truncated?
There should be a way to identify ambiguous variables after the names are truncated (see the second figure, red squares).

Bug in string word counts?

skimpy/src/skimpy/__init__.py

Line 414 in 910eb80

xf[xf.columns[0]].str.count(" ").add(1).sum()

Noticed some weird behavior with the word counts in skimpy output - should this be using col to subset xf rather than xf.columns[0]?

Citation

I'm using skimpy in a project and would love to have details for a Bibtex reference, thank you!

Inline histogram distorts the output of the layout

Uneven inline histogram bar widths distort the layout of the output:

This is the case as UTF-8 symbols (squares) that form the histogram, have different widths. I noticed that in R, in some cases, 4-th and 8-th (the narrowest) symbols are excluded in some cases:

https://github.com/ropensci/skimr/blob/d5126aa020e703f37740af7ee56a4acb5830fd08/R/stats.R#L133-L136

My question:

Can an option be added to remove the histogram? Instead, an option to include the median, which is missing, could be added.

add timedelta to the generated test data

Consider Pandas 2.0+ support?

Hi there. Pandas 2 came out a few months back and your installation dependency is at pandas ^1.3.2. Would you consider checking for Pandas 2 compatibility?
I'm going to mention skimpy on my newsletter (to 1,600 data scientists), I know that a bunch have upgraded to Pandas 2 already (given recent conferences talks I've given on Pandas 2 and Polars), so hopefully that'd open the door to a new base of users for you.
ydata-profiling (neé pandas-profiling) just added Pandas 2 support too: https://github.com/ydataai/ydata-profiling/releases/tag/v4.3.0
Cheers Ian.

`skim`: changing number of columns to be summarized

skim summarizes 20 columns as default. I couldn't find to change this default behaviour.

Suppress 'sum(cleaned)} column names have been cleaned' Message

Could you please add a parameter to allow to supress the 'sum(cleaned)} column names have been cleaned' message?
I would really appreciate it!
Thank you
Regards

Broken Contributing Link

The link at the top of the home page points to contributing.html, but the page is called CONTRIBUTING.html, hence the link is broken.

Reports not properly generated with a single dataframe column

MRE:

from numpy.random import Generator, PCG64
from skimpy import skim

seed = 34729
rng = Generator(PCG64(seed))
len_df = 1000
df = pd.DataFrame()
df["length"] = rng.beta(0.5, 0.5, size=len_df)
skim(df)

Add citation

And an option to submit notification of use.

Make output friendly to Quarto documents when there is any R code being executed in the .qmd file too

It would be helpful to have a quarto-friendly output option, so that tables generated from skim render in markdown instead of rendering as code-like objects.

For instance, a file like this, with a python skim(df) statement and an R skim(df) statement

(you'll have to add the qmd extension, github won't let me upload a qmd file)
test.qmd

renders as

Thank you for making this package, btw - it has made it much easier to teach my students R and python simultaneously when there are so many packages that have parallel functions and syntax between them.

Round numbers to sensible number of significant figures

Use something like

i = 32.1123
print(f'{float(f"{i:.2g}"):g}')

Add tests for compilation of docs

Support for polars

Polars is an increasingly popular data frame package. Although polars users can currently convert to pandas to run Skimpy would it be better if it was native?

Remove decimals and trailing zeros on whole numbers

It would look better to remove decimals and trailing zeros on whole numbers. Something like s.rstrip('0').rstrip('.') if '.' in s else s could work.

Be able to handle time delta

Currently, this data type is converted to strings.

eg

import pandas as pd

df_check = pd.DataFrame(
        {
            "header": [pd.Timedelta(365, "d"), pd.Timedelta(-19, "d")],
            "header_1": ["length_one", "length_two"],
        }
    )
skim(df_check)

should produce a table with a time difference section.

Words per row and Word count improper when we have multiple text columns

This is the result I got for titanic dataset and it looks improper.

A lot of Jupyter dependencies

This tool looks very useful. Although when I try installing it into my kernel environment it has a lot of dependencies including Jupyter and all associated server dependencies. Perhaps these need to be dev dependencies? I can't see where the dependency is used otherwise. I can't see where ipykernel is used either (initially i thought you might need to import from IPython.display).

bash$ poetry add git+https://github.com/aeturrell/skimpy.git

Updating dependencies
Resolving dependencies... (7.9s)

Package operations: 34 installs, 0 updates, 0 removals

• Installing types-python-dateutil (2.8.19.20240106)
• Installing arrow (1.3.0)
• Installing fqdn (1.5.1)
• Installing isoduration (20.11.0)
• Installing jsonpointer (2.4)
• Installing rfc3339-validator (0.1.4)
• Installing rfc3986-validator (0.1.1)
• Installing uri-template (1.3.0)
• Installing webcolors (1.13)
• Installing argon2-cffi-bindings (21.2.0)
• Installing python-json-logger (2.0.7)
• Installing terminado (0.18.0)
• Installing anyio (4.2.0)
• Installing argon2-cffi (23.1.0)
• Installing jupyter-events (0.9.0)
• Installing jupyter-server-terminals (0.5.1)
• Installing overrides (7.4.0)
• Installing send2trash (1.8.2)
• Installing websocket-client (1.7.0)
• Installing babel (2.14.0)
• Installing json5 (0.9.14)
• Installing jupyter-server (2.12.4)
• Installing async-lru (2.0.4)
• Installing jupyter-lsp (2.2.1)
• Installing jupyterlab-server (2.25.2)
• Installing notebook-shim (0.2.3)
• Installing jupyterlab (4.0.10)
• Installing qtpy (2.4.1)
• Installing jupyter-console (6.6.3)
• Installing notebook (7.0.6)
• Installing qtconsole (5.5.1)
• Installing jupyter (1.0.0)
• Installing typeguard (4.1.5)
• Installing skimpy (0.0.11 556aff6)

Writing lock file

Please make it available on Polars as well.

First of all, thank you for creating such a wonderful package.

I was able to quickly understand the characteristics of the data using skim in R, and thank you for making it possible in Python as well.

Polars in the DataFrame package has been growing rapidly in popularity recently.
You can use the skim function in Polars using the to_pandas() function.
However, it would be better if polars was supported directly in pyskim.
Also, Pandas has been updated to version 2.x, but if you install pyskim, the Pandas version will be downgraded. It would be nice if Pandas were also updated to support 2.x.

Thank you

Explore switching docs to Quartodoc

https://github.com/machow/quartodoc

Main advantage is to remove hackyness of current solution.

TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'

From my python (v 3.7) dev vm at work I pull data from Vertica (mysql) into a pandas df, and I get what smells like a dependency issue. If this is based on a pandas dependency, is it possible to use skimpy with a different version of pandas through some older version of skimpy?

I run:
skim(df)

I get the issue:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_xyz/myscript.py in
----> 1 skim(shipments)
2 # shipments.describe()

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in skim(df, header_style, **colour_kwargs)
527 xf = df.select_dtypes(col_type)
528 if not xf.empty:
--> 529 sum_df = summary_func(xf)
530 list_of_tabs.append(
531 dataframe_to_rich_table(

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in numeric_variable_summary_table(xf)
306 data_dict = {
307 "missing": count_nans_vec,
--> 308 "complete rate": 1 - count_nans_vec / xf.shape[0],
309 NUM_COL_MEAN: xf.mean(),
310 "sd": xf.std(),

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
63 break
64 if isinstance(other, cls):
---> 65 return NotImplemented
66
67 other = item_from_zerodim(other)

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/arraylike.py in truediv(self, other)
111 def rmul(self, other):
112 return self._arith_method(other, roperator.rmul)
--> 113
114 @unpack_zerodim_and_defer("truediv")
115 def truediv(self, other):

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/series.py in _arith_method(self, other, op)
4996 0 True
4997 1 True
-> 4998 2 True
4999 3 False
5000 4 True

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op)
187 Evaluate an arithmetic operation +, -, *, /, //, %, **, ...
188
--> 189 Note: the caller is responsible for ensuring that numpy warnings are
190 suppressed (with np.errstate(all="ignore")) if needed.
191

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in _na_arithmetic_op(left, right, op, is_cmp)
137
138 def _na_arithmetic_op(left, right, op, is_cmp: bool = False):
--> 139 """
140 Return the result of evaluating op on the passed in values.
141

~/.venv/asdflib/python3.7/site-packages/pandas/core/computation/expressions.py in
17 from pandas._typing import FuncType
18
---> 19 from pandas.core.computation.check import NUMEXPR_INSTALLED
20 from pandas.core.ops import roperator
21

~/.venv/data_analyses/lib/python3.7/site-packages/pandas/core/computation/check.py in
1 from pandas.compat._optional import import_optional_dependency
2
----> 3 ne = import_optional_dependency("numexpr", errors="warn")
4 NUMEXPR_INSTALLED = ne is not None
5 if NUMEXPR_INSTALLED:

  TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'`

Add export data to features / quick start

A kwarg in skim function

Remove Jupyter book dependency

Wrong number of NA rows in the output?

Hi,
first of all thank you for this great tool.

If I run skimpy on this 999 rows CSV I have 1000 NA rows.

Thank you

Add doc tests and examples

Use format as in: https://github.com/Erotemic/xdoctest

Skimpy ignores other data types except for Float and Integer

Hi,

I've tried Skimpy first time today and it seems like I found a bug. I used Skimpy on my sample dataframe and it only returns the summary for the Float and Integer columns while others were ignored.

This is the sample code:

from skimpy import skim
import datetime
import pandas as pd

data = ([datetime.datetime(2021, 1, 1), None, 'as', 6],
        [datetime.datetime(2021, 1, 2), 5.2, 'asd', 7],
        [None, 6.3, 'adasda', 8])

df = pd.DataFrame(data, columns=['date', 'float', 'string', 'integer'])

skim(df)

This is the result I got:

Column name colour - how can we change / customise?

Hi there, great package here, wondering if there is an easy way to change the colour used for the column names in the output - current default uses pink, unfortunately I have a grey terminal background, pink foreground font is pretty much impossible to see...

Thanks

aeturrell / skimpy Goto Github PK

skimpy's People

Contributors

Stargazers

Watchers

Forkers

skimpy's Issues

Recommend Projects

Recommend Topics

Recommend Org