aeturrell / skimpy Goto Github PK
View Code? Open in Web Editor NEWskimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
Home Page: https://aeturrell.github.io/skimpy/
License: Other
skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
Home Page: https://aeturrell.github.io/skimpy/
License: Other
Hey Arthur!
Are you planning to add skewness & kurtosis to the summary stats?
Thanks!
Pedro
The culprit appears to be the _infer_datatypes.
Line 95 in ad48d11
The workaround appears to be replacing the above function with panda's builtin infer_objects
method.
jupyter nbconvert --to rst README.ipynb
Hi,
The package is superuseful. However, it seems like the support for some key datatypes frequently used with pandas is missing.
It would be great if you could add support for datetime.date, datetime.month, datetime.year and so on.
for example, it supports datetime64 but if one wants to keep only date part
dt['date'] = dt['datetime'].dt.date
It will give an error "data type 'date' not understood"
Thank you
Practically, this will need to be something like a JSON given the structure of the results table.
Colab notebook including data to reproduced error is here:
https://github.com/Mjboothaus/Jupyter/blob/master/cleanup_beach_data.ipynb
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-d37235d13c7a> in <module>()
----> 1 skim(df)
/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)
/usr/local/lib/python3.7/dist-packages/skimpy/__init__.py in skim(df, header_style, **colour_kwargs)
543 grid.add_row(sum_tab)
544 # Weirdly, iteration over list of tabs misses last entry
--> 545 grid.add_row(list_of_tabs[-1])
546 console.print(Panel(grid, title="skimpy summary", subtitle="End"))
547
IndexError: list index out of range
This would mostly be straightforward.
For examples of how to do the charts within pandas dataframe, see https://twitter.com/jonathanrlarkin/status/1503591106939867137?s=11 and https://twitter.com/jonathanrlarkin/status/1503591106939867137?s=11.
See also the plottable package
Running skimpy.skim(df) returns me an error
662 df = _delete_unsupported_columns(df)
663 # Perform inference of datatypes
--> 664 df = _infer_datatypes(df)
/python3.9/site-packages/skimpy/__init__.py in _infer_datatypes(df)
137 continue
138 # There is no else statement here because logic should never get to this point.
--> 139 df[col[0]] = df[col[0]].astype(data_type)
140 return df
141
I have a bunch of columns so the message does not usefully describe how to fix.
i also cleaned my df (b10_r) with and i still get that error.
for column in b10_r.columns:
ty = pandas.api.types.infer_dtype(b10_r[column])
print("{} - {}".format(column, ty))
if ty in ["mixed-integer", "mixed", "mixed-integer-float", "unknown-array"]:
kols.append(column)
for k in kols :
del b10_r[k]
but i still get it
I have tried to export skim results. I tried to record it using Console(record=True)
before calling the function; however, I got a NoneType object. The expected result is an object to be exported via html or svg to share the results obtained.
I also tested the Console.capture()
method obtaining the same behavior. Did I do something wrong?
from skimpy import skim, generate_test_data
console = Console(record=True)
df = generate_test_data()
skim(df)
console.save_html("demo.html")
The demo.html is empty. Thanks for your support
Regards
It seems that skimpy
truncates variable names to 20 symbols. This seems to be unreasonable as there is a lot of empty space which is not used (indicated with yellow squares). This empty space can be removed to have more space for longer names
Line 414 in 910eb80
Noticed some weird behavior with the word counts in skimpy output - should this be using col to subset xf rather than xf.columns[0]?
I'm using skimpy in a project and would love to have details for a Bibtex reference, thank you!
Uneven inline histogram bar widths distort the layout of the output:
This is the case as UTF-8 symbols (squares) that form the histogram, have different widths. I noticed that in R, in some cases, 4-th and 8-th (the narrowest) symbols are excluded in some cases:
https://github.com/ropensci/skimr/blob/d5126aa020e703f37740af7ee56a4acb5830fd08/R/stats.R#L133-L136
My question:
Hi there. Pandas 2 came out a few months back and your installation dependency is at pandas ^1.3.2
. Would you consider checking for Pandas 2 compatibility?
I'm going to mention skimpy
on my newsletter (to 1,600 data scientists), I know that a bunch have upgraded to Pandas 2 already (given recent conferences talks I've given on Pandas 2 and Polars), so hopefully that'd open the door to a new base of users for you.
ydata-profiling
(neé pandas-profiling
) just added Pandas 2 support too: https://github.com/ydataai/ydata-profiling/releases/tag/v4.3.0
Cheers Ian.
skim
summarizes 20 columns as default. I couldn't find to change this default behaviour.
Could you please add a parameter to allow to supress the 'sum(cleaned)} column names have been cleaned' message?
I would really appreciate it!
Thank you
Regards
The link at the top of the home page points to contributing.html, but the page is called CONTRIBUTING.html, hence the link is broken.
MRE:
from numpy.random import Generator, PCG64
from skimpy import skim
seed = 34729
rng = Generator(PCG64(seed))
len_df = 1000
df = pd.DataFrame()
df["length"] = rng.beta(0.5, 0.5, size=len_df)
skim(df)
And an option to submit notification of use.
It would be helpful to have a quarto-friendly output option, so that tables generated from skim render in markdown instead of rendering as code-like objects.
For instance, a file like this, with a python skim(df) statement and an R skim(df) statement
(you'll have to add the qmd extension, github won't let me upload a qmd file)
test.qmd
renders as
Thank you for making this package, btw - it has made it much easier to teach my students R and python simultaneously when there are so many packages that have parallel functions and syntax between them.
Use something like
i = 32.1123
print(f'{float(f"{i:.2g}"):g}')
Polars is an increasingly popular data frame package. Although polars users can currently convert to pandas to run Skimpy would it be better if it was native?
It would look better to remove decimals and trailing zeros on whole numbers. Something like s.rstrip('0').rstrip('.') if '.' in s else s
could work.
Currently, this data type is converted to strings.
eg
import pandas as pd
df_check = pd.DataFrame(
{
"header": [pd.Timedelta(365, "d"), pd.Timedelta(-19, "d")],
"header_1": ["length_one", "length_two"],
}
)
skim(df_check)
should produce a table with a time difference section.
This tool looks very useful. Although when I try installing it into my kernel environment it has a lot of dependencies including Jupyter and all associated server dependencies. Perhaps these need to be dev dependencies? I can't see where the dependency is used otherwise. I can't see where ipykernel is used either (initially i thought you might need to import from IPython.display).
bash$ poetry add git+https://github.com/aeturrell/skimpy.git
Updating dependencies
Resolving dependencies... (7.9s)
Package operations: 34 installs, 0 updates, 0 removals
• Installing types-python-dateutil (2.8.19.20240106)
• Installing arrow (1.3.0)
• Installing fqdn (1.5.1)
• Installing isoduration (20.11.0)
• Installing jsonpointer (2.4)
• Installing rfc3339-validator (0.1.4)
• Installing rfc3986-validator (0.1.1)
• Installing uri-template (1.3.0)
• Installing webcolors (1.13)
• Installing argon2-cffi-bindings (21.2.0)
• Installing python-json-logger (2.0.7)
• Installing terminado (0.18.0)
• Installing anyio (4.2.0)
• Installing argon2-cffi (23.1.0)
• Installing jupyter-events (0.9.0)
• Installing jupyter-server-terminals (0.5.1)
• Installing overrides (7.4.0)
• Installing send2trash (1.8.2)
• Installing websocket-client (1.7.0)
• Installing babel (2.14.0)
• Installing json5 (0.9.14)
• Installing jupyter-server (2.12.4)
• Installing async-lru (2.0.4)
• Installing jupyter-lsp (2.2.1)
• Installing jupyterlab-server (2.25.2)
• Installing notebook-shim (0.2.3)
• Installing jupyterlab (4.0.10)
• Installing qtpy (2.4.1)
• Installing jupyter-console (6.6.3)
• Installing notebook (7.0.6)
• Installing qtconsole (5.5.1)
• Installing jupyter (1.0.0)
• Installing typeguard (4.1.5)
• Installing skimpy (0.0.11 556aff6)
Writing lock file
First of all, thank you for creating such a wonderful package.
I was able to quickly understand the characteristics of the data using skim in R, and thank you for making it possible in Python as well.
Polars in the DataFrame package has been growing rapidly in popularity recently.
You can use the skim function in Polars using the to_pandas() function.
However, it would be better if polars was supported directly in pyskim.
Also, Pandas has been updated to version 2.x, but if you install pyskim, the Pandas version will be downgraded. It would be nice if Pandas were also updated to support 2.x.
Thank you
https://github.com/machow/quartodoc
Main advantage is to remove hackyness of current solution.
From my python (v 3.7) dev vm at work I pull data from Vertica (mysql) into a pandas df, and I get what smells like a dependency issue. If this is based on a pandas dependency, is it possible to use skimpy with a different version of pandas through some older version of skimpy?
I run:
skim(df)
I get the issue:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_xyz/myscript.py in
----> 1 skim(shipments)
2 # shipments.describe()
~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)
~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in skim(df, header_style, **colour_kwargs)
527 xf = df.select_dtypes(col_type)
528 if not xf.empty:
--> 529 sum_df = summary_func(xf)
530 list_of_tabs.append(
531 dataframe_to_rich_table(
~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)
~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in numeric_variable_summary_table(xf)
306 data_dict = {
307 "missing": count_nans_vec,
--> 308 "complete rate": 1 - count_nans_vec / xf.shape[0],
309 NUM_COL_MEAN: xf.mean(),
310 "sd": xf.std(),
~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
63 break
64 if isinstance(other, cls):
---> 65 return NotImplemented
66
67 other = item_from_zerodim(other)
~/.venv/asdf/lib/python3.7/site-packages/pandas/core/arraylike.py in truediv(self, other)
111 def rmul(self, other):
112 return self._arith_method(other, roperator.rmul)
--> 113
114 @unpack_zerodim_and_defer("truediv")
115 def truediv(self, other):
~/.venv/asdf/lib/python3.7/site-packages/pandas/core/series.py in _arith_method(self, other, op)
4996 0 True
4997 1 True
-> 4998 2 True
4999 3 False
5000 4 True
~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op)
187 Evaluate an arithmetic operation +
, -
, *
, /
, //
, %
, **
, ...
188
--> 189 Note: the caller is responsible for ensuring that numpy warnings are
190 suppressed (with np.errstate(all="ignore")) if needed.
191
~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in _na_arithmetic_op(left, right, op, is_cmp)
137
138 def _na_arithmetic_op(left, right, op, is_cmp: bool = False):
--> 139 """
140 Return the result of evaluating op on the passed in values.
141
~/.venv/asdflib/python3.7/site-packages/pandas/core/computation/expressions.py in
17 from pandas._typing import FuncType
18
---> 19 from pandas.core.computation.check import NUMEXPR_INSTALLED
20 from pandas.core.ops import roperator
21
~/.venv/data_analyses/lib/python3.7/site-packages/pandas/core/computation/check.py in
1 from pandas.compat._optional import import_optional_dependency
2
----> 3 ne = import_optional_dependency("numexpr", errors="warn")
4 NUMEXPR_INSTALLED = ne is not None
5 if NUMEXPR_INSTALLED:
TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'`
A kwarg in skim function
Hi,
first of all thank you for this great tool.
If I run skimpy on this 999 rows CSV I have 1000 NA rows.
Thank you
Use format as in: https://github.com/Erotemic/xdoctest
Hi,
I've tried Skimpy first time today and it seems like I found a bug. I used Skimpy on my sample dataframe and it only returns the summary for the Float and Integer columns while others were ignored.
This is the sample code:
from skimpy import skim
import datetime
import pandas as pd
data = ([datetime.datetime(2021, 1, 1), None, 'as', 6],
[datetime.datetime(2021, 1, 2), 5.2, 'asd', 7],
[None, 6.3, 'adasda', 8])
df = pd.DataFrame(data, columns=['date', 'float', 'string', 'integer'])
skim(df)
Hi there, great package here, wondering if there is an easy way to change the colour used for the column names in the output - current default uses pink, unfortunately I have a grey terminal background, pink foreground font is pretty much impossible to see...
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.