Coder Social home page Coder Social logo

akanz1 / klib Goto Github PK

View Code? Open in Web Editor NEW
493.0 5.0 54.0 48.11 MB

Easy to use Python library of customized functions for cleaning and analyzing data.

Home Page: https://medium.com/p/97191d320f80

License: MIT License

Python 100.00%
data-science data-analysis klib data-visualization python feature-selection data-cleaning data-preprocessing

klib's Introduction

klib Header

Flake8 & PyTest Language Last Commit Quality Gate Status Scrutinizer codecov

klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience and in the examples section. Additionally, there are great introductions and overviews of the functionality on PythonBytes or on YouTube (Data Professor).

Installation

Use the package manager pip to install klib.

PyPI Version Downloads

pip install -U klib

Alternatively, to install this package with conda run:

Conda Version Conda Downloads

conda install -c conda-forge klib

Usage

import klib
import pandas as pd

df = pd.DataFrame(data)

# klib.describe - functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.corr_interactive_plot(df, split="neg").show() # returns an interactive correlation plot using plotly
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values

# klib.clean - functions for cleaning datasets
- klib.data_cleaning(df) # performs datacleaning (drop duplicates & empty rows/cols, adjust dtypes,...)
- klib.clean_column_names(df) # cleans and standardizes column names, also called inside data_cleaning()
- klib.convert_datatypes(df) # converts existing to more efficient dtypes, also called inside data_cleaning()
- klib.drop_missing(df) # drops missing values, also called in data_cleaning()
- klib.mv_col_handling(df) # drops features with high ratio of missing vals based on informational content
- klib.pool_duplicate_subsets(df) # pools subset of cols based on duplicates with min. loss of information

Examples

Find all available examples as well as applications of the functions in klib.clean() with detailed descriptions here.

klib.missingval_plot(df) # default representation of missing values in a DataFrame, plenty of settings are available

Missingvalue Plot Example

klib.corr_plot(df, split='pos') # displaying only positive correlations, other settings include threshold, cmap...
klib.corr_plot(df, split='neg') # displaying only negative correlations

Corr Plot Example

klib.corr_plot(df, target='wine') # default representation of correlations with the feature column

Target Corr Plot Example

klib.corr_interactive_plot(df, split="neg").show()

# The interactive plot has the same parameters as the corr_plot, but with additional Plotly heatmap graph object kwargs.
klib.corr_interactive_plot(df, split="neg", zmax=0)

Interactive Corr Plot Simple Example

Interactive Corr Plot with zmax kwarg Example

#Since corr_interactive_plot returns a Graph Object Figure, it supports the update_layout chain method.
klib.corr_interactive_plot(wine, split="neg").update_layout(template="simple_white")

Interactive Corr Plot Chained Example

klib.dist_plot(df) # default representation of a distribution plot, other settings include fill_range, histogram, ...

Dist Plot Example

klib.cat_plot(data, top=4, bottom=4) # representation of the 4 most & least common values in each categorical column

Cat Plot Example

Further examples, as well as applications of the functions in klib.clean() can be found here.

Contributing

Open in Visual Studio Code

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change.

License

MIT

klib's People

Contributors

akanz1 avatar deepsourcebot avatar dependabot[bot] avatar hasan-alper avatar jrrmcalcio avatar m-marqx avatar px39n avatar snyk-bot avatar withshubh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

klib's Issues

CI updates

  • add python 3.12-dev to CI
  • replcae flake8/pylint/reorder_imports with ruff
  • dependency updates

[BUG] - Cannot set non-string value {value} into a StringArray with klib.cat_plot

Once running klib.cat_plot(df_cleaned) I get this error:
Cannot set non-string value '2' into a StringArray.

Screen Shot 2020-10-05 at 11 01 32 AM

Python version: Python 3.6.9
Pandas version: 1.1.2

My data look like this:
0 vmid string
1 subscriptionid category
2 deploymentid category
3 vmcreated int32
4 vmdeleted int32
5 maxcpu float32
6 avgcpu float32
7 p95maxcpu float32
8 vmcategory category
9 vmcorecountbucket category
10 vmmemorybucket category
11 lifetime_h float32

[BUG] - numpy overflow encountered in reduce

Thanks for sharing this package, I'm loving it!

I did run into a bug today. When I try to run dist_plot on my dataset, I get the following message:

\numpy\core_methods.py:160: RuntimeWarning:
overflow encountered in reduce

I isolated it down to one particular series in my dataframe. It's not one I really care about, but maybe someone else will run into it for a series they DO care about. Here's a describe() after running it through klib's data_cleaning function:

df.created_at.describe()
count 5.213400e+04
mean 1.610795e+12
std 4.225043e+08
min 1.609891e+12
25% 1.610552e+12
50% 1.610838e+12
75% 1.611198e+12
max 1.611274e+12
Name: created_at, dtype: float64

Meanwhile, info() reports something different:

df.info()
...
2 created_at 52134 non-null float32
...

Notice one reports float32 while the other says float64... Seems fishy.

I'm using miniconda on Windows 10.
conda v4.9.2
numpy v1.19.5
klib v0.1.0

If you need me to provide my dataset, I can do so.

[BUG] - data cleaning sometimes returns float32 instead of float64

Describe the bug
Hi @akanz1, first of all thanks for this amazing package. I do not know whether this is properly a bug.
The cleaning function sometimes converts data to float32 instead of float64, and the dist_plot function returns a ValueError: data type <class 'numpy.object_'> not inexact . If I manually convert the data with .astype(float)everything works fine.

Here is the data that produces the error, you can create a data frame col with that data and try out

[0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
2.0331595,
1.0165797,
2.0331595,
2.0331595,
0.0,
2.0331595,
1.0165797,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
2.0331595,
1.0165797,
2.0331595,
2.0331595,
0.0,
2.0331595,
1.0165797,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
2.0331595,
1.0165797,
2.0331595,
2.0331595,
0.0,
2.0331595,
1.0165797,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
2.0331595,
1.0165797,
2.0331595,
2.0331595,
0.0,
2.0331595,
1.0165797,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
0.0,
0.0,
2.0331595,
2.0331595,
2.0331595,
1.0165797,
2.0331595,
2.0331595,
0.0,
2.0331595,
1.0165797]

[BUG] - ... The command -- klib.dist_plot(df) does not plot the distribution for all the numeric features of a Dataframa

Describe the bug
The issue is that the the command -- klib.dist_plot(df) does not plot the distribution for all the numeric features of a Dataframa it just plots the ditribution for the first numeric feature only.
To Reproduce
Steps to reproduce the behavior:

  1. Go to python notebook and import klib
  2. Create a dataframe from any dataset in my case i used "df = pd.read_csv('https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv')"
  3. then use - klib.dist_plot(df) to plot the distribution
  4. See error that it plots for the first numeric only in case use also set the showall to True

Screenshots
image

[BUG] - Plots not showing up in Jupyter Notebooks on Mac M1

Describe the bug

Plots not showing up in jupyter notebooks on mac m1

To Reproduce
Steps to reproduce the behavior:

  1. Install the library on a fresh conda enviroment (on macbook air M1, big sur)
  2. run the jupyter notebook
  3. import data from seaborn
  4. Plot the charts (the corr_mat does show up).

Expected behavior
The plots should show up as per the library homepage.

Screenshots
image

Desktop (please complete the following information):

  • OS: Mac Big Sur
  • Browser: Chrome/VS Code + Jupyter Notebooks

[BUG] - missinval_plot method return ValueError

Describe the bug
While trying to lot missing values I obtain the following error
ValueError: rotation must be 'vertical', 'horizontal' or a number, not 90

here the error backtrace

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [24], line 1
----> 1 klib.missingval_plot(data_interim_df)

File ~/venv/lib/python3.10/site-packages/klib/describe.py:689, in missingval_plot(data, cmap, figsize, sort, spine_color)
    687 for rect, label in zip(ax1.patches, mv_cols):
    688     height = rect.get_height()
--> 689     ax1.text(
    690         rect.get_x() + rect.get_width() / 2,
    691         height + max(np.log(1 + height / 6), 0.075),
    692         label,
    693         ha="center",
    694         va="bottom",
    695         rotation="90",
    696         alpha=0.5,
    697         fontsize="11",
    698     )
    700 ax1.set_frame_on(True)
    701 for _, spine in ax1.spines.items():

File /shared-libs/python3.10/py/lib/python3.10/site-packages/matplotlib/axes/_axes.py:678, in Axes.text(self, x, y, s, fontdict, **kwargs)
    617 """
    618 Add text to the Axes.
    619 
   (...)
    668     >>> text(x, y, s, bbox=dict(facecolor='red', alpha=0.5))
    669 """
    670 effective_kwargs = {
    671     'verticalalignment': 'baseline',
    672     'horizontalalignment': 'left',
   (...)
    676     **kwargs,
    677 }
--> 678 t = mtext.Text(x, y, text=s, **effective_kwargs)
    679 t.set_clip_path(self.patch)
    680 self._add_text(t)

File /shared-libs/python3.10/py/lib/python3.10/site-packages/matplotlib/_api/deprecation.py:454, in make_keyword_only.<locals>.wrapper(*args, **kwargs)
    448 if len(args) > name_idx:
    449     warn_deprecated(
    450         since, message="Passing the %(name)s %(obj_type)s "
    451         "positionally is deprecated since Matplotlib %(since)s; the "
    452         "parameter will become keyword-only %(removal)s.",
    453         name=name, obj_type=f"parameter of {func.__name__}()")
--> 454 return func(*args, **kwargs)

File /shared-libs/python3.10/py/lib/python3.10/site-packages/matplotlib/text.py:178, in Text.__init__(self, x, y, text, color, verticalalignment, horizontalalignment, multialignment, fontproperties, rotation, linespacing, rotation_mode, usetex, wrap, transform_rotates_text, parse_math, **kwargs)
    176 self.set_horizontalalignment(horizontalalignment)
    177 self._multialignment = multialignment
--> 178 self.set_rotation(rotation)
    179 self._transform_rotates_text = transform_rotates_text
    180 self._bbox_patch = None  # a FancyBboxPatch instance

File /shared-libs/python3.10/py/lib/python3.10/site-packages/matplotlib/text.py:1197, in Text.set_rotation(self, s)
   1195     self._rotation = 90.
   1196 else:
-> 1197     raise ValueError("rotation must be 'vertical', 'horizontal' or "
   1198                      f"a number, not {s}")
   1199 self.stale = True

ValueError: rotation must be 'vertical', 'horizontal' or a number, not 90

[BUG] - Broken missing values plot for small percentage of missing values

Describe the bug
The missing value plot gets broken for a small missing values percentage.

To Reproduce

df = pd.DataFrame.from_dict({'col': np.ones(1000)})
df.loc[:2] = np.NaN
klib.missingval_plot(df)

Expected behavior
Appropriate positioning of the text, y axes labels

Screenshots
image

I have a fix, but I'm not allowed to push my branch. How can I do it?

P.S. awesome library, thanks for the work!

Check dependencies

  • Check if jinja2 is stil required.
  • Check dependencies
  • Check dev dependencies
  • upgrade dev dependencies

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.