koaning / scikit-lego Goto Github PK

View Code? Open in Web Editor NEW

1.2K 23.0 114.0 25.57 MB

Extra blocks for scikit-learn pipelines.

Home Page: https://koaning.github.io/scikit-lego/

License: MIT License

Makefile 0.09% Python 99.91%

scikit-learn machine-learning common-sense

scikit-lego's Introduction

scikit-lego

We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project started as a collaboration between multiple companies in the Netherlands but has since received contributions from around the globe. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.

Note that we're not formally affiliated with the scikit-learn project at all, but we aim to strictly adhere to their standards.

The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.

Installation

Install scikit-lego via pip with

python -m pip install scikit-lego

Via conda with

conda install -c conda-forge scikit-lego

Alternatively, to edit and contribute you can fork/clone and run:

python -m pip install -e ".[dev]"
python setup.py develop

Documentation

The documentation can be found here.

Usage

We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.

# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

mod = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

Features

Here's a list of features that this library currently offers:

sklego.datasets.load_abalone loads in the abalone dataset
sklego.datasets.load_arrests loads in a dataset with fairness concerns
sklego.datasets.load_chicken loads in the joyful chickweight dataset
sklego.datasets.load_heroes loads a heroes of the storm dataset
sklego.datasets.load_hearts loads a dataset about hearts
sklego.datasets.load_penguins loads a lovely dataset about penguins
sklego.datasets.fetch_creditcard fetch a fraud dataset from openml
sklego.datasets.make_simpleseries make a simulated timeseries
sklego.pandas_utils.add_lags adds lag values in a pandas dataframe
sklego.pandas_utils.log_step a useful decorator to log your pipeline steps
sklego.dummy.RandomRegressor dummy benchmark that predicts random values
sklego.linear_model.DeadZoneRegressor experimental feature that has a deadzone in the cost function
sklego.linear_model.DemographicParityClassifier logistic classifier constrained on demographic parity
sklego.linear_model.EqualOpportunityClassifier logistic classifier constrained on equal opportunity
sklego.linear_model.ProbWeightRegression linear model that treats coefficients as probabilistic weights
sklego.linear_model.LowessRegression locally weighted linear regression
sklego.linear_model.LADRegression least absolute deviation regression
sklego.linear_model.QuantileRegression linear quantile regression, generalizes LADRegression
sklego.linear_model.ImbalancedLinearRegression punish over/under-estimation of a model directly
sklego.naive_bayes.GaussianMixtureNB classifies by training a 1D GMM per column per class
sklego.naive_bayes.BayesianGaussianMixtureNB classifies by training a bayesian 1D GMM per class
sklego.mixture.BayesianGMMClassifier classifies by training a bayesian GMM per class
sklego.mixture.BayesianGMMOutlierDetector detects outliers based on a trained bayesian GMM
sklego.mixture.GMMClassifier classifies by training a GMM per class
sklego.mixture.GMMOutlierDetector detects outliers based on a trained GMM
sklego.meta.ConfusionBalancer experimental feature that allows you to balance the confusion matrix
sklego.meta.DecayEstimator adds decay to the sample_weight that the model accepts
sklego.meta.EstimatorTransformer adds a model output as a feature
sklego.meta.OutlierClassifier turns outlier models into classifiers for gridsearch
sklego.meta.GroupedPredictor can split the data into runs and run a model on each
sklego.meta.GroupedTransformer can split the data into runs and run a transformer on each
sklego.meta.SubjectiveClassifier experimental feature to add a prior to your classifier
sklego.meta.Thresholder meta model that allows you to gridsearch over the threshold
sklego.meta.RegressionOutlierDetector meta model that finds outliers by adding a threshold to regression
sklego.meta.ZeroInflatedRegressor predicts zero or applies a regression based on a classifier
sklego.preprocessing.ColumnCapper limits extreme values of the model features
sklego.preprocessing.ColumnDropper drops a column from pandas
sklego.preprocessing.ColumnSelector selects columns based on column name
sklego.preprocessing.InformationFilter transformer that can de-correlate features
sklego.preprocessing.IdentityTransformer returns the same data, allows for concatenating pipelines
sklego.preprocessing.OrthogonalTransformer makes all features linearly independent
sklego.preprocessing.PandasTypeSelector selects columns based on pandas type
sklego.preprocessing.RandomAdder adds randomness in training
sklego.preprocessing.RepeatingBasisFunction repeating feature engineering, useful for timeseries
sklego.preprocessing.DictMapper assign numeric values on categorical columns
sklego.preprocessing.OutlierRemover experimental method to remove outliers during training
sklego.model_selection.GroupTimeSeriesSplit timeseries Kfold for groups with different amount of observations per group
sklego.model_selection.KlusterFoldValidation experimental feature that does K folds based on clustering
sklego.model_selection.TimeGapSplit timeseries Kfold with a gap between train/test
sklego.pipeline.DebugPipeline adds debug information to make debugging easier
sklego.pipeline.make_debug_pipeline shorthand function to create a debugable pipeline
sklego.metrics.correlation_score calculates correlation between model output and feature
sklego.metrics.equal_opportunity_score calculates equal opportunity metric
sklego.metrics.p_percent_score proxy for model fairness with regards to sensitive attribute
sklego.metrics.subset_score calculate a score on a subset of your data (meant for fairness tracking)

New Features

We want to be rather open here in what we accept but we do demand three things before they become added to the project:

any new feature contributes towards a demonstrable real-world usecase
any new feature passes standard unit tests (we use the ones from scikit-learn)
the feature has been discussed in the issue list beforehand

We automate all of our testing and use pre-commit hooks to keep the code working.

scikit-lego's People

Contributors

Stargazers

Watchers

Forkers

bradib0y sstolle kayhoogland anachal dominiquemaria scarozza orchardbirds rafah-ek stephanecollot pim-hoeven arthurpaulino ryan102590 jczuurmond rensdimmendaal jcshoekstra sprinterzzj fokko juandes jankeromnes trendingtechnology jsamoocha amrrs janvdvegt maxibor alekseidudchenko greatsharma schatzr leonardocordoba mbrukman royalts vtoliveira sanyam07 massiung rpauli abo7atm tomasborrella garrettmooney ktiamur doctorado-ml pverheijen kaipoethkow asiminaath tcacastelijns tomron konstantinklepikov michalchromcak nvanommeren zhangou888 sandy1811 maybeee18 zypher22 chihempat nicotrombon david26694 lepy shunte88 sephib janetcheung-byte nkm-ml antongusarov tch fritshermans prateekchandrajha mralbu gsarfo-boateng indymnv rdarie quendee moretticb cmftall hugovdberg garve lahdjirayhan valeman ivanye2509 momomuchu dumpmemory zhongkailv maxhalford pacman1984 markusdegen sardanabhav chetanmehra jckwind11 nj-eka skylarbpayne kamilm carlolepelaars glevv stomerg john-hawkins eromoe davidelagano juliosilvaebx rmallof xhochy kulikdm hsyngmtrk fbruzzesi conef06

scikit-lego's Issues

feature request: GMM Naive Bayes

nuff said.

confusion based reweighting

A conference paper that was downloaded 700+ times can't be wrong.

https://link.springer.com/chapter/10.1007/978-3-319-00651-2_19

feature request: Column Selector

Selects columns based on a name. Accepts Iterable(str) or str (which converts to an iterable of length 1.

documentation on github pages

locally it seems to run just fine

but github seems to not be rendering it appropriately

feature request: state-space models

Add state-space models, in discrete form:

x(k+1) = A * x(k) + B * u(k)
y(k) = C * x(k) + D * u(k)

in where:
x(k) - internal state vector at timestamp k
u(k) - input vector at timestamp k
y(k) - output at timestamp k

Initial implementation would be with a given size of state vector x (e.g. you know the dimension of the underlying system). Second iteration could also estimate the length of this vector x, but that's prob not doable in a single day.

Must admit: I haven't seen many use-cases that would be best solved using a state-space model and thus wonder how useful this can be. Also, I haven't seen many use-cases in general.

Help i hav e a prolbmre

sdfghjk

feature request: loess/lowess regression

I've always wanted to have loess regression in python. R has a cool version of it but it has always been missing. This would be a great model to host here.

feature request: only apply random noise in `RandomAdder` to training data

Currently, RandomAdder adds noise to data both at training and at prediction time. This causes predictions to become non-deterministic and it offers no clear benefit in most cases I can think of.

I suggest changing the default behaviour of the transformer to only add random noise to the train data and optionally through a constructor flag also to the prediction data.

missing documentation: Estimator Transformer

The EstimatorTransformer is complicated enough to add an .rst document for. Might be nice to check if we can automatically test this as well.

[FEATURE] Statsmodels wrapper class

If you want to use statsmodels for example for regression in a sklearn pipeline

Example

class SMWrapper(BaseEstimator, RegressorMixin):
    """ A universal sklearn-style wrapper for statsmodels regressors """
    def __init__(self, model_class, fit_intercept=True, sample_weight=None):
        self.model_class = model_class
        self.fit_intercept = fit_intercept
        self.sample_weight = sample_weight
    def fit(self, X, y):
        if self.fit_intercept:
            X = sm.add_constant(X)
        self.model_ = self.model_class(y, X, sample_weight=self.sample_weight)
        # Elastic net regularized fit _> fit_regularized
        #self.results_ = self.model_.fit_regularized(alpha=10, L1_wt=0.5)
        self.results_ = self.model_.fit()
    def predict(self, X):
        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)

feature request: lattice regression

the idea is to be force the effect of the parameters of the model to be either increasing or decreasing.

[DOCS] add guide for sklego.pipeline.DebugPipeline

Not every user will appreciate python logging so it might make sense to have a document in the documentation that gives an example of how a DebugPipeline might allow you to discover a bug.

feature request: pandas.lag_adder

LagAdder(colname, lags) or LagAdder(idx_col, lags)

feature request: group model

[DOCS] debug pipeline has rendering issues

[FEATURE] Time Series Split with gap and column parameter

Time Series Split with a gap parameter between train and testing

Between the blue and the red we want to have a gap, to simulate that in production you need to wait x days before creating your target that looks x days ahead (e.g. case when you want to predict value in x days).

Also sometime you have multiple sample per days, the current scikit learn implement doesn't support specifying a date column.

feature request: ensemble model that takes p(class | model) into account

It might be nice to have an variant of the voting classifier. One that goes a bit further than mere voting but takes the uncertainty of seperate classifiers into account.

feature request: monotonic models

it would be awesome if you could specify (per column) if the feature should be monotonically increasing, descreasing, updownup, downupdown or free. forcing this in a simple linear regression would already be kind of sweet.

Bug: Sphinx not installed as dependency for development

create dev dependencies in `./setup.
pip install sphinx
pip install sphinx_rtd_theme

feature request: GMM classifier

feature request: RBF Features

this is like the repeating RBF features except that this ... won't repeat. it will simply span the entire space of a variable.

feature request: kmeans crossvalidator

this is a bit of a silly request, but it is an interesting idea to cross validate on clusters instead of kfolds.

missing documentation: RandomAdder

Please add basic information for the documentation.

feature request: moving window cv splitter

https://github.com/roelbertens/time-series-nested-cv/blob/master/time_series_cross_validation/custom_time_series_split.py

feature request: EstimatorTransformer

is it possible to make a transformer that takes the output of an estimator and adds that to the values that is used for prediction? do we want it?

feature request: RBFRepeater

feature generation that can be used for timeseries. trick from the london talk.

feature request: get appveyor

this is travis but with support for windows. https://www.appveyor.com/pricing/

[FEATURE] Feature capping

Use case:

In ML some time your features have extreme large values or even infinite value (np.inf), we want to cap those values with a feature transformer.

Parameters:

feature to cap (and/or feature to no cap?)
min value
max value

feature request: dtype selector

Selecting Pandas columns by their dtype

`RandomAdder` currently applies to train and test

preferably this is a setting.

feature request: BoosterPipeline

The idea is to have a pipeline where you might have more than one model in sequention. Model 2 would try to improve on the residuals of Model 1 and so forth.

[FEATURE] PandasPipeTransformer

It might be cool to have things that are a "lambda" in pandas like this:

df.pipe(func, kw1="a", kw2="b")

to be applied used in a pipeline.

number of rows
number of columns
time taken

[FEATURE] Feature selector by name

When you use pandas, you want quickly specify which features to keep and/or which one you want to drop.