Coder Social home page Coder Social logo

koaning / scikit-lego Goto Github PK

View Code? Open in Web Editor NEW
1.2K 23.0 114.0 25.57 MB

Extra blocks for scikit-learn pipelines.

Home Page: https://koaning.github.io/scikit-lego/

License: MIT License

Makefile 0.09% Python 99.91%
scikit-learn machine-learning common-sense

scikit-lego's Introduction

Downloads Version Conda Version Ruff DOI

scikit-lego

We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project started as a collaboration between multiple companies in the Netherlands but has since received contributions from around the globe. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.

Note that we're not formally affiliated with the scikit-learn project at all, but we aim to strictly adhere to their standards.

The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.

Installation

Install scikit-lego via pip with

python -m pip install scikit-lego

Via conda with

conda install -c conda-forge scikit-lego

Alternatively, to edit and contribute you can fork/clone and run:

python -m pip install -e ".[dev]"
python setup.py develop

Documentation

The documentation can be found here.

Usage

We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.

# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

mod = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

Features

Here's a list of features that this library currently offers:

  • sklego.datasets.load_abalone loads in the abalone dataset
  • sklego.datasets.load_arrests loads in a dataset with fairness concerns
  • sklego.datasets.load_chicken loads in the joyful chickweight dataset
  • sklego.datasets.load_heroes loads a heroes of the storm dataset
  • sklego.datasets.load_hearts loads a dataset about hearts
  • sklego.datasets.load_penguins loads a lovely dataset about penguins
  • sklego.datasets.fetch_creditcard fetch a fraud dataset from openml
  • sklego.datasets.make_simpleseries make a simulated timeseries
  • sklego.pandas_utils.add_lags adds lag values in a pandas dataframe
  • sklego.pandas_utils.log_step a useful decorator to log your pipeline steps
  • sklego.dummy.RandomRegressor dummy benchmark that predicts random values
  • sklego.linear_model.DeadZoneRegressor experimental feature that has a deadzone in the cost function
  • sklego.linear_model.DemographicParityClassifier logistic classifier constrained on demographic parity
  • sklego.linear_model.EqualOpportunityClassifier logistic classifier constrained on equal opportunity
  • sklego.linear_model.ProbWeightRegression linear model that treats coefficients as probabilistic weights
  • sklego.linear_model.LowessRegression locally weighted linear regression
  • sklego.linear_model.LADRegression least absolute deviation regression
  • sklego.linear_model.QuantileRegression linear quantile regression, generalizes LADRegression
  • sklego.linear_model.ImbalancedLinearRegression punish over/under-estimation of a model directly
  • sklego.naive_bayes.GaussianMixtureNB classifies by training a 1D GMM per column per class
  • sklego.naive_bayes.BayesianGaussianMixtureNB classifies by training a bayesian 1D GMM per class
  • sklego.mixture.BayesianGMMClassifier classifies by training a bayesian GMM per class
  • sklego.mixture.BayesianGMMOutlierDetector detects outliers based on a trained bayesian GMM
  • sklego.mixture.GMMClassifier classifies by training a GMM per class
  • sklego.mixture.GMMOutlierDetector detects outliers based on a trained GMM
  • sklego.meta.ConfusionBalancer experimental feature that allows you to balance the confusion matrix
  • sklego.meta.DecayEstimator adds decay to the sample_weight that the model accepts
  • sklego.meta.EstimatorTransformer adds a model output as a feature
  • sklego.meta.OutlierClassifier turns outlier models into classifiers for gridsearch
  • sklego.meta.GroupedPredictor can split the data into runs and run a model on each
  • sklego.meta.GroupedTransformer can split the data into runs and run a transformer on each
  • sklego.meta.SubjectiveClassifier experimental feature to add a prior to your classifier
  • sklego.meta.Thresholder meta model that allows you to gridsearch over the threshold
  • sklego.meta.RegressionOutlierDetector meta model that finds outliers by adding a threshold to regression
  • sklego.meta.ZeroInflatedRegressor predicts zero or applies a regression based on a classifier
  • sklego.preprocessing.ColumnCapper limits extreme values of the model features
  • sklego.preprocessing.ColumnDropper drops a column from pandas
  • sklego.preprocessing.ColumnSelector selects columns based on column name
  • sklego.preprocessing.InformationFilter transformer that can de-correlate features
  • sklego.preprocessing.IdentityTransformer returns the same data, allows for concatenating pipelines
  • sklego.preprocessing.OrthogonalTransformer makes all features linearly independent
  • sklego.preprocessing.PandasTypeSelector selects columns based on pandas type
  • sklego.preprocessing.RandomAdder adds randomness in training
  • sklego.preprocessing.RepeatingBasisFunction repeating feature engineering, useful for timeseries
  • sklego.preprocessing.DictMapper assign numeric values on categorical columns
  • sklego.preprocessing.OutlierRemover experimental method to remove outliers during training
  • sklego.model_selection.GroupTimeSeriesSplit timeseries Kfold for groups with different amount of observations per group
  • sklego.model_selection.KlusterFoldValidation experimental feature that does K folds based on clustering
  • sklego.model_selection.TimeGapSplit timeseries Kfold with a gap between train/test
  • sklego.pipeline.DebugPipeline adds debug information to make debugging easier
  • sklego.pipeline.make_debug_pipeline shorthand function to create a debugable pipeline
  • sklego.metrics.correlation_score calculates correlation between model output and feature
  • sklego.metrics.equal_opportunity_score calculates equal opportunity metric
  • sklego.metrics.p_percent_score proxy for model fairness with regards to sensitive attribute
  • sklego.metrics.subset_score calculate a score on a subset of your data (meant for fairness tracking)

New Features

We want to be rather open here in what we accept but we do demand three things before they become added to the project:

  1. any new feature contributes towards a demonstrable real-world usecase
  2. any new feature passes standard unit tests (we use the ones from scikit-learn)
  3. the feature has been discussed in the issue list beforehand

We automate all of our testing and use pre-commit hooks to keep the code working.

scikit-lego's People

Contributors

amrrs avatar anopsy avatar arose13 avatar arthurpaulino avatar carlolepelaars avatar daimonie avatar david26694 avatar dependabot[bot] avatar fbruzzesi avatar fritshermans avatar garve avatar glevv avatar greatsharma avatar jankeromnes avatar jczuurmond avatar k-moun avatar kayhoogland avatar koaning avatar ktiamur avatar maxibor avatar mbrouns avatar menziess avatar pim-hoeven avatar pverheijen avatar rensdimmendaal avatar sandervandorsten avatar skylarbpayne avatar stephanecollot avatar tcacastelijns avatar tomasborrella avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-lego's Issues

feature request: state-space models

Add state-space models, in discrete form:

x(k+1) = A * x(k) + B * u(k)
y(k) = C * x(k) + D * u(k)

in where:
x(k) - internal state vector at timestamp k
u(k) - input vector at timestamp k
y(k) - output at timestamp k

Initial implementation would be with a given size of state vector x (e.g. you know the dimension of the underlying system). Second iteration could also estimate the length of this vector x, but that's prob not doable in a single day.

Must admit: I haven't seen many use-cases that would be best solved using a state-space model and thus wonder how useful this can be. Also, I haven't seen many use-cases in general.

feature request: only apply random noise in `RandomAdder` to training data

Currently, RandomAdder adds noise to data both at training and at prediction time. This causes predictions to become non-deterministic and it offers no clear benefit in most cases I can think of.

I suggest changing the default behaviour of the transformer to only add random noise to the train data and optionally through a constructor flag also to the prediction data.

[FEATURE] Statsmodels wrapper class

If you want to use statsmodels for example for regression in a sklearn pipeline

Example

class SMWrapper(BaseEstimator, RegressorMixin):
    """ A universal sklearn-style wrapper for statsmodels regressors """
    def __init__(self, model_class, fit_intercept=True, sample_weight=None):
        self.model_class = model_class
        self.fit_intercept = fit_intercept
        self.sample_weight = sample_weight
    def fit(self, X, y):
        if self.fit_intercept:
            X = sm.add_constant(X)
        self.model_ = self.model_class(y, X, sample_weight=self.sample_weight)
        # Elastic net regularized fit _> fit_regularized
        #self.results_ = self.model_.fit_regularized(alpha=10, L1_wt=0.5)
        self.results_ = self.model_.fit()
    def predict(self, X):
        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)

[FEATURE] Time Series Split with gap and column parameter

Time Series Split with a gap parameter between train and testing

Between the blue and the red we want to have a gap, to simulate that in production you need to wait x days before creating your target that looks x days ahead (e.g. case when you want to predict value in x days).
image

Also sometime you have multiple sample per days, the current scikit learn implement doesn't support specifying a date column.

feature request: monotonic models

it would be awesome if you could specify (per column) if the feature should be monotonically increasing, descreasing, updownup, downupdown or free. forcing this in a simple linear regression would already be kind of sweet.

feature request: RBF Features

this is like the repeating RBF features except that this ... won't repeat. it will simply span the entire space of a variable.

feature request: EstimatorTransformer

is it possible to make a transformer that takes the output of an estimator and adds that to the values that is used for prediction? do we want it?

[FEATURE] Feature capping

Use case:

In ML some time your features have extreme large values or even infinite value (np.inf), we want to cap those values with a feature transformer.

Parameters:

  • feature to cap (and/or feature to no cap?)
  • min value
  • max value

feature request: BoosterPipeline

The idea is to have a pipeline where you might have more than one model in sequention. Model 2 would try to improve on the residuals of Model 1 and so forth.

[FEATURE] PandasPipeTransformer

It might be cool to have things that are a "lambda" in pandas like this:

df.pipe(func, kw1="a", kw2="b")

to be applied used in a pipeline.

feature request: decay model

The idea is to pass a parameter decay that will automatically decay past features using exponential decay such that the sample_weights param can be optimised in a grid search.

It might be good to discuss what other methods of feature decay we might want.

feature request: timeseries features

it might be nice to be able to accept a datetime column and to generate lots of relevant features from it that can be used in an sklearn pipeline.

think: day_of_week, hour, etc.

feature request: grouped model

sometimes you'd like to group the dataset into separate parts and run a model in each part. the idea of this model would be that you can add a classified/regressor of your own but this model will make sure it gets run per group.

note that this model could actually work quite well in combination with a sklearn.dummy model.

feature request: FeatureSmoother/ConstrainedSmoother

image

it might be epic if you could smooth out every column in X with regards to y as a transformer before it goes into a estimator. when looking at the loe(w)ess model this seems to be exactly what i want. not sure if it is super useful tho.

[FEATURE] DebugFeatureUnion

Similar as #46, but here for the FeatureUnion.

Description:
Have a log statement inbetween the steps of a feature union.

[FEATURE] Feature selector by name

When you use pandas, you want quickly specify which features to keep and/or which one you want to drop.

Example

class FeatureSelector(BaseEstimator):
    def __init__(self, keep=None, drop=None):
        self.keep = keep
        self.drop = drop

    def transform(self, X):
        if self.keep:
            self.feature_names = self.keep
        else:
            self.feature_names = list(set(X.columns) - set(self.drop))
        return X[self.feature_names]

    def fit(self, X, y=None):
        return self
    
    def get_feature_names(self):
        return self.feature_names

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.