koaning / scikit-lego Goto Github PK
View Code? Open in Web Editor NEWExtra blocks for scikit-learn pipelines.
Home Page: https://koaning.github.io/scikit-lego/
License: MIT License
Extra blocks for scikit-learn pipelines.
Home Page: https://koaning.github.io/scikit-lego/
License: MIT License
It might be nice to have an variant of the voting classifier. One that goes a bit further than mere voting but takes the uncertainty of seperate classifiers into account.
Selects columns based on a name. Accepts Iterable(str)
or str
(which converts to an iterable of length 1.
Time Series Split with a gap parameter between train and testing
Between the blue and the red we want to have a gap, to simulate that in production you need to wait x days before creating your target that looks x days ahead (e.g. case when you want to predict value in x days).
Also sometime you have multiple sample per days, the current scikit learn implement doesn't support specifying a date column.
Currently, RandomAdder
adds noise to data both at training and at prediction time. This causes predictions to become non-deterministic and it offers no clear benefit in most cases I can think of.
I suggest changing the default behaviour of the transformer to only add random noise to the train data and optionally through a constructor flag also to the prediction data.
it might be nice to be able to accept a datetime column and to generate lots of relevant features from it that can be used in an sklearn pipeline.
think: day_of_week, hour, etc.
Selecting Pandas columns by their dtype
It might be cool to have things that are a "lambda" in pandas like this:
df.pipe(func, kw1="a", kw2="b")
to be applied used in a pipeline.
dev
dependencies in `./setup.pip install sphinx
pip install sphinx_rtd_theme
Not every user will appreciate python logging so it might make sense to have a document in the documentation that gives an example of how a DebugPipeline
might allow you to discover a bug.
it is probably a good idea to link to the blogpost that inspired the feature too: https://tomaugspurger.github.io/method-chaining
A wrapper class to log a fit/transform/fit_transform of an estimator or a transformer
Please add basic information for the documentation.
The idea is to pass a parameter decay
that will automatically decay past features using exponential decay such that the sample_weights
param can be optimised in a grid search.
It might be good to discuss what other methods of feature decay we might want.
preferably this is a setting.
this is travis but with support for windows. https://www.appveyor.com/pricing/
If you want to use statsmodels for example for regression in a sklearn pipeline
Example
class SMWrapper(BaseEstimator, RegressorMixin):
""" A universal sklearn-style wrapper for statsmodels regressors """
def __init__(self, model_class, fit_intercept=True, sample_weight=None):
self.model_class = model_class
self.fit_intercept = fit_intercept
self.sample_weight = sample_weight
def fit(self, X, y):
if self.fit_intercept:
X = sm.add_constant(X)
self.model_ = self.model_class(y, X, sample_weight=self.sample_weight)
# Elastic net regularized fit _> fit_regularized
#self.results_ = self.model_.fit_regularized(alpha=10, L1_wt=0.5)
self.results_ = self.model_.fit()
def predict(self, X):
if self.fit_intercept:
X = sm.add_constant(X)
return self.results_.predict(X)
When you use pandas, you want quickly specify which features to keep and/or which one you want to drop.
Example
class FeatureSelector(BaseEstimator):
def __init__(self, keep=None, drop=None):
self.keep = keep
self.drop = drop
def transform(self, X):
if self.keep:
self.feature_names = self.keep
else:
self.feature_names = list(set(X.columns) - set(self.drop))
return X[self.feature_names]
def fit(self, X, y=None):
return self
def get_feature_names(self):
return self.feature_names
is it possible to make a transformer that takes the output of an estimator and adds that to the values that is used for prediction? do we want it?
this is like the repeating RBF features except that this ... won't repeat. it will simply span the entire space of a variable.
Add state-space models, in discrete form:
x(k+1) = A * x(k) + B * u(k)
y(k) = C * x(k) + D * u(k)
in where:
x(k) - internal state vector at timestamp k
u(k) - input vector at timestamp k
y(k) - output at timestamp k
Initial implementation would be with a given size of state vector x (e.g. you know the dimension of the underlying system). Second iteration could also estimate the length of this vector x, but that's prob not doable in a single day.
Must admit: I haven't seen many use-cases that would be best solved using a state-space model and thus wonder how useful this can be. Also, I haven't seen many use-cases in general.
the idea is to be force the effect of the parameters of the model to be either increasing or decreasing.
once #21 is merged there is a randomregression. this is a dummy model used for benchmarking.
The idea is to have a pipeline where you might have more than one model in sequention. Model 2 would try to improve on the residuals of Model 1 and so forth.
Similar as #46, but here for the FeatureUnion.
Description:
Have a log statement inbetween the steps of a feature union.
A pipeline that logs extra information before/after every step, which is useful for debugging.
sometimes you'd like to group the dataset into separate parts and run a model in each part. the idea of this model would be that you can add a classified/regressor of your own but this model will make sure it gets run per group.
note that this model could actually work quite well in combination with a sklearn.dummy
model.
nuff said.
me: plz dont do this
also me: but it so funny
this is a bit of a silly request, but it is an interesting idea to cross validate on clusters instead of kfolds.
decorator that logs how a pandas dataframe is modified;
information:
train a umap/tsne embedding before passing it to the algorithm
would be nice.
Selects columns from a pd.DataFrame
or np.ndarray
based on indices. can be used in a sklearn Pipeline
Use case:
In ML some time your features have extreme large values or even infinite value (np.inf), we want to cap those values with a feature transformer.
Parameters:
it would be awesome if you could specify (per column) if the feature should be monotonically increasing, descreasing, updownup, downupdown or free. forcing this in a simple linear regression would already be kind of sweet.
The make_pipeline function uses the (standard) Pipeline of sklearn. Refactor this function such that another pipline, like DebugPipeline, can be passed as prefered pipeline.
After the unit tests have run we get about 433 warnings. Let's investigate this.
feature generation that can be used for timeseries. trick from the london talk.
A conference paper that was downloaded 700+ times can't be wrong.
https://link.springer.com/chapter/10.1007/978-3-319-00651-2_19
I've always wanted to have loess regression in python. R has a cool version of it but it has always been missing. This would be a great model to host here.
The EstimatorTransformer
is complicated enough to add an .rst document for. Might be nice to check if we can automatically test this as well.
as an alternative for isolation forests.
sdfghjk
LagAdder(colname, lags)
or LagAdder(idx_col, lags)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.