Coder Social home page Coder Social logo

vecxoz / vecstack Goto Github PK

View Code? Open in Web Editor NEW
684.0 684.0 83.0 532 KB

Python package for stacking (machine learning technique)

License: Other

Python 100.00%
bagging blending ensemble ensemble-learning ensembling explain-stacking machine-learning stacked-generalization stacking stacking-tutorial

vecstack's People

Contributors

vecxoz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vecstack's Issues

N-dimensional input (stacking for convolutional nets)

Hi,
I tried to do stacking by using kerasClassifier with a cnn but i get this error
ValueError: Found array with dim 4. Estimator expected <= 2.

This is my code for the cnn

`def model1():
model = Sequential()
model.add(Conv2D(16, (3, 3), activation='relu', padding="same", input_shape=(train_files.shape[1:])))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())  
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
          optimizer='rmsprop',
          metrics=['accuracy'])`

There is some question about transfrom

When i use Stacking and StackingTransformer for lst layer, there is some difference in specific model OOF which i use AdaBoost, (e.g (0.9 0.805) (1.3 1.34) )so i really don not know what caused this problem,.

Using different data transformations and fit parameters for different models

Hi Igor,

Congratulations for your package. I've been searching for a stacking package and this nails it (both for simplicity and efectiveness). Thanks for your contribution

Is there any possibility to stack already trained models with your package? There are 2 reasons for this:
-People might want to set fit arguments to the models (currently not available as the stacking function will actually train the models)
-We might want to use different data scaling and preprocessing techniques for different algorithms (label encoding for tree-based methods and one hot for linear)

For example, H2O stacking allows users to stack already trained models:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html

I would love to contribute to your package but unfortunately my technical level would be too dangerous for your code :P

issue with Keras custom layer

HI @vecxoz
vecstack seems doesn't allow using custom layer in Keras's model initializing:

For example if we define external Class layer (inherit Layer superclass) and pass it to model initializer, this raises that Exception:
"ValueError: Unknown layer: custom_layer".

ther's away to do that?

Thanks!

Edit: SOLVED using KerasClassifier, sklearn API..as wrapper.

Missing values

Hi great function! What if I data with missing values and I want to leave them as missing for the purpose of an XGBClassifier in an ensemble but I also want to include an Sklearn classifier that requires missing values to be filled e.g., Random Forest. So basically my training data would be different for different models in the ensemble.

Another related example would be encoding of categorical variables. For LGBMClassifier I may want to label encode vs one hot encode for XGBoost so the training set would have different dimensions for each classifier in this example

Using the functional API for training only

There doesn't seem to be a way to use the functional API just for training a model - since X_test= is a required argument. However, if I've already tested my 2nd level model, I think I should be able to train a model on the full data set.

To be clear, I would like to be able to just do the following:

from vecstack import stacking

# Get your data

# Initialize 1st level estimators
models = [LinearRegression(),
          Ridge(random_state=0)]

# Get your stacked features in a single line
S_train = stacking(models, X_train, y_train, regression=True, verbose=2)

# Use 2nd level estimator with stacked features

Am I missing something?

Ability to use different features in each model.

I have a model whose most predictive features are the most noisy. To compensate, I train 1 model on those features, and a separate model on all the other features. By combining these models, I can quickly and easily prevent strange outlier predictions.

Simple stacking / voting is okay, but I imagine the model would generalize better were I to implement vecstack instead.
Is there any feasible way we could add different X (column-wise) per model to vecstack? I.e. multiple X that are the same length, but have different widths.

Thank you for your time!
-Nathan

Python 2.7 DeprecationWarning

import vecstack
# DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
#  "(https://pypi.org/project/six/).", DeprecationWarning)

from sklearn.externals import six

metric=auc

It's great job!
can metric=auc,when I want to use Classifier?

Automatic saving (`save_dir`) doesn't work on Kaggle

HI
Unfortunately vecstack is useless on Kaggle, because output files' names result as invalid character, as reported in kernel' log file; (then that files can't be saved as kernel output)
ther's some trick to work around?

thanks

IndexError: tuple index out of range

Hi there

nice package. Just a quick one, in line 409 at core.py there is:

X_train = np.array(X_train)
y_train = np.array(y_train).ravel()
X_test = np.array(X_test)

I am not sure that is necessary? For example using sparse matrices this return an array with the sparse matrix "inside" (i.e. no shape) rather than an array THAT IS the sparse matrix. This will throw an error at line 502:

 IndexError: tuple index out of range

as

 In [31]: X_train.shape
 Out[31]: ()

I am at the moment running it just commenting out those lines with no problem.

Just that, maybe considering commenting out the lines or add an if statement for sparse matrices?

Thanks

Nested cross val for Hyperparameter tuning

Vecstack, as many people have pointed out, is fitting in a nice niche that solve a complex problem in single line of code.

I am aware of that stacking is very similar to cross-validation as it works with k-fold and oof. I wonder how does it work with hyperparameter tuning, e.g. gridsearchCV or Randomizedsearch?

Would you suggest us to hypertune the model before constructing them as the level 1 models for stacking? or this can be done together.

How to use SHAP with a vecstack model

Thanks for your awesome repo.

I used it to build my model as following:

S_train, S_test = stacking(models,
                               x_res, y_res, X_test,
                               regression=False,

                               mode='oof_pred_bag',

                               needs_proba=False,

                               save_dir=None,

                               metric=metrics.accuracy_score,

                               n_folds=10,

                               stratified=True,

                               shuffle=True,

                               verbose=2)

I need to interpret my stacking model by using SHAP they recommend this way for models with folds. My issue is that I can't access the folds in the stacking model, I'm thinking of using this way

My question is there a way to use SHAP with a stacking model?

How to predict

After I created my models and happy with the results. How can I save the models and use it to predict on real life data?

Question about usage...

I am trying to predict Housing prices, where I have a train data set and a test data set. the train data has a label and I need to train on it to later use this trained model to predict the label for the test data, which do not have a label. Aso, I followed your process on my train data set and performed the stacking, and applied the second level to the S_train and S_test variables as indicated in your instructions.
Now that i have done that, how do I proceed to predict the label on the test (unknown) dataset?

Support for custom Cross Validation strategies

The package looks amazing, but from what I saw, one can not pass a cross-validation sklearn object, only the number of folds, and enable/disable shuffling and stratification. This is an issue when trying to work with time series data, and using TimeSeriesSplit from sklearn. Would you consider adding maybe another toggle, like time_series={True, False} or even changing the API a bit, and instead of passing the number of folds and shuffle and stratified to have only one argument, like cv and pass a separate object from sklearn in there?

sklearn.cross_validation is deprecated

sklearn.cross_validation is changed to model_selection

maybe we should update StratifiedKfold and kfold which changed their parameters to avoid any subtle bug

Pipeline model is too large

I trained a Stacking model which Adaboost, XGBoot, and GBDT is the first layer, keras model is the second layer but the size of the pipeline model is 45G. when i load the pipeline model,it often shows MemoryError. whether my computer RAM is 16G or 64G. So is there some method to solve this problem?

Would it be possible to use Vecstack with a Neural Network?

Hi,

I used Vecstack to perform a regression with 12 regressors and get a pretty good prediction, after performing an exhausting tuning of each of the 12 estimators. However, I reached a point that adding a 13th estimator starts to denigrate the score (might be over fitting at this point).

I was able to run a kerras neural network on the same data, but it is not performing very well and my predictions are not very accurate.

So, I was wondering, if I could now add a kerras neural network into the mix to see if I can increase the accuracy of the predictions for a Housing Pricing dataset from Kaggle. If that is possible, how would I go about it?

Catboost classifier stacking

The issue arises from using Catboost classifier (https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html) stacking. I believe the output of the classifier is not compatible with vecstack. If the classifier is used with the stacking of models and roc_auc_score (or roc_curve) as the metric the following error is generated:

ValueError: y should be a 1d array, got an array of shape (7000,2) instead.

Code generating the output:

models = [
#   ('model_LR', LogisticRegression(C=1e4, multi_class='ovr', penalty='l2',solver='lbfgs', max_iter=1000,random_state=42)),
  ('model_CatB', CatBoostClassifier(silent=True)),
  ('model_xgb', xgboost.XGBClassifier(n_estimators=500)),
  ('model_RF', RandomForestClassifier(n_estimators=500)),
  ('model_lgbm', LGBMClassifier())
#   ('model_SVM', svm.SVC()),  
  
]

model = [x[1] for x in models]

S_train, S_test = stacking(model, X_train, Y_train, X_test,
                           regression=False,
                           mode = 'oof_pred_bag',
                           needs_proba=True,
                           save_dir = None,
                           metric=roc_curve,
                           n_folds = 5,
                           stratified=True,
                           shuffle=True,
                           random_state=2021,
                           verbose=2
                          )

Full error output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-2ce2cf1f93c3> in <module>
     21                            shuffle=True,
     22                            random_state=2021,
---> 23                            verbose=2
     24                           )

C:\ProgramData\Anaconda\lib\site-packages\vecstack\core.py in stacking(models, X_train, y_train, X_test, sample_weight, regression, transform_target, transform_pred, mode, needs_proba, save_dir, metric, n_folds, stratified, shuffle, random_state, verbose)
    595                 if mode in ['oof', 'oof_pred', 'B', 'oof_pred_bag', 'A']:
    596                     if save_dir is not None or verbose > 0:
--> 597                         score = metric(y_te, S_train[te_index, col_slice_model])
    598                         scores = np.append(scores, score)
    599                         fold_str = '    fold %2d:  [%.8f]' % (fold_counter, score)

C:\ProgramData\Anaconda\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

C:\ProgramData\Anaconda\lib\site-packages\sklearn\metrics\_ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    774     """
    775     fps, tps, thresholds = _binary_clf_curve(
--> 776         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
    777 
    778     # Attempt to drop thresholds corresponding to points in between and

C:\ProgramData\Anaconda\lib\site-packages\sklearn\metrics\_ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    541     check_consistent_length(y_true, y_score, sample_weight)
    542     y_true = column_or_1d(y_true)
--> 543     y_score = column_or_1d(y_score)
    544     assert_all_finite(y_true)
    545     assert_all_finite(y_score)

C:\ProgramData\Anaconda\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

C:\ProgramData\Anaconda\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    845     raise ValueError(
    846         "y should be a 1d array, "
--> 847         "got an array of shape {} instead.".format(shape))
    848 
    849 

ValueError: y should be a 1d array, got an array of shape (6029, 2) instead.

Multiclass Error (string labels)

I have the following error: "could not convert string to float: 'H'". In my problem I have 3 classes 'H','D','A'. I can't find if its possible to do the stacking model.
Can you help me with this issue?
This is the code is was running:

RANDOM_model = RandomForestClassifier(class_weight=None,criterion='entropy',max_depth=17, max_features='auto',
max_leaf_nodes=32,min_samples_leaf= 1,min_samples_split= 2,n_estimators=54,random_state= 1)
LR_model = linear_model.LogisticRegression(C= 10000000000.0,class_weight= None,fit_intercept= True,multi_class='auto',
penalty='l2',solver='liblinear')

models = [RANDOM_model,LR_model]
Strain, Stest = stacking(models,Xtrain, ytrain, Xtest, regression=False, mode='oof_pred_bag',
needs_proba=False,save_dir=None, metric=accuracy_score,n_folds=5,
random_state=1,stratified=False,shuffle=True,verbose=2)

model_stack = XGBClassifier(colsample_bytree= 1,gamma= 0.5,learning_rate= 1,max_depth= 2,min_child_weight=1,n_estimators= 11,random_state= 1)

model_stack = model_stack.fit(Strain, ytrain)
y_pred_stack = model_stack.predict(Stest)
print('Final prediction score: [%.8f]' % accuracy_score(ytest, y_pred_stack))

why 'acc' is getting worse after stacking?

task: [classification]
n_classes: [3]
metric: [accuracy_score]
mode: [oof_pred_bag]
n_models: [4]

model 0: [ExtraTreesClassifier]
fold 0: [0.84220100]
fold 1: [0.83712466]
fold 2: [0.85259327]
fold 3: [0.83750569]
fold 4: [0.83925319]
----
MEAN: [0.84173556] + [0.00571701]
FULL: [0.84173644]

model 1: [RandomForestClassifier]
fold 0: [0.87585266]
fold 1: [0.87170155]
fold 2: [0.88307552]
fold 3: [0.87437415]
fold 4: [0.86930783]
----
MEAN: [0.87486234] + [0.00468014]
FULL: [0.87486349]

model 2: [XGBClassifier]
fold 0: [0.90995907]
fold 1: [0.91446770]
fold 2: [0.91128298]
fold 3: [0.91169777]
fold 4: [0.90983607]
----
MEAN: [0.91144872] + [0.00167472]
FULL: [0.91144885]

model 3: [GradientBoostingClassifier]
fold 0: [0.90950432]
fold 1: [0.91037307]
fold 2: [0.91765241]
fold 3: [0.90350478]
fold 4: [0.90391621]
----
MEAN: [0.90899016] + [0.00515841]
FULL: [0.90899163]

Final prediction score: [0.90554191]

Allow user to pass custom folds (GroupKFold)

As far as I can tell, in the current implementation, you can only pass in the number of folds. What if the user wants to pass in a custom folds object (e.g. sklearn.model_selection.GroupKFold)?

If this is of interest, I can submit a pull request.

pipeline refit/partial_fit

Is there a way of doing a partial_fit or refit in the sklearn pipeline api for incremental learning?

Best regards

How to combine early stopping?

Thanks for your contribution. I was looking for a great api for stacking then found your good package .

I am wondering that is it possible to combine the early_stopping in lightgbm or EarlyStopping in keras with VECSTACK (because I don't know how to do it) ?

Keras as a L1 model?

Hi, Thanks again.
I am wondering whether keras can be used as a L1 model.
The only annoying thing is keras fit methods have a epoch arg, which is not standard as other sklearn models..

I am wondering how would you implement this?

high variability of StackingTransformer on training data

Hi, I was wondering if you could help. I am using a blended model based on the StackingTransformer model with 4 base models. For some reason the features I created for the model in production are slightly different, by magnitude of e-7. This causes the prediction results to be very different. I've used random_state during data splitting, on the base models and on the StackingTransformer. Do you have any suggestion on why this high variability is happening and how to reduce it?
Thanks in any case!
Keren

Stack already fitted models?

Hi, is it possible to stack already fitted models? I can't find reference in documentation. Essentially, I have multiple fitted models (based on different X transformations with ML and DL. Thanks in advance

Memory error in footprint for sparce matrix

X is
<239761x68891 sparse matrix of type '<class 'numpy.float64'>' with 8726453 stored elements in Compressed Sparse Row format>

Specifically, choice function is crashing, because n==16517375051
Error is:

~/.local/lib/python3.5/site-packages/vecstack/coresk.py in _get_footprint(self, X, n_items)
    863             # np.random.seed(0) # for development
--> 864             ids = np.random.choice(n, min(n_items, n), replace=False)
    865 

mtrand.pyx in mtrand.RandomState.choice()

mtrand.pyx in mtrand.RandomState.permutation()

MemoryError: 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-24-4be4f86278e7> in <module>()
      8                             verbose=2)             
      9 t = targets[0]
---> 10 stack = stack.fit(X, y[t])

~/.local/lib/python3.5/site-packages/vecstack/coresk.py in fit(self, X, y, sample_weight)
    393             self.n_classes_ = None
    394         self.n_estimators_ = len(self.estimators_)
--> 395         self.train_footprint_ = self._get_footprint(X)
    396 
    397         # ---------------------------------------------------------------------

~/.local/lib/python3.5/site-packages/vecstack/coresk.py in _get_footprint(self, X, n_items)
    872 
    873         except Exception:
--> 874             raise ValueError('Internal error. '
    875                              'Please save traceback and inform developers.')
    876 

ValueError: Internal error. Please save traceback and inform developers.```

Vecstack and RMSE

Hi,
I've been using your library vecstack for suite some time now and it has becime my favorite tool for stacking models. Although right now i face a error about how to implement a custom defined metric. My code example is
models = [
ExtraTreesRegressor(random_state=0, n_jobs=-1,
n_estimators=100, max_depth=3),

RidgeCV(),

XGBRegressor(random_state=0, n_jobs=-1, learning_rate=0.1, 
             n_estimators=100, max_depth=3)

]

and this is my code for rmse
def rmse(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1))

but when i fit it i get this throwback
My code is working for the default metric MAE but is showing errors for this

TypeError Traceback (most recent call last)
in ()
11 shuffle=True, # shuffle the data
12 random_state=0, # ensure reproducibility
---> 13 verbose=2) # print all info

/opt/conda/lib/python3.6/site-packages/vecstack/core.py in stacking(models, X_train, y_train, X_test, sample_weight, regression, transform_target, transform_pred, mode, needs_proba, save_dir, metric, n_folds, stratified, shuffle, random_state, verbose)
595 score = metric(y_te, S_train[te_index, col_slice_model])
596 scores = np.append(scores, score)
--> 597 fold_str = ' fold %2d: [%.8f]' % (fold_counter, score)
598 if save_dir is not None:
599 models_folds_str += fold_str + '\n'

TypeError: must be real number, not Tensor

Code for fitting the model is
S_train, S_test = stacking(models, # list of models
train_final, target, test_final, # data
regression=True, # regression task (if you need
# classification - set to False)
mode='oof_pred_bag', # mode: oof for train set, predict test
# set in each fold and find mean
save_dir=None, # do not save result and log (to save
# in current dir - set to '.')
metric=rmse, # metric: callable
n_folds=4, # number of folds
shuffle=True, # shuffle the data
random_state=0, # ensure reproducibility
verbose=2) # print all info

How to save model

when i trained a stacking regression model that has two levels , how can i save model to predict new data like RandomForest which i can use joblib to save a model to predict new data? can i save 1st model and 2 nd model Respectively ?

Error in `python': free(): invalid next size (normal)

Using any model except GaussianNB causes an error in stacking():
task: [classification]
n_classes: [2]
metric: [log_loss]
mode: [oof_pred_bag]
n_models: [1]
model 0: [LogisticRegression]
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/base.py:297: RuntimeWarning: overflow encountered in exp
np.exp(prob, prob)
----
MEAN: [0.56676799] + [0.01295934]
FULL: [0.56677227]

*** Error in `python': free(): invalid next size (normal): 0x0000564aaa718ea0 ***
How to debug it to find the reason of error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.