vecxoz / vecstack Goto Github PK

Python package for stacking (machine learning technique)

License: Other

Python 100.00%

bagging blending ensemble ensemble-learning ensembling explain-stacking machine-learning stacked-generalization stacking stacking-tutorial

vecstack's People

Contributors

Stargazers

Watchers

Forkers

vyraun alex4er hhh920406 everyonelijin aigujin ky-xt shichaoji directcsd seanhsieh easoncer winggy zuomatthew ringwraith zwt233 stevenlol chabobo dundee2002 toledy wangdq1989 whmnoe4j chetanmehra mikemraz saisrik mejihero liyi19950329 bigdong89 fbgameresearch klezmen joizhang2012 rohan-dot ryan102590 jierenlong amirunpri2018 srinivasgutta7 isaac-you moeinh77 fone4u superwildboy astroboy1 deadlywing marcogorelli xxzcool franrs sidharthiimc bingxianchen gridl ic3fr0g limingbei junehang ankishb minhnguyen10 fidahussain joe-cipolla testforexperiment elejke wwwzh2015 aseemanand trewaite neuralnetworkresearch michaelbroox chrinide stsfk adeyinka-hub sk-kadam ashishpatel26 ankur3107 kimwoonggon jensen-wong-rci tu-dan masou maxosmanov deepxcv cliekid fan-feng jrachoene rdhawan4 ahmad-abdellatif tdl77 liskibruh harel-coffee

vecstack's Issues

N-dimensional input (stacking for convolutional nets)

Hi,
I tried to do stacking by using kerasClassifier with a cnn but i get this error
ValueError: Found array with dim 4. Estimator expected <= 2.

This is my code for the cnn

`def model1():
model = Sequential()
model.add(Conv2D(16, (3, 3), activation='relu', padding="same", input_shape=(train_files.shape[1:])))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())  
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
          optimizer='rmsprop',
          metrics=['accuracy'])`

Can I use GPU with Vecstack to speed-up the process?

Hi, Thank you for your code. It is GREAT!!! Is it possible for me to use my GPU to stack / ensemble my models to speed-up the processing and generate the predictions?

ImportError: cannot import name 'stacking', same issue with StackingTransformer

I can't import Vecstack. I pip3 installed it, uninstalled all the dependencies and installed vecstack again. Still, no luck.

There is some question about transfrom

When i use Stacking and StackingTransformer for lst layer, there is some difference in specific model OOF which i use AdaBoost, (e.g (0.9 0.805) (1.3 1.34) )so i really don not know what caused this problem,.

Using different data transformations and fit parameters for different models

Hi Igor,

Congratulations for your package. I've been searching for a stacking package and this nails it (both for simplicity and efectiveness). Thanks for your contribution

Is there any possibility to stack already trained models with your package? There are 2 reasons for this:
-People might want to set fit arguments to the models (currently not available as the stacking function will actually train the models)
-We might want to use different data scaling and preprocessing techniques for different algorithms (label encoding for tree-based methods and one hot for linear)

For example, H2O stacking allows users to stack already trained models:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html

I would love to contribute to your package but unfortunately my technical level would be too dangerous for your code :P

issue with Keras custom layer

HI @vecxoz
vecstack seems doesn't allow using custom layer in Keras's model initializing:

For example if we define external Class layer (inherit Layer superclass) and pass it to model initializer, this raises that Exception:
"ValueError: Unknown layer: custom_layer".

ther's away to do that?

Thanks!

Edit: SOLVED using KerasClassifier, sklearn API..as wrapper.

Missing values

Hi great function! What if I data with missing values and I want to leave them as missing for the purpose of an XGBClassifier in an ensemble but I also want to include an Sklearn classifier that requires missing values to be filled e.g., Random Forest. So basically my training data would be different for different models in the ensemble.

Another related example would be encoding of categorical variables. For LGBMClassifier I may want to label encode vs one hot encode for XGBoost so the training set would have different dimensions for each classifier in this example

Using the functional API for training only

There doesn't seem to be a way to use the functional API just for training a model - since X_test= is a required argument. However, if I've already tested my 2nd level model, I think I should be able to train a model on the full data set.

To be clear, I would like to be able to just do the following:

from vecstack import stacking

# Get your data

# Initialize 1st level estimators
models = [LinearRegression(),
          Ridge(random_state=0)]

# Get your stacked features in a single line
S_train = stacking(models, X_train, y_train, regression=True, verbose=2)

# Use 2nd level estimator with stacked features

Am I missing something?

Ability to use different features in each model.

I have a model whose most predictive features are the most noisy. To compensate, I train 1 model on those features, and a separate model on all the other features. By combining these models, I can quickly and easily prevent strange outlier predictions.

Simple stacking / voting is okay, but I imagine the model would generalize better were I to implement vecstack instead.
Is there any feasible way we could add different X (column-wise) per model to vecstack? I.e. multiple X that are the same length, but have different widths.

Thank you for your time!
-Nathan

Python 2.7 DeprecationWarning

import vecstack
# DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
#  "(https://pypi.org/project/six/).", DeprecationWarning)

vecstack/vecstack/coresk.py

Line 57 in 4909830

from sklearn.externals import six

metric=auc

It's great job!
can metric=auc,when I want to use Classifier?

Automatic saving (`save_dir`) doesn't work on Kaggle

HI
Unfortunately vecstack is useless on Kaggle, because output files' names result as invalid character, as reported in kernel' log file; (then that files can't be saved as kernel output)
ther's some trick to work around?

thanks

What is the difference between vecstack and sklearn StackingClassifier?

The new versions of sklearn include the StackingClassifier for the implementation of stacking ensemble. What is the difference between using vecstack and sklern's StackingClassifier?

IndexError: tuple index out of range

Hi there

nice package. Just a quick one, in line 409 at core.py there is:

X_train = np.array(X_train)
y_train = np.array(y_train).ravel()
X_test = np.array(X_test)

I am not sure that is necessary? For example using sparse matrices this return an array with the sparse matrix "inside" (i.e. no shape) rather than an array THAT IS the sparse matrix. This will throw an error at line 502:

 IndexError: tuple index out of range

 In [31]: X_train.shape
 Out[31]: ()

I am at the moment running it just commenting out those lines with no problem.

Just that, maybe considering commenting out the lines or add an if statement for sparse matrices?

Thanks

Nested cross val for Hyperparameter tuning

Vecstack, as many people have pointed out, is fitting in a nice niche that solve a complex problem in single line of code.

I am aware of that stacking is very similar to cross-validation as it works with k-fold and oof. I wonder how does it work with hyperparameter tuning, e.g. gridsearchCV or Randomizedsearch?

Would you suggest us to hypertune the model before constructing them as the level 1 models for stacking? or this can be done together.

How to use SHAP with a vecstack model

Thanks for your awesome repo.

I used it to build my model as following:

S_train, S_test = stacking(models,
                               x_res, y_res, X_test,
                               regression=False,

                               mode='oof_pred_bag',

                               needs_proba=False,

                               save_dir=None,

                               metric=metrics.accuracy_score,

                               n_folds=10,

                               stratified=True,

                               shuffle=True,

                               verbose=2)

I need to interpret my stacking model by using SHAP they recommend this way for models with folds. My issue is that I can't access the folds in the stacking model, I'm thinking of using this way

My question is there a way to use SHAP with a stacking model?

How to predict

After I created my models and happy with the results. How can I save the models and use it to predict on real life data?

Question about usage...

I am trying to predict Housing prices, where I have a train data set and a test data set. the train data has a label and I need to train on it to later use this trained model to predict the label for the test data, which do not have a label. Aso, I followed your process on my train data set and performed the stacking, and applied the second level to the S_train and S_test variables as indicated in your instructions.
Now that i have done that, how do I proceed to predict the label on the test (unknown) dataset?

Support for custom Cross Validation strategies

The package looks amazing, but from what I saw, one can not pass a cross-validation sklearn object, only the number of folds, and enable/disable shuffling and stratification. This is an issue when trying to work with time series data, and using TimeSeriesSplit from sklearn. Would you consider adding maybe another toggle, like time_series={True, False} or even changing the API a bit, and instead of passing the number of folds and shuffle and stratified to have only one argument, like cv and pass a separate object from sklearn in there?

sklearn.cross_validation is deprecated

sklearn.cross_validation is changed to model_selection

maybe we should update StratifiedKfold and kfold which changed their parameters to avoid any subtle bug

Pipeline model is too large

I trained a Stacking model which Adaboost, XGBoot, and GBDT is the first layer, keras model is the second layer but the size of the pipeline model is 45G. when i load the pipeline model，it often shows MemoryError. whether my computer RAM is 16G or 64G. So is there some method to solve this problem？

Metric parameter failing for f1_score metric for multi class classification

Recently I faced this issue while using vecstack for a multicalss classification dataset

It occurs because I am not able to specify that my metric should use attribute avearge='weighted'

I had a look at the source code and I think I can fix this issue.

@vecxoz Can I go ahead and submit a pull request?

Would it be possible to use Vecstack with a Neural Network?

Hi,

I used Vecstack to perform a regression with 12 regressors and get a pretty good prediction, after performing an exhausting tuning of each of the 12 estimators. However, I reached a point that adding a 13th estimator starts to denigrate the score (might be over fitting at this point).

I was able to run a kerras neural network on the same data, but it is not performing very well and my predictions are not very accurate.

So, I was wondering, if I could now add a kerras neural network into the mix to see if I can increase the accuracy of the predictions for a Housing Pricing dataset from Kaggle. If that is possible, how would I go about it?

Catboost classifier stacking

The issue arises from using Catboost classifier (https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html) stacking. I believe the output of the classifier is not compatible with vecstack. If the classifier is used with the stacking of models and roc_auc_score (or roc_curve) as the metric the following error is generated:

ValueError: y should be a 1d array, got an array of shape (7000,2) instead.

Code generating the output:

models = [
#   ('model_LR', LogisticRegression(C=1e4, multi_class='ovr', penalty='l2',solver='lbfgs', max_iter=1000,random_state=42)),
  ('model_CatB', CatBoostClassifier(silent=True)),
  ('model_xgb', xgboost.XGBClassifier(n_estimators=500)),
  ('model_RF', RandomForestClassifier(n_estimators=500)),
  ('model_lgbm', LGBMClassifier())
#   ('model_SVM', svm.SVC()),  
  
]

model = [x[1] for x in models]

S_train, S_test = stacking(model, X_train, Y_train, X_test,
                           regression=False,
                           mode = 'oof_pred_bag',
                           needs_proba=True,
                           save_dir = None,
                           metric=roc_curve,
                           n_folds = 5,
                           stratified=True,
                           shuffle=True,
                           random_state=2021,
                           verbose=2
                          )

Full error output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-2ce2cf1f93c3> in <module>
     21                            shuffle=True,
     22                            random_state=2021,
---> 23                            verbose=2
     24                           )

C:\ProgramData\Anaconda\lib\site-packages\vecstack\core.py in stacking(models, X_train, y_train, X_test, sample_weight, regression, transform_target, transform_pred, mode, needs_proba, save_dir, metric, n_folds, stratified, shuffle, random_state, verbose)
    595                 if mode in ['oof', 'oof_pred', 'B', 'oof_pred_bag', 'A']:
    596                     if save_dir is not None or verbose > 0:
--> 597                         score = metric(y_te, S_train[te_index, col_slice_model])
    598                         scores = np.append(scores, score)
    599                         fold_str = '    fold %2d:  [%.8f]' % (fold_counter, score)

C:\ProgramData\Anaconda\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

C:\ProgramData\Anaconda\lib\site-packages\sklearn\metrics\_ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    774     """
    775     fps, tps, thresholds = _binary_clf_curve(
--> 776         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
    777 
    778     # Attempt to drop thresholds corresponding to points in between and

C:\ProgramData\Anaconda\lib\site-packages\sklearn\metrics\_ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    541     check_consistent_length(y_true, y_score, sample_weight)
    542     y_true = column_or_1d(y_true)
--> 543     y_score = column_or_1d(y_score)
    544     assert_all_finite(y_true)
    545     assert_all_finite(y_score)

C:\ProgramData\Anaconda\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

C:\ProgramData\Anaconda\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    845     raise ValueError(
    846         "y should be a 1d array, "
--> 847         "got an array of shape {} instead.".format(shape))
    848 
    849 

ValueError: y should be a 1d array, got an array of shape (6029, 2) instead.

Multiclass Error (string labels)

I have the following error: "could not convert string to float: 'H'". In my problem I have 3 classes 'H','D','A'. I can't find if its possible to do the stacking model.
Can you help me with this issue?
This is the code is was running:

RANDOM_model = RandomForestClassifier(class_weight=None,criterion='entropy',max_depth=17, max_features='auto',
max_leaf_nodes=32,min_samples_leaf= 1,min_samples_split= 2,n_estimators=54,random_state= 1)
LR_model = linear_model.LogisticRegression(C= 10000000000.0,class_weight= None,fit_intercept= True,multi_class='auto',
penalty='l2',solver='liblinear')

models = [RANDOM_model,LR_model]
Strain, Stest = stacking(models,Xtrain, ytrain, Xtest, regression=False, mode='oof_pred_bag',
needs_proba=False,save_dir=None, metric=accuracy_score,n_folds=5,
random_state=1,stratified=False,shuffle=True,verbose=2)

model_stack = XGBClassifier(colsample_bytree= 1,gamma= 0.5,learning_rate= 1,max_depth= 2,min_child_weight=1,n_estimators= 11,random_state= 1)

model_stack = model_stack.fit(Strain, ytrain)
y_pred_stack = model_stack.predict(Stest)
print('Final prediction score: [%.8f]' % accuracy_score(ytest, y_pred_stack))

why 'acc' is getting worse after stacking?

task: [classification]
n_classes: [3]
metric: [accuracy_score]
mode: [oof_pred_bag]
n_models: [4]

model 0: [ExtraTreesClassifier]
fold 0: [0.84220100]
fold 1: [0.83712466]
fold 2: [0.85259327]
fold 3: [0.83750569]
fold 4: [0.83925319]
----
MEAN: [0.84173556] + [0.00571701]
FULL: [0.84173644]

model 1: [RandomForestClassifier]
fold 0: [0.87585266]
fold 1: [0.87170155]
fold 2: [0.88307552]
fold 3: [0.87437415]
fold 4: [0.86930783]
----
MEAN: [0.87486234] + [0.00468014]
FULL: [0.87486349]

model 2: [XGBClassifier]
fold 0: [0.90995907]
fold 1: [0.91446770]
fold 2: [0.91128298]
fold 3: [0.91169777]
fold 4: [0.90983607]
----
MEAN: [0.91144872] + [0.00167472]
FULL: [0.91144885]

model 3: [GradientBoostingClassifier]
fold 0: [0.90950432]
fold 1: [0.91037307]
fold 2: [0.91765241]
fold 3: [0.90350478]
fold 4: [0.90391621]
----
MEAN: [0.90899016] + [0.00515841]
FULL: [0.90899163]

Final prediction score: [0.90554191]

Allow user to pass custom folds (GroupKFold)

As far as I can tell, in the current implementation, you can only pass in the number of folds. What if the user wants to pass in a custom folds object (e.g. sklearn.model_selection.GroupKFold)?

If this is of interest, I can submit a pull request.

pipeline refit/partial_fit

Is there a way of doing a partial_fit or refit in the sklearn pipeline api for incremental learning?

Best regards

How does `vecstack.StackingTransformer` differ from `sklearn.ensemble.StackingClassifier`?

This might be useful to add to the readme

How to combine early stopping?

Thanks for your contribution. I was looking for a great api for stacking then found your good package .

I am wondering that is it possible to combine the early_stopping in lightgbm or EarlyStopping in keras with VECSTACK (because I don't know how to do it) ?

Keras as a L1 model?

Hi, Thanks again.
I am wondering whether keras can be used as a L1 model.
The only annoying thing is keras fit methods have a epoch arg, which is not standard as other sklearn models..

I am wondering how would you implement this?

high variability of StackingTransformer on training data

Hi, I was wondering if you could help. I am using a blended model based on the StackingTransformer model with 4 base models. For some reason the features I created for the model in production are slightly different, by magnitude of e-7. This causes the prediction results to be very different. I've used random_state during data splitting, on the base models and on the StackingTransformer. Do you have any suggestion on why this high variability is happening and how to reduce it?
Thanks in any case!
Keren

Stack already fitted models?

Hi, is it possible to stack already fitted models? I can't find reference in documentation. Essentially, I have multiple fitted models (based on different X transformations with ML and DL. Thanks in advance

Memory error in footprint for sparce matrix

X is
<239761x68891 sparse matrix of type '<class 'numpy.float64'>' with 8726453 stored elements in Compressed Sparse Row format>

Specifically, choice function is crashing, because n==16517375051
Error is:

~/.local/lib/python3.5/site-packages/vecstack/coresk.py in _get_footprint(self, X, n_items)
    863             # np.random.seed(0) # for development
--> 864             ids = np.random.choice(n, min(n_items, n), replace=False)
    865 

mtrand.pyx in mtrand.RandomState.choice()

mtrand.pyx in mtrand.RandomState.permutation()

MemoryError: 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-24-4be4f86278e7> in <module>()
      8                             verbose=2)             
      9 t = targets[0]
---> 10 stack = stack.fit(X, y[t])

~/.local/lib/python3.5/site-packages/vecstack/coresk.py in fit(self, X, y, sample_weight)
    393             self.n_classes_ = None
    394         self.n_estimators_ = len(self.estimators_)
--> 395         self.train_footprint_ = self._get_footprint(X)
    396 
    397         # ---------------------------------------------------------------------

~/.local/lib/python3.5/site-packages/vecstack/coresk.py in _get_footprint(self, X, n_items)
    872 
    873         except Exception:
--> 874             raise ValueError('Internal error. '
    875                              'Please save traceback and inform developers.')
    876 

ValueError: Internal error. Please save traceback and inform developers.```

Can we use this library also for multi label classification (such as scikit-multilearn)?

Vecstack and RMSE

Hi,
I've been using your library vecstack for suite some time now and it has becime my favorite tool for stacking models. Although right now i face a error about how to implement a custom defined metric. My code example is
models = [
ExtraTreesRegressor(random_state=0, n_jobs=-1,
n_estimators=100, max_depth=3),

RidgeCV(),

XGBRegressor(random_state=0, n_jobs=-1, learning_rate=0.1, 
             n_estimators=100, max_depth=3)

]

and this is my code for rmse
def rmse(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1))

but when i fit it i get this throwback
My code is working for the default metric MAE but is showing errors for this

TypeError Traceback (most recent call last)
in ()
11 shuffle=True, # shuffle the data
12 random_state=0, # ensure reproducibility
---> 13 verbose=2) # print all info

/opt/conda/lib/python3.6/site-packages/vecstack/core.py in stacking(models, X_train, y_train, X_test, sample_weight, regression, transform_target, transform_pred, mode, needs_proba, save_dir, metric, n_folds, stratified, shuffle, random_state, verbose)
595 score = metric(y_te, S_train[te_index, col_slice_model])
596 scores = np.append(scores, score)
--> 597 fold_str = ' fold %2d: [%.8f]' % (fold_counter, score)
598 if save_dir is not None:
599 models_folds_str += fold_str + '\n'

TypeError: must be real number, not Tensor

Code for fitting the model is
S_train, S_test = stacking(models, # list of models
train_final, target, test_final, # data
regression=True, # regression task (if you need
# classification - set to False)
mode='oof_pred_bag', # mode: oof for train set, predict test
# set in each fold and find mean
save_dir=None, # do not save result and log (to save
# in current dir - set to '.')
metric=rmse, # metric: callable
n_folds=4, # number of folds
shuffle=True, # shuffle the data
random_state=0, # ensure reproducibility
verbose=2) # print all info

How to save model

when i trained a stacking regression model that has two levels , how can i save model to predict new data like RandomForest which i can use joblib to save a model to predict new data? can i save 1st model and 2 nd model Respectively ?

Error in `python': free(): invalid next size (normal)

Using any model except GaussianNB causes an error in stacking():
task: [classification]
n_classes: [2]
metric: [log_loss]
mode: [oof_pred_bag]
n_models: [1]
model 0: [LogisticRegression]
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/base.py:297: RuntimeWarning: overflow encountered in exp
np.exp(prob, prob)
----
MEAN: [0.56676799] + [0.01295934]
FULL: [0.56677227]

*** Error in `python': free(): invalid next size (normal): 0x0000564aaa718ea0 ***
How to debug it to find the reason of error?