civisanalytics / civisml-extensions Goto Github PK

View Code? Open in Web Editor NEW

59.0 59.0 19.0 121 KB

scikit-learn-compatible estimators from Civis Analytics

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

civisml-extensions's People

Contributors

Stargazers

Watchers

Forkers

kcrum elsander jmemich stephen-hoover ericschles ajinkyat afcarl xiaoxiao19 foeinlove tanmoyml123 phequals7 daveliu-riviera illumi91 whozawhen daliu dliu-fn viacheslav-m rnaimehaom

civisml-extensions's Issues

Warn before trying to expand columns with too many categories

Sometimes, datasets will accidentally include columns of categoricals in which every value is unique. (For example, if an index column gets included with the feature array.) This is not useful for modeling, and will usually cause the program to fail as it runs out of memory. The DataFrameETL should give a warning if it finds categorical columns with an excessive number of levels. I think the main purpose here would be to help users diagnose data quality issues which caused models to fail, so the warning threshold could be very high. Perhaps warn if there's more than 500 levels in a column?

Compatibility with newest scikit-learn

The requirements specify scikit-learn>=0.18.1,<0.20, and the newest release of scikit-learn is 0.20.2. Is this package incompatible with v0.20? If so, can we make it compatible with >=0.18.1? If it's already compatible, we should update the requirements.

Log number of hyperband configurations

Hyperband currently tells users how many combinations of parameters it's trying, but that information is in a print gated by an if self.verbose > 0. We should also give that information in a log.debug emit, regardless of the verbosity level. This will assist in debugging while still letting users avoid a print which will be usually unnecessary.

HyperbandCV does not work with sklearn v0.23 ('sklearn.externals.joblib' deprecated)

ModuleNotFoundError: No module named 'sklearn.externals.joblib'

Python37\lib\site-packages\sklearn\externals\joblib_init_.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

https://stackoverflow.com/questions/61893719/importerror-cannot-import-name-joblib-from-sklearn-externals

DataFrameETL doesn't handle indexes on inputs

If users request DataFrame output from the preprocessing.DataFrameETL, then the output DataFrame is missing the index of the input. In addition, any non-expanded columns will either be full of missing values or scrambled.

Without an index:

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'x']})
DataFrameETL(dataframe_output=True).fit_transform(df)

     a  b_x  b_y  b_NaN
0  1.0  1.0  0.0    0.0
1  2.0  0.0  1.0    0.0
2  3.0  1.0  0.0    0.0

With an index:

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'x']}, index=[11, 12, 0])
DataFrameETL(dataframe_output=True).fit_transform(df)

    a  b_x  b_y  b_NaN
0 3.0  1.0  0.0    0.0
1 NaN  0.0  1.0    0.0
2 NaN  1.0  0.0    0.0

The problem is in the if self.dataframe_output: block of DataFrameETL.transform. Perhaps we could use the index of the input X instead of creating a new index?

Keras Wrapper as base estimator won't work if n_jobs > 1

I am trying to run a stacked regression when the n_jobs is 1 it runs fine however, whenever I set the n_jobs to 2 it crashes with the error below. I looked into similar issues but none actually solved my error.

The code:

from civismlext.stacking import StackedRegressor
from civismlext.nonnegative import NonNegativeLinearRegression

def create_model():
    model = Sequential()
    model.add(Dense(150, activation='softmax', kernel_initializer='VarianceScaling', input_dim=456, name='HL1'))
    model.add(Dropout(0.25, name="Dropout1"))
    model.add(Dense(150, kernel_initializer='VarianceScaling', activation='softmax', name='HL2'))
    model.add(Dropout(0.25, name="Dropout2"))
    model.add(Dense(1, name='Output_Layer'))
    model.compile(optimizer='adam', loss='mae', metrics=['mae', 'mean_squared_error'])
    return model

mlp_model = KerasRegressor(build_fn=create_model, epochs=50, batch_size=75, validation_split=0.2, verbose=True)

super_learner = StackedRegressor([
    ('pipe_mlp', mlp_model),
    ('rf', rf),
    ('xgb', gb),
    ('meta', NonNegativeLinearRegression())
], cv=5, n_jobs=2, verbose=5)

the error:

MaybeEncodingError                        Traceback (most recent call last)
<ipython-input-7-1d4b04377633> in <module>()
      1 # fitting the model
----> 2 super_learner.fit(X_train[:50], y_train[:50])

~/anaconda3/lib/python3.6/site-packages/civismlext/stacking.py in fit(self, X, y, **fit_params)
    163         self.meta_estimator.fit(Xmeta, ymeta, **meta_params)
    164         # Now fit base estimators again, this time on full training set
--> 165         self._base_est_fit(X, y, **fit_params)
    166 
    167         return self

~/anaconda3/lib/python3.6/site-packages/civismlext/stacking.py in _base_est_fit(self, X, y, **fit_params)
    220             n_jobs=self.n_jobs,
    221             verbose=self.verbose,
--> 222             pre_dispatch=self.pre_dispatch)(_jobs)
    223 
    224         for name, _ in self.estimator_list[:-1]:

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    697             try:
    698                 if getattr(self._backend, 'supports_timeout', False):
--> 699                     self._output.extend(job.get(timeout=self.timeout))
    700                 else:
    701                     self._output.extend(job.get())

~/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

MaybeEncodingError: Error sending result: '[<keras.callbacks.History object at 0x7f93fe43c7b8>]'. Reason: 'TypeError("can't pickle _thread.lock objects",)'

The reason behind it because TensorFlow's Models cannot be shared across processes. This happens because of this line Do you have any ideas how to workaround it?

BUG: sklearn version must be >= 0.18.1

The HyperbandSearchCV class depends on the MaskedArray class, which was added to sklearn.utils.fixes in version 0.18.1. Attempts to import civismlext fail when using sklearn v0.18.

Here's what that looks like:

In [1]: from civismlext import NonNegativeLinearRegression
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-86087081ae3b> in <module>()
----> 1 from civismlext import NonNegativeLinearRegression

/Users/kcrum/miniconda3/envs/sandbox/lib/python3.5/site-packages/civismlext/__init__.py in <module>()
     2 from civismlext.stacking import StackedClassifier  # NOQA
     3 from civismlext.nonnegative import NonNegativeLinearRegression  # NOQA
----> 4 from civismlext.hyperband import HyperbandSearchCV  # NOQA
     5 from civismlext.preprocessing import DataFrameETL  # NOQA

/Users/kcrum/miniconda3/envs/sandbox/lib/python3.5/site-packages/civismlext/hyperband.py in <module>()
    18 
    19 from sklearn.externals.joblib import Parallel, delayed
---> 20 from sklearn.utils.fixes import MaskedArray
    21 from sklearn.utils.validation import indexable
    22 from sklearn.metrics.scorer import check_scoring

ImportError: cannot import name 'MaskedArray'
In [2]: 
In [2]: from sklearn.utils.fixes import MaskedArray
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-f7d936768fbc> in <module>()
----> 1 from sklearn.utils.fixes import MaskedArray

ImportError: cannot import name 'MaskedArray'
In [3]: import sklearn
In [4]: sklearn.__version__
Out[4]: '0.18'

Saving stacking

Is it possible to dump the stacking ensemble using joblib dump just as scikit-learn? will this store all the estimators within as well?

On another note, It would also be useful to have a save method where it will iterate through the estimators and store each one individually (should consider pipelines as well).

Slides from PyData NYC 2017 talk

Saw the very instructive and clear talk by @kcrum, just wondering where the slides can be found? Awesome stuff, would love to share with others! 😄

Multiclass Implementation of Hyperband?

I was wondering if it would be possible to handle multi-class classification with Hyperband? It worked with binary classification and regression tasks, but also including multi-class would be immensely helpful.

Otherwise, maybe we should include documentation somewhere that emphasizes that Hyperband can't handle the multi-class condition. I had to discover this when actually trying to do so.


~/.virtualenvs/dsmodels/lib/python3.7/site-packages/sklearn/metrics/ranking.py in roc_auc_score(y_true, y_score, average, sample_weight, max_fpr)
    354     return _average_binary_score(
    355         _binary_roc_auc_score, y_true, y_score, average,
--> 356         sample_weight=sample_weight)
    357 
    358 

~/.virtualenvs/dsmodels/lib/python3.7/site-packages/sklearn/metrics/base.py in _average_binary_score(binary_metric, y_true, y_score, average, sample_weight)
     72     y_type = type_of_target(y_true)
     73     if y_type not in ("binary", "multilabel-indicator"):
---> 74         raise ValueError("{0} format is not supported".format(y_type))
     75 
     76     if y_type == "binary":

ValueError: multiclass format is not supported

XGBoostError: b'value 1.16151 for Parameter colsample_bytree exceed bound [0,1]'

import numpy as np

import xgboost as xgb

from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_float


# XGBoost with Hyperband Hyperparameter Optimization
clf = xgb.XGBRegressor()

clf.set_params(**{"n_jobs": 4})

# Hyperparameter search boundaries
param_grid = {
              # Parameters for Tree Booster
              'eta': sp_float(0, 1),
              'gamma': sp_randint(0, 100),
              'max_depth': sp_randint(1, 3),
              'learning_rate': sp_float(.001, .005),
              'n_estimators': sp_randint(5000, 40000),

              'min_child_weight': sp_randint(0, 50),
              'max_delta_step': sp_randint(0, np.log(upper_limit)),
              'subsample': sp_float(0, 1),

              # Family of parameters for subsampling of columns
              'colsample_bytree': sp_float(0.2, 1),
              'colsample_bylevel': sp_float(0.2, 1),
              'colsample_bynode': sp_float(0.2, 1),

              # Regularization Params
              'lambda': sp_randint(1, 10),
              'alpha': sp_randint(0, 100),
              }

from civismlext.hyperband import HyperbandSearchCV
    tuned_model = HyperbandSearchCV(regressor,
                                    param_distributions=param_grid,
                                    cost_parameter_max={'n_estimators': 20000},
                                    cost_parameter_min={'n_estimators': 2000},
                                    n_jobs=4,
                                    cv=2)

Somehow I got an out-of-bounds error when I tried to set the range for colsample_by* as (0.2, 1), but when I changed it back to (0, 1) it worked.

Seems like it might be an async/distributed computing issue?

Log size of expanded array in DataFrameETL

We should give a debug log emit before expanding categoricals in the DataFrameETL. It's useful to know how big of an array we create, especially if the expansion fails because of memory constraints.

`_base_est_fit_predict` looks wrong

I believe cv.split(X, y) might give different results when called at different times. There might be randomness involved like shuffling, etc.

This is problematic because ybase, which is based on the old train & test but not coincide with with the new test and therefore y[test].

old train & test:

for train, test in cv.split(X, y):
    for name, est in self.estimator_list[:-1]:
        # adapted from sklearn.model_selection._fit_and_predict
        # Adjust length of sample weights
        fit_params_est_adjusted = dict([
            (k, _index_param_value(X, v, train))
            for k, v in fit_params_ests[name].items()])

        # Fit estimator on training set and score out-of-sample
        _jobs.append(delayed(_fit_predict)(
            clone(est),
            X[train],
            y[train],
            X[test],
            **fit_params_est_adjusted))

new train & test:

# Extract the results from joblib
Xmeta, ymeta = None, None
for train, test in cv.split(X, y):
    ybase = np.empty((y[test].shape[0], 0))
    for name, est in self.estimator_list[:-1]:
        # Build design matrix out of out-of-sample predictions
        ybase = np.hstack((ybase, _out.pop(0)))

    # Append the test outputs to what will eventually be the features
    # for the meta-estimator.
    if Xmeta is not None:
        ymeta = np.concatenate((ymeta, y[test]))
        Xmeta = np.vstack((Xmeta, ybase))
    else:
        Xmeta = ybase
        ymeta = y[test]

BUG DataFrameETL fails with int-like Categoricals

The following fails under v0.1.5 (the most recent release):

    raw = pd.concat([
        pd.Series([1.0, np.NaN, 3.0], dtype='float', name='fruits'),
        pd.Series([500, 1000, 1000], dtype='category', name='intcat'),
    ], axis=1)
    expander = DataFrameETL(cols_to_expand='auto', dummy_na=True)
    tfm = expander.fit_transform(raw)

The error is "ValueError: fill value must be in categories".

It looks like this is due to DataFrameETL._flag_numeric incorrectly marking a pd.Categorical as "numeric" when every level happens to be an integer.

Add version

We have a _version.py file which defines __version__, but we never import __version__, so there's no civismlext.__version__ attribute. We need a from _version import __version__ in the civismlext.__init__.

How to use `StackedClassifier` with `RandomizedGridSearch`

I'm having trouble understanding how to perform a grid search over all estimator parameters in my StackedClassifier (base estimators + meta estimator). Do I need to pass the parameters to the .fit method of the StackedClassifier? Or do I need to wrap my classifier in a CV class, as below?

from civismlext.stacking import StackedClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier


X, y = ... 
estimator_list = [('rf', RandomForestClassifier()), ('meta', LogisticRegression())]
cv = RandomizedSearchCV(
    estimator=StackedClassifier(
        estimator_list=estimator_list
    ),
    param_distributions={
        'rf__n_estimators': [10, 100, 1000, 10000],
        'rf__max_features': [None, 'sqrt', 'auto', 'log2'],
        'rf__criterion': ['gini', 'entropy'],
        'rf__class_weight': ['balanced_subsample', 'balanced'],
        'meta__l1_ratio': np.logspace(-5, 0, 6),
        'meta__C': np.logspace(-5, 5, 11),
    },
    scoring='roc_auc',
    n_iter=10
)
cv.fit(X, y)

This approach seems flawed to me because the grid search will perform CV in addition to the CV the StackedClassifier performs for us which leads to data sparsity in the base models.