civisanalytics / civisml-extensions Goto Github PK
View Code? Open in Web Editor NEWscikit-learn-compatible estimators from Civis Analytics
License: BSD 3-Clause "New" or "Revised" License
scikit-learn-compatible estimators from Civis Analytics
License: BSD 3-Clause "New" or "Revised" License
Sometimes, datasets will accidentally include columns of categoricals in which every value is unique. (For example, if an index column gets included with the feature array.) This is not useful for modeling, and will usually cause the program to fail as it runs out of memory. The DataFrameETL
should give a warning if it finds categorical columns with an excessive number of levels. I think the main purpose here would be to help users diagnose data quality issues which caused models to fail, so the warning threshold could be very high. Perhaps warn if there's more than 500 levels in a column?
The requirements specify scikit-learn>=0.18.1,<0.20
, and the newest release of scikit-learn is 0.20.2. Is this package incompatible with v0.20? If so, can we make it compatible with >=0.18.1? If it's already compatible, we should update the requirements.
Hyperband currently tells users how many combinations of parameters it's trying, but that information is in a print
gated by an if self.verbose > 0
. We should also give that information in a log.debug emit, regardless of the verbosity level. This will assist in debugging while still letting users avoid a print which will be usually unnecessary.
ModuleNotFoundError: No module named 'sklearn.externals.joblib'
Python37\lib\site-packages\sklearn\externals\joblib_init_.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
If users request DataFrame output from the preprocessing.DataFrameETL
, then the output DataFrame
is missing the index of the input. In addition, any non-expanded columns will either be full of missing values or scrambled.
Without an index:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'x']})
DataFrameETL(dataframe_output=True).fit_transform(df)
a b_x b_y b_NaN
0 1.0 1.0 0.0 0.0
1 2.0 0.0 1.0 0.0
2 3.0 1.0 0.0 0.0
With an index:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'x']}, index=[11, 12, 0])
DataFrameETL(dataframe_output=True).fit_transform(df)
a b_x b_y b_NaN
0 3.0 1.0 0.0 0.0
1 NaN 0.0 1.0 0.0
2 NaN 1.0 0.0 0.0
The problem is in the if self.dataframe_output:
block of DataFrameETL.transform
. Perhaps we could use the index of the input X
instead of creating a new index?
I am trying to run a stacked regression when the n_jobs is 1 it runs fine however, whenever I set the n_jobs to 2 it crashes with the error below. I looked into similar issues but none actually solved my error.
The code:
from civismlext.stacking import StackedRegressor
from civismlext.nonnegative import NonNegativeLinearRegression
def create_model():
model = Sequential()
model.add(Dense(150, activation='softmax', kernel_initializer='VarianceScaling', input_dim=456, name='HL1'))
model.add(Dropout(0.25, name="Dropout1"))
model.add(Dense(150, kernel_initializer='VarianceScaling', activation='softmax', name='HL2'))
model.add(Dropout(0.25, name="Dropout2"))
model.add(Dense(1, name='Output_Layer'))
model.compile(optimizer='adam', loss='mae', metrics=['mae', 'mean_squared_error'])
return model
mlp_model = KerasRegressor(build_fn=create_model, epochs=50, batch_size=75, validation_split=0.2, verbose=True)
super_learner = StackedRegressor([
('pipe_mlp', mlp_model),
('rf', rf),
('xgb', gb),
('meta', NonNegativeLinearRegression())
], cv=5, n_jobs=2, verbose=5)
the error:
MaybeEncodingError Traceback (most recent call last)
<ipython-input-7-1d4b04377633> in <module>()
1 # fitting the model
----> 2 super_learner.fit(X_train[:50], y_train[:50])
~/anaconda3/lib/python3.6/site-packages/civismlext/stacking.py in fit(self, X, y, **fit_params)
163 self.meta_estimator.fit(Xmeta, ymeta, **meta_params)
164 # Now fit base estimators again, this time on full training set
--> 165 self._base_est_fit(X, y, **fit_params)
166
167 return self
~/anaconda3/lib/python3.6/site-packages/civismlext/stacking.py in _base_est_fit(self, X, y, **fit_params)
220 n_jobs=self.n_jobs,
221 verbose=self.verbose,
--> 222 pre_dispatch=self.pre_dispatch)(_jobs)
223
224 for name, _ in self.estimator_list[:-1]:
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
697 try:
698 if getattr(self._backend, 'supports_timeout', False):
--> 699 self._output.extend(job.get(timeout=self.timeout))
700 else:
701 self._output.extend(job.get())
~/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
MaybeEncodingError: Error sending result: '[<keras.callbacks.History object at 0x7f93fe43c7b8>]'. Reason: 'TypeError("can't pickle _thread.lock objects",)'
The reason behind it because TensorFlow's Models cannot be shared across processes. This happens because of this line Do you have any ideas how to workaround it?
The HyperbandSearchCV
class depends on the MaskedArray
class, which was added to sklearn.utils.fixes
in version 0.18.1. Attempts to import civismlext
fail when using sklearn v0.18.
Here's what that looks like:
In [1]: from civismlext import NonNegativeLinearRegression
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-86087081ae3b> in <module>()
----> 1 from civismlext import NonNegativeLinearRegression
/Users/kcrum/miniconda3/envs/sandbox/lib/python3.5/site-packages/civismlext/__init__.py in <module>()
2 from civismlext.stacking import StackedClassifier # NOQA
3 from civismlext.nonnegative import NonNegativeLinearRegression # NOQA
----> 4 from civismlext.hyperband import HyperbandSearchCV # NOQA
5 from civismlext.preprocessing import DataFrameETL # NOQA
/Users/kcrum/miniconda3/envs/sandbox/lib/python3.5/site-packages/civismlext/hyperband.py in <module>()
18
19 from sklearn.externals.joblib import Parallel, delayed
---> 20 from sklearn.utils.fixes import MaskedArray
21 from sklearn.utils.validation import indexable
22 from sklearn.metrics.scorer import check_scoring
ImportError: cannot import name 'MaskedArray'
In [2]:
In [2]: from sklearn.utils.fixes import MaskedArray
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-f7d936768fbc> in <module>()
----> 1 from sklearn.utils.fixes import MaskedArray
ImportError: cannot import name 'MaskedArray'
In [3]: import sklearn
In [4]: sklearn.__version__
Out[4]: '0.18'
Is it possible to dump the stacking ensemble using joblib dump just as scikit-learn? will this store all the estimators within as well?
On another note, It would also be useful to have a save method where it will iterate through the estimators and store each one individually (should consider pipelines as well).
Saw the very instructive and clear talk by @kcrum, just wondering where the slides can be found? Awesome stuff, would love to share with others! ๐
I was wondering if it would be possible to handle multi-class classification with Hyperband? It worked with binary classification and regression tasks, but also including multi-class would be immensely helpful.
Otherwise, maybe we should include documentation somewhere that emphasizes that Hyperband can't handle the multi-class condition. I had to discover this when actually trying to do so.
~/.virtualenvs/dsmodels/lib/python3.7/site-packages/sklearn/metrics/ranking.py in roc_auc_score(y_true, y_score, average, sample_weight, max_fpr)
354 return _average_binary_score(
355 _binary_roc_auc_score, y_true, y_score, average,
--> 356 sample_weight=sample_weight)
357
358
~/.virtualenvs/dsmodels/lib/python3.7/site-packages/sklearn/metrics/base.py in _average_binary_score(binary_metric, y_true, y_score, average, sample_weight)
72 y_type = type_of_target(y_true)
73 if y_type not in ("binary", "multilabel-indicator"):
---> 74 raise ValueError("{0} format is not supported".format(y_type))
75
76 if y_type == "binary":
ValueError: multiclass format is not supported
import numpy as np
import xgboost as xgb
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_float
# XGBoost with Hyperband Hyperparameter Optimization
clf = xgb.XGBRegressor()
clf.set_params(**{"n_jobs": 4})
# Hyperparameter search boundaries
param_grid = {
# Parameters for Tree Booster
'eta': sp_float(0, 1),
'gamma': sp_randint(0, 100),
'max_depth': sp_randint(1, 3),
'learning_rate': sp_float(.001, .005),
'n_estimators': sp_randint(5000, 40000),
'min_child_weight': sp_randint(0, 50),
'max_delta_step': sp_randint(0, np.log(upper_limit)),
'subsample': sp_float(0, 1),
# Family of parameters for subsampling of columns
'colsample_bytree': sp_float(0.2, 1),
'colsample_bylevel': sp_float(0.2, 1),
'colsample_bynode': sp_float(0.2, 1),
# Regularization Params
'lambda': sp_randint(1, 10),
'alpha': sp_randint(0, 100),
}
from civismlext.hyperband import HyperbandSearchCV
tuned_model = HyperbandSearchCV(regressor,
param_distributions=param_grid,
cost_parameter_max={'n_estimators': 20000},
cost_parameter_min={'n_estimators': 2000},
n_jobs=4,
cv=2)
Somehow I got an out-of-bounds error when I tried to set the range for colsample_by* as (0.2, 1), but when I changed it back to (0, 1) it worked.
Seems like it might be an async/distributed computing issue?
We should give a debug log emit before expanding categoricals in the DataFrameETL
. It's useful to know how big of an array we create, especially if the expansion fails because of memory constraints.
I believe cv.split(X, y) might give different results when called at different times. There might be randomness involved like shuffling, etc.
This is problematic because ybase
, which is based on the old train
& test
but not coincide with with the new test
and therefore y[test]
.
old train
& test
:
for train, test in cv.split(X, y):
for name, est in self.estimator_list[:-1]:
# adapted from sklearn.model_selection._fit_and_predict
# Adjust length of sample weights
fit_params_est_adjusted = dict([
(k, _index_param_value(X, v, train))
for k, v in fit_params_ests[name].items()])
# Fit estimator on training set and score out-of-sample
_jobs.append(delayed(_fit_predict)(
clone(est),
X[train],
y[train],
X[test],
**fit_params_est_adjusted))
new train
& test
:
# Extract the results from joblib
Xmeta, ymeta = None, None
for train, test in cv.split(X, y):
ybase = np.empty((y[test].shape[0], 0))
for name, est in self.estimator_list[:-1]:
# Build design matrix out of out-of-sample predictions
ybase = np.hstack((ybase, _out.pop(0)))
# Append the test outputs to what will eventually be the features
# for the meta-estimator.
if Xmeta is not None:
ymeta = np.concatenate((ymeta, y[test]))
Xmeta = np.vstack((Xmeta, ybase))
else:
Xmeta = ybase
ymeta = y[test]
The following fails under v0.1.5 (the most recent release):
raw = pd.concat([
pd.Series([1.0, np.NaN, 3.0], dtype='float', name='fruits'),
pd.Series([500, 1000, 1000], dtype='category', name='intcat'),
], axis=1)
expander = DataFrameETL(cols_to_expand='auto', dummy_na=True)
tfm = expander.fit_transform(raw)
The error is "ValueError: fill value must be in categories".
It looks like this is due to DataFrameETL._flag_numeric
incorrectly marking a pd.Categorical
as "numeric" when every level happens to be an integer.
We have a _version.py
file which defines __version__
, but we never import __version__
, so there's no civismlext.__version__
attribute. We need a from _version import __version__
in the civismlext.__init__
.
I'm having trouble understanding how to perform a grid search over all estimator parameters in my StackedClassifier
(base estimators + meta estimator). Do I need to pass the parameters to the .fit
method of the StackedClassifier
? Or do I need to wrap my classifier in a CV class, as below?
from civismlext.stacking import StackedClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
X, y = ...
estimator_list = [('rf', RandomForestClassifier()), ('meta', LogisticRegression())]
cv = RandomizedSearchCV(
estimator=StackedClassifier(
estimator_list=estimator_list
),
param_distributions={
'rf__n_estimators': [10, 100, 1000, 10000],
'rf__max_features': [None, 'sqrt', 'auto', 'log2'],
'rf__criterion': ['gini', 'entropy'],
'rf__class_weight': ['balanced_subsample', 'balanced'],
'meta__l1_ratio': np.logspace(-5, 0, 6),
'meta__C': np.logspace(-5, 5, 11),
},
scoring='roc_auc',
n_iter=10
)
cv.fit(X, y)
This approach seems flawed to me because the grid search will perform CV in addition to the CV the StackedClassifier
performs for us which leads to data sparsity in the base models.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.