microsoft / flaml Goto Github PK
View Code? Open in Web Editor NEWA fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
Home Page: https://microsoft.github.io/FLAML/
License: MIT License
A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
Home Page: https://microsoft.github.io/FLAML/
License: MIT License
My team is working on a multiclass classification model for a predicting the workload type(POC, Prod, Dev, Test) of Azure services like SQLDW, Synapse and SQLDB. We replaced the Gridsearch/XGBoost with FLAML XGBoost for better performance. Since it is multiclass classification, we implemented more metrics like normalized confusion matrix, Precision-Recall curve and Roc-curve using OneVsRestClassifier for binarizing the labels for our final model so that we can measure the performance for prediction of each individual workload type, in additional to accuracy, precision and recall of overall model. This seems like a common requirement that other FLAML users might have and it will be valuable to add these features for multiclass classification models.
The link to access the jupyter notebook for multiclass classification is
https://microsoft.sharepoint.com/:u:/t/AzureDataUXBA-DataEngineeringandAnalysis/ETY_DWyvPXBEl2S-R5C6rVUBFa0fvbnE9V7KSzAC3H8uMQ?e=hgxKmb
It has the implementation of above metrics in the last section of the file(5. Metrics).
fit()
is stopping early with the following message:
[flaml.automl: 06-17 12:55:08] {1013} INFO - iteration 41, current learner lrl1
No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'.
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_sag.py:329: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
I cannot find any details in the documentation as to what exactly low_cost_partial_config
is, or what I should be setting it to. Any pointers or guidance would be appreciated.
Hi,
I am trying to learn how FLAML handles categorical features - i.e., which encoding methods (e.g., OneHotEncoding, OrdinalEncoding) are used.
I looked through the following code:
Line 187 in c4c15f5
but I can't see where the categorical features are actually encoded.
Also, I was wondering if different estimators will use different encodings? E.g., OrdinalEncoder for lgbm and OHE for RandomForest?
and an email
Hi:
get_output_from_log returns empty objects for me with no error message:
from flaml.data import get_output_from_log
get_output_from_log(filename = 'test.log', time_budget = 600)
([], [], [], [], [])
However, when I ran your sample codes in the notebook, this function worked well. Moreover, my test log file exists and can be accessed in Jupyter using other methods.
I am using 0.2.5 and my settings are:
settings = {
"time_budget": 3600
'eval_method':'cv',
'max_iter' :100,
'n_splits' :5,
'log_type' :'all',
"metric": 'roc_auc',
"task": 'classification',
"log_file_name": 'test.log',
"log_training_metric": False
}
Any ideas?
Appreciate your help! So far the feedback of FLAML from our users is overwhelmingly positive. Great work!
After running FLAML for several hours, I noticed in the log that the best model was xgboost with n_estimators set to 32768:
{"record_id": 24, "iter_per_learner": 55, "logged_metric": false, "trial_time": 1509.8762745857239, "total_search_time": 41334.96948957443, "validation_loss": 0.25501297475819773, "config": {"n_estimators": 32768.0, "max_leaves": 186.0, "min_child_weight": 0.22536063808245474, "learning_rate": 0.05398963108662436, "subsample": 0.9173715591862044, "colsample_bylevel": 0.9005345477364418, "colsample_bytree": 0.6104797018735161, "reg_alpha": 0.0009765625, "reg_lambda": 1. 92166667176985, "FLAML_sample_size": 55408}, "best_validation_loss": 0.25501297475819773, "best_config": {"n_estimators": 32768.0, "max_leaves": 186.0, "min_child_weight": 0.22536063808245474, "learning_rate": 0.05398963108662436, "subsample": 0.9173715591862044, "colsample_bylevel": 0.9005345477364418, "colsample_bytree": 0.6104797018735161, "reg_alpha": 0.0009765625, "reg_lambda": 1.92166667176985, "FLAML_sample_size": 55408}, "learner": "xgboost", "sample_size": 55408}
{"curr_best_record_id": 24}
But that seems excessively large to me and will surely result in an overfit model. (Indeed, the model achieves 99.9% F1 score on the training data set, but only about 75% on a held-out test data set.)
I see in the code that 32768 is set as the upper bound for n_estimators
:
Lines 307 to 313 in 0604570
I'm just wondering if this upper bound is intentionally set so high, or if this is an oversight.
Hi Dr. Wang:
Does this algorithm work with weighted datasets? I haven't seen any parameter like 'sample_weight'. Or shall I create customized learners for weighted datasets as you suggested for monotonicity?
Thank you.
Currently logs are produced in Tab-separated row format. Consider JSON log format because:
add test/nni to illustrate how to use flaml in the nni framework.
Is it possible to let users specify the final_estimator and passthrough for the ensemble, please? In practice sometimes the only meta learner can be accepted by the business is GLM. Single boosting models are OK but a boosting model of boosting models is just too complicated for the legal team and regulators. Regarding the passthrough, there is no guarantee that one way will be better than the other so perhaps it is better to let the users decide.
Appreciate your help!
Originally posted by @flippercy in #47 (comment)
A static site for API doc.
When FLAML is building the ensemble at the end of a run, I am seeing the following error message:
[flaml.automl: 07-10 06:16:01] {1153} INFO - at 9912.2s, best xgboost's error=0.1893, best xgboost's error=0.1893
[flaml.automl: 07-10 06:16:01] {1001} INFO - iteration 183, current learner lgbm
[flaml.automl: 07-10 06:17:00] {1153} INFO - at 9970.5s, best lgbm's error=0.1928, best xgboost's error=0.1893
[flaml.automl: 07-10 06:17:00] {1193} INFO - selected model: XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=0.6553649023281938, colsample_bynode=1,
colsample_bytree=0.5733906723952086, gamma=0, gpu_id=-1,
grow_policy='lossguide', importance_type='gain',
interaction_constraints='', learning_rate=0.03981439313350194,
max_delta_step=0, max_depth=0, max_leaves=1130,
min_child_weight=5.542464309441731, missing=nan,
monotone_constraints='()', n_estimators=123, n_jobs=10,
num_parallel_tree=1, objective='multi:softprob', random_state=0,
reg_alpha=0.0059793400625186045, reg_lambda=7.330769622156848,
scale_pos_weight=None, subsample=1.0, tree_method='hist',
use_label_encoder=False, validate_parameters=1, verbosity=0)[flaml.automl: 07-10 06:17:00] {1203} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f3e68c55048>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f3ea4da8cc0>), ('rf', <flaml.model.RandomForestEstimator object at 0x7f3ef11bc278>), ('extra_tree', <flaml.model.ExtraTreeEstimator object at0x7f3ea4fb9240>), ('catboost', <flaml.model.CatBoostEstimator object at 0x7f3ef24f5c50>)]
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 357, in _sendback_result
exception=exception))
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 247, in put
self._writer.send_bytes(obj)
File "/opt/python/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/python/anaconda3/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "search.py", line 230, in <module>
main()
File "search.py", line 226, in main
data_sheet = run_data_sheet(data_sheet, target_col, id_col, data_dir, out_dir, eval_metric)
File "search.py", line 181, in run_data_sheet
pipe.fit(X_train, y_train, **automl_settings)
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 950, in fit
self._search()
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 1222, in _search
**self._state.fit_kwargs)
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
return super().fit(X, self._le.transform(y), sample_weight)
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 149, in fit
for est in all_estimators if est != 'drop'
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
Interestingly, this error does not occur every time, but only sometimes.
Hi,
I used pip install and
from flaml import AutoML
gave me the error of
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\windwine8700\anaconda3\lib\site-packages\settings.json'
I am using python 3.83. Is there a way to solve this issue? Thanks.
Best,
Jiaqi
I noticed that FLAML will only retrain on the full dataset if the eval_method
parameter is set to 'holdout'
:
Lines 910 to 911 in b04b00d
Why not retrain on full dataset for other eval_methods, such as 'cv'
?
Thanks for this wonderful promising AutoML stack.
May I suggest to had an "export to ONNX/ONNXML method " ?
How would I export the best model pipeline to ONNXML now ? using sklearn-onnx ?
Currently, the model training logs are written into a log file. Instead of writing into a logfile, is there a way to log parameters, metrics using mlflow
Create features from columns of datetime dtype in DataTransformer's fit_transform() and transform() methods. Right now a simple conversion to float type is used.
Originally posted by @sonichi in #66 (comment)
There are open compliance tasks that need to be reviewed for your FLAML repo.
To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.
Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/FLAML/compliance
You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.
If you no longer need this repository, it might be quickest to delete the repo, too.
More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]
FYI: current admins at Microsoft include @ekzhu, @markusweimer, @qingyun-wu, @sonichi
sklearn f1_score makes use of 'binary' as average default parameter. It means, if the problem is a multiclass ones, it must be changed the average parameter to ones of ['micro', 'macro', 'weighted', 'samples']. In the ml module the sklearn_metric_loss_score is called without specifiyng that parameter. Consequently, at the moment, if a multiclass problem and f1 metric are choosen, a problem arise.
One solution, could be to set average='samples' when the task is multiclass:softmax.
However, choosing one of the above list options, depends on the nature of the labels (balanced/unbalanced). It may be interesting to automate the process of choosing the best metric looking to the nature of the labels.
settings = {
"time_budget": TIME_BUDGET,
"metric": 'f1',
"estimator_list": ['lgbm'],
"task": 'classification',
"log_file_name": 'flaml_lgb.log',
}
Hi Dr. Wang:
Got a few questions from my team on the content in the log of FLAML.
This is part of the log from one of our tests on FLAML (all the numbers on loss are redacted for compliance reasons):
{"record_id": 0, "iter_per_learner": 1, "logged_metric": false, "trial_time": 1756.8860552310944, "total_search_time": 2590.4430527687073, "validation_loss": XXX, "config": {"max_depth": 6, "n_estimators": 100, "min_child_weight": 10, "subsample": 0.67, "colsample_bylevel": 0.9, "gamma": 0, "learning_rate": 0.07435893300587489}, "best_validation_loss": XXX, "best_config": {"max_depth": 6, "n_estimators": 100, "min_child_weight": 10, "subsample": 0.67, "colsample_bylevel": 0.9, "gamma": 0, "learning_rate": 0.07435893300587489}, "learner": "MonotonicXgboostGBTree", "sample_size": 784536}
{"record_id": 1, "iter_per_learner": 5, "logged_metric": false, "trial_time": 1537.3765320777893, "total_search_time": 13424.922722578049, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 110, "min_child_weight": 1, "subsample": 0.5954399576961257, "colsample_bylevel": 1.0, "gamma": 1e-14, "learning_rate": 0.10828032871243709}, "best_validation_loss": XXX "best_config": {"max_depth": 4, "n_estimators": 110, "min_child_weight": 1, "subsample": 0.5954399576961257, "colsample_bylevel": 1.0, "gamma": 1e-14, "learning_rate": 0.10828032871243709}, "learner": "MonotonicXgboostGBTree", "sample_size": 784536}
{"record_id": 2, "iter_per_learner": 13, "logged_metric": false, "trial_time": 340.0606036186218, "total_search_time": 34851.914006233215, "validation_loss": XXX, "config": {"max_depth": 5, "num_leaves": 23, "n_estimators": 157, "min_child_weight": 1, "subsample": 0.5112583180636173, "colsample_bylevel": 0.9863382485941592, "min_split_gain": 1e-14, "learning_rate": 0.05875161500234584}, "best_validation_loss": XXX, "best_config": {"max_depth": 5, "num_leaves": 23, "n_estimators": 157, "min_child_weight": 1, "subsample": 0.5112583180636173, "colsample_bylevel": 0.9863382485941592, "min_split_gain": 1e-14, "learning_rate": 0.05875161500234584}, "learner": "MonotonicLightGBMGBDT", "sample_size": 784536}
{"record_id": 3, "iter_per_learner": 18, "logged_metric": false, "trial_time": 270.2408003807068, "total_search_time": 41368.91024374962, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 338, "min_data_in_leaf": 56, "subsample": 0.6614322871324126, "colsample_bylevel": 0.9458919560564311, "learning_rate": 0.23062756268773424}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 338, "min_data_in_leaf": 56, "subsample": 0.6614322871324126, "colsample_bylevel": 0.9458919560564311, "learning_rate": 0.23062756268773424}, "learner": "MonotonicCatboost", "sample_size": 784536}
{"record_id": 4, "iter_per_learner": 22, "logged_metric": false, "trial_time": 366.1694631576538, "total_search_time": 43080.46155285835, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 448, "min_data_in_leaf": 46, "subsample": 0.6950654501710251, "colsample_bylevel": 0.956150967914549, "learning_rate": 0.4527543463119874}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 448, "min_data_in_leaf": 46, "subsample": 0.6950654501710251, "colsample_bylevel": 0.956150967914549, "learning_rate": 0.4527543463119874}, "learner": "MonotonicCatboost", "sample_size": 784536}
{"record_id": 5, "iter_per_learner": 23, "logged_metric": false, "trial_time": 343.4558777809143, "total_search_time": 45475.49441862106, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 405, "min_data_in_leaf": 58, "subsample": 0.734450014003538, "colsample_bylevel": 0.9644762947991873, "learning_rate": 0.3151376812002405}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 405, "min_data_in_leaf": 58, "subsample": 0.734450014003538, "colsample_bylevel": 0.9644762947991873, "learning_rate": 0.3151376812002405}, "learner": "MonotonicCatboost", "sample_size": 784536}
I am wondering:
What does 'iter_per_learner' mean? My understanding is that the output in the log was generated in batch. For example, for record_id 2, does it include 13 or 8 (13-5 from record_id 1) MonotonicLightGBMGBDT models with different sets of hyperparameters?
What does 'trial_time' mean? How is it different from 'total_search_time"?
What is the difference between 'config' and 'best_config' in each record? They all look the same.
If the process reaches the time budget in the middle of an iteration, will it stop immediately or finish the current iteration first before stopping?
Appreciate your help! As you can see from the log, our dataset is quite large (780000+ records and thousands of predictors). Although the fitting is far from over yet, the current optimal result is already as good as what we got using BayesOpt.
Best,
Feedback from sebhrusen (from the automlbenchmark)
CatboostEstimator is creating and filling a catboost_info subfolder in the running directory. We should be able to pass a 'train_dir' param to Catboost to avoid that.
For example at AutoML level, accept a tmpdir and pass it to each algo supporting an equivalent property (or pass a dedicated subfolder, for example tmpdir/catboost for Catboost and so on).
Reference:
openml/automlbenchmark#270
When I set ensemble=True
, and my data has categorical features, I get the following error at the end of the FLAML run:
[flaml.automl: 07-08 09:40:44] {1141} INFO - at 9373.5s, best extra_tree's error=0.2056, best rf's error=0.1950[flaml.automl: 07-08 09:40:44] {993} INFO - iteration 52, current learner rf[flaml.automl: 07-08 09:41:42] {1141} INFO - at 9431.7s, best rf's error=0.1950, best rf's error=0.1950
[flaml.automl: 07-08 09:41:42] {993} INFO - iteration 53, current learner rf
[flaml.automl: 07-08 09:42:11] {1141} INFO - at 9460.7s, best rf's error=0.1950, best rf's error=0.1950[flaml.automl: 07-08 09:42:11] {993} INFO - iteration 54, current learner rf[flaml.automl: 07-08 09:50:15] {1141} INFO - at 9944.4s, best rf's error=0.1949, best rf's error=0.1949
[flaml.automl: 07-08 09:50:15] {1187} INFO - selected model: RandomForestClassifier(criterion='entropy', max_features=0.7294599478674504,
n_estimators=347, n_jobs=10)[flaml.automl: 07-08 09:50:15] {1197} INFO - [('rf', <flaml.model.RandomForestEstimator object at 0x7fca69effaf0>), ('extra_tree', <flaml.model.ExtraTreeEstimator object at 0x7fca8cc1f8e0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7fc799985190>), ('catboost', <flaml.model.CatBoostEstimator object at 0x7fc
a8cc884f0>), ('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7fca8cd0e610>)]
/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecat
ed and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier
object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecat
ed and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier
object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
Traceback (most recent call last): File "search.py", line 212, in <module> dump_json(data_sheet_file, data_sheet)
File "search.py", line 208, in main
with open(data_sheet_file) as f: File "search.py", line 163, in run_data_sheet run['flaml_settings'] = jsonpickle.encode(automl_settings, unpicklable=False, keys=True)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/automl.py", line 943, in fit
self._search() File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/automl.py", line 1212, in _search stacker.fit(self._X_train_all, self._y_train_all,
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
return super().fit(X, self._le.transform(y), sample_weight) File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 196, in fit _fit_single_estimator(self.final_estimator_, X_meta, y,
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_base.py", line 39, in _fit_single_estimator
estimator.fit(X, y)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/model.py", line 296, in fit
self._fit(X_train, y_train, **kwargs)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/model.py", line 78, in _fit
model.fit(X_train, y_train, **kwargs)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 304, in fit
X, y = self._validate_data(X, y, multi_output=True,
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 871, in check_X_y
X = check_array(X, accept_sparse=accept_sparse,
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 673, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '__OTHER__'
This error does not occur if ensemble=False
or if I remove (or encode) the categorical features from my dataset
My guess is that FLAML properly encodes categorical features when training the base estimators (LGBM, RF, etc), but not when training the stacking classifier.
add a task type 'forecast', and at least one forecasting learner, like greykite.
Is there a way to persist the best model in the mlflow runs?
After upgrading to the newest version of FLAML, I am running into the following error when I set ensemble=True
:
Traceback (most recent call last):
File "search.py", line 229, in <module>
main()
File "search.py", line 225, in main
data_sheet = run_data_sheet(data_sheet, target_col, id_col, data_dir, out_dir, eval_metric)
File "search.py", line 180, in run_data_sheet
pipe.fit(X_train, y_train, **automl_settings)
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 962, in fit
self._search()
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 1232, in _search
**self._state.fit_kwargs)
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
return super().fit(X, self._le.transform(y), sample_weight)
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 149, in fit
for est in all_estimators if est != 'drop'
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
TypeError: __init__() got an unexpected keyword argument '_estimator_type'
My call to FLAML:
automl_settings = {
"time_budget": search_time,
"task": 'classification',
"log_file_name": "{}/flaml-{}.log".format(out_dir, runname),
"n_jobs": 10,
"estimator_list": ['lgbm', 'xgboost', 'rf', 'extra_tree', 'catboost'],
"model_history": True,
"eval_method": "cv",
"n_splits": 3,
"metric": eval_metric,
"log_training_metric": True,
"verbose": 1,
"ensemble": True,
}
pipe = AutoML()
pipe.fit(X_train, y_train, **automl_settings)
This issue goes away if I change ensemble
to False
.
Here are my environment details:
$ pip list
Package Version
------------------ --------
catboost 0.26
ConfigSpace 0.4.19
cycler 0.10.0
Cython 0.29.23
FLAML 0.5.6
graphviz 0.16
importlib-metadata 4.6.1
joblib 1.0.1
jsonpickle 2.0.0
kiwisolver 1.3.1
lightgbm 3.2.1
matplotlib 3.3.4
numpy 1.19.5
pandas 1.1.5
Pillow 8.3.1
pip 21.1.3
plotly 5.1.0
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2021.1
scikit-learn 0.24.2
scipy 1.5.4
setuptools 40.6.2
six 1.16.0
tenacity 8.0.0
threadpoolctl 2.1.0
typing-extensions 3.10.0.0
wheel 0.36.2
xgboost 1.4.2
zipp 3.5.0
$ python --version
Python 3.6.8 :: Anaconda custom (64-bit)
In code, step size is reduced with the following:
if self._num_proposedby_incumbent == self.dir and (
not self._resource or self._resource == self.max_resource):
# check stuck condition if using max resource
if self.step >= self.step_lower_bound:
# decrease step size
self._oldK = self._K if self._K else self._iter_best_config
self._K = self.trial_count_proposed + 1
self.step *= np.sqrt(self._oldK / self._K)
self._num_proposedby_incumbent -= 2
However, the algorithm description in the FLOW2 paper shows that:
From this, we can see that k'
(_oldK
in code) is only changed whenever a new best score is obtained. However, in the current implementation, k'
always becomes the previous k
instead. This seems counter-intuitive to me, as the step size multiplier will reduce much slower than in the paper implementation, thus making FLOW2 spend more time evaluating a configuration that has most likely already converged.
I believe that the implementation consistent with the paper would be:
if self._num_proposedby_incumbent == self.dir and (
not self._resource or self._resource == self.max_resource):
# check stuck condition if using max resource
if self.step >= self.step_lower_bound:
# decrease step size
self._oldK = self._iter_best_config # change here
self._K = self.trial_count_proposed + 1
self.step *= np.sqrt(self._oldK / self._K)
self._num_proposedby_incumbent -= 2
I have ran some trials with this change and it seems to be working as intended, at least for my purposes - converged combinations are eliminated more aggressively.
Am I understanding all of this correctly? Is this an oversight in the code, or has this been changed after the paper was published?
line 394 of flaml/model.py, The train() method does not accept weight as legitimate argument.
Instead, the weight should be specified in line 390 when creating dtrain.
I am using FLAML in Django views:
X_train, X_test, y_train, y_test = train_test_split(df.copy(), train_size=selectedTrainingPercentage)
automl = AutoML()
settings = {
"time_budget": 60, # total running time in seconds
"metric": 'r2', # primary metrics for regression can be chosen from: ['mae','mse','r2']
# list of ML learners; we tune xgboost in this example
"task": 'regression', # task type
}
print('fitting')
automl.fit(X_train=X_train, y_train=y_train, **settings)
print('fit complete')`
And the fitting stops at iteration 0:
However it works completely fine if I change the metric to nae or mse rather than r2.
When, DataTranformer's fit_transform method is called, If some columns have a datetime format, an error is raised by the sklearn\utils\validation.py method.
I fixed it, turning any datetime columns to datetime.toordinal type
Hi:
There is an error message when fitting models using customized monotonic learners with ensemble = True:
RuntimeError: Cannot clone object <main.MyMonotonicLightGBMGBDTClassifier object at 0x7f9ef2999310>, as the constructor either does not set or modifies parameter monotone_constraints
I assume it is due to the monotone_constraints added to self.params. Any suggestion on how to fix it?
Usually we won't implement an ensemble of boosting models but would be great if we can figure out a solution!
Thank you.
Hi,
While training the learner, a console output is generated, which can take up huge space in the notebook if the time_budget is made large. If I wish to suppress the console output while training my learner, how do I do that? In keras, sklearn, etc., setting verbose = 0 suppresses the console output.
Thanks!
Hi:
I've received the error message below with lrl1 when using FLAML in RStudio via reticulate:
[flaml.automl: 04-02 14:24:16] {986} INFO - iteration 0 current learner lrl1
NameError: name '_' is not defined
Interestingly, the same codes ran well in Jupyter. The versions of scikit-learn in the two environments are the same.
Any ideas?
Thank you.
Some helpful documentation for future contributors, may include:
Congratulations on releasing FLAML and I look forward to contributing to it. Would it be possible for you to enable discussions?
Here is how you enable it :
https://docs.github.com/en/discussions/quickstart
Thanks,
Sandeep
Hi everyone!!
I've received the atribute error message below when using FLAML with XGBoost (this error occurs with others algorithms too):
[flaml.automl: 07-01 10:45:34] {908} INFO - Evaluation method: cv
[flaml.automl: 07-01 10:45:34] {607} INFO - Using StratifiedKFold
[flaml.automl: 07-01 10:45:34] {929} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl: 07-01 10:45:34] {949} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl: 07-01 10:45:34] {1013} INFO - iteration 0, current learner xgboost
Traceback (most recent call last):
File "ft2.py", line 33, in <module>
automl.fit(X_train=X, y_train=y, **settings)
File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/automl.py", line 962, in fit
self._search()
File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/automl.py", line 1081, in _search
use_ray=False)
File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/tune/tune.py", line 270, in run
search_alg.set_search_properties(metric, mode, config={
AttributeError: 'ConcurrencyLimiter' object has no attribute 'set_search_properties'
Parameters used:
settings = {
"time_budget": 108000,
"metric": 'roc_auc',
"task": 'classification',
"n_jobs": -1,
"estimator_list": ['xgboost'],
"n_splits": 5,
"log_file_name": 'ft.log',
}
Specifications:
Python 3.7.10
FLAML 0.5.4 (installed via PiP)
XGBoost 1.4.0 (installed via conda)
Any ideas?
Thanks! :D
Hi,
I'm trying to tune lightgbm for a regression problem and need to use groupKFold for cross-validation.
By default, automl.fit() takes repeatedkfold as split_type. I looked up at the documentation, but couldn't find details regarding that. Also, how to pass the groups arguments to it.
Thanks in advance.
I fit a model with both the RGF in the sample codes and a few other default learners:
settings = {
"time_budget": 120, # total running time in seconds
"metric": 'roc_auc',
"estimator_list": ['lgbm', 'rf', 'RGF'], # list of ML learners
"task": 'classification', # task type
"sample": True, # whether to subsample training data
"log_file_name": 'airlines_experiment_with_ensemble.log', # cache directory of flaml log files
"log_training_metric": True, # whether to log training metric
}
automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)
I received an error message: TypeError: init() got an unexpected keyword argument '_estimator_type'
I got similar results when using other customized learners with unique hyperparameters.
Moreover, how can I pull the details of the ensemble? I did not see it in the log file.
Thank you.
Hi Chi
Amazing work! Could you create a R library for it, too? There is still a large portion of potential users working in R.
Best,
I would like to define & use a custom evaluation metric.
Is there a way to handle imbalanced datasets in the automl?
I'm trying to use a custom metric. I'm using the one from the test case:
Line 92 in 0604570
This works fine when eval_method
is set to its default value of "holdout". But if I change this to "cv"`, I get an error as follows:
Hi:
Is there a way to pull the number of iterations completed by automl() for each learner, please? I know it can be found in the log if I set log_type to 'all' but can I pull it directly?
Assume all the default learners are used, it would be great if we can get the information for a table as below:
Learner | Iterations Completed |
---|---|
Xgboost | 100 |
LightGBM | 200 |
Catboost | 150 |
RF | 50 |
Thank you!
Dear all,
I have been trying FLAML for a few days now and I believe I stumbled across a bug in the evaluation of the model when using cross-validation (eval_method="cv").
I believe that there is only the last fold that is taken into account in function evaluate_model_CV (ml.py). The list of validation scores (val_loss_list) is only updated with the current fold's validation score for the last fold or when the budget is not anymore sufficient. In any case, the val_loss_list only contains one item in all cases. Moreover, what is appended to the list is not the validation score of the current fold, but the mean of the validation scores of the first "valid_fold_num" folds.
I would suggest the following to replace lines 220--226 in ml.py:
val_loss_list.append(val_loss_i)
if valid_fold_num == n:
total_val_loss = valid_fold_num = 0
elif time.time() - start_time >= budget:
break
val_loss = np.max(val_loss_list)
One might also consider changing (or make some options) for the last line in the above snippet. Indeed, here the maximum of the validation scores of each fold is taken. Another commonly used way is to take the average of the validation scores of each fold. This could be an option for the user but it is not a bug per se. I am also ok keeping the max of all validation scores as it is now. (note that basically, the current situation is using the mean value of the different folds, as it is taking the total_val_loss divided by the number of folds).
Best
David
Currently pandas.Dataframe
input is cleaned.
Consider perform the same preparation steps for numpy.ndarray
inputs.
Hi Chi:
Thank you for the cool work! Could I enforce monotonicity in the main automl.fit() function? If so, what algorithms can be chosen in the estimator list?
Best,
If I leave out X_val and y_val, automl works fine. But if I specify these values, it crashes with the following error:
----> 7 automl.fit(X_train= xtrain,y_train=ytrain,X_val=xvalid,y_val=yvalid,**automl_settings)
~\anaconda3\lib\site-packages\flaml\automl.py in fit(self, X_train, y_train, dataframe, label, metric, task, n_jobs, log_file_name, estimator_list, time_budget, max_iter, sample, ensemble, eval_method, log_type, model_history, split_ratio, n_splits, log_training_metric, mem_thres, X_val, y_val, sample_weight_val, retrain_full, split_type, learner_selector, hpo_method, **fit_kwargs)
832 self._state.fit_kwargs = fit_kwargs
833 self._state.weight_val = sample_weight_val
--> 834 self._validate_data(X_train, y_train, dataframe, label, X_val, y_val)
835 self._search_states = {} #key: estimator name; value: SearchState
836 self._random = np.random.RandomState(RANDOM_SEED)
~\anaconda3\lib\site-packages\flaml\automl.py in _validate_data(self, X_train_all, y_train_all, dataframe, label, X_val, y_val)
434 "# rows in X_val must match length of y_val.")
435 if self._transformer:
--> 436 self._state.X_val = self._transformer.transform(X_val)
437 else:
438 self._state.X_val = X_val
~\anaconda3\lib\site-packages\flaml\data.py in transform(self, X)
251 X[cat_columns] = X[cat_columns].astype('category')
252 if num_columns:
--> 253 X[num_columns].fillna(np.nan, inplace=True)
254 X[num_columns] = self.transformer.transform(X)
255 return X
~\anaconda3\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast)
4315 downcast=None,
4316 ) -> Optional["DataFrame"]:
-> 4317 return super().fillna(
4318 value=value,
4319 method=method,
~\anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
6086 result = self._constructor(new_data)
6087 if inplace:
-> 6088 return self._update_inplace(result)
6089 else:
6090 return result.__finalize__(self, method="fillna")
~\anaconda3\lib\site-packages\pandas\core\generic.py in _update_inplace(self, result, verify_is_copy)
3962 self._clear_item_cache()
3963 self._mgr = result._mgr
-> 3964 self._maybe_update_cacher(verify_is_copy=verify_is_copy)
3965
3966 def add_prefix(self: FrameOrSeries, prefix: str) -> FrameOrSeries:
~\anaconda3\lib\site-packages\pandas\core\generic.py in _maybe_update_cacher(self, clear, verify_is_copy)
3243
3244 if verify_is_copy:
-> 3245 self._check_setitem_copy(stacklevel=5, t="referant")
3246
3247 if clear:
~\anaconda3\lib\site-packages\pandas\core\generic.py in _check_setitem_copy(self, stacklevel, t, force)
3679
3680 if value == "raise":
-> 3681 raise com.SettingWithCopyError(t)
3682 elif value == "warn":
3683 warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Hi:
Our team has explored the ensemble option in the fit function of automl and got a few errors:
from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id = 1169, data_dir = './')
settings = {
"time_budget": 40,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": [
'lrl1'
,'lrl2'
,'lgbm'
, 'xgboost'
],
"log_file_name": 'airlines_experiment.log',
}
automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)
[flaml.automl: 03-18 17:34:40] {1157} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f61f8659ed0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f61f8687350>), ('lrl2', <flaml.model.LRL2Classifier object at 0x7f61f8687090>), ('lrl1', <flaml.model.LRL1Classifier object at 0x7f61f8654150>)]
RuntimeError: Cannot clone object <flaml.model.LRL2Classifier object at 0x7f84877a1c10>, as the constructor either does not set or modifies parameter penalty.
This is similar to the error we've discussed before.
class MyMonotonicXGBGBTreeClassifier(BaseEstimator):
def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):
super().__init__(task, **params)
self.estimator_class = XGBClassifier
# convert to int for integer hyperparameters
self.params = {
'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
'booster': params['booster'] if 'booster' in params else 'gbtree',
'learning_rate': params['learning_rate'],
'gamma': params['gamma'],
'max_depth': int(params['max_depth']),
'min_child_weight': int(params['min_child_weight']),
'subsample': params['subsample'],
'colsample_bylevel':params['colsample_bylevel'],
'n_estimators':int(params['n_estimators']),
'reg_lambda': params['reg_lambda'],
'reg_alpha': params['reg_alpha'],
'random_state': params['random_state'] if 'random_state' in params else randomseed,
"monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
}
@classmethod
def search_space(cls, data_size, task):
space = {
'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
'min_child_weight': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7},
'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8},
'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
}
return space
class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):
def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):
super().__init__(task, **params)
self.estimator_class = LGBMClassifier
# convert to int for integer hyperparameters
self.params = {
'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
'learning_rate': params['learning_rate'],
'min_split_gain': params['min_split_gain'],
'max_depth': int(params['max_depth']),
'min_data_in_leaf': int(params['min_data_in_leaf']),
'min_sum_hessian_in_leaf': params['min_sum_hessian_in_leaf'],
'subsample': params['subsample'],
'colsample_bytree':params['colsample_bytree'],
'n_estimators':int(params['n_estimators']),
'subsample_freq':int(params['subsample_freq']),
'reg_lambda': params['reg_lambda'],
'reg_alpha': params['reg_alpha'],
'random_state': params['random_state'] if 'random_state' in params else randomseed,
"monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
}
@classmethod
def search_space(cls, data_size, task):
space = {
'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
'subsample_freq': {'domain': tune.uniform(lower=1, upper=10), 'init_value': 5},
'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
'min_data_in_leaf': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
'min_sum_hessian_in_leaf': {'domain': tune.loguniform(lower = 0.000001, upper = 0.1), 'init_value': 0.001},
'subsample': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.67},
'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9},
'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
'min_split_gain': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
}
return space
Without the ensemble, both worked well as individual learners. However, when we set ensemble=True, the monotonic xgboost learner still worked well but the process always crashed if the monotonic lightGBM learner was included in the list of estimators. The kernel of Jupyter just went dead without any error message. In the .out file generated at the backend, there is an error message:
[LightGBM] [Fatal] Check failed: static_cast<size_t>(num_total_features_) == io_config.monotone_constraints.size() at /__w/1/s/python-package/compile/src/io/dataset.cpp, line 314
What does it mean? It seems that something is wrong with the monotone_constraints but the size of the constraints matches the number of variables.
This error can be replicated using the airlines data; to make it easier just let monotone=(0, 0, 0, 0, 0, 0, 0).
Appreciate your help.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.