microsoft / flaml Goto Github PK

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

Home Page: https://microsoft.github.io/FLAML/

License: MIT License

Python 21.69% Jupyter Notebook 77.56% Dockerfile 0.03% JavaScript 0.12% CSS 0.06% MDX 0.54%

automl hyperparam automated-machine-learning machine-learning data-science python jupyter-notebook hyperparameter-optimization random-forest scikit-learn

flaml's Issues

FLAML: New feature

My team is working on a multiclass classification model for a predicting the workload type(POC, Prod, Dev, Test) of Azure services like SQLDW, Synapse and SQLDB. We replaced the Gridsearch/XGBoost with FLAML XGBoost for better performance. Since it is multiclass classification, we implemented more metrics like normalized confusion matrix, Precision-Recall curve and Roc-curve using OneVsRestClassifier for binarizing the labels for our final model so that we can measure the performance for prediction of each individual workload type, in additional to accuracy, precision and recall of overall model. This seems like a common requirement that other FLAML users might have and it will be valuable to add these features for multiclass classification models.

The link to access the jupyter notebook for multiclass classification is
https://microsoft.sharepoint.com/:u:/t/AzureDataUXBA-DataEngineeringandAnalysis/ETY_DWyvPXBEl2S-R5C6rVUBFa0fvbnE9V7KSzAC3H8uMQ?e=hgxKmb

It has the implementation of above metrics in the last section of the file(5. Metrics).

Write a blog post

More details on low_cost_partial_config?

fit() is stopping early with the following message:

[flaml.automl: 06-17 12:55:08] {1013} INFO - iteration 41, current learner lrl1
No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'.
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_sag.py:329: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

I cannot find any details in the documentation as to what exactly low_cost_partial_config is, or what I should be setting it to. Any pointers or guidance would be appreciated.

Question: how does FLAML handle categorical features?

Hi,

I am trying to learn how FLAML handles categorical features - i.e., which encoding methods (e.g., OneHotEncoding, OrdinalEncoding) are used.

I looked through the following code:

FLAML/flaml/data.py

Line 187 in c4c15f5

class DataTransformer:

but I can't see where the categorical features are actually encoded.

Also, I was wondering if different estimators will use different encodings? E.g., OrdinalEncoder for lgbm and OHE for RandomForest?

create a community

and an email

get_output_from_log returns empty objects

Hi:

get_output_from_log returns empty objects for me with no error message:

from flaml.data import get_output_from_log
get_output_from_log(filename = 'test.log', time_budget = 600)

([], [], [], [], [])

However, when I ran your sample codes in the notebook, this function worked well. Moreover, my test log file exists and can be accessed in Jupyter using other methods.

I am using 0.2.5 and my settings are:

settings = {
"time_budget": 3600
'eval_method':'cv',
'max_iter' :100,
'n_splits' :5,
'log_type' :'all',
"metric": 'roc_auc',
"task": 'classification',
"log_file_name": 'test.log',
"log_training_metric": False
}

Any ideas?

Appreciate your help! So far the feedback of FLAML from our users is overwhelmingly positive. Great work!

n_estimators of best model is really large (32768)

After running FLAML for several hours, I noticed in the log that the best model was xgboost with n_estimators set to 32768:

{"record_id": 24, "iter_per_learner": 55, "logged_metric": false, "trial_time": 1509.8762745857239, "total_search_time": 41334.96948957443, "validation_loss": 0.25501297475819773, "config": {"n_estimators": 32768.0, "max_leaves": 186.0, "min_child_weight": 0.22536063808245474, "learning_rate": 0.05398963108662436,       "subsample": 0.9173715591862044, "colsample_bylevel": 0.9005345477364418, "colsample_bytree": 0.6104797018735161, "reg_alpha": 0.0009765625, "reg_lambda": 1.    92166667176985, "FLAML_sample_size": 55408}, "best_validation_loss": 0.25501297475819773, "best_config": {"n_estimators": 32768.0, "max_leaves": 186.0,          "min_child_weight": 0.22536063808245474, "learning_rate": 0.05398963108662436, "subsample": 0.9173715591862044, "colsample_bylevel": 0.9005345477364418,         "colsample_bytree": 0.6104797018735161, "reg_alpha": 0.0009765625, "reg_lambda": 1.92166667176985, "FLAML_sample_size": 55408}, "learner": "xgboost",            "sample_size": 55408}
{"curr_best_record_id": 24}

But that seems excessively large to me and will surely result in an overfit model. (Indeed, the model achieves 99.9% F1 score on the training data set, but only about 75% on a held-out test data set.)

I see in the code that 32768 is set as the upper bound for n_estimators:

FLAML/flaml/model.py

Lines 307 to 313 in 0604570

    
           upper = min(32768, int(data_size)) 
        
           return { 
        
               'n_estimators': { 
        
                   'domain': tune.qloguniform(lower=4, upper=upper, q=1), 
        
                   'init_value': 4, 
        
                   'low_cost_init_value': 4, 
        
               },

I'm just wondering if this upper bound is intentionally set so high, or if this is an oversight.

Does it work with weighted datasets?

Hi Dr. Wang:

Does this algorithm work with weighted datasets? I haven't seen any parameter like 'sample_weight'. Or shall I create customized learners for weighted datasets as you suggested for monotonicity?

Thank you.

Output logs in JSON format

Currently logs are produced in Tab-separated row format. Consider JSON log format because:

Easier interpolation with log analysis tools such as Elasticsearch/Logstash
JSON is typed, better for parsing in Python
Easier to combine log output from different future versions due to log schema change -- JSON schema is not position sensitive like the current format.
Model configuration is already dumped in JSON.

Write example test for using nni tuning interface

add test/nni to illustrate how to use flaml in the nni framework.

let users specify the final_estimator and passthrough for the ensemble

Is it possible to let users specify the final_estimator and passthrough for the ensemble, please? In practice sometimes the only meta learner can be accepted by the business is GLM. Single boosting models are OK but a boosting model of boosting models is just too complicated for the legal team and regulators. Regarding the passthrough, there is no guarantee that one way will be better than the other so perhaps it is better to let the users decide.

Appreciate your help!

Originally posted by @flippercy in #47 (comment)

API Doc Website

A static site for API doc.

struct error during ensemble

When FLAML is building the ensemble at the end of a run, I am seeing the following error message:

[flaml.automl: 07-10 06:16:01] {1153} INFO -  at 9912.2s,       best xgboost's error=0.1893,    best xgboost's error=0.1893
[flaml.automl: 07-10 06:16:01] {1001} INFO - iteration 183, current learner lgbm
[flaml.automl: 07-10 06:17:00] {1153} INFO -  at 9970.5s,       best lgbm's error=0.1928,       best xgboost's error=0.1893
[flaml.automl: 07-10 06:17:00] {1193} INFO - selected model: XGBClassifier(base_score=0.5, booster='gbtree',
              colsample_bylevel=0.6553649023281938, colsample_bynode=1,
              colsample_bytree=0.5733906723952086, gamma=0, gpu_id=-1,
              grow_policy='lossguide', importance_type='gain',
              interaction_constraints='', learning_rate=0.03981439313350194,
              max_delta_step=0, max_depth=0, max_leaves=1130,
              min_child_weight=5.542464309441731, missing=nan,
              monotone_constraints='()', n_estimators=123, n_jobs=10,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0.0059793400625186045, reg_lambda=7.330769622156848,
              scale_pos_weight=None, subsample=1.0, tree_method='hist',
              use_label_encoder=False, validate_parameters=1, verbosity=0)[flaml.automl: 07-10 06:17:00] {1203} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f3e68c55048>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f3ea4da8cc0>), ('rf', <flaml.model.RandomForestEstimator object at 0x7f3ef11bc278>), ('extra_tree', <flaml.model.ExtraTreeEstimator object at0x7f3ea4fb9240>), ('catboost', <flaml.model.CatBoostEstimator object at 0x7f3ef24f5c50>)]
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 357, in _sendback_result
    exception=exception))
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 247, in put
    self._writer.send_bytes(obj)
  File "/opt/python/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/python/anaconda3/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "search.py", line 230, in <module>
    main()
  File "search.py", line 226, in main
    data_sheet = run_data_sheet(data_sheet, target_col, id_col, data_dir, out_dir, eval_metric)
  File "search.py", line 181, in run_data_sheet
    pipe.fit(X_train, y_train, **automl_settings)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 950, in fit
    self._search()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 1222, in _search
    **self._state.fit_kwargs)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
    return super().fit(X, self._le.transform(y), sample_weight)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 149, in fit
    for est in all_estimators if est != 'drop'
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Interestingly, this error does not occur every time, but only sometimes.

import package error

Hi,

I used pip install and
from flaml import AutoML
gave me the error of

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\windwine8700\anaconda3\lib\site-packages\settings.json'

I am using python 3.83. Is there a way to solve this issue? Thanks.

Best,

Jiaqi

Question: Why not retrain on full dataset if eval_method = cv?

I noticed that FLAML will only retrain on the full dataset if the eval_method parameter is set to 'holdout':

FLAML/flaml/automl.py

Lines 910 to 911 in b04b00d

    
           self._retrain_full = retrain_full and ( 
        
               eval_method == 'holdout' and self._state.X_val is None)

Why not retrain on full dataset for other eval_methods, such as 'cv'?

ONNX/ONNXML export

Thanks for this wonderful promising AutoML stack.
May I suggest to had an "export to ONNX/ONNXML method " ?
How would I export the best model pipeline to ONNXML now ? using sklearn-onnx ?

Log parameters, metrics using mlflow instead of writing into a log file

Currently, the model training logs are written into a log file. Instead of writing into a logfile, is there a way to log parameters, metrics using mlflow

feature engineering for datetime columns

Create features from columns of datetime dtype in DataTransformer's fit_transform() and transform() methods. Right now a simple conversion to float type is used.

Originally posted by @sonichi in #66 (comment)

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your FLAML repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/FLAML/compliance

The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @ekzhu, @markusweimer, @qingyun-wu, @sonichi

sklearn f1_score method has 'binary' as average default

sklearn f1_score makes use of 'binary' as average default parameter. It means, if the problem is a multiclass ones, it must be changed the average parameter to ones of ['micro', 'macro', 'weighted', 'samples']. In the ml module the sklearn_metric_loss_score is called without specifiyng that parameter. Consequently, at the moment, if a multiclass problem and f1 metric are choosen, a problem arise.

One solution, could be to set average='samples' when the task is multiclass:softmax.
However, choosing one of the above list options, depends on the nature of the labels (balanced/unbalanced). It may be interesting to automate the process of choosing the best metric looking to the nature of the labels.

Any idea?

settings = {
"time_budget": TIME_BUDGET,
"metric": 'f1',
"estimator_list": ['lgbm'],
"task": 'classification',
"log_file_name": 'flaml_lgb.log',
}

Questions on the output in the log

Hi Dr. Wang:

Got a few questions from my team on the content in the log of FLAML.

This is part of the log from one of our tests on FLAML (all the numbers on loss are redacted for compliance reasons):

{"record_id": 0, "iter_per_learner": 1, "logged_metric": false, "trial_time": 1756.8860552310944, "total_search_time": 2590.4430527687073, "validation_loss": XXX, "config": {"max_depth": 6, "n_estimators": 100, "min_child_weight": 10, "subsample": 0.67, "colsample_bylevel": 0.9, "gamma": 0, "learning_rate": 0.07435893300587489}, "best_validation_loss": XXX, "best_config": {"max_depth": 6, "n_estimators": 100, "min_child_weight": 10, "subsample": 0.67, "colsample_bylevel": 0.9, "gamma": 0, "learning_rate": 0.07435893300587489}, "learner": "MonotonicXgboostGBTree", "sample_size": 784536}

{"record_id": 1, "iter_per_learner": 5, "logged_metric": false, "trial_time": 1537.3765320777893, "total_search_time": 13424.922722578049, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 110, "min_child_weight": 1, "subsample": 0.5954399576961257, "colsample_bylevel": 1.0, "gamma": 1e-14, "learning_rate": 0.10828032871243709}, "best_validation_loss": XXX "best_config": {"max_depth": 4, "n_estimators": 110, "min_child_weight": 1, "subsample": 0.5954399576961257, "colsample_bylevel": 1.0, "gamma": 1e-14, "learning_rate": 0.10828032871243709}, "learner": "MonotonicXgboostGBTree", "sample_size": 784536}

{"record_id": 2, "iter_per_learner": 13, "logged_metric": false, "trial_time": 340.0606036186218, "total_search_time": 34851.914006233215, "validation_loss": XXX, "config": {"max_depth": 5, "num_leaves": 23, "n_estimators": 157, "min_child_weight": 1, "subsample": 0.5112583180636173, "colsample_bylevel": 0.9863382485941592, "min_split_gain": 1e-14, "learning_rate": 0.05875161500234584}, "best_validation_loss": XXX, "best_config": {"max_depth": 5, "num_leaves": 23, "n_estimators": 157, "min_child_weight": 1, "subsample": 0.5112583180636173, "colsample_bylevel": 0.9863382485941592, "min_split_gain": 1e-14, "learning_rate": 0.05875161500234584}, "learner": "MonotonicLightGBMGBDT", "sample_size": 784536}

{"record_id": 3, "iter_per_learner": 18, "logged_metric": false, "trial_time": 270.2408003807068, "total_search_time": 41368.91024374962, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 338, "min_data_in_leaf": 56, "subsample": 0.6614322871324126, "colsample_bylevel": 0.9458919560564311, "learning_rate": 0.23062756268773424}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 338, "min_data_in_leaf": 56, "subsample": 0.6614322871324126, "colsample_bylevel": 0.9458919560564311, "learning_rate": 0.23062756268773424}, "learner": "MonotonicCatboost", "sample_size": 784536}

{"record_id": 4, "iter_per_learner": 22, "logged_metric": false, "trial_time": 366.1694631576538, "total_search_time": 43080.46155285835, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 448, "min_data_in_leaf": 46, "subsample": 0.6950654501710251, "colsample_bylevel": 0.956150967914549, "learning_rate": 0.4527543463119874}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 448, "min_data_in_leaf": 46, "subsample": 0.6950654501710251, "colsample_bylevel": 0.956150967914549, "learning_rate": 0.4527543463119874}, "learner": "MonotonicCatboost", "sample_size": 784536}

{"record_id": 5, "iter_per_learner": 23, "logged_metric": false, "trial_time": 343.4558777809143, "total_search_time": 45475.49441862106, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 405, "min_data_in_leaf": 58, "subsample": 0.734450014003538, "colsample_bylevel": 0.9644762947991873, "learning_rate": 0.3151376812002405}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 405, "min_data_in_leaf": 58, "subsample": 0.734450014003538, "colsample_bylevel": 0.9644762947991873, "learning_rate": 0.3151376812002405}, "learner": "MonotonicCatboost", "sample_size": 784536}

I am wondering:

What does 'iter_per_learner' mean? My understanding is that the output in the log was generated in batch. For example, for record_id 2, does it include 13 or 8 (13-5 from record_id 1) MonotonicLightGBMGBDT models with different sets of hyperparameters?
What does 'trial_time' mean? How is it different from 'total_search_time"?
What is the difference between 'config' and 'best_config' in each record? They all look the same.
If the process reaches the time budget in the middle of an iteration, will it stop immediately or finish the current iteration first before stopping?

Appreciate your help! As you can see from the log, our dataset is quite large (780000+ records and thousands of predictors). Although the fitting is far from over yet, the current optimal result is already as good as what we got using BayesOpt.

Best,

Redirect the catboost_info subfolder created by CatboostEstimator

Feedback from sebhrusen (from the automlbenchmark)
CatboostEstimator is creating and filling a catboost_info subfolder in the running directory. We should be able to pass a 'train_dir' param to Catboost to avoid that.

For example at AutoML level, accept a tmpdir and pass it to each algo supporting an equivalent property (or pass a dedicated subfolder, for example tmpdir/catboost for Catboost and so on).

Reference:
openml/automlbenchmark#270

Crash with ValueError when ensemble=True

When I set ensemble=True, and my data has categorical features, I get the following error at the end of the FLAML run:

[flaml.automl: 07-08 09:40:44] {1141} INFO -  at 9373.5s,       best extra_tree's error=0.2056, best rf's error=0.1950[flaml.automl: 07-08 09:40:44] {993} INFO - iteration 52, current learner rf[flaml.automl: 07-08 09:41:42] {1141} INFO -  at 9431.7s,       best rf's error=0.1950, best rf's error=0.1950
[flaml.automl: 07-08 09:41:42] {993} INFO - iteration 53, current learner rf
[flaml.automl: 07-08 09:42:11] {1141} INFO -  at 9460.7s,       best rf's error=0.1950, best rf's error=0.1950[flaml.automl: 07-08 09:42:11] {993} INFO - iteration 54, current learner rf[flaml.automl: 07-08 09:50:15] {1141} INFO -  at 9944.4s,       best rf's error=0.1949, best rf's error=0.1949
[flaml.automl: 07-08 09:50:15] {1187} INFO - selected model: RandomForestClassifier(criterion='entropy', max_features=0.7294599478674504,
                       n_estimators=347, n_jobs=10)[flaml.automl: 07-08 09:50:15] {1197} INFO - [('rf', <flaml.model.RandomForestEstimator object at 0x7fca69effaf0>), ('extra_tree', <flaml.model.ExtraTreeEstimator object at 0x7fca8cc1f8e0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7fc799985190>), ('catboost', <flaml.model.CatBoostEstimator object at 0x7fc
a8cc884f0>), ('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7fca8cd0e610>)]
/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecat
ed and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier
object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecat
ed and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier
object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
Traceback (most recent call last):  File "search.py", line 212, in <module>    dump_json(data_sheet_file, data_sheet)
  File "search.py", line 208, in main
    with open(data_sheet_file) as f:  File "search.py", line 163, in run_data_sheet    run['flaml_settings'] = jsonpickle.encode(automl_settings, unpicklable=False, keys=True)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/automl.py", line 943, in fit
    self._search()  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/automl.py", line 1212, in _search    stacker.fit(self._X_train_all, self._y_train_all,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
    return super().fit(X, self._le.transform(y), sample_weight)  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 196, in fit    _fit_single_estimator(self.final_estimator_, X_meta, y,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_base.py", line 39, in _fit_single_estimator
    estimator.fit(X, y)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/model.py", line 296, in fit
    self._fit(X_train, y_train, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/model.py", line 78, in _fit
    model.fit(X_train, y_train, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 304, in fit
    X, y = self._validate_data(X, y, multi_output=True,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 871, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 673, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '__OTHER__'

This error does not occur if ensemble=False or if I remove (or encode) the categorical features from my dataset

My guess is that FLAML properly encodes categorical features when training the base estimators (LGBM, RF, etc), but not when training the stacking classifier.

support time series forecasting

add a task type 'forecast', and at least one forecasting learner, like greykite.

Log best model as part of MLflow run

Is there a way to persist the best model in the mlflow runs?

Error when ensemble=true

After upgrading to the newest version of FLAML, I am running into the following error when I set ensemble=True:

Traceback (most recent call last):
  File "search.py", line 229, in <module>
    main()
  File "search.py", line 225, in main
    data_sheet = run_data_sheet(data_sheet, target_col, id_col, data_dir, out_dir, eval_metric)
  File "search.py", line 180, in run_data_sheet
    pipe.fit(X_train, y_train, **automl_settings)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 962, in fit
    self._search()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 1232, in _search
    **self._state.fit_kwargs)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
    return super().fit(X, self._le.transform(y), sample_weight)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 149, in fit
    for est in all_estimators if est != 'drop'
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: __init__() got an unexpected keyword argument '_estimator_type'

My call to FLAML:

        automl_settings = {
            "time_budget": search_time,
            "task": 'classification',
            "log_file_name": "{}/flaml-{}.log".format(out_dir, runname),
            "n_jobs": 10,
            "estimator_list": ['lgbm', 'xgboost', 'rf', 'extra_tree', 'catboost'],
            "model_history": True,
            "eval_method": "cv",
            "n_splits": 3,
            "metric": eval_metric,
            "log_training_metric": True,
            "verbose": 1,
            "ensemble": True,
        }

        pipe = AutoML()
        pipe.fit(X_train, y_train, **automl_settings)

This issue goes away if I change ensemble to False.

Here are my environment details:

$ pip list
Package            Version
------------------ --------
catboost           0.26
ConfigSpace        0.4.19
cycler             0.10.0
Cython             0.29.23
FLAML              0.5.6
graphviz           0.16
importlib-metadata 4.6.1
joblib             1.0.1
jsonpickle         2.0.0
kiwisolver         1.3.1
lightgbm           3.2.1
matplotlib         3.3.4
numpy              1.19.5
pandas             1.1.5
Pillow             8.3.1
pip                21.1.3
plotly             5.1.0
pyparsing          2.4.7
python-dateutil    2.8.1
pytz               2021.1
scikit-learn       0.24.2
scipy              1.5.4
setuptools         40.6.2
six                1.16.0
tenacity           8.0.0
threadpoolctl      2.1.0
typing-extensions  3.10.0.0
wheel              0.36.2
xgboost            1.4.2
zipp               3.5.0

$ python --version
Python 3.6.8 :: Anaconda custom (64-bit)

Possible error/inconsistency with the paper in FLOW2 step size reduction

In code, step size is reduced with the following:

        if self._num_proposedby_incumbent == self.dir and (
            not self._resource or self._resource == self.max_resource):
                # check stuck condition if using max resource
                if self.step >= self.step_lower_bound:
                    # decrease step size
                    self._oldK = self._K if self._K else self._iter_best_config
                    self._K = self.trial_count_proposed + 1
                    self.step *= np.sqrt(self._oldK / self._K)
                self._num_proposedby_incumbent -= 2

However, the algorithm description in the FLOW2 paper shows that:

From this, we can see that k' (_oldK in code) is only changed whenever a new best score is obtained. However, in the current implementation, k' always becomes the previous k instead. This seems counter-intuitive to me, as the step size multiplier will reduce much slower than in the paper implementation, thus making FLOW2 spend more time evaluating a configuration that has most likely already converged.

I believe that the implementation consistent with the paper would be:

        if self._num_proposedby_incumbent == self.dir and (
            not self._resource or self._resource == self.max_resource):
                # check stuck condition if using max resource
                if self.step >= self.step_lower_bound:
                    # decrease step size
                    self._oldK = self._iter_best_config  # change here
                    self._K = self.trial_count_proposed + 1
                    self.step *= np.sqrt(self._oldK / self._K)
                self._num_proposedby_incumbent -= 2

I have ran some trials with this change and it seems to be working as intended, at least for my purposes - converged combinations are eliminated more aggressively.

Am I understanding all of this correctly? Is this an oversight in the code, or has this been changed after the paper was published?

Error when specifying sample_weight with xgboost

line 394 of flaml/model.py, The train() method does not accept weight as legitimate argument.
Instead, the weight should be specified in line 390 when creating dtrain.

fit stops after Iteration 0 when metric is r2

I am using FLAML in Django views:

           X_train, X_test, y_train, y_test = train_test_split(df.copy(), train_size=selectedTrainingPercentage) 
           
           automl = AutoML()

            settings = {
                "time_budget": 60,      # total running time in seconds
                "metric": 'r2',         # primary metrics for regression can be chosen from: ['mae','mse','r2']
                                        # list of ML learners; we tune xgboost in this example
                "task": 'regression',   # task type
            }
            print('fitting')

            automl.fit(X_train=X_train, y_train=y_train, **settings)
            print('fit complete')`

And the fitting stops at iteration 0:

However it works completely fine if I change the metric to nae or mse rather than r2.

datetime64[ns] dtype in dataframe

When, DataTranformer's fit_transform method is called, If some columns have a datetime format, an error is raised by the sklearn\utils\validation.py method.

I fixed it, turning any datetime columns to datetime.toordinal type

Errors when ensemble = True

Hi:

There is an error message when fitting models using customized monotonic learners with ensemble = True:

RuntimeError: Cannot clone object <main.MyMonotonicLightGBMGBDTClassifier object at 0x7f9ef2999310>, as the constructor either does not set or modifies parameter monotone_constraints

I assume it is due to the monotone_constraints added to self.params. Any suggestion on how to fix it?

Usually we won't implement an ensemble of boosting models but would be great if we can figure out a solution!

Thank you.

Verbose argument in model.fit()

Hi,

While training the learner, a console output is generated, which can take up huge space in the notebook if the time_budget is made large. If I wish to suppress the console output while training my learner, how do I do that? In keras, sklearn, etc., setting verbose = 0 suppresses the console output.

Thanks!

Error message with lrl1 when using FLAML via reticulate: NameError: name '_' is not defined

Hi:

I've received the error message below with lrl1 when using FLAML in RStudio via reticulate:

[flaml.automl: 04-02 14:24:16] {986} INFO - iteration 0 current learner lrl1
NameError: name '_' is not defined

Interestingly, the same codes ran well in Jupyter. The versions of scikit-learn in the two environments are the same.

Any ideas?

Thank you.

Guide for contributors

Some helpful documentation for future contributors, may include:

Some general description about the classes and how they work
How to run unit tests.
etc.

Publish package to Pypi

Enable Discussion

Congratulations on releasing FLAML and I look forward to contributing to it. Would it be possible for you to enable discussions?
Here is how you enable it :

https://docs.github.com/en/discussions/quickstart

Thanks,
Sandeep

AtributeError mensage during fit

Hi everyone!!

I've received the atribute error message below when using FLAML with XGBoost (this error occurs with others algorithms too):

[flaml.automl: 07-01 10:45:34] {908} INFO - Evaluation method: cv
[flaml.automl: 07-01 10:45:34] {607} INFO - Using StratifiedKFold
[flaml.automl: 07-01 10:45:34] {929} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl: 07-01 10:45:34] {949} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl: 07-01 10:45:34] {1013} INFO - iteration 0, current learner xgboost
Traceback (most recent call last):
  File "ft2.py", line 33, in <module>
    automl.fit(X_train=X, y_train=y, **settings)
  File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/automl.py", line 962, in fit
    self._search()
  File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/automl.py", line 1081, in _search
    use_ray=False)
  File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/tune/tune.py", line 270, in run
    search_alg.set_search_properties(metric, mode, config={
AttributeError: 'ConcurrencyLimiter' object has no attribute 'set_search_properties'

Parameters used:

settings = {
    "time_budget": 108000,
    "metric": 'roc_auc',
    "task": 'classification',
    "n_jobs": -1,
    "estimator_list": ['xgboost'],
    "n_splits": 5,
    "log_file_name": 'ft.log',
}

Specifications:
Python 3.7.10
FLAML 0.5.4 (installed via PiP)
XGBoost 1.4.0 (installed via conda)

Any ideas?

Thanks! :D

Option for groupKFold for regression problems

Hi,

I'm trying to tune lightgbm for a regression problem and need to use groupKFold for cross-validation.
By default, automl.fit() takes repeatedkfold as split_type. I looked up at the documentation, but couldn't find details regarding that. Also, how to pass the groups arguments to it.

Thanks in advance.

python 3.9 support?

it seems only up to 3.8 is supported, is 3.9? will it be?

also 3.10 which is releasing soon?

The ensemble option in the main fit function does not work with customized learners

I fit a model with both the RGF in the sample codes and a few other default learners:

settings = {
"time_budget": 120, # total running time in seconds
"metric": 'roc_auc',
"estimator_list": ['lgbm', 'rf', 'RGF'], # list of ML learners
"task": 'classification', # task type
"sample": True, # whether to subsample training data
"log_file_name": 'airlines_experiment_with_ensemble.log', # cache directory of flaml log files
"log_training_metric": True, # whether to log training metric
}
automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)

I received an error message: TypeError: init() got an unexpected keyword argument '_estimator_type'

I got similar results when using other customized learners with unique hyperparameters.

Moreover, how can I pull the details of the ensemble? I did not see it in the log file.

Thank you.

Could you create a R library, too?

Hi Chi

Amazing work! Could you create a R library for it, too? There is still a large portion of potential users working in R.

Best,

Is there a way to use a custom evaluation metric?

I would like to define & use a custom evaluation metric.

Handling imbalanced dataset in automl

Is there a way to handle imbalanced datasets in the automl?

Error with custom metric when using eval_method = "cv"

I'm trying to use a custom metric. I'm using the one from the test case:

FLAML/test/test_automl.py

Line 92 in 0604570

def custom_metric(X_test, y_test, estimator, labels, X_train, y_train,

This works fine when eval_method is set to its default value of "holdout". But if I change this to "cv"`, I get an error as follows:

How to pull the number of iterations completed for each learner?

Hi:

Is there a way to pull the number of iterations completed by automl() for each learner, please? I know it can be found in the log if I set log_type to 'all' but can I pull it directly?

Assume all the default learners are used, it would be great if we can get the information for a table as below:

Learner	Iterations Completed
Xgboost	100
LightGBM	200
Catboost	150
RF	50

Thank you!

Bug in Cross-Validation estimation

Dear all,

I have been trying FLAML for a few days now and I believe I stumbled across a bug in the evaluation of the model when using cross-validation (eval_method="cv").

I believe that there is only the last fold that is taken into account in function evaluate_model_CV (ml.py). The list of validation scores (val_loss_list) is only updated with the current fold's validation score for the last fold or when the budget is not anymore sufficient. In any case, the val_loss_list only contains one item in all cases. Moreover, what is appended to the list is not the validation score of the current fold, but the mean of the validation scores of the first "valid_fold_num" folds.

I would suggest the following to replace lines 220--226 in ml.py:

        val_loss_list.append(val_loss_i)
        if valid_fold_num == n:
            total_val_loss = valid_fold_num = 0
        elif time.time() - start_time >= budget:
            break
    val_loss = np.max(val_loss_list)

One might also consider changing (or make some options) for the last line in the above snippet. Indeed, here the maximum of the validation scores of each fold is taken. Another commonly used way is to take the average of the validation scores of each fold. This could be an option for the user but it is not a bug per se. I am also ok keeping the max of all validation scores as it is now. (note that basically, the current situation is using the mean value of the different folds, as it is taking the total_val_loss divided by the number of folds).

Best

David

Data cleaning/preparation for Numpy matrix input

Currently pandas.Dataframe input is cleaned.

Consider perform the same preparation steps for numpy.ndarray inputs.

How to enforce monotonicity in FLAML?

Hi Chi:

Thank you for the cool work! Could I enforce monotonicity in the main automl.fit() function? If so, what algorithms can be chosen in the estimator list?

Best,

Using pandas validation data gives an error

If I leave out X_val and y_val, automl works fine. But if I specify these values, it crashes with the following error:

----> 7 automl.fit(X_train= xtrain,y_train=ytrain,X_val=xvalid,y_val=yvalid,**automl_settings)

~\anaconda3\lib\site-packages\flaml\automl.py in fit(self, X_train, y_train, dataframe, label, metric, task, n_jobs, log_file_name, estimator_list, time_budget, max_iter, sample, ensemble, eval_method, log_type, model_history, split_ratio, n_splits, log_training_metric, mem_thres, X_val, y_val, sample_weight_val, retrain_full, split_type, learner_selector, hpo_method, **fit_kwargs)
    832         self._state.fit_kwargs = fit_kwargs
    833         self._state.weight_val = sample_weight_val
--> 834         self._validate_data(X_train, y_train, dataframe, label, X_val, y_val)
    835         self._search_states = {}  #key: estimator name; value: SearchState
    836         self._random = np.random.RandomState(RANDOM_SEED)

~\anaconda3\lib\site-packages\flaml\automl.py in _validate_data(self, X_train_all, y_train_all, dataframe, label, X_val, y_val)
    434             "# rows in X_val must match length of y_val.")
    435             if self._transformer:
--> 436                 self._state.X_val = self._transformer.transform(X_val)
    437             else:
    438                 self._state.X_val = X_val

~\anaconda3\lib\site-packages\flaml\data.py in transform(self, X)
    251                 X[cat_columns] = X[cat_columns].astype('category')
    252             if num_columns:
--> 253                 X[num_columns].fillna(np.nan, inplace=True)
    254                 X[num_columns] = self.transformer.transform(X)
    255         return X

~\anaconda3\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast)
   4315         downcast=None,
   4316     ) -> Optional["DataFrame"]:
-> 4317         return super().fillna(
   4318             value=value,
   4319             method=method,

~\anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6086         result = self._constructor(new_data)
   6087         if inplace:
-> 6088             return self._update_inplace(result)
   6089         else:
   6090             return result.__finalize__(self, method="fillna")

~\anaconda3\lib\site-packages\pandas\core\generic.py in _update_inplace(self, result, verify_is_copy)
   3962         self._clear_item_cache()
   3963         self._mgr = result._mgr
-> 3964         self._maybe_update_cacher(verify_is_copy=verify_is_copy)
   3965 
   3966     def add_prefix(self: FrameOrSeries, prefix: str) -> FrameOrSeries:

~\anaconda3\lib\site-packages\pandas\core\generic.py in _maybe_update_cacher(self, clear, verify_is_copy)
   3243 
   3244         if verify_is_copy:
-> 3245             self._check_setitem_copy(stacklevel=5, t="referant")
   3246 
   3247         if clear:

~\anaconda3\lib\site-packages\pandas\core\generic.py in _check_setitem_copy(self, stacklevel, t, force)
   3679 
   3680         if value == "raise":
-> 3681             raise com.SettingWithCopyError(t)
   3682         elif value == "warn":
   3683             warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A couple of errors when building the ensemble

Hi:

Our team has explored the ensemble option in the fit function of automl and got a few errors:

There is an error when using both the GLM (LRL1/LRL2) and MLs for the ensemble. For example:

from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id = 1169, data_dir = './')

settings = {
"time_budget": 40,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": [
'lrl1'
,'lrl2'
,'lgbm'
, 'xgboost'
],
"log_file_name": 'airlines_experiment.log',
}

automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)

[flaml.automl: 03-18 17:34:40] {1157} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f61f8659ed0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f61f8687350>), ('lrl2', <flaml.model.LRL2Classifier object at 0x7f61f8687090>), ('lrl1', <flaml.model.LRL1Classifier object at 0x7f61f8654150>)]

RuntimeError: Cannot clone object <flaml.model.LRL2Classifier object at 0x7f84877a1c10>, as the constructor either does not set or modifies parameter penalty.

This is similar to the error we've discussed before.

The other error is more confusing. We've created a few customized ML learners with monotone constraints and used them for the automl. For example, below are the codes for a monotonic xgboost and a lightGBM both using GBDT as the booster:

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = XGBClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'booster': params['booster'] if 'booster' in params else 'gbtree',
        'learning_rate': params['learning_rate'],
        'gamma': params['gamma'],
        'max_depth': int(params['max_depth']),
        'min_child_weight': int(params['min_child_weight']),
        'subsample': params['subsample'],
        'colsample_bylevel':params['colsample_bylevel'],
        'n_estimators':int(params['n_estimators']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_child_weight': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7},
    'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = LGBMClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
        'learning_rate': params['learning_rate'],
        'min_split_gain': params['min_split_gain'],
        'max_depth': int(params['max_depth']),
        'min_data_in_leaf': int(params['min_data_in_leaf']),
        'min_sum_hessian_in_leaf': params['min_sum_hessian_in_leaf'],
        'subsample': params['subsample'],
        'colsample_bytree':params['colsample_bytree'],
        'n_estimators':int(params['n_estimators']),
        'subsample_freq':int(params['subsample_freq']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'subsample_freq': {'domain': tune.uniform(lower=1, upper=10), 'init_value': 5},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_data_in_leaf': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'min_sum_hessian_in_leaf': {'domain': tune.loguniform(lower = 0.000001, upper = 0.1), 'init_value': 0.001},
    'subsample': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.67},
    'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'min_split_gain': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

Without the ensemble, both worked well as individual learners. However, when we set ensemble=True, the monotonic xgboost learner still worked well but the process always crashed if the monotonic lightGBM learner was included in the list of estimators. The kernel of Jupyter just went dead without any error message. In the .out file generated at the backend, there is an error message:

[LightGBM] [Fatal] Check failed: static_cast<size_t>(num_total_features_) == io_config.monotone_constraints.size() at /__w/1/s/python-package/compile/src/io/dataset.cpp, line 314

What does it mean? It seems that something is wrong with the monotone_constraints but the size of the constraints matches the number of variables.

This error can be replicated using the airlines data; to make it easier just let monotone=(0, 0, 0, 0, 0, 0, 0).

Appreciate your help.

	upper = min(32768, int(data_size))
	return {
	'n_estimators': {
	'domain': tune.qloguniform(lower=4, upper=upper, q=1),
	'init_value': 4,
	'low_cost_init_value': 4,
	},

	self._retrain_full = retrain_full and (
	eval_method == 'holdout' and self._state.X_val is None)

microsoft / flaml Goto Github PK

flaml's Issues

Action required: 4 compliance tasks

GitHub inside Microsoft program information

Recommend Projects

Recommend Topics

Recommend Org