Coder Social home page Coder Social logo

naiveautoml's Introduction

Naive AutoML

https://github.com/github/docs/actions/workflows/python-publish.yml/badge.svg

naiveautoml is a tool to find optimal machine learning pipelines for

  • classification tasks (binary, multi-class, or multi-label) and
  • regression tasks.

Other than most AutoML tools, naiveautoml has no (also no implicit) definitions of timeouts. While timeouts can optionally provided, naiveautoml will simply stop as soon as it believes that no better pipeline can be found; this can be surprisingly quick.

Python

Install via pip install naiveautoml. The current version is 0.1.2.

We highly recommend to check out the usage example python notebook.

Finding an optimal model for your data is then as easy as running:

import naiveautoml
import sklearn.datasets
naml = naiveautoml.NaiveAutoML()
X, y = sklearn.datasets.load_iris(return_X_y=True)
naml.fit(X, y)
print(naml.chosen_model)

The task type (here classification) is derived automatically, but it can also be specified via task_type with values in classification, regression or multilabel-indicator to be sure.

To get the history of considered pipelines, together with a (relative) timestamp and internal validation scores, you can access the history:

print(naml.history)

Want to limit the number of candidates considered during hyper-parameter tuning?

naml = naiveautoml.NaiveAutoML(max_hpo_iterations=20)

Want to put a timeout? Specify it in seconds (should be always bigger than 10s to avoid strange side effects).

naml = naiveautoml.NaiveAutoML(timeout_overall=20)

You can modify the pipeline timeout on single pipeline evaluations with

naml = naiveautoml.NaiveAutoML(timeout_candidate=20)

However, be aware that on many pipelines this time out is not enforced since this not safely possible without memory leakage or malfunction.

This can also be combined with max_hpo_iterations.

Want to see the progress bar for the optimization process?

naml = naiveautoml.NaiveAutoML(show_progress=True)

Want logging?

# configure logger
import logging
logger = logging.getLogger('naiveautoml')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

Scoring functions

By default, log-loss is used for classification (AUROC in the case of binary classification). To use a custom scoring function, pass it in the constructor:

naml = naiveautoml.NaiveAutoML(scoring="accuracy")

To additionally evaluate other scoring functions (not used to rank candidates), you can use a list of passive_scorings:

naml = naiveautoml.NaiveAutoML(scoring="accuracy", passive_scorings=["neg_log_loss", "f1_score"])

You can also pass a custom scoring function through a dictionary:

scorer = make_scorer(**{
            "name": "accuracy",
            "score_func": lambda y, y_pred: np.count_nonzero(y == y_pred).mean(),
            "greater_is_better": True,
            "needs_proba": False,
            "needs_threshold": False
        })
naml = naiveautoml.NaiveAutoML(scoring=scorer)

Custom Categorical Features

Naive AutoML determines the categorical attributes automatically as far as possible. However, sometimes even columns consisting only of numbers should be treated as categorical attributes. To pass an explicit list of attributes that should be treated as categoricals, use the categorical_features parameter in the fit function:

naml.fit(X_df, y, categorical_features=["name_of_first_categorical_attribute", "name_of_second_categorical_attribute"])

alternatively (or if your data is a numpy array), you can use the index of the column:

naml.fit(X_df, y, categorical_features=[4, 9])

Citing naive automl

Please use the reference from the Machine Learning Journal to cite Naive AutoML:

https://link.springer.com/article/10.1007/s10994-022-06200-0#article-info

@article{mohr2022naive,
  title={{Naive Automated Machine Learning}},
  author={Mohr, Felix and Wever, Marcel},
  journal={Machine Learning},
  pages={1131--1170},
  year={2022},
  publisher={Springer},
  volume={112},
  issue={4}
}

naiveautoml's People

Contributors

angelg14 avatar fmohr avatar pgijsbers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

naiveautoml's Issues

Version is out of sync

PyPI lists naiveautoml as 0.0.15.

The repository lists it as 0.0.13.

It would make sense if the repo had a newer, yet unreleased version, but as is, I don't know how the two relate.

It would also be incredibly helpful if the naiveautoml module would provide a __version__ as is customary for many packages, e.g.:

>>> import sklearn; sklearn.__version__
'1.2.1'
>>> import pandas; pandas.__version__
'1.5.3'

Add support for class weights

It would be great if one could specify the class weights in the initialization so that algorithms that have support for this parameter will be fed with it, such as LR, DT, SVMs, etc.

Enrich history by additional fields

introduce:

  • pipeline as string for easier reproduction (again)
  • one column that says (for each pipeline slot) whether default hyperparameters were used
  • maybe include the mandatory pre-processing steps as well

HPO phase is currently always strictly naive

In the HPO phase, all the slots are optimized independently. This is fine for research purposes, but, in practice, one would expect that the pipeline consisting of the best individual candidates of the AS phase would be taken and optimized jointly (no reason to not do that).

`Cannot step inactive HPO Process` when setting `max_hpo_iterations`

When running

from naiveautoml import NaiveAutoML
from sklearn.datasets import load_iris

if __name__ == "__main__":
    naml = NaiveAutoML(timeout=60, execution_timeout=6, max_hpo_iterations=1e10)
    x, y = load_iris(return_X_y=True, as_frame=True)
    naml.fit(x, y)

I frequently (but not always!) get the error

An error occurred in the HPO step: Cannot step inactive HPO Process
Traceback (most recent call last):
  File "/Users/pietergijsbers/Library/Application Support/JetBrains/PyCharm2022.3/scratches/scratch_37.py", line 7, in <module>
    naml.fit(x, y)
  File "/Users/pietergijsbers/repositories/automlbenchmark/frameworks/NaiveAutoML/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 507, in fit
    self.tune_parameters(X, y)
  File "/Users/pietergijsbers/repositories/automlbenchmark/frameworks/NaiveAutoML/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 391, in tune_parameters
    res = hpo.step(remaining_time)
  File "/Users/pietergijsbers/repositories/automlbenchmark/frameworks/NaiveAutoML/venv/lib/python3.9/site-packages/naiveautoml/commons.py", line 1098, in step
    raise Exception("Cannot step inactive HPO Process")

py3.9, naml 0.0.15

Naive AutoML can get stuck in infinite loop if no working pipeline is found

This piece of code is supposed to try and salvage a pipeline as a last resort:

while True:
try:
self.pl.fit(X, y)
break
except:
self.logger.warning("There was a problem in building the pipeline, cutting it one down!")
self.pl = Pipeline(steps=self.pl.steps[1:])
self.logger.warning("new pipeline is:", self.pl)

However, since slicing an empty list as [][1:] is legal and produces an empty list [], when there are no more steps in the pipeline, there is no longer a way to break out of the loop:

  • fit fails
  • the pipeline is shrunk to the same empty-list pipeline

Conditional Candidate Evaluation

Some evaluations are syntactically possible but do not make sense, e.g., decision trees with scalers. Allowing for a callable with access to the history can be asked to return True if it allows the execution and False otherwise.

Invalid scoring value results in "silent"

I think it would be good to do input validation on the scoring hyperparameter, and/or at least provide documentation on which words are valid (I assume scikit-learn?). I was trying various functions, made a typo, and my terminal got blasted with error messages :)

import naiveautoml
import sklearn.datasets
naml = naiveautoml.NaiveAutoML(scoring="acuracy")  # note the typo, but any non-supported metric seems to do
X, y = sklearn.datasets.load_iris(return_X_y=True)
naml.fit(X, y)
print(naml.chosen_model)

Repeats the following --- Logging error --- ad infinitum (or maybe until time expires):

There was a problem in building the pipeline, cutting it one down!

--- Logging error ---
Traceback (most recent call last):
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 518, in fit
    self.pl.fit(X, y)
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 401, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 339, in _fit
    self._validate_steps()
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 215, in _validate_steps
    names, estimators = zip(*self.steps)
ValueError: not enough values to unpack (expected 2, got 0)

During handling of the above exception, another exception occurred:

Example broken with `ConfigSpace==0.7.1`

MWE (note that this installs pandas, see #19):

python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install naiveautoml
python -m pip install pandas
python example.py

python example:

import naiveautoml
import sklearn.datasets
naml = naiveautoml.NaiveAutoML()
X, y = sklearn.datasets.load_iris(return_X_y=True)
naml.fit(X, y)
print(naml.chosen_model)

output:

Successfully installed pandas-2.0.2 python-dateutil-2.8.2 pytz-2023.3 six-1.16.0 tzdata-2023.3

Traceback (most recent call last):
  File "/Users/pietergijsbers/tmp/example.py", line 5, in <module>
    naml.fit(X, y)
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 503, in fit
    self.choose_algorithms(X, y)
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 253, in choose_algorithms
    pl = self.get_pipeline_for_decision_in_step(step_name, comp, X, y, decisions)
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 176, in get_pipeline_for_decision_in_step
    steps_tmp.append((step_name, build_estimator(comp, None, X, y)))
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/commons.py", line 216, in build_estimator
    params = {"kernel": config_json.read(json.dumps(comp["params"])).get_hyperparameter("kernel").value}
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/ConfigSpace/read_and_write/json.py", line 444, in read
    _construct_hyperparameter(
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/ConfigSpace/read_and_write/json.py", line 490, in _construct_hyperparameter
    q=hyperparameter["q"],
KeyError: 'q'

Resolve by installing older version of configspace:

python -m pip install "ConfigSpace<0.7.1"
python example.py

output:

(venv) pietergijsbers@TUE027303 tmp % python example.py                      
Pipeline(steps=[('data-pre-processor', PowerTransformer()),
                ('feature-pre-processor', FastICA()),
                ('learner', QuadraticDiscriminantAnalysis())])

Package has a dependency on pandas that is not specified in `setup.py`

MWE:

python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install naiveautoml
python example.py

python example:

import naiveautoml
import sklearn.datasets
naml = naiveautoml.NaiveAutoML()
X, y = sklearn.datasets.load_iris(return_X_y=True)
naml.fit(X, y)
print(naml.chosen_model)

output:

Successfully installed configspace-0.7.1 func-timeout-4.3.5 joblib-1.2.0 more-itertools-9.1.0 naiveautoml-0.0.15 numpy-1.25.0 psutil-5.9.5 pyparsing-3.1.0 scikit-learn-1.2.2 scipy-1.10.1 threadpoolctl-3.1.0 tqdm-4.65.0 typing-extensions-4.6.3

Traceback (most recent call last):
  File "/Users/pietergijsbers/tmp/example.py", line 1, in <module>
    import naiveautoml
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/__init__.py", line 1, in <module>
    from naiveautoml.naiveautoml import NaiveAutoML
  File "/Users/pietergijsbers/tmp/venv/lib/python3.9/site-packages/naiveautoml/naiveautoml.py", line 3, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Enable three-way pipelines for transformations only on training data

The main problem with the standard logic of pipelines is that fit_transform, which is applied to all pre-processors in the pipeline, first applies fit and then transform, where transform uses the same logic on the training data as for other data that would pass the pipeline later in a standard transform call.

Some pre-processors of a pipeline should only be used in the transform step coupled to the fit step, i.e., only in fit_transform but not in an ordinary transform. One solution is to use three different methods: fit, transform_fitted_data, transform.

A classical example is SMOTE, whose job is to do the following things during the different phases:

  1. fit: Memorizes the data
  2. transform_fitted_data: Applies upsampling based on the given data
  3. transform: Does nothing

Alternatively, one could extend the signature of the transform function with an optional parameter fitted_data: bool. The pipeline then can set this parameter to true when the fit_transform function is used. If the parameter is abscent, then no different should be made between the fitted data and other data.

Segmentation Fault

When working with larger data and runtimes, I sometimes run into segmentation faults. These seem to happen during garbage collection, and not during a specific naiveautoml call:

import gc
from naiveautoml import NaiveAutoML
import openml

import logging
logger = logging.getLogger('naiveautoml')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

if __name__ == "__main__":
    census_income = openml.datasets.get_dataset(4535)
    x, y, *_ = census_income.get_data(target='V42')

    naml = NaiveAutoML(
        timeout=3600,
        execution_timeout=600,
        max_hpo_iterations=1e10,
    )
    naml.fit(x, y)
    p = naml.predict(x)
    pp = naml.predict_proba(x)

    del naml
    gc.collect()

output:

.... search and so ...
2023-06-29 15:53:21,415 - naiveautoml - INFO - --------------------------------------------------
2023-06-29 15:53:21,415 - naiveautoml - INFO - Search Completed. Building final pipeline.
2023-06-29 15:53:21,415 - naiveautoml - INFO - --------------------------------------------------
2023-06-29 15:53:21,425 - naiveautoml - INFO - Pipeline(steps=[('impute_and_binarize',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['V1', 'V3', 'V4', 'V6',
                                                   'V17', 'V18', 'V19', 'V25',
                                                   'V31', 'V37', 'V39', 'V40',
                                                   'V41']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('binarizer',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['V2', 'V5', 'V7', 'V8', 'V9',
                                                   'V10', 'V11', 'V12', 'V13',
                                                   'V14', 'V15', 'V16', 'V20',
                                                   'V21', 'V22', 'V23', 'V24',
                                                   'V26', 'V27', 'V28', 'V29',
                                                   'V30', 'V32', 'V33', 'V34',
                                                   'V35', 'V36', 'V38'])])),
                ('learner', LinearDiscriminantAnalysis())])
2023-06-29 15:53:21,434 - naiveautoml - INFO - Now fitting the pipeline with all given data.
2023-06-29 16:08:34,527 - naiveautoml - INFO - Runtime was 4503.308254957199 seconds
zsh: segmentation fault  python 

Improve history

create columns for:

  • the choice of each pipeline slot
  • status with possible values success, fail, timeout
  • evaltime for the total time in seconds required for the evaluation
  • one column for each item in side_scores

Assure that SVMs are not executed with probabilities.

This implies that SVMs cannot be used by default when the main metric is log loss or something that depends on probabilities. However, the time overhead to fit for probabilities (and the fact that those are not necessarily proper) implies that we maybe should not use them.

Enable pandas dataframes for X

Currently X must be a numpy array, but it would be great if it could also be a pandas dataframe. Types could be read from the df.

Support for upsampling required

Naive AutoML needs support for up-sampling like SMOTE. This cannot be done a priori by the user, because this would disturb the validation process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.