openml-labs / gama Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 29.0 136.26 MB

An automated machine learning tool aimed to facilitate AutoML research.

Home Page: https://openml-labs.github.io/gama/master/

License: Apache License 2.0

Python 99.09% TeX 0.91%

automl hyperparameter-optimization research-tool

gama's People

Contributors

Stargazers

Watchers

gama's Issues

Create shorthands for handling logs to stdout and file

Build in two common uses for logging: printing to stdout and logging to file.
All that needs to be done is set up the right stream handlers for the 'gama' log.
Make sure to update the documentation in User Guide/Things To Know/Logging.

Memory consumption

Hello,

I would like to ask about the memory consuption of gama. At autosklearn, there are parameters ( ml_memory_limit and ensemble_memory_limit ) that allow the user to setup the amount of memory the automl uses. I searched the docs of gama but could not find such options and the only mention of memory can be found at the Release notes (Memory usage of all GAMA’s processes is logged.). Where is it logged exactly? I guess at gama.log, but searching for memory returned no results. What is the general policy of memory consumption of gama? I am asking because I will be performing experiments on an HPC and I would like to know how much memory I need to allocate to the task, in order to have gama use ~50GBs.

Thank you.

Create a structured way to perform mutation/create new individuals.

Currently, creating a new individual or mutating one is done at random.
Only after creation we make sure that the individual is actually new, and if not, start over.
This means that we need to apply the operation possibly multiple times and just hope to get a good one, which is not an elegant solution (and one that might still result in no new individual, at that).

The desired behavior is that when making or mutating an individual is to take into account previous individuals to try. I don't know a good way to do this. Perhaps something with keeping a graph of created individuals? And you can create/mutate an individual by traversing/looking at the graph? Will this require too many resources (either cpu or memory)?

OpenML integration

Automatically run GAMA on OpenML tasks, by adding an optional dependency on openml api. Specifics need to be decided on, e.g.:

full trace integration?
allow ensembles?

Automate PyPI releases

Some form of automation to allow a push to a specific branch to also upload to PyPI.

Make logs more readable?

I'm not sure whether the log is meant to ever be read by a human?
I tried understanding what is going on in the pipeline construction, but this is impossible given the high amount of output, a lot of which seems diagnostic or just object prints?

PLE;EVAL;2018-11-13 09:12:09,412014;1.6411040000000128;(-0.3984387878651146, -1);49dbdcef-d34c-412b-8136-a1789b54e838;ExtraTreesClassifier(data, ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=4, min_samples_split=14, ExtraTreesClassifier.n_estimators=100);2018-11-13 09:12:11,517865;END!
Evaluation;1.6411;(-0.3984387878651146, -1);ExtraTreesClassifier(data, ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=4, min_samples_split=14, ExtraTreesClassifier.n_estimators=100)
PLE;RMV_IND;[[<deap.gp.Primitive object at 0x1a15718548>, <deap.gp.Primitive object at 0x1a156af4a8>, <deap.gp.Terminal object at 0x1a156a7120>, <deap.gp.Terminal object at 0x1a1571a900>, <deap.gp.Terminal object at 0x1a1571aa68>, <deap.gp.Terminal object at 0x1a1571af30>, <deap.gp.Terminal object at 0x1a156fedc8>, <deap.gp.Terminal object at 0x1a15700ab0>, <deap.gp.Terminal object at 0x1a1571e288>]];2018-11-13 09:12:11,523995;END!

Extend observable behavior and/or reconsider the behavior

The current subject-observer pattern only supports evaluations (here).
However, I think instead of directly adding the interface to the GAMA object, there could be a completely isolated object which just parses the GAMA log and performs callbacks.

The advantage is more isolation. However this means that objects passed along from these events are no longer the same objects as are being handled in the evolutionary algorithm, but rather parsed interpretations.
Also, I don't know if it is good/bad practice to create use the logging module for this isolation.

Either way, more events should be exposed, e.g. mutation, selection and elimination. As such, it might also be more sensible to link it to the operator set.

Use base class docstrings where possible

The Gama base class shares docstrings with both GamaClassification and GamaRegressor, but currently these have to be duplicated to be rendered successfully. It should be possible to have the docstrings written in the base class but rendered for the child classes?

Same goes for e.g. fit and predict.

Make GAMA pip installable

score function

Adding a score() function would make gama more of a drop-in replacement for scikit-learn classifiers in scripts.

Dynamically pick evaluation strategy

A static 5-fold CV is not good. It should ideally be dynamic ranging from e.g. repeated k-fold for small datasets to hold-out for large ones.

Improve code formatting of code export

PR #69 introduced a code export function that writes the best found model definition to a Python file.
The formatting of the code however is very bad. It should be improved to respect normal indentation levels and newline placement.

Upgrade front-end with other visualization package?

The current dashboard implementation is in Dash.
Development is quite slow because of a few complications:

files have to be "uploaded". This is problematic when starting GAMA from the Dashboard, as it needs the data unnecessarily needs to be loaded into Dash before calling GAMA. With the new logging structure (separate files), this is also annoying as one can not point to a directory (instead each file needs to be "uploaded")
the callback architecture of Dash can be frustrating to work with at times.
web development, in particular placement and layout of elements, can be very time consuming.

Additionally I think it will not extend well to a more flexible semi-AutoML workflow where there is interactive data manipulation.

Suggestions are welcome.

last update broke the log visualization

Log format was changed (accidentally). Working on a fix.

Add instructions on generating documentation

and add any required packages to setup.py's extra_requires.

Use DASK to support caching and parallelism

Perhaps gama could switch from evaluating multiple pipelines single-core in parallel to evaluating one pipeline multi-core sequentially. This could lead to improved performance due to e.g. less memory use.

Creating multiple `Gama` instances will register multiple streamhandlers to the Gama logger.

This affects users only if they create more than one Gama instance in a session.

import logging
from gama import GamaClassifier

GamaClassifier(verbosity=logging.DEBUG)
GamaClassifier(verbosity=logging.DEBUG)

produces

Using GAMA version 19.01.0.
GamaClassifier(cache_dir=None,keep_analysis_log=gama.log,verbosity=10,n_jobs=1,max_eval_time=300,max_total_time=3600,population_size=50,random_state=None,regularize_length=True,scoring=neg_log_loss)

three times instead of two times, because the second time the message is sent to the gama logger, it is handled by both stream handlers.

I'm not entirely sure how to best remedy this. Checking if handlers are already registered is an option. Registering only one handler can have the issue where one GamaClassifier with logging level DEBUG and another with level WARNING conflict.

Increase code unit test coverage

Not all code is currently covered by unit and/or system tests.
In some cases, this does not matter (e.g. not all ValueError scenarios need to be automatically checked, I think), but for other functionality still needs coverage (e.g. time-out behavior in evaluation.py).

Support ARFF

This allows for more meta-data to be used, such as proper handling of categorical variables.

A 2-dimensional y-vector (i.e. (N,1)) is invalid input

GAMA will crash if given a 2-dimensional y-vector, mwe:

from sklearn.datasets import load_iris
from gama import GamaClassifier

automl = GamaClassifier()
x, y = load_iris(return_X_y=True)
automl.fit(x, y.reshape(-1, 1))

Expected behavior: correctly format y into a pandas.Series as it would a (N,)-shaped array, as long as the second dimension is size 1.

Parallelize Ensemble Construction

Constructing the ensemble, i.e. picking the models and assigning their weights, can be parallelized.
Multiple models can be considered for adding in parallel.

Files being written at the end of fit procedure may be corrupted

When fit-time runs out and one of the evaluation processes are killed, the process might still be writing to file. This seems to result in 0-sized files, but there is no official specification about this.

This should either be prevented somehow, or we should be able to have a guarantee that all incomplete files are exactly 0-sized.

Variables type definition

Hello, I am reading the docs and I came across this:

This means that GAMA might use a wrong feature transformation for the data (e.g. one-hot encoding on a numeric feature or scaling on a categorical feature). Note that this is not unique to GAMA, but any framework which accepts numeric input without meta-data.

I know arff is a format encapsulating the data types in its structure, but me, and many others, use .csv data files, hence we do not provide the data types information in our original data. My question is: If my data are in .csv format and I use pandas, is there a way to ensure there will be no mistakes when inferring the data types? And also, is there a way to manually define the data types?

Thank you

Runtime warning

I'm getting a lot of runtime warnings on the minimal example:

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify=y, random_state=42)
automl = GamaClassifier(max_total_time=300, n_jobs=4)
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
print('accuracy', accuracy_score(y_test, predictions))

anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
Could not create a new individual from 50 iterations of mate_new
anaconda3/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
  return f(*args, **kwds)
Could not create a new individual from 50 iterations of mate_new
Could not create a new individual from 50 iterations of mate_new
Could not create a new individual from 50 iterations of mate_new

The DeprecationWarning can be fixed easily, I assume? The second I'm not so sure about.

'Could not create a new individual' is not very descriptive. Does this mean that there is something wrong with the algorithm? Is there something the user should do (e.g. use different hyperparameters)?

Automatically exclude some methods from being included in documentation

For the search_methods module, I don't want to include the dynamic_defaults and search methods, but they seem to be added by default anyway in the .. automodule:: command.
I tried to circumvent this by adding :no-undoc-members: and tinkering with the defaults but neither worked.
Not including undocumented members should be default behavior as far as I could tell, so I probably configured something wrong somewhere.

timeout exception

I have 96k features whose dimention is 270 to fit. When I try to fit those features by following the classification examples, it reported a timeout exception as shown in the image below.

Add scikit-learn 0.20 to CI

Currently the latest version of scikit-learn is used in tests, which is 0.21 currently. This means that new changes are not tested against scikit-learn 0.20 even though it's officially supported. We should explicitly test against both 0.20 and 0.21.

Allow CSV files

Despite the limitations of CSV files (in particular no feature type annotation), I think we should support a best-effort try for working with CSV files. In the future this can integrate with the Dashboard tool to annotate the csv/convert it to arff for a more robust experience.

Add type hints

Add type hint to function calls to make code easier to read, as per pep 484.
Currently will touch up old code whenever I happen to work on it, but this means there's still work left to do.

Dataset based initialization

Adapt the initialization to take into account the dataset. E.g. adapt max_eval_time to dataset size.

Visualization of lineage

I really like the visualization Fig 2 here. Add such a visualization to the visualization notebook. I think it can already be done by parsing the log as is.

Import warnings

On import, a lot of warnings are thrown:

anaconda3/lib/python3.6/site-packages/deap/tools/_hypervolume/pyhv.py:33: ImportWarning: Falling back to the python version of hypervolume module. Expect this to be very slow.
  "module. Expect this to be very slow.", ImportWarning)
anaconda3/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
  return f(*args, **kwds)

Is there something we can do about this? Is this specific to my anaconda setup?
Especially the first warning doesn't look so good.

Explanation for `FutureWarning: is_categorical is deprecated and will be removed in a future version.`

When using pandas>=1.1.0 you will encounter the following warning:

...\site-packages\category_encoders\utils.py:21: 
FutureWarning: is_categorical is deprecated and will be removed in a future version.  
Use is_categorical_dtype instead.
elif pd.api.types.is_categorical(cols):

this is due to the category_encoders calling a function which is deprecated per pandas>=1.1.0, and not caused by GAMA directly.
There is a PR with a fix for the category_encoders package here.
Hopefully it will be integrated in a PyPI release soon.
Until then, to avoid the warnings you can either use pandas<1.1.0 or add the following lines of code (from this blog):

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

Here are the related Python docs.

Clear and structured support for any optimization metric(s)

Currently the way to specify optimization metrics is a little messy. There is the scikit-learn-like scoring hyperparameter to specify multi-object optimization towards the specified metric and pipeline length. There is the objectives hyperparameter which let's you specify more than one optimization metric, and there is the optimize_strategy which let's you specify minimization and/or maximization.

This should be unified into one clear way to set metric(s) to optimize towards.

Incorporate optimization over predetermined hyperparameter steps

For ARFF files, one-hot encoding and target encoding are used for categorical variables. However, where is no hyperparameter optimization over which features to encode. The imputation strategy is also left out of scope for optimization.

Find alternative to using str for pipeline representation.

For particularly verbose pipelines (many steps and/or hyperparameters), using the str and repr representations omit part of the pipeline, e.g.:
Pipeline(memory=None, steps=[('FeatureAgglomeration0', FeatureAgglomeration(affinity='cosine', compute_full_tree='auto', connectivity=None, linkage='average', memory=None, n_clusters=2, pooling_func=<function mean at 0x00000182D2EFA9D8>)), ('RBFSampler0', RBFSampler(gamma=0.65, n_components=100, ran...imators=100, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False))])

So an alternative needs to be made.
One alternative would be to pass individuals along instead of pipelines.
However, that might require a workaround as that is a generated class.

This is only an issue for log output. The current visualization script ignores this case.

Replay Functionality

GAMA is currently not reproducible due to randomness in the timing of asynchronous tasks. While this may be alleviated by opting for different forms of parallelism, this is out of scope for now. It would be useful to have replay functionality in GAMA, meaning that given its log-file, GAMA can retrace its steps. From there, it should also be possible to pause GAMA arbitrarily and resume with different operators (mutation, selection, crossover, etc.).

Make it easier to use other backends than scikit-learn

As it is, evaluation assumes scikit-learn for all its components. It should be possible (and relatively easy) for users to use their own components, even if the interface differs from scikit-learn.

Find out what broke test on Python 3.5.

Since the last merge one system test fails on Python 3.5.

More descriptive error messages when an operation fails to create a new individual.

See #21. If any operation fails to create a new individual, it should give a more descriptive message so that it is easier to determine post-hoc why the operation could have failed.

cannot import name 'Gama Classifier'

I tried running

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score
from gama import GamaClassifier

if __name__ == "__main__":
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=0
    )

    automl = GamaClassifier(max_total_time=180, store="nothing", n_jobs=1)
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit(X_train, y_train)

    label_predictions = automl.predict(X_test)
    probability_predictions = automl.predict_proba(X_test)

    print("accuracy:", accuracy_score(y_test, label_predictions))
    print("log loss:", log_loss(y_test, probability_predictions))

from https://pgijsbers.github.io/gama/master/user_guide/index.html#examples and I get "ImportError: cannot import name 'GamaClassifier'." I've tried this on both a Linux and Windows and get the same error. Here is my pip freeze output for both OSs

appdirs==1.4.4
attrs==20.3.0
black==19.10b0
category-encoders==2.2.2
click==7.1.2
Cython==0.29.21
gama==20.2.2
joblib==1.0.0
liac-arff==2.5.0
numpy==1.19.4
pandas==1.0.5
pathspec==0.8.1
patsy==0.5.1
pkg-resources==0.0.0
psutil==5.8.0
python-dateutil==2.8.1
pytz==2020.5
regex==2020.11.13
scikit-learn==0.24.0
scipy==1.5.4
six==1.15.0
statsmodels==0.12.1
stopit==1.1.2
threadpoolctl==2.1.0
toml==0.10.2
typed-ast==1.4.1

Formatting of docstrings

The current rendering of the docstrings is not very legible.
I would like to have parameter names, as well as their defaults and type highlighted and on their own line (like in e.g. the scikit-learn documentation).

Figure out why LogisticRegression(?) causes hiccups in CPU utilization.

As seen in the visualization log example, there can be dips in the CPU usage.
From experience, this seems to only happen after/during long LogisticRegression evaluations (this is also the case in the example log).

Cache best fit pipeline for auto ensemble.

Currently, the best found pipeline is fit before starting the auto ensemble procedure.
This gives a fallback model in case the auto ensemble does not finish within the specified time limit.
Two things need to be changed from the current implementation:

bug: currently the time used to fit the best pipeline is not substracted from the time left for the auto ensemble procedure.
enhancement: the best possible model is always included in the ensemble. This means that the best possible pipeline is trained once as fallback model, and later in the ensemble. With some caching this training should only have to happen once.

Invalid individuals can crash gama (custom search spaces only)

This issue should not affect Gama under its default configuration

In the evaluation individuals are compiled. But under a custom configuration the different components may be defined in a way s.t. individuals are generated that can not be compiled. There is no safety net in place for catching this, and hence the search procedure my exit with the raised error.

The default configuration should not be affected by this, as the primitives are selected s.t. they all adhere to the scikit-learn rules of transformers/estimators.

`queue.Full` may be raised by workerthread of ProcessPoolExecutor

This should be because we terminate the child processes directly, which should not really be done. The reasoning for shutting the processes down directly is so that it frees the resources used with it immediately, instead of waiting for the currently running jobs to finish. Unfortunately, this messes with the concurrent.futures.ProcessPoolExecutor clean up logic.

The error isn't always raised, and I don't think it has any major consequences. But I'd like to find a way to terminate the processes and the worker thread correctly but abruptly.

Traceback (most recent call last):
  File "/opt/python/3.6.3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/opt/python/3.6.3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/python/3.6.3/lib/python3.6/concurrent/futures/process.py", line 295, in _queue_management_worker
    shutdown_worker()
  File "/opt/python/3.6.3/lib/python3.6/concurrent/futures/process.py", line 253, in shutdown_worker
    call_queue.put_nowait(None)
  File "/opt/python/3.6.3/lib/python3.6/multiprocessing/queues.py", line 129, in put_nowait
    return self.put(obj, False)
  File "/opt/python/3.6.3/lib/python3.6/multiprocessing/queues.py", line 83, in put
    raise Full
queue.Full

Look into good default restarting behavior.

Because evolutionary algorithms tend to optimize locally, it is often useful to execute the algorithm multiple time with different initial populations, and pick the best solution produced across restarts.

Currently, GAMA does not stop searching to restart with a new population, so this 'outer loop' is still left to the end user. While it is possible to restart the optimization process by giving a custom restart function, this is not used by default.

As an AutoML system, GAMA should be able to decide when to restart its optimization process to (on average) provide a better global solution.

Training estimation report

Hello,

I would like to ask about the training performance reported once the model is trained. In the API, it states Should take 3 minutes to run and give the output below (exact performance might differ), so, where can I find the actual performance estimation?

Thanks in advance

Move TODOs from code to Github. Optionally add issue reference in code.

Centralize version, make it attribute of module

There are currently two issues with keeping track of gama's version:

it is located in two files (gama\gama.py and setup.py)
it is not directly available with import gama; gama.__version__

todo: create a __version__ file and refer to it from gama\__init__.py and setup.py.

openml-labs / gama Goto Github PK

gama's People

Contributors

Stargazers

Watchers

Forkers

gama's Issues

Recommend Projects

Recommend Topics

Recommend Org