The skplumber from epeters3

OneStack Sampler Cannot Handle Non-Numeric Targets

Add Time-Constrained Optimization

Add logic for skplumber to optimize intelligently given a time budget. Specifically, in the pipeline sampling phase, use extreme value theory and running average pipeline fit+score times to estimate time remaining, always leaving enough time for the flexga package to complete at least one generation of hyperparameter sampling, the time of which will take can be estimated by the time it took the best pipeline sampled so far to fit+score multiplied by the number of hyperparameters in that pipeline to tune, multiplied by 10 (since that's flexga population size).

This will make the SKPLumber.crank method even a little higher level with less knobs to tune, which is ok because the lower level components of the package (e.g. sampling, tuning, pipeline) or still available to the user.

Add `Pipeline` Functionality to Readme

Currently, only a basic example for SKPlumber.crank() is provided in the readme. The Pipeline class API is slightly lower level than SKPlumber and has some nice features. Also, since SKPlumber.crank() returns a Pipeline instance, its important to know how to use that pipeline for downstream use.

Don't Allow Infeasible Hyperparameter Combinations

In the logistic regression solver, this hyperparameter combo is not supported, so make sure it cannot be tried in skplumber:

  File "/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py", line 445, in _check_solver
    "got %s penalty." % (solver, penalty))
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

Here is another one for SVM:

  File "/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py", line 793, in _get_liblinear_solver_type
    % (error_string, penalty, loss, dual))
ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when 
dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True

Exit When Budget is Used Up Even Before Progress Can Report

In the SKPlumber._sampler_callback function, it is not checked if the system should exit the sampling early until the SPlumber.progress object is able to report. It is sometimes the case where the whole time budget is used up even before the progress object is able to report.

Add Good Encoder

The encoder should have fit and produce methods, so the output columns are always the same. Also, if more than unique values are found in a column, it should cap the feature expansion and only encode the most common values, putting all others in an “Other” column.

Add Basic CI

Add basic Travis build which runs tests

Always Normalize Features

Many optimization strategies do best when the features are all normalized. Rather than currently requiring normalization to be found as a preprocessor when sampling, always insert a normalization preprocessing step, with an option to not normalize (e.g. have normalize=True be a default in the crank API).

Add Support For Distributed Training

Use the ray package to support sampling using as many processors as are available.

Fix Random Seed of Evaluators By Default

Fixing the random seed will allow all estimators to see the same splits of data, for fair comparison.

Make access to primitives more natural

Currently, all primitives in the package must be accessed by key through primitive dictionaries. It would be better to have them be accessible as objects directly e.g. instead of:

from skplumber.primitives import classifiers
prim = classifiers["RandomForestClassifierPrimitive"]

It would be more natural to say:

from skplumber.primitives.classifiers import RandomForestClassifierPrimitive

Furthermore, it would be good to eliminate the Primitive postfix from all the primitive names; it's a little redundant.

Add Basic CI

Add basic Travis build which runs tests

Change Imputer To Sample Randomly From Known Values

Add & Use Nice Colorized Logging Util

Primitives should support sequential dataset fit

The same instantiated primitive should be able to be fit on one dataset and then another. All the sklearn primitives should already support this, but the custom primitives do not. E.g. the one-hot encoder, when fit, keeps track of all the categorical columns, but when fit to a new dataset, it does not clear out all the old columns it was tracking, so things the columns the new dataset has is the union of the old dataset's columns with the new dataset's columns.

Add a test case for this by fitting on one dataset, then another, to make sure no errors occur.

Add Sample Training Timeout

It would be useful to have an option to limit the max amount of time spent fitting a pipeline when searching for good solutions to a problem. Adding a timeout option to the Pipeline.fit method would do the trick.

Use default hyperparams if none better were found during tuning

Currently SKPLumber.fit uses the best hyperparameter configuration found during hyperparameter tuning. If it didn't find any better than the defaults used during sampling, use the defaults.

Pass Instances of Strategies to Plumber

Currently the name of a search strategy is what’s past to the plumber. It would be better to pass an instantiation of a search strategy, so the user can configure the search strategy without having to pass all the search strategy parameters to the strategy through the plumber API, but rather to the search strategy directly.

Raise error when pipeline passed to `ga` has no tunable hyperparameters

Currently, something internal to flexga raises an error when this happens and its very cryptic.

Add Support For Distributed Training

Use the ray package to support sampling using as many processors as are available.

Improve & Document Workflow For Scoring On Unseen Data

Fix Broken Primitives

Several of the sklearn primitives are erroring out when used.

Support Custom Pipeline Evaluation

Currently the performance of all candidate pipelines can only be evaluated via k-fold cross validation. That is a great method but for large datasets especially k-fold is impractical and sometimes unecessary. SKPlumber.crank should expose an API for passing in a custom evaluation strategy. Also, it would be good to use a sensible default and to provide both basic train/test set evaluation and k-fold cross validation evaluation utilities for the user.

Add Time-Constrained Optimization

Add logic for skplumber to optimize intelligently given a time budget. Specifically, in the pipeline sampling phase, use extreme value theory and running average pipeline fit+score times to estimate time remaining, always leaving enough time for the flexga package to complete at least one generation of hyperparameter sampling, the time of which will take can be estimated by the time it took the best pipeline sampled so far to fit+score multiplied by the number of hyperparameters in that pipeline to tune, multiplied by 10 (since that's flexga population size).

This will make the SKPLumber.crank method even a little higher level with less knobs to tune, which is ok because the lower level components of the package (e.g. sampling, tuning, pipeline) or still available to the user.

Add Basic Hyperparameter Optimization

Use the flexga package to add genetic-algorithm-based hyperparameter optimization of arbitrary pipelines to the package.

Imputer primitives fails when there are no known vals

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/users/grads/epeter92/code/big-data-course/project/mldb/__main__.py", line 31, in <module>
    main()
  File "/users/grads/epeter92/code/big-data-course/project/mldb/__main__.py", line 27, in main
    results = ray.get(result_id)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/ray/worker.py", line 1504, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::mldb.model_running.do_run() (pid=3548, ip=192.168.36.150)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "/users/grads/epeter92/code/big-data-course/project/mldb/model_running.py", line 31, in do_run
    pipe.fit(X_train, y_train)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/pipeline.py", line 96, in fit
    self._run(X, y, fit=True)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/pipeline.py", line 69, in _run
    step_outputs = step.primitive.produce(step_inputs)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/primitives/custom_primitives/preprocessing.py", line 93, in produce
    np.random.choice(known_vals.index, p=known_vals, size=len(result.index))
  File "mtrand.pyx", line 902, in numpy.random.mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken

I think this happens when training on datasets that don't have any known values in a column.

Cross Validate Then Refit Best on Full Train Set

There is a risk that currently with the random train/test splits that the package is doing, the best model being returned is influenced in part by the luck of the train test split. In other words, the best model might have just gotten a lucky train test split that was easy to learn. Adding K fold cross validation will smooth out that variance by getting multiple sample points on the distribution of the problem being learned.

Once the best model is identified via cross validation, that model can be refit using the full training data set before being returned by the package. Then it will have the benefit of learning from as much data as it can before being deployed into the wild.

Add Stacking Search Strategy

Add a strategy that will randomly sample a layer of primitives that all take the input data, then which adds a model to ensemble all the data outputed by the previous layer of primitives.

Express Datatypes and Fundamental Characteristics of Hyperparameters

All primitives should implement the appropriate sklearn base class which includes appropriate hyperparameter getting and setting methods.

In addition, all primitives should programmatically document datatypes of each hyperparameter. In the case of numeric hyperparameters, a way to compute the bounds for a given dataset should be provided. Bounds can be dependent on features of the dataset being trained on (e.g. number of instances, number of features, etc). In the case of categorical hyperparameters, all possible values should be enumerated.

Finally, the Pipeline class should implement the BaseEstimator API.

Doing this will enable SKPlumber to support hyperparameter search, whether it be through sampling or optimization.

Add option for returning training scores in `make_kfold_evaluator`

README can't be parsed by twine

The readme file currently can't be parsed by twine, so can't be uploaded to pypi and used as the long description there.

epeters3 / skplumber Goto Github PK

skplumber's People

Contributors

Stargazers

Watchers

skplumber's Issues

Recommend Projects

Recommend Topics

Recommend Org