Coder Social home page Coder Social logo

skplumber's People

Contributors

epeters3 avatar

Stargazers

 avatar

Watchers

 avatar

skplumber's Issues

Add Time-Constrained Optimization

Add logic for skplumber to optimize intelligently given a time budget. Specifically, in the pipeline sampling phase, use extreme value theory and running average pipeline fit+score times to estimate time remaining, always leaving enough time for the flexga package to complete at least one generation of hyperparameter sampling, the time of which will take can be estimated by the time it took the best pipeline sampled so far to fit+score multiplied by the number of hyperparameters in that pipeline to tune, multiplied by 10 (since that's flexga population size).

This will make the SKPLumber.crank method even a little higher level with less knobs to tune, which is ok because the lower level components of the package (e.g. sampling, tuning, pipeline) or still available to the user.

Add `Pipeline` Functionality to Readme

Currently, only a basic example for SKPlumber.crank() is provided in the readme. The Pipeline class API is slightly lower level than SKPlumber and has some nice features. Also, since SKPlumber.crank() returns a Pipeline instance, its important to know how to use that pipeline for downstream use.

Don't Allow Infeasible Hyperparameter Combinations

In the logistic regression solver, this hyperparameter combo is not supported, so make sure it cannot be tried in skplumber:

  File "/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py", line 445, in _check_solver
    "got %s penalty." % (solver, penalty))
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

Here is another one for SVM:

  File "/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py", line 793, in _get_liblinear_solver_type
    % (error_string, penalty, loss, dual))
ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when 
dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True

Exit When Budget is Used Up Even Before Progress Can Report

In the SKPlumber._sampler_callback function, it is not checked if the system should exit the sampling early until the SPlumber.progress object is able to report. It is sometimes the case where the whole time budget is used up even before the progress object is able to report.

Add Good Encoder

The encoder should have fit and produce methods, so the output columns are always the same. Also, if more than unique values are found in a column, it should cap the feature expansion and only encode the most common values, putting all others in an “Other” column.

Always Normalize Features

Many optimization strategies do best when the features are all normalized. Rather than currently requiring normalization to be found as a preprocessor when sampling, always insert a normalization preprocessing step, with an option to not normalize (e.g. have normalize=True be a default in the crank API).

Make access to primitives more natural

Currently, all primitives in the package must be accessed by key through primitive dictionaries. It would be better to have them be accessible as objects directly e.g. instead of:

from skplumber.primitives import classifiers
prim = classifiers["RandomForestClassifierPrimitive"]

It would be more natural to say:

from skplumber.primitives.classifiers import RandomForestClassifierPrimitive

Furthermore, it would be good to eliminate the Primitive postfix from all the primitive names; it's a little redundant.

Primitives should support sequential dataset fit

The same instantiated primitive should be able to be fit on one dataset and then another. All the sklearn primitives should already support this, but the custom primitives do not. E.g. the one-hot encoder, when fit, keeps track of all the categorical columns, but when fit to a new dataset, it does not clear out all the old columns it was tracking, so things the columns the new dataset has is the union of the old dataset's columns with the new dataset's columns.

Add a test case for this by fitting on one dataset, then another, to make sure no errors occur.

Add Sample Training Timeout

It would be useful to have an option to limit the max amount of time spent fitting a pipeline when searching for good solutions to a problem. Adding a timeout option to the Pipeline.fit method would do the trick.

Pass Instances of Strategies to Plumber

Currently the name of a search strategy is what’s past to the plumber. It would be better to pass an instantiation of a search strategy, so the user can configure the search strategy without having to pass all the search strategy parameters to the strategy through the plumber API, but rather to the search strategy directly.

Support Custom Pipeline Evaluation

Currently the performance of all candidate pipelines can only be evaluated via k-fold cross validation. That is a great method but for large datasets especially k-fold is impractical and sometimes unecessary. SKPlumber.crank should expose an API for passing in a custom evaluation strategy. Also, it would be good to use a sensible default and to provide both basic train/test set evaluation and k-fold cross validation evaluation utilities for the user.

Add Time-Constrained Optimization

Add logic for skplumber to optimize intelligently given a time budget. Specifically, in the pipeline sampling phase, use extreme value theory and running average pipeline fit+score times to estimate time remaining, always leaving enough time for the flexga package to complete at least one generation of hyperparameter sampling, the time of which will take can be estimated by the time it took the best pipeline sampled so far to fit+score multiplied by the number of hyperparameters in that pipeline to tune, multiplied by 10 (since that's flexga population size).

This will make the SKPLumber.crank method even a little higher level with less knobs to tune, which is ok because the lower level components of the package (e.g. sampling, tuning, pipeline) or still available to the user.

Imputer primitives fails when there are no known vals

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/users/grads/epeter92/code/big-data-course/project/mldb/__main__.py", line 31, in <module>
    main()
  File "/users/grads/epeter92/code/big-data-course/project/mldb/__main__.py", line 27, in main
    results = ray.get(result_id)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/ray/worker.py", line 1504, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::mldb.model_running.do_run() (pid=3548, ip=192.168.36.150)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "/users/grads/epeter92/code/big-data-course/project/mldb/model_running.py", line 31, in do_run
    pipe.fit(X_train, y_train)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/pipeline.py", line 96, in fit
    self._run(X, y, fit=True)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/pipeline.py", line 69, in _run
    step_outputs = step.primitive.produce(step_inputs)
  File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/primitives/custom_primitives/preprocessing.py", line 93, in produce
    np.random.choice(known_vals.index, p=known_vals, size=len(result.index))
  File "mtrand.pyx", line 902, in numpy.random.mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken

I think this happens when training on datasets that don't have any known values in a column.

Cross Validate Then Refit Best on Full Train Set

There is a risk that currently with the random train/test splits that the package is doing, the best model being returned is influenced in part by the luck of the train test split. In other words, the best model might have just gotten a lucky train test split that was easy to learn. Adding K fold cross validation will smooth out that variance by getting multiple sample points on the distribution of the problem being learned.

Once the best model is identified via cross validation, that model can be refit using the full training data set before being returned by the package. Then it will have the benefit of learning from as much data as it can before being deployed into the wild.

Add Stacking Search Strategy

Add a strategy that will randomly sample a layer of primitives that all take the input data, then which adds a model to ensemble all the data outputed by the previous layer of primitives.

Express Datatypes and Fundamental Characteristics of Hyperparameters

All primitives should implement the appropriate sklearn base class which includes appropriate hyperparameter getting and setting methods.

In addition, all primitives should programmatically document datatypes of each hyperparameter. In the case of numeric hyperparameters, a way to compute the bounds for a given dataset should be provided. Bounds can be dependent on features of the dataset being trained on (e.g. number of instances, number of features, etc). In the case of categorical hyperparameters, all possible values should be enumerated.

Finally, the Pipeline class should implement the BaseEstimator API.

Doing this will enable SKPlumber to support hyperparameter search, whether it be through sampling or optimization.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.