eharmony / spotz Goto Github PK

Spark Parameter Optimization and Tuning

Scala 100.00%

hyperparameter-optimization spark optimizer optimization-algorithms machine-learning machinelearning hyperparameter-tuning hyperparameters grid-search random-search

spotz's People

Contributors

Stargazers

Watchers

Forkers

mindis neiodavince mostafa-zefr tony32769 bevolta vsuthichai jepsonw

spotz's Issues

Add scaladoc

Ability for a sampler to depend on previous parameter's values

Hi,

We're interested in Spotz, but we have a need for a param to depend on previously-sampled ones (i.e. if mode is ON, vary between 0 to 5, otherwise 5 to 10).

The change would probably need to happen

spotz/core/src/main/scala/com/eharmony/spotz/optimizer/random/RandomSearch.scala

Line 56 in fdf61ba

factory(params.map { case (label, sampler) => (label, sampler(rng)) } )

where instead of a map, we'd fold the params list by passing the previously-calculated params along with the (current argument) rng.

This would be a breaking change as it is (since all samplers need their apply functions to receive an extra parameter), but maybe there's a way to make it smoother?
We'd provide a PR if that's acceptable.

I'm interested in your suggestions, if there are others&simpler ways. Thanks!

Refactor OptimizerState and OptimizeResult

There is overlap in common code that can be reused

VW Objective with train and holdout set

Implement a VW objective that accepts a training set and test set on which to train and evalutate a VW model.

Add README.md for vw module. Document the supported use cases

Write up details about the various supported use cases for VW

Add more examples

Add README.md

RandomSearch: Move random seed to be last arg with a default

Add subset sampling for RandomSearch hyperparameters

VW Objective for k-fold Cross Validation

Grid Search

Random Search

User configurable parallelism for non-spark implementation

Allow an optimizer run to continue where it left off

If during a spotz optimization run, the run is killed intentionally, crashes, or stops before normal completion, allow the optimizer to continue running where it left off

Add an SBT build

Exhaustive iteration of combinations for Grid Search

Allow user to specify cache size before training

Integrate with MLlib

--cb parameter must be specified when generating cache file

Add readme for VW module

Bake in clean way of minimizing / maximization of other metrics in objective functions

Refactor VW cache distribution

There's a slowdown with VW cache distribution during at the beginning of the Spark job. Refactor this logic to zip, and distribute the vw dataset to the executors before VW cache generation begins

Add random sampling of combinations for RandomSearch

This is primarily for VW feature interactions

Dataset loader for k-fold Cross Validation

Partition a dataset into k-folds and create VW train and test cache files for every fold. Distribute these cache files to the executor so that they can be used by the objective function.

Allow user to specify their own StopStrategy function

Add backend support for running without Spark

Refactor functionality to allow mixin of the backend compute framework so that users can choose to use Spark or something else, ie. potentially executors.

Allow VWProcess to accept samples through InputStream. This will by redirected into that process' standard input

Add unit testing

Research implementing RDDs that produce hyper parameter values

Currently, hyper parameter values are materialized through the sample method of the Space trait. Look into possibly implementing RDDs that materialize the values instead of invoking sc.parallelize() in conjunction with the sample method.

Reorganizing into maven multi modules

Adaptive batch sizing

Tune the batch size adaptively such that the user does not need to specify it. The batch size becomes important when the caller desires the optimizer to finish within some maximum duration. Too large a batch size will delay duration checks while processing occurs on the cluster. Too small a batch size will cause frequent return trips back to the driver which incur some constant time overhead.

Implement CLI

Research other optimization algorithms

Particle Swarm
Tree of Parzen Estimators
Nelder-Mead Simplex
CMA-ES
Sobol Sequences