pathologydatascience / glimr Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 639 KB

A simplified wrapper for hyperparameter search with Ray Tune.

License: Apache License 2.0

Python 100.00%

glimr's People

Contributors

Stargazers

Watchers

glimr's Issues

Notebook for use on a Ray cluster

Develop a notebook that shows how to tune on multiple machines using a Ray cluster.

Update description of loss/metric handling

Update descriptions in Readme.md and in notebook text cells.

Improve notebook for Search class

The current notebook doesn't illustrate the use of the Search class well. Create a second notebook that focuses on this topic.

function to visualize the run of an experiment

This function will allow us to capture and display loss during reporting, ultimately enabling us to visualize the progress of a run of an experiment.

Tuning search space should not be large

Recently, did an experiment and used rastrigin function to figure out how well Ray works when we increase the dimensionality of search space. rastrigin has a lot of local minima, but it's global minimum is at 0 [Fig. 1].
The results of the experiment shows that as the dimensionality of the search space increases, the results deteriorate. It raised my concern if Ray is able to find best config when there are a lot of tunable hyperparameters. Maybe we need to define some trimming tools to trim the search space and reduce its dimensionally, or we can set some hyperparameters constant, especially those we think we might not get any benefit from.

Code:

from ray import train, tune
from ray.tune.schedulers import PopulationBasedTraining
import numpy as np

# rastrigin function.
def rastrigin(config):
    x = list(config.values())
    n = len(x)
    score = 10*n + sum([xi**2 - 10*np.cos(2*np.pi*xi) for xi in x])
    return {"score": score}

# plot rastrigin in 3D 
# Note: it's global minimum is at zeros.
x = np.linspace(-5.12, 5.12, 100)
y = np.linspace(-5.12, 5.12, 100)
X, Y = np.meshgrid(x, y)
Z = rastrigin({"a":X, "b":Y})

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z['score'], cmap='viridis')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')


# run ray experiment to find global minimum of n-D rastrigin func.
max_dim = 10
scores = []
for d in range(1, max_dim):
  
  search_space = {f"var_{i}": tune.quniform(-2, 2, 0.05) for i in range(d)}
  scheduler = PopulationBasedTraining(
      time_attr="training_iteration", 
      hyperparam_mutations=search_space, 
      metric="score", 
      mode="max",
  )

  tuner = tune.Tuner(rastrigin, 
                    param_space=search_space,
                    tune_config=tune.TuneConfig(
                          num_samples=50,
                          scheduler=scheduler,
                      ),)

  results = tuner.fit()
  scores.append(results.get_best_result(metric="score", mode="min").metrics['score'])

The resulting loss value vs the dimensionality of search space.

Dictionary Key Indexing

Python return the dict_keys object when calling a dictionary keys

The following lines(Line 116 in Search.py) need to be modified

task_name = space["tasks"].keys()[0]
metricname = space["tasks"][taskname].keys()[0]

task_name = list(space["tasks"].keys())[0]
metricname = list(space["tasks"][taskname].keys())[0]

Add tune, air extras for ray dependency

Ray requires the AI Runtime and Tune extras.

Fix trial numbering mismatch in the experiment_table function

Here is the suggested code fix:

import re
trials = []
subdirs = os.listdir(exp_dir)
for i, subdir in enumerate(subdirs):
if subdir.startswith(“trainable”) and os.path.isdir(
os.path.join(exp_dir, subdir)
):
trial_num = re.search(r’(?:[^_]*){3}(\d+)’, subdir).group(1)
result_path = os.path.join(exp_dir, subdir, “result.json”)
if os.path.exists(result_path):
trial = pd.read_json(result_path, lines=True)
trial.insert(0, “trial#“, trial_num)
trial.insert(1, “subdir”, subdir)
trial.insert(2, “exp_dir”, exp_dir)
trials.append(trial)
df = pd.concat(trials, ignore_index=True)

Integrate K-fold cross-validation into glimr tuning

@cooperlab , currently for tuning we are using a config dictionary that enables us to search through different configurations and select the best model. However, the data pipeline is still fixed for all the trials, which means all the trials use the same train/validation sets [Fig.1].
It would be interesting to resample data automatically during tuning, i.e. generate train/validation subsets during tuning [Fig. 2].
This would allow us to train various models independently through distinct training/validation sets, which enable us to do ensemble tuning, a valuable approach to deal with overfitting issue, particularly in situations with limited data.

[Fig. 1]

[Fig. 2]

The common resampling technique is k-fold cross validation (cv), but we can use other resampling methods.
I think we need to consider the following items if we want to implement that:

Update search space so we can resample data during tuning using k-fold cv. This can be done by passing an index into the data loader.
Write a wrapper to analyse the logs generated by trials to extract fold-specific information, models, etc.
Each trial should be run on a specific resampled data

Investigate conditional search spaces

See if conditional search spaces are compatible with PBT and ASHA schedulers.

They are not compatible with most search algorithms, but search algorithms are not currently working anyway.

Add AdamW to the optimization search space

This optimizer provides exponential model averaging over multiple epochs and is a better alternative than tfa which will be discontinued.

Allow overriding of hyperparameter notation

Users should be able to specify samplers from ray.tune.search.sample in place of list or set convention that provide more limited sampling options.

This can be enabled in glimr.utils.set_hyperparameter by checking if a hyperparameter is callable, and if fun.__module__ == ray.tune.search.sample, and also by checking against a list of known sampler function names.

capture loss values during training

Currently loss values are not reported to ray. These can be useful as metrics to measure performance.

remove functions from search space using prune_constants, applied to conditional tuning

To define conditional search space, we can employ tune.sample_from, where conditions are expressed as functions using lambda expressions.".

However, we will not be able to use some schedulers like PBT, if the search space contains functions. To fix that and in fact enable glimr to use PBT for conditional search spaces, one thing we can do is to update prune_constants so it can remove functions(defined by tune.sample_from) from search space.

Removing conditional functions from PBT's search space does not mean that they will no longer exist in the search space, but means that they will not be mutable anymore, which makes sense, because functions are not mutable things.

Prevent callables from using kwargs

Using kwargs with callable losses via functools.partial prevents saving of models (tensorflow error). For now, raise an error when kwargs are provided with losses. Upgrade losses to classes to enable kwargs.

Modify extraction function to extract epoch information

Allow multiple losses per output

Currently glimr.keras.keras_losses only allows one loss per output. Enable multiple losses per output, similar to glimr.keras.keras_metrics.

restore is broken

PR #31 breaks the ability to restore halted experiments.

Failure # 1 (occurred at 2023-07-18_14-54-14) �[36mray::ImplicitFunc.train()�[39m (pid=8512, ip=127.0.0.1, repr=trainable) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 384, in train raise skipped from exception_cause(skipped) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint return self._trainable_func( File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func output = fn() File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/glimr/search.py", line 311, in trainable model, losses, loss_weights, metrics = config["builder"](config) TypeError: 'tuple' object is not callable Failure # 2 (occurred at 2023-07-18_14-54-24) �[36mray::ImplicitFunc.train()�[39m (pid=8570, ip=127.0.0.1, repr=trainable) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 384, in train raise skipped from exception_cause(skipped) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint return self._trainable_func( File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func output = fn() File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/glimr/search.py", line 311, in trainable model, losses, loss_weights, metrics = config["builder"](config) TypeError: 'tuple' object is not callable Failure # 3 (occurred at 2023-07-18_14-54-30) �[36mray::ImplicitFunc.train()�[39m (pid=8582, ip=127.0.0.1, repr=trainable) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 384, in train raise skipped from exception_cause(skipped) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint return self._trainable_func( File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func output = fn() File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/glimr/search.py", line 311, in trainable model, losses, loss_weights, metrics = config["builder"](config) TypeError: 'tuple' object is not callable Failure # 4 (occurred at 2023-07-18_14-54-30) �[36mray::ImplicitFunc.train()�[39m (pid=8515, ip=127.0.0.1, repr=trainable) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 384, in train raise skipped from exception_cause(skipped) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint return self._trainable_func( File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func output = fn() File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/glimr/search.py", line 311, in trainable model, losses, loss_weights, metrics = config["builder"](config) TypeError: 'tuple' object is not callable Failure # 5 (occurred at 2023-07-18_14-54-38) �[36mray::ImplicitFunc.train()�[39m (pid=8586, ip=127.0.0.1, repr=trainable) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 384, in train raise skipped from exception_cause(skipped) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint return self._trainable_func( File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func output = fn() File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/glimr/search.py", line 311, in trainable model, losses, loss_weights, metrics = config["builder"](config) TypeError: 'tuple' object is not callable Failure # 6 (occurred at 2023-07-18_14-54-45) �[36mray::ImplicitFunc.train()�[39m (pid=8596, ip=127.0.0.1, repr=trainable) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 384, in train raise skipped from exception_cause(skipped) File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint return self._trainable_func( File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func output = fn() File "/Users/lac5440/anaconda3/lib/python3.10/site-packages/glimr/search.py", line 311, in trainable model, losses, loss_weights, metrics = config["builder"](config) TypeError: 'tuple' object is not callable

PBT TuneError in perturbing config

TuneError: Traceback (most recent call last):
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/execution/tune_controller.py", line 853, in _on_result
    on_result(trial, *args, **kwargs)
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 735, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 791, in _process_trial_result
    decision = self._scheduler_alg.on_trial_result(
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 545, in on_trial_result
    self._checkpoint_or_exploit(
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 652, in _checkpoint_or_exploit
    self._exploit(trial_runner, trial, trial_to_clone)
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 819, in _exploit
    new_config, operations = self._get_new_config(trial, trial_to_clone)
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 709, in _get_new_config
    return _explore(
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 80, in _explore
    nested_new_config, nested_ops = _explore(
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 80, in _explore
    nested_new_config, nested_ops = _explore(
  File "/home/lgc2035/miniconda3/envs/kidney/lib/python3.8/site-packages/ray/tune/schedulers/pbt.py", line 125, in _explore
    new_config[key] = config[key] * perturbation_factor
TypeError: can't multiply sequence by non-int of type 'float'

Seems to be an error in the PBT scheduler's attempt to generate a new config for a new trial... Regular trials were training just fine until this error popped up when it was time for the PBT scheduler to perturb a config.

Remove batch from optimization_space

Batching should be handled through the dataloader and data entry instead.

restore is broken again

Values for the dataloader and model builder functions in config are being received by trials as placeholders. This relates to ray internals and may be difficult to fix. Check ray versions.

Testing

Revisit testing and hit 80% coverage.

Create a function to trim constants from the PBT mutations dict

Constant values that are not mutatable should be removed from the hyperparameter_mutations argument of PopulationBasedTraining.

Traceback (most recent call last): File "i-score.py", line 59, in <module> attempt_tuning( File "/renal_allograft/code/utils/attempt_tuning.py", line 35, in attempt_tuning scheduler = PopulationBasedTraining( File "/usr/local/lib/python3.8/dist-packages/ray/tune/schedulers/pbt.py", line 360, in __init__ raise TypeError( TypeError: hyperparam_mutation values must be either a List, Tuple, Dict, a tune search space object, or a callable.

update notebooks

Illustrate how to use the ResultsGrid object and Lawrence's top_k function to analyze results on completion of PR #40.

Improve search space documentation

Improve description of required keys and role of data component.

Allocated resources not scaling trials

For both GPU and CPU, increasing the resources does not increase the number of concurrently running trials.

Trial reports only show last epoch performance

We want to report performance of the best epoch of the trial, not just the last.

Allow non-class losses

A valid loss can be a callable/function and doesn't have to be a class.

Change keras_losses to handle callable loss objects too.

loss and metric validation functions

Provide validation functions to check proper formatting of loss and metric dicts. Can extend to tasks as well.

Allow for custom stopper object

Update Search class to include method to set custom trial or experiment stopper. Should probably include some error handling as well, check that the stopper passed is of a certain allowable type.

Support metric and loss kwargs

Support passing of class definitions and kwargs in configurations.

Currently, we use a mapper argument to map strings to metrics and losses since most class instances cannot be passed to trials (non-picklable). This will be replaced with passing class definitions or callables with kwargs.

For metrics - a list of dicts as below (metrics do not contain hyperparameters and can be stored in a list)

[{“name”: str, “metric”: class/callable, “kwargs”: dict}]

For losses - a single dict (multiple losses not supported):

{“name”: str, “loss”: class/callable, “kwargs”: dict}

Multi-output models raise errors

Some single-task networks can have multiple outputs (e.g. an attention model that outputs predictions and attention scores). In this case, checking the task count fails to predict how keras will name the output metrics. Use len(model.outputs) instead.

Update get_top_k_trials for losses

Add an argument "mode" to allow filtering of trials by max or min values. This permits finding of trials with the highest metric value or the lowest loss.

Support kwargs for metrics and losses

Many metrics and losses have parameters that users may want to specify or tune in the space definition. These should be supported by enhancing the space specification. This requires changes to glimr.keras.keras and testing with single task and multiple task models.

Model averaging of top trials

We could average top trials or compose an ensemble of sensitive and specific models for prediction. This should improve accuracy and also enables calculation of uncertainties. Perhaps this cannot be generalized for all applications and should be handled in the application libraries instead.

Cleanup docstrings

Many docstrings refer to the prior list/set notation.

Support dataloader kwargs

Users should be able to specify dataloader kwargs and hyperparameters in their space. This can support operations like augmentation which is application specific, or training with different feature sets stored in different directories.

The batch should also be integrated into a data dictionary in the search space.

pathologydatascience / glimr Goto Github PK

glimr's People

Contributors

Stargazers

Watchers

glimr's Issues

Recommend Projects

Recommend Topics

Recommend Org