Coder Social home page Coder Social logo

awslabs / syne-tune Goto Github PK

View Code? Open in Web Editor NEW
363.0 12.0 49.0 13.81 MB

Large scale and asynchronous Hyperparameter and Architecture Optimization at your fingertips.

Home Page: https://syne-tune.readthedocs.io

License: Apache License 2.0

Python 99.76% Dockerfile 0.07% Shell 0.17%
hyperparameter-optimization machine-learning multi-objective-optimization hyperparameter-tuning bayesian-optimization sagemaker neural-architecture-search

syne-tune's Introduction

Syne Tune: Large-Scale and Reproducible Hyperparameter Optimization

release License Downloads Documentation Python Version codecov.io

Syne Tune

Documentation | Tutorials | API Reference | PyPI | Latest Blog Post

Syne Tune provides state-of-the-art algorithms for hyperparameter optimization (HPO) with the following key features:

  • Lightweight and platform-agnostic: Syne Tune is designed to work with different execution backends, so you are not locked into a particular distributed system architecture. Syne Tune runs with minimal dependencies.
  • Wide coverage of different HPO methods: Syne Tune supports more than 20 different optimization methods across multi-fidelity HPO, constrained HPO, multi-objective HPO, transfer learning, cost-aware HPO, and population-based training.
  • Simple, modular design: Rather than wrapping other HPO frameworks, Syne Tune provides simple APIs and scheduler templates, which can easily be extended to your specific needs. Studying the code will allow you to understand what the different algorithms are doing, and how they differ from each other.
  • Industry-strength Bayesian optimization: Syne Tune has comprehensive support for Gaussian Process-based Bayesian optimization. The same code powers modalities such as multi-fidelity HPO, constrained HPO, and cost-aware HPO, and has been tried and tested in production for several years.
  • Support for distributed workloads: Syne Tune lets you move fast, thanks to the parallel compute resources AWS SageMaker offers. Syne Tune allows ML/AI practitioners to easily set up and run studies with many experiments running in parallel. Run on different compute environments (locally, AWS, simulation) by changing just one line of code.
  • Out-of-the-box tabulated benchmarks: Tabulated benchmarks let you simulate results in seconds while preserving the real dynamics of asynchronous or synchronous HPO with any number of workers.

Syne Tune is developed in collaboration with the team behind the Automatic Model Tuning service.

Installing

To install Syne Tune from pip, you can simply do:

pip install 'syne-tune[basic]'

or to install the latest version from source:

git clone https://github.com/awslabs/syne-tune.git
cd syne-tune
python3 -m venv st_venv
. st_venv/bin/activate
pip install --upgrade pip
pip install -e '.[basic]'

This installs everything in a virtual environment st_venv. Remember to activate this environment before working with Syne Tune. We also recommend building the virtual environment from scratch now and then, in particular when you pull a new release, as dependencies may have changed.

See our change log to see what changed in the latest version.

Getting started

To enable tuning, you have to report metrics from a training script so that they can be communicated later to Syne Tune, this can be accomplished by just calling report(epoch=epoch, loss=loss) as shown in the example below:

# train_height_simple.py
import logging
import time

from syne_tune import Reporter
from argparse import ArgumentParser

if __name__ == '__main__':
    root = logging.getLogger()
    root.setLevel(logging.INFO)
    parser = ArgumentParser()
    parser.add_argument('--epochs', type=int)
    parser.add_argument('--width', type=float)
    parser.add_argument('--height', type=float)
    args, _ = parser.parse_known_args()
    report = Reporter()
    for step in range(args.epochs):
        time.sleep(0.1)
        dummy_score = 1.0 / (0.1 + args.width * step / 100) + args.height * 0.1
        # Feed the score back to Syne Tune.
        report(epoch=step + 1, mean_loss=dummy_score)

Once you have a training script reporting a metric, you can launch a tuning as follows:

# launch_height_simple.py
from syne_tune import Tuner, StoppingCriterion
from syne_tune.backend import LocalBackend
from syne_tune.config_space import randint
from syne_tune.optimizer.baselines import ASHA

# hyperparameter search space to consider
config_space = {
    'width': randint(1, 20),
    'height': randint(1, 20),
    'epochs': 100,
}

tuner = Tuner(
    trial_backend=LocalBackend(entry_point='train_height_simple.py'),
    scheduler=ASHA(
        config_space,
        metric='mean_loss',
        resource_attr='epoch',
        max_resource_attr="epochs",
        search_options={'debug_log': False},
    ),
    stop_criterion=StoppingCriterion(max_wallclock_time=30),
    n_workers=4,  # how many trials are evaluated in parallel
)
tuner.run()

The above example runs ASHA with 4 asynchronous workers on a local machine.

Experimentation with Syne Tune

If you plan to use advanced features of Syne Tune, such as different execution backends or running experiments remotely, writing launcher scripts like examples/launch_height_simple.py can become tedious. Syne Tune provides an advanced experimentation framework, which you can learn about in this tutorial or also in this one.

Supported HPO methods

The following hyperparameter optimization (HPO) methods are available in Syne Tune:

Method Reference Searcher Asynchronous? Multi-fidelity? Transfer?
Grid Search deterministic yes no no
Random Search Bergstra, et al. (2011) random yes no no
Bayesian Optimization Snoek, et al. (2012) model-based yes no no
BORE Tiao, et al. (2021) model-based yes no no
CQR Salinas, et al. (2023) model-based yes no no
MedianStoppingRule Golovin, et al. (2017) any yes yes no
SyncHyperband Li, et al. (2018) random no yes no
SyncBOHB Falkner, et al. (2018) model-based no yes no
SyncMOBSTER Klein, et al. (2020) model-based no yes no
ASHA Li, et al. (2019) random yes yes no
BOHB Falkner, et al. (2018) model-based yes yes no
MOBSTER Klein, et al. (2020) model-based yes yes no
DEHB Awad, et al. (2021) evolutionary no yes no
HyperTune Li, et al. (2022) model-based yes yes no
DyHPO* Wistuba, et al. (2022) model-based yes yes no
ASHABORE Tiao, et al. (2021) model-based yes yes no
ASHACQR Salinas, et al. (2023) model-based yes yes no
PASHA Bohdal, et al. (2022) random or model-based yes yes no
REA Real, et al. (2019) evolutionary yes no no
KDE Falkner, et al. (2018) model-based yes no no
PBT Jaderberg, et al. (2017) evolutionary no yes no
ZeroShotTransfer Wistuba, et al. (2015) deterministic yes no yes
ASHA-CTS Salinas, et al. (2021) random yes yes yes
RUSH Zappella, et al. (2021) random yes yes yes
BoundingBox Perrone, et al. (2019) any yes yes yes

*: We implement the model-based scheduling logic of DyHPO, but use the same Gaussian process surrogate models as MOBSTER and HyperTune. The original source code for the paper is here.

The searchers fall into four broad categories, deterministic, random, evolutionary and model-based. The random searchers sample candidate hyperparameter configurations uniformly at random, while the model-based searchers sample them non-uniformly at random, according to a model (e.g., Gaussian process, density ration estimator, etc.) and an acquisition function. The evolutionary searchers make use of an evolutionary algorithm.

Syne Tune also supports BoTorch searchers.

Supported multi-objective optimization methods

Method Reference Searcher Asynchronous? Multi-fidelity? Transfer?
Constrained Bayesian Optimization Gardner, et al. (2014) model-based yes no no
MOASHA Schmucker, et al. (2021) random yes yes no
NSGA-2 Deb, et al. (2002) evolutionary no no no
Multi Objective Multi Surrogate (MSMOS) Guerrero-Viu, et al. (2021) model-based no no no
MSMOS with random scalarization Paria, et al. (2018) model-based no no no

HPO methods listed can be used in a multi-objective setting by scalarization or non-dominated sorting. See multiobjective_priority.py for details.

Examples

You will find many examples in the examples/ folder illustrating different functionalities provided by Syne Tune. For example:

Examples for Experimentation and Benchmarking

You will find many examples for experimentation and benchmarking in benchmarking/examples/ and in benchmarking/nursery/.

FAQ and Tutorials

You can check our FAQ, to learn more about Syne Tune functionalities.

Do you want to know more? Here are a number of tutorials.

Blog Posts

Videos

Security

See CONTRIBUTING for more information.

Citing Syne Tune

If you use Syne Tune in a scientific publication, please cite the following paper:

"Syne Tune: A Library for Large Scale Hyperparameter Tuning and Reproducible Research" First Conference on Automated Machine Learning, 2022.

@inproceedings{
  salinas2022syne,
  title={Syne Tune: A Library for Large Scale Hyperparameter Tuning and Reproducible Research},
  author={David Salinas and Matthias Seeger and Aaron Klein and Valerio Perrone and Martin Wistuba and Cedric Archambeau},
  booktitle={International Conference on Automated Machine Learning, AutoML 2022},
  year={2022},
  url={https://proceedings.mlr.press/v188/salinas22a.html}
}

License

This project is licensed under the Apache-2.0 License.

syne-tune's People

Contributors

610v4nn1 avatar aaronkl avatar amazon-auto avatar austinmw avatar banyikun avatar dependabot[bot] avatar duck105 avatar eddiebergman avatar geoalgo avatar hfurkanbozkurt avatar iaroslav-ai avatar jgolebiowski avatar jjaeyeon avatar lostella avatar ltiao avatar master avatar mina-ghashami avatar mlblack avatar mseeger avatar ondrejbohdal avatar rsnirwan avatar sighellan avatar talesa avatar trellixvulnteam avatar valavanca avatar wesk avatar wistuba avatar ystein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

syne-tune's Issues

Container build fails

Hi, when running the container build script, it fails at the following:

Step 12/12 : RUN python -m pip install --no-cache-dir --upgrade -r /tmp/packages/requirements.txt
 ---> Running in 67f84184ab3e
ERROR: Extras after version '>=1.3ray[tune]'.
The command '/bin/sh -c python -m pip install --no-cache-dir --upgrade -r /tmp/packages/requirements.txt' returned a non-zero code: 1

Grid search in syne-tune

Hey folks,
would you be interested in grid search implemented in syne-tune? I had a few offline discussions with some of you already, and it seems that you are not against grid search added to syne-tune, but want to keep a record of that here.

Additionally, would you have any pointers as to what would be the best way to add grid search to syne-tune?

Simulator results are ignored

When running (main branch) e.g.

python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark fcnet-protein

I get following warnings. Is this expected, anyone knows what's going on?

WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 38: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 44: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 49: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 77: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 86: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 113: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 121: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 142: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 169: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 186: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 188: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 229: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 247: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 252: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 260: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 255: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 264: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 297: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 309: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 314: status = Stopped, num_results = 1

Doc mismatch leads to ImportError: cannot import name 'report' ?

Hi,

I'm using Syne Tune from a SageMaker-managed EC2 instance (notebook instance)

As indicated here, I'm using this code in my backend script:
from syne_tune.report import report
which returns an ImportError: cannot import name 'report'

and when I look in report.py I can see a Reporter but no report

This blog post however proposes a different code:

from syne_tune.report import Reporter
report = Reporter()

Could the README be clarified?

(Note that I cannot check the version: import syne_tune syne_tune.__version__ returns an AttributeError: module 'syne_tune' has no attribute '__version__'

thanks!

[Feature Request] Attach to SageMakerBackend logging

Hi, could you add a method to attach to the logs for the SageMakerBackend management estimator? For example, RemoteLauncher.logs so we can simply do remote.logs()?

Some customers can't access console to view CloudWatch logs, so this would be easier for them than fiddling with boto3.

Numeric and Log-Scale Choice

There is no equivalent of choice for numeric values. E.g., in the FCNet blackbox the learning rate is defined as 'hp_init_lr': choice([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]). This will not allow model-based approaches to encode this hyperparameter correctly. Would be great to identify them as numeric and also indicate whether log transform is needed.

Warning: Python 3.6

Hi, I receive this warning using the docker image:

PythonDeprecationWarning: Boto3 will no longer support Python 3.6 starting May 30, 2022. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.7 or later. More information can be found here: https://aws.amazon.com/blogs/developer/python-support-policy-updates-for-aws-sdks-and-tools/

Since this is only a couple of weeks away it might be a good idea to update the Dockerfile to Python 3.7 now.

How to set a custom tuner_path ?

Hi,

how to set a custom tuner_path?

I'm launching long-running experiments on remote SageMaker jobs, and I'd like to set the tuner metadata path to /opt/ml/checkpoints (local path on those transient VMs), to get the metadata sent to s3 upon updates

ExperimentResult plot warning

Hi, with recent versions of matplotlib, ExperimentResult.plot() gives me the warning:

WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Experiment Results Contain Random Rows

In my experiment, the result data frame contains multiple rows with trial id 1 with the same content as the next row, the only difference being the config. This causes problems since sometimes the best config is now trial id 1 that shows a config which did not achieve the best performance.

See this example: True trial id 1 performance is 81% (Row 4) but trial id 1 also shows up in row 10 with highest accuracy.
I've added a simple example to reproduce this behavior.

image

from pathlib import Path

from sagemaker.pytorch import PyTorch

from syne_tune.backend import SageMakerBackend
from sagemaker import get_execution_role
from syne_tune.optimizer.baselines import RandomSearch
from syne_tune import Tuner
from syne_tune.config_space import randint
from syne_tune import StoppingCriterion
from syne_tune.optimizer.schedulers.fifo import FIFOScheduler

entry_point = Path('examples') / "training_scripts" / "height_example" / "train_height.py"
assert entry_point.is_file(), 'File unknown'
mode = "min"
metric = "mean_loss"
instance_type = 'ml.c5.4xlarge'
instance_count = 1
instance_max_time = 999
n_workers = 20

config_space = {
    "steps": 1,
    "width": randint(0, 20),
    "height": randint(-100, 100)
}

backend = SageMakerBackend(
    sm_estimator=PyTorch(
        entry_point=str(entry_point),
        instance_type=instance_type,
        instance_count=instance_count,
        role=get_execution_role(),
        max_run=instance_max_time,
        py_version='py3',
        framework_version='1.6',
    ),
    metrics_names=[metric],
)

# Random search without stopping
scheduler = FIFOScheduler(
    config_space=config_space,
    searcher='random',
    mode=mode,
    metric=metric,
)

tuner = Tuner(
    trial_backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=300),
    n_workers=n_workers,
)

tuner.run()

SageMaker ResourceLimitExceeded

Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?

Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:

try:
    # manage tuning jobs
except:
   # raise error
finally:
   # stop any trials still running

Custom results directory

Dear creators, thank you again for your great work and perhaps sorry for being annoying with my suggestions/questions. Is it possible to change the home directory of different runs such that it is not ~/syne-tune but a custom path? Thanks!

Custom arguments packaging

Dear creators, thank you for your great work. Is there a way how we could specify any packaging for the input hyperparameters for our main script? E.g. in our project we do not input hyperparameters directly as in python3 train.py --width 1 but through python3 train.py --hyperparameters='{"width": 1}' to avoid adding a new argument to our parser and to avoid clutter each time we would like to change something. I have checked the FAQ but I have not found anything related. Thank you for your input!

Issue with running launch_sagemaker_backend.py: No module named 'benchmarks'

Hello!
When running https://github.com/awslabs/syne-tune/blob/main/docs/tutorials/basics/scripts/launch_sagemaker_backend.py (python docs/tutorials/basics/scripts/launch_sagemaker_backend.py) on the main branch I get an error within the spawned SageMaker training jobs:

Traceback (most recent call last):
  File "traincode_report_withcheckpointing.py", line 29, in <module>
    from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'

I'm including the full log below.
I’m not certain if it’s due to my AWS environment setup (although I am generally able to run SageMaker training jobs) or an issue with the code, could you please have a look?

Best wishes,
Adam

Full log:

showing log of sagemaker job: traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-01-18 16:34:35,020 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-01-18 16:34:35,023 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:35,035 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-01-18 16:34:36,465 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-01-18 16:34:37,061 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,076 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,090 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,103 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch_size": 126,
        "weight_decay": 0.7744002774231975,
        "st_checkpoint_dir": "/opt/ml/checkpoints",
        "st_instance_count": 1,
        "n_units_2": 322,
        "dataset_path": "./",
        "n_units_1": 107,
        "dropout_2": 0.20979101632756325,
        "dropout_1": 0.4715702331554363,
        "epochs": 81,
        "learning_rate": 0.0029903699075321814,
        "st_instance_type": "ml.m4.10xlarge"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz",
    "module_name": "traincode_report_withcheckpointing",
    "network_interface_name": "eth0",
    "num_cpus": 40,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "traincode_report_withcheckpointing.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975}
SM_USER_ENTRY_POINT=traincode_report_withcheckpointing.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=traincode_report_withcheckpointing
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=40
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz","module_name":"traincode_report_withcheckpointing","network_interface_name":"eth0","num_cpus":40,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"traincode_report_withcheckpointing.py"}
SM_USER_ARGS=["--batch_size","126","--dataset_path","./","--dropout_1","0.4715702331554363","--dropout_2","0.20979101632756325","--epochs","81","--learning_rate","0.0029903699075321814","--n_units_1","107","--n_units_2","322","--st_checkpoint_dir","/opt/ml/checkpoints","--st_instance_count","1","--st_instance_type","ml.m4.10xlarge","--weight_decay","0.7744002774231975"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH_SIZE=126
SM_HP_WEIGHT_DECAY=0.7744002774231975
SM_HP_ST_CHECKPOINT_DIR=/opt/ml/checkpoints
SM_HP_ST_INSTANCE_COUNT=1
SM_HP_N_UNITS_2=322
SM_HP_DATASET_PATH=./
SM_HP_N_UNITS_1=107
SM_HP_DROPOUT_2=0.20979101632756325
SM_HP_DROPOUT_1=0.4715702331554363
SM_HP_EPOCHS=81
SM_HP_LEARNING_RATE=0.0029903699075321814
SM_HP_ST_INSTANCE_TYPE=ml.m4.10xlarge
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975
Traceback (most recent call last):
  File "traincode_report_withcheckpointing.py", line 29, in <module>
    from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'
2022-01-18 16:34:38,444 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975"
Traceback (most recent call last):
  File "traincode_report_withcheckpointing.py", line 29, in <module>
    from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'

Add plateau stopper

Add a stopping criterion that stops the HPO process if it hasn’t improved for N consecutive steps

Duplicate SM training job names with SagemakerBackend

Running python docs/tutorials/basics/scripts/launch_sagemaker_backend.py produces SM training jobs named None-0, None-1, etc which do not depend on tuner_name.
Rerunning the example leads to duplicate SM training job names and hence failure of the script.
This is because tuner_name inside the SagemakerBackend object is only ever set to None in the constructor.

Cross-ref: #112 and points by mseeger in #113.

How to set and get experiment name?

Hi,

I see in the blog that one can query experiments by name, to access the metrics:

from syne_tune.experiments import load_experiment
tuning_experiment = load_experiment("train-cifar100-2021-11-05-15-22-27-531")
tuning_experiment.plot()

How do we know and set an experiment name?

Implement independent GP surrogate model and Hyper-Tune

What. Implement Hyper-Tune as extension of asynchronous Hyperband (ASHA)
Why. Very competitive method, according to the paper. We lack async HB methods that do a good job with bracket sampling
Done. Some unit tests, comparison with baselines

Promotion Logic Bug

There seems to be a problem with the Hyperband promotion logic.

How to reproduce:
Add type="promotion" to https://github.com/awslabs/syne-tune/blob/main/benchmarking/nursery/benchmark_automl/baselines.py#L69

Run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines

  File "/syne-tune/benchmarking/nursery/benchmark_automl/benchmark_main.py", line 209, in <module>
    tuner.run()
  File "/syne-tune/syne_tune/tuner.py", line 240, in run
    raise e
  File "/syne-tune/syne_tune/tuner.py", line 175, in run
    new_done_trial_statuses, new_results = self._process_new_results(
  File "/syne-tune/syne_tune/tuner.py", line 345, in _process_new_results
    done_trials_statuses = self._update_running_trials(
  File "/syne-tune/syne_tune/tuner.py", line 465, in _update_running_trials
    decision = self.scheduler.on_trial_result(trial=trial, result=result)
  File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 779, in on_trial_result
    task_info = self.terminator.on_task_report(trial_id, result)
  File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 1124, in on_task_report
    rung_sys.on_task_report(trial_id, result, skip_rungs=skip_rungs)
  File "/syne-tune/syne_tune/optimizer/schedulers/hyperband_promotion.py", line 221, in on_task_report
    assert resource == milestone, (
AssertionError: trial_id 1: resource = 4 > 3 milestone. Make sure to report time attributes covering all milestones```

[BUG] LocalBackend: Evaluation Failed!

Hi, I am using LocalBackend to train a couple of huggingface models for a sample dataset (still WIP)..

However, I ran into the following errors:

INFO:syne_tune.optimizer.schedulers.hyperband:trial_id 1 starts (first milestone = 1)
INFO:root:running subprocess with command: /opt/conda/bin/python huggingface_on_excel.py --model_type google/electra-base-discriminator --learning_rate 8.018154654725304e-05 --weight_decay 1.3591419560772573e-07 --dataset_path /DATA/jin/  --CUDA_VISIBLE_DEVICES 2 --train_batch_size 8 --valid_batch_size 8 --epochs 1 --output_dir output/ --eval_steps 100 --st_checkpoint_dir /root/syne-tune/test-hugging/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'model_type': 'google/electra-base-discriminator', 'learning_rate': 8.018154654725304e-05, 'weight_decay': 1.3591419560772573e-07, 'dataset_path': '/DATA/jin/, 'CUDA_VISIBLE_DEVICES': '2', 'train_batch_size': 8, 'valid_batch_size': 8, 'epochs': 1, 'output_dir': 'output/', 'eval_steps': 100}
INFO:syne_tune.tuner:Trial trial_id 1 was stopped independently of the scheduler.
INFO:syne_tune.optimizer.schedulers.fifo:trial_id 1: Evaluation failed!

Some of the debugging methods I have tried:

  1. setting debug_mode : True in tuner did not reflect the bug.
  2. I am able to run the exact commands for the subprocess without running into any issue or bug.

Any advice will be appreciated. Thank you!

sp.number_choice?

Hello!
ST already has sp.choice for categorical variables, and sp.finrange and sp.logfinrange for numerical values, but I feel that sometimes it is easier to manually specify the elements (as per sp.choice), but have them treated as numerical values by the GP-models and by the Blackbox-surrogate-models. Hence I'm wondering about implementing something like sp.number_choice, mostly for convenience, what do you think?

INFO:root:Detected 2 GPUs on an EC2 m5d.12xlarge that has no GPU ?

Hi,

I'm running Syne Tune on the conda_python3 Jupyter kernel of a SageMaker-managed EC2 instance (ml.m5d.12xlarge notebook instance), that has no GPUs.
However, in the Syne Tune logs I see:

INFO:root:Detected 2 GPUs

and then few lines below

DEBUG:root:Free GPUs: {0, 1}
DEBUG:root:Assigned GPU 0 to trial_id 0

But an m5d.12xlarge is not expected to have GPUs, right?

ModuleNotFoundError when running example_syne_tune_for_hf.ipynb notebook

When I run example_syne_tune_for_hf.ipynb notebook, first cell after !pip install commands, results in ModuleNotFoundError: No module named 'syne_tune.config_space' error.

Cell:

import matplotlib as mpl #$; mpl.use('pgf')
import os

%matplotlib inline
import matplotlib.pyplot as plt
import logging
logging.basicConfig(level=logging.INFO)
from pathlib import Path

from syne_tune.backend.local_backend import LocalBackend
from syne_tune.tuner import Tuner
from syne_tune.search_space import uniform, loguniform, choice, randint
from syne_tune.stopping_criterion import StoppingCriterion
from syne_tune.optimizer.baselines import ASHA, MOBSTER, BayesianOptimization, RandomSearch, MOASHA
from syne_tune.constants import ST_WORKER_TIME
from syne_tune.backend.sagemaker_backend.instance_info import select_instance_type
from syne_tune.backend.sagemaker_backend.sagemaker_backend import SagemakerBackend
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role


TASK2METRICSMODE = {
    "cola": {'metric': 'matthews_correlation', 'mode': 'max'},
    "mnli": {'metric': 'accuracy', 'mode': 'max'},
    "mrpc": {'metric': 'f1', 'mode': 'max'},
    "qnli": {'metric': 'accuracy', 'mode': 'max'},
    "qqp": {'metric': 'f1', 'mode': 'max'},
    "rte": {'metric': 'accuracy', 'mode': 'max'},
    "sst2": {'metric': 'accuracy', 'mode': 'max'},
    "stsb": {'metric': 'spearmanr', 'mode': 'max'},
    "wnli": {'metric': 'accuracy', 'mode': 'max'},
}

Full Logs:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-ad89febf37d1> in <module>
     10 from syne_tune.backend.local_backend import LocalBackend
     11 from syne_tune.tuner import Tuner
---> 12 from syne_tune.config_space import uniform, loguniform, choice, randint
     13 from syne_tune.stopping_criterion import StoppingCriterion
     14 from syne_tune.optimizer.baselines import ASHA, MOBSTER, BayesianOptimization, RandomSearch, MOASHA

ModuleNotFoundError: No module named 'syne_tune.config_space'

ImportError for BotorchSearcher

Test (3.8) fails with:

____________ ERROR collecting tst/schedulers/test_schedulers_api.py ____________ ImportError while importing test module '/home/runner/work/syne-tune/syne-tune/tst/schedulers/test_schedulers_api.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tst/schedulers/test_schedulers_api.py:47: in <module> from syne_tune.optimizer.schedulers.botorch.botorch_searcher import BotorchSearcher syne_tune/optimizer/schedulers/botorch/botorch_searcher.py:18: in <module> from botorch.models import SingleTaskGP /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/__init__.py:7: in <module> from botorch import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/__init__.py:7: in <module> from botorch.acquisition.acquisition import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/acquisition.py:16: in <module> from botorch.models.model import Model /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/__init__.py:7: in <module> from botorch.models.approximate_gp import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/approximate_gp.py:35: in <module> from botorch.models.gpytorch import GPyTorchModel /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/gpytorch.py:23: in <module> from botorch.acquisition.objective import PosteriorTransform /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/objective.py:18: in <module> from botorch.models.model import Model /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/model.py:24: in <module> from botorch.models.utils.assorted import fantasize as fantasize_flag /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/utils/__init__.py:7: in <module> from botorch.models.utils.assorted import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/utils/assorted.py:21: in <module> from gpytorch.utils.broadcasting import _mul_broadcast_shape E ImportError: cannot import name '_mul_broadcast_shape' from 'gpytorch.utils.broadcasting' (/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/gpytorch/utils/broadcasting.py)

Is there a tuner.best_config() API?

After a tuner.run() execution, I'd like to be able to programmatically get the best config, either from the tuner or from its data folder, eg:

tuner.best_config()

or

tuning_experiment = load_experiment("experiment-xxxxxxxx")
tuning_experiment.best_config()

Is there an API for this?
If no, I suggest to add it to the roadmap

Config_space for full pipeline optimization

Hi, would it be possible to direct the config_space search for a conditional set formation so it can create a multi-step pipeline?
Something that will limit activation of invalid pipelines from alg+hp config variables.

Tutorial: Multi-fidelity HNAS in Syne Tune

What. Longer step-by-step tutorial on how to run experiments with our async and sync multi-fidelity HPO methods, both using tabulated blackboxes and a real DNN tuning problem (Hugging Face?).
Why. The way in which variants of different algos are implemented and available in ST could be a real advantage, but is right now hidden and undocumented. A tutorial would be most accessible, and would clarify important concepts (sync/async)
Done. Tutorial tested with volunteer outside the team, feedback incorporated

A second part of the tutorial could be for developers: how to implement a new scheduler, or a variant of an existing one.

Failed trials have out of date metrics

Hi, I'm using SageMaker as a backend and remote launcher. I noticed that if a job errors out during training, the latest performance logs will not be captured.

For example in my HPO experiment on CIFAR-10 dataset, One trial (number 8) had been reported in the Syne Tune results dataframe as achieving a validation accuracy of 0.8478 at epoch 22:

image

However my CloudWatch logs show that the validation accuracy actually reached 0.926 at epoch 60 before crashing:

image

image

Interestingly the job shows as Stopped rather than Failed in SageMaker console. Does Syne Tune notice an exception and stop the job before it exits with a failure?

make QuantileBasedSurrogateSearcher import in baselines optional

Right now importing syne_tune.optimizer.baselines fails when only core dependencies are installed because it imports QuantileBasedSurrogateSearcher, which in turn requires additional dependencies, such as XGBoost or sklearn. I would suggest to make it's import optional to avoid exceptions.

AttributeError: 'NoneType' object has no attribute 'scheduler'

Hi,

I launched a Syne Tune experiment few hours ago (experiment-2022-01-11-10-57-17-491), then stopped it and launched another one.

While experiment-2022-01-11-10-57-17-491 was running I could see its chart using

from syne_tune.experiments import load_experiment

tuning_experiment = load_experiment("experiment-2022-01-11-10-57-17-491")
tuning_experiment.plot()

Now, when I'm doing it from the same machine, I get a :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-0c0bfae5f6de> in <module>
      1 # metric over time
----> 2 tuning_experiment.plot()

~/anaconda3/envs/python3/lib/python3.6/site-packages/syne_tune/experiments.py in plot(self, **plt_kwargs)
     51         import matplotlib.pyplot as plt
     52 
---> 53         scheduler = self.tuner.scheduler
     54         metric = self.metric_name()
     55         df = self.results

AttributeError: 'NoneType' object has no attribute 'scheduler'

What is wrong? Can the graph be accessed only while the tuner is running?

[Feature Request] Parallel Categories Plot

(Apologies for creating multiple recent GitHub issues, this is the last one, I promise!)

I took the DataFrame from my experiment results and used Plotly's plotly.express.parallel_categories plot to visualize hyperparameter interactions, dropping any features that only have one unique value. This is an interactive plot, and you can wrap it in a function that refreshes periodically when new data is available:

image

This has been super useful for myself, so I thought that it may be useful to others as well if it were added as a plotting capability to the library? Although I'd understand if it's not desirable to add another dependency. Just thought I'd share!

sp.(log)finrange throws an error when sample(size=1)

Caused by self._uniform_int.sample(spec, size=1, random_state) returning an int rather than an iterable.
This seems to be caused by this piece of code

def _sanitize_sample_result(items, domain: Domain):
if len(items) > 1:
return [domain.cast(x) for x in items]
else:
return domain.cast(items[0])

import syne_tune.search_space as sp
fr = sp.finrange(1, 2, 2)
fr.sample(size=2)
> Out[4]: [1.0, 2.0]

fr.sample(size=1)
> Traceback (most recent call last):
>   ...
>   File "/Users/awgol/code/syne-tune/syne_tune/search_space.py", line 592, in sample
>     for x in self._uniform_int.sample(spec, size, random_state)]
> TypeError: 'int' object is not iterable

Gracefully deal with SageMaker Failures

A SageMaker training job failed for some random reasons which seems to break the tuner:

File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 152, in run
    new_done_trial_statuses, new_results = self._process_new_results(
  File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 282, in _process_new_results
    done_trials_statuses = self._update_running_trials(trial_status_dict, new_results, callbacks=self.callbacks)
  File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 437, in _update_running_trials
    assert trial_id in self.last_seen_result_per_trial, \
AssertionError: trial 35 completed and no metrics got observed

Would be great to retry jobs or at least ignore and continue somehow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.