kruize / hpo Goto Github PK

Hyper Parameter Optimization

License: Apache License 2.0

Shell 59.00% Python 40.07% HTML 0.93%

hpo's Introduction

Kruize HPO

Goal

Provide Kruize Hyper Parameter Optimization (HPO) to choose the optimal values for hyperparameters provided by the user for any model.

Background

While talking to Kruize Autotune users, we came across a number of scenarios where Hyper Parameter Optimization would be useful outside of the Autotune context (Kubernetes), including on bare metal and even on containerized but non-Kubernetes scenarios. This is when it was felt that it would be nice to separate the HPO part of Autotune as an independent service with a well defined API that will allow this feature to be used much more broadly.

Motivation

Machine learning is a process of teaching a system to make accurate predictions based on the data fed. Hyperparamter optimization (/ tuning) helps to choose the right set of parameters for a learning algorithm. HPO uses different methods like Manual, Random search, Grid search, Bayesian optimization. Kruize HPO currently uses Bayesian optimization because of the multiple advantages that it provides.

There are a number of Open Source modules / projects that provide hyperparameter optimization functions (Eg Optuna, hyperopt etc). Some modules are better suited for solving particular problems than others. However every module has a different API and supports varied workflows. This repo provides a thin API layer (REST, gRPC) that simplifies choosing both the modules / projects as well as the specific algorithm. This helps to hide all of the complexity of understanding individual HPO modules and their intricacies while at the same time providing an easy to use interface that requires the search space data to be provided in JSON format.

HPO Basics

What is HPO?

Hyperparameter optimization(HPO) is choosing a set of optimal hyperparameters that yields an optimal performance based on the predefined objective function.

Definitions

Search space: List of tunables with the ranges to optimize.
Experiment: A set of trials to find the optimal set of tunable values for a given objective function.
Trials: Each trial is an execution of an objective function by running a benchmark / application with the configuration generated by Kruize HPO.
Objective function: Typically an algebraic expression that either needs to be maximized or minimized. Eg, maximize throughput, minimize cost etc

Kruize HPO Architecture

The current architecture of Kruize HPO consists of a thin abstraction layer that provides a common REST API and gRPC interface. It provides an interface for integrating with Open Source projects / modules that provide HPO functionality. Currently it only supports the Optuna OSS Project. It provides a simple HTTP server through which the REST APIs can be accessed.

Kruize HPO supports the following ways of deployment:

Bare Metal
Container
Kubernetes (Minikube / Openshift)

REST API

See the API README for more details on the Autotune REST API.

Workflow of Kruize HPO

Step 1: Arrive at an objective function for your specific performance goal and capture it as a single algebraic expression.
Step 2: Capture the relevant tunables and ranges within which they operate. Create a search space JSON with the tunable details.
Step 3: Start an experiment by doing a POST of the search space to the URL as mentioned in the REST API above. On success, this should return a “trial_number”.
Step 4: Query Kruize HPO for the “trial config” associated with the “trial_number”.
Step 5: Start a benchmark run with the “trial config”.
Step 6: POST the results of the trial back to Kruize HPO.
Step 7: Generate a subsequent trial.
Step 8: Loop through Step 4 to 7 for the remaining trials of an experiment.
Step 9: Examine the results log to determine the best result for your experiment.

Supported Modules & Algorithms

Currently Kruize HPO supports only Optuna which is a Open Source Framework for many HPO algorithms. Here are a few of the algorithms supported by Optuna

Optuna
- TPE: Tree-structured Parzen Estimator sampler. (Default)
- TPE with multivariate
- optuna-scikit

The above tools mentioned supports Bayesian optimization which is part of a class of sequential model-based optimization(SMBO) algorithms for using results from a previous trial to improve the next.

Installation

You can access the Kruize HPO Operate-first instance to make use of it without installing by running the following command:

$ ./deploy_hpo.sh -c operate-first

Also, Kruize HPO can be installed natively on Linux, as a container or in minikube / openshift

Native $ ./deploy_hpo.sh -c native
Container $ ./deploy_hpo.sh -c docker
Minikube $ ./deploy_hpo.sh -c minikube
Openshift $ ./deploy_hpo.sh -c openshift

You can run a specific version of the Kruize HPO container $ ./deploy_hpo.sh -c minikube -o image:tag

Operate First

We have deployed Kruize HPO on Operate First community cloud using namespace 'openshift-tuning', to promote open operations. Operate First is a community of open source contributors including developers, data scientists and SREs, where developers and operators collaborate on production community cloud for operational considerations for their code and other artifacts. For more information on operate-first, please visit https://www.operate-first.cloud/

You can access HPO on operate first by running the following command: $ ./deploy_hpo.sh -c operate-first

How to make use of Kruize HPO for my use case?

We would recommend that you start with the hpo_demo_setup.sh script and customize it for your use case.

Contributing

We welcome your contributions! See CONTRIBUTING.md for more details.

License

Apache License 2.0, see LICENSE.

hpo's People

Contributors

Stargazers

Watchers

Forkers

khansaad kusumachalasani johnaohara chandrams dinogun msvinaykumar yogananth-subramanian ruivieira riprasad rbadagandi skonto kdelee shreyabiradar07

hpo's Issues

Separate API for Objective function and function variables

The experiment trials API currently accepts search space JSON which contains objective function as well. The objective function need not to be part of the search space and has to be cleaned up.

Also, we need to have a separate API which will have the provision to accept the objective function and the function variables and return the result after calculation.

Capture resource usage when HPO experiments are scaled

Capture resource usage when HPO experiments are scaled 10x or 100x times

CrashLoopBackoff when using ./deploy_hpo.sh -c openshift

Container failing to come up when using ./deploy_hpo.sh -c openshift

seeing in the logs of container:

Traceback (most recent call last):
File "/home/hpo/app/src/service.py", line 18, in <module>
import rest_service, grpc_service
File "/home/hpo/app/src/rest_service.py", line 24, in <module>
from json_validate import validate_trial_generate_json
File "/home/hpo/app/src/json_validate.py", line 1, in <module>
import jsonschema
ModuleNotFoundError: No module named 'jsonschema'

Speed up CI tests

By slightly re-architecturing the testsuite, I have reduced the time taken to run the testsuite in CI from
a) sanity-tests: 13min -> 1min
b) full testsuite: ~55min -> ~2min

In order for part of this work, the PR to delete experiments is required (#57). Another speed-up is obtained by splitting the Docker build into a multi-part build, build a base image with all the required libaries etc, and a second stage where the service files are copied into a second image. This reduces the docker build time in CI from 8min 42s to 33s as the base image does not need to be rebuilt for every test. The only downside to this is a separate process needs to build and push the base image to quay.io. The base image should only need to change for component upgrades

Before I open a PR, does anyone have an opinion on these changes?

Incorrect validation error for property data type mismatch in the search space JSON

The expected data type for the experiment_id property is string. However, if an integer or double value is provided in the search space for e.g in the below JSON, it results in a validation error with the message Parameters cannot be empty or null! . The correct validation error message should indicate that the data type of "experiment_id" should be a string i.e Parameter 'experiment_id' 123 is not of type 'string'

{
  "operation": "EXP_TRIAL_GENERATE_NEW",
  "search_space": {
    "experiment_name": "exp1",
    "experiment_id": 123,
    "total_trials": 5 ,
    "parallel_trials": 1,
    "value_type": "double",
    "hpo_algo_impl": "optuna_tpe",
    "objective_function": "transaction_response_time",
    "tunables": [
      {
        "value_type": "double",
        "lower_bound": 150,
        "name": "memoryRequest",
        "upper_bound": 300,
        "step": 1
      },
      {
        "value_type": "double",
        "lower_bound": 1,
        "name": "cpuRequest",
        "upper_bound": 3,
        "step": 0.01
      }
    ],
    "direction": "minimize"
  }
}

EXP_STOP operation removes both completed and incomplete experiment

EXP_STOP is intended to cancel the experiment when the experiment is incomplete.

Right now, it deletes the experiment irrespective of if an experiment is complete or not.
Because of this, the plots generated for the completed experiment become inaccessible.

@johnaohara

Include validation checks for invalid values in grpc service

We need to include validation checks for invalid values in grpc service. For example, posting experiments with empty name or no name doesn't give any error message.

python ../../src/grpc_client.py new --file empty-name.json 
 Adding new experiment:  
Trial Number: 0

python ../../src/grpc_client.py new --file no-name.json 
 Adding new experiment: 
Trial Number: 0

Update tuneables to tunables in newExperiment.json

Update tuneables to tunables in newExperiment.json & any other places it is used to keep it consistent.

HPO doesn't start in native with only rest service as it still looks for grpc.

Description:
When HPO is started with only rest, it is still looking for grpc and HPO is not started.

./deploy_hpo.sh -c native --rest

###   Installing HPO as a native App


### Installing dependencies..........


### Starting the service...

Traceback (most recent call last):
  File "src/service.py", line 18, in <module>
    import rest_service, grpc_service
  File "/root/kchalasa/kruize-demos/checkHPO/hpo/src/grpc_service.py", line 22, in <module>
    import grpc
ModuleNotFoundError: No module named 'grpc'

Add an API to generate hyperparameter importance values from optuna

Add an API to generate hyperparameter importance values for an experiment from optuna study

Functional support to start only rest / grpc service.

Functional support to start only rest / grpc service. By default hpo starts both the services.
If only rest service is required, it's not required to install grpc related modules.

Include negative tests for grpc

Include negative tests for grpc:

post experiment
post experiment results
get config

Update index.html with experiment plot URLs to access.

Replace system:build-strategy-docker role from the hpo-operator-rolebinding.yml to a more restrictive one

We have added a ClusterRole in hpo-operator-rolebinding.yml file under kind: ClusterRoleBinding which has a role name as system:build-strategy-docker. This is needed to install python modules inside kubernetes/openshift.

We need to find an alternate solution to make the role more restrictive i.e. avoid giving more permissions to a role than actually required.

HPO doesn't throw any error on posting the experiment result with invalid values

HPO doesn't throw any error on posting the experiment result with invalid values for trial_result, result_value, result_value_type in the experiment results JSON

To recreate the issue:

Run ./deploy_hpo.sh -c native
Post an experiment using the below curl command

curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d '{"operation":"EXP_TRIAL_GENERATE_NEW","search_space":{"experiment_name":"petclinic-sample-2-75884c5549-npvgd","total_trials":5,"parallel_trials":1,"experiment_id":"a123","value_type":"double","hpo_algo_impl":"optuna_tpe","objective_function":"transaction_response_time","tunables":[{"value_type":"double","lower_bound":150,"name":"memoryRequest","upper_bound":300,"step":1},{"value_type":"double","lower_bound":1,"name":"cpuRequest","upper_bound":3,"step":0.01}],"slo_class":"response_time","direction":"minimize"}}'

Post the experiment result json using the below curl command

curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d {"experiment_name" : "petclinic-sample-2-75884c5549-npvgd", "trial_number": 0, "trial_result": "xyz", "result_value_type": "double", "result_value": 98.78, "operation" : "EXP_TRIAL_RESULT"}

secret in minikube needs to deleted while terminating HPO

While terminating the HPO in minikube, secret still exists. It needs to be deleted.

HPO service does not shutdown gracefully

At present, when users attempt to shutdown a running service with a SIGINT signal, the process does not exit cleanly. The current behaviour;

Requires 2 SIGINT signals to be sent to the process,
Prints a stack trace from an unhandled KeyboardInterrupt error;

Traceback (most recent call last):
  File "/working/projects/redHat/autotune/hpo/src/service.py", line 31, in <module>
    main()
  File "/working/projects/redHat/autotune/hpo/src/service.py", line 27, in main
    restService.join()
  File "/usr/lib64/python3.10/threading.py", line 1089, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.10/threading.py", line 1109, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

When the HPO process receives a single SINGINT signal, it should shut down the process cleanly, without displaying stack traces

Operator for HPO

Currently HPO do not have an operator dependency. This issue is created as part of feature enhancement tracking for including Operator.

Get config always returns the current config generated irrespective of the trial number

Get config always returns the last config generated irrespective of the trial number in case of gRPC service. With REST service, get config returns -1 for all trial numbers except the current trial for which it shows the config.

To recreate follow the below steps:

Deploy HPO service - ./deploy_hpo.sh -c native
Post a new experiment - python grpc_client.py new --file=../tests/resources/searchspace_jsons/newExperiment.json
Get the config using - python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 0
Post the result - python grpc_client.py result --name petclinic-sample-2-75884c5549-npvgd --trial 0 --result SUCCESS --value_type double --value 5745.33
Post the next experiment - python grpc_client.py next --name petclinic-sample-2-75884c5549-npvgd
Get the config using both trial numbers or any random trial number that has not been generated so far
python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 10

I have posted the result for trial 0 & trial 1, trial 2 is the current config

(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 2
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}
(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 0
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}
(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 1
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}
(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 10
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}

Standard Frameworks

I wanted to ask some questions about the architecture of the project in general. Is there a reason to not use standard python frameworks/test runners/build tools for HPO?

For example, rather than hand-crafting a REST endpoint based on a HTTP server, would it make sense to use something like Django Rest/Flask?

Similarly with unit/integration tests, would it make sense to use a standard python library for creating and running tests?

Failures in full testsuite

As I was experimenting for #58 I noticed there a re a few failures when running the full testsuite, e.g.

########### Results Summary of the test suite hpo_api_tests ##########
hpo_api_tests took 2562 seconds
Number of tests performed 56
Number of tests passed 42
Number of tests failed 14

~~~~~~~~~~~~~~~~~~~~~~~ hpo_api_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
		  invalid-id
		  empty-id
		  invalid-searchspace
		  empty-name
		  null-name
		  invalid-trial-result
		  empty-trial-result
		  null-trial-result
		  invalid-result-value-type
		  empty-result-value-type
		  null-result-value-type
		  invalid-result-value
		  null-result-value
		  additional-field

Is this something that someone is currently investigating?

Onboard Kruize HPO service in OperateFirst cloud.

Goal : HPO as a Service can be made available as part of OperateFirst cloud which will be available to community cloud as open service.

Kruize HPOaaS provides an interface for integrating with Open Source projects / modules that provide HPO functionality.

HPO for Openshift:

Status : Deployed and tested the app on Openshift using QuickLab.

TO DO:

Creating a namespace. Adding a namespace, ocp group, resource quota on the namespace, and giving the group access to this namespace.
Creating a PR and sharing this namespace
Onboarding doc link for Operate First

Experiment/Trial state is ephemeral

From discussion on #39

"In optuna we aren't using persistent storage at present"

At present all experiment / trial config data is stored in memory and not in any form of persistent storage. This creates a number of challenges;

If HPO is running as a service (esp. in a K8's environment) any containers will be ephemeral. This means that K8s can decide to restart pods, current experiments would be lost after a restart and experiments would have to be restarted
There is no persistent history of experiments, trials or results.
There is no ability to restart an experiment that was already under-way. This will currently be difficult to restart due to the synchronous architecture of the optuna library.

Are there plans to introduce persistent storage of experiment/trial data?

Protobuf version specified in the Dockerfile needs to be removed

In PR #53 we had to add the protobuf version in the Dockerfile as otherwise we are seeing PR check failures. This might be related to the protocolbuffers/protobuf#10053. Will need to remove the version once this is fixed upstream

Create SCC for running HPO on Openshift in restricted manner

Currently HPO has issues while running on Openshift due to the additional restrictions it imposes for running the applications. We need to add a security policy to deploy our app successfully.

Openshift has an SCC ( Security Context Constraint) feature which needs to implemented for the same.

Updates to HPO documentation

Update documentation for below:

Include REST API doc
Documentation on building and deploying hpo

Ownership of app/src folder is not hpo user.

Description:
The folders in /home/hpo has root as user although it is copying as hpo user.

[hpo@da33248ee069 app]$ ls -lrt
total 8
-rw-rw-r--. 1 root root 1330 Apr 12 09:43 index.html
-rw-rw-r--. 1 root root   73 Aug  5 10:28 requirements.txt
drwxr-xr-x. 5 root root   57 Aug 23 06:17 src

Remove SCC dependency for Openshift deployment

Currently we're using one SCC yaml to deploy the HPO application Openshift successfully. This requires special privileges when we deploy it in other cloud platforms like Operate-First.
We need to make changes in the dockerFile to remove the dependency on SCC which in turn will ease our deployment on other platforms.

Update tests in the HPO repo

The tests are currently not working in this repo, update tests and add a PR check as well

Logging infra for HPO

Currently we do not have support for different logging levels for HPO.
Also add support for user to change logging levels from deploy script
Format of logged messages needs to be fixed (Refer autotune log messages format)

HPO doesn't throw any error on passing an empty string

Passing an empty string to some of the fields (experiment_id, trial_result, result_value_type) in the HPO experiment or results json doesn't throw any error message.

To recreate the issue:

Run ./deploy_hpo.sh -c native
Post the experiment json using the curl command
Below is an example curl command to post an experiment with empty string for experiment_id

curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d '{"operation":"EXP_TRIAL_GENERATE_NEW","search_space":{"experiment_name":"petclinic-sample-2-75884c5549-npvgd","total_trials":5,"parallel_trials":1,"experiment_id":" ","value_type":"double","hpo_algo_impl":"optuna_tpe","objective_function":"transaction_response_time","tunables":[{"value_type":"double","lower_bound":150,"name":"memoryRequest","upper_bound":300,"step":1},{"value_type":"double","lower_bound":1,"name":"cpuRequest","upper_bound":3,"step":0.01}],"slo_class":"response_time","direction":"minimize"}}'

Posting experiment result for an ongoing trial failed

Posting experiment result for an ongoing trial failed with the below error when experiments are run in parallel

*********************************** Experiment petclinic-sample-87 and trial_number 3 *************************************

Generate the config for experiment 87 and trial 3...

command used to query the experiment_trial API = curl -s -H 'Accept: application/json' 'http://worker001-r630:31645/experiment_trials?experiment_name=petclinic-sample-87&trial_number=3' -w '\n%{http_code}'
[{"tunable_name": "memoryRequest", "tunable_value": 189.0}, {"tunable_name": "cpuRequest", "tunable_value": 2.39}]

Post the experiment result for experiment petclinic-sample-87 and trial 3...


Command used to post the experiment result= curl -s -H 'Content-Type: application/json' http://worker001-r630:31645/experiment_trials -d {"experiment_name":"petclinic-sample-87","trial_number":3,"trial_result":"success","result_value_type":"double","result_value":98.7,"operation":"EXP_TRIAL_RESULT"} -w '\n%{http_code}'

Requested trial exceeds the completed trial limit!
400
Response is Requested trial exceeds the completed trial limit!
http_code = 400 response = Requested trial exceeds the completed trial limit!
Post experiment result for experiment petclinic-sample-87 and trial 3 failed - http_code is not as expected, http_code = 400 expected code = 200

Generate subsequent config for experiment petclinic-sample-87 after trial 3 ...


Curl command used to post the experiment = curl -s -H 'Content-Type: application/json' http://worker001-r630:31645/experiment_trials -d '{"experiment_name":"petclinic-sample-87","operation":"EXP_TRIAL_GENERATE_SUBSEQUENT"}'  -w '\n%{http_code}'

4
200
Response is 4
http_code is 200 Response is 4

Fix package versions to avoid compatibility issue

Currently we're only specifying a fixed version for protobuf package in our requirements file i.e. 4.21.8 . We faced issues while running the grpc service in docker because it took the latest grpcio package which is incompatible with the protobuf version 4.21.8 .

So, to fix these issues permanently we need to fix the versions of all the packages that we're using in the requirements file along with the compatible Python version.

HPO negative tests failures on Openshift

Post experiment tests failed, due to log messages not in service.log

Failed cases are :
empty-id
null-id
empty-name
no-name
null-name
no-operation
additional-field
generate-subsequent
invalid-searchspace
post_duplicate_experiments

plot API fails to generate plots and defaults to optimization_history when only exp_name is specified

Specifying only experiment name, type doesn't default to tunables_importance and don't see any plots generated.

2022-08-23 16:09:53 - INFO - rest_service - Total Trials = 5
2022-08-23 16:09:53 - INFO - rest_service - Parallel Trials = 1
2022-08-23 16:09:53 - INFO - rest_service - Starting Experiment: petclinic-sample-2-75884c5549-npvgd
127.0.0.1 - - [23/Aug/2022 16:09:53] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:53 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:53 - INFO - rest_service - Trial_Number = 0
127.0.0.1 - - [23/Aug/2022 16:09:53] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=0 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:54 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:54 - INFO - rest_service - Trial_Number = 1
127.0.0.1 - - [23/Aug/2022 16:09:54] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=1 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:54 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:54 - INFO - rest_service - Trial_Number = 2
127.0.0.1 - - [23/Aug/2022 16:09:54] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=2 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:55 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:55 - INFO - rest_service - Trial_Number = 3
127.0.0.1 - - [23/Aug/2022 16:09:55] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=3 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:55 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:55 - INFO - rest_service - Trial_Number = 4
127.0.0.1 - - [23/Aug/2022 16:09:55] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=4 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:56] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:56 - INFO - bayes_optuna.optuna_hpo - BEST PARAMETER: {'memoryRequest': 202.0, 'cpuRequest': 1.78}
2022-08-23 16:09:56 - INFO - bayes_optuna.optuna_hpo - BEST VALUE: 98.7
2022-08-23 16:09:56 - INFO - bayes_optuna.optuna_hpo - BEST TRIAL: FrozenTrial(number=0, values=[98.7], datetime_start=datetime.datetime(2022, 8, 23, 16, 9, 53, 648072), datetime_complete=datetime.datetime(2022, 8, 23, 16, 9, 54, 136858), params={'memoryRequest': 202.0, 'cpuRequest': 1.78}, distributions={'memoryRequest': DiscreteUniformDistribution(high=300.0, low=150.0, q=1.0), 'cpuRequest': DiscreteUniformDistribution(high=3.0, low=1.0, q=0.01)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=0, state=TrialState.COMPLETE, value=None)
2022-08-23 16:09:56 - WARNING - bayes_optuna.optuna_hpo - Experiment stopped: petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:56 - INFO - rest_service - Plot type not defined. Defaulting it to optimization_history
127.0.0.1 - - [23/Aug/2022 16:09:56] "GET /plot?experiment_name="petclinic-sample-2-75884c5549-npvgd" HTTP/1.1" 404 -
2022-08-23 16:09:56 - ERROR - rest_service - Experiment not found!
127.0.0.1 - - [23/Aug/2022 16:09:56] "GET /plot?experiment_name="petclinic-sample-2-75884c5549-npvgd" HTTP/1.1" 400 -

Add hyperopt support

Add support for the hyperopt library. See this PR for a draft implementation,

Use threading to better handle experiments

This is needed

To handle multiple experiments that can run in parallel
Use threading to wait()/notify() when a trial result becomes available

Subsequent trial is not generated at times when running experiments in parallel

Subsequent trial is not generated at times when running experiments in parallel. Observed this while running 10 experiments in parallel

*********************************** Experiment petclinic-sample-10 and trial_number 0 *************************************

Generate the config for experiment 10 and trial 0...

command used to query the experiment_trial API = curl -s -H 'Accept: application/json' 'http://worker001-r630:32600/experiment_trials?experiment_name=petclinic-sample-10&trial_number=0' -w '\n%{http_code}'
[{"tunable_name": "memoryRequest", "tunable_value": 292.0}, {"tunable_name": "cpuRequest", "tunable_value": 2.6500000000000004}]

Post the experiment result for experiment petclinic-sample-10 and trial 0...


Command used to post the experiment result= curl -s -H 'Content-Type: application/json' http://worker001-r630:32600/experiment_trials -d {"experiment_name":"petclinic-sample-10","trial_number":0,"trial_result":"success","result_value_type":"double","result_value":98.7,"operation":"EXP_TRIAL_RESULT"} -w '\n%{http_code}'

Result posted successfully!
200
Response is Result posted successfully!
http_code = 200 response = Result posted successfully!

Generate subsequent config for experiment petclinic-sample-10 after trial 0 ...


Curl command used to post the experiment = curl -s -H 'Content-Type: application/json' http://worker001-r630:32600/experiment_trials -d '{"experiment_name":"petclinic-sample-10","operation":"EXP_TRIAL_GENERATE_SUBSEQUENT"}'  -w '\n%{http_code}'

0
200
Response is 0
http_code is 200 Response is 0

*********************************** Experiment petclinic-sample-10 and trial_number 1 *************************************

Generate the config for experiment 10 and trial 1...

command used to query the experiment_trial API = curl -s -H 'Accept: application/json' 'http://worker001-r630:32600/experiment_trials?experiment_name=petclinic-sample-10&trial_number=1' -w '\n%{http_code}'
Requested trial exceeds the completed trial limit!
Get config from hpo for experiment petclinic-sample-10 and trial 1 failed - http_code is not as expected, http_code = 400 expected code = 200

Replace experiment_id with experiment_name in the HPO REST APIs

At present we use experiment_id in the HPO REST APIs to post experiment, get config & post result. We can replace experiment_id with experiment_name instead.

Validate HPO on openshift

Update HPO tests
Validate HPO on openshift

Refactor server/client/common code

Based on comment: #21 (comment)

In order to separate client & server code, thought should be given to the project structure to support modular dependencies on common code

Add visualization plots for HPO experiment

Description:
APIs to generate the visualization plots for an experiment in optuna.

Different plots include:
optuna.visualization.plot_param_importances
optuna.visualization.plot_optimization_history
optuna.visualization.plot_slice
optuna.visualization.plot_contour
optuna.visualization.plot_parallel_coordinate

Reference:
https://optuna.readthedocs.io/en/stable/reference/visualization/index.html

Terminating HPO in minikube doesn't require to set environment variables for secret creation.

Description:
To terminate the HPO in minikube, it asks to set environment variables - which is not required.

 ./deploy_hpo.sh -c minikube -t
You need to set the environment variables first for Kubernetes secret creation

Usage:
 -a | --non_interactive: interactive (default)
 -s | --start: start(default) the app
 -t | --terminate: terminate the app
 -c | --cluster_type: cluster type [docker|minikube|native|openshift]]
 -o | --container_image: build with specific hpo container image name [Default - kruize/hpo:<version>]
 -n | --namespace : Namespace to which hpo is deployed [Default - monitoring namespace for cluster type minikube]
 -d | --configmaps_dir : Config maps directory [Default - manifests/configmaps]
 --both: install both REST and the gRPC service
 --rest: install REST only
 Environment Variables to be set: REGISTRY, REGISTRY_EMAIL, REGISTRY_USERNAME, REGISTRY_PASSWORD
 [Example - REGISTRY: docker.io, quay.io, etc]

Unhandled exception when the status of all the trials in the experiment is prune

Below is the exception observed when the status of all the trials in the experiment is prune

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.6/threading.py", line 867, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kchalasa/Desktop/em-hpo-scripts/hpo/hpo-l/autotune-demo/hpo/src/bayes_optuna/optuna_hpo.py", line 145, in recommend
    logger.info("Best parameter: " + str(study.best_params))
  File "/home/kchalasa/.local/lib/python3.6/site-packages/optuna/study/study.py", line 60, in best_params
    return self.best_trial.params
  File "/home/kchalasa/.local/lib/python3.6/site-packages/optuna/study/study.py", line 97, in best_trial
    return copy.deepcopy(self._storage.get_best_trial(self._study_id))
  File "/home/kchalasa/.local/lib/python3.6/site-packages/optuna/storages/_in_memory.py", line 311, in get_best_trial
    raise ValueError("No trials are completed yet.")

HPO doesn't throw any error message on passing null for result_value

HPO doesn't throw any error on posting the experiment result with trial_result, result_value, result_value_type being null in the experiment results JSON

To recreate the issue:

Run ./deploy_hpo.sh -c native
Post an experiment using the below curl command

curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d '{"operation":"EXP_TRIAL_GENERATE_NEW","search_space":{"experiment_name":"petclinic-sample-2-75884c5549-npvgd","total_trials":5,"parallel_trials":1,"experiment_id":"a123","value_type":"double","hpo_algo_impl":"optuna_tpe","objective_function":"transaction_response_time","tunables":[{"value_type":"double","lower_bound":150,"name":"memoryRequest","upper_bound":300,"step":1},{"value_type":"double","lower_bound":1,"name":"cpuRequest","upper_bound":3,"step":0.01}],"slo_class":"response_time","direction":"minimize"}}'

Post the experiment result json using the below curl command

curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d {"experiment_name" : "petclinic-sample-2-75884c5549-npvgd", "trial_number": 0, "trial_result": null, "result_value_type": "double", "result_value": 98.78, "operation" : "EXP_TRIAL_RESULT"}

optuna: n_jobs is getting deprecated

It appears that with the latest optuna, parallelism within an experiment viz n_jobs is getting deprecated. This issue is to explore the implications and research any alternatives

Recommendation API

Add an API that returns the best / recommended config. This should return the "current" best during the experiment run and the "absolute" best after all the trials complete.

GRPC sanity tests fail on minikube / openshift

GRPC sanity tests fail on minikube with the below error on the client side

Posting a new experiment...
Adding new experiment: petclinic-sample-2-75884c5549-npvgd
Error: An error occurred executing command: failed to connect to all addresses
Post new experiment failed!

From the service log:

2022-07-15 09:12:26 - INFO - hpo-service - Starting HPO service
2022-07-15 09:12:26 - INFO - rest_service - Access server at http://localhost:8085
2022-07-15 09:12:26 - INFO - grpc_service - Starting gRPC server at http://0.0.0.0:50051

I have exported HPO_HOST & PORT to minikube IP & port no. of HPO service. Not sure if I 'm missing out anything here or if any
changes are required to grpc service.

Shouldn't the IP & port no. in the INFO statements in service log display the cluster IP and Port?

HPO service needs to be exposed

Expose HPO service so that it can be accessed outside the cluster using the route rather than using the port.

oc expose svc/< hpo service > -n < namespace >

kruize / hpo Goto Github PK

hpo's Introduction

Kruize HPO

Goal

Background

Motivation

HPO Basics

What is HPO?

Definitions

Kruize HPO Architecture

REST API

Workflow of Kruize HPO

Supported Modules & Algorithms

Installation

Operate First

How to make use of Kruize HPO for my use case?

Contributing

License

hpo's People

Contributors

Stargazers

Watchers

Forkers

hpo's Issues

Recommend Projects

Recommend Topics

Recommend Org