Coder Social home page Coder Social logo

hpo's Introduction

Kruize HPO

Goal

Provide Kruize Hyper Parameter Optimization (HPO) to choose the optimal values for hyperparameters provided by the user for any model.

Background

While talking to Kruize Autotune users, we came across a number of scenarios where Hyper Parameter Optimization would be useful outside of the Autotune context (Kubernetes), including on bare metal and even on containerized but non-Kubernetes scenarios. This is when it was felt that it would be nice to separate the HPO part of Autotune as an independent service with a well defined API that will allow this feature to be used much more broadly.

Motivation

Machine learning is a process of teaching a system to make accurate predictions based on the data fed. Hyperparamter optimization (/ tuning) helps to choose the right set of parameters for a learning algorithm. HPO uses different methods like Manual, Random search, Grid search, Bayesian optimization. Kruize HPO currently uses Bayesian optimization because of the multiple advantages that it provides.

There are a number of Open Source modules / projects that provide hyperparameter optimization functions (Eg Optuna, hyperopt etc). Some modules are better suited for solving particular problems than others. However every module has a different API and supports varied workflows. This repo provides a thin API layer (REST, gRPC) that simplifies choosing both the modules / projects as well as the specific algorithm. This helps to hide all of the complexity of understanding individual HPO modules and their intricacies while at the same time providing an easy to use interface that requires the search space data to be provided in JSON format.

HPO Basics

What is HPO?

Hyperparameter optimization(HPO) is choosing a set of optimal hyperparameters that yields an optimal performance based on the predefined objective function.

Definitions

  • Search space: List of tunables with the ranges to optimize.
  • Experiment: A set of trials to find the optimal set of tunable values for a given objective function.
  • Trials: Each trial is an execution of an objective function by running a benchmark / application with the configuration generated by Kruize HPO.
  • Objective function: Typically an algebraic expression that either needs to be maximized or minimized. Eg, maximize throughput, minimize cost etc

Kruize HPO Architecture

Kruize HPO Architecture The current architecture of Kruize HPO consists of a thin abstraction layer that provides a common REST API and gRPC interface. It provides an interface for integrating with Open Source projects / modules that provide HPO functionality. Currently it only supports the Optuna OSS Project. It provides a simple HTTP server through which the REST APIs can be accessed.

Kruize HPO supports the following ways of deployment:

  • Bare Metal
  • Container
  • Kubernetes (Minikube / Openshift)

REST API

See the API README for more details on the Autotune REST API.

Workflow of Kruize HPO

  • Step 1: Arrive at an objective function for your specific performance goal and capture it as a single algebraic expression.
  • Step 2: Capture the relevant tunables and ranges within which they operate. Create a search space JSON with the tunable details.
  • Step 3: Start an experiment by doing a POST of the search space to the URL as mentioned in the REST API above. On success, this should return a “trial_number”.
  • Step 4: Query Kruize HPO for the “trial config” associated with the “trial_number”.
  • Step 5: Start a benchmark run with the “trial config”.
  • Step 6: POST the results of the trial back to Kruize HPO.
  • Step 7: Generate a subsequent trial.
  • Step 8: Loop through Step 4 to 7 for the remaining trials of an experiment.
  • Step 9: Examine the results log to determine the best result for your experiment.

Supported Modules & Algorithms

Currently Kruize HPO supports only Optuna which is a Open Source Framework for many HPO algorithms. Here are a few of the algorithms supported by Optuna

  • Optuna
    • TPE: Tree-structured Parzen Estimator sampler. (Default)
    • TPE with multivariate
    • optuna-scikit

The above tools mentioned supports Bayesian optimization which is part of a class of sequential model-based optimization(SMBO) algorithms for using results from a previous trial to improve the next.

Installation

You can access the Kruize HPO Operate-first instance to make use of it without installing by running the following command:

$ ./deploy_hpo.sh -c operate-first

Also, Kruize HPO can be installed natively on Linux, as a container or in minikube / openshift

  1. Native $ ./deploy_hpo.sh -c native
  2. Container $ ./deploy_hpo.sh -c docker
  3. Minikube $ ./deploy_hpo.sh -c minikube
  4. Openshift $ ./deploy_hpo.sh -c openshift

You can run a specific version of the Kruize HPO container $ ./deploy_hpo.sh -c minikube -o image:tag

Operate First

We have deployed Kruize HPO on Operate First community cloud using namespace 'openshift-tuning', to promote open operations. Operate First is a community of open source contributors including developers, data scientists and SREs, where developers and operators collaborate on production community cloud for operational considerations for their code and other artifacts. For more information on operate-first, please visit https://www.operate-first.cloud/

You can access HPO on operate first by running the following command: $ ./deploy_hpo.sh -c operate-first

How to make use of Kruize HPO for my use case?

We would recommend that you start with the hpo_demo_setup.sh script and customize it for your use case.

Contributing

We welcome your contributions! See CONTRIBUTING.md for more details.

License

Apache License 2.0, see LICENSE.

hpo's People

Contributors

bhanvimenghani avatar chandrams avatar dinogun avatar johnaohara avatar kdelee avatar khansaad avatar kusumachalasani avatar rbadagandi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpo's Issues

Separate API for Objective function and function variables

The experiment trials API currently accepts search space JSON which contains objective function as well. The objective function need not to be part of the search space and has to be cleaned up.

Also, we need to have a separate API which will have the provision to accept the objective function and the function variables and return the result after calculation.

CrashLoopBackoff when using ./deploy_hpo.sh -c openshift

Container failing to come up when using ./deploy_hpo.sh -c openshift

seeing in the logs of container:

Traceback (most recent call last):
File "/home/hpo/app/src/service.py", line 18, in <module>
import rest_service, grpc_service
File "/home/hpo/app/src/rest_service.py", line 24, in <module>
from json_validate import validate_trial_generate_json
File "/home/hpo/app/src/json_validate.py", line 1, in <module>
import jsonschema
ModuleNotFoundError: No module named 'jsonschema'

Speed up CI tests

By slightly re-architecturing the testsuite, I have reduced the time taken to run the testsuite in CI from
a) sanity-tests: 13min -> 1min
b) full testsuite: ~55min -> ~2min

In order for part of this work, the PR to delete experiments is required (#57). Another speed-up is obtained by splitting the Docker build into a multi-part build, build a base image with all the required libaries etc, and a second stage where the service files are copied into a second image. This reduces the docker build time in CI from 8min 42s to 33s as the base image does not need to be rebuilt for every test. The only downside to this is a separate process needs to build and push the base image to quay.io. The base image should only need to change for component upgrades

Before I open a PR, does anyone have an opinion on these changes?

Incorrect validation error for property data type mismatch in the search space JSON

The expected data type for the experiment_id property is string. However, if an integer or double value is provided in the search space for e.g in the below JSON, it results in a validation error with the message Parameters cannot be empty or null! . The correct validation error message should indicate that the data type of "experiment_id" should be a string i.e Parameter 'experiment_id' 123 is not of type 'string'

{
  "operation": "EXP_TRIAL_GENERATE_NEW",
  "search_space": {
    "experiment_name": "exp1",
    "experiment_id": 123,
    "total_trials": 5 ,
    "parallel_trials": 1,
    "value_type": "double",
    "hpo_algo_impl": "optuna_tpe",
    "objective_function": "transaction_response_time",
    "tunables": [
      {
        "value_type": "double",
        "lower_bound": 150,
        "name": "memoryRequest",
        "upper_bound": 300,
        "step": 1
      },
      {
        "value_type": "double",
        "lower_bound": 1,
        "name": "cpuRequest",
        "upper_bound": 3,
        "step": 0.01
      }
    ],
    "direction": "minimize"
  }
}

Include validation checks for invalid values in grpc service

We need to include validation checks for invalid values in grpc service. For example, posting experiments with empty name or no name doesn't give any error message.

python ../../src/grpc_client.py new --file empty-name.json 
 Adding new experiment:  
Trial Number: 0

python ../../src/grpc_client.py new --file no-name.json 
 Adding new experiment: 
Trial Number: 0

HPO doesn't start in native with only rest service as it still looks for grpc.

Description:
When HPO is started with only rest, it is still looking for grpc and HPO is not started.

./deploy_hpo.sh -c native --rest

###   Installing HPO as a native App


### Installing dependencies..........


### Starting the service...

Traceback (most recent call last):
  File "src/service.py", line 18, in <module>
    import rest_service, grpc_service
  File "/root/kchalasa/kruize-demos/checkHPO/hpo/src/grpc_service.py", line 22, in <module>
    import grpc
ModuleNotFoundError: No module named 'grpc'

HPO doesn't throw any error on posting the experiment result with invalid values

HPO doesn't throw any error on posting the experiment result with invalid values for trial_result, result_value, result_value_type in the experiment results JSON

To recreate the issue:

  • Run ./deploy_hpo.sh -c native
  • Post an experiment using the below curl command
curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d '{"operation":"EXP_TRIAL_GENERATE_NEW","search_space":{"experiment_name":"petclinic-sample-2-75884c5549-npvgd","total_trials":5,"parallel_trials":1,"experiment_id":"a123","value_type":"double","hpo_algo_impl":"optuna_tpe","objective_function":"transaction_response_time","tunables":[{"value_type":"double","lower_bound":150,"name":"memoryRequest","upper_bound":300,"step":1},{"value_type":"double","lower_bound":1,"name":"cpuRequest","upper_bound":3,"step":0.01}],"slo_class":"response_time","direction":"minimize"}}'
  • Post the experiment result json using the below curl command
curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d {"experiment_name" : "petclinic-sample-2-75884c5549-npvgd", "trial_number": 0, "trial_result": "xyz", "result_value_type": "double", "result_value": 98.78, "operation" : "EXP_TRIAL_RESULT"}

HPO service does not shutdown gracefully

At present, when users attempt to shutdown a running service with a SIGINT signal, the process does not exit cleanly. The current behaviour;

  • Requires 2 SIGINT signals to be sent to the process,
  • Prints a stack trace from an unhandled KeyboardInterrupt error;
Traceback (most recent call last):
  File "/working/projects/redHat/autotune/hpo/src/service.py", line 31, in <module>
    main()
  File "/working/projects/redHat/autotune/hpo/src/service.py", line 27, in main
    restService.join()
  File "/usr/lib64/python3.10/threading.py", line 1089, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.10/threading.py", line 1109, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

When the HPO process receives a single SINGINT signal, it should shut down the process cleanly, without displaying stack traces

Operator for HPO

Currently HPO do not have an operator dependency. This issue is created as part of feature enhancement tracking for including Operator.

Get config always returns the current config generated irrespective of the trial number

Get config always returns the last config generated irrespective of the trial number in case of gRPC service. With REST service, get config returns -1 for all trial numbers except the current trial for which it shows the config.

To recreate follow the below steps:

  • Deploy HPO service - ./deploy_hpo.sh -c native
  • Post a new experiment - python grpc_client.py new --file=../tests/resources/searchspace_jsons/newExperiment.json
  • Get the config using - python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 0
  • Post the result - python grpc_client.py result --name petclinic-sample-2-75884c5549-npvgd --trial 0 --result SUCCESS --value_type double --value 5745.33
  • Post the next experiment - python grpc_client.py next --name petclinic-sample-2-75884c5549-npvgd
  • Get the config using both trial numbers or any random trial number that has not been generated so far
    python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 10
I have posted the result for trial 0 & trial 1, trial 2 is the current config

(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 2
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}
(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 0
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}
(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 1
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}
(base) [csubrama@csubrama src]$ python grpc_client.py config --name petclinic-sample-2-75884c5549-npvgd --trial 10
{
  "config": [
    {
      "name": "memoryRequest",
      "value": 209.0
    },
    {
      "name": "cpuRequest",
      "value": 1.89
    }
  ]
}

Standard Frameworks

I wanted to ask some questions about the architecture of the project in general. Is there a reason to not use standard python frameworks/test runners/build tools for HPO?

For example, rather than hand-crafting a REST endpoint based on a HTTP server, would it make sense to use something like Django Rest/Flask?

Similarly with unit/integration tests, would it make sense to use a standard python library for creating and running tests?

Failures in full testsuite

As I was experimenting for #58 I noticed there a re a few failures when running the full testsuite, e.g.

########### Results Summary of the test suite hpo_api_tests ##########
hpo_api_tests took 2562 seconds
Number of tests performed 56
Number of tests passed 42
Number of tests failed 14

~~~~~~~~~~~~~~~~~~~~~~~ hpo_api_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
		  invalid-id
		  empty-id
		  invalid-searchspace
		  empty-name
		  null-name
		  invalid-trial-result
		  empty-trial-result
		  null-trial-result
		  invalid-result-value-type
		  empty-result-value-type
		  null-result-value-type
		  invalid-result-value
		  null-result-value
		  additional-field

Is this something that someone is currently investigating?

Onboard Kruize HPO service in OperateFirst cloud.

Goal : HPO as a Service can be made available as part of OperateFirst cloud which will be available to community cloud as open service.

Kruize HPOaaS provides an interface for integrating with Open Source projects / modules that provide HPO functionality.

HPO for Openshift:

  • Status : Deployed and tested the app on Openshift using QuickLab.

TO DO:

  • Creating a namespace. Adding a namespace, ocp group, resource quota on the namespace, and giving the group access to this namespace.
  • Creating a PR and sharing this namespace
  • Onboarding doc link for Operate First

Experiment/Trial state is ephemeral

From discussion on #39

"In optuna we aren't using persistent storage at present"

At present all experiment / trial config data is stored in memory and not in any form of persistent storage. This creates a number of challenges;

  1. If HPO is running as a service (esp. in a K8's environment) any containers will be ephemeral. This means that K8s can decide to restart pods, current experiments would be lost after a restart and experiments would have to be restarted
  2. There is no persistent history of experiments, trials or results.
  3. There is no ability to restart an experiment that was already under-way. This will currently be difficult to restart due to the synchronous architecture of the optuna library.

Are there plans to introduce persistent storage of experiment/trial data?

Create SCC for running HPO on Openshift in restricted manner

Currently HPO has issues while running on Openshift due to the additional restrictions it imposes for running the applications. We need to add a security policy to deploy our app successfully.

Openshift has an SCC ( Security Context Constraint) feature which needs to implemented for the same.

Ownership of app/src folder is not hpo user.

Description:
The folders in /home/hpo has root as user although it is copying as hpo user.

[hpo@da33248ee069 app]$ ls -lrt
total 8
-rw-rw-r--. 1 root root 1330 Apr 12 09:43 index.html
-rw-rw-r--. 1 root root   73 Aug  5 10:28 requirements.txt
drwxr-xr-x. 5 root root   57 Aug 23 06:17 src

Remove SCC dependency for Openshift deployment

Currently we're using one SCC yaml to deploy the HPO application Openshift successfully. This requires special privileges when we deploy it in other cloud platforms like Operate-First.
We need to make changes in the dockerFile to remove the dependency on SCC which in turn will ease our deployment on other platforms.

Logging infra for HPO

  1. Currently we do not have support for different logging levels for HPO.
  2. Also add support for user to change logging levels from deploy script
  3. Format of logged messages needs to be fixed (Refer autotune log messages format)

HPO doesn't throw any error on passing an empty string

Passing an empty string to some of the fields (experiment_id, trial_result, result_value_type) in the HPO experiment or results json doesn't throw any error message.

To recreate the issue:

  • Run ./deploy_hpo.sh -c native
  • Post the experiment json using the curl command
  • Below is an example curl command to post an experiment with empty string for experiment_id
curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d '{"operation":"EXP_TRIAL_GENERATE_NEW","search_space":{"experiment_name":"petclinic-sample-2-75884c5549-npvgd","total_trials":5,"parallel_trials":1,"experiment_id":" ","value_type":"double","hpo_algo_impl":"optuna_tpe","objective_function":"transaction_response_time","tunables":[{"value_type":"double","lower_bound":150,"name":"memoryRequest","upper_bound":300,"step":1},{"value_type":"double","lower_bound":1,"name":"cpuRequest","upper_bound":3,"step":0.01}],"slo_class":"response_time","direction":"minimize"}}'

Posting experiment result for an ongoing trial failed

Posting experiment result for an ongoing trial failed with the below error when experiments are run in parallel

*********************************** Experiment petclinic-sample-87 and trial_number 3 *************************************

Generate the config for experiment 87 and trial 3...

command used to query the experiment_trial API = curl -s -H 'Accept: application/json' 'http://worker001-r630:31645/experiment_trials?experiment_name=petclinic-sample-87&trial_number=3' -w '\n%{http_code}'
[{"tunable_name": "memoryRequest", "tunable_value": 189.0}, {"tunable_name": "cpuRequest", "tunable_value": 2.39}]

Post the experiment result for experiment petclinic-sample-87 and trial 3...


Command used to post the experiment result= curl -s -H 'Content-Type: application/json' http://worker001-r630:31645/experiment_trials -d {"experiment_name":"petclinic-sample-87","trial_number":3,"trial_result":"success","result_value_type":"double","result_value":98.7,"operation":"EXP_TRIAL_RESULT"} -w '\n%{http_code}'

Requested trial exceeds the completed trial limit!
400
Response is Requested trial exceeds the completed trial limit!
http_code = 400 response = Requested trial exceeds the completed trial limit!
Post experiment result for experiment petclinic-sample-87 and trial 3 failed - http_code is not as expected, http_code = 400 expected code = 200

Generate subsequent config for experiment petclinic-sample-87 after trial 3 ...


Curl command used to post the experiment = curl -s -H 'Content-Type: application/json' http://worker001-r630:31645/experiment_trials -d '{"experiment_name":"petclinic-sample-87","operation":"EXP_TRIAL_GENERATE_SUBSEQUENT"}'  -w '\n%{http_code}'

4
200
Response is 4
http_code is 200 Response is 4

Fix package versions to avoid compatibility issue

Currently we're only specifying a fixed version for protobuf package in our requirements file i.e. 4.21.8 . We faced issues while running the grpc service in docker because it took the latest grpcio package which is incompatible with the protobuf version 4.21.8 .

So, to fix these issues permanently we need to fix the versions of all the packages that we're using in the requirements file along with the compatible Python version.

HPO negative tests failures on Openshift

Post experiment tests failed, due to log messages not in service.log

Failed cases are :
empty-id
null-id
empty-name
no-name
null-name
no-operation
additional-field
generate-subsequent
invalid-searchspace
post_duplicate_experiments

plot API fails to generate plots and defaults to optimization_history when only exp_name is specified

Specifying only experiment name, type doesn't default to tunables_importance and don't see any plots generated.

2022-08-23 16:09:53 - INFO - rest_service - Total Trials = 5
2022-08-23 16:09:53 - INFO - rest_service - Parallel Trials = 1
2022-08-23 16:09:53 - INFO - rest_service - Starting Experiment: petclinic-sample-2-75884c5549-npvgd
127.0.0.1 - - [23/Aug/2022 16:09:53] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:53 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:53 - INFO - rest_service - Trial_Number = 0
127.0.0.1 - - [23/Aug/2022 16:09:53] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=0 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:54 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:54 - INFO - rest_service - Trial_Number = 1
127.0.0.1 - - [23/Aug/2022 16:09:54] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=1 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:54] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:54 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:54 - INFO - rest_service - Trial_Number = 2
127.0.0.1 - - [23/Aug/2022 16:09:54] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=2 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:55 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:55 - INFO - rest_service - Trial_Number = 3
127.0.0.1 - - [23/Aug/2022 16:09:55] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=3 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:55] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:55 - INFO - rest_service - Experiment_Name = petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:55 - INFO - rest_service - Trial_Number = 4
127.0.0.1 - - [23/Aug/2022 16:09:55] "GET /experiment_trials?experiment_name=petclinic-sample-2-75884c5549-npvgd&trial_number=4 HTTP/1.1" 200 -
127.0.0.1 - - [23/Aug/2022 16:09:56] "POST /experiment_trials HTTP/1.1" 200 -
2022-08-23 16:09:56 - INFO - bayes_optuna.optuna_hpo - BEST PARAMETER: {'memoryRequest': 202.0, 'cpuRequest': 1.78}
2022-08-23 16:09:56 - INFO - bayes_optuna.optuna_hpo - BEST VALUE: 98.7
2022-08-23 16:09:56 - INFO - bayes_optuna.optuna_hpo - BEST TRIAL: FrozenTrial(number=0, values=[98.7], datetime_start=datetime.datetime(2022, 8, 23, 16, 9, 53, 648072), datetime_complete=datetime.datetime(2022, 8, 23, 16, 9, 54, 136858), params={'memoryRequest': 202.0, 'cpuRequest': 1.78}, distributions={'memoryRequest': DiscreteUniformDistribution(high=300.0, low=150.0, q=1.0), 'cpuRequest': DiscreteUniformDistribution(high=3.0, low=1.0, q=0.01)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=0, state=TrialState.COMPLETE, value=None)
2022-08-23 16:09:56 - WARNING - bayes_optuna.optuna_hpo - Experiment stopped: petclinic-sample-2-75884c5549-npvgd
2022-08-23 16:09:56 - INFO - rest_service - Plot type not defined. Defaulting it to optimization_history
127.0.0.1 - - [23/Aug/2022 16:09:56] "GET /plot?experiment_name="petclinic-sample-2-75884c5549-npvgd" HTTP/1.1" 404 -
2022-08-23 16:09:56 - ERROR - rest_service - Experiment not found!
127.0.0.1 - - [23/Aug/2022 16:09:56] "GET /plot?experiment_name="petclinic-sample-2-75884c5549-npvgd" HTTP/1.1" 400 -

Subsequent trial is not generated at times when running experiments in parallel

Subsequent trial is not generated at times when running experiments in parallel. Observed this while running 10 experiments in parallel

*********************************** Experiment petclinic-sample-10 and trial_number 0 *************************************

Generate the config for experiment 10 and trial 0...

command used to query the experiment_trial API = curl -s -H 'Accept: application/json' 'http://worker001-r630:32600/experiment_trials?experiment_name=petclinic-sample-10&trial_number=0' -w '\n%{http_code}'
[{"tunable_name": "memoryRequest", "tunable_value": 292.0}, {"tunable_name": "cpuRequest", "tunable_value": 2.6500000000000004}]

Post the experiment result for experiment petclinic-sample-10 and trial 0...


Command used to post the experiment result= curl -s -H 'Content-Type: application/json' http://worker001-r630:32600/experiment_trials -d {"experiment_name":"petclinic-sample-10","trial_number":0,"trial_result":"success","result_value_type":"double","result_value":98.7,"operation":"EXP_TRIAL_RESULT"} -w '\n%{http_code}'

Result posted successfully!
200
Response is Result posted successfully!
http_code = 200 response = Result posted successfully!

Generate subsequent config for experiment petclinic-sample-10 after trial 0 ...


Curl command used to post the experiment = curl -s -H 'Content-Type: application/json' http://worker001-r630:32600/experiment_trials -d '{"experiment_name":"petclinic-sample-10","operation":"EXP_TRIAL_GENERATE_SUBSEQUENT"}'  -w '\n%{http_code}'

0
200
Response is 0
http_code is 200 Response is 0

*********************************** Experiment petclinic-sample-10 and trial_number 1 *************************************

Generate the config for experiment 10 and trial 1...

command used to query the experiment_trial API = curl -s -H 'Accept: application/json' 'http://worker001-r630:32600/experiment_trials?experiment_name=petclinic-sample-10&trial_number=1' -w '\n%{http_code}'
Requested trial exceeds the completed trial limit!
Get config from hpo for experiment petclinic-sample-10 and trial 1 failed - http_code is not as expected, http_code = 400 expected code = 200

Terminating HPO in minikube doesn't require to set environment variables for secret creation.

Description:
To terminate the HPO in minikube, it asks to set environment variables - which is not required.

 ./deploy_hpo.sh -c minikube -t
You need to set the environment variables first for Kubernetes secret creation

Usage:
 -a | --non_interactive: interactive (default)
 -s | --start: start(default) the app
 -t | --terminate: terminate the app
 -c | --cluster_type: cluster type [docker|minikube|native|openshift]]
 -o | --container_image: build with specific hpo container image name [Default - kruize/hpo:<version>]
 -n | --namespace : Namespace to which hpo is deployed [Default - monitoring namespace for cluster type minikube]
 -d | --configmaps_dir : Config maps directory [Default - manifests/configmaps]
 --both: install both REST and the gRPC service
 --rest: install REST only
 Environment Variables to be set: REGISTRY, REGISTRY_EMAIL, REGISTRY_USERNAME, REGISTRY_PASSWORD
 [Example - REGISTRY: docker.io, quay.io, etc]

Unhandled exception when the status of all the trials in the experiment is prune

Below is the exception observed when the status of all the trials in the experiment is prune

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.6/threading.py", line 867, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kchalasa/Desktop/em-hpo-scripts/hpo/hpo-l/autotune-demo/hpo/src/bayes_optuna/optuna_hpo.py", line 145, in recommend
    logger.info("Best parameter: " + str(study.best_params))
  File "/home/kchalasa/.local/lib/python3.6/site-packages/optuna/study/study.py", line 60, in best_params
    return self.best_trial.params
  File "/home/kchalasa/.local/lib/python3.6/site-packages/optuna/study/study.py", line 97, in best_trial
    return copy.deepcopy(self._storage.get_best_trial(self._study_id))
  File "/home/kchalasa/.local/lib/python3.6/site-packages/optuna/storages/_in_memory.py", line 311, in get_best_trial
    raise ValueError("No trials are completed yet.")

HPO doesn't throw any error message on passing null for result_value

HPO doesn't throw any error on posting the experiment result with trial_result, result_value, result_value_type being null in the experiment results JSON

To recreate the issue:

  • Run ./deploy_hpo.sh -c native
  • Post an experiment using the below curl command
curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d '{"operation":"EXP_TRIAL_GENERATE_NEW","search_space":{"experiment_name":"petclinic-sample-2-75884c5549-npvgd","total_trials":5,"parallel_trials":1,"experiment_id":"a123","value_type":"double","hpo_algo_impl":"optuna_tpe","objective_function":"transaction_response_time","tunables":[{"value_type":"double","lower_bound":150,"name":"memoryRequest","upper_bound":300,"step":1},{"value_type":"double","lower_bound":1,"name":"cpuRequest","upper_bound":3,"step":0.01}],"slo_class":"response_time","direction":"minimize"}}'
  • Post the experiment result json using the below curl command
curl -s -H 'Content-Type: application/json' http://localhost:8085/experiment_trials -d {"experiment_name" : "petclinic-sample-2-75884c5549-npvgd", "trial_number": 0, "trial_result": null, "result_value_type": "double", "result_value": 98.78, "operation" : "EXP_TRIAL_RESULT"}

optuna: n_jobs is getting deprecated

It appears that with the latest optuna, parallelism within an experiment viz n_jobs is getting deprecated. This issue is to explore the implications and research any alternatives

Recommendation API

Add an API that returns the best / recommended config. This should return the "current" best during the experiment run and the "absolute" best after all the trials complete.

GRPC sanity tests fail on minikube / openshift

GRPC sanity tests fail on minikube with the below error on the client side

Posting a new experiment...
Adding new experiment: petclinic-sample-2-75884c5549-npvgd
Error: An error occurred executing command: failed to connect to all addresses
Post new experiment failed!

From the service log:

2022-07-15 09:12:26 - INFO - hpo-service - Starting HPO service
2022-07-15 09:12:26 - INFO - rest_service - Access server at http://localhost:8085
2022-07-15 09:12:26 - INFO - grpc_service - Starting gRPC server at http://0.0.0.0:50051

I have exported HPO_HOST & PORT to minikube IP & port no. of HPO service. Not sure if I 'm missing out anything here or if any
changes are required to grpc service.

Shouldn't the IP & port no. in the INFO statements in service log display the cluster IP and Port?

HPO service needs to be exposed

Expose HPO service so that it can be accessed outside the cluster using the route rather than using the port.

oc expose svc/< hpo service > -n < namespace >

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.