project-codeflare / instascale Goto Github PK
View Code? Open in Web Editor NEWOn-demand Kubernetes/OpenShift cluster scaling and aggregated resource provisioning
License: Apache License 2.0
On-demand Kubernetes/OpenShift cluster scaling and aggregated resource provisioning
License: Apache License 2.0
I noticed that when I submit a sample appwrapper, that it's causing a CrashLoopBackOff of the Instascale pod.
oc get pods -n opendatahub |grep insta
instascale-instascale-6bb58b6559-m8snx 0/1 CrashLoopBackOff 62 (2m20s ago) 16h
And in the log I see:
I0322 18:03:14.944406 1 appwrapper_controller.go:409] Pending AppWrapper defaultaw-schd-spec-with-timeout-104 needs scaling
E0322 18:03:14.944519 1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 348 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1641a80?, 0xc000b4db78})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x86
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x415ed0?})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1641a80, 0xc000b4db78})
/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000e98c00)
/workspace/controllers/appwrapper_controller.go:287 +0x265
github.com/project-codeflare/instascale/controllers.onUpdate({0x14f0360?, 0xc000e5d8f0?}, {0x16ed9c0?, 0xc000e98c00?})
/workspace/controllers/appwrapper_controller.go:262 +0x17c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate({0x18081e0?, {0x1958fa0?, 0xc0009e8918?}}, {0x16ed9c0, 0xc000dd2c00}, {0x16ed9c0, 0xc000e98c00})
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:273 +0xe2
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:785 +0xf7
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0000b6f38?, {0x1944860, 0xc0005a6030}, 0x1, 0xc000a7a060)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x31?, 0xc0000b6fb0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000206300)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x85
panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0
goroutine 348 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x415ed0?})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd7
panic({0x1641a80, 0xc000b4db78})
/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000e98c00)
/workspace/controllers/appwrapper_controller.go:287 +0x265
github.com/project-codeflare/instascale/controllers.onUpdate({0x14f0360?, 0xc000e5d8f0?}, {0x16ed9c0?, 0xc000e98c00?})
/workspace/controllers/appwrapper_controller.go:262 +0x17c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate({0x18081e0?, {0x1958fa0?, 0xc0009e8918?}}, {0x16ed9c0, 0xc000dd2c00}, {0x16ed9c0, 0xc000e98c00})
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:273 +0xe2
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:785 +0xf7
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0000b6f38?, {0x1944860, 0xc0005a6030}, 0x1, 0xc000a7a060)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x31?, 0xc0000b6fb0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000206300)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x85
Existing format: orderedinstance: m5.4xlarge_g4dn.xlarge
Proposed format: orderedinstance: m5.4xlarge_g4dn.xlarge:255
We currently use _
to separate instance types. Our proposal is to add :<int>
as a valid portion of the string to allow users to easily specify the number of nodes they require of a given type.
InstaScale performance testing is needed for several queued jobs to determine resources (CPU, memory) required to run
InstaScale in its current state sequentially gets aggregated resources from the provider. We need to enable InstaScale to parallel send requests to the cloud provider to get aggregated resources faster.
We are currently using a large number of api-specific objects to make client calls like get/put / whatever. We should be able to do all of these using the default kubernetes client. https://github.com/opendatahub-io/modelmesh-serving/blob/main/controllers/servingruntime_controller.go#L134-L151 has an example for secrets, but similar logic should function for machinesets and the like.
Manage OCM secret via Kustomize and possibly mount into the controller deployment. In the codeflare operator it should then be exposed in the instascale API as an object reference. See project-codeflare/codeflare-operator#148
Follow on to #81
The action should perform all steps described in #75.
Part of this task is to switch from defining version in VERSION
file to providing version as part of GitHub action trigger or retrieving the version from tag.
also when there are 0 elements
Install InstaScale through CodeFlare operator on OKD and make sure machine* APIs work
Repository is missing a documentation providing instructions to release a new version. There is a brief mention of image release in https://github.com/project-codeflare/instascale#image-creation , however it is not clear if this is the only step needed to release a new version or if there are some other steps needed.
Readme file should be adjusted to describe steps taken to release a new Instascale version, possibly mentioning any external dependency.
Machinesets acquires resources from the cloud provider. when the underlying node fails, such nodes should be released and new nodes should be added
It is possible that the admin has two same types of machinesets configured with different names, it is important to identify such machinesets, add them to the same group, and use one of them to get resources. if anyone similar element from the machinesets group fails to provide resources in the "desired" time limit, another machineset should be tried to acquire resources.
When a cloud provider fails to provision a machine, InstaScale should be able to gracefully handle: either by removing the failed machine and re-trying, or by scaling down the whole request if incomplete (or both, with a configurable timeout).
There needs to be documentation for users to understand how to set up instascale for actual use with desired machine types (how to set up machinesets and machinepools).
A ROSA hypershift cluster does not expose the machineset API, therefore instascale cannot contact it to scale up resources.
Currently the release GitHub action supports releasing from main
branch only.
The action should be able to release from any branch.
explore the use of kustomize or helm chart to deploy InstaScale
Appwrapper receives aggregated resource requests to get instances for the workload. It may be the case that the cluster has CPU resources available. should the InstaScale controller gather partial requests in such a scenario?
While working through the quick start guide instascale logs showed that the instascale-ocm-secret could not be found.
We need to update the rbacs to include secrets and also report the error relating to this correctly.
The quick start guide needs to be updated to include creation of the required secret.
We need to setup automate test setup that will run on each user submitted PRs
Much of the Makefile content, as well as the entirety of the config
and bin
dirs (alongside other miscellaneous pieces) exist for potential future development with operator-sdk, and are not currently utilized.
While bundle and catalog images can successfully be built with the Makefile, they currently hold no use.
I used twistlock to scan the instascale-controller:latest image, and it has a few known vulnerabilities:
6 total vulnerabilties
4 High
2 Medium
I tried changing the Dockerfile to use the latest golang image:
#FROM golang:1.17 as builder
FROM golang:1.20.1 as builder
And that fixed all the vulnerabilities but one: https://nvd.nist.gov/vuln/detail/CVE-2022-21698
Looking to see how that can potentially be solved tomorrow. Full description of the issue is below:
packagePath = /manager
packageName = github.com/prometheus/client_golang
packageVersion = v1.11.0
status = fixed in 1.11.1
description = client_golang is the instrumentation library for Go applications in Prometheus, and the promhttp package in client_golang provides tooling around HTTP servers and clients. In client_golang prior to version 1.11.1, HTTP server is susceptible to a Denial of Service through unbounded cardinality, and potential memory exhaustion, when handling requests with non-standard HTTP methods. In order to be affected, an instrumented software must use any of promhttp.InstrumentHandler*
middleware except RequestsInFlight
; not filter any specific methods (e.g GET) before middleware; pass metric with method
label name to our middleware; and not have any firewall/LB/proxy that filters away requests with unknown method
. client_golang version 1.11.1 contains a patch for this issue. Several workarounds are available, including removing the method
label name from counter/gauge used in the InstrumentHandler; turning off affected promhttp handlers; adding custom middleware before promhttp handler that will sanitize the request method given by Go http.Request; and using a reverse proxy or web application firewall, configured to only allow a limited set of methods.
We have custom informers set up for InstaScale to watch events on appwrappers. As far as I can tell, the informer logic isn't doing anything special -- we should be able to just rely on the default reconcilation loop. Using the inbuilt controller-runtime for
for the runtime should suffice. Refer to https://github.com/opendatahub-io/odh-model-controller/blob/main/controllers/inferenceservice_controller.go#L93 for a simple example.
This will reduce the complexity of our code
InstaScale crashes when receiving an AppWrapper that does not have specified machine types (which is the case in any AppWrapper not intending to use InstaScale)
Create basic tests that can be run against an OpenShift cluster to verify InstaScale functionality.
We should make sure to verify both MachineSet and MachinePool functionality.
For now, these tests would assume that you already have an OpenShift cluster spun up and configured correctly for InstaScale functionality. A follow-on to this will be running the tests in CI.
MachinePool API returns success by creating the desired OpenShift objects. There is a high chance that the node acquiring steps fails from the cloud provider. We need to investigate if MachinePool API performs retries to get the desired replicas.
Users using frameworks like ray, spark have autoscaling capability, there was a feature request by user to tie framework autoscaling with InstaScale cluster autoscaling
@KPostOffice should we close this issue in a different repo: openshift/machine-api-operator#1138
We have a large number of variables unnecessarily initialized in the Reconcile loop. We should move these initializations out.
When aggregated resources are resources labels are reapplied when all the resources are obtained. a fix would be to acquire all the resources and apply labels only when aggregated resources are in state READY
Container registry is needed to store InstaScale images
Add logging in source code with desired level and log persistence
Using Twistlock or another container vulnerability scan, check the
quay.io/project-codeflare/instascale-controller:v0.0.3
for the latest Vulnerabilities
When Instascale scales up an m5.xlarge or g4dn.xlarge machine type and 16g memory is requested the jobs fail as only 15.35g is actually available.
Infrastructure resources acquired from the cloud provider are released as soon as the job is completed. If jobs are pending in a queue that has the same user resource requirements then such infrastructure resources could be reused, saving time to run the next pending workload.
Update this https://github.com/project-codeflare/instascale/blob/main/controllers/appwrapper_controller.go#L172 -
we want to fail the reconciliation and update the status of the reconciled resource e.g. in a condition, so it's clear for the end-user that there is an issue with the secret configuration.
Follow on to #83 #81
Automated strategy to cut releases for InstaScale
InstaScale will not function in managed OpenShift environments such as ROSA or OSD. This is due to the following:
We will most likely need to figure out how to interact with the OCM API to update "machine pools". This way Hive itself would control the nodes that are being created/deleted.
I just realized that the Instascale pod is crashing and restarting itself after I issue a cluster.up(). My current cluster config is the following:
cluster = Cluster(ClusterConfiguration(name='jim-mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=4, min_memory=8, max_memory=8, gpu=0, instascale=False, auth=auth))
And after the cluster.up is submitted, if I'm following the instascale pod, I'll see it panic and then restart, like this:
oc logs -f instascale-9dcf85dcf-9cfzc
I0223 19:48:50.119654 1 request.go:665] Waited for 1.033932486s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
1.6771817321736054e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
1.677181732174286e+09 INFO setup starting manager
1.6771817321773903e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6771817321773977e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.67718173217752e+09 INFO controller.appwrapper Starting EventSource {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "source": "kind source: *v1beta1.AppWrapper"}
1.6771817321775486e+09 INFO controller.appwrapper Starting Controller {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper"}
1.6771817322781909e+09 INFO controller.appwrapper Starting workers {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "worker count": 1}
I0223 19:48:52.281672 1 appwrapper_controller.go:129] Got config map named: instascale-config that configures max nodes in cluster to value 15
I0223 19:48:52.384790 1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status Running
I0223 19:50:06.416947 1 appwrapper_controller.go:420] Appwrapper deleted scale-down machineset: jim-mnisttest
I0223 19:50:30.335775 1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status
E0223 19:50:30.335862 1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 490 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x15eeba0, 0xc000175770})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00043e898})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x15eeba0, 0xc000175770})
/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000501800)
/workspace/controllers/appwrapper_controller.go:287 +0x285
github.com/project-codeflare/instascale/controllers.onAdd({0x16970c0, 0xc000501800})
/workspace/controllers/appwrapper_controller.go:226 +0x11e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0x1798b20, {0x18e63f8, 0xc000116540}}, {0x16970c0, 0xc000501800})
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:264 +0x64
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:787 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7fd823cf9fb8)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007cf38, {0x18bfba0, 0xc00016bc50}, 0x1, 0xc0006167e0)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0000a8cc0, 0x3b9aca00, 0x0, 0x57, 0xc00007cf88)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00004c280)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x88
panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0
goroutine 490 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00043e898})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x15eeba0, 0xc000175770})
/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000501800)
/workspace/controllers/appwrapper_controller.go:287 +0x285
github.com/project-codeflare/instascale/controllers.onAdd({0x16970c0, 0xc000501800})
/workspace/controllers/appwrapper_controller.go:226 +0x11e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0x1798b20, {0x18e63f8, 0xc000116540}}, {0x16970c0, 0xc000501800})
/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:264 +0x64
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:787 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7fd823cf9fb8)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007cf38, {0x18bfba0, 0xc00016bc50}, 0x1, 0xc0006167e0)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0000a8cc0, 0x3b9aca00, 0x0, 0x57, 0xc00007cf88)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00004c280)
/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x88
I'm still investigating and will post results here. Want to see if this happens with instascale=True, for example.
InstaScale always searched for a configmap in the kube-system
namespace. The controller should be updated to accept a parameter allowing users to specify which namespace the configmap is located in.
InstaScale has integration with MachineSets. In general, this should work without any changes, we should create an OpenShift cluster on VMWare Infrastructure and perform tests.
The latest Red Hat supported ubi for go is 1.18. We should upgrade the go version used by InstaScale to bring this more in line with the CodeFlare Operator.
Add usage and deployment documentation for InstaScale
Enable automated builds on target releases
The current methodology for supplying machine types to be used with Instascale is by adding a machineset template. It may be a better UX to instead define these machine types in a configmap.
Design and test InstaScale interaction with on-prem environments or providers
Test InstaScale integration with MCAD quota management
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.