project-codeflare / instascale Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 19.0 67.92 MB

On-demand Kubernetes/OpenShift cluster scaling and aggregated resource provisioning

License: Apache License 2.0

Dockerfile 0.86% Makefile 7.84% Go 91.30%

instascale's People

Contributors

Stargazers

Watchers

Forkers

asm582 heyselbi jbusche anishasthana maxusmusti kpostoffice sutaakar dimakis fiona-waters vanillaspoon bobbins228 michaelclifford christianzaccaria akoserwal z103cb srihari1192 chughshilpa abhijeet-dhumal sudharshanibm

instascale's Issues

InstaScale pod in a CrashLoopBackOff after submitting a sample appwrapper

I noticed that when I submit a sample appwrapper, that it's causing a CrashLoopBackOff of the Instascale pod.

oc get pods -n opendatahub |grep insta
instascale-instascale-6bb58b6559-m8snx                 0/1     CrashLoopBackOff   62 (2m20s ago)   16h

And in the log I see:

I0322 18:03:14.944406       1 appwrapper_controller.go:409] Pending AppWrapper defaultaw-schd-spec-with-timeout-104 needs scaling
E0322 18:03:14.944519       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 348 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1641a80?, 0xc000b4db78})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x86
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x415ed0?})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1641a80, 0xc000b4db78})
	/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000e98c00)
	/workspace/controllers/appwrapper_controller.go:287 +0x265
github.com/project-codeflare/instascale/controllers.onUpdate({0x14f0360?, 0xc000e5d8f0?}, {0x16ed9c0?, 0xc000e98c00?})
	/workspace/controllers/appwrapper_controller.go:262 +0x17c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate({0x18081e0?, {0x1958fa0?, 0xc0009e8918?}}, {0x16ed9c0, 0xc000dd2c00}, {0x16ed9c0, 0xc000e98c00})
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:273 +0xe2
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:785 +0xf7
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0000b6f38?, {0x1944860, 0xc0005a6030}, 0x1, 0xc000a7a060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x31?, 0xc0000b6fb0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000206300)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x85
panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 348 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x415ed0?})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd7
panic({0x1641a80, 0xc000b4db78})
	/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000e98c00)
	/workspace/controllers/appwrapper_controller.go:287 +0x265
github.com/project-codeflare/instascale/controllers.onUpdate({0x14f0360?, 0xc000e5d8f0?}, {0x16ed9c0?, 0xc000e98c00?})
	/workspace/controllers/appwrapper_controller.go:262 +0x17c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate({0x18081e0?, {0x1958fa0?, 0xc0009e8918?}}, {0x16ed9c0, 0xc000dd2c00}, {0x16ed9c0, 0xc000e98c00})
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:273 +0xe2
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:785 +0xf7
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0000b6f38?, {0x1944860, 0xc0005a6030}, 0x1, 0xc000a7a060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x31?, 0xc0000b6fb0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000206300)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x85

Allow users to specify number of replicas of a given node type

Existing format: orderedinstance: m5.4xlarge_g4dn.xlarge
Proposed format: orderedinstance: m5.4xlarge_g4dn.xlarge:255

We currently use _ to separate instance types. Our proposal is to add :<int> as a valid portion of the string to allow users to easily specify the number of nodes they require of a given type.

Perf testing of InstaScale and MCAD

InstaScale performance testing is needed for several queued jobs to determine resources (CPU, memory) required to run

[machineSet API] Parallel scaling of resources

InstaScale in its current state sequentially gets aggregated resources from the provider. We need to enable InstaScale to parallel send requests to the cloud provider to get aggregated resources faster.

Simplify client calls for machine resources

We are currently using a large number of api-specific objects to make client calls like get/put / whatever. We should be able to do all of these using the default kubernetes client. https://github.com/opendatahub-io/modelmesh-serving/blob/main/controllers/servingruntime_controller.go#L134-L151 has an example for secrets, but similar logic should function for machinesets and the like.

https://github.com/project-codeflare/instascale/blob/main/controllers/appwrapper_controller.go#L118-L121

Manage OCM secret

Manage OCM secret via Kustomize and possibly mount into the controller deployment. In the codeflare operator it should then be exposed in the instascale API as an object reference. See project-codeflare/codeflare-operator#148
Follow on to #81

Create GitHub action to automate Instascale release

The action should perform all steps described in #75.
Part of this task is to switch from defining version in VERSION file to providing version as part of GitHub action trigger or retrieving the version from tag.

(bug) panic when array of machine types past to instascale contain only 1 element

also when there are 0 elements

Install InstaScale on OKD

Install InstaScale through CodeFlare operator on OKD and make sure machine* APIs work

Provide release steps for Codeflare Instascale

Repository is missing a documentation providing instructions to release a new version. There is a brief mention of image release in https://github.com/project-codeflare/instascale#image-creation , however it is not clear if this is the only step needed to release a new version or if there are some other steps needed.

Readme file should be adjusted to describe steps taken to release a new Instascale version, possibly mentioning any external dependency.

AppWrapper Compatibility Concerns

Commit db535fe (from PR #38 ) has introduced an altered AppWrapper template expectation that is currently incompatible with other existing components. This needs to be resolved before a new InstaScale release can be created.

[Machinesets API] Retry when nodes fail

Machinesets acquires resources from the cloud provider. when the underlying node fails, such nodes should be released and new nodes should be added

Identify similar machinesets and use round robin to use them

It is possible that the admin has two same types of machinesets configured with different names, it is important to identify such machinesets, add them to the same group, and use one of them to get resources. if anyone similar element from the machinesets group fails to provide resources in the "desired" time limit, another machineset should be tried to acquire resources.

Graceful Failure

When a cloud provider fails to provision a machine, InstaScale should be able to gracefully handle: either by removing the failed machine and re-trying, or by scaling down the whole request if incomplete (or both, with a configurable timeout).

Documentation for MachineSets/MachinePools

There needs to be documentation for users to understand how to set up instascale for actual use with desired machine types (how to set up machinesets and machinepools).

ROSA Hypershift cluster now exposes machineset API, add support for hypershift

A ROSA hypershift cluster does not expose the machineset API, therefore instascale cannot contact it to scale up resources.

Add support for GitHub action release from any branch

Currently the release GitHub action supports releasing from main branch only.

The action should be able to release from any branch.

deployment tooling for InstaScale

explore the use of kustomize or helm chart to deploy InstaScale

Partial request of resources

Appwrapper receives aggregated resource requests to get instances for the workload. It may be the case that the cluster has CPU resources available. should the InstaScale controller gather partial requests in such a scenario?

Update rbacs to include secrets

While working through the quick start guide instascale logs showed that the instascale-ocm-secret could not be found.
We need to update the rbacs to include secrets and also report the error relating to this correctly.
The quick start guide needs to be updated to include creation of the required secret.

Test setup and enhancement in CI

We need to setup automate test setup that will run on each user submitted PRs

Unused operator-sdk infra

Much of the Makefile content, as well as the entirety of the config and bin dirs (alongside other miscellaneous pieces) exist for potential future development with operator-sdk, and are not currently utilized.

While bundle and catalog images can successfully be built with the Makefile, they currently hold no use.

instascale-controller:latest image has existing vulnerabilities

I used twistlock to scan the instascale-controller:latest image, and it has a few known vulnerabilities:

6 total vulnerabilties 
4 High
2 Medium

I tried changing the Dockerfile to use the latest golang image:

#FROM golang:1.17 as builder
FROM golang:1.20.1 as builder

And that fixed all the vulnerabilities but one: https://nvd.nist.gov/vuln/detail/CVE-2022-21698

Looking to see how that can potentially be solved tomorrow. Full description of the issue is below:

packagePath = /manager
packageName = github.com/prometheus/client_golang
packageVersion = v1.11.0
status = fixed in 1.11.1
description = client_golang is the instrumentation library for Go applications in Prometheus, and the promhttp package in client_golang provides tooling around HTTP servers and clients. In client_golang prior to version 1.11.1, HTTP server is susceptible to a Denial of Service through unbounded cardinality, and potential memory exhaustion, when handling requests with non-standard HTTP methods. In order to be affected, an instrumented software must use any of promhttp.InstrumentHandler* middleware except RequestsInFlight; not filter any specific methods (e.g GET) before middleware; pass metric with method label name to our middleware; and not have any firewall/LB/proxy that filters away requests with unknown method. client_golang version 1.11.1 contains a patch for this issue. Several workarounds are available, including removing the method label name from counter/gauge used in the InstrumentHandler; turning off affected promhttp handlers; adding custom middleware before promhttp handler that will sanitize the request method given by Go http.Request; and using a reverse proxy or web application firewall, configured to only allow a limited set of methods.

Remove custom informers from InstaScale

We have custom informers set up for InstaScale to watch events on appwrappers. As far as I can tell, the informer logic isn't doing anything special -- we should be able to just rely on the default reconcilation loop. Using the inbuilt controller-runtime for for the runtime should suffice. Refer to https://github.com/opendatahub-io/odh-model-controller/blob/main/controllers/inferenceservice_controller.go#L93 for a simple example.

This will reduce the complexity of our code

https://github.com/project-codeflare/instascale/blob/main/controllers/appwrapper_controller.go#L135-L148

CrashLoopBackoff on non-IS AppWrapper

InstaScale crashes when receiving an AppWrapper that does not have specified machine types (which is the case in any AppWrapper not intending to use InstaScale)

Create e2e tests for InstaScale

Create basic tests that can be run against an OpenShift cluster to verify InstaScale functionality.

We should make sure to verify both MachineSet and MachinePool functionality.

For now, these tests would assume that you already have an OpenShift cluster spun up and configured correctly for InstaScale functionality. A follow-on to this will be running the tests in CI.

Reliability with machinepool API

MachinePool API returns success by creating the desired OpenShift objects. There is a high chance that the node acquiring steps fails from the cloud provider. We need to investigate if MachinePool API performs retries to get the desired replicas.

Interplay between framework autoscaler and InstaScale

Users using frameworks like ray, spark have autoscaling capability, there was a feature request by user to tie framework autoscaling with InstaScale cluster autoscaling

Use Operator Framework for Instascale controller

@KPostOffice should we close this issue in a different repo: openshift/machine-api-operator#1138

Stop initializing variables in reconcile loop

We have a large number of variables unnecessarily initialized in the Reconcile loop. We should move these initializations out.

Fix label addition

When aggregated resources are resources labels are reapplied when all the resources are obtained. a fix would be to acquire all the resources and apply labels only when aggregated resources are in state READY

Add liveness and readiness probes to instascale deployment containers

Registry for holding InstaScale images

Container registry is needed to store InstaScale images

Add logging and log persistence

Add logging in source code with desired level and log persistence

Check InstaScale Image v0.0.3 for vulnerabilities

Using Twistlock or another container vulnerability scan, check the

quay.io/project-codeflare/instascale-controller:v0.0.3

for the latest Vulnerabilities

Instascale: (bug) Jobs fail to run when 16g memory requested on m5.xlarge, and g4dn.xlarge machine types

When Instascale scales up an m5.xlarge or g4dn.xlarge machine type and 16g memory is requested the jobs fail as only 15.35g is actually available.

Reuse of resources obtained from cloud provider

Infrastructure resources acquired from the cloud provider are released as soon as the job is completed. If jobs are pending in a queue that has the same user resource requirements then such infrastructure resources could be reused, saving time to run the next pending workload.

Handle error when unable to get instascale-ocm-secret

Update this https://github.com/project-codeflare/instascale/blob/main/controllers/appwrapper_controller.go#L172 -
we want to fail the reconciliation and update the status of the reconciled resource e.g. in a condition, so it's clear for the end-user that there is an issue with the secret configuration.
Follow on to #83 #81

InstaScale release management

Automated strategy to cut releases for InstaScale

InstaScale functionality in Managed OpenShift environments

InstaScale will not function in managed OpenShift environments such as ROSA or OSD. This is due to the following:

There is an admission webhook that stops users (even cluster admins) from creating/updating machine sets. This webhook is created by the SREs managing the cluster
Even if we got around the admission webhook, the cluster state is actually controlled by an external component called Hive (there are multiple components, but I believe Hive is the "culprit" here). Hive interacts with OCM to determine the state of the cluster -- i.e how many machine sets there should be and what they look like. As such, it will overwrite any updates we make to machine sets in cluster.

We will most likely need to figure out how to interact with the OCM API to update "machine pools". This way Hive itself would control the nodes that are being created/deleted.

Instascale pod crashing with panic after cluster.up() with instascale=False

I just realized that the Instascale pod is crashing and restarting itself after I issue a cluster.up(). My current cluster config is the following:

cluster = Cluster(ClusterConfiguration(name='jim-mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=4, min_memory=8, max_memory=8, gpu=0, instascale=False, auth=auth))

And after the cluster.up is submitted, if I'm following the instascale pod, I'll see it panic and then restart, like this:

oc logs -f instascale-9dcf85dcf-9cfzc 
I0223 19:48:50.119654       1 request.go:665] Waited for 1.033932486s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
1.6771817321736054e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
1.677181732174286e+09	INFO	setup	starting manager
1.6771817321773903e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6771817321773977e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
1.67718173217752e+09	INFO	controller.appwrapper	Starting EventSource	{"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "source": "kind source: *v1beta1.AppWrapper"}
1.6771817321775486e+09	INFO	controller.appwrapper	Starting Controller	{"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper"}
1.6771817322781909e+09	INFO	controller.appwrapper	Starting workers	{"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "worker count": 1}
I0223 19:48:52.281672       1 appwrapper_controller.go:129] Got config map named: instascale-config that configures max nodes in cluster to value 15
I0223 19:48:52.384790       1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status Running
I0223 19:50:06.416947       1 appwrapper_controller.go:420] Appwrapper deleted scale-down machineset: jim-mnisttest 
I0223 19:50:30.335775       1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status 
E0223 19:50:30.335862       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 490 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x15eeba0, 0xc000175770})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00043e898})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x15eeba0, 0xc000175770})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000501800)
	/workspace/controllers/appwrapper_controller.go:287 +0x285
github.com/project-codeflare/instascale/controllers.onAdd({0x16970c0, 0xc000501800})
	/workspace/controllers/appwrapper_controller.go:226 +0x11e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0x1798b20, {0x18e63f8, 0xc000116540}}, {0x16970c0, 0xc000501800})
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:264 +0x64
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:787 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7fd823cf9fb8)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007cf38, {0x18bfba0, 0xc00016bc50}, 0x1, 0xc0006167e0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0000a8cc0, 0x3b9aca00, 0x0, 0x57, 0xc00007cf88)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00004c280)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x88
panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 490 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00043e898})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x15eeba0, 0xc000175770})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000501800)
	/workspace/controllers/appwrapper_controller.go:287 +0x285
github.com/project-codeflare/instascale/controllers.onAdd({0x16970c0, 0xc000501800})
	/workspace/controllers/appwrapper_controller.go:226 +0x11e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0x1798b20, {0x18e63f8, 0xc000116540}}, {0x16970c0, 0xc000501800})
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:264 +0x64
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:787 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7fd823cf9fb8)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007cf38, {0x18bfba0, 0xc00016bc50}, 0x1, 0xc0006167e0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0000a8cc0, 0x3b9aca00, 0x0, 0x57, 0xc00007cf88)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00004c280)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x88

I'm still investigating and will post results here. Want to see if this happens with instascale=True, for example.