Helpers for Operator developers.
operator-framework / operator-lib Goto Github PK
View Code? Open in Web Editor NEWThis is a library to help Operator developers
License: Apache License 2.0
This is a library to help Operator developers
License: Apache License 2.0
Users might be able to use the implementation made in upstream instead. See: https://github.com/kubernetes/kubernetes/pull/92717/files
Is your feature request related to a problem? Please describe.
Yes. I found that the hard-coded max backoff interval of 16s was too long for production environments. It introduced unnecessary leaderless time during deployment.
Describe the solution you'd like
Make maxBackoffInterval configurable, which can be passed in via Become's Option.
What did you do?
Wanting to configure the Prune structure but the log field is not exported and never set.
What did you expect to see?
Either that the field is automatically populated or that it is exported.
What did you see instead? Under which circumstances?
A field that is not exported and not automatically set.
Possible Solution
Although, I am not in favor of using global variables for loggers things should be consistent inside a library. The other features provided by operator-lib for instance event handler and leader election use a preset variable:
var log = logf.Log.WithName("event_handler")
var log = logf.Log.WithName("leader")
logf is a package of the controller-runtime library, which contains logr.Logger Log. This assumes that SetLogger was called earlier, which is the same expectation sets by Kubebuilder/controller-runtime.
This makes this library less suitable for clients or other apps not based on controller-runtime but again things should be consistent.
Another thing I would like to mention and a second issue could get open for it is that the naming/structure could be improved. It seems a bit unconventional to have a method like Execute on a structure named Config. I would expect that more on a structure called Pruner. Config being what is provided by users to get Pruners that fits their needs.
What did you do?
While building one operator using operator-lib
0.11.0 and controller-gen
0.15.0, the following build error is thrown:
# github.com/operator-framework/operator-lib/internal/annotation
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:90:15: cannot use func(evt event.CreateEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.CreateEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.CreateEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:92:24: not enough arguments in call to f.hdlr.Create
have (event.CreateEvent, workqueue.RateLimitingInterface)
want (context.Context, event.CreateEvent, workqueue.RateLimitingInterface)
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:95:15: cannot use func(evt event.UpdateEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.UpdateEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.UpdateEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:97:24: not enough arguments in call to f.hdlr.Update
have (event.UpdateEvent, workqueue.RateLimitingInterface)
want (context.Context, event.UpdateEvent, workqueue.RateLimitingInterface)
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:100:15: cannot use func(evt event.DeleteEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.DeleteEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.DeleteEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:102:24: not enough arguments in call to f.hdlr.Delete
have (event.DeleteEvent, workqueue.RateLimitingInterface)
want (context.Context, event.DeleteEvent, workqueue.RateLimitingInterface)
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:105:16: cannot use func(evt event.GenericEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.GenericEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.GenericEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:107:25: not enough arguments in call to f.hdlr.Generic
have (event.GenericEvent, workqueue.RateLimitingInterface)
want (context.Context, event.GenericEvent, workqueue.RateLimitingInterface)
Possible Solution
I think this was solved in #114 but we need one release of the package.
Is your feature request related to a problem? Please describe.
I'm migrating an operator to a OperatorSDK 1.x and I'm following the migration guide which suggests to use operator-lib/handler
package to have the enqueue methods instrumented with the metrics. The operator I'm migrating needs to watch for a secondary resource whose instances I used to enqueue with controller-runtime's handler.EnqueueRequestForOwner
. I don't see the equivalent instrumented function in operator-lib's handler package.
Describe the solution you'd like
I'm wondering whether there is an easy way to achieve what I just described. Otherwise I'd like to see a method InstrumentedEnqueueRequestForOwner
in handler's package which would instrument the enqueue action of the secondary resources with the metrics similar to InstrumentedEnqueueRequestForObject
.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Example: "I have an issue when (...)"
From leader/leader.go, leader re-election works after the default timeout 5-min since the condition
Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" when a worker node is failed.
I have an opinion that leader re-election can work almost immediately when the condition contains checking the status of the node where the leader pod is running.
Describe the solution you'd like
A clear and concise description of what you want to happen. Add any considered drawbacks.
Check the condition of the node where the leader pod is running with [Node.Type == "NodeReady" && Node.Status != "ConditionTrue"] and when the node has been failed, delete the leaderPod (only mark the pod with 'terminating' because the node where the pod is running has been failed) and the configmap lock whose OwnerReference is leaderPod)
Pictures below are the test of node-check for leader re-election.(test-operator-xxx-xxxqb was the leaderPod)
Making --pod-eviction-timeout to be short can be another approach. However, I sure that above approach can bring more reliability since we don't know appropriate time out.
And Is there any drawbacks when making --pod-eviction-timeout to be very very short?
What did you do?
Created a wrapper function in our operator code that finds the name of the associated OLM OperatorCondition and then uses the InClusterFactory API to set the upgradeable condition. Then we started writing tests to test this function.
What did you expect to see?
Tests to pass or fail depending on the correct writing of the wrapper function and calling the correct InClusterFactory and Condition code.
What did you see instead? Under which circumstances?
The error: get operator condition namespace: namespace not found for current environment
.
Since the unit tests are using a fake runtime client and running locally, this was to be expected since InClusterFactory.GetNamespacedName()
calls utils.GetOperatorNamespace()
. However, what was unexpected was in our tests there is no way to override this since:
readNamespace
.Environment
Possible Solutions
readNamespace
publicGetNamespacedName
to be overriddenrequest.namespace
, it might be a nice alternative?)Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
Yes. It isn't possible to use leader-for-life leader election with controller-runtime's manager when also using liveness and readiness probes.
Using controller-runtime's manager out of the box, the following sequence of events happens when manager.Start()
is called:
When using leader-for-life from this repo, it must be called prior to manager.Start()
since controller-runtime doesn't support pluggable leader election implementations. The sequence of events in this case is:
Notice that 1) and 2) are swapped. This swap causes deadlocks when upgrading operator deployments that use leader-for-life. When the deployment is attempting to rollout a new version, the new pod starts up and first attempts to become the leader, failing indefinitely until the old pod relinquishes ownership. However the old pod will not relinquish ownership until it disappears and it won't disappear until the new pod reports that it's healthy. Unfortunately the new pod will never be able to report that it's healthy because it needs to be the leader before it starts its liveness and readiness probe servers.
Describe the solution you'd like
To work upstream to make controller-runtime support a pluggable leader election implementation such that leader-for-life can be used by the manager.
Is your feature request related to a problem? Please describe.
I noticed that here it is stated that the pkg/status
and pkg/leader
are deleted from the operator-sdk
because they are moved to this repo. However, I only see pkg/leader
here and pkg/status
is still missing. Is this intended?
Describe the solution you'd like
Add pkg/status
in this repo.
I added the code of checking the node status for leader-election. (#3498)
but i don't know how to test my code using CI platform.(just like operator-sdk repository uses Travias CI for integration test.)
Do you have any contribution guide about this repository?
Now that auto-pruning library has shipped, it needs better go docs.
docs.go
with a longer explanation and examples.Is your feature request related to a problem? Please describe.
I'd like to check release notes with the changes made for each version.
Describe the solution you'd like
Add SDK changelog solution and release notes in MD doc.
Leader election function fails to detect non-running leader pod on Ready node.
What did you do?
Rebooted several nodes of vSphere 7 Kubernetes cluster. After some time, all nodes went up and Ready. Operator pods, which existed before reboot, stuck with ProviderFailed status reason and Ready condition = false.
What did you expect to see?
Operator pods restarted by deployment, or removed by leader election function. Leader election successful for one of pods.
What did you see instead? Under which circumstances?
New pods created and all waiting for leader lock go off. Previous leader pod is not evicted, and as its node is Ready, leader deletion never happens.
Environment
Possible Solution
If current leader pod is not Ready after some significant timeout (2-3 minutes, maybe configurable), perform delete and re-elect.
Additional context
ProviderFailed status is platform-specific (and is likely to be a bug), but operator is expected to detect failure based on pod conditions.
Currently go version is on 1.13 and should be updated to the version used in go.mod (1.15).
/area dependency
What did you do?
The operator pod was preempted from the cluster to allow other high priority pod.
What did you expect to see?
The operator pod was rescheduled and it should become the leader.
What did you see instead? Under which circumstances?
The new operator pod started did not become the leader instead it was waiting as previous preempted pod is still the leader.
Environment
Possible Solution
check for pod preemption along with the eviction
Additional context
Add any other context about the problem here.
What did you do?
leader.Become
claims in its documentation:
If run outside a cluster, it will skip leader election and return nil.
I therefore called leader.Become
in a Go program ran on my local machine via go run
.
What did you expect to see?
leader.Become
should return a nil
error.
What did you see instead? Under which circumstances?
leader.Become
returned the error namespace not found for current environment
Environment
Possible Solution
leader.ErrNoNamespace
is returned.leader.Become
checking for an error from readNamespace()
should map ErrNoNamespace
to nil
when returningOrthogonally, it would be helpful if it were possible to specify an explicit namespace to allow bypassing the call to readNamespace
. This would also allow leader-for-life election when running outside of a Kubernetes pod, which is helpful when testing locally via go run
and another instance might be running in a different terminal.
What did you do?
Trying to use operator-condition in HCO. Getting errors in the log when trying to set the condition:
operatorconditions.operators.coreos.com "kubevirt-hyperconverged-operator.v1.5.0" is forbidden: User "system:serviceaccount:kubevirt-hyperconverged:hyperconverged-cluster-operator" cannot update resource "operatorconditions/status" in API group "operators.coreos.com" in the namespace "kubevirt-hyperconverged"
What did you expect to see?
The operator-condition is set to the required value
Bug source
The library still uses c.client.Status().Update()
instead of c.client.Update()
.
https://github.com/operator-framework/operator-lib/blob/main/conditions/conditions.go#L101
Environment
0.50.0
Possible Solution
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
Currently, the repository contains a leader-for-life based election model that ensures that a single leader is elected for life during a HA state.
To briefly describe what leader for life and leader for lease approaches are:
(1) Leader for life: A leader pod is selected for life until its garbage collected. This ensures there is only one leader at any instant of time.
(2) Leader for lease: Leader election happens periodically at a defined interval which can be tweaked. When the current leader is not able to renew the lease, a new leader is elected.
More details on both of these approaches are available here.
Approach (2) is implemented upstream, in client-go and is scaffolded by default for the past 30+ releases of SDK, since it is implemented as a part of setting up the manager in controller-runtime. It guarantees faster election of leaders, less downtime, recovery from disconnected/frozen node failures. However, it does not eliminate the split brain scenario - where more than a single leader is available at an instant of time.
Approach (1) on the other hand, was developed long before we had a leader-for-lease implemented upstream. Though it solves the split brain scenario, it does not guarantee recovery from node failure nor faster recovery. We also have issues with integrating it to controller-runtime (#48). Neither is it being maintained nor used as widely as leader for lease.
Here is a detailed comment explaining the preference of (2) over the other, wherein users would prefer a faster recovery even though there is split brain scenario intermittently, rather than an implementation that does not guarantee faster recovery.
Describe the solution you'd like
Since leader for life approach is not being widely used, neither works with controller-runtime seamlessly, it is better to adopt a well tested upstream library than to depend on what is currently available as an option.
The solution for this is:
Importing and using operator-lib
What did you do?
I ran go get github.com/operator-framework/operator-lib
, and added the following to my main.go operator code:
import ( olib "github.com/operator-framework/operator-lib" )
olib.GetOperatorNamespace()
What did you expect to see?
go mod tidy
succeeding.
What did you see instead? Under which circumstances?
go: finding module for package github.com/operator-framework/operator-lib
XXXXXXXXXX imports
github.com/operator-framework/operator-lib: module github.com/operator-framework/operator-lib@latest found (v0.4.1), but does not contain package github.com/operator-framework/operator-lib
Environment
Is your feature request related to a problem? Please describe.
It seems that the implementation of the prune feature could be improved. As part of the fix for PR #100 I noticed the following:
Config.Execute
. It does not sound right. Config may be passed to a factory creating a Pruner or a Processor structure but the Execuste method should be on the later structure: Pruner.Execute() or Processor.Execute() makes more sense.func pruneStrategyt(resources []unstructured.Unstructured) (resourcesToRemove [][]unstructured.Unstructured, err error)
Is your feature request related to a problem? Please describe.
It allows we to check the lint issues in the PR which make the review and the process to contribute to it easier.
See; https://github.com/golangci/golangci-lint-action
Describe the solution you'd like
Call golangci-lint via GitHub Action
Is your feature request related to a problem? Please describe.
I would like a way to retrieve the namespace where the operator is installed from a controller.
I can set up the namespace from the kustomize script (like here https://github.com/operator-framework/operator-sdk-samples/blob/master/go/memcached-operator/config/default/kustomization.yaml#L2), obviously I can hardcode that same value in my code, however I think it would be nice if I could retrieve the value already configured, either at runtime or with some code generation.
Note, in slack it was suggested to expose utils.GetOperatorNamespace()
that is currently internal: https://github.com/operator-framework/operator-lib/blob/main/internal/utils/utils.go#L34
What did you do?
Check: https://coveralls.io/builds/32575225/source?filename=leader/leader.go
What did you expect to see?
The code of the leader.go
shown what is covered or not. E.g https://coveralls.io/jobs/66045064/source_files/1735508317
What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).
Possible Solution
I suspected that it can be some issue to generated the out file with the data which is sent to the coverall integration or in the integration dep adopted to sent it.
The NewCondition
(and other) method, called (indirectlly) to the readSAFile
method. This method uses a hard code path for the operator namespace file:
operator-lib/internal/utils/utils.go
Line 29 in c0ba7dc
This implementation blocks the ability to run the operator locally for develpment and debug,
Please consider one of the following, or both (pefered):
Environment
What did you do?
The operator pod went into Succeeded state and a new pod was started by the controller runtime
What did you expect to see?
The operator pod was rescheduled and it should become the leader.
What did you see instead? Under which circumstances?
The new operator pod started did not become the leader instead it was waiting as previous Succeded pod is still the leader.
Environment
Possible Solution
check for pod completion along with the eviction and preemption
Additional context
Add any other context about the problem here.
This issue mirrors operator-framework/operator-sdk#2770
Feature Request
Is your feature request related to a problem? Please describe.
There are several frameworks out there that aim at making it easier to develop operators by providing >implementations for common tasks such as:
- https://github.com/redhat-cop/operator-utils
- https://github.com/RHsyseng/operator-utils
- https://github.com/halkyonio/operator-framework
- ...
Describe the solution you'd like
It would be interesting to investigate these frameworks and see which features (if any) could be >included in operator-sdk to make it easier to develop operators.Proposed contributions
- Composable operator API: #2340
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.