Coder Social home page Coder Social logo

operator-lib's Introduction

operator-lib

CI Coverage Status

Helpers for Operator developers.

operator-lib's People

Contributors

alexnpavel avatar asmacdo avatar awgreene avatar camilamacedo86 avatar caueasantos avatar darkowlzz avatar dbenque-1a avatar dependabot[bot] avatar estroz avatar everettraven avatar fabianvf avatar fgiloux avatar grosser avatar hongchaodeng avatar hyungjune avatar jagpreetstamber avatar jmrodri avatar joelanford avatar lilic avatar m1kola avatar mcharrel avatar mhrivnak avatar ncdc avatar neo2308 avatar nunnatsa avatar oceanc80 avatar rashmigottipati avatar tiraboschi avatar varshaprasad96 avatar venkat19967 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

operator-lib's Issues

Configurable max backoff interval in leader-for-life

Feature Request

Is your feature request related to a problem? Please describe.
Yes. I found that the hard-coded max backoff interval of 16s was too long for production environments. It introduced unnecessary leaderless time during deployment.

Describe the solution you'd like
Make maxBackoffInterval configurable, which can be passed in via Become's Option.

Prune logging cannot be set

Bug Report

What did you do?

Wanting to configure the Prune structure but the log field is not exported and never set.

What did you expect to see?

Either that the field is automatically populated or that it is exported.

What did you see instead? Under which circumstances?

A field that is not exported and not automatically set.

Possible Solution

Although, I am not in favor of using global variables for loggers things should be consistent inside a library. The other features provided by operator-lib for instance event handler and leader election use a preset variable:

var log = logf.Log.WithName("event_handler")
var log = logf.Log.WithName("leader")

logf is a package of the controller-runtime library, which contains logr.Logger Log. This assumes that SetLogger was called earlier, which is the same expectation sets by Kubebuilder/controller-runtime.
This makes this library less suitable for clients or other apps not based on controller-runtime but again things should be consistent.

Another thing I would like to mention and a second issue could get open for it is that the naming/structure could be improved. It seems a bit unconventional to have a method like Execute on a structure named Config. I would expect that more on a structure called Pruner. Config being what is provided by users to get Pruners that fits their needs.

operator-lib 0.11.0 is not compatible with controller-runtime v0.15.0

Bug Report

What did you do?
While building one operator using operator-lib 0.11.0 and controller-gen 0.15.0, the following build error is thrown:

# github.com/operator-framework/operator-lib/internal/annotation
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:90:15: cannot use func(evt event.CreateEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.CreateEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.CreateEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:92:24: not enough arguments in call to f.hdlr.Create
	have (event.CreateEvent, workqueue.RateLimitingInterface)
	want (context.Context, event.CreateEvent, workqueue.RateLimitingInterface)
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:95:15: cannot use func(evt event.UpdateEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.UpdateEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.UpdateEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:97:24: not enough arguments in call to f.hdlr.Update
	have (event.UpdateEvent, workqueue.RateLimitingInterface)
	want (context.Context, event.UpdateEvent, workqueue.RateLimitingInterface)
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:100:15: cannot use func(evt event.DeleteEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.DeleteEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.DeleteEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:102:24: not enough arguments in call to f.hdlr.Delete
	have (event.DeleteEvent, workqueue.RateLimitingInterface)
	want (context.Context, event.DeleteEvent, workqueue.RateLimitingInterface)
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:105:16: cannot use func(evt event.GenericEvent, q workqueue.RateLimitingInterface) {…} (value of type func(evt event.GenericEvent, q workqueue.RateLimitingInterface)) as func(context.Context, event.GenericEvent, workqueue.RateLimitingInterface) value in struct literal
../../gopath/pkg/mod/github.com/operator-framework/[email protected]/internal/annotation/filter.go:107:25: not enough arguments in call to f.hdlr.Generic
	have (event.GenericEvent, workqueue.RateLimitingInterface)
	want (context.Context, event.GenericEvent, workqueue.RateLimitingInterface)

Possible Solution
I think this was solved in #114 but we need one release of the package.

handler.InstrumentedEnqueueRequestForOwner

Feature Request

Is your feature request related to a problem? Please describe.
I'm migrating an operator to a OperatorSDK 1.x and I'm following the migration guide which suggests to use operator-lib/handler package to have the enqueue methods instrumented with the metrics. The operator I'm migrating needs to watch for a secondary resource whose instances I used to enqueue with controller-runtime's handler.EnqueueRequestForOwner. I don't see the equivalent instrumented function in operator-lib's handler package.

Describe the solution you'd like
I'm wondering whether there is an easy way to achieve what I just described. Otherwise I'd like to see a method InstrumentedEnqueueRequestForOwner in handler's package which would instrument the enqueue action of the secondary resources with the metrics similar to InstrumentedEnqueueRequestForObject.

Check the node status when leader re-election.

Feature Request

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Example: "I have an issue when (...)"

From leader/leader.go, leader re-election works after the default timeout 5-min since the condition
Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" when a worker node is failed.
I have an opinion that leader re-election can work almost immediately when the condition contains checking the status of the node where the leader pod is running.

Describe the solution you'd like
A clear and concise description of what you want to happen. Add any considered drawbacks.

Check the condition of the node where the leader pod is running with [Node.Type == "NodeReady" && Node.Status != "ConditionTrue"] and when the node has been failed, delete the leaderPod (only mark the pod with 'terminating' because the node where the pod is running has been failed) and the configmap lock whose OwnerReference is leaderPod)

Pictures below are the test of node-check for leader re-election.(test-operator-xxx-xxxqb was the leaderPod)

3
4
5

Making --pod-eviction-timeout to be short can be another approach. However, I sure that above approach can bring more reliability since we don't know appropriate time out.

And Is there any drawbacks when making --pod-eviction-timeout to be very very short?

@HyungJune

Can't unit test own usage of InClusterFactory

Bug Report

What did you do?
Created a wrapper function in our operator code that finds the name of the associated OLM OperatorCondition and then uses the InClusterFactory API to set the upgradeable condition. Then we started writing tests to test this function.

What did you expect to see?
Tests to pass or fail depending on the correct writing of the wrapper function and calling the correct InClusterFactory and Condition code.

What did you see instead? Under which circumstances?
The error: get operator condition namespace: namespace not found for current environment.
Since the unit tests are using a fake runtime client and running locally, this was to be expected since InClusterFactory.GetNamespacedName() calls utils.GetOperatorNamespace(). However, what was unexpected was in our tests there is no way to override this since:

  • utils package is internal so private
  • readNamespace is private
  • the library's own tests can override since they share the same package as readNamespace.

Environment

  • operator-lib version: 0.9.x
  • github.com/operator-framework/operator-lib v0.9.0

Possible Solutions

  • make readNamespace public
  • provide an interface to allow GetNamespacedName to be overridden
  • allow the namespace to be injected into InClusterFactory (since the operator provides the namespace through request.namespace, it might be a nice alternative?)

Additional context
Add any other context about the problem here.

Make leader-for-life leader election more integrated with controller-runtime

Feature Request

Is your feature request related to a problem? Please describe.
Yes. It isn't possible to use leader-for-life leader election with controller-runtime's manager when also using liveness and readiness probes.

Using controller-runtime's manager out of the box, the following sequence of events happens when manager.Start() is called:

  1. Liveness and readiness probes are started
  2. Leader election is started.
  3. Controllers are started.

When using leader-for-life from this repo, it must be called prior to manager.Start() since controller-runtime doesn't support pluggable leader election implementations. The sequence of events in this case is:

  1. Leader election is started.
  2. Liveness and readiness probes are started
  3. Controllers are started.

Notice that 1) and 2) are swapped. This swap causes deadlocks when upgrading operator deployments that use leader-for-life. When the deployment is attempting to rollout a new version, the new pod starts up and first attempts to become the leader, failing indefinitely until the old pod relinquishes ownership. However the old pod will not relinquish ownership until it disappears and it won't disappear until the new pod reports that it's healthy. Unfortunately the new pod will never be able to report that it's healthy because it needs to be the leader before it starts its liveness and readiness probe servers.

Describe the solution you'd like
To work upstream to make controller-runtime support a pluggable leader election implementation such that leader-for-life can be used by the manager.

Status pkg missing

Feature Request

Is your feature request related to a problem? Please describe.
I noticed that here it is stated that the pkg/status and pkg/leader are deleted from the operator-sdk because they are moved to this repo. However, I only see pkg/leader here and pkg/status is still missing. Is this intended?

Describe the solution you'd like
Add pkg/status in this repo.

Any contribution guide?

I added the code of checking the node status for leader-election. (#3498)

but i don't know how to test my code using CI platform.(just like operator-sdk repository uses Travias CI for integration test.)

Do you have any contribution guide about this repository?

Flesh out go docs for prune

Now that auto-pruning library has shipped, it needs better go docs.

  • All exposed functions and types
  • A docs.go with a longer explanation and examples.

Add changelog/release notes for this project

Feature Request

Is your feature request related to a problem? Please describe.
I'd like to check release notes with the changes made for each version.

Describe the solution you'd like
Add SDK changelog solution and release notes in MD doc.

Leader election cannot complete after node reboot on vSphere

Bug Report

Leader election function fails to detect non-running leader pod on Ready node.

What did you do?
Rebooted several nodes of vSphere 7 Kubernetes cluster. After some time, all nodes went up and Ready. Operator pods, which existed before reboot, stuck with ProviderFailed status reason and Ready condition = false.

What did you expect to see?
Operator pods restarted by deployment, or removed by leader election function. Leader election successful for one of pods.

What did you see instead? Under which circumstances?
New pods created and all waiting for leader lock go off. Previous leader pod is not evicted, and as its node is Ready, leader deletion never happens.

Environment

  • operator-lib version: v0.6.0
  • vSphere 7 U3

Possible Solution
If current leader pod is not Ready after some significant timeout (2-3 minutes, maybe configurable), perform delete and re-elect.

Additional context
ProviderFailed status is platform-specific (and is likely to be a bug), but operator is expected to detect failure based on pod conditions.

Leader for life lock not released in case pod is preempted

Bug Report

What did you do?
The operator pod was preempted from the cluster to allow other high priority pod.

What did you expect to see?
The operator pod was rescheduled and it should become the leader.

What did you see instead? Under which circumstances?
The new operator pod started did not become the leader instead it was waiting as previous preempted pod is still the leader.

Environment

  • operator-lib version: v0.12.0

Possible Solution
check for pod preemption along with the eviction

Additional context
Add any other context about the problem here.

leader.Become documentation is wrong

Bug Report

What did you do?
leader.Become claims in its documentation:

If run outside a cluster, it will skip leader election and return nil.

I therefore called leader.Become in a Go program ran on my local machine via go run.

What did you expect to see?
leader.Become should return a nil error.

What did you see instead? Under which circumstances?
leader.Become returned the error namespace not found for current environment

Environment

  • operator-lib version: v0.4.0

Possible Solution

  • Option 1: the documentation should be changed to state that when run outside of a Kubernetes pod, leader.ErrNoNamespace is returned.
  • Option 2: the branch in leader.Become checking for an error from readNamespace() should map ErrNoNamespace to nil when returning

Orthogonally, it would be helpful if it were possible to specify an explicit namespace to allow bypassing the call to readNamespace. This would also allow leader-for-life election when running outside of a Kubernetes pod, which is helpful when testing locally via go run and another instance might be running in a different terminal.

Permission error when trying to set the operator condition

Bug Report

What did you do?
Trying to use operator-condition in HCO. Getting errors in the log when trying to set the condition:

operatorconditions.operators.coreos.com "kubevirt-hyperconverged-operator.v1.5.0" is forbidden: User "system:serviceaccount:kubevirt-hyperconverged:hyperconverged-cluster-operator" cannot update resource "operatorconditions/status" in API group "operators.coreos.com" in the namespace "kubevirt-hyperconverged"

What did you expect to see?
The operator-condition is set to the required value

Bug source
The library still uses c.client.Status().Update() instead of c.client.Update().

https://github.com/operator-framework/operator-lib/blob/main/conditions/conditions.go#L101

Environment

  • operator-lib version: 0.50.0

Possible Solution

Additional context
Add any other context about the problem here.

Deprecate Leader for life based leader election

Feature Request

Is your feature request related to a problem? Please describe.
Currently, the repository contains a leader-for-life based election model that ensures that a single leader is elected for life during a HA state.

To briefly describe what leader for life and leader for lease approaches are:
(1) Leader for life: A leader pod is selected for life until its garbage collected. This ensures there is only one leader at any instant of time.
(2) Leader for lease: Leader election happens periodically at a defined interval which can be tweaked. When the current leader is not able to renew the lease, a new leader is elected.

More details on both of these approaches are available here.

Approach (2) is implemented upstream, in client-go and is scaffolded by default for the past 30+ releases of SDK, since it is implemented as a part of setting up the manager in controller-runtime. It guarantees faster election of leaders, less downtime, recovery from disconnected/frozen node failures. However, it does not eliminate the split brain scenario - where more than a single leader is available at an instant of time.

Approach (1) on the other hand, was developed long before we had a leader-for-lease implemented upstream. Though it solves the split brain scenario, it does not guarantee recovery from node failure nor faster recovery. We also have issues with integrating it to controller-runtime (#48). Neither is it being maintained nor used as widely as leader for lease.

Here is a detailed comment explaining the preference of (2) over the other, wherein users would prefer a faster recovery even though there is split brain scenario intermittently, rather than an implementation that does not guarantee faster recovery.

Describe the solution you'd like
Since leader for life approach is not being widely used, neither works with controller-runtime seamlessly, it is better to adopt a well tested upstream library than to depend on what is currently available as an option.

The solution for this is:

  1. To deprecate and remove leader for life in future releases of Operator SDK.
  2. To bring this up upstream (in controller-runtime), for easier integration. This had already been brought up upstream (kubernetes-sigs/controller-runtime#1963) but there was no response on the same.

How to properly import the library

Type of question

Importing and using operator-lib

Question

What did you do?
I ran go get github.com/operator-framework/operator-lib, and added the following to my main.go operator code:
import ( olib "github.com/operator-framework/operator-lib" )

olib.GetOperatorNamespace()

What did you expect to see?
go mod tidy succeeding.

What did you see instead? Under which circumstances?
go: finding module for package github.com/operator-framework/operator-lib
XXXXXXXXXX imports
github.com/operator-framework/operator-lib: module github.com/operator-framework/operator-lib@latest found (v0.4.1), but does not contain package github.com/operator-framework/operator-lib

Environment

  • operator-lib version: 0.4.1
    Mac OS

Improve prune feature implementation

Feature Request

Is your feature request related to a problem? Please describe.
It seems that the implementation of the prune feature could be improved. As part of the fix for PR #100 I noticed the following:

  • Naming issue: Config is the main struct for the prune logic, which leads to methods like:
    Config.Execute. It does not sound right. Config may be passed to a factory creating a Pruner or a Processor structure but the Execuste method should be on the later structure: Pruner.Execute() or Processor.Execute() makes more sense.
  • The Config structure seems not optimal in the sense that for instance the same Strategy has to be applied to all the resources. If the logic is to be configured once for multiple resources it may make more sense to have a slice or a map of substructures containing the specificity for each resource: gvk, namespace, strategy, pre-deletion hook. A different strategy can then be selected by the user for each resource.
  • The ResourceInfo structure seems to be limiting. Using Unstructured would allow working with any type. The additional logic would then be able to work with fields specific to the type being processed.
  • The pod deletion strategy only considers succeeded pods. It should also consider failed pods. Basically all pods with the terminated state, possibly letting the user selects the phases that are candidates for pruning.
  • It may make more sense to use a closure for strategy rather than a structure strategyConfig and a func StrategyFunc. By doing so it would be possible to unify the interface for what is implemented by the library: pruneByMaxAge, pruneByMaxCount and what is provided by the user: CustomStrategy. They would all be a func, possibly passed as a Config field with this signature:
    func pruneStrategyt(resources []unstructured.Unstructured) (resourcesToRemove [][]unstructured.Unstructured, err error)
    
    Specific parameters part of the Config would be made available through closure variables. pruneByMaxCount would then not need a Config.strategy, which provides information meant for all possible strategies. It would make things easier to extend and the implementation would be more performant: No need for switch config.Strategy.Mode in prune.go. All the prune strategies would have the same signature.
  • The context is passed to the functions in the library but my feeling is that we have two use cases:
    • The prune logic is called from a reconciliation loop and the context may provide more information
    • The prune logic is called from a cron job (as per the documentation) and the context will provide little information
      I haven't seen it really used. Is it thought more as a safety net (if anything is needed that was not foreseen get it through the context)? Can this be addressed through the closure approach?
  • We should also aim and I believe the points listed above would help with that not to limit the logic to pods and jobs.

Expose operator namespace to controllers

Feature Request

Is your feature request related to a problem? Please describe.
I would like a way to retrieve the namespace where the operator is installed from a controller.
I can set up the namespace from the kustomize script (like here https://github.com/operator-framework/operator-sdk-samples/blob/master/go/memcached-operator/config/default/kustomization.yaml#L2), obviously I can hardcode that same value in my code, however I think it would be nice if I could retrieve the value already configured, either at runtime or with some code generation.

Note, in slack it was suggested to expose utils.GetOperatorNamespace() that is currently internal: https://github.com/operator-framework/operator-lib/blob/main/internal/utils/utils.go#L34

Coveralls integration is not showing the code which is covered by the test

Bug Report

What did you do?
Check: https://coveralls.io/builds/32575225/source?filename=leader/leader.go

What did you expect to see?
The code of the leader.go shown what is covered or not. E.g https://coveralls.io/jobs/66045064/source_files/1735508317

What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).

Possible Solution
I suspected that it can be some issue to generated the out file with the data which is sent to the coverall integration or in the integration dep adopted to sent it.

Can't run operators locally, when using the conditions package

Bug Report

The NewCondition (and other) method, called (indirectlly) to the readSAFile method. This method uses a hard code path for the operator namespace file:

return ioutil.ReadFile("/var/run/secrets/kubernetes.io/serviceaccount/namespace")

This implementation blocks the ability to run the operator locally for develpment and debug,

Please consider one of the following, or both (pefered):

  1. Replace the hard-code path with a environment variable, with a default of the current value.
  2. Use an environment variable to store the namespace, and use the file only if this variable is not set.

Environment

  • operator-lib version:

Leader for life lock not released in case the pod is Succeeded

Bug Report

What did you do?
The operator pod went into Succeeded state and a new pod was started by the controller runtime

What did you expect to see?
The operator pod was rescheduled and it should become the leader.

What did you see instead? Under which circumstances?
The new operator pod started did not become the leader instead it was waiting as previous Succeded pod is still the leader.

Environment

  • operator-lib version: v0.12.0

Possible Solution
check for pod completion along with the eviction and preemption

Additional context
Add any other context about the problem here.

[mirror] Investigate which features (if any) could be included to ease operator development

This issue mirrors operator-framework/operator-sdk#2770

Feature Request

Is your feature request related to a problem? Please describe.
There are several frameworks out there that aim at making it easier to develop operators by providing >implementations for common tasks such as:

Describe the solution you'd like
It would be interesting to investigate these frameworks and see which features (if any) could be >included in operator-sdk to make it easier to develop operators.

Proposed contributions

  • Composable operator API: #2340

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.