kubernetes-sigs / boskos Goto Github PK

Boskos is a resource management service that provides reservation and lifecycle management of a variety of different kinds of resources.

License: Apache License 2.0

Dockerfile 0.56% Makefile 0.57% Go 92.26% Python 5.71% Shell 0.90%

k8s-sig-testing

boskos's People

Contributors

Stargazers

Watchers

Forkers

laashub-soa ixdy bsdnet chizhg cofyc chaodaig hongkailiu amwat kkeshavamurthy alvaroaleman chrishenzie cheld stevekuznetsov richardcase randomvariable cpanato yjuns sedefsavas deepsm007 isabella232 sfowl rifelpet emilvberglind chemikadze azylinski hamzy detiber joelsmith spiffxp ruigulala droslean liwenhao0810 standardgalactic amulyam24 justinsb spirosoik mkumatag warmchang arkasaha30 bartekmi jprzychodzen olemarkus strongjz youyangl munnerz palnabarun mercedes-benz pigboysid krzyzacy ralph7c2 dprotaso asim-reza danilo-gemoli bweston92 jcpowermac dims pkprzekwas pawbana hemendrateli smg247 prucek sebastienvas borg-land yongxiu fxierh sawsa307 torredil c22zhang taichiho

boskos's Issues

Bazel build error: Fake packages must not be imported in non-fake package

Below error is from the build log when building with bazel:

fake package sigs.k8s.io/controller-runtime/pkg/client/fake must not be imported in non-fake package sigs.k8s.io/boskos/crds

Boskos client ReleaseOne is not usable if process is restarted

Originally filed as kubernetes/test-infra#15910 by @chemikadze

What happened:

ReleaseOne performs local check for whether resource was allocated by same client previously, and fails if it was not. After release, it makes client forget such association.
Release is not doing checks and is not making sure that association is removed.

At the same time, all Allocate methods are adding associations. This means, there is no consistent way to release resource in case when client was recreated for some reason (for example, process restart). One known workaround to make sure object is released and there is no memory leak is to use Release if ReleaseOne failed to cover restart case, however quite error-prone without knowledge of internals.

What you expected to happen:

ReleaseOne not to fail if client state is out of sync or Release to clean up client state.

How to reproduce it (as minimally and precisely as possible):

Create two boskos clients, allocate in one client, and run release from another client.

Resource starvation for dependencies of type of single masonable affect entire binary

Fulfilling goroutine runs in a single thread, and fulfillment for single resource retries acquire until all dependencies are met. As a consequence, if any single dependency of any single resource type has an issues with cleanups or capacity, it will cause fulfillment for all resource types handled by the mason instance to get stuck.

Feature request: add compute security-policy to gcp_janitor.py

I would like to have a new compute resource security-policies to be added to gcp_janitor.py script.

Updates silently fail if configuration file is renamed

Boskos should fail on startup if the path to the config file is invalid.

We've discovered a deployment where Boskos is not updating its configuration.

What I think happened is that the --config flag pointed to a valid configuration file, and then at some point, the filename inside ConfigMap changed, resulting in the original file being deleted.

It appears that viper simply silently stops its file watch when a file is deleted, so there's no good way to detect if this error has occurred until the pod restarts.

I don't have a good idea for how to resolve this - should we periodically check the config file and crash if it's no longer valid? That'd also help a bit with #20. (If we didn't crash, we'd at least get error messages that might be helpful for users trying to figure out why their configs aren't being updated.)

cc @coryrc

Logger is not configured correctly

Looking at logs from a recent boskos build (from this repo) reveals that some things are not configured appropriately:

boskos:

{"component":"unset","file":"/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314","func":"github.com/sirupsen/logrus.(*Entry).Logf","handler":"handleUpdate","level":"info","msg":"From 10.44.2.234:42282","time":"2020-06-05T23:01:09Z"}

cleaner:

{"component":"unset","file":"/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:192","func":"github.com/sirupsen/logrus.(*Logger).Log","level":"info","msg":"Cleaner started","time":"2020-06-05T22:47:27Z"}

etc.

The component is unset because we aren't setting the variables in k8s.io/test-infra/prow/version.
We can fairly easily add linker flags for this, though it'll require a bit of work in the Makefile (or maybe we'll just want to write a wrapper script).

I'm not sure why we're getting useless file and function annotations, though.

/kind bug

REST client without depending on kubernetes staging libraries

This is a feature request asking that we consider publishing a go module that does not import Kubernetes staging libraries, at least for the purposes of talking to the boskos API.

We need a package like this somewhere for projects like kubetest kubernetes/test-infra#20422, having importable packages that depend on these is a bit of a nightmare.

Also: Having an independent client module might help resolve circular dependencies with any tools in test-infra that talk to boskos while boskos is importing test-infra ...

gcp_janitor deletes instances first causing some recreates

Originally filed as kubernetes/test-infra#16965 by @oxddr

What happened:
gcp_janitor deleted all instances first. Some of them belonged to managed instance groups, so they were recreated and deleted few seconds later when IGM has been deleted.

What you expected to happen:
Only non-managed instances should be deleted first or managed instance groups should be deleted before instances.

How to reproduce it (as minimally and precisely as possible):
create clusters on GCE using kube-up and clean it up with gcp_janitor/

Please provide links to example occurrences, if any:
n/a

Anything else we need to know?:
-

GCP janitor failing when trying to clean up logging sinks

At some point, it seems like GCP added two new Cloud Logging logs router sinks to projects:

_Default
_Required

These cannot be deleted, and this recently started causing cleanup of projects to fail, with error messages like the following:

ERROR: (gcloud.logging.sinks.delete) PERMISSION_DENIED: Sink _Default cannot be deleted. Consider disabling instead
ERROR: (gcloud.logging.sinks.delete) PERMISSION_DENIED: Sink _Required cannot be deleted
Error try to delete resources sinks: CalledProcessError()
Error try to delete resources sinks: CalledProcessError()

reaper: support different expiration times based on state

Currently, the reaper only resets resources in state busy, cleaning, or leased. Furthermore, it uses the same expiration time for each.

One use case that isn't supported by this model is human inspection of failed resources. For example, if a test fails, a team might want to look at the state of the resource before cleaning it up. The tests could move this resource into a new state (perhaps purgatory), but then it will never be cleaned up. Ideally we'd be able to set a longer expiration time on this new state.

Tangential note: why do we need a separate reaper binary at all? Would it be simpler to have a setting in the main boskos configuration map that controls whether leases expire, and have boskos core do that itself? Putting configuration there would allow easy per-resource overrides, too.

Boskos client does not distinguish between incorrect resource type and no resources available

Currently, when acquiring a resource using AcquireWait(), the client makes no distinction between a resource type that does not exist (which could arise from a user typo) vs. a situation where all resources are busy.

Unfortunately, the Boskos Server sends the 404 status code in both situation - see code here. The Boskos client then looks at the status code here.

One easy and backwards-compatible fix would be to add a new option to the client that, when set, would distinguish between the two situation based on the text returned from the HTTP call. The return text for the two different situations is:

Acquire failed: resource type "my-resource-type" does not exist
Acquire failed: no available resource my-resource-type, try again later

The current problem is that if the user accidentally asks for a resource type that does not exist, they will be stuck in a loop forever with no hope of unblocking except for a Context timeout, which, imo, is not an acceptable situation.

GCP janitor: support arbitrary cleanup commands instead of simple delete

Some gcloud cleanup commands are not standard delete commands - https://cloud.google.com/sdk/gcloud/reference/beta/compute/shared-vpc/associated-projects/remove as an example, thus cannot be simply added into the map in https://github.com/kubernetes-sigs/boskos/blob/master/cmd/janitor/gcp_janitor.py#L32-L91.

One option to support this is to add custom functions similar as clean_gke_cluster, but there's probably a better approach?

/cc @ixdy

Allow running Boskos in HA mode

Currently it is not possible to run boskos with more than one replica, because it maintains an in-memory FIFO queue to make sure leases are handed out in the order they got requested. This makes it impossible to run it in HA, resulting in downtimes of the whole service if the single replica goes down for whatever reason.

Deleting static resource may not take effect until next config update or container restart

Static resources are deleted only on "SyncConfig", which is triggered by config map update, and in-use resource deletion is delayed until next SyncConfig. But sync happens only on container restart or config change - so if resource was in use, deletion may be significantly delayed.

Dynamic resources however are updated at same time as static + on dynamic-resource-update-period (10 minutes by default). Should static resources be updated as well with similar cadence?

Release blocking tests cannot acquire project from boskos

gce-cos-master-scalability-100 release blocking tests are failing with:

2022/03/29 11:16:38 main.go:331: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: Post "http://boskos.test-pods.svc.cluster.local./acquire?dest=busy&owner=ci-kubernetes-e2e-gci-gce-scalability&request_id=eda4a4c1-65ce-460e-8906-35595e3e8d6f&state=free&type=scalability-project": dial tcp 10.35.241.148:80: connect: connection refused

Handwritten CRD DeepCopy methods are not deepcopying

For some reason we have handwritten instead of generated DeepCopy methods on our CRDs and they definitely do not DeepCopy, i.E. something like this is incorrect, because nested pointers will just get copied over, so if someone changes the value that is pointed to, it will change on both the original and the "deepcopied" version:

boskos/crds/drlc_crd.go

Line 76 in 36bb085

out.Spec = in.Spec

/kind bug

GCP Janitor: support clean up GCP resources in the additional zones

Currently there is a list of GCP zones configured in GCP Janitor - https://github.com/kubernetes-sigs/boskos/blob/master/cmd/janitor/gcp_janitor.py#L94-L184. When the script is run, Janitor will clean up all the GCP resources in these zones. However, the list is not exclusive and we cannot add the zones that are not publicly launched (e.g. us-east1-a) since the cleanups will fail for projects that cannot access these zones.

One solution would be adding an extra --additional-zones flag to gcp_janitor.py that allows extending the list of zones. For Janitor instances that manage internal GCP projects, pass the internal zones so that it can also clean up GCP resources in these zones.

/assign

Release boskosctl :)

will be good to make a release of boskosctl

/cc @ixdy @alvaroaleman

janitor fails to clean up some resources in a timely manner if dirty rates are unequal

Originally filed as kubernetes/test-infra#15925

Creating a one-sentence summary of this issue is hard, but the basic bug is fairly easy to understand.

Assume a Boskos instance has three resource types, A, B, and C. A has 5 resources, B has 10, and C has 100. A Boskos janitor has been configured to clean all three types.

Currently, the janitor loops through all resource types, iteratively cleaning one resource of each type. If the janitor finds that one of the types has no dirty resources, it stops querying that resource type until all resources have been cleaned, at which point it waits a minute and then starts over with the complete list again.

In our hypothetical case (as well as observed in practice), what this means is that the janitor will finish cleaning resources of type A (and possibly B), while still having many more C resources to clean. Additionally, given that C is such a large pool, there will likely be many jobs making more C resources dirty. As a result, it will be quite some time before the janitor attempts to clean A resources, and the pool will probably fill up with dirty resources.

Possible ways to mitigate the issue (in increasing complexity):

increase the number of janitor replicas
segment the janitors (i.e. have separate janitors for each type)
remove the optimization in the janitor loop, continuing to attempt to acquire all resource types (this will likely result in more /aquire RPCs to Boskos)
use Boskos metrics to select which resources to attempt to clean. This could even be prioritized (e.g. focus on whichever type is closest to running out of resources), though that might lead to different issues with starvation. Additionally, a failing cleanup could mean the janitor might get completely stuck.

Update controller-runtime to v0.15.0

Controller-runtime v0.15.0 has released on 5/23/2023

https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.15.0

This new version has many breaking changes, and if someone imported this repo as dependency, it's blocking them from bumping up the controller-runtime to v0.15.0

could not import sigs.k8s.io/boskos/crds (-: # sigs.k8s.io/boskos/crds
vendor/sigs.k8s.io/boskos/crds/client.go:89:24: cannot use func(_ *rest.Config) (meta.RESTMapper, error) {…} (value of type func(_ *rest.Config) (meta.RESTMapper, error)) as func(c *rest.Config, httpClient *http.Client) (meta.RESTMapper, error) value in struct literal
vendor/sigs.k8s.io/boskos/crds/client.go:94:15: cannot use func(_ cache.Cache, _ *rest.Config, _ ctrlruntimeclient.Options, _ ...ctrlruntimeclient.Object) (ctrlruntimeclient.Client, error) {…} (value of type func(_ "sigs.k8s.io/controller-runtime/pkg/cache".Cache, _ *rest.Config, _
 client.Options, _ ...client.Object) (client.Client, error)) as client.NewClientFunc value in struct literal) (typecheck)
        "sigs.k8s.io/boskos/crds"

Unexpected behavior of boskos in acquirestate when destination and current state are both free.

acquirestate lost a resource's state when destination state and current state of a resource are both free. This means if same acquirestate request is called twice sequentially, the second time boskos would give an error. It is only after the release request that acquirestate works again. The following are two proposed solutions:

give an error when destination and current state are same(recommended)
not allowing destination to be free.

gcp_janitor.py isn't thread-safe: multiple threaded invocations can corrupt GCloud config file

In one of my Boskos instances, I've observed failures when invoking gcp_janitor.py of the following form:

failed to clean up project asm-boskos-shared-vpc-svc-188, error info: ERROR: gcloud failed to load: Source contains parsing errors: '/root/.config/gcloud/configurations/config_default'
	[line 13]: 'oogleapis.com/\n'
    parsed_config.read(properties_path)
    self._read(fp, filename)
    raise e
	[line 13]: 'oogleapis.com/\n'

My analysis of gcp_janitor.py makes me believe it's not thread-safe, specifically when running gcloud config set.

The first place we run that is in line 511, where we run:

gcloud config set billing/quota-project <xyz>

The second is in line 588, where we run:

gcloud config set api_endpoint_overrides/gkehub https://<gkehub-url>/

The janitor itself, janitor.go, invokes gcp_janitor.py inside Goroutines, which run in parallel; I believe that if multiple threads attempt to run gcloud config set simultaneously, it can corrupt the GCloud config file (in my case, /root/.config/gcloud/configurations/config_default), which is shared among all threads. This renders any future attempts to run gcp_janitor.py futile, because the GCloud config file is irrevocably corrupted.

I believe I have a fix for this, involving setting os.environ rather than running gcloud config set. Specifically, you can replace the commands with environment variables like this:

gcloud config set billing/quota-project <xyz>
-> CLOUDSDK_BILLING_QUOTA_PROJECT=<xyz>

gcloud config set api_endpoint_overrides/gkehub https://<gkehub-url>/
-> CLOUDSDK_API_ENDPOINT_OVERRIDES_GKEHUB=https://<gkehub-url>

Looks as though the code already makes use of environment variables like these in line 436, where we do:

os.environ['CLOUDSDK_API_ENDPOINT_OVERRIDES_CONTAINER'] = endpoint

I'll put up a PR to modify those gcloud config set commands with os.environ assignments.

Support multiple config files

Currently, Boskos reads a single configuration file, but maintaining a single config file can be painful, and there has been some desire to support multiple configuration files.

Internally, Boskos is using viper, which only supports one configuration file per instance. We could experiment to see if multiple instances would be a feasible approach, or if there is some other way to address this request.

/kind feature

static resources removed from the configuration may never be deleted

Originally filed as kubernetes/test-infra#17282

I mentioned this tangentially in kubernetes/test-infra#16047 (comment), but I want to pull it to a separate issue to be more easily highlight it.

Boskos doesn't delete static resources that are removed from the configuration if they are in use, to ensure that jobs don't fail, and to ensure that such resources are properly cleaned up by the janitor.

Originally, this was a reasonable decision, since Boskos periodically synced its storage against the configuration, and most likely such resources would eventually be free and thus deleted from storage.

After kubernetes/test-infra#13990, Boskos only syncs its storage against the configuration when the configuration changes (or when Boskos restarts). As a result, it may take a long time for static resources to be deleted, if ever.

There was a similar issue for DRLCs that I addressed in kubernetes/test-infra#16021, effectively by putting the DRLCs into lame-duck mode.

There isn't a clear way to indicate that static resources are in lame-duck mode, though.

Possible ways to address this bug, in increasing order of complexity:

Just delete static resources, regardless of what state they're in.
Periodically sync storage against the config. It's probably less expensive now, due to the improvements around locking.
Somehow indicate that resources are in lame-duck mode to prevent them from being leased, and then delete them once free:
a. Add a field into the UserData for static resources. (Currently UserData is not used for static resources.)
b. Set an ExpirationDate on static resources. (Currently ExpirationDate is not used for static resources.)
c. Add a new field on the ResourceStatus indicating resources are in lame-duck mode.

Workaround until this bug is fixed: admins with access to the cluster where Boskos is running can just delete the resources manually using kubectl.

GCP janitor: support filtering by labels

The AWS janitor now supports filtering (include and exclude) via tags.

GCP has similar functionality for most resources, so it would be great to add similar filtering functionality.

/kind feature

GCP projects stuck in Cleaning block other projects from cleaning

As far as I understood this logic correctly, Janitor tries to acquire and clean all dirty resources sequentially, type by type; acquisition is also throttled by size of the channel. If cleaning fails, janitor returns resource back to dirty state, and if acquisition was throttled, loop will try to get same resource again. So if janitor(s) are putting resources back to dirty fast enough, it's possible for janitor to get stuck on one resource type and do nothing for other resource types.

This was observed previously with filestore cleans in private install: multiple resource types have been affected after one of types got more projects stuck in cleaning than aggregated capacity of channels in all janitor instances.

AWS Janitor: Add support for ECR Public

In e2e CI jobs in CAPA, we're creating an extremely temporary ECR public repo to deploy a container image from the codebase into a created EC2 instance. These registries should be mopped up after use.

Note that ecr-public is its own API, distinct from normal ecr.

Migrate aws-janitor to use Go AWS SDK v2

The Go AWS SDK v1 (which is used by this project) is moving to maintenance mode July 2024 and will be completely out of support July 2025¹.

We should update aws-janitor (and any other other boskos related code) to use the the v2 SDK.

AWS have published a migration guide² that we can use to understand the changes needed.

monitor boskos cleanup timing

Originally filed as kubernetes/test-infra#14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos
/assign @krzyzacy
cc @fejta @mm4tt
/kind feature

aws-janitor overly eagerly deleting IAM Role

On AWS, IAM Roles have the same name (and no unique UUID). They do have a creation timestamp, however.

Our test jobs are creating IAM roles with the same name. The aws-janitor runs periodically, and if the timings work out "just so", aws-janitor will observe different IAM roles with the same name for the entire TTL window. It will then delete an IAM role, thinking that it is no longer in use, but in fact it has seen multiple different IAM roles with the same name.

I propose using CreationTimestamp to differentiate.

flaky test: cleaner TestRecycleResources/noLeasedResources

Seen failing here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_boskos/24/pull-boskos-build-test-verify/1270814151844827142

=== FAIL: cleaner TestRecycleResources/noLeasedResources (0.05s)
time="2020-06-10T20:26:08Z" level=info msg="Cleaner started"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=error msg="Release failed" error="owner mismatch request by cleaner, currently owned by "
time="2020-06-10T20:26:08Z" level=error msg="failed to release dynamic_2 as tombstone" error="owner mismatch request by cleaner, currently owned by "
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
    TestRecycleResources/noLeasedResources: cleaner_test.go:212: resource dynamic_2 state cleaning does not match expected tombstone
time="2020-06-10T20:26:08Z" level=info msg="Stopping Cleaner"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Cleaner stopped"
    --- FAIL: TestRecycleResources/noLeasedResources (0.05s)
=== FAIL: cleaner TestRecycleResources (0.21s)

/kind bug

janitor: track when cleanup fails repeatedly for the same resource

Originally filed as kubernetes/test-infra#15866

Due to programming errors, the janitor may continuously fail to clean up a resource. Two examples I just discovered:

possibly an order-of-deletion issue:

{"error":"exit status 1","level":"info","msg":"failed to clean up project kube-gke-upg-1-2-1-3-upg-clu-n, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.networks.delete) Could not fetch resource:\n - The network resource 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/networks/jenkins-e2e' is already being used by 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/routes/default-route-92807148d5aa60d1'\n\nError try to delete resources networks: CalledProcessError()\n[=== Start Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' with status 1 ===]\n","time":"2020-01-10T21:03:14Z"}

likely incorrect flags (gcloud changed but we didn't?):

{"error":"exit status 1","level":"info","msg":"failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --region=https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gke-ci-canary/regions/us-central1 \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\n[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]\n","time":"2020-01-10T21:18:55Z"}

It'd be good to have some way of detecting when we're repeatedly failing to clean up a resource.
Not sure yet what the best way would be to track that.

Feature request: add compute ssl-policy to gcp_janitor.py

I would like to have a new compute resource ssl-policies to be added to gcp_janitor.py script.

Release blocking tests cannot acquire project from boskos

Similar to: #118

gce-cos-master-scalability-100 release blocking tests are failing with:

2022/04/06 10:36:20 main.go:331: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: Post "http://boskos.test-pods.svc.cluster.local./acquire?dest=busy&owner=ci-kubernetes-e2e-gci-gce-scalability&request_id=e01699c7-977c-4045-81e5-2d8825e132c6&state=free&type=scalability-project": dial tcp 10.35.241.148:80: connect: connection refused

Finish setting up this repository

A number of tasks remain. In no particular order:

aws-janitor: ensure IDs are unique across resources and regions for set.Mark

Some AWS resources don't have an ARN, so we're currently just using the ID of the resource, which may not necessarily be globally unique for the same resource and/or across resource types.

For several resource types, we generate a fake ARN which includes the resource type and region. We should ensure all resource types are doing something similar.

/kind cleanup

Boskos CRD uses deprecated API group to be removed in v1.22

https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122

The apiextensions.k8s.io/v1beta1 API version of CustomResourceDefinition will no longer be served in v1.22.

Migrate manifests and API clients to use the apiextensions.k8s.io/v1 API version, available since v1.16.

All existing persisted objects are accessible via the new API

Notable changes:

spec.scope is no longer defaulted to Namespaced and must be explicitly specified

spec.version is removed in v1; use spec.versions instead

spec.validation is removed in v1; use spec.versions[*].schema instead

spec.subresources is removed in v1; use spec.versions[*].subresources instead

spec.additionalPrinterColumns is removed in v1; use spec.versions[*].additionalPrinterColumns instead

spec.conversion.webhookClientConfig is moved to spec.conversion.webhook.clientConfig in v1

spec.conversion.conversionReviewVersions is moved to spec.conversion.webhook.conversionReviewVersions in v1

spec.versions[*].schema.openAPIV3Schema is now required when creating v1 CustomResourceDefinition objects, and must be a structural schema

spec.preserveUnknownFields: true is disallowed when creating v1 CustomResourceDefinition objects; it must be specified within schema definitions as x-kubernetes-preserve-unknown-fields: true

In additionalPrinterColumns items, the JSONPath field was renamed to jsonPath in v1 (fixes #66531)

Boskos does not allow duplicate resource name even for different types

Originally filed as kubernetes/test-infra#15054 by @amwat

resources:
- type: bar
  names: 
  - foo
- type: qwe
  names:
  - foo

results in {"error":"duplicated resource name: foo","level":"fatal","msg":"failed to create ranch!" }

This seems like something we should support?

/cc @krzyzacy @ixdy @sebastienvas

changing type of a static resource in config doesn't update storage

Originally filed as kubernetes/test-infra#16047

What happened:
We renamed the type of some of our static resources to work around a bug in the janitor.
(That bug: we had a group of projects that didn't end in -project, and thus the janitor was passing the wrong flag: https://github.com/kubernetes/test-infra/blob/761c11f53ddb7dde3fcc4073a7e3b9015554fe7f/boskos/janitor/janitor.go#L92-L99
)

After applying the config, the old type still remained in storage (in the Kubernetes objects).

What you expected to happen:
Boskos would update storage (in the Kubernetes objects) reflecting the new type.

How to reproduce it (as minimally and precisely as possible):
I wrote a simple unit test that reproduces this failure: ixdy/kubernetes-test-infra@d6714a6

It looks like when updating static resources, we just check to see whether the resources specified in the config exist in storage and vice-versa, only looking at the resource names. We do not consider that other metadata (such as type) may have changed.

AWS Janitor: Add support for VPC peering cleanup

In e2e CI jobs in CAPA, we're creating VPC peering, we need to add the clean up logic here.

kubernetes-sigs / boskos Goto Github PK

boskos's People

Contributors

Stargazers

Watchers

Forkers

boskos's Issues

Footnotes

Recommend Projects

Recommend Topics

Recommend Org