Coder Social home page Coder Social logo

kubernetes-sigs / boskos Goto Github PK

View Code? Open in Web Editor NEW
117.0 5.0 70.0 14.17 MB

Boskos is a resource management service that provides reservation and lifecycle management of a variety of different kinds of resources.

License: Apache License 2.0

Dockerfile 0.56% Makefile 0.57% Go 92.26% Python 5.71% Shell 0.90%
k8s-sig-testing

boskos's Introduction

boskos

Background

βοσκός - shepherd in greek!

boskos is a resource manager service, that handles and manages different kind of resources and transition between different states.

Introduction

Boskos is inited with a config of resources, a list of resources by names. It's passed in by -config, usually as a config map.

Boskos supports 2 types of resources, static and dynamic resources. Static resources are the one that depends on actual physical resources, meaning someone needs to physically create it, and add it to the list. Dynamic resources may depend on static resources. In the example bellow , aws-account is static resource, and aws-cluster is a dynamic resource that depends on having an aws-account. Once a cluster is created, AWS are in used, so admin might want to always have a minimum cluster available for testing, and might allow for more cluster to be created for spike usage.

---
resources:
  # Static
  - type: "aws-account"
    state: free
    names:
    - "account1"
    - "account2"
  # Dynamic
  - type: "aws-cluster"
    state: dirty
    min-count: 1
    max-count: 2
    lifespan: 48h
    needs:
      aws-account: 1
    config:
      type: AWSClusterCreator
      content: "..."

Type can be GCPProject, cluster, or even a dota2 server, anything that you want to be a group of resources. Name is a unique identifier of the resource. State is a string that tells the current status of the resource.

User Data is here for customization. In Mason as an example, we create new resources from existing ones (creating a cluster inside a GCP project), but in order to acquire the right resources, we need to store some information in the final resource UserData. It is up to the implementation to parse the string into the right struct. UserData can be updated using the update API call. All resource user data is returned as part of acquisition (calling acquire or acquirebystate)

Dynamic Resources

As explain in the introduction, dynamic resources were introduced to reduce cost.

If all resources are currently being used, and the count of resources is bellow Max, boskos will create new resources on Acquire. In order to take advantage of this, users need to specify a request ID in Acquire and keep using the same requestID until the resource is available.

Boskos will take care of naming and creating resources (if the current count is below min-count) and deleting the resources if they are expired (lifetime option) or over max-count.

All resource being deleted (due to config update or expiration) will be marked as ToBeDeleted. The cleaner component will mark them as Tombstone such that they can be safely deleted by Boskos. The cleaner will ensure that dynamic resources release other leased resources associated with it to prevent leaks.

API

POST /acquire

Use /acquire when you want to get hold of some resource.

Required Parameters

Name Type Description
type string type of requested resource
state string current state of the requested resource
dest string destination state of the requested resource
owner string requester of the resource

Optional Parameters

Name Type Description
request_id string request id to use to keep your priority rank

Example: /acquire?type=gce-project&state=free&dest=busy&owner=user.

On a successful request, /acquire will return HTTP 200 and a valid Resource JSON object.

POST /acquirebystate

Use /acquirebystate when you want to get hold of a set of resources in a given state.

Required Parameters

Name Type Description
state string current state of the requested resource
dest string destination state of the requested resource
owner string requester of the resource
names string comma separated list of resource names

Example: /acquirebystate?state=free&dest=busy&owner=user&names=res1,res2.

On a successful request, /acquirebystate will return HTTP 200 and a valid list of Resources JSON object.

POST /release

Use /release when you finish use some resource. Owner need to match current owner.

Required Parameters

Name Type Description
name string name of finished resource
owner string owner of the resource
dest string destination state of the released resource

Example: /release?name=k8s-jkns-foo&dest=dirty&owner=user

POST /update

Use /update to update resource last-update timestamp. Owner need to match current owner.

Required Parameters

Name Type Description
name string name of target resource
owner string owner of the resource
state string current state of the resource

Optional Parameters

In order to update user data, just marshall the user data into the request body.

Example: /update?name=k8s-jkns-foo&state=free&owner=user

POST /reset

Use /reset to reset a group of expired resource to certain state.

Required Parameters

Name Type Description
type string type of resource in interest
state string current state of the expired resource
dest string destination state of the expired resource
expire durationStr resource has not been updated since before expire

Note: durationStr is any string can be parsed by time.ParseDuration()

On a successful request, /reset will return HTTP 200 and a list of [Owner:Resource] pairs, which can be unmarshalled into map[string]string{}

Example: /reset?type=gce-project&state=busy&dest=dirty&expire=20m

GET /metric

Use /metric to retrieve a metric.

Required Parameters

Name Type Description
type string type of requested resource

On a successful request, /metric will return HTTP 200 and a JSON object containing the count of projects in each state, the count of projects with each owner (or without an owner), and the sum of state moved to after /done (Todo). A sample object will look like:

{
        "type" : "project",
        "Current":
        {
                "total"   : 35,
                "free"    : 20,
                "dirty"   : 10,
                "injured" : 5
        },
        "Owners":
        {
                "fejta" : 1,
                "Senlu" : 1,
                "sig-testing" : 20,
                "Janitor" : 10,
                "None" : 20
        }
}

Config update:

  1. Edit resources.yaml, and send a PR.

  2. After PR is LG'd, make sure your branch is synced up with master.

  3. run make update-config to update the configmap.

  4. Boskos updates its config every 10min. Newly added resources will be available after next update cycle. Newly deleted resource will be removed in a future update cycle if the resource is not owned by any user.

Other Components:

Reaper looks for resources that owned by someone, but have not been updated for a period of time, and reset the stale resources to dirty state for the Janitor component to pick up. It will prevent state leaks if a client process is killed unexpectedly.

Janitor looks for dirty resources from boskos, and will kick off sub-janitor process to clean up the resource, finally return them back to boskos in a free state.

Metrics is a separate service, which can display json metric results, and has HTTP endpoint opened for prometheus monitoring.

Mason updates virtual resources with existing resources. An example would be a cluster. In order to create a GKE cluster you need a GCP Project. Mason will look for specific resources and release leased resources as dirty (such that Janitor can pick it up) and ask for brand new resources in order to convert them in the final resource states. Mason comes with its own client to ease usage. The mason client takes care of acquiring and release all the right resources from the User Data information.

Cleaner Mark resource with status ToBeDeleted as Tombstone such they can be safely deleted by Boskos. This is important for dynamic resources such that all associated resources can be released before deletion to prevent leak.

Storage There could be multiple implementation on how resources and mason config are stored. Since we have multiple components with storage needs, we have now shared storage implementation. In memory and in Cluster via k8s custom resource definition.

crds General client library to store data on k8s custom resource definition. In theory those could be use outside of Boskos.

For the boskos server that handles k8s e2e jobs, the status is available from the Prow monitoring dashboard

Adding UserData to a resource

  1. Check it out:

    curl -X POST "http://localhost:8080/acquire?type=my-resource&state=free&dest=busy&owner=$(whoami)"
    {"type":"my-resource","name":"resource1","state":"busy","owner":"user","lastupdate":"2019-02-07T22:33:38.01350902Z","userdata":null}
  2. Add the data:

    curl -X POST -d '{"access-key-id":"17","secret-access-key":"18"}' "http://localhost:8080/update?name=resource1&state=busy&owner=$(whoami)"
  3. Check it back in:

    curl -X POST 'http://localhost:8080/release?name=liz2&dest=free&owner=user'

Local test:

  1. Start boskos with a fake config.yaml, with go run boskos.go -in_memory -config=/path/to/config.yaml

  2. Sent some local requests to boskos:

curl 'http://127.0.0.1:8080/acquire?type=project&state=free&dest=busy&owner=user'

K8s test:

  1. Create and navigate to your own cluster

  2. make server-deployment

  3. make service

  4. kubectl create configmap -n test-pods resources --from-file=config=cfg.yaml See boskos-resources.yaml for an example of how the config file should look

  5. kubectl describe svc -n test-pods boskos to make sure boskos is running

  6. Test from another pod within the cluster

kubectl run curl --image=radial/busyboxplus:curl -i --tty
Waiting for pod default/curl-XXXXX to be running, status is Pending, pod ready: false
If you don't see a command prompt, try pressing enter.
[ root@curl-XXXXX:/ ]$ curl -X POST 'http://boskos.test-pods.svc.cluster.local/acquire?type=project&state=free&dest=busy&owner=user'

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

boskos's People

Contributors

akutz avatar alvaroaleman avatar amulyam24 avatar amwat avatar bentheelder avatar c22zhang avatar cblecker avatar chaodaig avatar chizhg avatar cpanato avatar detiber avatar dims avatar fejta avatar hongkailiu avatar ixdy avatar jprzychodzen avatar justinsb avatar k8s-ci-robot avatar katharine avatar krzyzacy avatar liztio avatar mrhohn avatar rifelpet avatar sawsa307 avatar sebastienvas avatar spiffxp avatar stevekuznetsov avatar wonderfly avatar yongxiu avatar youyangl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

boskos's Issues

Boskos client ReleaseOne is not usable if process is restarted

Originally filed as kubernetes/test-infra#15910 by @chemikadze

What happened:

ReleaseOne performs local check for whether resource was allocated by same client previously, and fails if it was not. After release, it makes client forget such association.
Release is not doing checks and is not making sure that association is removed.

At the same time, all Allocate methods are adding associations. This means, there is no consistent way to release resource in case when client was recreated for some reason (for example, process restart). One known workaround to make sure object is released and there is no memory leak is to use Release if ReleaseOne failed to cover restart case, however quite error-prone without knowledge of internals.

What you expected to happen:

ReleaseOne not to fail if client state is out of sync or Release to clean up client state.

How to reproduce it (as minimally and precisely as possible):

Create two boskos clients, allocate in one client, and run release from another client.

Handwritten CRD DeepCopy methods are not deepcopying

For some reason we have handwritten instead of generated DeepCopy methods on our CRDs and they definitely do not DeepCopy, i.E. something like this is incorrect, because nested pointers will just get copied over, so if someone changes the value that is pointed to, it will change on both the original and the "deepcopied" version:

out.Spec = in.Spec

/kind bug

aws-janitor: ensure IDs are unique across resources and regions for set.Mark

Some AWS resources don't have an ARN, so we're currently just using the ID of the resource, which may not necessarily be globally unique for the same resource and/or across resource types.

For several resource types, we generate a fake ARN which includes the resource type and region. We should ensure all resource types are doing something similar.

/kind cleanup

Release blocking tests cannot acquire project from boskos

gce-cos-master-scalability-100 release blocking tests are failing with:

2022/03/29 11:16:38 main.go:331: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: Post "http://boskos.test-pods.svc.cluster.local./acquire?dest=busy&owner=ci-kubernetes-e2e-gci-gce-scalability&request_id=eda4a4c1-65ce-460e-8906-35595e3e8d6f&state=free&type=scalability-project": dial tcp 10.35.241.148:80: connect: connection refused

Boskos CRD uses deprecated API group to be removed in v1.22

https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122

The apiextensions.k8s.io/v1beta1 API version of CustomResourceDefinition will no longer be served in v1.22.

  • Migrate manifests and API clients to use the apiextensions.k8s.io/v1 API version, available since v1.16.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.scope is no longer defaulted to Namespaced and must be explicitly specified
    • spec.version is removed in v1; use spec.versions instead
    • spec.validation is removed in v1; use spec.versions[*].schema instead
    • spec.subresources is removed in v1; use spec.versions[*].subresources instead
    • spec.additionalPrinterColumns is removed in v1; use spec.versions[*].additionalPrinterColumns instead
    • spec.conversion.webhookClientConfig is moved to spec.conversion.webhook.clientConfig in v1
    • spec.conversion.conversionReviewVersions is moved to spec.conversion.webhook.conversionReviewVersions in v1
    • spec.versions[*].schema.openAPIV3Schema is now required when creating v1 CustomResourceDefinition objects, and must be a structural schema
    • spec.preserveUnknownFields: true is disallowed when creating v1 CustomResourceDefinition objects; it must be specified within schema definitions as x-kubernetes-preserve-unknown-fields: true
    • In additionalPrinterColumns items, the JSONPath field was renamed to jsonPath in v1 (fixes #66531)

AWS Janitor: Add support for ECR Public

In e2e CI jobs in CAPA, we're creating an extremely temporary ECR public repo to deploy a container image from the codebase into a created EC2 instance. These registries should be mopped up after use.

Note that ecr-public is its own API, distinct from normal ecr.

GCP janitor failing when trying to clean up logging sinks

At some point, it seems like GCP added two new Cloud Logging logs router sinks to projects:

  • _Default
  • _Required

These cannot be deleted, and this recently started causing cleanup of projects to fail, with error messages like the following:

ERROR: (gcloud.logging.sinks.delete) PERMISSION_DENIED: Sink _Default cannot be deleted. Consider disabling instead
ERROR: (gcloud.logging.sinks.delete) PERMISSION_DENIED: Sink _Required cannot be deleted
Error try to delete resources sinks: CalledProcessError()
Error try to delete resources sinks: CalledProcessError()

Migrate aws-janitor to use Go AWS SDK v2

The Go AWS SDK v1 (which is used by this project) is moving to maintenance mode July 2024 and will be completely out of support July 20251.

We should update aws-janitor (and any other other boskos related code) to use the the v2 SDK.

AWS have published a migration guide2 that we can use to understand the changes needed.

Footnotes

  1. https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-aws-sdk-for-go-v1-on-july-31-2025/

  2. https://aws.github.io/aws-sdk-go-v2/docs/migrating/

Updates silently fail if configuration file is renamed

Boskos should fail on startup if the path to the config file is invalid.

We've discovered a deployment where Boskos is not updating its configuration.

What I think happened is that the --config flag pointed to a valid configuration file, and then at some point, the filename inside ConfigMap changed, resulting in the original file being deleted.

It appears that viper simply silently stops its file watch when a file is deleted, so there's no good way to detect if this error has occurred until the pod restarts.

I don't have a good idea for how to resolve this - should we periodically check the config file and crash if it's no longer valid? That'd also help a bit with #20. (If we didn't crash, we'd at least get error messages that might be helpful for users trying to figure out why their configs aren't being updated.)

cc @coryrc

reaper: support different expiration times based on state

Currently, the reaper only resets resources in state busy, cleaning, or leased. Furthermore, it uses the same expiration time for each.

One use case that isn't supported by this model is human inspection of failed resources. For example, if a test fails, a team might want to look at the state of the resource before cleaning it up. The tests could move this resource into a new state (perhaps purgatory), but then it will never be cleaned up. Ideally we'd be able to set a longer expiration time on this new state.

Tangential note: why do we need a separate reaper binary at all? Would it be simpler to have a setting in the main boskos configuration map that controls whether leases expire, and have boskos core do that itself? Putting configuration there would allow easy per-resource overrides, too.

gcp_janitor.py isn't thread-safe: multiple threaded invocations can corrupt GCloud config file

In one of my Boskos instances, I've observed failures when invoking gcp_janitor.py of the following form:

failed to clean up project asm-boskos-shared-vpc-svc-188, error info: ERROR: gcloud failed to load: Source contains parsing errors: '/root/.config/gcloud/configurations/config_default'
	[line 13]: 'oogleapis.com/\n'
    parsed_config.read(properties_path)
    self._read(fp, filename)
    raise e
	[line 13]: 'oogleapis.com/\n'

My analysis of gcp_janitor.py makes me believe it's not thread-safe, specifically when running gcloud config set.

The first place we run that is in line 511, where we run:

gcloud config set billing/quota-project <xyz>

The second is in line 588, where we run:

gcloud config set api_endpoint_overrides/gkehub https://<gkehub-url>/

The janitor itself, janitor.go, invokes gcp_janitor.py inside Goroutines, which run in parallel; I believe that if multiple threads attempt to run gcloud config set simultaneously, it can corrupt the GCloud config file (in my case, /root/.config/gcloud/configurations/config_default), which is shared among all threads. This renders any future attempts to run gcp_janitor.py futile, because the GCloud config file is irrevocably corrupted.

I believe I have a fix for this, involving setting os.environ rather than running gcloud config set. Specifically, you can replace the commands with environment variables like this:

gcloud config set billing/quota-project <xyz>
-> CLOUDSDK_BILLING_QUOTA_PROJECT=<xyz>

gcloud config set api_endpoint_overrides/gkehub https://<gkehub-url>/
-> CLOUDSDK_API_ENDPOINT_OVERRIDES_GKEHUB=https://<gkehub-url>

Looks as though the code already makes use of environment variables like these in line 436, where we do:

os.environ['CLOUDSDK_API_ENDPOINT_OVERRIDES_CONTAINER'] = endpoint

I'll put up a PR to modify those gcloud config set commands with os.environ assignments.

GCP projects stuck in Cleaning block other projects from cleaning

As far as I understood this logic correctly, Janitor tries to acquire and clean all dirty resources sequentially, type by type; acquisition is also throttled by size of the channel. If cleaning fails, janitor returns resource back to dirty state, and if acquisition was throttled, loop will try to get same resource again. So if janitor(s) are putting resources back to dirty fast enough, it's possible for janitor to get stuck on one resource type and do nothing for other resource types.

This was observed previously with filestore cleans in private install: multiple resource types have been affected after one of types got more projects stuck in cleaning than aggregated capacity of channels in all janitor instances.

aws-janitor overly eagerly deleting IAM Role

On AWS, IAM Roles have the same name (and no unique UUID). They do have a creation timestamp, however.

Our test jobs are creating IAM roles with the same name. The aws-janitor runs periodically, and if the timings work out "just so", aws-janitor will observe different IAM roles with the same name for the entire TTL window. It will then delete an IAM role, thinking that it is no longer in use, but in fact it has seen multiple different IAM roles with the same name.

I propose using CreationTimestamp to differentiate.

GCP janitor: support arbitrary cleanup commands instead of simple delete

Some gcloud cleanup commands are not standard delete commands - https://cloud.google.com/sdk/gcloud/reference/beta/compute/shared-vpc/associated-projects/remove as an example, thus cannot be simply added into the map in https://github.com/kubernetes-sigs/boskos/blob/master/cmd/janitor/gcp_janitor.py#L32-L91.

One option to support this is to add custom functions similar as clean_gke_cluster, but there's probably a better approach?

/cc @ixdy

REST client without depending on kubernetes staging libraries

This is a feature request asking that we consider publishing a go module that does not import Kubernetes staging libraries, at least for the purposes of talking to the boskos API.

We need a package like this somewhere for projects like kubetest kubernetes/test-infra#20422, having importable packages that depend on these is a bit of a nightmare.

Also: Having an independent client module might help resolve circular dependencies with any tools in test-infra that talk to boskos while boskos is importing test-infra ...

Deleting static resource may not take effect until next config update or container restart

Static resources are deleted only on "SyncConfig", which is triggered by config map update, and in-use resource deletion is delayed until next SyncConfig. But sync happens only on container restart or config change - so if resource was in use, deletion may be significantly delayed.

Dynamic resources however are updated at same time as static + on dynamic-resource-update-period (10 minutes by default). Should static resources be updated as well with similar cadence?

Update controller-runtime to v0.15.0

Controller-runtime v0.15.0 has released on 5/23/2023

https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.15.0

This new version has many breaking changes, and if someone imported this repo as dependency, it's blocking them from bumping up the controller-runtime to v0.15.0

could not import sigs.k8s.io/boskos/crds (-: # sigs.k8s.io/boskos/crds
vendor/sigs.k8s.io/boskos/crds/client.go:89:24: cannot use func(_ *rest.Config) (meta.RESTMapper, error) {…} (value of type func(_ *rest.Config) (meta.RESTMapper, error)) as func(c *rest.Config, httpClient *http.Client) (meta.RESTMapper, error) value in struct literal
vendor/sigs.k8s.io/boskos/crds/client.go:94:15: cannot use func(_ cache.Cache, _ *rest.Config, _ ctrlruntimeclient.Options, _ ...ctrlruntimeclient.Object) (ctrlruntimeclient.Client, error) {…} (value of type func(_ "sigs.k8s.io/controller-runtime/pkg/cache".Cache, _ *rest.Config, _
 client.Options, _ ...client.Object) (client.Client, error)) as client.NewClientFunc value in struct literal) (typecheck)
        "sigs.k8s.io/boskos/crds" 

flaky test: cleaner TestRecycleResources/noLeasedResources

Seen failing here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_boskos/24/pull-boskos-build-test-verify/1270814151844827142

=== FAIL: cleaner TestRecycleResources/noLeasedResources (0.05s)
time="2020-06-10T20:26:08Z" level=info msg="Cleaner started"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=error msg="Release failed" error="owner mismatch request by cleaner, currently owned by "
time="2020-06-10T20:26:08Z" level=error msg="failed to release dynamic_2 as tombstone" error="owner mismatch request by cleaner, currently owned by "
time="2020-06-10T20:26:08Z" level=info msg="Resource dynamic_2 is being recycled"
    TestRecycleResources/noLeasedResources: cleaner_test.go:212: resource dynamic_2 state cleaning does not match expected tombstone
time="2020-06-10T20:26:08Z" level=info msg="Stopping Cleaner"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Released dynamic_2 as tombstone"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Exiting recycleAll Thread"
time="2020-06-10T20:26:08Z" level=info msg="Cleaner stopped"
    --- FAIL: TestRecycleResources/noLeasedResources (0.05s)
=== FAIL: cleaner TestRecycleResources (0.21s)

/kind bug

static resources removed from the configuration may never be deleted

Originally filed as kubernetes/test-infra#17282

I mentioned this tangentially in kubernetes/test-infra#16047 (comment), but I want to pull it to a separate issue to be more easily highlight it.

Boskos doesn't delete static resources that are removed from the configuration if they are in use, to ensure that jobs don't fail, and to ensure that such resources are properly cleaned up by the janitor.

Originally, this was a reasonable decision, since Boskos periodically synced its storage against the configuration, and most likely such resources would eventually be free and thus deleted from storage.

After kubernetes/test-infra#13990, Boskos only syncs its storage against the configuration when the configuration changes (or when Boskos restarts). As a result, it may take a long time for static resources to be deleted, if ever.

There was a similar issue for DRLCs that I addressed in kubernetes/test-infra#16021, effectively by putting the DRLCs into lame-duck mode.

There isn't a clear way to indicate that static resources are in lame-duck mode, though.

Possible ways to address this bug, in increasing order of complexity:

  1. Just delete static resources, regardless of what state they're in.
  2. Periodically sync storage against the config. It's probably less expensive now, due to the improvements around locking.
  3. Somehow indicate that resources are in lame-duck mode to prevent them from being leased, and then delete them once free:
    a. Add a field into the UserData for static resources. (Currently UserData is not used for static resources.)
    b. Set an ExpirationDate on static resources. (Currently ExpirationDate is not used for static resources.)
    c. Add a new field on the ResourceStatus indicating resources are in lame-duck mode.

Workaround until this bug is fixed: admins with access to the cluster where Boskos is running can just delete the resources manually using kubectl.

GCP Janitor: support clean up GCP resources in the additional zones

Currently there is a list of GCP zones configured in GCP Janitor - https://github.com/kubernetes-sigs/boskos/blob/master/cmd/janitor/gcp_janitor.py#L94-L184. When the script is run, Janitor will clean up all the GCP resources in these zones. However, the list is not exclusive and we cannot add the zones that are not publicly launched (e.g. us-east1-a) since the cleanups will fail for projects that cannot access these zones.

One solution would be adding an extra --additional-zones flag to gcp_janitor.py that allows extending the list of zones. For Janitor instances that manage internal GCP projects, pass the internal zones so that it can also clean up GCP resources in these zones.

/assign

Finish setting up this repository

A number of tasks remain. In no particular order:

Logger is not configured correctly

Looking at logs from a recent boskos build (from this repo) reveals that some things are not configured appropriately:

boskos:

{"component":"unset","file":"/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314","func":"github.com/sirupsen/logrus.(*Entry).Logf","handler":"handleUpdate","level":"info","msg":"From 10.44.2.234:42282","time":"2020-06-05T23:01:09Z"}

cleaner:

{"component":"unset","file":"/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:192","func":"github.com/sirupsen/logrus.(*Logger).Log","level":"info","msg":"Cleaner started","time":"2020-06-05T22:47:27Z"}

etc.

The component is unset because we aren't setting the variables in k8s.io/test-infra/prow/version.
We can fairly easily add linker flags for this, though it'll require a bit of work in the Makefile (or maybe we'll just want to write a wrapper script).

I'm not sure why we're getting useless file and function annotations, though.

/kind bug

Unexpected behavior of boskos in acquirestate when destination and current state are both free.

acquirestate lost a resource's state when destination state and current state of a resource are both free. This means if same acquirestate request is called twice sequentially, the second time boskos would give an error. It is only after the release request that acquirestate works again. The following are two proposed solutions:

  1. give an error when destination and current state are same(recommended)
  2. not allowing destination to be free.

Support multiple config files

Currently, Boskos reads a single configuration file, but maintaining a single config file can be painful, and there has been some desire to support multiple configuration files.

Internally, Boskos is using viper, which only supports one configuration file per instance. We could experiment to see if multiple instances would be a feasible approach, or if there is some other way to address this request.

/kind feature

changing type of a static resource in config doesn't update storage

Originally filed as kubernetes/test-infra#16047

What happened:
We renamed the type of some of our static resources to work around a bug in the janitor.
(That bug: we had a group of projects that didn't end in -project, and thus the janitor was passing the wrong flag: https://github.com/kubernetes/test-infra/blob/761c11f53ddb7dde3fcc4073a7e3b9015554fe7f/boskos/janitor/janitor.go#L92-L99
)

After applying the config, the old type still remained in storage (in the Kubernetes objects).

What you expected to happen:
Boskos would update storage (in the Kubernetes objects) reflecting the new type.

How to reproduce it (as minimally and precisely as possible):
I wrote a simple unit test that reproduces this failure: ixdy/kubernetes-test-infra@d6714a6

It looks like when updating static resources, we just check to see whether the resources specified in the config exist in storage and vice-versa, only looking at the resource names. We do not consider that other metadata (such as type) may have changed.

gcp_janitor deletes instances first causing some recreates

Originally filed as kubernetes/test-infra#16965 by @oxddr

What happened:
gcp_janitor deleted all instances first. Some of them belonged to managed instance groups, so they were recreated and deleted few seconds later when IGM has been deleted.

What you expected to happen:
Only non-managed instances should be deleted first or managed instance groups should be deleted before instances.

How to reproduce it (as minimally and precisely as possible):
create clusters on GCE using kube-up and clean it up with gcp_janitor/

Please provide links to example occurrences, if any:
n/a

Anything else we need to know?:
-

janitor fails to clean up some resources in a timely manner if dirty rates are unequal

Originally filed as kubernetes/test-infra#15925

Creating a one-sentence summary of this issue is hard, but the basic bug is fairly easy to understand.

Assume a Boskos instance has three resource types, A, B, and C. A has 5 resources, B has 10, and C has 100. A Boskos janitor has been configured to clean all three types.

Currently, the janitor loops through all resource types, iteratively cleaning one resource of each type. If the janitor finds that one of the types has no dirty resources, it stops querying that resource type until all resources have been cleaned, at which point it waits a minute and then starts over with the complete list again.

In our hypothetical case (as well as observed in practice), what this means is that the janitor will finish cleaning resources of type A (and possibly B), while still having many more C resources to clean. Additionally, given that C is such a large pool, there will likely be many jobs making more C resources dirty. As a result, it will be quite some time before the janitor attempts to clean A resources, and the pool will probably fill up with dirty resources.

Possible ways to mitigate the issue (in increasing complexity):

  • increase the number of janitor replicas
  • segment the janitors (i.e. have separate janitors for each type)
  • remove the optimization in the janitor loop, continuing to attempt to acquire all resource types (this will likely result in more /aquire RPCs to Boskos)
  • use Boskos metrics to select which resources to attempt to clean. This could even be prioritized (e.g. focus on whichever type is closest to running out of resources), though that might lead to different issues with starvation. Additionally, a failing cleanup could mean the janitor might get completely stuck.

Boskos client does not distinguish between incorrect resource type and no resources available

Currently, when acquiring a resource using AcquireWait(), the client makes no distinction between a resource type that does not exist (which could arise from a user typo) vs. a situation where all resources are busy.

Unfortunately, the Boskos Server sends the 404 status code in both situation - see code here. The Boskos client then looks at the status code here.

One easy and backwards-compatible fix would be to add a new option to the client that, when set, would distinguish between the two situation based on the text returned from the HTTP call. The return text for the two different situations is:

  • Acquire failed: resource type "my-resource-type" does not exist
  • Acquire failed: no available resource my-resource-type, try again later

The current problem is that if the user accidentally asks for a resource type that does not exist, they will be stuck in a loop forever with no hope of unblocking except for a Context timeout, which, imo, is not an acceptable situation.

Release blocking tests cannot acquire project from boskos

Similar to: #118

gce-cos-master-scalability-100 release blocking tests are failing with:

2022/04/06 10:36:20 main.go:331: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: Post "http://boskos.test-pods.svc.cluster.local./acquire?dest=busy&owner=ci-kubernetes-e2e-gci-gce-scalability&request_id=e01699c7-977c-4045-81e5-2d8825e132c6&state=free&type=scalability-project": dial tcp 10.35.241.148:80: connect: connection refused

Allow running Boskos in HA mode

Currently it is not possible to run boskos with more than one replica, because it maintains an in-memory FIFO queue to make sure leases are handed out in the order they got requested. This makes it impossible to run it in HA, resulting in downtimes of the whole service if the single replica goes down for whatever reason.

janitor: track when cleanup fails repeatedly for the same resource

Originally filed as kubernetes/test-infra#15866

Due to programming errors, the janitor may continuously fail to clean up a resource. Two examples I just discovered:

possibly an order-of-deletion issue:

{"error":"exit status 1","level":"info","msg":"failed to clean up project kube-gke-upg-1-2-1-3-upg-clu-n, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.networks.delete) Could not fetch resource:\n - The network resource 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/networks/jenkins-e2e' is already being used by 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/routes/default-route-92807148d5aa60d1'\n\nError try to delete resources networks: CalledProcessError()\n[=== Start Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' with status 1 ===]\n","time":"2020-01-10T21:03:14Z"}

likely incorrect flags (gcloud changed but we didn't?):

{"error":"exit status 1","level":"info","msg":"failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --region=https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gke-ci-canary/regions/us-central1 \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\n[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]\n","time":"2020-01-10T21:18:55Z"}

It'd be good to have some way of detecting when we're repeatedly failing to clean up a resource.
Not sure yet what the best way would be to track that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.