isovalent / gke-test-cluster-operator Goto Github PK

An operator for managing ephemeral clusters in GKE

License: Other

Shell 0.95% Dockerfile 0.55% Makefile 0.32% Go 31.26% CUE 66.93%

integration-testing kubernetes kubernetes-deployment kubernetes-operator kubernetes-test-operator performance-testing testing

gke-test-cluster-operator's People

Contributors

Stargazers

Watchers

Forkers

tuananh anilreddypuresoftware krishnamurtyp

gke-test-cluster-operator's Issues

node pool creation time is suboptimal

When the Config Connector sees all new objects, it builds a dependency graph to create underlying GCP object in the right order.

Namely, when it sees ContainerCluster, ContainerNodePool, ComputeNetwork and ComputeSubnetwork, It will create ComputeNetwork, and ComputeSubnetwork after ComputeNetwork is ready. The dependency readiness checks appear to have an exponential back-off. The ComputeNetwork, and ComputeSubnetwork are fairly quick, but ContainerCluster takes around 5m, so I have seen ContainerNodePool waiting for 10m before getting created.

I've reported an issue upstream regarding this: GoogleCloudPlatform/k8s-config-connector#161.

copy labels and annotations from TestClusterGKE to jobs

This will make log extraction easier and generally simplify what it takes to find the job that belongs to a particular run, especially using GitHub metadata.
Adding some of the labels to the Prometheus metrics would be very handy, e.g. so that one can easily find the metrics by PR/commit.

use-case: cluster TTL

Right now clusters that are associated with a test job are deleted upon job completion (whether it failed or succeeded), and clusters without a job need to be deleted manually. This needs to be refined.

Generally speaking, any cluster should have a TTL, so that clusters don't sit around forever in case of a cleanup failure. For developer there should be a way of requesting a long-lived cluster, yet perhaps there should be a maximum. Costs are the primary concern, but also there should be no good reason for a long-lived cluster to be ever managed by this operator.

There are few different use-cases that should eventually be covered:

retention of CI clusters for debugging with an option hold
ephemeral dev clusters - delete after a week by default, but allow longer TTL

pick a zone automatically to avoid congestion

Right now user can specify zone and region, or rely on defaults. They should instead be able to just specify which region they want, and whether a regional (#17) or single-zone cluster (default), the zone should be selected automatically (at random) for them. This should enable some degree of zonal distribution also, so that the operator doesn't create congestion and is not overly reliant on a specific zone.

document namespace setup and GitHub Actions configuration

bubble up cluster creation errors to PR status or as a comment

Issues like #11 currently don't get caught in any way and are not visible from a PR status. Perhaps requester could catch some of these, as it maybe generally useful for requester to wait a little before it returns. Or perhaps the GitHub integration could handle it also.

refactor grafana dashboards

The grafana dashboads are generated for each test cluster, but these are specific to Cilium be implemented in a configurable fashion.

GKE API changes - automatic upgrades and repairs

Looks like there were API changes in GKE and automatic upgrades and rapair are now required in regular release channel...

Here's an example:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  creationTimestamp: "2021-02-03T02:17:46Z"
  generation: 1
  name: test-66gx7
  namespace: test-clusters
  resourceVersion: "155013016"
  selfLink: /apis/clusters.ci.cilium.io/v1alpha2/namespaces/test-clusters/testclustersgke/test-66gx7
  uid: acd5f665-2fd6-4bc2-b401-417227e5ce91
spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/cilium-test-gke.sh
      - quay.io/cilium/cilium:latest
      - quay.io/cilium/operator-generic:latest
      - quay.io/cilium/hubble-relay:latest
      - NightlyPolicyStress
      image: cilium/cilium-test-dev:7cdf8024e
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2
status:
  clusterName: test-66gx7-c5lsf
  conditions:
  - lastTransitionTime: "2021-02-03T16:52:07Z"
    message: Some dependencies are not ready yet
    reason: DependenciesNotReady
    status: "False"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-66gx7-c5lsf:
    - lastTransitionTime: "2021-02-03T16:52:07Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-66gx7-c5lsf:
    - lastTransitionTime: "2021-02-03T02:17:46Z"
      message: 'Update call failed: error applying desired state: summary: error creating
        NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false
        when release_channel REGULAR is set., badRequest, detail: '
      reason: UpdateFailed
      status: "False"
      type: Ready

Originally this was intentional, as actually for testing purposes it's best to have these features disable.

use-case: pool of pre-built clusters

Currently cluster provisioning is done on-demand, so it takes at least 5 minutes to obtain a cluster.
Since cluster configuration is based on well-known templates, and for testing purposes region and scale shouldn't matter, it should be possible to maintain a pool of cluster that had been pre-built and enable faster provisioning for CI job to run faster.

It should be possible to control pool size with based on business needs, e.g. scale up during business hours and scale down out-of-hours, and even a predictive model can be implement if so desired.

## TODOs

design a basic TestClusterPoolGKE object
- make it namespaced
- initially it should be support only type: static, but may extend to e.g. type: businessHours
- it should support subset of parameters that TestClusterGKE offers, e.g. location and scale shouldn't matter
- it may need to provide options like scaling nodes on job assignment
implement and test it
- refactor how job is created, i.e. it will need to be done when cluster is take from a pool
deploy it in CI

regional clusters

The operator currently assumes single-zone use-case for its cost advantage, however it should be possible for the user to request a regional cluster.

promview and logview should expose RED metrics

Right now these components don't have any metrics, it's critical to have metrics to operationalise the operator.

logview should handle error states better

Right now an init container error and probably other errors result in cannot get log stream, it should probably display log of e.g. the init container.

detect unhealthy objects over prolonged period of time

There should alerting in place when there are continuous CNRM errors over relatively long period of time, namely something like cluster didn't get created after 20 minutes (see e.g. #11).

events

Events can be quite helpful in troubleshooting lifecycle of a resource, it'd be good to send events on major status updates (e.g. #14), and on any key dependency events, especially errors.

requester: improve usability

retry creating cluster in different zone when one is out of resources

This is related to #18, but is actually a separate issue.

Sometimes a zone is short of resources, and GKE yields:

  Warning  UpdateFailed        12m (x4 over 23m)   containercluster-controller  Update call failed: error applying desired state: summary: Error waiting for creating GKE cluster: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: europe-west2-b., detail:

One of the purpose of this operator was exactly to cater for this type of errors and retry.

c.Spec.Project = "cilium-ci" 
c.Spec.Location = "europe-west2-b"
c.Spec.Region = "europe-west2"
c.Spec.JobSpec.Runner.Image = "quay.io/isovalent/gke-test-cluster-gcloud:803ff83d3786eb38ef05c95768060b0c7ae0fc4d"
c.Spec.JobSpec.Runner.InitImage = "quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f"

There should be a per-namespace object that defines the defaults, to allow for multi-project setups etc.