Coder Social home page Coder Social logo

isovalent / gke-test-cluster-operator Goto Github PK

View Code? Open in Web Editor NEW
30.0 30.0 3.0 1021 KB

An operator for managing ephemeral clusters in GKE

License: Other

Shell 0.95% Dockerfile 0.55% Makefile 0.32% Go 31.26% CUE 66.93%
integration-testing kubernetes kubernetes-deployment kubernetes-operator kubernetes-test-operator performance-testing testing

gke-test-cluster-operator's People

Contributors

errordeveloper avatar kaworu avatar nebril avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gke-test-cluster-operator's Issues

node pool creation time is suboptimal

When the Config Connector sees all new objects, it builds a dependency graph to create underlying GCP object in the right order.

Namely, when it sees ContainerCluster, ContainerNodePool, ComputeNetwork and ComputeSubnetwork, It will create ComputeNetwork, and ComputeSubnetwork after ComputeNetwork is ready. The dependency readiness checks appear to have an exponential back-off. The ComputeNetwork, and ComputeSubnetwork are fairly quick, but ContainerCluster takes around 5m, so I have seen ContainerNodePool waiting for 10m before getting created.

I've reported an issue upstream regarding this: GoogleCloudPlatform/k8s-config-connector#161.

copy labels and annotations from TestClusterGKE to jobs

This will make log extraction easier and generally simplify what it takes to find the job that belongs to a particular run, especially using GitHub metadata.
Adding some of the labels to the Prometheus metrics would be very handy, e.g. so that one can easily find the metrics by PR/commit.

use-case: cluster TTL

Right now clusters that are associated with a test job are deleted upon job completion (whether it failed or succeeded), and clusters without a job need to be deleted manually. This needs to be refined.

Generally speaking, any cluster should have a TTL, so that clusters don't sit around forever in case of a cleanup failure. For developer there should be a way of requesting a long-lived cluster, yet perhaps there should be a maximum. Costs are the primary concern, but also there should be no good reason for a long-lived cluster to be ever managed by this operator.

There are few different use-cases that should eventually be covered:

  • retention of CI clusters for debugging with an option hold
  • ephemeral dev clusters - delete after a week by default, but allow longer TTL

pick a zone automatically to avoid congestion

Right now user can specify zone and region, or rely on defaults. They should instead be able to just specify which region they want, and whether a regional (#17) or single-zone cluster (default), the zone should be selected automatically (at random) for them. This should enable some degree of zonal distribution also, so that the operator doesn't create congestion and is not overly reliant on a specific zone.

bubble up cluster creation errors to PR status or as a comment

Issues like #11 currently don't get caught in any way and are not visible from a PR status. Perhaps requester could catch some of these, as it maybe generally useful for requester to wait a little before it returns. Or perhaps the GitHub integration could handle it also.

refactor grafana dashboards

The grafana dashboads are generated for each test cluster, but these are specific to Cilium be implemented in a configurable fashion.

GKE API changes - automatic upgrades and repairs

Looks like there were API changes in GKE and automatic upgrades and rapair are now required in regular release channel...

Here's an example:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  creationTimestamp: "2021-02-03T02:17:46Z"
  generation: 1
  name: test-66gx7
  namespace: test-clusters
  resourceVersion: "155013016"
  selfLink: /apis/clusters.ci.cilium.io/v1alpha2/namespaces/test-clusters/testclustersgke/test-66gx7
  uid: acd5f665-2fd6-4bc2-b401-417227e5ce91
spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/cilium-test-gke.sh
      - quay.io/cilium/cilium:latest
      - quay.io/cilium/operator-generic:latest
      - quay.io/cilium/hubble-relay:latest
      - NightlyPolicyStress
      image: cilium/cilium-test-dev:7cdf8024e
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2
status:
  clusterName: test-66gx7-c5lsf
  conditions:
  - lastTransitionTime: "2021-02-03T16:52:07Z"
    message: Some dependencies are not ready yet
    reason: DependenciesNotReady
    status: "False"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-66gx7-c5lsf:
    - lastTransitionTime: "2021-02-03T16:52:07Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-66gx7-c5lsf:
    - lastTransitionTime: "2021-02-03T02:17:46Z"
      message: 'Update call failed: error applying desired state: summary: error creating
        NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false
        when release_channel REGULAR is set., badRequest, detail: '
      reason: UpdateFailed
      status: "False"
      type: Ready

Originally this was intentional, as actually for testing purposes it's best to have these features disable.

use-case: pool of pre-built clusters

Currently cluster provisioning is done on-demand, so it takes at least 5 minutes to obtain a cluster.
Since cluster configuration is based on well-known templates, and for testing purposes region and scale shouldn't matter, it should be possible to maintain a pool of cluster that had been pre-built and enable faster provisioning for CI job to run faster.

It should be possible to control pool size with based on business needs, e.g. scale up during business hours and scale down out-of-hours, and even a predictive model can be implement if so desired.

## TODOs

  • design a basic TestClusterPoolGKE object
    • make it namespaced
    • initially it should be support only type: static, but may extend to e.g. type: businessHours
    • it should support subset of parameters that TestClusterGKE offers, e.g. location and scale shouldn't matter
    • it may need to provide options like scaling nodes on job assignment
  • implement and test it
    • refactor how job is created, i.e. it will need to be done when cluster is take from a pool
  • deploy it in CI

regional clusters

The operator currently assumes single-zone use-case for its cost advantage, however it should be possible for the user to request a regional cluster.

events

Events can be quite helpful in troubleshooting lifecycle of a resource, it'd be good to send events on major status updates (e.g. #14), and on any key dependency events, especially errors.

retry creating cluster in different zone when one is out of resources

This is related to #18, but is actually a separate issue.

Sometimes a zone is short of resources, and GKE yields:

  Warning  UpdateFailed        12m (x4 over 23m)   containercluster-controller  Update call failed: error applying desired state: summary: Error waiting for creating GKE cluster: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: europe-west2-b., detail:

One of the purpose of this operator was exactly to cater for this type of errors and retry.

document deployment

Management cluster setup and operator deployment needs to be fully documented.

move defaults out of api/*/testclustergke_webhook.go

The primary need for this is to remove hardcoded defaults for:

c.Spec.Project = "cilium-ci" 
c.Spec.Location = "europe-west2-b"
c.Spec.Region = "europe-west2"
c.Spec.JobSpec.Runner.Image = "quay.io/isovalent/gke-test-cluster-gcloud:803ff83d3786eb38ef05c95768060b0c7ae0fc4d"
c.Spec.JobSpec.Runner.InitImage = "quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f"

There should be a per-namespace object that defines the defaults, to allow for multi-project setups etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.