Coder Social home page Coder Social logo

garm-operator's Introduction

Go Report Card GitHub release (latest SemVer) build

garm-operator

โœจ What is the garm-operator?

garm-operator is a Kubernetesยฎ operator that manages the lifecycle of garm objects by creating/updating/deleting corresponding objects in the Kubernetes cluster.

garm components overview

๐Ÿ”€ Versioning

Garm Version

garm-operator uses garm-api-client to talk with garm servers. The supported garm server version is determined by garm-api-client.

Kubernetes Version

garm-operator uses client-go to talk with Kubernetes clusters. The supported Kubernetes cluster version is determined by client-go. The compatibility matrix for client-go and Kubernetes cluster can be found here.

๐Ÿš€ Installation

Prerequisites

  1. A Kubernetes cluster you want to deploy the garm-operator.
  2. As we use ValidatingWebhooks for validation, cert-manager must be installed. (You can find the installation instructions here).
  3. You need to have a garm server up and running and reachable from within the Kubernetes cluster you want to deploy the garm-operator.

Deployment

garm-operator

We are releasing the garm-operator as container image together with the corresponding Kubernetes manifests. You can find the latest release here.

This manifests can be used to deploy the garm-operator into your Kubernetes cluster.

export GARM_OPERATOR_VERSION=<garm-operator-version>
export GARM_SERVER_URL=<garm-server-url> 
export GARM_SERVER_USERNAME=<garm-server-username>
export GARM_SERVER_PASSWORD=<garm-server-password>
export OPERATOR_WATCH_NAMESPACE=<operator-watch-namespace>
curl -L https://github.com/mercedes-benz/garm-operator/releases/download/${GARM_OPERATOR_VERSION}/garm-operator-all.yaml | envsubst | kubectl apply -f -

The full configuration parsing documentation can be found in the configuration parsing guide

Custom Resources

The CRD documentation can be also seen via docs.crds.dev.

Folder config/samples contains few basic examples of Pools, Images and corresponding Repositories, Organizations or Enterprises.

๐Ÿ’ป Development

For local development, please read the development guide.

๐Ÿ“‹ ADRs

To make some assumptions and corresponding decisions transparent, we use ADRs (Architecture Decision Records) to document them.

All ADRs can be found in the here.

Contributing

We welcome any contributions. If you want to contribute to this project, please read the contributing guide.

Code of Conduct

Please read our Code of Conduct as it is our base for interaction.

License

This project is licensed under the MIT LICENSE.

Provider Information

Please visit https://www.mercedes-benz-techinnovation.com/en/imprint/ for information on the provider.

Notice: Before you use the program in productive use, please take all necessary precautions, e.g. testing and verifying the program with regard to your specific use. The program was tested solely for our own use cases, which might differ from yours.

garm-operator's People

Contributors

bavarianbidi avatar dependabot[bot] avatar h777k avatar maigl avatar rafalgalaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

gabriel-samfira

garm-operator's Issues

`--field-selector` should work when getting `runners`

What is the feature you would like to have?

a k get runner --field-selector status.poolId=os023-small should only print all runners of poolId=os023-small.

Currently we get the following error:

Error from server (BadRequest): Unable to find "garm-operator.mercedes-benz.com/v1alpha1, Resource=runners" that match label selector "", field selector "status.poolId=os023-small": field label not supported: status.poolId

Anything else you would like to add?

No response

Helper method for updating CR Status

What is the feature you would like to have?

Right now when updating the status of a resource, we do not compare the new and old status in a unified way, which caused some reconcile spikes due to unnecessary status updates. We should have a common method on how to update the status, so we dont patch it unnecessarily in all controllers.

Anything else you would like to add?

No response

Feature toggle for reflecting runners as CR & make runner polling interval configurable

What is the feature you would like to have?

Right now, reflecting GitHub Actions Runner Instances from garm as CustomResource into the cluster is enabled by default and there is no possibility to toggle the feature on or off. Also the configured polling interval of syncing runners as CR into the cluster is 5 seconds and not configurable.

As an operator admin, I want to be able to supply a feature flag in order to enable or disable reflecting runners and also configure the polling interval, in order to reduce load on garm- and the k8s-api-server:

--operator-sync-runners=true
--operator-sync-runners-interval="20s"

Anything else you would like to add?

No response

Automatic Auth Token Refresh & GARM Init

What is the feature you would like to have?

Right now garm-operator does a login request on every reconcile loop to prevent the auth token from expiring, as there is no refresh-token api endpoint on the garm-server side, as addressed in this adr.

As we are polling runners every 5 seconds and also need to improve self-healing capabilities in case the garm-server dies and gets restarted, the operator should be capable of automaticlly refreshing the auth-token and init the garm-server again.

Anything else you would like to add?

No response

allow `pool` creation even if the referenced `image` doesn't exist yet

What is the feature you would like to have?

At the moment it's not possible to create a pool CR when the referenced image CR doesn't exist.
This might cause some confusion as you have to need to know that the image must exist before a pool got created.

The common pattern for such cases in kubernetes is to requeue the reconciliation and try again.

  • a pool CR can be created even if the referenced image doesn't exist
  • once the referenced image CR got created, the pool get created in garm
  • image update still works (updating the tag in the image-cr will update all pools where the image is used)

Anything else you would like to add?

Similar to a pod spec it's possible to create a pod, even if the referenced image isn't available in the registry.
The pod controller is reconciling the pod-creation (with an exponential backoff) and once the image is available in the registry, the pod got scheduled.

Set `lastSyncTime` annotation on Custom Resources

What steps did you take and what happened?

It would be cool to have a lastSyncTime annotation in our crds. Previously we had such field in our pool.Status which caused countless reconcile loops.

What did you expect to happen?

Set annotation like in the following reference implementation:
kubernetes-sigs/cluster-api

garm version

v0.1.3

garm-operator version

v0.1.0

Kubernetes version

Kubernetes 1.25.5

Anything else you would like to add?

No response

Show Runners

What steps did you take and what happened?

I can't see the currently active runners with k9s :)

What did you expect to happen?

The garm operator should reflect the currently active runners.

garm version

all

garm-operator version

all

Kubernetes version

all

Anything else you would like to add?

no

Add kube-state-metric config

What is the feature you would like to have?

To provid meaningful metrics on the state of garm-operator owned resources one can deploy the kube-state-metrics chart like:

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update

followed by a

helm upgrade --install garm prometheus-community/kube-state-metrics -f ./helm/kube-state-metrics/values.yaml -n kube-state-metrics --create-namespace

It would be nice, if garm-operator adds the kube-state-metrics config-map as a release manifest to observe all CRs:

apiVersion: v1
kind: ConfigMap
metadata:
  name: garm-kube-state-metrics-customresourcestate-config
  namespace: kube-state-metrics
  labels:    
    helm.sh/chart: kube-state-metrics-5.15.2
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/instance: garm
    app.kubernetes.io/version: "2.10.1"
data:
  config.yaml: |
    kind: CustomResourceStateMetrics
    spec:
      resources:
      - commonLabels:
          crd_type: enterprise
        groupVersionKind:
          group: garm-operator.mercedes-benz.com
          kind: Enterprise
          version: v1alpha1
        labelsFromPath:
          name:
          - metadata
          - name
        metricNamePrefix: garm_operator
        metrics:
        - each:
            gauge:
              path:
              - metadata
              - creationTimestamp
            type: Gauge
          help: Unix creation timestamp.
          name: enterprise_created
        - each:
            gauge:
              nilIsZero: true
              path:
              - status
              - poolManagerIsRunning
            type: Gauge
          help: Whether the enterprises poolManager is running.
          name: enterprise_pool_manager_running
        - each:
            info:
              labelsFromPath:
                credentialsName:
                - spec
                - credentialsName
                id:
                - status
                - id
                webhookSecretRefKey:
                - spec
                - webhookSecretRef
                - key
                webhookSecretRefName:
                - spec
                - webhookSecretRef
                - name
            type: Info
          help: Information about an enterprise.
          name: enterprise_info
        - each:
            info:
              labelsFromPath:
                paused_value: []
              path:
              - metadata
              - annotations
              - garm-operator.mercedes-benz.com/paused
            type: Info
          help: Whether the enterprise reconciliation is paused.
          name: enterprise_annotation_paused_info
        namespace:
        - metadata
        - namespace
      - commonLabels:
          crd_type: organization
        groupVersionKind:
          group: garm-operator.mercedes-benz.com
          kind: Organization
          version: v1alpha1
        labelsFromPath:
          name:
          - metadata
          - name
        metricNamePrefix: garm_operator
        metrics:
        - each:
            gauge:
              path:
              - metadata
              - creationTimestamp
            type: Gauge
          help: Unix creation timestamp.
          name: org_created
        - each:
            gauge:
              nilIsZero: true
              path:
              - status
              - poolManagerIsRunning
            type: Gauge
          help: Whether the orgs poolManager is running.
          name: org_pool_manager_running
        - each:
            info:
              labelsFromPath:
                credentialsName:
                - spec
                - credentialsName
                id:
                - status
                - id
                webhookSecretRefKey:
                - spec
                - webhookSecretRef
                - key
                webhookSecretRefName:
                - spec
                - webhookSecretRef
                - name
            type: Info
          help: Information about an enterprise.
          name: org_info
        - each:
            info:
              labelsFromPath:
                paused_value: []
              path:
              - metadata
              - annotations
              - garm-operator.mercedes-benz.com/paused
            type: Info
          help: Whether the org reconciliation is paused.
          name: org_annotation_paused_info
        namespace:
        - metadata
        - namespace
      - commonLabels:
          crd_type: repository
        groupVersionKind:
          group: garm-operator.mercedes-benz.com
          kind: Repository
          version: v1alpha1
        labelsFromPath:
          name:
          - metadata
          - name
        metricNamePrefix: garm_operator
        metrics:
        - each:
            gauge:
              path:
              - metadata
              - creationTimestamp
            type: Gauge
          help: Unix creation timestamp.
          name: repo_created
        - each:
            gauge:
              nilIsZero: true
              path:
              - status
              - poolManagerIsRunning
            type: Gauge
          help: Whether the repositories poolManager is running.
          name: repo_pool_manager_running
        - each:
            info:
              labelsFromPath:
                credentialsName:
                - spec
                - credentialsName
                id:
                - status
                - id
                owner:
                - spec
                - owner
                webhookSecretRefKey:
                - spec
                - webhookSecretRef
                - key
                webhookSecretRefName:
                - spec
                - webhookSecretRef
                - name
            type: Info
          help: Information about a repository.
          name: repo_info
        - each:
            info:
              labelsFromPath:
                paused_value: []
              path:
              - metadata
              - annotations
              - garm-operator.mercedes-benz.com/paused
            type: Info
          help: Whether the repo reconciliation is paused.
          name: repo_annotation_paused_info
        namespace:
        - metadata
        - namespace
      - commonLabels:
          crd_type: pool
        groupVersionKind:
          group: garm-operator.mercedes-benz.com
          kind: Pool
          version: v1alpha1
        labelsFromPath:
          name:
          - metadata
          - name
        metricNamePrefix: garm_operator
        metrics:
        - each:
            gauge:
              path:
              - metadata
              - creationTimestamp
            type: Gauge
          help: Unix creation timestamp.
          name: pool_created
        - each:
            gauge:
              path:
              - status
              - creationTimestamp
            type: Gauge
          help: Unix creation timestamp.
          name: pool_min_idle_runner
        - each:
            info:
              labelsFromPath:
                enabled:
                - spec
                - enabled
                githubRunnerGroup:
                - spec
                - githubRunnerGroup
                id:
                - status
                - id
                imageName:
                - spec
                - imageName
                maxRunners:
                - spec
                - maxRunners
                minIdleRunners:
                - spec
                - minIdleRunners
                osArch:
                - spec
                - osArch
                osType:
                - spec
                - osType
                providerName:
                - spec
                - providerName
                runnerBootstrapTimeout:
                - spec
                - runnerBootstrapTimeout
                runnerPrefix:
                - spec
                - runnerPrefix
                scopeKind:
                - spec
                - githubScopeRef
                - kind
                scopeName:
                - spec
                - githubScopeRef
                - name
                tags:
                - spec
                - tags
            type: Info
          help: Information about a pool.
          name: pool_info
        - each:
            info:
              labelsFromPath:
                paused_value: []
              path:
              - metadata
              - annotations
              - garm-operator.mercedes-benz.com/paused
            type: Info
          help: Whether the pool reconciliation is paused.
          name: pool_annotation_paused_info
        namespace:
        - metadata
        - namespace
      - commonLabels:
          crd_type: image
        groupVersionKind:
          group: garm-operator.mercedes-benz.com
          kind: Image
          version: v1alpha1
        labelsFromPath:
          name:
          - metadata
          - name
        metricNamePrefix: garm_operator
        metrics:
        - each:
            gauge:
              path:
              - metadata
              - creationTimestamp
            type: Gauge
          help: Unix creation timestamp.
          name: image_created
        - each:
            info:
              labelsFromPath:
                tag:
                - spec
                - tag
            type: Info
          help: Information about an image.
          name: image_info
        namespace:
        - metadata
        - namespace

Anything else you would like to add?

Is it enough to just add the config map?
Or should we provide ready to install kube-state-metrics deploy manifests?
Where in the repo should this be maintained?

Usage of `garm-operator` is not quite clear

What steps did you take and what happened?

With the provided examples it's not quite clear how to use the operator (or which objects should get created in which order).

What did you expect to happen?

A clear documentation with examples how the created resources will look like in kubernetes and also on garm side

garm version

v0.1.3

garm-operator version

v0.1.0

Kubernetes version

Kubernetes 1.25.5

Anything else you would like to add?

No response

kubernetes flag is missing

Since we have merged the PR #24 , we can no longer use the --kubernetes flag to specify the path to a kubeconfig if we want to use the operator outside of a kubernetes cluster.

We should implment this flag again.

Fix Pool samples `spec.extraSpecs`

What steps did you take and what happened?

The Pool samples still contain a wrong "" (empty string) value which causes the validating webhook to deny the pool resource

What did you expect to happen?

Should be replaced by extraSpecs: '{}'

garm version

v0.1.3

garm-operator version

v0.1.0

Kubernetes version

Kubernetes 1.25.5

Anything else you would like to add?

No response

reduce the numer of api calls for get a pool during pool reconciliation

What is the feature you would like to have?

With #35 a second API Call towards the GARM server for each pool got introduced.

With some refactoring it should be possible to reduce the number of API Calls and with that it should be easy to get rid of some functions.

Anything else you would like to add?

No response

Integration tests

What is the feature you would like to have?

It would be nice to have some integration tests with a "real" garm-server in the backend.
ATM all the tests are unit-tests with a mocked garm server

Anything else you would like to add?

No response

`garm-operator` should provide different ways for configuration

What steps did you take and what happened?

It would be great to have more ways to configuring garm-operator.
At the moment it's possible by defining some flags or a subset of the flags via environment variables.

What did you expect to happen?

I would like to have a framework like viper to make the configuration via flags, environment variables or e.g. a yaml based configuration file.

garm version

v0.1.3

garm-operator version

v0.1.0

Kubernetes version

Kubernetes 1.25.5

Anything else you would like to add?

No response

Recreate `Enterprise`, `Org` and `Repo` Resources and sync new ID if garm-server gets restarted

What is the feature you would like to have?

Right now, if an Enterprise, Org or Repo CR is applied, it gets persisted as record inside garm DB and its IDs are synced back to the .Status.IDfield of the CR. However if the garm-server gets restarted, the CRs still have to old ID synced and therefore it does not attempt to retry syncing the CRs back to garm-server. The Pool Controller has such behaviour already build in. So the Enterprise, Org and Repo should attempt to recreate these resources on garm-server side, even if an ID is synced but not found anymore inside garm.

Anything else you would like to add?

No response

Expose garms JWT auth token `ExpiresAt` property as metric

What is the feature you would like to have?

With the newly added auto-init and ensure-auth feature, it would be great to track when the obtained jwt to authenticate with garm will be expired and exposing this as a metric.

Anything else you would like to add?

No response

log flags are missing

Since we have merged the PR #24 , we can no longer use the --log* and -v flags to specify the logging behaviour of the garm-operator.

In release v0.1.2 there were the following log flags:

./bin/manager -h
Usage of ./bin/manager:
      --add_dir_header                     If true, adds the file directory to the header of the log messages
      --alsologtostderr                    log to standard error as well as files (no effect when -logtostderr=true)
      --garm-password string               The password for the GARM server
      --garm-server string                 The address of the GARM server
      --garm-username string               The username for the GARM server
      --health-probe-bind-address string   The address the probe endpoint binds to. (default ":8081")
      --kubeconfig string                  Paths to a kubeconfig. Only required if out-of-cluster.
      --leader-elect                       Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
      --log_backtrace_at traceLocation     when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                     If non-empty, write log files in this directory (no effect when -logtostderr=true)
      --log_file string                    If non-empty, use this log file (no effect when -logtostderr=true)
      --log_file_max_size uint             Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                        log to standard error instead of files (default true)
      --metrics-bind-address string        The address the metric endpoint binds to. (default ":8080")
      --namespace string                   Namespace that the controller watches to reconcile garm objects. If unspecified, the controller watches for garm objects across all namespaces.
      --one_output                         If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
      --skip_headers                       If true, avoid header prefixes in the log messages
      --skip_log_headers                   If true, avoid headers when opening log files (no effect when -logtostderr=true)
      --stderrthreshold severity           logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
      --sync-period duration               The minimum interval at which watched resources are reconciled (e.g. 15m) (default 5m0s)
  -v, --v Level                            number for the log level verbosity
      --vmodule moduleSpec                 comma-separated list of pattern=N settings for file-filtered logging
pflag: help requested

We probably don't need all log flags, so we should consider which ones are necessary and should be implemented.

creating multiple pools with the same spec should be forbidden

What steps did you take and what happened?

Creating two pool objects with different names (but same spec) will result in one pool on garm-side.
On Kubernetes booth pools will have the same status.id

apiVersion: garm-operator.mercedes-benz.com/v1alpha1
kind: Pool
metadata:
  name: openstack-default-runner-os01
  namespace: garm-prod
spec:
  githubScopeRef:
    apiGroup: garm-operator.mercedes-benz.com
    kind: Enterprise
    name: mercedes-benz-group-ag
  enabled: true
  extraSpecs: '{"garm_image_type":"runner-default","garm_stage":"prod"}'
  flavor: m1.large
  githubRunnerGroup: ""
  imageName: runner-roadkit
  maxRunners: 10
  minIdleRunners: 1
  osArch: amd64
  osType: linux
  providerName: os01.fra-prod3
  runnerBootstrapTimeout: 20
  runnerPrefix: "road-runner-os013"
  tags:
  - ubuntu
---
apiVersion: garm-operator.mercedes-benz.com/v1alpha1
kind: Pool
metadata:
  name: openstack-default-runner-os02
  namespace: garm-prod
spec:
  githubScopeRef:
    apiGroup: garm-operator.mercedes-benz.com
    kind: Enterprise
    name: mercedes-benz-group-ag
  enabled: true
  extraSpecs: '{"garm_image_type":"runner-default","garm_stage":"prod"}'
  flavor: m1.large
  githubRunnerGroup: ""
  imageName: runner-roadkit
  maxRunners: 10
  minIdleRunners: 1
  osArch: amd64
  osType: linux
  providerName: os01.fra-prod3
  runnerBootstrapTimeout: 20
  runnerPrefix: "road-runner-os013"
  tags:
  - ubuntu

What did you expect to happen?

pool webhook should reject the creation of objects with the same spec if already one object exist.

The current implementation already forse this behavior but doesn't block the second creation request (https://github.com/mercedes-benz/garm-operator/blob/main/api/v1alpha1/pool_webhook.go#L71-L82)

garm version

v0.1.3

garm-operator version

v0.1.0

Kubernetes version

Kubernetes 1.25.5

Anything else you would like to add?

No response

metrics about the state of the CRs

What is the feature you would like to have?

I would like to have a metrics-endpoint which expose some additional information about the existing garm-operator based CRs.

Anything else you would like to add?

Either using KSM with a customresource-definition-config or built-in metrics - both should be fine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.