mercedes-benz / garm-operator Goto Github PK
View Code? Open in Web Editor NEWa k8s operator to run garm
License: MIT License
a k8s operator to run garm
License: MIT License
With the newly added auto-init and ensure-auth feature, it would be great to track when the obtained jwt to authenticate with garm will be expired and exposing this as a metric.
No response
Right now, reflecting GitHub Actions Runner
Instances from garm as CustomResource
into the cluster is enabled by default and there is no possibility to toggle
the feature on or off. Also the configured polling interval
of syncing runners as CR into the cluster is 5 seconds and not configurable.
As an operator admin, I want to be able to supply a feature flag
in order to enable or disable reflecting runners and also configure the polling interval, in order to reduce load on garm- and the k8s-api-server:
--operator-sync-runners=true
--operator-sync-runners-interval="20s"
No response
Right now garm-operator does a login request on every reconcile loop to prevent the auth token from expiring, as there is no refresh-token api endpoint on the garm-server side, as addressed in this adr.
As we are polling runners every 5 seconds and also need to improve self-healing capabilities in case the garm-server dies and gets restarted, the operator should be capable of automaticlly refreshing the auth-token and init the garm-server again.
No response
Creating two pool
objects with different names (but same spec
) will result in one pool
on garm-side.
On Kubernetes booth pools will have the same status.id
apiVersion: garm-operator.mercedes-benz.com/v1alpha1
kind: Pool
metadata:
name: openstack-default-runner-os01
namespace: garm-prod
spec:
githubScopeRef:
apiGroup: garm-operator.mercedes-benz.com
kind: Enterprise
name: mercedes-benz-group-ag
enabled: true
extraSpecs: '{"garm_image_type":"runner-default","garm_stage":"prod"}'
flavor: m1.large
githubRunnerGroup: ""
imageName: runner-roadkit
maxRunners: 10
minIdleRunners: 1
osArch: amd64
osType: linux
providerName: os01.fra-prod3
runnerBootstrapTimeout: 20
runnerPrefix: "road-runner-os013"
tags:
- ubuntu
---
apiVersion: garm-operator.mercedes-benz.com/v1alpha1
kind: Pool
metadata:
name: openstack-default-runner-os02
namespace: garm-prod
spec:
githubScopeRef:
apiGroup: garm-operator.mercedes-benz.com
kind: Enterprise
name: mercedes-benz-group-ag
enabled: true
extraSpecs: '{"garm_image_type":"runner-default","garm_stage":"prod"}'
flavor: m1.large
githubRunnerGroup: ""
imageName: runner-roadkit
maxRunners: 10
minIdleRunners: 1
osArch: amd64
osType: linux
providerName: os01.fra-prod3
runnerBootstrapTimeout: 20
runnerPrefix: "road-runner-os013"
tags:
- ubuntu
pool
webhook should reject the creation of objects with the same spec if already one object exist.
The current implementation already forse this behavior but doesn't block the second creation request (https://github.com/mercedes-benz/garm-operator/blob/main/api/v1alpha1/pool_webhook.go#L71-L82)
v0.1.3
v0.1.0
Kubernetes 1.25.5
No response
It would be great to have more ways to configuring garm-operator
.
At the moment it's possible by defining some flags or a subset of the flags via environment variables.
I would like to have a framework like viper
to make the configuration via flags, environment variables or e.g. a yaml
based configuration file.
v0.1.3
v0.1.0
Kubernetes 1.25.5
No response
I can't see the currently active runners with k9s :)
The garm operator should reflect the currently active runners.
all
all
all
no
a k get runner --field-selector status.poolId=os023-small
should only print all runners of poolId=os023-small
.
Currently we get the following error:
Error from server (BadRequest): Unable to find "garm-operator.mercedes-benz.com/v1alpha1, Resource=runners" that match label selector "", field selector "status.poolId=os023-small": field label not supported: status.poolId
No response
I would like to have a metrics-endpoint which expose some additional information about the existing garm-operator based CRs.
Either using KSM with a customresource-definition-config
or built-in metrics - both should be fine
Right now when updating the status of a resource, we do not compare the new and old status in a unified way, which caused some reconcile spikes due to unnecessary status updates. We should have a common method on how to update the status, so we dont patch it unnecessarily in all controllers.
No response
Right now, if an Enterprise
, Org
or Repo CR
is applied, it gets persisted as record inside garm DB and its IDs are synced back to the .Status.ID
field of the CR. However if the garm-server gets restarted, the CRs still have to old ID synced and therefore it does not attempt to retry syncing the CRs back to garm-server. The Pool
Controller has such behaviour already build in. So the Enterprise
, Org
and Repo
should attempt to recreate these resources on garm-server side, even if an ID is synced but not found anymore inside garm.
No response
With #35 a second API Call towards the GARM server for each pool got introduced.
With some refactoring it should be possible to reduce the number of API Calls and with that it should be easy to get rid of some functions.
No response
At the moment it's not possible to create a pool
CR when the referenced image
CR doesn't exist.
This might cause some confusion as you have to need to know that the image
must exist before a pool
got created.
The common pattern for such cases in kubernetes
is to requeue the reconciliation and try again.
Similar to a pod
spec it's possible to create a pod, even if the referenced image
isn't available in the registry.
The pod
controller is reconciling the pod-creation (with an exponential backoff) and once the image is available in the registry, the pod got scheduled.
as a pool on garm
contains more tags as defined (self-hosted
, osArch and CPU), we should reflect the additional tags in the status
No response
It would be nice to have some integration tests with a "real" garm-server
in the backend.
ATM all the tests are unit-tests with a mocked garm
server
No response
With the provided examples
it's not quite clear how to use the operator (or which objects should get created in which order).
A clear documentation with examples how the created resources will look like in kubernetes and also on garm side
v0.1.3
v0.1.0
Kubernetes 1.25.5
No response
To provid meaningful metrics on the state of garm-operator owned resources one can deploy the kube-state-metrics
chart like:
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
followed by a
helm upgrade --install garm prometheus-community/kube-state-metrics -f ./helm/kube-state-metrics/values.yaml -n kube-state-metrics --create-namespace
It would be nice, if garm-operator adds the kube-state-metrics config-map as a release manifest to observe all CRs:
apiVersion: v1
kind: ConfigMap
metadata:
name: garm-kube-state-metrics-customresourcestate-config
namespace: kube-state-metrics
labels:
helm.sh/chart: kube-state-metrics-5.15.2
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: metrics
app.kubernetes.io/part-of: kube-state-metrics
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/instance: garm
app.kubernetes.io/version: "2.10.1"
data:
config.yaml: |
kind: CustomResourceStateMetrics
spec:
resources:
- commonLabels:
crd_type: enterprise
groupVersionKind:
group: garm-operator.mercedes-benz.com
kind: Enterprise
version: v1alpha1
labelsFromPath:
name:
- metadata
- name
metricNamePrefix: garm_operator
metrics:
- each:
gauge:
path:
- metadata
- creationTimestamp
type: Gauge
help: Unix creation timestamp.
name: enterprise_created
- each:
gauge:
nilIsZero: true
path:
- status
- poolManagerIsRunning
type: Gauge
help: Whether the enterprises poolManager is running.
name: enterprise_pool_manager_running
- each:
info:
labelsFromPath:
credentialsName:
- spec
- credentialsName
id:
- status
- id
webhookSecretRefKey:
- spec
- webhookSecretRef
- key
webhookSecretRefName:
- spec
- webhookSecretRef
- name
type: Info
help: Information about an enterprise.
name: enterprise_info
- each:
info:
labelsFromPath:
paused_value: []
path:
- metadata
- annotations
- garm-operator.mercedes-benz.com/paused
type: Info
help: Whether the enterprise reconciliation is paused.
name: enterprise_annotation_paused_info
namespace:
- metadata
- namespace
- commonLabels:
crd_type: organization
groupVersionKind:
group: garm-operator.mercedes-benz.com
kind: Organization
version: v1alpha1
labelsFromPath:
name:
- metadata
- name
metricNamePrefix: garm_operator
metrics:
- each:
gauge:
path:
- metadata
- creationTimestamp
type: Gauge
help: Unix creation timestamp.
name: org_created
- each:
gauge:
nilIsZero: true
path:
- status
- poolManagerIsRunning
type: Gauge
help: Whether the orgs poolManager is running.
name: org_pool_manager_running
- each:
info:
labelsFromPath:
credentialsName:
- spec
- credentialsName
id:
- status
- id
webhookSecretRefKey:
- spec
- webhookSecretRef
- key
webhookSecretRefName:
- spec
- webhookSecretRef
- name
type: Info
help: Information about an enterprise.
name: org_info
- each:
info:
labelsFromPath:
paused_value: []
path:
- metadata
- annotations
- garm-operator.mercedes-benz.com/paused
type: Info
help: Whether the org reconciliation is paused.
name: org_annotation_paused_info
namespace:
- metadata
- namespace
- commonLabels:
crd_type: repository
groupVersionKind:
group: garm-operator.mercedes-benz.com
kind: Repository
version: v1alpha1
labelsFromPath:
name:
- metadata
- name
metricNamePrefix: garm_operator
metrics:
- each:
gauge:
path:
- metadata
- creationTimestamp
type: Gauge
help: Unix creation timestamp.
name: repo_created
- each:
gauge:
nilIsZero: true
path:
- status
- poolManagerIsRunning
type: Gauge
help: Whether the repositories poolManager is running.
name: repo_pool_manager_running
- each:
info:
labelsFromPath:
credentialsName:
- spec
- credentialsName
id:
- status
- id
owner:
- spec
- owner
webhookSecretRefKey:
- spec
- webhookSecretRef
- key
webhookSecretRefName:
- spec
- webhookSecretRef
- name
type: Info
help: Information about a repository.
name: repo_info
- each:
info:
labelsFromPath:
paused_value: []
path:
- metadata
- annotations
- garm-operator.mercedes-benz.com/paused
type: Info
help: Whether the repo reconciliation is paused.
name: repo_annotation_paused_info
namespace:
- metadata
- namespace
- commonLabels:
crd_type: pool
groupVersionKind:
group: garm-operator.mercedes-benz.com
kind: Pool
version: v1alpha1
labelsFromPath:
name:
- metadata
- name
metricNamePrefix: garm_operator
metrics:
- each:
gauge:
path:
- metadata
- creationTimestamp
type: Gauge
help: Unix creation timestamp.
name: pool_created
- each:
gauge:
path:
- status
- creationTimestamp
type: Gauge
help: Unix creation timestamp.
name: pool_min_idle_runner
- each:
info:
labelsFromPath:
enabled:
- spec
- enabled
githubRunnerGroup:
- spec
- githubRunnerGroup
id:
- status
- id
imageName:
- spec
- imageName
maxRunners:
- spec
- maxRunners
minIdleRunners:
- spec
- minIdleRunners
osArch:
- spec
- osArch
osType:
- spec
- osType
providerName:
- spec
- providerName
runnerBootstrapTimeout:
- spec
- runnerBootstrapTimeout
runnerPrefix:
- spec
- runnerPrefix
scopeKind:
- spec
- githubScopeRef
- kind
scopeName:
- spec
- githubScopeRef
- name
tags:
- spec
- tags
type: Info
help: Information about a pool.
name: pool_info
- each:
info:
labelsFromPath:
paused_value: []
path:
- metadata
- annotations
- garm-operator.mercedes-benz.com/paused
type: Info
help: Whether the pool reconciliation is paused.
name: pool_annotation_paused_info
namespace:
- metadata
- namespace
- commonLabels:
crd_type: image
groupVersionKind:
group: garm-operator.mercedes-benz.com
kind: Image
version: v1alpha1
labelsFromPath:
name:
- metadata
- name
metricNamePrefix: garm_operator
metrics:
- each:
gauge:
path:
- metadata
- creationTimestamp
type: Gauge
help: Unix creation timestamp.
name: image_created
- each:
info:
labelsFromPath:
tag:
- spec
- tag
type: Info
help: Information about an image.
name: image_info
namespace:
- metadata
- namespace
Is it enough to just add the config map?
Or should we provide ready to install kube-state-metrics deploy manifests?
Where in the repo should this be maintained?
Since we have merged the PR #24 , we can no longer use the --log*
and -v
flags to specify the logging behaviour of the garm-operator.
In release v0.1.2 there were the following log flags:
./bin/manager -h
Usage of ./bin/manager:
--add_dir_header If true, adds the file directory to the header of the log messages
--alsologtostderr log to standard error as well as files (no effect when -logtostderr=true)
--garm-password string The password for the GARM server
--garm-server string The address of the GARM server
--garm-username string The username for the GARM server
--health-probe-bind-address string The address the probe endpoint binds to. (default ":8081")
--kubeconfig string Paths to a kubeconfig. Only required if out-of-cluster.
--leader-elect Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory (no effect when -logtostderr=true)
--log_file string If non-empty, use this log file (no effect when -logtostderr=true)
--log_file_max_size uint Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--logtostderr log to standard error instead of files (default true)
--metrics-bind-address string The address the metric endpoint binds to. (default ":8080")
--namespace string Namespace that the controller watches to reconcile garm objects. If unspecified, the controller watches for garm objects across all namespaces.
--one_output If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
--skip_headers If true, avoid header prefixes in the log messages
--skip_log_headers If true, avoid headers when opening log files (no effect when -logtostderr=true)
--stderrthreshold severity logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
--sync-period duration The minimum interval at which watched resources are reconciled (e.g. 15m) (default 5m0s)
-v, --v Level number for the log level verbosity
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
pflag: help requested
We probably don't need all log flags, so we should consider which ones are necessary and should be implemented.
The Pool samples still contain a wrong "" (empty string) value which causes the validating webhook to deny the pool resource
Should be replaced by extraSpecs: '{}'
v0.1.3
v0.1.0
Kubernetes 1.25.5
No response
Since we have merged the PR #24 , we can no longer use the --kubernetes
flag to specify the path to a kubeconfig if we want to use the operator outside of a kubernetes cluster.
We should implment this flag again.
It would be cool to have a lastSyncTime
annotation in our crds. Previously we had such field in our pool.Status which caused countless reconcile loops.
Set annotation like in the following reference implementation:
kubernetes-sigs/cluster-api
v0.1.3
v0.1.0
Kubernetes 1.25.5
No response
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.