Coder Social home page Coder Social logo

infrastructure-manager's Introduction

REUSE status

Infrastructure manager

Overview

This project manages the Kyma cluster infrastructure. It's built using the kubebuilder framework.

It's currently responsible for generating and rotating Secrets containing dynamic kubeconfigs.

Prerequisites

  • Access to a k8s cluster. You can use k3d to get a local cluster for testing or run against a remote cluster.
  • kubectl

Installation

  1. Clone the project.
git clone https://github.com/kyma-project/infrastructure-manager.git && cd infrastructure-manager/
  1. Set the infrastructure-manager image name.
export IMG=custom-infrastructure-manager:0.0.1
export K3D_CLUSTER_NAME=infrastructure-manager-demo
  1. Build the project.
make build
  1. Build the image.
make docker-build
  1. Push the image to the registry.
k3d
k3d cluster create $K3D_CLUSTER_NAME
k3d image import $IMG -c $K3D_CLUSTER_NAME
Globally available Docker registry
make docker-push
  1. Deploy.
make deploy
  1. Create a Secret with the Gardener credentials
export GARDENER_KUBECONFIG_PATH=<kubeconfig file for Gardener project> 
make gardener-secret-deploy

Usage

Infrastructure Manager is responsible for creating and rotating Secrets of clusters defined in the GardenerCluster custom resources (CRs). The sample CR is available here.

Time-based rotation

Secrets are rotated based on kubeconfig-expiration-time. See Configuration for more details.

Force rotation

It's possible to force the Secret rotation before the time-based rotation kicks in. To do that, add the operator.kyma-project.io/force-kubeconfig-rotation: "true" annotation to the GardenCluster CR.

Contributing

See CONTRIBUTING.md

Code of Conduct

See CODE_OF_CONDUCT.md

Licensing

See the LICENSE file

infrastructure-manager's People

Contributors

akgalwas avatar dellagustin-sap avatar dependabot[bot] avatar disper avatar grego952 avatar koala7659 avatar kyma-bot avatar m00g3n avatar mvshao avatar pbochynski avatar tobiscr avatar

Watchers

 avatar  avatar  avatar  avatar

infrastructure-manager's Issues

Enable Registry Cache Extension

Description
It is possible now to run docker registry cache in Gardener cluster. The new feature is described here: https://github.com/gardener/gardener-extension-registry-cache/blob/main/docs/usage/configuration.md
Provide a way to enable and configure registry in the shoot cluster as our customers do not have access to the shoot spec to do that on their own.

Reasons
Local registry cache can decrease cost of traffic from docker registry. It also can be a solution for the docker registries that are not designed to support high load with guaranteed high availability (Artifactory).

Create proposal for Cluster CR we expect from KEB

Description

The technical contract between the KEB and the KIM will be based on Cluster CR. As consumer of this resource, we have to come up with a proposal of the mandatory data which are required for this CR.

Consider also coming features and coming new requirements. We should be able to extend the model in a way that new features can be easily introduced into the structure without having a need to change already exiting fields.

Fields which are not dynamic and can be determined by the KIM should not be part of the CRD but instead land in the KIM configuration and be handled by KIM's logic.

Planned new features are:

AC:

  • Review a generated Gardener Shoot-Spec and extract all mandatory values we have to receive from KEB
  • Define an initial draft for a Cluster CRD
    • Present the draft in the team and gather their feedback
    • Show how this model could be extended in the future if new features (see list above) will later be introduced
    • KIM has to track the status of each cluster. This should be tracked in the conditions field of the CR (also lerror should be listed there) as also KLM is following this approach for Kyma CRs. - extracted to #193
    • It could happen that Gardener shoot is not fully in-sync with latest updates from KEB because the last update failed (e.g. list of Administrators was invalid and got rejected). The contract has to include a field / status which is reflecting such cases. - separate issue #198
  • Share the CRD with Gophers in a call and explain what the purpose of the different fields is (only if they are not already self-explaining)

Relates to
#127

Safe deletion of Kyma Clusters

Description

Instead of deletion we can hibernate cluster and delete it few days later. Accidental deletion can be recovered. Such cluster is not reconciled (Kyma resource can be deleted). Deletion of Kyma resource should not cause module deletion - it is just opt out from lifecycle management.

Note: customer data is still in the hibernated cluster so we should not keep it too long and we need to make sure we do not violate data privacy policies

Implementation idea:

  • cluster resource could be in the deleting state for longer period and the operator will remove shoot and finalizer after defined timeout
  • recovery would be a manual process: copy the resource, remove finalizer and recreate the resource from the copy

Reasons
We should protect our customers as much as possible against unintentional or malicious actions that cause data loss.

Attachments

Migrate infrastructure related logic from KEB to KIM

Description

A goal of the KIm is to establish a domain for infrastructure relate tasks (primarily cluster creation) within Kyma. At the moment is KEB heavily involved in this area as managing several decisions about the cluster creation (e.g. which region has to be used etc.).

To establish KEB as pure orchestration service for Kyma backends, all infrastructure related logic in KEB should be extracted and become part of KIM.

AC:

  • Get in contact with KEB team and
    • verify the different steps applied by KEB during the cluster creation process
    • decide together which logic should be removed from KEB an integrated into KIM
  • Extract infrastructure related logic from KEB and integrate it into KIM
    • Identify scope of the migration (define work packages, identify features)
    • Agree on the migration strategy (avoid a big-bang migration, make it "gruanlar" to migrate things step by step)

Depends on
#125 + #134

RBAC kubeconfigs for Clusters

Description

There should be a possibility to issue a kubeconfig for the cluster with limited access/privileges.

Kubernetes allows for creating kubeconfigs for specific ServiceAccounts. Having such SA-based kubeconfig makes it possible to limit its use with proper Roles/ClusterRoles.

Suggestions

this is just a proposal, feel free to refine/change/adapt it as you like

One of the options would be to have a new CRD used for issuing kubeconfigs - it could include ServiceAccount information along with the Role/ClusterRole assigned to that ServiceAccount. Based on this Infrastructure Manager could create the SA, (Cluster)Role, issue kubeconfig and save it as a secret in the KCP.

Such a solution would require introducing a controller for handling those, but it will be a universal solution that would support multiple Kubeconfigs to be issued for a single cluster (i.e. for KEB, KLM and other KCP Controllers that would require cluster access).

Regarding the deletion logic - it can be solved with a finalizer that is set on all the CRs, when the deletion timestamp is picked up by the controller then cluster resources (SAs, Roles, etc.) are dropped and the finalizer is removed.

Reasons

It is generally recommended to keep the required privileges minimal for the specific roles. Right now the issued kubeconfigs are for the cluster-admin role which allows for unconstrained actions to be taken using this kubeconfig. From the security perspective, it would be also beneficial to differentiate between entities connecting to the SKR. Separate kubeconfigs for KEB or KLM would make it transparent from the audit-log perspective on which component took which action in the cluster.

Acceptance Criteria

this is just a proposal, feel free to refine those as you like

  • It is possible to request RBAC Kubeconfig
    • ServiceAccount spec is passed as part of the request
    • Role/ClusterRole is passed as part of the request
  • Requested resources are created in the SKR cluster
    • ServiceAccount
    • Role/ClusterRole
    • RoleBinding/ClusterRoleBinding
  • Kubeconfig is issued for the created ServiceAccount
  • Kubeconfig is saved as a K8S Secret
  • K8S Secret with the secret is referenced as part of the status for the request
  • Infrastructure Manager supports "graceful" deletion of deployed resources

Infrastructure Manager - create initial project structure

Description

Create a minimal structure for Cluster Inventory Infrastructure Manager.

Acceptance criteria:

Stretch:

Reasons

In order to kick off the implementation we need to define the code structure, create pipelines. We also need to define the interface for Kyma Environment Broker that is supposed to create Cluster CRs.

Ensure that relevant secret is removed when CR is deleted

Reason
When POD is disabled (even for a shorter duration like 10 seconds), and the GardenerCluster CR will be removed by KEB, IM controller will not receive an event and the corresponding secret will not be cleaned up.

What
Some mechanisms (e.g., owner reference/finalizers) should be introduced to ensure that when GardnerCluster CR is removed, the corresponding secret will also be removed.

Documentation improvements

Description

Acceptance Criteria

  • Improve the part on what has to be configured for IM to work
  • Describe the time rotation feature
  • Describe the force rotation feature

Replace Provisioner by Kyma Infrastructure Manager [EPIC]

Description

The Provisioner has to be replaced by the Kyma Infrastrcuture Manager. The logic of the Provisioner has to be migrated into the Infrastructure Manager, but also considering already planned new features. This could required a rethinking of the current software architecture to ensure a flexible and extensible but also maintainable software structure of the Infrastructure Manager.

AC:

Reasons

Replacing the old Kyma Provisioner with the Kyma Infrastructure Manager to follow new KCP architectural paradigm (K8s native application).

Attachments

Depends on
#134

[Threat Modelling] Configure audit logs to track changes applied on CRs and secrets

Reason
Those important IM resources should be audit logged.

Acceptance Criteria

Ensure following cases are recorded in the auditlog:

  • If an agent (app or a user) edits the GardenCluster CR - we should see an audit log of that action
  • If an agent (app or a user) edits the secrets - we should see an audit log of that action
  • If an agent (app or a user) accesses gardener secret - we should see an audit log of that action
  • If the above does not happen, consult the situation with security experts and prepare the mitigation plan

Review and extend troubleshooting and on-call guides for KIM

Description

To be ready for the go-live, we have to create an on-call guide for the Infrastructure Manager. This is also a pre-requisite for the Microdelivery of the Infrastructure Manager.

Possible location for the on-call guide: https://github.tools.sap/kyma/documentation/tree/main/kyma-internal/on-call-guides/mps

AC:

  • Document the common use-cases / possible incidents we have to expect when the Infrastructure Manager runs in a productive context

Area

  • Infrastructure Manager

Reasons

Mandatory pre-requisite before we can go-live and part of the SAP Product Standards.

Assignees

@kyma-project/technical-writers

Attachments

Define and document KIM architecture

Description

KIM will become a critical component of the Kyma backend and be responsible for any infrastructure (especially cluster creation) related tasks.

We see already a growing amount of expectations and requirements for KIM. This requires a well defined architecture which supports

  • easy extensibility (e.g. any value of the shoot-spec should be configurable by customer values or use default config values as fallback)
  • good testability (e.g. mock the Gardener API to simulate any possible use-case in unit- or E2E-tests)
  • well performing (e.g. KIM has to be able to manage 10.000 clusters with an linear scaling processing time)
  • comprehensive monitoring (e.g. central KPIs of KIM have to be exposed, e.g. failure rate, processing time, throughput, resource consumption)
  • acceptable maintainability efforts (e.g. most of the identified incidents can be covered by the software/architecture without human intervention)

Take already known requirements into consideration when designing the new architecture:

AC:

  • Create ADRs for KIM which show how the software architecture will look like

    • Add to each ADR a table and evaluate how good/bad the non-functional requirements (see list above) are supported
    • Present the results of the ADR and evaluation to @kyma-project/framefrog
  • When ADR is merged, close following issues:

  • Implement a POC of the ADR and demonstrate it to the team - akgalwas#1

Dependency health checking

Description

Implement periodic health checking of Gardener cluster API dependency by periodically querying of the version or health non-resource endpoint via gardener kubeclient in a separate goroutine and keep the latest check result up-to-date. Expose the current (up-to-date) healthcheck result on the Prometheus metrics endpoint via series like:

{app}_{subsys}_gardener_health{url="..", status="healthy"} 1
{app}_{subsys}_gardener_health{url="..", status="error"} 0
{app}_{subsys}_gardener_health{url="..", status="unknown"} 0

Reasons

Ability to cross-correlate infrastructure-manager errors with Gardener API (dependency) errors.

Attachments

Configurable networking filter for ingress traffic (geo-blocking) [EPIC]

Description
Add the possibility to enable ingress filtering in Kyma Runtime that utilizes shoot-networking-filter. The filter allows blocking certain IP addresses or even regions (geo-blocking). The filter should be applied only when explicitly configured by the user (suggestion: Kyma Runtime service instance parameter).

Reasons
Kyma runtime utilizes shoot-networking-filter from Gardener. The default setup enabled only the egress filter. Applications running on Kyma that use external authentication services (like SAP IAS or XSUAA) comply with geo-blocking regulations out of the box. Those external services not only block access from embargoed countries but also permanently block user accounts. But there are some use cases where applications hosted in Kyma Runtime are accessed by service accounts (system-to-system communication) and in that case geo-blocking has to be enabled in the Kyma cluster.
Be aware that the ingress filter should not be enabled if the application is accessed by end users directly as the blackholing will block redirect to IAS/XSUAA and the user activity in the embargoed country cannot be tracked.
That's why the incoming filter should be enabled by the Kyma Runtime customer on demand as a conscious decision if the application exposes API accessible by other systems only.

Attachments

Force rotation should update the condition's reason to KubeconfigSecretRotated

Description
Invalid condition.reason is set, when the rotation is forced.

Expected result

After force rotation, the status is set to KubeconfigSecretRotated.

Actual result

After force rotation, the status is set to KubeconfigSecretRotated, then reverts to KubeconfigSecretCreated almost immediately.

Steps to reproduce

  1. kubectl annotate gardenercluster -n kcp-system {GARDENER_CLUSTER_CR_NAME} operator.kyma-project.io/force-kubeconfig-rotation=true

Troubleshooting

IM's logs around the force rotation:

2024-03-04T08:21:54Z    INFO    Rotation of secret kubeconfig-runtimeid-md-im in namespace kcp-system forced.   {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system"}
2024-03-04T08:21:54Z    ERROR   status update failed    {"error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"runtimeid-md-im\": the object has been modified; please apply your changes to the latest version and try again"}
2024-03-04T08:21:54Z    ERROR   Reconciler error        {"controller": "gardenercluster", "controllerGroup":"infrastructuremanager.kyma-project.io", "controllerKind": "GardenerCluster", "GardenerCluster": {"name":"runtimeid-md-im","namespace":"kcp-system"}, "namespace": "kcp-system", "name": "runtimeid-md-im", "reconcileID": "48a96634-001b-47c0-94d8-11277eee7798", "error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"runtimeid-md-im\": the object has been modified; please apply your changes to the latest version and try again"}
2024-03-04T08:21:54Z    INFO    Starting reconciliation.        {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system"}
2024-03-04T08:21:54Z    INFO    rotation params {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system", "lastSync": "0001-01-01 00:00:00", "requeueAfter": "1m54s"}
2024-03-04T08:21:54Z    INFO    Secret kubeconfig-runtimeid-md-im has been updated in kcp-system namespace.     {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system"}
2024-03-04T08:21:54Z    INFO    Starting reconciliation.        {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system"}
2024-03-04T08:21:54Z    INFO    rotation params {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system", "lastSync": "2024-03-04 08:21:54", "requeueAfter": "1m53.028304778s"}
2024-03-04T08:21:56Z    INFO    Secret kubeconfig-runtimeid-md-im in namespace kcp-system does not need to be rotated yet.      {"GardenerCluster": "runtimeid-md-im", "Namespace": "kcp-system"

Errors are being thrown in logs when using force rotation.

Description
Errors are being thrown in logs when using force rotation.

Expected result

No errors should be thrown in logs when using force rotation.

Actual result

Errors are being thrown in logs when using force rotation.

2023-12-20T12:29:44Z    INFO    Rotation of secret kubeconfig-01568d6b-e96f-4106-b8f5-f5a745f0390d in namespace kcp-system forced. {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system"}
2023-12-20T12:29:44Z    ERROR   status update failed    {"error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"01568d6b-e96f-4106-b8f5-f5a745f0390d\": the object has been modified; please apply your changes to the latest version and try again"}
2023-12-20T12:29:44Z    ERROR   Reconciler error        {"controller": "gardenercluster", "controllerGroup": "infrastructuremanager.kyma-project.io", "controllerKind": "GardenerCluster", "GardenerCluster": {"name":"01568d6b-e96f-4106-b8f5-f5a745f0390d","namespace":"kcp-system"}, "namespace": "kcp-system", "name": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "reconcileID": "f1f60c6e-15c4-45cb-bcde-a3c60b8ce864", "error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"01568d6b-e96f-4106-b8f5-f5a745f0390d\": the object has been modified; please apply your changes to the latest version and try again"}
2023-12-20T12:29:44Z    INFO    Starting reconciliation.        {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system"}
2023-12-20T12:29:44Z    INFO    rotation params {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system", "lastSync": "0001-01-01 00:00:00", "requeueAfter": "6h50m24s"}

Steps to reproduce

  1. (Probably not important) The cluster was first updated to k8s 1.27.6 and then hibernated before the rotation was forced.
  2. Force certificate rotation
  3. Check IM logs

/kind bug

Multiple worker groups

Description
Enable possibility to create multiple worker groups with different machine types, volume types, node labels, annotations, taints.

See Gardener specs:

Current example shoot from Provisioner:

 workers:
      - cri:
          name: containerd
        name: cpu-worker-0
        machine:
          type: m5.xlarge
          image:
            name: gardenlinux
            version: 1.2.3
          architecture: amd64
        maximum: 1
        minimum: 1
        maxSurge: 1
        maxUnavailable: 0
        volume:
          type: gp2
          size: 50Gi
        zones:
          - eu-central-1a
        systemComponents:
          allow: true
    workersSettings:
      sshAccess:
        enabled: true

Reasons
One size doesn't fit all. Many applications require specific nodes for particular services.

Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules

Description

With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.

Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.

AC:

  • Think about technical and business critical metrics / KPIs which give a clear indication of the quality and health of the Infrastructure Manager
    • Define the reason why this metric is relevant and what it represents.
    • Define the threshold (min <> max etc.) which indicate an service degradation or health issue of the Infrastructure Manager. If a metric has no threshold, verify if it's for us still helpful to measure this value.
    • Specify the required action that has to be applied if a threshold is reached to recover the Infrastructure Manager into a productive and healthy state
    • Present the results in the team to collect the feedback of the colleagues.
  • Implement the identify business metrics in the Infrastructure Manager
  • Configure alerting rules which inform the team as soon as one of the thresholds is reached

Reasons

Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.

Extends #11

Attachments

Set force-deletion flag when creating shoot-cluster

Description

Gardner supports now the option to force the deletion of a cluster (which avoids longer waiting-periods during the de-provisioning e.g. the K8s cluster couldn't be gracefully stopped caused by hanging finalizers).

We agreed to use this feature flag and the infrastructure manager / provisioner should set this flag properly.

AC:

  • The flag confirmation.gardener.cloud/force-deletion is set in the shoot-specs of Gardener clusters.

Reasons

Enable/accept non-graceful shutdowns of Gardener clusters to avoid longer waiting periods during the de-provisioning.

Attachments

[Moved from Provisioner to KIM]

Infrastructure Manager - Prepare migration script/Go program that will create GardenerCluster for each existing cluster

Description

Prepare a Go program/script that will iterate over Kyma resources. For each Kyma resource it will:

  1. Read labels from the Kyma resource
  2. Create GardenerCluster CR

The GardenerCluster CR must contain the fields defined here. Kyma resource is created by the KEB, and the labels it adds can be found here. Mind that the secret name is also defined by KEB.

Reasons
In order to migrate to the architecture with the Infrastructure Manager responsible for dynamic kubeconfig creation the environment some additional steps must be performed. When Infrastructure Manager will be deployed on the target environment there will be a need to handle existent Kyma clusters. The migration script is needed to make sure Infrastructure Manager will control all the runtimes.

Define testing concept for Kyma Infrastructure Manager

Description

For our release management and to fulfil SAP product standards, we have to document how our testing strategy for the Infrastructure Manager looks like.

Some example links to such documentations are available here: https://wiki.one.int.sap/wiki/display/kyma/Testing+Strategy+-+Link+summary

The testing strategy should cover:

  • How the different layers of the application are tests (e.g. unit-tests for structs + packages, integration tests for inter-module objects, end2-end tests for API tests, performance tests for KPI measuring etc.)
  • When tests tests are executed and what action gets triggered if a test fails
  • How the test-coverage can be measured and what meaningful thresholds are and what happens if the threshold is not reached/exceeded.

AC:

Area
Infrastructure Manager

Reasons

Mandatory part of the delivery process and required for a fast creation of Microdeliveries.

Assignees

@kyma-project/technical-writers

Attachments

Improve unit testing in the main reconciliation loop

Description

While working on #95, #97 and #99 we've noticed that the bigger changes in the corresponding code we've noticed that tests require an improvement.

AC:

  • Enhance ENV-test to avoid regressions of #95 + #97
  • Implement unittest for #99 to detect regression afterwards

Reasons

That's the crucial part of Infrastructure Manager that has to be correctly tested so the future enhancements or bug fixes will not cause regressions.

Attachments

  • related PR with some initial unit tests improvements #107

Infrastructure Manager - Perform load and stress test to verify operator's behaviour under load

Description

We should verify how the operator behaves under load. To increase the stabilisation and reliability of the infrastructure manager, a performance test has to be implemented which verifies common use cases. Goals is to measure regularly our internally defined performance KPIs (benchmarking/load test), verify the limits of the application (stress test) and detect performance critical behaviours before the Infrastructure Manager gets deployed on a productive landscape (no memory leaks etc.).

Acceptance criteria:

  • Identify the most relevant use-cases of the Infrastructure Manager
    • define input parameters (e.g. execute the test for 100, 1000, and 5000 CRDs)
    • specify the execution context/boundaries (how often the use case will be applied in parallel, limits for CPU/RAM consumption, max. execution time per test case etc.)
    • share the collected use-cases and the defined boundaries with in the team and collect their feedback
  • Learn what is the recommended way of load testing Kubebuilder projects
  • Implement the use-cases in a load test using one of the mainstream load testing tools (e.g. Grafana K6). This test has to cover
    • the creation of a load test landscape (e.g. by using a local K3d cluster or provisioning a Gardener Cluster) and deployment of a particular Infrastructure Manager version
    • ensure metrics of the Infrastructure Manager are recorded during the test execution
    • visualisation of the measured metrics in a Dashboard (e.g. Plutono)
    • mocks for 3rd party systems to avoid an overload of external systems (e.g. Gardener service)
  • Run the load test and increase the amount of (parallel) workers until the application starts to behave unstable/crashing to detect our maximum performance capacities.
    • Document test results
  • Integrate the load test into the release process to detect critical performance changes between releases

Reasons
Before deploying the operator on production we must know its performance characteristic.

[Threat Modelling] Limit access to Gardener project kubeconfig

Reason

We're using a kubeconfig defined in gardener-kubeconfig-path. We should limit the access to it to not allow unathorize access to the gardener project.

Acceptance criteria

  • review access rights to the gardener project kubeconfig and adjust them if needed

Migrate Prow jobs to Github Actions

Description

As Prow will be discontinued in 2024, we have to move the Prow jobs used for the provisioner to an alternative CI/CD system. In our case Github Actions is the preferred choice.

Overview of all existing Prow-jobs is listed here: https://github.com/search?q=repo%3Akyma-project%2Ftest-infra+framefrog&type=code&p=1

AC:

  • Identify which of the jobs listed in the URL above are required during the Infrastructure Manager development-lifecycle and relevant in the longterm (have to be migrated)
  • Infrastructure Manager related Prow jobs are migrated to Github Actions (except main-infrastructure-manager-build and pull-infrastructure-manager-build) that take care of signing

Reasons

Migrate CI/CD jobs from Prow to Github Actions as Prow will be discontinued in 2024.

Attachments

Setup end-2-end monitoring of KIM to detect service degradations and fire alerts

Description

As critical backend service of Kyma, the monitoring of the availability of the Infrastructure Manger is critical to react in-time on service degradations.

Goals is to setup a end-2-end test case for the Infrastructure Manager which verifies the correct functionality of this service on KCP. The test should be executed in intervals (e.g. hourly) and create a full-fledged Gardener cluster and also destroy it afterwards.

In case that the cluster creation wasn't possible, an alert should be fired (e.g. via the SRE monitoring system) and inform the Framefrog team about the service degradation.

AC:

  • Get in touch with SREs and verify how a full-fledged test case could be integrated into the existing monitoring solution in Kyma
  • Implement an test case which requests the KIM to create a Gardener cluster and finally also deletes it:
    • The test has to verify that the cluster got successfully created in Gardener
    • Check whether the cluster is accessible using the received kubeconfig from Gardener
    • Finally destroy the created Kyma cluster
  • Ensure a cleanup mechanism is in place which would remove orphan clusters in cases that the test mechanism wasn't able to handle the cleanup as part of the test run.
  • Integrate the test case into the monitoring system (based on the guidance from SREs, see step 1) and ensure alerts are fire in case of KIM service degradation

Reasons

Ensure high quality and proactive service monitoring.

Attachments

Infrastructure Manager - implement kubeconfig secret management

Description

The Infrastructure Manager must manage dynamic kubeconfigs.

Acceptance criteria:

  • Infrastructure Manager can be installed on Gardener cluster.
  • #37
  • #48
  • #39
  • Infrastructure Manager is periodically triggered to ensure secrets are rotated when needed.
  • It is possible to force a secret rotation with annotation added to the secret.

Reasons

In the long term the Infrastructure Manager will replace Provisioner. In the first step it will be responsible for kubeconfig management in the Kyma Control Plane.

Infrastructure Manager - Add metrics, and alerts to improve observability

Description

The infrastructure Manager should provide metrics to allow early issues detection.

Reasons

Infrastructure Manager is a component that in the long run will be responsible for cluster creation. In case of a downtime the impact on Kyma Control Plane will be significant. We must prevent that by increasing the observability.

Acceptance criteria

  • Add metrics proposed by Benajmin
    • Gardener Cluster CR metrics
    • Kubeconfig expiration metrics - #163
      • should expose shootName
      • DEV bump
      • configure DEV plutono
  • To have alerting set in place that will fire [we also have to figure out when] (based on expiration metrics) metrics)
  • Contact SRE when metrics will be available on DEV
  • Track failed request to Gardener API-server to measure the failure rate (helps detecting Gardener interruptions)
    • There is a separate issue #138
  • Delete kube-rbac-proxy still sidecar similar change done in https://github.com/kyma-project/compass-manager/pull/55/files - #164
  • metrics should be still valid after pod restarts

Infrastructure Manager - Dynamic kubeconfigs e2e test

Description

How it's going to be implemented is yet to be defined.

Reasons

Assure that the dynamic kubeconfigs feature is working e2e.

Acceptance criteria

  • Prepare Go code/bash script performing the test
  • Prepare changes in configuration/makefile to allow running in CI/CD pipeline

Attachments

/area control-plane
/kind feature

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.