Coder Social home page Coder Social logo

metal-stack / gardener-extension-provider-metal Goto Github PK

View Code? Open in Web Editor NEW
24.0 8.0 11.0 11.97 MB

Implementation of the gardener-extension-controller for metal-stack

License: MIT License

Smarty 0.74% Go 97.46% Shell 0.66% Dockerfile 0.08% Makefile 1.06%
gardener-extension gardener kubernetes-controller bare-metal

gardener-extension-provider-metal's Introduction

metal-stack

we believe kubernetes runs best on bare metal, this is all about providing metal as a service

gardener-extension-provider-metal's People

Contributors

chbmuc avatar dergeberl avatar droid42 avatar gehoern avatar gerrit91 avatar iljarotar avatar kolsa avatar majst01 avatar mreiger avatar mwennrich avatar mwindower avatar rfranzke avatar robertvolkmann avatar timebertt avatar vknabel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener-extension-provider-metal's Issues

firewall infrastructure does not get reconciled

How to reproduce:

  • create cluster
  • cluster has two workers and one firewall:
metalctl machine list --project ed53d0dc-fd1c-41f9-a534-bf84c61c98c9
  ID                                         LAST EVENT   WHEN  AGE      HOSTNAME                        PROJECT                               SIZE           IMAGE         PARTITION
  fafd0c00-7090-11e9-8000-efbeaddeefbe       Phoned Home  2s    23m 9s   shoot--p7l8m...-firewall-fcbab  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Firewall 1    nbg-w8101
  00000000-beef-beef-0011-efbeaddeefbe       Phoned Home  3s    18m 52s  shoot--p7l8m...dfb8b55dc-f9zxg  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101
  00000000-beef-beef-0001-efbeaddeefbe       Phoned Home  1s    6m 36s   shoot--p7l8m...dfb8b55dc-nfvv9  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101

  • destroy firewall
    metalctl machine rm fafd0c00-7090-11e9-8000-efbeaddeefbe
  • reconcile cluster
    cloudctl cluster reconcile c1605f39-1668-11ea-853c-42ced4c4a306

=> reconcile freezes at 86%:

  UID                                   TENANT  PROJECT                               NAME      VERSION  PARTITION  OPERATION   PROGRESS         API   CONTROL  NODES  SYSTEM  SIZE  AGE
  c1605f39-1668-11ea-853c-42ced4c4a306  fits    ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  mwen4895  1.14.3   nbg-w8101  Processing  86% [Reconcile]  True  True     False  False   2/2   54m 7s

project still without firewall:

  ID                                         LAST EVENT   WHEN  AGE      HOSTNAME                        PROJECT                               SIZE           IMAGE         PARTITION
  00000000-beef-beef-0011-efbeaddeefbe       Phoned Home  15s   49m 6s   shoot--p7l8m...dfb8b55dc-f9zxg  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101
  00000000-beef-beef-0001-efbeaddeefbe       Phoned Home  17s   36m 50s  shoot--p7l8m...dfb8b55dc-nfvv9  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101

Provide general provider-specific configuration in cloud profile

To enhance validations, reduce the complexity of the shoot spec and simplify the integration of the metal provider into the Gardener dashboard, we need to provide some provider config into the cloud profile:

  • Add control plane IAM config into cloud profile
    • Populate control plane deployment charts dynamically with the values from the cloud profile
    • Still allow overriding these cloud defaults with the existing shoot iam provider config spec
    • Validate shoot spec by checking against cloud profile as well (allow them to be unset in shoot spec)
  • Add available firewall images to cloud profile and add to validation
  • Add available firewall networks to cloud profile and add to validation

droptailer connect not working

Feb 11 19:36:24 shoot--p5mlcn--<masked>-firewall-eeeb8 ip[1412]: 2021/02/11 19:36:24 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"
k logs -n firewall droptailer-667bdd5598-rmrm7 -f
2021/02/11 18:35:22 listening on 50051


Audit events have node IP address as source address for requests

Audit events from the kube-apiserver contain a field for the source IP that the requests came from. Example:

audittailer-768f964b78-t4hcs audittailer {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"39d36d5d-cae5-4b0c-8ef2-8dc8013f49d1","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/default/pods?limit=500","verb":"list","user":{"username":"oidc:IZ00242","uid":"IZ00242","groups":["oidc:all-cadm","system:authenticated"]},"sourceIPs":["10.67.48.2"],"userAgent":"kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841","objectRef":{"resource":"pods","namespace":"default","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-05-27T17:30:52.228925Z","stageTimestamp":"2021-05-27T17:30:52.231553Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"oidc-all-cadm\" of ClusterRole \"cluster-admin\" to Group \"oidc:all-cadm\""}}

Unfortunately the "sourceIPs":["10.67.48.2"] is the node IP address of one of the nodes in the seed cluster. This seems to be the correct behaviour since the Apiserver is is exposed as service of typ loadBalancer with externalTrafficPolicy: Cluster.

From an audit point of view this is not ideal because it hides the real source address from which an event originated.
Changing the externalTrafficPolicy of the kube-apiserver service manually to Local fixes this temporarily, until the service get reconciled again. Example audit event:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"952889bc-8879-43f6-9d91-e465cae3c76e","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/audit/pods/audittailer-768f964b78-zg8jk/log","verb":"get","user":{"username":"oidc:IZ00242","uid":"IZ00242","groups":["oidc:all-cadm","system:authenticated"]},"sourceIPs":["95.117.118.243"],"userAgent":"kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841","objectRef":{"resource":"pods","namespace":"audit","name":"audittailer-768f964b78-zg8jk","apiVersion":"v1","subresource":"log"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-05-27T17:45:49.837864Z","stageTimestamp":"2021-05-27T17:45:51.099244Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"oidc-all-cadm\" of ClusterRole \"cluster-admin\" to Group \"oidc:all-cadm\""}}

This seemed to have no ill effect on the cluster during the short time until the policy was reset, so I suggest we set the externalTrafficPolicy of the kube-apiserver to Local in ths extension provider.

Default networking provider config with shoot mutator

For end-users it would improve the experience if the calico networking config for the metal provider would be defaulted. This can be achieved using a mutating webhook on the shoot resource that defaults the required values if the networking provider config is unconfigured by a user.

change all imagePullPolicy to IfNotPresent

1s Warning Failed pod/csi-lvm-controller-759f7b6c6b-dgp6v Failed to pull image "metalstack/csi-lvm-controller:v0.6.1": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Remove IAM config from shoot spec

At the moment we have the IAM configuration for our authn-webhook stored in the shoot spec.
Gardener designed the shoot resource to be user-facing (editable by the user), so the placement of the IAM Config in the shoot control plane provider config has some downsides:

  • Maybe parts of this configuration should not be configurable by the user at all (e.g. why should the user be able to define a free value for the NamespaceMaxLength or ExcludedNamespaces)
  • The configuration is hard to provide correctly by a user because it requires the user to know his tenant-specific authorization backend
  • For dashboard integration... where should the data come from?

Admittedly, we built an own API to hide the shoot resource from the dashboard, but I think it makes more sense to do the things equivalent to the Gardener practices because it will reduce the maintenance overhead of our API and improve the usability of our cloud for third-parties.

#26 already targets to mitigate the problems of this configuration for dashboard integration by allowing to default the IAM config within the cloud profile. Defaulting though will not be a final solution as the IAM config will vary across different tenants in the near future.

Instead, the IAM configuration should be provided by the os-extension-provider-metal and not by our own API (just like we moved the reconcilation of the node network to this controller). This would actually be possible:

  • The os-extension-provider-metal is capable to find out the Gardener project of the cluster in the control-plane controller as the cluster namespace contains the unique project name (Which is not modifable by a user)
  • From the Gardener project we can deduce the owner id of the cluster (which again can not be changed by the user)
  • From the owner we can find out the tenant
  • With the tenant we can retrieve the proper IAM configuration from the masterdata-api even from extension-provider-metal (assuming the masterdata-api had a public API)

There are requests to let users define different OIDC backends per shoot in the future. We can think about leaving a special field in the control plane config for this and hide the specific webhook configuration from the user in the same way as describe above. So, if the user needs something special like that, we can say it's acceptable that the user has to add a small of configuration by hand to the shoot spec.

Internal version used for `infrastructureStatus`

The InfrastructureStatus is used with the internal API version, e.g.

infrastructureProviderStatus:
      Firewall:
        MachineID: metal:///...
        Succeeded: true
      apiVersion: metal.provider.extensions.gardener.cloud/__internal
      kind: InfrastructureStatus

Instead, a version (e.g., v1alpha1 should be used) as this information is client-facing.

Support shoot control plane migration

This basically works already except for the infrastructure resource, which loses the status field. Due to this reason, the infrastructure controller assumes that there is no firewall and tries to create another one.

Instead, when the status field is empty, we should try to find an existing firewall for this cluster and if there is a single one, we don't do anything but update the infrastructure status:

$ k get infrastructure -o yaml
...
  status:
    ...
    providerStatus:
      apiVersion: metal.provider.extensions.gardener.cloud/v1alpha1
      firewall:
        machineID: metal:///fra-equ01/00000000-0000-0000-0000-ac1f6bd390b2
        succeeded: true
...

Create switches for deploying optional parts like accounting, auth, ...

Not everyone will need our accounting solution, group role binding controller and authn webhook. Same will be the case for the splunk webhook that we will introduce.

We should add switches to the controller to enable or disable the deployment of these controllers and webhooks. The switches should be settable in the deployment of the gardener extension controller.

shooted seeds did not get duros-controller deployed

gardener-extension-provider-metal-6cc85bb4bd-g97nx gardener-extension-provider-metal {"level":"info","ts":"2021-07-06T06:36:04.849Z","logger":"metal-controlplane-controller.metal-values-provider","msg":"skipping duros storage deployment because no storage configuration found for seed","seed":"prod"}

Hitting Github API rate limits

Our controller can die because too often it consume the Github API for retrieving the firewall release asset:

...
panic: GET https://api.github.com/repos/metal-stack/firewall-controller/releases: 403 API rate limit exceeded for 185.153.67.16. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 1m32s] [recovered]
        panic: GET https://api.github.com/repos/metal-stack/firewall-controller/releases: 403 API rate limit exceeded for 185.153.67.16. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 1m32s]
...

It even dies with a panic which we should avoid under all circumstances.

Remove limit-validating-webhook

This was quite an old idea we had for accounting and we won't use it anymore in the future. It's still deployed and failing. We should just remove everything related to it.

droptailer client cannot send data

Dec 03 20:27:09 shoot--pcfgbt--gerrit-firewall-1afab ip[106868]: 2020/12/03 20:27:09 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"
Dec 03 20:27:12 shoot--pcfgbt--gerrit-firewall-1afab ip[106868]: 2020/12/03 20:27:12 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"

Seems to be related to go 1.15
jaegertracing/jaeger#2435, this points to the real solution:
golang/go#39568 (comment)

adding:

Environment=GODEBUG=x509ignoreCN=0

to droptailer.service immediatly makes the connection work again

I think we must change the way the certs are create in https://github.com/metal-stack/gardener-extension-provider-metal

@mwindower we must fix this with the next fw image version.

TTL mechanism for finished Jobs

for automatic deletion of jobs there is an alpha feature. please check if you can activate the feature.

ttlSecondsAfterFinished

Metal API URL should be part of the `CloudProfile`

Today, the URL to the MetalStack API is part of the cloudprovider secret: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/example/30-infrastructure.yaml#L11.

This requires every user to enter it into its secret although it's always the same for one MetalStack environment.
It'd be more convenient and improve the user experience if the operator could configure it in the CloudProfile instead, similar to the Keystone URL of OpenStack.

ControlPlane controller complains after latest changes if no providerConfig is given

With the latest changes, the iamconfig is part of the CloudProfileConfig. I assumed that you can now leave controlPlaneConfig in Shoot resources empty, but then the ControlPlane controller complains with:

    lastOperation:
      description: 'Error reconciling controlplane: provider config is not set on
        the control plane resource'

After specifying

controlPlaneConfig:
  apiVersion: metal.provider.extensions.gardener.cloud/v1alpha1
  kind: ControlPlaneConfig

it could successfully reconcile. It would be nicer if controlPlaneConfig would be completely optional now.

Remove dependency on cloud-go

Temporarily, we were not able to receive project information via the metal-api such that we were not able to configure the accounting exporter. Due to that reason we added the cloud-go dependency to this project. Retrieving project information is now possible again via metal-go, so we can remove the cloud-go dependency.

Fix tests

Since the beginning of the project we have test files copied over from another reference implementation of an extension-controller. We should make them work or remove them from the project. Then test regularly in CI.

providerID is malformed

./metal-stack/gardener-extension-provider-metal/pkg/controller/infrastructure/machine.go
9: return fmt.Sprintf("metal:///%s/%s", partition, machineID)

./gardener-extension-provider-metal/pkg/controller/infrastructure/machine.go
9: return fmt.Sprintf("metal:///%s/%s", partition, machineID)

Metal-API access for authn-webhook

The new version of the authn-webhook needs read-access to the metal-api to read the tenant list.

Introduce and fill these new env-variables:

  • METAL_URL - URL of the metal-api to read tenants from
  • METAL_HMAC - HMAC for metal-api access
  • METAL_AUTHTYPE - "User" for HMAC, e.g. "Metal-View"

Add metallb and droptailer to image vector

Currently, the docker image and tag is hard coded into the shoot-control-plane charts.

We should make this variable and add the images to the image vector such that we can override them from our controller registrations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.