metal-stack / gardener-extension-provider-metal Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 11.0 12 MB

Implementation of the gardener-extension-controller for metal-stack

License: MIT License

Smarty 0.78% Go 97.33% Shell 0.69% Dockerfile 0.08% Makefile 1.12%

bare-metal gardener gardener-extension kubernetes-controller

gardener-extension-provider-metal's Issues

Internal version used for `infrastructureStatus`

The InfrastructureStatus is used with the internal API version, e.g.

infrastructureProviderStatus:
      Firewall:
        MachineID: metal:///...
        Succeeded: true
      apiVersion: metal.provider.extensions.gardener.cloud/__internal
      kind: InfrastructureStatus

Instead, a version (e.g., v1alpha1 should be used) as this information is client-facing.

Requeue backoff does not work in infrastructure deletion

When encountering an error during infrastructure network deletion, the 30s backoff does somehow not work. Instead, it is being tried every second.

So, this line does not seem to have an effect: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/pkg/controller/infrastructure/actuator_delete.go#L161

Docker image still points to metalpod

metalpod/gardener-extension-provider-metal:v0.5.0

Check for etcd-druid integration as other providers do

Example:
gardener/gardener-extension-provider-equinix-metal#20

Hitting Github API rate limits

Our controller can die because too often it consume the Github API for retrieving the firewall release asset:

...
panic: GET https://api.github.com/repos/metal-stack/firewall-controller/releases: 403 API rate limit exceeded for 185.153.67.16. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 1m32s] [recovered]
        panic: GET https://api.github.com/repos/metal-stack/firewall-controller/releases: 403 API rate limit exceeded for 185.153.67.16. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 1m32s]
...

It even dies with a panic which we should avoid under all circumstances.

Remove limit-validating-webhook

This was quite an old idea we had for accounting and we won't use it anymore in the future. It's still deployed and failing. We should just remove everything related to it.

firewall infrastructure does not get reconciled

How to reproduce:

create cluster
cluster has two workers and one firewall:

metalctl machine list --project ed53d0dc-fd1c-41f9-a534-bf84c61c98c9
  ID                                         LAST EVENT   WHEN  AGE      HOSTNAME                        PROJECT                               SIZE           IMAGE         PARTITION
  fafd0c00-7090-11e9-8000-efbeaddeefbe       Phoned Home  2s    23m 9s   shoot--p7l8m...-firewall-fcbab  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Firewall 1    nbg-w8101
  00000000-beef-beef-0011-efbeaddeefbe       Phoned Home  3s    18m 52s  shoot--p7l8m...dfb8b55dc-f9zxg  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101
  00000000-beef-beef-0001-efbeaddeefbe       Phoned Home  1s    6m 36s   shoot--p7l8m...dfb8b55dc-nfvv9  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101

destroy firewall
metalctl machine rm fafd0c00-7090-11e9-8000-efbeaddeefbe
reconcile cluster
cloudctl cluster reconcile c1605f39-1668-11ea-853c-42ced4c4a306

=> reconcile freezes at 86%:

  UID                                   TENANT  PROJECT                               NAME      VERSION  PARTITION  OPERATION   PROGRESS         API   CONTROL  NODES  SYSTEM  SIZE  AGE
  c1605f39-1668-11ea-853c-42ced4c4a306  fits    ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  mwen4895  1.14.3   nbg-w8101  Processing  86% [Reconcile]  True  True     False  False   2/2   54m 7s

project still without firewall:

  ID                                         LAST EVENT   WHEN  AGE      HOSTNAME                        PROJECT                               SIZE           IMAGE         PARTITION
  00000000-beef-beef-0011-efbeaddeefbe       Phoned Home  15s   49m 6s   shoot--p7l8m...dfb8b55dc-f9zxg  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101
  00000000-beef-beef-0001-efbeaddeefbe       Phoned Home  17s   36m 50s  shoot--p7l8m...dfb8b55dc-nfvv9  ed53d0dc-fd1c-41f9-a534-bf84c61c98c9  c1-xlarge-x86  Ubuntu 19.04  nbg-w8101

TTL mechanism for finished Jobs

for automatic deletion of jobs there is an alpha feature. please check if you can activate the feature.

ttlSecondsAfterFinished

droptailer connect not working

Feb 11 19:36:24 shoot--p5mlcn--<masked>-firewall-eeeb8 ip[1412]: 2021/02/11 19:36:24 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"

k logs -n firewall droptailer-667bdd5598-rmrm7 -f
2021/02/11 18:35:22 listening on 50051

add validator controller

It now seems possible to validate the shoot spec regarding our provider specific config, we should do that as well:

https://github.com/gardener/gardener-extensions/blob/master/controllers/provider-aws/pkg/validator/shoot_handler.go

Create switches for deploying optional parts like accounting, auth, ...

Not everyone will need our accounting solution, group role binding controller and authn webhook. Same will be the case for the splunk webhook that we will introduce.

We should add switches to the controller to enable or disable the deployment of these controllers and webhooks. The switches should be settable in the deployment of the gardener extension controller.

Remove in-tree mcm configuration

Once oot-mcm is configured on all shoots.

/cc @mwennrich @Gerrit91

Temporarily, we were not able to receive project information via the metal-api such that we were not able to configure the accounting exporter. Due to that reason we added the cloud-go dependency to this project. Retrieving project information is now possible again via metal-go, so we can remove the cloud-go dependency.

Update Gardener dependencies to 1.13 release

Default networking provider config with shoot mutator

For end-users it would improve the experience if the calico networking config for the metal provider would be defaulted. This can be achieved using a mutating webhook on the shoot resource that defaults the required values if the networking provider config is unconfigured by a user.

Performance: Don't `network ls` without filtering

For every reconcile we query all our metal networks, which is quite an expensive call.

Provide general provider-specific configuration in cloud profile

To enhance validations, reduce the complexity of the shoot spec and simplify the integration of the metal provider into the Gardener dashboard, we need to provide some provider config into the cloud profile:

Add control plane IAM config into cloud profile
- Populate control plane deployment charts dynamically with the values from the cloud profile
- Still allow overriding these cloud defaults with the existing shoot iam provider config spec
- Validate shoot spec by checking against cloud profile as well (allow them to be unset in shoot spec)
Add available firewall images to cloud profile and add to validation
Add available firewall networks to cloud profile and add to validation

ControlPlane controller complains after latest changes if no providerConfig is given

With the latest changes, the iamconfig is part of the CloudProfileConfig. I assumed that you can now leave controlPlaneConfig in Shoot resources empty, but then the ControlPlane controller complains with:

    lastOperation:
      description: 'Error reconciling controlplane: provider config is not set on
        the control plane resource'

After specifying

controlPlaneConfig:
  apiVersion: metal.provider.extensions.gardener.cloud/v1alpha1
  kind: ControlPlaneConfig

it could successfully reconcile. It would be nicer if controlPlaneConfig would be completely optional now.

Remove ugly packr and bindata with go:embed

Already done in gardener as well with: gardener/gardener#3739 for reference.
In the extension provider there is a reference here: gardener/gardener-extension-provider-aws#316

Add metallb and droptailer to image vector

Currently, the docker image and tag is hard coded into the shoot-control-plane charts.

We should make this variable and add the images to the image vector such that we can override them from our controller registrations.

CSI Benchmark Issue regarding Clusterrole of csi-lvm

It should be listed explicitly which verbs on the pods should be possible for the csi-lvm instead of wildcard(*).

This is set actually here: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/charts/internal/shoot-storageclasses/templates/storageclasses.yaml
I propose instead to limit this to:

create, delete, get, list, patch, update, watch

Provide validating webhook for firewall object

firewall object is not allowed to be changed by end users

Changing firewall networks is prevented from webhook but should be supported

Metal-API access for authn-webhook

The new version of the authn-webhook needs read-access to the metal-api to read the tenant list.

Introduce and fill these new env-variables:

METAL_URL - URL of the metal-api to read tenants from
METAL_HMAC - HMAC for metal-api access
METAL_AUTHTYPE - "User" for HMAC, e.g. "Metal-View"

Ensure created certificates work with go-1.15

We must check our whole code base for missing DNSNames like it was fixed for the droptailer certificate in
#136

change all imagePullPolicy to IfNotPresent

1s Warning Failed pod/csi-lvm-controller-759f7b6c6b-dgp6v Failed to pull image "metalstack/csi-lvm-controller:v0.6.1": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Health checking the firewall resource / firewall-controller

Would be nice if the status of the firewall would contribute to the overall shoot health through the health controller.

Recreate firewall if networks have changed

Cluster can end up having more than firewalls

Not yet sure how this can happen but it can happen when updating to a new firewall. We ended up with four firewalls in one cluster after an update.

Ensure DNS Names are set for all generated certificates

Follow-up of #136.

Important Bugfix in g/g and g/ep-{provider}

gardener/gardener#2037
gardener/gardener-extension-provider-gcp#45

Add AutoUpdate to firewall spec to explicitly disable/enable this feature

By now, this is always enabled and a user has to use explicit image versions if he wants no auto updates.

Fix tests

Since the beginning of the project we have test files copied over from another reference implementation of an extension-controller. We should make them work or remove them from the project. Then test regularly in CI.

Update to MetalLB v0.10.3

Does not look like any actions should be required because we don't use the Layer 2 features.

https://metallb.universe.tf/release-notes/

Implement modifications from upstream

Remove firewall-policy-controller configuration

Once alle firewalls in production are up to date, we should remove all remaining configuration for the deprecated firewall-policy-controller

providerID is malformed

./metal-stack/gardener-extension-provider-metal/pkg/controller/infrastructure/machine.go
9: return fmt.Sprintf("metal:///%s/%s", partition, machineID)

./gardener-extension-provider-metal/pkg/controller/infrastructure/machine.go
9: return fmt.Sprintf("metal:///%s/%s", partition, machineID)

Metal API URL should be part of the `CloudProfile`

Today, the URL to the MetalStack API is part of the cloudprovider secret: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/example/30-infrastructure.yaml#L11.

This requires every user to enter it into its secret although it's always the same for one MetalStack environment.
It'd be more convenient and improve the user experience if the operator could configure it in the CloudProfile instead, similar to the Keystone URL of OpenStack.

update [email protected]

see: gardener-attic/gardener-extensions@a8ff11e

Enable auditlogging of shoot apiservers and forward to splunk

Steps:

Inject command line options of shoot apiservers to turn on dynamic auditlogging
Provide audit policy to apiserver
Forward the auditlog to a splunk HEC endpoint, passing the right HEC token. Use https://github.com/metal-stack/kubernetes-splunk-audit-webhook.
- Find a way to pass the proper hec token to splunk. The token can differ from apiserver to apiserver, e.g. to differentiate between tenants on the splunk side.

Consider adoption of monitoring enhancements from upstream

gardener-extensions made huge improvements for the monitoring with:

gardener-attic/gardener-extensions#299
and
gardener-attic/gardener-extensions#344

We should do that for us as well

Ensure did not miss any migration to clusterwide network policy

Remove IAM config from shoot spec

At the moment we have the IAM configuration for our authn-webhook stored in the shoot spec.
Gardener designed the shoot resource to be user-facing (editable by the user), so the placement of the IAM Config in the shoot control plane provider config has some downsides:

Maybe parts of this configuration should not be configurable by the user at all (e.g. why should the user be able to define a free value for the NamespaceMaxLength or ExcludedNamespaces)
The configuration is hard to provide correctly by a user because it requires the user to know his tenant-specific authorization backend
For dashboard integration... where should the data come from?

Admittedly, we built an own API to hide the shoot resource from the dashboard, but I think it makes more sense to do the things equivalent to the Gardener practices because it will reduce the maintenance overhead of our API and improve the usability of our cloud for third-parties.

#26 already targets to mitigate the problems of this configuration for dashboard integration by allowing to default the IAM config within the cloud profile. Defaulting though will not be a final solution as the IAM config will vary across different tenants in the near future.

Instead, the IAM configuration should be provided by the os-extension-provider-metal and not by our own API (just like we moved the reconcilation of the node network to this controller). This would actually be possible:

The os-extension-provider-metal is capable to find out the Gardener project of the cluster in the control-plane controller as the cluster namespace contains the unique project name (Which is not modifable by a user)
From the Gardener project we can deduce the owner id of the cluster (which again can not be changed by the user)
From the owner we can find out the tenant
With the tenant we can retrieve the proper IAM configuration from the masterdata-api even from extension-provider-metal (assuming the masterdata-api had a public API)

There are requests to let users define different OIDC backends per shoot in the future. We can think about leaving a special field in the control plane config for this and hide the specific webhook configuration from the user in the same way as describe above. So, if the user needs something special like that, we can say it's acceptable that the user has to add a small of configuration by hand to the shoot spec.

droptailer client cannot send data

Dec 03 20:27:09 shoot--pcfgbt--gerrit-firewall-1afab ip[106868]: 2020/12/03 20:27:09 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"
Dec 03 20:27:12 shoot--pcfgbt--gerrit-firewall-1afab ip[106868]: 2020/12/03 20:27:12 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"

Seems to be related to go 1.15
jaegertracing/jaeger#2435, this points to the real solution:
golang/go#39568 (comment)

adding:

Environment=GODEBUG=x509ignoreCN=0

to droptailer.service immediatly makes the connection work again

I think we must change the way the certs are create in https://github.com/metal-stack/gardener-extension-provider-metal

@mwindower we must fix this with the next fw image version.

Utilize common.ClientContext from gardener/extensions

In the controllers we should use the common client context offered by Gardener, like:

https://github.com/gardener/gardener-extension-provider-aws/blob/v1.16.1/pkg/controller/controlplane/valuesprovider.go#L314

Support shoot control plane migration

This basically works already except for the infrastructure resource, which loses the status field. Due to this reason, the infrastructure controller assumes that there is no firewall and tries to create another one.

Instead, when the status field is empty, we should try to find an existing firewall for this cluster and if there is a single one, we don't do anything but update the infrastructure status:

$ k get infrastructure -o yaml
...
  status:
    ...
    providerStatus:
      apiVersion: metal.provider.extensions.gardener.cloud/v1alpha1
      firewall:
        machineID: metal:///fra-equ01/00000000-0000-0000-0000-ac1f6bd390b2
        succeeded: true
...

Audit events have node IP address as source address for requests

Audit events from the kube-apiserver contain a field for the source IP that the requests came from. Example:

audittailer-768f964b78-t4hcs audittailer {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"39d36d5d-cae5-4b0c-8ef2-8dc8013f49d1","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/default/pods?limit=500","verb":"list","user":{"username":"oidc:IZ00242","uid":"IZ00242","groups":["oidc:all-cadm","system:authenticated"]},"sourceIPs":["10.67.48.2"],"userAgent":"kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841","objectRef":{"resource":"pods","namespace":"default","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-05-27T17:30:52.228925Z","stageTimestamp":"2021-05-27T17:30:52.231553Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"oidc-all-cadm\" of ClusterRole \"cluster-admin\" to Group \"oidc:all-cadm\""}}

Unfortunately the "sourceIPs":["10.67.48.2"] is the node IP address of one of the nodes in the seed cluster. This seems to be the correct behaviour since the Apiserver is is exposed as service of typ loadBalancer with externalTrafficPolicy: Cluster.

From an audit point of view this is not ideal because it hides the real source address from which an event originated.
Changing the externalTrafficPolicy of the kube-apiserver service manually to Local fixes this temporarily, until the service get reconciled again. Example audit event:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"952889bc-8879-43f6-9d91-e465cae3c76e","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/audit/pods/audittailer-768f964b78-zg8jk/log","verb":"get","user":{"username":"oidc:IZ00242","uid":"IZ00242","groups":["oidc:all-cadm","system:authenticated"]},"sourceIPs":["95.117.118.243"],"userAgent":"kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841","objectRef":{"resource":"pods","namespace":"audit","name":"audittailer-768f964b78-zg8jk","apiVersion":"v1","subresource":"log"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-05-27T17:45:49.837864Z","stageTimestamp":"2021-05-27T17:45:51.099244Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"oidc-all-cadm\" of ClusterRole \"cluster-admin\" to Group \"oidc:all-cadm\""}}

This seemed to have no ill effect on the cluster during the short time until the policy was reset, so I suggest we set the externalTrafficPolicy of the kube-apiserver to Local in ths extension provider.

Add csi-driver-lvm as second storage option

Add https://github.com/metal-stack/csi-driver-lvm to shoot-storageclasses.
Check if a csi-migration path from csi-lvm to csi-driver-lvm is possible.

shooted seeds did not get duros-controller deployed

gardener-extension-provider-metal-6cc85bb4bd-g97nx gardener-extension-provider-metal {"level":"info","ts":"2021-07-06T06:36:04.849Z","logger":"metal-controlplane-controller.metal-values-provider","msg":"skipping duros storage deployment because no storage configuration found for seed","seed":"prod"}

system-critical-priorityclass not allowed outside kube-system namespace

3m40s       Warning   FailedCreate        replicaset/csi-lvm-controller-7bd679d598   Error creating: pods "csi-lvm-controller-7bd679d598-" is forbidden: pods with system-cluster-critical priorityClass is not permitted in csi-lvm namespace

metal-stack / gardener-extension-provider-metal Goto Github PK

gardener-extension-provider-metal's Issues

Recommend Projects

Recommend Topics

Recommend Org