metal-stack / gardener-extension-provider-metal Goto Github PK
View Code? Open in Web Editor NEWImplementation of the gardener-extension-controller for metal-stack
License: MIT License
Implementation of the gardener-extension-controller for metal-stack
License: MIT License
The InfrastructureStatus
is used with the internal API version, e.g.
infrastructureProviderStatus:
Firewall:
MachineID: metal:///...
Succeeded: true
apiVersion: metal.provider.extensions.gardener.cloud/__internal
kind: InfrastructureStatus
Instead, a version (e.g., v1alpha1
should be used) as this information is client-facing.
When encountering an error during infrastructure network deletion, the 30s backoff does somehow not work. Instead, it is being tried every second.
So, this line does not seem to have an effect: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/pkg/controller/infrastructure/actuator_delete.go#L161
metalpod/gardener-extension-provider-metal:v0.5.0
Our controller can die because too often it consume the Github API for retrieving the firewall release asset:
...
panic: GET https://api.github.com/repos/metal-stack/firewall-controller/releases: 403 API rate limit exceeded for 185.153.67.16. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 1m32s] [recovered]
panic: GET https://api.github.com/repos/metal-stack/firewall-controller/releases: 403 API rate limit exceeded for 185.153.67.16. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 1m32s]
...
It even dies with a panic
which we should avoid under all circumstances.
This was quite an old idea we had for accounting and we won't use it anymore in the future. It's still deployed and failing. We should just remove everything related to it.
How to reproduce:
metalctl machine list --project ed53d0dc-fd1c-41f9-a534-bf84c61c98c9
ID LAST EVENT WHEN AGE HOSTNAME PROJECT SIZE IMAGE PARTITION
fafd0c00-7090-11e9-8000-efbeaddeefbe Phoned Home 2s 23m 9s shoot--p7l8m...-firewall-fcbab ed53d0dc-fd1c-41f9-a534-bf84c61c98c9 c1-xlarge-x86 Firewall 1 nbg-w8101
00000000-beef-beef-0011-efbeaddeefbe Phoned Home 3s 18m 52s shoot--p7l8m...dfb8b55dc-f9zxg ed53d0dc-fd1c-41f9-a534-bf84c61c98c9 c1-xlarge-x86 Ubuntu 19.04 nbg-w8101
00000000-beef-beef-0001-efbeaddeefbe Phoned Home 1s 6m 36s shoot--p7l8m...dfb8b55dc-nfvv9 ed53d0dc-fd1c-41f9-a534-bf84c61c98c9 c1-xlarge-x86 Ubuntu 19.04 nbg-w8101
metalctl machine rm fafd0c00-7090-11e9-8000-efbeaddeefbe
cloudctl cluster reconcile c1605f39-1668-11ea-853c-42ced4c4a306
=> reconcile freezes at 86%:
UID TENANT PROJECT NAME VERSION PARTITION OPERATION PROGRESS API CONTROL NODES SYSTEM SIZE AGE
c1605f39-1668-11ea-853c-42ced4c4a306 fits ed53d0dc-fd1c-41f9-a534-bf84c61c98c9 mwen4895 1.14.3 nbg-w8101 Processing 86% [Reconcile] True True False False 2/2 54m 7s
project still without firewall:
ID LAST EVENT WHEN AGE HOSTNAME PROJECT SIZE IMAGE PARTITION
00000000-beef-beef-0011-efbeaddeefbe Phoned Home 15s 49m 6s shoot--p7l8m...dfb8b55dc-f9zxg ed53d0dc-fd1c-41f9-a534-bf84c61c98c9 c1-xlarge-x86 Ubuntu 19.04 nbg-w8101
00000000-beef-beef-0001-efbeaddeefbe Phoned Home 17s 36m 50s shoot--p7l8m...dfb8b55dc-nfvv9 ed53d0dc-fd1c-41f9-a534-bf84c61c98c9 c1-xlarge-x86 Ubuntu 19.04 nbg-w8101
for automatic deletion of jobs there is an alpha feature. please check if you can activate the feature.
ttlSecondsAfterFinished
Feb 11 19:36:24 shoot--p5mlcn--<masked>-firewall-eeeb8 ip[1412]: 2021/02/11 19:36:24 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"
k logs -n firewall droptailer-667bdd5598-rmrm7 -f
2021/02/11 18:35:22 listening on 50051
It now seems possible to validate the shoot spec regarding our provider specific config, we should do that as well:
Not everyone will need our accounting solution, group role binding controller and authn webhook. Same will be the case for the splunk webhook that we will introduce.
We should add switches to the controller to enable or disable the deployment of these controllers and webhooks. The switches should be settable in the deployment of the gardener extension controller.
Once oot-mcm is configured on all shoots.
/cc @mwennrich @Gerrit91
Temporarily, we were not able to receive project information via the metal-api such that we were not able to configure the accounting exporter. Due to that reason we added the cloud-go dependency to this project. Retrieving project information is now possible again via metal-go, so we can remove the cloud-go dependency.
For end-users it would improve the experience if the calico networking config for the metal provider would be defaulted. This can be achieved using a mutating webhook on the shoot resource that defaults the required values if the networking provider config is unconfigured by a user.
For every reconcile we query all our metal networks, which is quite an expensive call.
To enhance validations, reduce the complexity of the shoot spec and simplify the integration of the metal provider into the Gardener dashboard, we need to provide some provider config into the cloud profile:
With the latest changes, the iamconfig
is part of the CloudProfileConfig
. I assumed that you can now leave controlPlaneConfig
in Shoot
resources empty, but then the ControlPlane controller complains with:
lastOperation:
description: 'Error reconciling controlplane: provider config is not set on
the control plane resource'
After specifying
controlPlaneConfig:
apiVersion: metal.provider.extensions.gardener.cloud/v1alpha1
kind: ControlPlaneConfig
it could successfully reconcile. It would be nicer if controlPlaneConfig
would be completely optional now.
Already done in gardener as well with: gardener/gardener#3739 for reference.
In the extension provider there is a reference here: gardener/gardener-extension-provider-aws#316
Currently, the docker image and tag is hard coded into the shoot-control-plane charts.
We should make this variable and add the images to the image vector such that we can override them from our controller registrations.
It should be listed explicitly which verbs on the pods should be possible for the csi-lvm instead of wildcard(*).
This is set actually here: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/charts/internal/shoot-storageclasses/templates/storageclasses.yaml
I propose instead to limit this to:
create, delete, get, list, patch, update, watch
firewall object is not allowed to be changed by end users
The new version of the authn-webhook needs read-access to the metal-api to read the tenant list.
Introduce and fill these new env-variables:
We must check our whole code base for missing DNSNames like it was fixed for the droptailer certificate in
#136
1s Warning Failed pod/csi-lvm-controller-759f7b6c6b-dgp6v Failed to pull image "metalstack/csi-lvm-controller:v0.6.1": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
Would be nice if the status of the firewall would contribute to the overall shoot health through the health controller.
Not yet sure how this can happen but it can happen when updating to a new firewall. We ended up with four firewalls in one cluster after an update.
Follow-up of #136.
By now, this is always enabled and a user has to use explicit image versions if he wants no auto updates.
Since the beginning of the project we have test files copied over from another reference implementation of an extension-controller. We should make them work or remove them from the project. Then test regularly in CI.
Does not look like any actions should be required because we don't use the Layer 2 features.
Once alle firewalls in production are up to date, we should remove all remaining configuration for the deprecated firewall-policy-controller
./metal-stack/gardener-extension-provider-metal/pkg/controller/infrastructure/machine.go
9: return fmt.Sprintf("metal:///%s/%s", partition, machineID)
./gardener-extension-provider-metal/pkg/controller/infrastructure/machine.go
9: return fmt.Sprintf("metal:///%s/%s", partition, machineID)
Today, the URL to the MetalStack API is part of the cloudprovider
secret: https://github.com/metal-stack/gardener-extension-provider-metal/blob/master/example/30-infrastructure.yaml#L11.
This requires every user to enter it into its secret although it's always the same for one MetalStack environment.
It'd be more convenient and improve the user experience if the operator could configure it in the CloudProfile
instead, similar to the Keystone URL of OpenStack.
Steps:
gardener-extensions made huge improvements for the monitoring with:
gardener-attic/gardener-extensions#299
and
gardener-attic/gardener-extensions#344
We should do that for us as well
Related to metal-stack/firewall-controller#67
At the moment we have the IAM configuration for our authn-webhook stored in the shoot spec.
Gardener designed the shoot resource to be user-facing (editable by the user), so the placement of the IAM Config in the shoot control plane provider config has some downsides:
NamespaceMaxLength
or ExcludedNamespaces
)Admittedly, we built an own API to hide the shoot resource from the dashboard, but I think it makes more sense to do the things equivalent to the Gardener practices because it will reduce the maintenance overhead of our API and improve the usability of our cloud for third-parties.
#26 already targets to mitigate the problems of this configuration for dashboard integration by allowing to default the IAM config within the cloud profile. Defaulting though will not be a final solution as the IAM config will vary across different tenants in the near future.
Instead, the IAM configuration should be provided by the os-extension-provider-metal and not by our own API (just like we moved the reconcilation of the node network to this controller). This would actually be possible:
There are requests to let users define different OIDC backends per shoot in the future. We can think about leaving a special field in the control plane config for this and hide the specific webhook configuration from the user in the same way as describe above. So, if the user needs something special like that, we can say it's acceptable that the user has to add a small of configuration by hand to the shoot spec.
Dec 03 20:27:09 shoot--pcfgbt--gerrit-firewall-1afab ip[106868]: 2020/12/03 20:27:09 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"
Dec 03 20:27:12 shoot--pcfgbt--gerrit-firewall-1afab ip[106868]: 2020/12/03 20:27:12 unable to send dropentry:rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"
Seems to be related to go 1.15
jaegertracing/jaeger#2435, this points to the real solution:
golang/go#39568 (comment)
adding:
Environment=GODEBUG=x509ignoreCN=0
to droptailer.service immediatly makes the connection work again
I think we must change the way the certs are create in https://github.com/metal-stack/gardener-extension-provider-metal
@mwindower we must fix this with the next fw image version.
In the controllers we should use the common client context offered by Gardener, like:
This basically works already except for the infrastructure resource, which loses the status field. Due to this reason, the infrastructure controller assumes that there is no firewall and tries to create another one.
Instead, when the status field is empty, we should try to find an existing firewall for this cluster and if there is a single one, we don't do anything but update the infrastructure status:
$ k get infrastructure -o yaml
...
status:
...
providerStatus:
apiVersion: metal.provider.extensions.gardener.cloud/v1alpha1
firewall:
machineID: metal:///fra-equ01/00000000-0000-0000-0000-ac1f6bd390b2
succeeded: true
...
Audit events from the kube-apiserver contain a field for the source IP that the requests came from. Example:
audittailer-768f964b78-t4hcs audittailer {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"39d36d5d-cae5-4b0c-8ef2-8dc8013f49d1","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/default/pods?limit=500","verb":"list","user":{"username":"oidc:IZ00242","uid":"IZ00242","groups":["oidc:all-cadm","system:authenticated"]},"sourceIPs":["10.67.48.2"],"userAgent":"kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841","objectRef":{"resource":"pods","namespace":"default","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-05-27T17:30:52.228925Z","stageTimestamp":"2021-05-27T17:30:52.231553Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"oidc-all-cadm\" of ClusterRole \"cluster-admin\" to Group \"oidc:all-cadm\""}}
Unfortunately the "sourceIPs":["10.67.48.2"]
is the node IP address of one of the nodes in the seed cluster. This seems to be the correct behaviour since the Apiserver is is exposed as service of typ loadBalancer
with externalTrafficPolicy: Cluster
.
From an audit point of view this is not ideal because it hides the real source address from which an event originated.
Changing the externalTrafficPolicy
of the kube-apiserver
service manually to Local
fixes this temporarily, until the service get reconciled again. Example audit event:
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"952889bc-8879-43f6-9d91-e465cae3c76e","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/audit/pods/audittailer-768f964b78-zg8jk/log","verb":"get","user":{"username":"oidc:IZ00242","uid":"IZ00242","groups":["oidc:all-cadm","system:authenticated"]},"sourceIPs":["95.117.118.243"],"userAgent":"kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841","objectRef":{"resource":"pods","namespace":"audit","name":"audittailer-768f964b78-zg8jk","apiVersion":"v1","subresource":"log"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-05-27T17:45:49.837864Z","stageTimestamp":"2021-05-27T17:45:51.099244Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"oidc-all-cadm\" of ClusterRole \"cluster-admin\" to Group \"oidc:all-cadm\""}}
This seemed to have no ill effect on the cluster during the short time until the policy was reset, so I suggest we set the externalTrafficPolicy
of the kube-apiserver
to Local
in ths extension provider.
Add https://github.com/metal-stack/csi-driver-lvm to shoot-storageclasses.
Check if a csi-migration path from csi-lvm to csi-driver-lvm is possible.
gardener-extension-provider-metal-6cc85bb4bd-g97nx gardener-extension-provider-metal {"level":"info","ts":"2021-07-06T06:36:04.849Z","logger":"metal-controlplane-controller.metal-values-provider","msg":"skipping duros storage deployment because no storage configuration found for seed","seed":"prod"}
3m40s Warning FailedCreate replicaset/csi-lvm-controller-7bd679d598 Error creating: pods "csi-lvm-controller-7bd679d598-" is forbidden: pods with system-cluster-critical priorityClass is not permitted in csi-lvm namespace
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.