humio / humio-operator Goto Github PK
View Code? Open in Web Editor NEWKubernetes Operator for Humio
License: Apache License 2.0
Kubernetes Operator for Humio
License: Apache License 2.0
Need to go through helm chart pods and add anything we’re missing to the pods created by the operator. Not sure if it fits here, but we also want to use groups rather than zone/region name in zookeeper prefix.
Right now we have fields InitServiceAccountName and AuthServiceAccountName on Spec of HumioCluster, but we are lacking tests that it works and also documentation for how you would use it (if we require anything specific). During work done in #161 I added a test at least for one of them, but it started failing because problems accessing the k8s secret holding the token secret.
So we need to figure out what is wrong, fix it, add the tests to validate it works and also ensure we have some sort of documentation for what is expected from the users that want to use an existing service account rather than relying on the operator to create it for them.
We should document how to leverage existing service account and probably recommend this approach for production use-cases. As for development/sandbox clusters, I think it is fine to just let the operator create the service accounts.
We need to expose environment variables in the cluster spec and then merge those with the required env vars.
See #4 for more context.
There is at least one case where we need to roll the cluster following a change. See #116. If we update the annotations on the service account, then we need to recreate all the humio pods that use the service account.
Ideally we'd have a way to roll the cluster as well as perform non-rolling restarts in the case where those are required by updating the humio version.
Humio cluster spec uses gp2 storage class defined.
....
dataVolumePersistentVolumeClaimSpecTemplate:
storageClassName: gp2
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 500Gi
...
AWS EKS version: 1.17
EBS driver plugin version:
amazon/aws-ebs-csi-driver:v0.4.0
Storage Class on the cluster.
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
cloud-storage ebs.csi.aws.com Delete WaitForFirstConsumer true 36d
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 48d
When the humio pod is spinning up; it creates 2 volumes on the AWS console with the name kubernetes-dynamic-pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf; However, only one is in the 'In-use' status and the other one is always in the 'Available' status as seen in AWS. The pods then goes into a CrashLoopBackOff
due to a FailedAttachVolume warning or in some cases 1 out of 3 pods are always stuck in the Init phase.
ubuntu@ip-10-150-1-220:/tmp/humio-operator/oauth$ kch
portal-humio-core-alpnzq 1/2 CrashLoopBackOff 6 12m
humio-operator-5559ddb85-nzcql 1/1 Running 0 6h2m
48s Normal Provisioning persistentvolumeclaim/portal-humio-core-gtydvy External provisioner is provisioning volume for claim "cloudops/portal-humio-core-gtydvy"
48s Normal WaitForFirstConsumer persistentvolumeclaim/portal-humio-core-gtydvy waiting for first consumer to be created before binding
14s Normal WaitForFirstConsumer persistentvolumeclaim/portal-humio-core-lgxuum waiting for first consumer to be created before binding
14s Normal WaitForFirstConsumer persistentvolumeclaim/portal-humio-core-nwmjpk waiting for first consumer to be created before binding
43s Normal ProvisioningSucceeded persistentvolumeclaim/portal-humio-core-gtydvy Successfully provisioned volume pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf using kubernetes.io/aws-ebs
3m34s Normal Scheduled pod/portal-humio-core-alpnzq Successfully assigned cloudops/portal-humio-core-alpnzq to ip-10-150-12-138.us-west-2.compute.internal
3m29s Normal SuccessfulAttachVolume pod/portal-humio-core-alpnzq AttachVolume.Attach succeeded for volume "pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf"
3m6s Normal Started pod/portal-humio-core-alpnzq Started container zookeeper-prefix
3m6s Normal Created pod/portal-humio-core-alpnzq Created container zookeeper-prefix
3m6s Normal Pulled pod/portal-humio-core-alpnzq Container image "humio/humio-operator-helper:0.0.7" already present on machine
3m4s Normal Started pod/portal-humio-core-alpnzq Started container auth
70s Normal Created pod/portal-humio-core-alpnzq Created container humio
70s Normal Started pod/portal-humio-core-alpnzq Started container humio
70s Normal Pulled pod/portal-humio-core-alpnzq Container image "humio/humio-core:1.13.4" already present on machine
3m4s Normal Created pod/portal-humio-core-alpnzq Created container auth
3m4s Normal Pulled pod/portal-humio-core-alpnzq Container image "humio/humio-operator-helper:0.0.7" already present on machine
51s Warning FailedAttachVolume pod/portal-humio-core-alpnzq AttachVolume.Attach failed for volume "pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf" : volume is still being detached from the node
44s Warning BackOff pod/portal-humio-core-alpnzq Back-off restarting failed container
The auth container logs
ubuntu@ip-10-150-1-220:/tmp/humio-operator/oauth$ kc logs portal-humio-core-alpnzq -c auth
Starting humio-operator-helper version 0.0.7
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
Create a configmap with extra kafka options and mount it to the humio pods. Similar to what is done in humio/humio-helm-charts@b399cd5.
The problem is if a HumioCluster resource is initially configured in a way that prevents a humio pod from bootstrapping (i.e. that pod gets into a crashloop backoff), the controller continually waits for the pod to become ready (which it never does) even if the HumioCluster resource is fixed in a way that should allow the pod to start.
This is because the bootstrap portion of the control loop happens before the mismatched pods deletion logic. The pod deletion logic should happen either before the bootstrap logic, or the bootstrap logic should have some timeout or way to recognize and ignore pods that are crashing.
Just SAML or SAML and OAuth?
When ingress is enabled but TLS is disabled, the logic here https://github.com/humio/humio-operator/blob/master/pkg/controller/humiocluster/humiocluster_controller.go#L1183-L1191 causes the ingress certs to be deleted. This doesn't happen once TLS is enabled.
We should adjust the logic to limit the cleanup job to the TLS certs for the nodes nonly.
When setting the service account annotations required to use IAM for pod auth via oidc, such as:
humioServiceAccountAnnotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::account_id:role/role_name"
And then adding a cluster, the service account and annotations appear to be set appropriately however the humio pods fail to start due to:
ERROR c.h.m.ServerRunner$ - Got exception starting pid=1 com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: Not authorized to perform sts:AssumeRoleWithWebIdentity (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: c59017b1-8ce2-4afe-8815-7d7c9bae0f86; Proxy: null)
If instead I use humioServiceAccountName: humio
where humio
is the service account created by the helm chart, it works.
Expose the service type for the service resource that is created by the operator. Currently this is hard coded to ClusterIP
.
As part of #5 we now support using emptyDir
and hostPath
volumes, but this does make it possible to leverage persistent volume types such as EBS volumes.
The main issue is that we currently use in the given volume source directly to the volumes in pod spec's. In order to support persistent volume types we would need to create PersistentVolumeClaim's, one for each pod we create. When creating pods, we can then pick a random unbound PVC and mount that in.
Original context here: #4
The main question is can we make humio zone aware and then move the functionality of balancing partitions across zones (and ensuring replicas are in different AZs) into humio core. This is a question that must be raised with core. I think we can probably follow the init container patten that is used in the helm charts to get the availability zone from the host where the pod is scheduled, but there would obviously still be code changes required in core.
The next shorter-term solution that was suggested is to move the balancing functionality into the cli/client api. This way the operator can use it but it can also be used by anyone not running the operator.
The open question I have is should put zone awareness in the operator and then migrate it to the cli, or start with adding it to the cli first?
helm install
SCC and update users
list in the SecurityContextConstraint
to include service accounts as needed: #99We currently use it in these places:
Will this be informational to the user only or will the controller use the state to determine actions? It may be useful to rely on these states in some cases, e.g. allow us to not do operations if the cluster is in a bad state.
Possible cluster states (examples): Bootstrapping, Running, Shrinking, Expanding, Upgrading, Degraded, MissingSegments, Unavailable.
Maybe auto-detect if it is openshift? Or, just use a flag to enable it similar to nginx ingress?
I've noticed that waitForNewPod()
seems to be failing when being called from ensurePodsExist()
when creating new pods. Somehow we keep listing pods but the number of pods never increases, so we end up timing out.
Not sure if this is only when doing make test
, but it is a significant slowdown of tests, and is why the for
loop in waitForNewPod()
only waits 3 seconds before timing out compared to the previous 30 seconds.
We should start with at least hostPath. But as per Mike's comments here #4, we may also want support for emptyDir{} for local testing.
Additionally, we should expect that we will extend this support to pvcs, etc. So it should be designed for extendibility/flexibility. Following the kubernetes volume spec may be one way to obtain this.
We need to be able to mount the gcp storage account json file like the helm chart does: https://github.com/humio/humio-helm-charts/blob/master/charts/humio-core/templates/statefulset.yaml#L173-L178
Humio doesn't have service accounts, so in order to grant the operator access to the cluster, we need to start up in single-user mode, set a developer user password and store it as a secret, and then login and get the token.
Should we use zookeeper and kafka operators rather than the cp-helm-charts for running locally or our e2e tests? If so, we should update our readme and scripts accordingly.
Some users still rely on the older RBAC implementation via JSON. Add support to the operator for starting Humio pods with a configmap that uses extraVolumes and extraVolumeMounts, similar to the extraKafkaConfigs option.
Right now we have our init container figure out what zone the pod was deployed in, and then it pass that on to Humio using a shared volume which is read by the Humio container before starting up Humio itself.
We should be able to do something similar as I describe above, to set the ZONE
configuration option so we can leverage the new functionality in Humio 1.15+ that can suggest partition layouts for storage and digest partitions.
With the recent merge of #174 we now set ZONE
, but still need to set the replication factor options to leverage suggested partition layouts.
Remove int suffix from pod names, use generate name instead
Make labels consistent. Use the same label method to generate and list pods
Goal is to:
charts/humio-operator/Chart.yaml
version/version.go
Ability to add or remove instances and migrate partitions as needed
Use standard AZ annotations
Rack awareness
Ingest
Storage and bucket storage
Once a HumioCluster is created along with a service account, that service account cannot be updated by adding humioServiceAccountAnnotations
. Additionally, deleting the service account manually is not enough as the secret also needs to be updated.
Should we have an ensureMismatchedServiceAccountIsDeleted in the case where a user adds humioServiceAccountAnnotations
to the HumioCluster spec after the initial creation? If we do this, we'd also need to re-create the secret.
You can specify the name of the service account(s) to use when stitching together a HumioCluster resource. If not specified the operator will create them automatically.
We should document how to leverage existing service account and probably recommend this approach for production use-cases. As for development/sandbox clusters, I think it is fine to just let the operator create the service accounts.
Create a OLM package, test the installation, update the readme with install instructions and submit to operatorhub.io.
Humio supports oauth, and the operator can support this via environment variables that are exposed in the cluster spec. See https://docs.humio.com/cluster-management/security/oauth.
The part that we do not have is a way to use dynamic oidc registration as per https://openid.net/specs/openid-connect-registration-1_0.html.
We'll need to determine either a generic or plug-able solution to this, such that either we have one solution that would work with all oidc providers, or have a convention such as running a separate service, job, etc that could be customized in order to support any provider.
Here are a few examples:
https://www.ibm.com/support/knowledgecenter/SSFC4F_1.2.0/iam/3.4.0/auth_onboard.html
https://developer.okta.com/docs/reference/api/oauth-clients/
https://auth0.com/docs/api-auth/dynamic-client-registration
The log format is pretty bad right now and we use at least two different loggers. We should decide on one and stick to it, and also improve the format.
For the alpha release, the controller should be able to update humio in a way that takes all the pods down and then brings them back up on the new version.
We need a table in the readme that matches the supported operator version with the version of humio. The major and minor versions should be locked in. We should have a way in the operator to error and not run in the case where the major and minor versions of the operator and humio do not match.
For patch updates to humio, we should eventually be able to do rolling updates. But we need core to commit that patch updates to not contain schema changes.
We should attach a PodDistruptionBudget to the humio pods: https://kubernetes.io/docs/tasks/run-application/configure-pdb.
How the PDB is configured may depend on how many storage/digest replicas are configured.
I would like to access my humio URL using a different URL prefix. Changing the path
in the humio operator didn't work but this is probably possible using nginx rewrite rules.
nginx.ingress.kubernetes.io/rewrite-target
annotation to achieve the /logs path, documentation for that annotation is available here: https://kubernetes.github.io/ingress-nginx/examples/rewrite/
Issue to track either documenting rewrite rule or adding support for the path
in the Humio Operator. Preferabbly being able to do via the path
in Humio operator would be ideal
Do we run the nginx ingress controller by default and use our existing config in US Cloud?
We should have at least one e2e test and some additional scenarios for the reconciler
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.