The humio-operator from humio

Parity with helm chart pods

Need to go through helm chart pods and add anything we’re missing to the pods created by the operator. Not sure if it fits here, but we also want to use groups rather than zone/region name in zookeeper prefix.

Documentation and tests for using existing service accounts

Right now we have fields InitServiceAccountName and AuthServiceAccountName on Spec of HumioCluster, but we are lacking tests that it works and also documentation for how you would use it (if we require anything specific). During work done in #161 I added a test at least for one of them, but it started failing because problems accessing the k8s secret holding the token secret.

So we need to figure out what is wrong, fix it, add the tests to validate it works and also ensure we have some sort of documentation for what is expected from the users that want to use an existing service account rather than relying on the operator to create it for them.

We should document how to leverage existing service account and probably recommend this approach for production use-cases. As for development/sandbox clusters, I think it is fine to just let the operator create the service accounts.

Expose environment variables

We need to expose environment variables in the cluster spec and then merge those with the required env vars.

See #4 for more context.

Ensure all containers have probes defined for liveness and readiness

Cluster Initialization - Create Humio pods

Add option to restart humio pods

There is at least one case where we need to roll the cluster following a change. See #116. If we update the annotations on the service account, then we need to recreate all the humio pods that use the service account.

Ideally we'd have a way to roll the cluster as well as perform non-rolling restarts in the case where those are required by updating the humio version.

Cluster Configuration - Ingest partitions at start

Notice that two persistent volumes get created for each pod on AWS.

Humio cluster spec uses gp2 storage class defined.

....
  dataVolumePersistentVolumeClaimSpecTemplate:
    storageClassName: gp2
    accessModes: [ReadWriteOnce]
    resources:
      requests:
        storage: 500Gi
...

AWS EKS version: 1.17

EBS driver plugin version:
amazon/aws-ebs-csi-driver:v0.4.0

Storage Class on the cluster.

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
cloud-storage   ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   36d
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  48d

When the humio pod is spinning up; it creates 2 volumes on the AWS console with the name kubernetes-dynamic-pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf; However, only one is in the 'In-use' status and the other one is always in the 'Available' status as seen in AWS. The pods then goes into a CrashLoopBackOff due to a FailedAttachVolume warning or in some cases 1 out of 3 pods are always stuck in the Init phase.

ubuntu@ip-10-150-1-220:/tmp/humio-operator/oauth$ kch
portal-humio-core-alpnzq                            1/2     CrashLoopBackOff   6          12m
humio-operator-5559ddb85-nzcql                         1/1     Running            0          6h2m

48s         Normal    Provisioning             persistentvolumeclaim/portal-humio-core-gtydvy   External provisioner is provisioning volume for claim "cloudops/portal-humio-core-gtydvy"
48s         Normal    WaitForFirstConsumer     persistentvolumeclaim/portal-humio-core-gtydvy   waiting for first consumer to be created before binding
14s         Normal    WaitForFirstConsumer     persistentvolumeclaim/portal-humio-core-lgxuum   waiting for first consumer to be created before binding
14s         Normal    WaitForFirstConsumer     persistentvolumeclaim/portal-humio-core-nwmjpk   waiting for first consumer to be created before binding
43s         Normal    ProvisioningSucceeded    persistentvolumeclaim/portal-humio-core-gtydvy   Successfully provisioned volume pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf using kubernetes.io/aws-ebs
3m34s       Normal    Scheduled                pod/portal-humio-core-alpnzq                     Successfully assigned cloudops/portal-humio-core-alpnzq to ip-10-150-12-138.us-west-2.compute.internal
3m29s       Normal    SuccessfulAttachVolume   pod/portal-humio-core-alpnzq                     AttachVolume.Attach succeeded for volume "pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf"
3m6s        Normal    Started                  pod/portal-humio-core-alpnzq                     Started container zookeeper-prefix
3m6s        Normal    Created                  pod/portal-humio-core-alpnzq                     Created container zookeeper-prefix
3m6s        Normal    Pulled                   pod/portal-humio-core-alpnzq                     Container image "humio/humio-operator-helper:0.0.7" already present on machine
3m4s        Normal    Started                  pod/portal-humio-core-alpnzq                     Started container auth
70s         Normal    Created                  pod/portal-humio-core-alpnzq                     Created container humio
70s         Normal    Started                  pod/portal-humio-core-alpnzq                     Started container humio
70s         Normal    Pulled                   pod/portal-humio-core-alpnzq                     Container image "humio/humio-core:1.13.4" already present on machine
3m4s        Normal    Created                  pod/portal-humio-core-alpnzq                     Created container auth
3m4s        Normal    Pulled                   pod/portal-humio-core-alpnzq                     Container image "humio/humio-operator-helper:0.0.7" already present on machine
51s         Warning   FailedAttachVolume       pod/portal-humio-core-alpnzq                     AttachVolume.Attach failed for volume "pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf" : volume is still being detached from the node
44s         Warning   BackOff                  pod/portal-humio-core-alpnzq                     Back-off restarting failed container

The auth container logs

ubuntu@ip-10-150-1-220:/tmp/humio-operator/oauth$ kc logs portal-humio-core-alpnzq -c auth
Starting humio-operator-helper version 0.0.7
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json

Support for Extra Kafka Configs

Create a configmap with extra kafka options and mount it to the humio pods. Similar to what is done in humio/humio-helm-charts@b399cd5.

Pods that are initially configured incorrectly requires manual intervention to fix

The problem is if a HumioCluster resource is initially configured in a way that prevents a humio pod from bootstrapping (i.e. that pod gets into a crashloop backoff), the controller continually waits for the pod to become ready (which it never does) even if the HumioCluster resource is fixed in a way that should allow the pod to start.

This is because the bootstrap portion of the control loop happens before the mismatched pods deletion logic. The pod deletion logic should happen either before the bootstrap logic, or the bootstrap logic should have some timeout or way to recognize and ignore pods that are crashing.

Cluster configuration - Authentication

Just SAML or SAML and OAuth?

Cert cleanup job deletes ingress certs

When ingress is enabled but TLS is disabled, the logic here https://github.com/humio/humio-operator/blob/master/pkg/controller/humiocluster/humiocluster_controller.go#L1183-L1191 causes the ingress certs to be deleted. This doesn't happen once TLS is enabled.

We should adjust the logic to limit the cleanup job to the TLS certs for the nodes nonly.

IAM auth via oidc and the service account created by the operator does not appear to work

When setting the service account annotations required to use IAM for pod auth via oidc, such as:

humioServiceAccountAnnotations:
  eks.amazonaws.com/role-arn: "arn:aws:iam::account_id:role/role_name"

And then adding a cluster, the service account and annotations appear to be set appropriately however the humio pods fail to start due to:

ERROR c.h.m.ServerRunner$ - Got exception starting pid=1 com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: Not authorized to perform sts:AssumeRoleWithWebIdentity (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: c59017b1-8ce2-4afe-8815-7d7c9bae0f86; Proxy: null)

If instead I use humioServiceAccountName: humio where humio is the service account created by the helm chart, it works.

Expose type for service resource

Expose the service type for the service resource that is created by the operator. Currently this is hard coded to ClusterIP.

Support persistent data volumes

As part of #5 we now support using emptyDir and hostPath volumes, but this does make it possible to leverage persistent volume types such as EBS volumes.

The main issue is that we currently use in the given volume source directly to the volumes in pod spec's. In order to support persistent volume types we would need to create PersistentVolumeClaim's, one for each pod we create. When creating pods, we can then pick a random unbound PVC and mount that in.

Balance partitions across zones

Original context here: #4

The main question is can we make humio zone aware and then move the functionality of balancing partitions across zones (and ensuring replicas are in different AZs) into humio core. This is a question that must be raised with core. I think we can probably follow the init container patten that is used in the helm charts to get the availability zone from the host where the pod is scheduled, but there would obviously still be code changes required in core.

The next shorter-term solution that was suggested is to move the balancing functionality into the cli/client api. This way the operator can use it but it can also be used by anyone not running the operator.

The open question I have is should put zone awareness in the operator and then migrate it to the cli, or start with adding it to the cli first?

Ensure operator can manage multiple clusters

Ensure service account names doesn't conflict if the operator manages service accounts and two HumioClusters are deployed in the same namespace: #97
Ensure operator can watch and manage clusters humio cluster CR's in multiple namespaces: #98
Install a per-helm install SCC and update users list in the SecurityContextConstraint to include service accounts as needed: #99

Replace use of image humio/strix:latest

We currently use it in these places:

initContainer: Use kubectl to grab AZ information to pass on Zookeeper prefix to Humio
sidecar: use local admin token to create a new admin user, grab the api token for the new admin user from global, store it in a k8s secret to allow humio operator to use it

Different cluster states

Will this be informational to the user only or will the controller use the state to determine actions? It may be useful to rely on these states in some cases, e.g. allow us to not do operations if the cluster is in a bad state.

Possible cluster states (examples): Bootstrapping, Running, Shrinking, Expanding, Upgrading, Degraded, MissingSegments, Unavailable.

Add support for OpenShift Route's

Maybe auto-detect if it is openshift? Or, just use a flag to enable it similar to nginx ingress?

Timeout in waitForNewPod during tests

I've noticed that waitForNewPod() seems to be failing when being called from ensurePodsExist() when creating new pods. Somehow we keep listing pods but the number of pods never increases, so we end up timing out.

Not sure if this is only when doing make test, but it is a significant slowdown of tests, and is why the for loop in waitForNewPod() only waits 3 seconds before timing out compared to the previous 30 seconds.

Support data volumes

We should start with at least hostPath. But as per Mike's comments here #4, we may also want support for emptyDir{} for local testing.

Additionally, we should expect that we will extend this support to pvcs, etc. So it should be designed for extendibility/flexibility. Following the kubernetes volume spec may be one way to obtain this.

Add option to set and mount GCP_STORAGE_ACCOUNT_JSON_FILE

We need to be able to mount the gcp storage account json file like the helm chart does: https://github.com/humio/humio-helm-charts/blob/master/charts/humio-core/templates/statefulset.yaml#L173-L178

Resources - Ingest tokens

Auth within the cluster

Humio doesn't have service accounts, so in order to grant the operator access to the cluster, we need to start up in single-user mode, set a developer user password and store it as a secret, and then login and get the token.

Update readme/scripts to use preferred zookeeper and kafka

Should we use zookeeper and kafka operators rather than the cp-helm-charts for running locally or our e2e tests? If so, we should update our readme and scripts accordingly.

Add support for legacy RBAC - view-group-permissions.json

Some users still rely on the older RBAC implementation via JSON. Add support to the operator for starting Humio pods with a configmap that uses extraVolumes and extraVolumeMounts, similar to the extraKafkaConfigs option.

Leverage Humio's suggested partition layouts

Right now we have our init container figure out what zone the pod was deployed in, and then it pass that on to Humio using a shared volume which is read by the Humio container before starting up Humio itself.

We should be able to do something similar as I describe above, to set the ZONE configuration option so we can leverage the new functionality in Humio 1.15+ that can suggest partition layouts for storage and digest partitions.

With the recent merge of #174 we now set ZONE, but still need to set the replication factor options to leverage suggested partition layouts.

Pod suffix changes and better labeling

Remove int suffix from pod names, use generate name instead
Make labels consistent. Use the same label method to generate and list pods

Resources - Install parsers

Adjust release process

Goal is to:

Release new helm chart when chart version changes in: charts/humio-operator/Chart.yaml
Release new operator container images when version changes in: version/version.go

Cluster Scaling

Ability to add or remove instances and migrate partitions as needed

Management of Cluster Partitions

Use standard AZ annotations
Rack awareness
Ingest
Storage and bucket storage

Service account not updated when humioServiceAccountAnnotations are added

Once a HumioCluster is created along with a service account, that service account cannot be updated by adding humioServiceAccountAnnotations. Additionally, deleting the service account manually is not enough as the secret also needs to be updated.

Should we have an ensureMismatchedServiceAccountIsDeleted in the case where a user adds humioServiceAccountAnnotations to the HumioCluster spec after the initial creation? If we do this, we'd also need to re-create the secret.

Resources - Repository creation

Document how to use an existing service account

You can specify the name of the service account(s) to use when stitching together a HumioCluster resource. If not specified the operator will create them automatically.

We should document how to leverage existing service account and probably recommend this approach for production use-cases. As for development/sandbox clusters, I think it is fine to just let the operator create the service accounts.

Package for OLM Support

Create a OLM package, test the installation, update the readme with install instructions and submit to operatorhub.io.

Add appropriate license

Support generic oidc auth integration

Humio supports oauth, and the operator can support this via environment variables that are exposed in the cluster spec. See https://docs.humio.com/cluster-management/security/oauth.

The part that we do not have is a way to use dynamic oidc registration as per https://openid.net/specs/openid-connect-registration-1_0.html.

We'll need to determine either a generic or plug-able solution to this, such that either we have one solution that would work with all oidc providers, or have a convention such as running a separate service, job, etc that could be customized in order to support any provider.

Here are a few examples:
https://www.ibm.com/support/knowledgecenter/SSFC4F_1.2.0/iam/3.4.0/auth_onboard.html
https://developer.okta.com/docs/reference/api/oauth-clients/
https://auth0.com/docs/api-auth/dynamic-client-registration

Set up GitHub actions for publishing releases

Improve logging

The log format is pretty bad right now and we use at least two different loggers. We should decide on one and stick to it, and also improve the format.

Ensure all containers have resource requests & limits set

Resources - Cluster logging and metrics

Handle Humio Updates

For the alpha release, the controller should be able to update humio in a way that takes all the pods down and then brings them back up on the new version.

We need a table in the readme that matches the supported operator version with the version of humio. The major and minor versions should be locked in. We should have a way in the operator to error and not run in the case where the major and minor versions of the operator and humio do not match.

For patch updates to humio, we should eventually be able to do rolling updates. But we need core to commit that patch updates to not contain schema changes.

nginx.ingress.kubernetes.io/rewrite-target
annotation to achieve the /logs path, documentation for that annotation is available here: https://kubernetes.github.io/ingress-nginx/examples/rewrite/

Issue to track either documenting rewrite rule or adding support for the path in the Humio Operator. Preferabbly being able to do via the path in Humio operator would be ideal

Cluster Ingress

Do we run the nginx ingress controller by default and use our existing config in US Cloud?

Additional reconciler tests and e2e tests

We should have at least one e2e test and some additional scenarios for the reconciler

humio / humio-operator Goto Github PK

humio-operator's People

Stargazers

Watchers

Forkers

humio-operator's Issues

Recommend Projects

Recommend Topics

Recommend Org