Coder Social home page Coder Social logo

humio-operator's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

humio-operator's Issues

Parity with helm chart pods

Need to go through helm chart pods and add anything we’re missing to the pods created by the operator. Not sure if it fits here, but we also want to use groups rather than zone/region name in zookeeper prefix.

Documentation and tests for using existing service accounts

Right now we have fields InitServiceAccountName and AuthServiceAccountName on Spec of HumioCluster, but we are lacking tests that it works and also documentation for how you would use it (if we require anything specific). During work done in #161 I added a test at least for one of them, but it started failing because problems accessing the k8s secret holding the token secret.

So we need to figure out what is wrong, fix it, add the tests to validate it works and also ensure we have some sort of documentation for what is expected from the users that want to use an existing service account rather than relying on the operator to create it for them.

We should document how to leverage existing service account and probably recommend this approach for production use-cases. As for development/sandbox clusters, I think it is fine to just let the operator create the service accounts.

Expose environment variables

We need to expose environment variables in the cluster spec and then merge those with the required env vars.

See #4 for more context.

Add option to restart humio pods

There is at least one case where we need to roll the cluster following a change. See #116. If we update the annotations on the service account, then we need to recreate all the humio pods that use the service account.

Ideally we'd have a way to roll the cluster as well as perform non-rolling restarts in the case where those are required by updating the humio version.

Notice that two persistent volumes get created for each pod on AWS.

Humio cluster spec uses gp2 storage class defined.

....
  dataVolumePersistentVolumeClaimSpecTemplate:
    storageClassName: gp2
    accessModes: [ReadWriteOnce]
    resources:
      requests:
        storage: 500Gi
...

AWS EKS version: 1.17

EBS driver plugin version:
amazon/aws-ebs-csi-driver:v0.4.0

Storage Class on the cluster.

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
cloud-storage   ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   36d
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  48d

When the humio pod is spinning up; it creates 2 volumes on the AWS console with the name kubernetes-dynamic-pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf; However, only one is in the 'In-use' status and the other one is always in the 'Available' status as seen in AWS. The pods then goes into a CrashLoopBackOff due to a FailedAttachVolume warning or in some cases 1 out of 3 pods are always stuck in the Init phase.

ubuntu@ip-10-150-1-220:/tmp/humio-operator/oauth$ kch
portal-humio-core-alpnzq                            1/2     CrashLoopBackOff   6          12m
humio-operator-5559ddb85-nzcql                         1/1     Running            0          6h2m
48s         Normal    Provisioning             persistentvolumeclaim/portal-humio-core-gtydvy   External provisioner is provisioning volume for claim "cloudops/portal-humio-core-gtydvy"
48s         Normal    WaitForFirstConsumer     persistentvolumeclaim/portal-humio-core-gtydvy   waiting for first consumer to be created before binding
14s         Normal    WaitForFirstConsumer     persistentvolumeclaim/portal-humio-core-lgxuum   waiting for first consumer to be created before binding
14s         Normal    WaitForFirstConsumer     persistentvolumeclaim/portal-humio-core-nwmjpk   waiting for first consumer to be created before binding
43s         Normal    ProvisioningSucceeded    persistentvolumeclaim/portal-humio-core-gtydvy   Successfully provisioned volume pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf using kubernetes.io/aws-ebs
3m34s       Normal    Scheduled                pod/portal-humio-core-alpnzq                     Successfully assigned cloudops/portal-humio-core-alpnzq to ip-10-150-12-138.us-west-2.compute.internal
3m29s       Normal    SuccessfulAttachVolume   pod/portal-humio-core-alpnzq                     AttachVolume.Attach succeeded for volume "pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf"
3m6s        Normal    Started                  pod/portal-humio-core-alpnzq                     Started container zookeeper-prefix
3m6s        Normal    Created                  pod/portal-humio-core-alpnzq                     Created container zookeeper-prefix
3m6s        Normal    Pulled                   pod/portal-humio-core-alpnzq                     Container image "humio/humio-operator-helper:0.0.7" already present on machine
3m4s        Normal    Started                  pod/portal-humio-core-alpnzq                     Started container auth
70s         Normal    Created                  pod/portal-humio-core-alpnzq                     Created container humio
70s         Normal    Started                  pod/portal-humio-core-alpnzq                     Started container humio
70s         Normal    Pulled                   pod/portal-humio-core-alpnzq                     Container image "humio/humio-core:1.13.4" already present on machine
3m4s        Normal    Created                  pod/portal-humio-core-alpnzq                     Created container auth
3m4s        Normal    Pulled                   pod/portal-humio-core-alpnzq                     Container image "humio/humio-operator-helper:0.0.7" already present on machine
51s         Warning   FailedAttachVolume       pod/portal-humio-core-alpnzq                     AttachVolume.Attach failed for volume "pvc-d4a79a5f-dc0b-40af-b903-04e9fc2d84cf" : volume is still being detached from the node
44s         Warning   BackOff                  pod/portal-humio-core-alpnzq                     Back-off restarting failed container

The auth container logs

ubuntu@ip-10-150-1-220:/tmp/humio-operator/oauth$ kc logs portal-humio-core-alpnzq -c auth
Starting humio-operator-helper version 0.0.7
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json
waiting on files /data/humio-data/local-admin-token.txt, /data/humio-data/global-data-snapshot.json

Pods that are initially configured incorrectly requires manual intervention to fix

The problem is if a HumioCluster resource is initially configured in a way that prevents a humio pod from bootstrapping (i.e. that pod gets into a crashloop backoff), the controller continually waits for the pod to become ready (which it never does) even if the HumioCluster resource is fixed in a way that should allow the pod to start.

This is because the bootstrap portion of the control loop happens before the mismatched pods deletion logic. The pod deletion logic should happen either before the bootstrap logic, or the bootstrap logic should have some timeout or way to recognize and ignore pods that are crashing.

IAM auth via oidc and the service account created by the operator does not appear to work

When setting the service account annotations required to use IAM for pod auth via oidc, such as:

humioServiceAccountAnnotations:
  eks.amazonaws.com/role-arn: "arn:aws:iam::account_id:role/role_name"

And then adding a cluster, the service account and annotations appear to be set appropriately however the humio pods fail to start due to:

ERROR c.h.m.ServerRunner$ - Got exception starting pid=1 com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: Not authorized to perform sts:AssumeRoleWithWebIdentity (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: c59017b1-8ce2-4afe-8815-7d7c9bae0f86; Proxy: null)

If instead I use humioServiceAccountName: humio where humio is the service account created by the helm chart, it works.

Support persistent data volumes

As part of #5 we now support using emptyDir and hostPath volumes, but this does make it possible to leverage persistent volume types such as EBS volumes.

The main issue is that we currently use in the given volume source directly to the volumes in pod spec's. In order to support persistent volume types we would need to create PersistentVolumeClaim's, one for each pod we create. When creating pods, we can then pick a random unbound PVC and mount that in.

Balance partitions across zones

Original context here: #4

The main question is can we make humio zone aware and then move the functionality of balancing partitions across zones (and ensuring replicas are in different AZs) into humio core. This is a question that must be raised with core. I think we can probably follow the init container patten that is used in the helm charts to get the availability zone from the host where the pod is scheduled, but there would obviously still be code changes required in core.

The next shorter-term solution that was suggested is to move the balancing functionality into the cli/client api. This way the operator can use it but it can also be used by anyone not running the operator.

The open question I have is should put zone awareness in the operator and then migrate it to the cli, or start with adding it to the cli first?

Ensure operator can manage multiple clusters

  • Ensure service account names doesn't conflict if the operator manages service accounts and two HumioClusters are deployed in the same namespace: #97
  • Ensure operator can watch and manage clusters humio cluster CR's in multiple namespaces: #98
  • Install a per-helm install SCC and update users list in the SecurityContextConstraint to include service accounts as needed: #99

Replace use of image humio/strix:latest

We currently use it in these places:

  • initContainer: Use kubectl to grab AZ information to pass on Zookeeper prefix to Humio
  • sidecar: use local admin token to create a new admin user, grab the api token for the new admin user from global, store it in a k8s secret to allow humio operator to use it

Different cluster states

Will this be informational to the user only or will the controller use the state to determine actions? It may be useful to rely on these states in some cases, e.g. allow us to not do operations if the cluster is in a bad state.

Possible cluster states (examples): Bootstrapping, Running, Shrinking, Expanding, Upgrading, Degraded, MissingSegments, Unavailable.

Timeout in waitForNewPod during tests

I've noticed that waitForNewPod() seems to be failing when being called from ensurePodsExist() when creating new pods. Somehow we keep listing pods but the number of pods never increases, so we end up timing out.

Not sure if this is only when doing make test, but it is a significant slowdown of tests, and is why the for loop in waitForNewPod() only waits 3 seconds before timing out compared to the previous 30 seconds.

Support data volumes

We should start with at least hostPath. But as per Mike's comments here #4, we may also want support for emptyDir{} for local testing.

Additionally, we should expect that we will extend this support to pvcs, etc. So it should be designed for extendibility/flexibility. Following the kubernetes volume spec may be one way to obtain this.

Auth within the cluster

Humio doesn't have service accounts, so in order to grant the operator access to the cluster, we need to start up in single-user mode, set a developer user password and store it as a secret, and then login and get the token.

Leverage Humio's suggested partition layouts

Right now we have our init container figure out what zone the pod was deployed in, and then it pass that on to Humio using a shared volume which is read by the Humio container before starting up Humio itself.

We should be able to do something similar as I describe above, to set the ZONE configuration option so we can leverage the new functionality in Humio 1.15+ that can suggest partition layouts for storage and digest partitions.

With the recent merge of #174 we now set ZONE, but still need to set the replication factor options to leverage suggested partition layouts.

Adjust release process

Goal is to:

  • Release new helm chart when chart version changes in: charts/humio-operator/Chart.yaml
  • Release new operator container images when version changes in: version/version.go

Cluster Scaling

Ability to add or remove instances and migrate partitions as needed

Service account not updated when humioServiceAccountAnnotations are added

Once a HumioCluster is created along with a service account, that service account cannot be updated by adding humioServiceAccountAnnotations. Additionally, deleting the service account manually is not enough as the secret also needs to be updated.

Should we have an ensureMismatchedServiceAccountIsDeleted in the case where a user adds humioServiceAccountAnnotations to the HumioCluster spec after the initial creation? If we do this, we'd also need to re-create the secret.

Document how to use an existing service account

You can specify the name of the service account(s) to use when stitching together a HumioCluster resource. If not specified the operator will create them automatically.

We should document how to leverage existing service account and probably recommend this approach for production use-cases. As for development/sandbox clusters, I think it is fine to just let the operator create the service accounts.

Package for OLM Support

Create a OLM package, test the installation, update the readme with install instructions and submit to operatorhub.io.

Support generic oidc auth integration

Humio supports oauth, and the operator can support this via environment variables that are exposed in the cluster spec. See https://docs.humio.com/cluster-management/security/oauth.

The part that we do not have is a way to use dynamic oidc registration as per https://openid.net/specs/openid-connect-registration-1_0.html.

We'll need to determine either a generic or plug-able solution to this, such that either we have one solution that would work with all oidc providers, or have a convention such as running a separate service, job, etc that could be customized in order to support any provider.

Here are a few examples:
https://www.ibm.com/support/knowledgecenter/SSFC4F_1.2.0/iam/3.4.0/auth_onboard.html
https://developer.okta.com/docs/reference/api/oauth-clients/
https://auth0.com/docs/api-auth/dynamic-client-registration

Improve logging

The log format is pretty bad right now and we use at least two different loggers. We should decide on one and stick to it, and also improve the format.

Handle Humio Updates

For the alpha release, the controller should be able to update humio in a way that takes all the pods down and then brings them back up on the new version.

We need a table in the readme that matches the supported operator version with the version of humio. The major and minor versions should be locked in. We should have a way in the operator to error and not run in the case where the major and minor versions of the operator and humio do not match.

For patch updates to humio, we should eventually be able to do rolling updates. But we need core to commit that patch updates to not contain schema changes.

Being able to configure path or documenting nginx rewrite rules for using alternate URL prefixes

I would like to access my humio URL using a different URL prefix. Changing the path in the humio operator didn't work but this is probably possible using nginx rewrite rules.

nginx.ingress.kubernetes.io/rewrite-target
annotation to achieve the /logs path, documentation for that annotation is available here: https://kubernetes.github.io/ingress-nginx/examples/rewrite/

Issue to track either documenting rewrite rule or adding support for the path in the Humio Operator. Preferabbly being able to do via the path in Humio operator would be ideal

Cluster Ingress

Do we run the nginx ingress controller by default and use our existing config in US Cloud?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.