Coder Social home page Coder Social logo

druid-operator's People

Contributors

achetronic avatar adheipsingh avatar avtarops avatar beni20 avatar camerondavison avatar christian-schlichtherle avatar cintosunny avatar cyril-corbon avatar dependabot[bot] avatar eandrewjones avatar farhadf avatar gurjotkaur20 avatar harinirajendran avatar himanshug avatar itamar-marom avatar jwitko avatar kentakozuka avatar layoaster avatar mrlarssonjr avatar nitisht avatar rbankar7 avatar renatocron avatar roelofkuijpers avatar samwheating avatar satyakuppam avatar schmichri avatar vladislavpv avatar youngwookim avatar yurmix avatar zhangluva avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

druid-operator's Issues

Can only run 1 task at a time?

I followed the installation steps to create the operator and tiny cluster.

helm repo add datainfra https://charts.datainfra.io
helm repo update
kubectl create namespace druid
helm -n druid-operator-system upgrade -i --create-namespace --set env.WATCH_NAMESPACE="druid" namespaced-druid-operator datainfra/druid-operator
helm -n druid-operator-system upgrade -i --create-namespace --set env.DENY_LIST="kube-system" namespaced-druid-operator datainfra/druid-operator
kubectl apply -f tiny-cluster-zk.yaml -n druid
kubectl apply -f tiny-cluster.yaml -n druid

The big problem is I can only run 1 task at a time. For example, if I'm running a kafka supervisor, then all other tasks such as index_parallel or other index_kafka tasks are stuck in a "pending" status.
I never had this problem with the previous druid-io repo. Any help is appreciated.

image

Example of pending kafka task:

{
  "id": "index_kafka_crypto_bulk_15m_6fa92e3a918604c_kfcgggfh",
  "groupId": "index_kafka_crypto_bulk_15m",
  "type": "index_kafka",
  "createdTime": "2023-09-15T03:58:34.787Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "RUNNING",
  "status": "RUNNING",
  "runnerStatusCode": "PENDING",
  "duration": -1,
  "location": {
    "host": null,
    "port": -1,
    "tlsPort": -1
  },
  "dataSource": "crypto_bulk_15m",
  "errorMsg": null
}

deleteOrphanPvc deleted PVC in use

I'm trying to create a Druid cluster using druid-operator on AWS EKS. I'm using EBS GP2 for the persistent volume.

When trying to scale up the historical pods (e.g. 4 to 8), the first pod stuck in pending, and the rest 7 pods working fine. The first pvc was mistakenly deleted as orphan PVC even though it is still in use.

druid-operator log:
1.6798315940261655e+09 INFO druid_operator_handler Deleted orphaned pvc [data-volume-druid-workload-historicals-4:default] successfully {"name": "workload", "namespace": "default"}
1.679831594026486e+09 DEBUG events Normal {"object": {"kind":"Druid","namespace":"default","name":"workload","uid":"2c6b92b9-73cb-408f-a670-a3ee7fc307ff","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"3088566"}, "reason": "DruidOperatorDeleteSuccess", "message": "Successfully deleted object [data-volume-druid-workload-historicals-4:PersistentVolumeClaim] in namespace [default]"}

This issue is reproducible in the following environments:
druid-operator (0.0.9), kubernetes (1.23).

Storage Class:
Name: gp2
IsDefaultClass: Yes
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"gp2"},"parameters":{"fsType":"ext4","type":"gp2"},"provisioner":"kubernetes.io/aws-ebs","volumeBindingMode":"WaitForFirstConsumer"}
,storageclass.kubernetes.io/is-default-class=true
Provisioner: kubernetes.io/aws-ebs
Parameters: fsType=ext4,type=gp2
AllowVolumeExpansion:
MountOptions:
ReclaimPolicy: Delete
VolumeBindingMode: WaitForFirstConsumer
Events:

Documentation arrangement

We need to write much better documentation to the operator and arranged topics and order.
@AdheipSingh should we use mkdocs or do you want to do some integration between the project to Datainfra?

Using Kubebuilder markers for object validation

Kubebuilder supports object validation markers. Instead of validating the object inside the reconcile function (in the verifyDruidSpec function, we should let Kubebuilder do it.
There are some validations that we can do with markers (like the cluster level image vs node level image) but those should be in a Kubernetes admission validating webhook.

Enhance e2e tests

  • E2e tests - deploy a full cluster using kind.
  • TODO
  1. indexing job
  2. query datasource

No amd64 docker image for 1.1.1

When attempting to upgrade from 1.1.0 to 1.1.1.

I0518 19:10:59.514979 1 main.go:218] Valid token audiences:
I0518 19:10:59.515043 1 main.go:344] Generating self signed cert as no cert is provided
I0518 19:11:00.480061 1 main.go:394] Starting TCP socket on 0.0.0.0:8443
I0518 19:11:00.480301 1 main.go:401] Listening securely on 0.0.0.0:8443
exec /manager: exec format error

I just checked Docker Hub and it looks like the image was only built for arm64. There is no amd64 image.

Help getting started with `tiny-cluster`

Hi, I am trying to set up druid-operator in minikube (for now) which is running on an EC2.

Here's what I have done:

cd druid-operator
git checkout -b v1.0.0 v1.0.0

k create ns druid-ns
k config set-context --current --namespace=druid-ns

k create -f deploy/service_account.yaml
k create -f deploy/role.yaml
k create -f deploy/role_binding.yaml
k create -f deploy/crds/druid.apache.org_druids.yaml
k create -f deploy/operator.yaml

# k apply -f examples/tiny-cluster-zk.yaml
k apply -f examples/tiny-cluster.yaml

I initially started with https://github.com/datainfrahq/druid-operator/blob/master/docs/getting_started.md but that was not bringing up all the services for me but the above did.

Now I wish to access the web console.

Here's the output of k get all:

$ k get all
NAME                                    READY   STATUS             RESTARTS        AGE
pod/druid-operator-7ccbfc66b-2mm7q      1/1     Running            0               6m36s
pod/druid-tiny-cluster-brokers-0        0/1     Running            1 (2m49s ago)   6m30s
pod/druid-tiny-cluster-coordinators-0   0/1     Running            3 (69s ago)     6m30s
pod/druid-tiny-cluster-historicals-0    0/1     CrashLoopBackOff   5 (2m26s ago)   6m30s
pod/druid-tiny-cluster-routers-0        1/1     Running            0               6m30s

NAME                                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/druid-tiny-cluster-brokers        ClusterIP   None         <none>        8088/TCP   6m30s
service/druid-tiny-cluster-coordinators   ClusterIP   None         <none>        8088/TCP   6m30s
service/druid-tiny-cluster-historicals    ClusterIP   None         <none>        8088/TCP   6m30s
service/druid-tiny-cluster-routers        ClusterIP   None         <none>        8088/TCP   6m30s

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/druid-operator   1/1     1            1           6m36s

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/druid-operator-7ccbfc66b   1         1         1       6m36s

NAME                                               READY   AGE
statefulset.apps/druid-tiny-cluster-brokers        0/1     6m30s
statefulset.apps/druid-tiny-cluster-coordinators   0/1     6m30s
statefulset.apps/druid-tiny-cluster-historicals    0/1     6m30s
statefulset.apps/druid-tiny-cluster-routers        1/1     6m30s

I tried running:

k port-forward service/druid-tiny-cluster-routers --address 0.0.0.0 8888:8088

And then tried to access the console using http://ec2-public-ip:8888 but that did not work.

I am pretty sure I'm doing something wrong but I am a k8s beginner so any help would be really great.

Example single node cluster fails to apply

Running on kubernetes 1.25.9 through Docker Desktop local cluster mode.

$ kubectl apply -f examples/tiny-cluster.yaml
Error from server (BadRequest): error when creating "examples/tiny-cluster.yaml": Druid in version "v1alpha1" cannot be handled as a Druid: strict decoding error: unknown field "spec.nodes.brokers.volumeClaimTemplates[0].metadata.name"

Pulled the latest commit off the main branch of the repo to apply the operator (e010411).

Dump of the pod object:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-container: manager
  creationTimestamp: "2023-05-03T13:32:21Z"
  generateName: druid-operator-controller-manager-f4bf77f54-
  labels:
    control-plane: controller-manager
    pod-template-hash: f4bf77f54
  name: druid-operator-controller-manager-f4bf77f54-tw4wd
  namespace: druid-operator-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: druid-operator-controller-manager-f4bf77f54
    uid: 5311981e-0ced-4a2d-96c3-f6d76aeb41bf
  resourceVersion: "27927"
  uid: 260ab173-6aaa-49c2-9003-db8b59e9e552
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
            - arm64
            - ppc64le
            - s390x
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
  containers:
  - args:
    - --secure-listen-address=0.0.0.0:8443
    - --upstream=http://127.0.0.1:8080/
    - --logtostderr=true
    - --v=0
    image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
    imagePullPolicy: IfNotPresent
    name: kube-rbac-proxy
    ports:
    - containerPort: 8443
      name: https
      protocol: TCP
    resources:
      limits:
        cpu: 500m
        memory: 128Mi
      requests:
        cpu: 5m
        memory: 64Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-z75tr
      readOnly: true
  - args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=127.0.0.1:8080
    - --leader-elect
    command:
    - /manager
    image: datainfrahq/druid-operator:latest
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8081
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1
    name: manager
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /readyz
        port: 8081
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 500m
        memory: 128Mi
      requests:
        cpu: 10m
        memory: 64Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-z75tr
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: docker-desktop
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsNonRoot: true
  serviceAccount: druid-operator-controller-manager
  serviceAccountName: druid-operator-controller-manager
  terminationGracePeriodSeconds: 10
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-z75tr
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-05-03T13:32:21Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-05-03T13:32:32Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-05-03T13:32:32Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-03T13:32:21Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://36d70093e7494a286098150203296ebb247ad207d9f3a0df0b5ffa7df9d7cf97
    image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
    imageID: docker-pullable://gcr.io/kubebuilder/kube-rbac-proxy@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522
    lastState: {}
    name: kube-rbac-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-05-03T13:32:22Z"
  - containerID: docker://dc215fd45f7bef2036c832184c8b4a8ec37e986807a516b3784cc804933bae6d
    image: datainfrahq/druid-operator:latest
    imageID: docker-pullable://datainfrahq/druid-operator@sha256:c5dc3f12f28695fea7c3849ffd4e83a729b5b049a0fa13f20ad7904c58410256
    lastState: {}
    name: manager
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-05-03T13:32:26Z"
  hostIP: 192.168.65.4
  phase: Running
  podIP: 10.1.0.126
  podIPs:
  - ip: 10.1.0.126
  qosClass: Burstable
  startTime: "2023-05-03T13:32:21Z"

Dump of the current CRD object I'm trying to apply:

# This spec only works on a single node kubernetes cluster(e.g. typical k8s cluster setup for dev using kind/minikube or single node AWS EKS cluster etc)
# as it uses local disk as "deep storage".
#
apiVersion: "druid.apache.org/v1alpha1"
kind: "Druid"
metadata:
  name: tiny-cluster
spec:
  image: apache/druid:25.0.0
  # Optionally specify image for all nodes. Can be specify on nodes also
  # imagePullSecrets:
  # - name: tutu
  startScript: /druid.sh
  podLabels:
    environment: stage
    release: alpha
  podAnnotations:
    dummykey: dummyval
  readinessProbe:
    httpGet:
      path: /status/health
      port: 8088
  securityContext:
    fsGroup: 1000
    runAsUser: 1000
    runAsGroup: 1000
  services:
    - spec:
        type: ClusterIP
        clusterIP: None
  commonConfigMountPath: "/opt/druid/conf/druid/cluster/_common"
  jvm.options: |-
    -server
    -XX:MaxDirectMemorySize=10240g
    -Duser.timezone=UTC
    -Dfile.encoding=UTF-8
    -Dlog4j.debug
    -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
    -Djava.io.tmpdir=/druid/data
  log4j.config: |-
    <?xml version="1.0" encoding="UTF-8" ?>
    <Configuration status="WARN">
        <Appenders>
            <Console name="Console" target="SYSTEM_OUT">
                <PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
            </Console>
        </Appenders>
        <Loggers>
            <Root level="info">
                <AppenderRef ref="Console"/>
            </Root>
        </Loggers>
    </Configuration>
  common.runtime.properties: |

    # Zookeeper
    druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
    druid.zk.paths.base=/druid
    druid.zk.service.compress=false

    # Metadata Store
    druid.metadata.storage.type=derby
    druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/druid/data/derbydb/metadata.db;create=true
    druid.metadata.storage.connector.host=localhost
    druid.metadata.storage.connector.port=1527
    druid.metadata.storage.connector.createTables=true

    # Deep Storage
    druid.storage.type=local
    druid.storage.storageDirectory=/druid/deepstorage
    #
    # Extensions
    #
    druid.extensions.loadList=["druid-kafka-indexing-service"]

    #
    # Service discovery
    #
    druid.selectors.indexing.serviceName=druid/overlord
    druid.selectors.coordinator.serviceName=druid/coordinator

    druid.indexer.logs.type=file
    druid.indexer.logs.directory=/druid/data/indexing-logs
    druid.lookup.enableLookupSyncOnStartup=false

  metricDimensions.json: |-
    {
      "query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer"},
      "query/bytes" : { "dimensions" : ["dataSource", "type"], "type" : "count"},
      "query/node/time" : { "dimensions" : ["server"], "type" : "timer"},
      "query/node/ttfb" : { "dimensions" : ["server"], "type" : "timer"},
      "query/node/bytes" : { "dimensions" : ["server"], "type" : "count"},
      "query/node/backpressure": { "dimensions" : ["server"], "type" : "timer"},
      "query/intervalChunk/time" : { "dimensions" : [], "type" : "timer"},

      "query/segment/time" : { "dimensions" : [], "type" : "timer"},
      "query/wait/time" : { "dimensions" : [], "type" : "timer"},
      "segment/scan/pending" : { "dimensions" : [], "type" : "gauge"},
      "query/segmentAndCache/time" : { "dimensions" : [], "type" : "timer" },
      "query/cpu/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer" },

      "query/count" : { "dimensions" : [], "type" : "count" },
      "query/success/count" : { "dimensions" : [], "type" : "count" },
      "query/failed/count" : { "dimensions" : [], "type" : "count" },
      "query/interrupted/count" : { "dimensions" : [], "type" : "count" },
      "query/timeout/count" : { "dimensions" : [], "type" : "count" },

      "query/cache/delta/numEntries" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/sizeBytes" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/hits" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/misses" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/evictions" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/hitRate" : { "dimensions" : [], "type" : "count", "convertRange" : true },
      "query/cache/delta/averageBytes" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/timeouts" : { "dimensions" : [], "type" : "count" },
      "query/cache/delta/errors" : { "dimensions" : [], "type" : "count" },

      "query/cache/total/numEntries" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/sizeBytes" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/hits" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/misses" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/evictions" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/hitRate" : { "dimensions" : [], "type" : "gauge", "convertRange" : true },
      "query/cache/total/averageBytes" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/timeouts" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/total/errors" : { "dimensions" : [], "type" : "gauge" },

      "ingest/events/thrownAway" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/events/unparseable" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/events/duplicate" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/events/processed" : { "dimensions" : ["dataSource", "taskType", "taskId"], "type" : "count" },
      "ingest/events/messageGap" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "ingest/rows/output" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/persists/count" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/persists/time" : { "dimensions" : ["dataSource"], "type" : "timer" },
      "ingest/persists/cpu" : { "dimensions" : ["dataSource"], "type" : "timer" },
      "ingest/persists/backPressure" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "ingest/persists/failed" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/handoff/failed" : { "dimensions" : ["dataSource"], "type" : "count" },
      "ingest/merge/time" : { "dimensions" : ["dataSource"], "type" : "timer" },
      "ingest/merge/cpu" : { "dimensions" : ["dataSource"], "type" : "timer" },

      "ingest/kafka/lag" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "ingest/kafka/maxLag" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "ingest/kafka/avgLag" : { "dimensions" : ["dataSource"], "type" : "gauge" },

      "task/success/count" : { "dimensions" : ["dataSource"], "type" : "count" },
      "task/failed/count" : { "dimensions" : ["dataSource"], "type" : "count" },
      "task/running/count" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "task/pending/count" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "task/waiting/count" : { "dimensions" : ["dataSource"], "type" : "gauge" },

      "taskSlot/total/count" : { "dimensions" : [], "type" : "gauge" },
      "taskSlot/idle/count" : { "dimensions" : [], "type" : "gauge" },
      "taskSlot/busy/count" : { "dimensions" : [], "type" : "gauge" },
      "taskSlot/lazy/count" : { "dimensions" : [], "type" : "gauge" },
      "taskSlot/blacklisted/count" : { "dimensions" : [], "type" : "gauge" },

      "task/run/time" : { "dimensions" : ["dataSource", "taskType"], "type" : "timer" },
      "segment/added/bytes" : { "dimensions" : ["dataSource", "taskType"], "type" : "count" },
      "segment/moved/bytes" : { "dimensions" : ["dataSource", "taskType"], "type" : "count" },
      "segment/nuked/bytes" : { "dimensions" : ["dataSource", "taskType"], "type" : "count" },

      "segment/assigned/count" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/moved/count" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/dropped/count" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/deleted/count" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/unneeded/count" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/unavailable/count" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "segment/underReplicated/count" : { "dimensions" : ["dataSource", "tier"], "type" : "gauge" },
      "segment/cost/raw" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/cost/normalization" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/cost/normalized" : { "dimensions" : ["tier"], "type" : "count" },
      "segment/loadQueue/size" : { "dimensions" : ["server"], "type" : "gauge" },
      "segment/loadQueue/failed" : { "dimensions" : ["server"], "type" : "gauge" },
      "segment/loadQueue/count" : { "dimensions" : ["server"], "type" : "gauge" },
      "segment/dropQueue/count" : { "dimensions" : ["server"], "type" : "gauge" },
      "segment/size" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "segment/overShadowed/count" : { "dimensions" : [], "type" : "gauge" },

      "segment/max" : { "dimensions" : [], "type" : "gauge"},
      "segment/used" : { "dimensions" : ["dataSource", "tier", "priority"], "type" : "gauge" },
      "segment/usedPercent" : { "dimensions" : ["dataSource", "tier", "priority"], "type" : "gauge", "convertRange" : true },
      "segment/pendingDelete" : { "dimensions" : [], "type" : "gauge"},

      "jvm/pool/committed" : { "dimensions" : ["poolKind", "poolName"], "type" : "gauge" },
      "jvm/pool/init" : { "dimensions" : ["poolKind", "poolName"], "type" : "gauge" },
      "jvm/pool/max" : { "dimensions" : ["poolKind", "poolName"], "type" : "gauge" },
      "jvm/pool/used" : { "dimensions" : ["poolKind", "poolName"], "type" : "gauge" },
      "jvm/bufferpool/count" : { "dimensions" : ["bufferpoolName"], "type" : "gauge" },
      "jvm/bufferpool/used" : { "dimensions" : ["bufferpoolName"], "type" : "gauge" },
      "jvm/bufferpool/capacity" : { "dimensions" : ["bufferpoolName"], "type" : "gauge" },
      "jvm/mem/init" : { "dimensions" : ["memKind"], "type" : "gauge" },
      "jvm/mem/max" : { "dimensions" : ["memKind"], "type" : "gauge" },
      "jvm/mem/used" : { "dimensions" : ["memKind"], "type" : "gauge" },
      "jvm/mem/committed" : { "dimensions" : ["memKind"], "type" : "gauge" },
      "jvm/gc/count" : { "dimensions" : ["gcName", "gcGen"], "type" : "count" },
      "jvm/gc/cpu" : { "dimensions" : ["gcName", "gcGen"], "type" : "count" },

      "ingest/events/buffered" : { "dimensions" : ["serviceName", "bufferCapacity"], "type" : "gauge"},

      "sys/swap/free" : { "dimensions" : [], "type" : "gauge"},
      "sys/swap/max" : { "dimensions" : [], "type" : "gauge"},
      "sys/swap/pageIn" : { "dimensions" : [], "type" : "gauge"},
      "sys/swap/pageOut" : { "dimensions" : [], "type" : "gauge"},
      "sys/disk/write/count" : { "dimensions" : ["fsDevName"], "type" : "count"},
      "sys/disk/read/count" : { "dimensions" : ["fsDevName"], "type" : "count"},
      "sys/disk/write/size" : { "dimensions" : ["fsDevName"], "type" : "count"},
      "sys/disk/read/size" : { "dimensions" : ["fsDevName"], "type" : "count"},
      "sys/net/write/size" : { "dimensions" : [], "type" : "count"},
      "sys/net/read/size" : { "dimensions" : [], "type" : "count"},
      "sys/fs/used" : { "dimensions" : ["fsDevName", "fsDirName", "fsTypeName", "fsSysTypeName", "fsOptions"], "type" : "gauge"},
      "sys/fs/max" : { "dimensions" : ["fsDevName", "fsDirName", "fsTypeName", "fsSysTypeName", "fsOptions"], "type" : "gauge"},
      "sys/mem/used" : { "dimensions" : [], "type" : "gauge"},
      "sys/mem/max" : { "dimensions" : [], "type" : "gauge"},
      "sys/storage/used" : { "dimensions" : ["fsDirName"], "type" : "gauge"},
      "sys/cpu" : { "dimensions" : ["cpuName", "cpuTime"], "type" : "gauge"},

      "coordinator-segment/count" : { "dimensions" : ["dataSource"], "type" : "gauge" },
      "historical-segment/count" : { "dimensions" : ["dataSource", "tier", "priority"], "type" : "gauge" },

      "jetty/numOpenConnections" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/caffeine/total/requests" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/caffeine/total/loadTime" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/caffeine/total/evictionBytes" : { "dimensions" : [], "type" : "gauge" },
      "query/cache/memcached/total" : { "dimensions" : ["[MEM] Reconnecting Nodes (ReconnectQueue)",
        "[MEM] Request Rate: All",
        "[MEM] Average Bytes written to OS per write",
        "[MEM] Average Bytes read from OS per read",
        "[MEM] Response Rate: All (Failure + Success + Retry)",
        "[MEM] Response Rate: Retry",
        "[MEM] Response Rate: Failure",
        "[MEM] Response Rate: Success"],
        "type" : "gauge" },
      "query/cache/caffeine/delta/requests" : { "dimensions" : [], "type" : "count" },
      "query/cache/caffeine/delta/loadTime" : { "dimensions" : [], "type" : "count" },
      "query/cache/caffeine/delta/evictionBytes" : { "dimensions" : [], "type" : "count" },
      "query/cache/memcached/delta" : { "dimensions" : ["[MEM] Reconnecting Nodes (ReconnectQueue)",
        "[MEM] Request Rate: All",
        "[MEM] Average Bytes written to OS per write",
        "[MEM] Average Bytes read from OS per read",
        "[MEM] Response Rate: All (Failure + Success + Retry)",
        "[MEM] Response Rate: Retry",
        "[MEM] Response Rate: Failure",
        "[MEM] Response Rate: Success"],
        "type" : "count" }
    }

  volumeMounts:
    - mountPath: /druid/data
      name: data-volume
    - mountPath: /druid/deepstorage
      name: deepstorage-volume
  volumes:
    - name: data-volume
      emptyDir: {}
    - name: deepstorage-volume
      hostPath:
        path: /tmp/druid/deepstorage
        type: DirectoryOrCreate
  env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace

  nodes:
    brokers:
      # Optionally specify for running broker as Deployment
      # kind: Deployment
      nodeType: "broker"
      # Optionally specify for broker nodes
      # imagePullSecrets:
      # - name: tutu
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/broker"
      replicas: 1
      volumeClaimTemplates:
       - metadata:
           name: data-volume
         spec:
           accessModes:
           - ReadWriteOnce
           resources:
             requests:
               storage: 2Gi
           storageClassName: standard
      runtime.properties: |
        druid.service=druid/broker
        # HTTP server threads
        druid.broker.http.numConnections=5
        druid.server.http.numThreads=10
        # Processing threads and buffers
        druid.processing.buffer.sizeBytes=1
        druid.processing.numMergeBuffers=1
        druid.processing.numThreads=1
        druid.sql.enable=true
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M

    coordinators:
      # Optionally specify for running coordinator as Deployment
      # kind: Deployment
      nodeType: "coordinator"
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/master/coordinator-overlord"
      replicas: 1
      runtime.properties: |
        druid.service=druid/coordinator

        # HTTP server threads
        druid.coordinator.startDelay=PT30S
        druid.coordinator.period=PT30S

        # Configure this coordinator to also run as Overlord
        druid.coordinator.asOverlord.enabled=true
        druid.coordinator.asOverlord.overlordService=druid/overlord
        druid.indexer.queue.startDelay=PT30S
        druid.indexer.runner.type=local
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M

    historicals:
      nodeType: "historical"
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/historical"
      replicas: 1
      runtime.properties: |
        druid.service=druid/historical
        druid.server.http.numThreads=5
        druid.processing.buffer.sizeBytes=536870912
        druid.processing.numMergeBuffers=1
        druid.processing.numThreads=1

        # Segment storage
        druid.segmentCache.locations=[{\"path\":\"/druid/data/segments\",\"maxSize\":10737418240}]
        druid.server.maxSize=10737418240
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M
          
    routers:
      nodeType: "router"
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
      replicas: 1
      runtime.properties: |
        druid.service=druid/router

        # HTTP proxy
        druid.router.http.numConnections=10
        druid.router.http.readTimeout=PT5M
        druid.router.http.numMaxThreads=10
        druid.server.http.numThreads=10

        # Service discovery
        druid.router.defaultBrokerServiceName=druid/broker
        druid.router.coordinatorServiceName=druid/coordinator

        # Management proxy to coordinator / overlord: required for unified web console.
        druid.router.managementProxy.enabled=true       
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M

Configuration question for running Druid with the Operator

From the druid-operator Slack channel:

  • What are you doing with the -Djava.io.tmpdir= configuration?
  • How do you handle TLS?
  • Using autoscaling? on which component and how?
  • What Kubernetes kind are you using for each component?
  • Using ZooKeeper-less?
  • Using MiddleManager-less?
  • What is your -Xmx , -Xms? How is it compared to the pods’ resource request/limits
  • Setting CPU limit?
  • How do you spread your pods across the cluster?
  • Using Karpenter? What is your Provisioner and Launch Template
  • What are you doing with these configurations: druid.segmentCache.locations and druid.server.maxSize
  • Created Service objects? for which component?

Question: Can I install the druid-operator from operatorhub via Helm?

Hi team,

I have spent some time to look for a way to install druid-operator from operatorhub via Helm. Do we happen to publish the druid-operator helm chart to elsewhere other than github? The reason is that I want to deploy druid-operator from pipeline, and I don't want to clone the git repo every time for the deployment.

Thanks

Using additionalContainer not able to add extension libraries

To configure mysql as a metadata storage I need to add mysql-connector-java library to druid extensions, but I am not able to figure out the best way to do so. Tried using additonalContainer as below but its failing to start the container.

additionalContainer:
- containerName: download-mysql-connector
image: apache/druid:25.0.0
command: ["sh", "-c", "wget -O /tmp/mysql-connector-j-8.0.32.tar.gz https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-j-8.0.32.tar.gz && cd /tmp && tar -xf /tmp/mysql-connector-j-8.0.32.tar.gz && cp /tmp/mysql-connector-j-8.0.32/mysql-connector-j-8.0.32.jar /opt/druid/extensions/mysql-metadata-storage/mysql-connector-java.jar"]
volumeMounts:
- name: mysql-connector-jar
mountPath: /opt/druid/extensions/mysql-connector

If there is a way we can only add initiContainer by nodes, that way we don't have to add libraries to all containers, but only to specific services.

Druid components autoscaling best practices

Started in this slack thread

We need an answer for how to scale each component of Druid.
Middle Managers are on the way to becoming dynamically provisioned which will solve this.
The biggest problem is autoscaling historicals where we should also take storage into our calculation.
Should the operator handle that? Should we have a smart third-party auto scaler (like KEDA)?

K8s v1.26 is not supported

We have upgraded our K8s cluster to v1.26. We had set up the druid operator v1.0.0 on K8s v1.25 for testing and it was working fine even on v1.26 but now when we tried to setup again it's failing because of hpa API version.

Druid-operator / K8s compatibility matrix shows druid operator v1.0.0 works above K8s v1.25, It seems to be ambiguous.
Screenshot 2023-05-02 at 5 52 58 PM

How can we run Druid on K8s v1.26? We are completely blocked now.

Any help would really be appreciated, Thanks in advance.

Additional ConfigMap For HDFS and Core site xml

I want to setup a druid cluster which is using HDFS for deep storage. As documented in Druid documentation, I need to add core-site.xml and hdfs-site.xml files in the Druid classpath.

I searched in documentation if it's possible to add ConfigMap to be mounted in /opt/druid/conf/druid/cluster/_common but I didn't find any spec about this. Is it something doable ?

Not able to see PVC in CR Status

Not able to see PVC in CR Status.
Here is the list of PVC:

$ kubectl get pvc
NAME                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-volume-druid-tiny-cluster-historicals-0   Bound    pvc-eab29f7e-73ec-4b92-ac7a-6f2add38647f   2Gi        RWO            standard       10m
data-volume-druid-tiny-cluster-historicals-1   Bound    pvc-8cd53678-49e1-4eca-a2e6-b62a954ffc40   2Gi        RWO            standard       10m

CR status:

status:
  configMaps:
  - druid-tiny-cluster-brokers-config
  - druid-tiny-cluster-coordinators-config
  - druid-tiny-cluster-historicals-config
  - druid-tiny-cluster-routers-config
  - tiny-cluster-druid-common-config
  deployments:
  - druid-tiny-cluster-brokers
  druidNodeStatus:
    druidNode: All
    druidNodeConditionStatus: "True"
    druidNodeConditionType: DruidClusterReady
    reason: All Druid Nodes are in Ready Condition
  hpAutoscalers:
  - druid-tiny-cluster-brokers
  ingress:
  - druid-tiny-cluster-routers
  podDisruptionBudgets:
  - druid-tiny-cluster-brokers
  pods:
  - druid-tiny-cluster-brokers-f58678f48-snvbd
  - druid-tiny-cluster-coordinators-0
  - druid-tiny-cluster-historicals-0
  - druid-tiny-cluster-historicals-1
  - druid-tiny-cluster-routers-0
  services:
  - druid-tiny-cluster-brokers
  - druid-tiny-cluster-coordinators
  - druid-tiny-cluster-historicals
  - druid-tiny-cluster-routers
  statefulSets:
  - druid-tiny-cluster-coordinators
  - druid-tiny-cluster-historicals
  - druid-tiny-cluster-routers

Default readiness on historicals breaks operator.

In pr #72 , defaults probes are set for various components.

This breaks the operator because while /druid/historical/v1/readiness is just fine to call /druid/historical/v1/loadstatus is a privileged call that requires authentication. Since it places this readinessProbe by default if you don't have one, the only way to disable it is to hard wire one yourself to another call.

curl -vvv http://127.0.0.1:4124/druid/historical/v1/readiness
HTTP/1.1 200 OK

curl -vvv http://127.0.0.1:4124/druid/historical/v1/loadstatus
HTTP/1.1 401 Unauthorized

Support node-specific startupProbes

Since operator 1.12.0 the setup of nodespecific nodes doesn't work any more. In my example the config of a coordinator looks the following:

nodes:
    coordinator:
      nodeType: "coordinator"
      druid.port: 8281
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/master/coordinator-overlord"
      replicas: 2
      podManagementPolicy: OrderedReady
      updateStrategy:
        type: RollingUpdate
      livenessProbe:
        failureThreshold: 60
        periodSeconds: 10
        httpGet:
          path: /status/health
          port: 8281
          scheme: HTTPS
      readinessProbe:
        failureThreshold: 60
        periodSeconds: 10
        httpGet:
          path: /status/health
          port: 8281
          scheme: HTTPS
      runtime.properties: |
        druid.service=druid/coordinator
        .... 

As you can see the cluster has TLS enabled and due to that I've used the default TLS Ports so for coordinator it's 8281, for router 9088, historical 8283, etc.

With operator <=1.1.1 this works fine. Since Operator 1.2.0 the default_probes #98 came in place and the node-specific probes config stopped working.

Switching off the Defaultprobes with spec.defaultProbes: false a default startupProbe with will be set on the coordinator pod (via statefulset of course):

      livenessProbe:
        httpGet:
          path: /status/health
          port: 8281
          scheme: HTTPS
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 60
      readinessProbe:
        httpGet:
          path: /status/health
          port: 8281
          scheme: HTTPS
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 60
      startupProbe:
        httpGet:
          path: /status/health
          port: 8281
          scheme: HTTP
        initialDelaySeconds: 5
        timeoutSeconds: 5
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 10

The latest CRD doesn't allow to set node-specific startupProbes:

failed to create typed patch object (druid/fqmdruid; druid.apache.org/v1alpha1, Kind=Druid): .spec.nodes.coordinator.startupProbe: field not declared in schema

The changerequest would be, to allow the setup of node-specific startupProbes in the CRD.

Fail to create controller manager container

commit 2a5d9f5
Go version 1.20
K8s version 1.23.12

I tried to use master branch to deploy Druid on my company dev K8S cluster.

I used make deploy to deploy
And I found that manager container keep the status on CreateContainerConfigError.
I tried to describe the controller manger and find below error.

I have got an advice on slack channel and suggest fix it with the security annotation on the pod.
Then, I tried to change

druid-operator/config/manager/manager.yaml

securityContext:
   runAsUser: 1000
   runAsNonRoot: true

When I applied new security config, I got a new error message in the manger container. I still can't success create manger container.

flag provided but not defined: -metrics-bind-address                                                    │
│ Usage of /manager:                                                                     │
│  -enable-leader-election                                                                 │
│     Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.               │
│  -health-probe-bind-address string                                                            │
│     The address the probe endpoint binds to. (default ":8081")                                             │
│  -kubeconfig string                                                                    │
│     Paths to a kubeconfig. Only required if out-of-cluster.                                              │
│  -metrics-addr string                                                                   │
│     The address the metric endpoint binds to. (default ":8080")

Scale PVC for multi-tier component is broken

I noticed on my cluster that the operator is trying to resize PVCs (and failing).

We have 2 tier of historicals, called histolder and histrecent,
However, they are the right size, but it seems it's mixing the storage specs for the PVC : it's trying to resize the histolder PVC to the histrecent size...

It seems the pvcLabels here are too loose for matching the PVCs:

pvcLabels := map[string]string{
		"component": nodeSpec.NodeType,
	}

	pvcList, err := readers.List(ctx, sdk, drd, pvcLabels, emitEvent, func() objectList { return &v1.PersistentVolumeClaimList{} }, func(listObj runtime.Object) []object {
		items := listObj.(*v1.PersistentVolumeClaimList).Items
		result := make([]object, len(items))
		for i := 0; i < len(items); i++ {
			result[i] = &items[i]
		}
		return result
	})
	if err != nil {
		return nil
	}

It matches all historical nodes, regardless of the node name.

Here is a excerpt of my config:

  scalePvcSts: true
  nodes:
    histrecent:
      nodeType: "historical"
      #...
      volumeClaimTemplates:
        - metadata:
            name: data
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 400Gi
            storageClassName: csi-cinder-high-speed
      volumeMounts:
        - mountPath: /druid/data
          name: data
      volumes:
        - name: data
          emptyDir: {}

    histolder:
      nodeType: "historical"
      #...
      volumeClaimTemplates:
        - metadata:
            name: data
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 1200Gi
            storageClassName: csi-cinder-classic
      volumeMounts:
        - mountPath: /druid/data
          name: data
      volumes:
        - name: data
          emptyDir: {}

And here is the bit of operator log that caught my eye:

2023-09-11T15:04:22Z    ERROR   Reconciler error        {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"x-cluster","namespace":"default"}, "namespace": "default", "name": "x-cluster", "reconcileID": "f8be13e6-fd34-428a-8f14-e3916e2e3112", "error": "PersistentVolumeClaim \"data-druid-x-cluster-histrecent-0\" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims\n  core.PersistentVolumeClaimSpec{\n  \tAccessModes: {\"ReadWriteOnce\"},\n  \tSelector:    nil,\n  \tResources: core.ResourceRequirements{\n  \t\tLimits:   nil,\n- \t\tRequests: core.ResourceList{s\"storage\": {i: resource.int64Amount{value: 429496729600}, Format: \"BinarySI\"}},\n+ \t\tRequests: core.ResourceList{s\"storage\": {i: resource.int64Amount{value: 1288490188800}, Format: \"BinarySI\"}},\n  \t},\n  \tVolumeName:       \"\",\n  \tStorageClassName: &\"csi-cinder-high-speed\",\n  \t... // 3 identical fields\n  }\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
2023-09-11T15:04:22Z    DEBUG   events  Error patching object [data-druid-x-cluster-histrecent-0:*v1.PersistentVolumeClaim] in namespace [default] due to [PersistentVolumeClaim "data-druid-x-cluster-histrecent-0" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
        AccessModes: {"ReadWriteOnce"},
        Selector:    nil,
        Resources: core.ResourceRequirements{
                Limits:   nil,
-               Requests: core.ResourceList{s"storage": {i: resource.int64Amount{value: 429496729600}, Format: "BinarySI"}},
+               Requests: core.ResourceList{s"storage": {i: resource.int64Amount{value: 1288490188800}, Format: "BinarySI"}},
        },
        VolumeName:       "",
        StorageClassName: &"csi-cinder-high-speed",
        ... // 3 identical fields
  }

Here are 2 successive (few minute interval) output from k get pvc:

data-druid-x-cluster-histolder-8          Bound    xxx     RWO            csi-cinder-classic      2m7s
data-druid-x-cluster-histolder-9          Bound    xxx   1200Gi     RWO            csi-cinder-classic      2m6s
data-druid-x-cluster-histrecent-0         Bound    xxx   400Gi      RWO            csi-cinder-high-speed   2m8s
data-druid-x-cluster-histrecent-1         Bound    xxx   400Gi      RWO            csi-cinder-high-speed   2m8s
data-druid-x-cluster-histolder-8          Bound    xxx   1200Gi     RWO            csi-cinder-classic      25m
data-druid-x-cluster-histolder-9          Bound    xxx   1200Gi     RWO            csi-cinder-classic      25m
data-druid-x-cluster-histrecent-0         Bound    xxx   1200Gi     RWO            csi-cinder-high-speed   25m
data-druid-x-cluster-histrecent-1         Bound    xxx   1200Gi     RWO            csi-cinder-high-speed   25m

As you can see, it ended up scaling all PVCs to 1200Gi because it keeps mixing up between the two values.

Compatibility e2e tests

We're missing the visibility of the combinations of these:

  • version of the operator
  • version of Druid
  • version of Kubernetes

We need to create an e2e test that will run on every supported combination

Support TLS Certificates

While looking at the Druid CRD, I don't see any information on how we can pass a CA certificate to the operator.

Either via a K8s Secret or $CLOUD provider method to get the secret passed in.

I was thinking of using an ExternalSecret --> $CLOUD provider secret manager and then referencing the Secret via the Operator but no such luck.

Is this correct?

Thanks,
Shawn

Coordinator is not created by druid operator

We recently performed an upgrade of the Druid operator from version 1.0.0 to version 1.2.0, and during the process, we encountered an issue when attempting to create a new Druid cluster. It's worth noting that there were no changes made to the cluster manifest.

The specific problem we encountered was the absence of a coordinator created by the Druid operator. Upon inspecting the resource list, we noticed that there was no coordinator statefulset present. Strangely, there were no error messages recorded in the Druid operator log. This issue appears to be intermittent, as we have successfully used the Druid operator to create multiple clusters without encountering this problem, and it was only observed in one particular cluster.

Additionally, we observed that the Druid operator log does not seem to contain particularly useful information, and there is a lack of valuable info in the pod logs.

Question: Kubebuilder structure

Why did you choose to move from Kubebuilder's default directories structure? As soon as you need to add a new API, you will need to make lots of changes in order to fit the current custom structure. I'm mainly talking about the deploy directory, but the Makefile also.

Welcome Cyril as Collaborator

@cyril-corbon is one of the core contributors to the project and it active in helping the community. He has also helped in evangelising the druid operator in CNCF community events.
Welcome @cyril-corbon as a collaborator, it is a pleasure to have you.

Ability to add custom files in _common directory

We currently can't add other files to the CommonConfigMountPath path.

I thought about mounting files as subPath, but Kubernetes documentation states, A container using a ConfigMap as a [subPath](https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath) volume will not receive ConfigMap updates.

My suggestion is to add a new field:

// References to ConfigMaps holding more files to mount to the CommonConfigMountPath.
// +optional
ExtraCommonConfig []*v1.ObjectReference `json:"extraCommonConfig"`

It will give customers the flexibility to create and arrange ConfigMaps with extra files inside them, and the operator will mount them together with the CommonRuntimeProperties.

This will be done by changing the makeCommonConfigMap function to also attach the files in those extra config maps.

Support Hadoop Indexing

Not sure I'm right, but in order to support Hadoop indexing (at least what I have in my company) are the following files under /opt/druid/conf/druid/cluster/_common/:

  • capacity-scheduler.xml
  • common.runtime.properties
  • core-site.xml
  • hadoop-policy.xml
  • hdfs-site.xml
  • hive-site.xml
  • httpfs-site.xml
  • kms-acls.xml
  • kms-site.xml
  • mapred-site.xml
  • metric-dimensions.json
  • yarn-site.xml

I don't think we should support them in CRD, We should have the ability to mount configMaps and Secrets as files under /opt/druid/conf/druid/cluster/_common/.

[proposal] Setup default probe for each nodes types

Context & goal

All the druid components deployed by the operator are deployed by default without probe.
IMHO an operator have to configure in advance this kind of settings.

Probes have to be configured for each node type and override if an user has defined it.

all the different probes have to use the druid api reference

Probe definition

coordinator, overlord, middlemanager and router

      livenessProbe:
        httpGet:
          path: /status/health
          port: $druid.port
        failureThreshold: 20
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      readinessProbe:
        httpGet:
          path: /status/health
          port: $druid.port
        failureThreshold: 10
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5

broker

      livenessProbe:
        httpGet:
          path: /status/health
          port: $druid.port
        failureThreshold: 20
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      readinessProbe:
        httpGet:
          path: /druid/broker/v1/readiness
          port: $druid.port
        failureThreshold: 10
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5

historical

      livenessProbe:
        httpGet:
          path: /status/health
          port: $druid.port
        failureThreshold: 20
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      readinessProbe:
        httpGet:
          path: /status/health
          port: $druid.port
        failureThreshold: 10
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      startUpProbes:
        httpGet:
          path: /druid/historical/v1/loadstatus
          port: $druid.port
        failureThreshold: 20
        initialDelaySeconds: 180
        periodSeconds: 30
        successThreshold: 1
        timeoutSeconds: 10
      

Testing migration to Kubebuilder native framework

As part of the migration from operatorSDK to Kubebuilder, we also need to refactor the tests into Kuebuilder's framework and structure in order to be aligned with the project.
That means generating the suite_test.go from kubebuilder and adding our currents tests using the gomega and ginkgo frameworks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.