The druid-operator from stackabletech

Evaluate whether we can use the "Environment Variable Dynamic Config Provider"

https://druid.apache.org/docs/latest/operations/dynamic-config-provider.html

This can possibly be used to directly get mounted env vars into the config file, for things like zookeeper discovery. If it works we should use it.

Enable HDFS for deep storage

As a user of druid services I want to be able to use HDFS for deep storage.

Using local storage for deep storage only works if the storage location is reachable by all relevant druid services (middle-manager, historical server).

This would be possible with e.g. Persistent Volume and Persistent Volume Claims, but would only realistically be used for prototype/demo cases
a better solution is to use HDFS and deactivate local-deep-storage as we should offer HDFS anyway, and it lends itself more easily to integration testing (using our own operator).

Loading data locally fails when the s3 extension is loaded

I've noticed this problem here and have alread reported it upstream, I wanted to document the problem in our repo as well.

TLDR: If the s3 extension is loaded but not actually used (local deep storage, loading data locally) errors appear that a connection to s3 cannot be made.

We could possibly add a switch to decide whether the extension should be loaded or not. If proper credentials are provided as well, it works.

There is however a feature where you load s3 data from a bucket with credentials that you just provide in the Web UI. To allow this ad-hoc s3 access, the extension must be loaded at all times. If we instead provide a switch in the CRD whether it should be loaded or not, that's not possible anymore.

Model Server Types (Master/Query/Data) with Pod Affinity

Each process is a seperate pod, and some should run together on the same node (Historical & MiddleManager, Overlord & Coordinator) and that can be modelled with pod affinity.

Useful info in the official druid-operator: https://github.com/druid-io/druid-operator

Eliminate duplicate dependency

Properly support S3 structs

When implementing support for S3 in the operators out there we took some shortcuts like ignoring accessStyle and tls settings for now. We need to honor them and not just silently ignore them.

What needs to be done is described in stackabletech/issues#226

Write Kubernetes Events for errors during reconciliation

We would like to emit Kubernetes events for all errors, please see the epic for details stackabletech/issues#158

Document supported product version

The documentation (in the docs/ directory) needs to contain a section on the versions of the products we explicitly support. Please see the epic stackabletech/issues#139 for details

Support specifying a namespace to watch

Currently this operator watches resources in all namespaces.
I'd like this to be configurable so I can specify which namespace to watch.

This should be a clap argument (which then can be provided on the command line or in an env var) called --watch-namespace.
It is okay to only take a single namespace for now.

Implement the above description
Document the default behavior and the new parameter

See stackabletech/issues#162 for the overarching epic

Enable Prometheus scraping

Annotation prometheus.io/scrape: "true" should be added to role services

Document service discovery

As a user of services deployed by this operator I'd like to know how to discover its connection details.

It's done when

documentation is available on all the objects (e.g. ConfigMaps) created by this operator and the circumstances under which they are created and
documentation on the contents of the created objects is available.

NOTE: This ticket is part of an epic and autocreated for all our operators. It might not apply to this operator in particular, in that case please comment and close
stackabletech/documentation#86

Document Druid Operator

Acceptance Criteria

It is documented how to get started with Druid Operator in the Getting Started guide

Add correct probes to Druid containers

We could use the /status/health endpoint or others (/druid/broker/v1/readiness for the broker and historical) to implement readiness probes.

Refactoring config.rs (hard-coded config properties)

To clean up config.rs we should check all the properties that are set per process and remove the ones that have a default entirely (for now). The ones that are left over should be moved to the product config with a sensible default. At the end, the file should be entirely removed.

We do not map anything into config properties in the resource for now, until we know which properties we should map. Overriding properties is still possible by using configOverrides.

Configuration Reference (states the defaults for each property): https://druid.apache.org/docs/latest/configuration/index.html

Integration-tests relocation and improved stability

As a user I want to have the druid integration tests moved to the operator repository and for the stability to be improved (e.g. currently the python tests start before the druid components are reachable).

Investigate Necessity of the Local Deep Storage Label

is this label really necessary?

Add deploying ingestion specs from custom resources

Implementation ticket for #168

The Druid operator should be extended to be able to deploy ingestion specs from definitions provided in CRDs.

Ingestion specs can be defined by the user via a customresource, which is watched by a controller in the Druid operator that then provisions these specs.

The crd will contain at least the following:

type of source (stream, flat file, ...)
failure behavior
- none
- retry
- delete and retry
specification of the ingest job
- inline
- reference to configmap
- reference to pvc (or file therein)

These objects will initially be considered read-only, so changes to them will not be propagated to Druid by the controller.

The initial implementation will not be a perfect ingestion task management solution, but rather a first attempt to offer something useful to our users.
The user needs to decide themselves what the appropriate failure behavior is for the spec they provide to the operator, whether duplicate data might be created by retrying etc.
The defined failure options should offer simple solutions for all scenarios:

task is too complex -> none , the user will investigate themselves
no duplicates expected -> retry , task is idempotent and can be retried
duplicates possible -> delete and retry , not idempotent, to be on the safe side delete the target before retrying

RUSTSEC-2020-0016: `net2` crate has been deprecated; use `socket2` instead

net2 crate has been deprecated; use socket2 instead

Details
Status	unmaintained
Package	`net2`
Version	`0.2.37`
URL	deprecrated/net2-rs@`3350e38`
Date	2020-05-01

The net2 crate has been deprecated
and users are encouraged to considered socket2 instead.

See advisory page for additional details.

Add option to enable S3 path style access

Discussed in https://github.com/stackabletech/druid-operator/discussions/213

^{Originally posted by sbernauer April 5, 2022}
In our Docaton we had to use something like this as we were not able to specify s3.pathStyleAccess: true in the CRD
We need to add that attribute to avoid using custom druid settings

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: druid-nytaxidata
spec:
  version: 0.22.1
  zookeeperConfigMapName: simple-druid-znode
  metadataStorageDatabase:
    dbType: postgresql
    connString: jdbc:postgresql://postgresql-druid/druid
    host: postgresql-druid
    port: 5432
    user: druid
    password: druid
  s3:
    endpoint: http://minio:9000
    credentialsSecret: druid-s3-credentials
  deepStorage:
    storageType: s3
    bucket: nytaxidata
    baseKey: storage
  brokers:
    configOverrides:
      runtime.properties:
        druid.s3.enablePathStyleAccess: "true" ### <<< HERE
    roleGroups:
      default:
        selector:
          matchLabels:
            kubernetes.io/os: linux
        config: {}
        replicas: 1
```</div>

As a user I'd like to add Ingestion specs/Datasources programatically

I'd like to have a CustomResource/ConfigMap that I can use to define all my Datasources/Ingestions in Druid.

We decided to just take the verbatim JSON as it is generated by the Druid Web UI and have this in a CustomResource.
I thought about having just a ConfigMap instead of a CR but that opens the question on how to reference the Druid cluster in question.

It could be the other way around: A "config map selector" in Druid that selects all ConfigMaps that should be added but that also doesn't seem very agile. Either way this needs to be authorized later.

Please come up with a final architecture (CustomResource vs. other options)
Bring up the architecture at the weekly architecture meeting for discussion
Implement
Add Integration Tests
Add documentation

Monitoring

Metrics can either be written to a log file or posted to an HTTP endpoint. To get that to work with Prometheus, we should use Druid Exporter, deploy it with druid and use that for metrics.

Once we run on docker that should be easy to package together so I'm blocking this issue on the docker issue.

Package Druid

I'd like v0.22.0 packaged, together with a the stackable startup script.

This is the directory structure that I'm working with right now:

        druid-0.22.0/
        └── apache-druid-0.22.0
            ├── bin
            ├── conf
            ├── lib
            ├── stackable/
                └── run-druid
            └── ...

Standalone overlord

The coordinator supports running the overlord itself, which is how we do it at the moment.

It might be useful to some to have it run in a seperate process/pod.

supposedly seperating them makes Druid more resilient.

Change version from an enum to a string

This allows for more flexibility and means we don't have to release a new operator for a new upstream version.

Remove deploy/crd folder and generation

This should only be done once updated with templating from stackabletech/operator-templating#55.

Disable CRD serialization from build.rs
Delete deploy/crd folder

healthcheck technical user with multiple authenticators

Problem

The k8s healthcheck probes make HTTP requests which get blocked by Druid if Authorization is enabled.

Proposed solution

We need a technical user to make these probe requests, in case authentication and authorization are enabled. Otherwise the endpoints cannot be queried.

Druid supports an authentication and authorization chain, with multiple authenticators/authorizers. We can add a second mini-authenticator for just a single health-check user, or maybe reuse the existing one for basic auth, on top of LDAP. we can then use this user to do our health checks. The user should be created automatically, with generated credentials.

Open Questions

The healthchecks are made from a k8s Probe: https://docs.rs/k8s-openapi/latest/k8s_openapi/api/core/v1/struct.Probe.html
- does the probe support HTTPS? without HTTPS, basic auth isn't safe.

RUSTSEC-2020-0036: failure is officially deprecated/unmaintained

failure is officially deprecated/unmaintained

Details
Status	unmaintained
Package	`failure`
Version	`0.1.8`
URL	rust-lang-deprecated/failure#347
Date	2020-05-02

The failure crate is officially end-of-life: it has been marked as deprecated
by the former maintainer, who has announced that there will be no updates or
maintenance work on it going forward.

The following are some suggested actively developed alternatives to switch to:

See advisory page for additional details.

Rework: Move Operator to new structure

Please see the epic stackabletech/issues#129 for details

Configure Druid to support S3

See Druid documentation
https://druid.apache.org/docs/latest/ingestion/native-batch.html#s3-input-source

Acceptance Criteria

Druid is configured to use S3 as Input Source for data (by Stackable Druid Operator)
Druid is configured to use S3 as deep staorage (by Stackable Druid Operator)
A User may load data stored in S3 into Druid
A User may query these data, at least via REST call: curl -X POST '<queryable_host>:/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d @<query_json_file>

RUSTSEC-2020-0071: Potential segfault in the time crate

Potential segfault in the time crate

Details
Package	`time`
Version	`0.1.44`
URL	time-rs/time#293
Date	2020-11-18
Patched versions	`>=0.2.23`
Unaffected versions	`=0.2.0,=0.2.1,=0.2.2,=0.2.3,=0.2.4,=0.2.5,=0.2.6`

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

The affected functions from time 0.2.7 through 0.2.22 are:

time::UtcOffset::local_offset_at
time::UtcOffset::try_local_offset_at
time::UtcOffset::current_local_offset
time::UtcOffset::try_current_local_offset
time::OffsetDateTime::now_local
time::OffsetDateTime::try_now_local

The affected functions in time 0.1 (all versions) are:

at
at_utc
now

Non-Unix targets (including Windows and wasm) are unaffected.

Patches

Pending a proper fix, the internal method that determines the local offset has been modified to always return None on the affected operating systems. This has the effect of returning an Err on the try_* methods and UTC on the non-try_* methods.

Users and library authors with time in their dependency tree should perform cargo update, which will pull in the updated, unaffected code.

Users of time 0.1 do not have a patch and should upgrade to an unaffected version: time 0.2.23 or greater or the 0.3 series.

Workarounds

No workarounds are known.

References

time-rs/time#293

See advisory page for additional details.

Convert Operator to K8S Architecture

These Acceptance Criteria need to be met:

Pods are changed to use Docker images instead of the current "packages"
hostNetwork is used (for now)
Volumes and VolumeMounts are added (for hostPath/local volumes)
command line templates are fixed

This depends on stackabletech/docker-images#6 which provides initial Docker images but might require further changes to the images.

Support TLS authentication & encryption with provided certificates

This is the same as we did for ZooKeeper in stackabletech/zookeeper-operator#466 but with a new structure according to stackabletech/issues#293.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: druid
spec:
  version: 24.0.0-stackable0.1.0
  commonConfig:
    tls:
      # client-server encryption (only server requires a trusted certificate)
      serverSecretClass: String # defaults to "tls"
      # server-server encryption
      internalSecretClass: String # defaults to "tls"
    # This should be a Vector. Can be a vector of Strings but preferably an extra struct containing at least a 
    # String to reference the operator-rs AuthenticationClass (plus optional settings if required)
    authentication: 
      # mTLS (client and server require a trusted certificate)
      - authenticationClass: druid-tls-authentication-class # String
    authorization:
      opa:
        configMapName: druid-opa
    # all other top level configuration should be under shared-/global-/cluster-config as well
    zookeeperConfigMapName: simple-druid-znode
    metadataStorageDatabase:
      dbType: postgresql
      connString: jdbc:postgresql://druid-postgresql/druid
      host: druid-postgresql
      port: 5432
      user: druid
      password: druid
    deepStorage: ...

This is done when

Client-Server communication can be encrypted via TLS (on by default)
Server-Server communication can be encrypted via TLS (on by default - can be deactivated if performance impacted heavily)
The common foundations are used (SecretClass, AuthenticationClass)
Documentation has been added and adapted to the new structure
Integration tests have been added and adapted to the new structure
Examples have been added and adapted to the new structure
All top level fields except version or image and stopped are moved to commonConfig (See next for opa config map)
Opa discovery config map name field moved to commonConfig.authorization
Feature tracker has been updated (@lfrancke can do this if needed, ping him)

OPA Authorizer

As an admin I'd like all requests to Druid resources optionally be authorized via OpenPolicyAgent (OPA).

To support this in Druid we need to

Create a project that implements the Druid Authorizer interface
Add integration tests
We need to establish a runbook or similar so we don't forget to test/update the authorizer project when we upgrade Druid
(Optional for now) We should open a ticket upstream and propose donating the code so it gets integrated into Druid itself
We need examples and documentation covering how to use this feature
Package the Authorizer in the Docker image
Make all necessary (unknown to me at the moment if any are needed) changes to the operator to support it

I'm happy to split this up into multiple tickets instead of one big one.
I can help with that or whoever picks it up is welcome to do so.

Consolidate example files/folders

There are currently three directories with multiple files that together (as far as I can tell) make up a single example.

I'd like each file in the examples folder to be useful on its own which means it should include everything that's needed including a comment at the top detailing what's happening in this file for a new user.
This makes it easy for people to get started by issuing a command like kubectl apply -f https://github.com/stackabletech/.....

I'm fine with keeping three directories but I'm also fine with getting rid of the directories.

But each example should contain the ZK cluster definition as well as a Znode and use said Znode configmap instead of the ZooKeeper wide configmap

RUSTSEC-2021-0124: Data race when sending and receiving after closing a `oneshot` channel

Data race when sending and receiving after closing a oneshot channel

Details
Package	`tokio`
Version	`0.1.22`
URL	tokio-rs/tokio#4225
Date	2021-11-16
Patched versions	`>=1.8.4, <1.9.0,>=1.13.1`
Unaffected versions	`<0.1.14`

If a tokio::sync::oneshot channel is closed (via the
oneshot::Receiver::close method), a data race may occur if the
oneshot::Sender::send method is called while the corresponding
oneshot::Receiver is awaited or calling try_recv.

When these methods are called concurrently on a closed channel, the two halves
of the channel can concurrently access a shared memory location, resulting in a
data race. This has been observed to cause memory corruption.

Note that the race only occurs when both halves of the channel are used
after the Receiver half has called close. Code where close is not used, or where the
Receiver is not awaited and try_recv is not called after calling close,
is not affected.

See tokio#4225 for more details.

See advisory page for additional details.

Add the Indexer

The indexer is experimental, and the druid distribution doesn't provide a nice working example to easily add it right away. So this is spun out into a seperate issue now and deferred to later.

Indexer: https://druid.apache.org/docs/0.22.0/design/indexer.html

Refactor S3 configuration to be compabitle with ADR and operator-rs implementation

As a user of the Druid operator I want to refer to an existing S3 connection as specified in stackabletech/documentation#177

This is done when

the operator is compatible with the ADR linked above to configure our S3 storage (as per the example below)
documentation, examples and integration tests have been updated

See stackabletech/operator-rs#377 which is required for this

Implement resource requests and limits for Druid pods

Part of this epic stackabletech/issues#241

Acceptance criteria

Relevant part of the code: https://github.com/stackabletech/druid-operator/blob/129c5e9769f513c9a9f318392f6b508b2a1f2a81/rust/operator-binary/src/config.rs
There are the jvm settings already in there, should be factored out somewhere

Document monitoring

Documentation is missing for Monitoring.

Please see other operators (e.g. ZooKeeper) for the snippet to copy.

RUSTSEC-2020-0159: Potential segfault in `localtime_r` invocations

Potential segfault in localtime_r invocations

Details
Package	`chrono`
Version	`0.4.19`
URL	chronotope/chrono#499
Date	2020-11-10

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

Workarounds

No workarounds are known.

References

time-rs/time#293

See advisory page for additional details.

RUSTSEC-2022-0048: xml-rs is Unmaintained

xml-rs is Unmaintained

Details
Status	unmaintained
Package	`xml-rs`
Version	`0.8.4`
URL	https://github.com/netvl/xml-rs/issues
Date	2022-01-26

xml-rs is a XML parser has open issues around parsing including integer
overflows / panics that may or may not be an issue with untrusted data.

Together with these open issues with Unmaintained status xml-rs
may or may not be suited to parse untrusted data.

Alternatives

quick-xml

See advisory page for additional details.

Bootstrap an Apache Druid operator

Implement initial Druid Operator for all Server-/Process Types (https://druid.apache.org/docs/latest/design/processes.html) (ACs: )

Acceptance Criteria

Operator can start/stop/restart a Druid Cluster
Druid configs can be applied and updated
Monitoring is integrated
all Process types are supported (Coordinator, Overlord, Broker, Historical, MiddleManager and Peons, Indexer (Optional), Router (Optional)
all Server types are supported (Master, Query, Data)
support Maturity Level 1 (Is there more todo than in AC 1?)

tbd

~~ships with license compatible JDBC driver for S3 (Does this really apply? Maybe not necessary (https://druid.apache.org/docs/latest/ingestion/native-batch.html#s3-input-source)~~

Dockerize

Use HDFS ConfigMap to get a reference to the HDFS end-point for deep storage

As a user of druid services I want to use the HDFS config-map to reference the HDFS endpoint for druid deep storage. Instead of:

  deepStorage:
    hdfs: 
      configMapName: production
      storageDirectory: /data

I want to use the hdfs config map and the properties contained in the "hdfs-site.xml" key therein.

This is done when

the operator supports the CRD above
the thing under deepStorage is a complex enum that can be extended to support more options later
the hdfs-site and core-site from the referenced configMap are mounted in the classpath of druid as documented here: https://druid.apache.org/docs/latest/development/extensions-core/hdfs.html

RUSTSEC-2021-0139: ansi_term is Unmaintained

ansi_term is Unmaintained

Details
Status	unmaintained
Package	`ansi_term`
Version	`0.12.1`
URL	ogham/rust-ansi-term#72
Date	2021-08-18

The maintainer has adviced this crate is deprecated and will not
receive any maintenance.

The crate does not seem to have much dependencies and may or may not be ok to use as-is.

Last release seems to have been three years ago.

Possible Alternative(s)

The below list has not been vetted in any way and may or may not contain alternatives;

See advisory page for additional details.

Support LDAP authentication

As a user I'd like to use my existing LDAP/AD credentials to log into Druid. This was already done in e.g. NiFi or Trino. This can be especially helpful for writing tests.

The LDAP support should be integrated in the structure from PR #6 which must be finished first.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: druid
spec:
  version: 24.0.0-stackable0.1.0
  clusterConfig:
    tls:
      serverSecretClass: String # defaults to "tls"
      internalSecretClass: String # defaults to "tls"
    authentication: 
      - authenticationClass: druid-tls-authentication-class # String
      - authenticationClass: druid-ldap-authentication-class # String

This is done when

LDAP is configurable in the CRD using the LDAP structs from operator-rs
A user/admin can configure Druid to use a LDAP server for authentication (while still offering existing authentication methods)
There is documentation on how to configure Druid with LDAP using the Custom Resource
Optional: There is an example demonstrating Druid with LDAP (docs, or example folder)
There are tests which include:
- OpenLDAP is installed and accessible via Druid
- LDAP authenticated access to Druid works
It is added to the feature tracker (ask Lars for help)

This depends on the reference architecture developed in stackabletech/issues#170

Handle stale information/clean up stale resources

Currently our operators will not act on removed information from the CR in some/most/all cases.

One example:
HBase operator has three roles (master, regionServer, restServer). If I create a HBase server CR with a restServer component and then remove it later (entirely, not setting replicas to 0) our operator will not clean up the STS that belongs to this role.

Proposed solution

Use the ClusterResource struct from operator-rs to manage Kubernetes resources belonging to a Cluster object. An example of its usage can be found in the Superset Operator: https://github.com/stackabletech/superset-operator/blob/main/rust/operator-binary/src/superset_controller.rs#L241

Acceptance criteria

This is done when all stale Kubernetes resources are cleaned up. A resource becomes stale when it's not part of the current cluster definition anymore.
ZNode and S3Connection resources are not deleted because the operator cannot know if they are stale or not.
There is at least one test.
Documentation on operator implementation is updated with information regarding handling of stale resources.
Upgrade to the latest version of operator-rs (0.25 at the moment).

NOTE: This is part of an epic (stackabletech/issues#203) and might not apply to this operator. If that is the case please comment on this issue and just close it. This issue was created as part of a special bulk creation operation.

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Update EmbarkStudios/cargo-deny-action action to v1.2.9
Update Rust crate clap to 2.34
Update Rust crate snafu to 0.7
Update Rust crate tokio to 1.15
Update Rust crate clap to v3
Click on this checkbox to rebase all open PRs at once

Check this box to trigger a request for Renovate to run again on this repository

Authorize access to Druid

Authorize access to Druid by simple OPA RegoRules

Acceptance Criteria

It is checked if an individual implementation is needed or if we should use Ranger
Druid OPA Authorizer is implemented (with or without Ranger)
Druid Operator is able to write RegoRules

Druid Authentication

As an analyst I want to log in into Druid having READ or WRITE Access
https://druid.apache.org/docs/latest/operations/security-overview.html
https://druid.apache.org/docs/latest/operations/security-user-auth.html

Acceptance Criteria:

Configuring Authentication is supported by Druid Operator
at least 1 auth method is supported

stackabletech / druid-operator Goto Github PK

druid-operator's People

Contributors

Stargazers

Watchers

druid-operator's Issues

Discussed in https://github.com/stackabletech/druid-operator/discussions/213

Problem

Proposed solution

Open Questions

Impact

Patches

Workarounds

References

Acceptance criteria

Impact

Workarounds

References

Alternatives

Possible Alternative(s)

Proposed solution

Acceptance criteria

Open

Recommend Projects

Recommend Topics

Recommend Org