Coder Social home page Coder Social logo

stackabletech / druid-operator Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 0.0 3.11 MB

An Operator for Apache Druid for Stackable Data Platform

License: Other

Dockerfile 0.29% Shell 1.81% Python 7.74% Rust 53.78% Makefile 2.34% Smarty 0.52% Jinja 31.75% Starlark 0.42% Nix 1.22% Just 0.13%

druid-operator's People

Contributors

adwk67 avatar backstreetkiwi avatar bors[bot] avatar dependabot[bot] avatar fhennig avatar labrenbe avatar lfrancke avatar maleware avatar maltesander avatar nicklarsennz avatar nightkr avatar razvan avatar renovate-bot avatar renovate[bot] avatar sbernauer avatar siegfriedweber avatar soenkeliebau avatar stackable-bot avatar stefanigel avatar techassi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

druid-operator's Issues

Enable HDFS for deep storage

As a user of druid services I want to be able to use HDFS for deep storage.

Using local storage for deep storage only works if the storage location is reachable by all relevant druid services (middle-manager, historical server).

  • This would be possible with e.g. Persistent Volume and Persistent Volume Claims, but would only realistically be used for prototype/demo cases
  • a better solution is to use HDFS and deactivate local-deep-storage as we should offer HDFS anyway, and it lends itself more easily to integration testing (using our own operator).

Loading data locally fails when the s3 extension is loaded

I've noticed this problem here and have alread reported it upstream, I wanted to document the problem in our repo as well.

TLDR: If the s3 extension is loaded but not actually used (local deep storage, loading data locally) errors appear that a connection to s3 cannot be made.

We could possibly add a switch to decide whether the extension should be loaded or not. If proper credentials are provided as well, it works.

There is however a feature where you load s3 data from a bucket with credentials that you just provide in the Web UI. To allow this ad-hoc s3 access, the extension must be loaded at all times. If we instead provide a switch in the CRD whether it should be loaded or not, that's not possible anymore.

Properly support S3 structs

When implementing support for S3 in the operators out there we took some shortcuts like ignoring accessStyle and tls settings for now. We need to honor them and not just silently ignore them.

What needs to be done is described in stackabletech/issues#226

Support specifying a namespace to watch

Currently this operator watches resources in all namespaces.
I'd like this to be configurable so I can specify which namespace to watch.

This should be a clap argument (which then can be provided on the command line or in an env var) called --watch-namespace.
It is okay to only take a single namespace for now.

  • Implement the above description
  • Document the default behavior and the new parameter

See stackabletech/issues#162 for the overarching epic

Document service discovery

As a user of services deployed by this operator I'd like to know how to discover its connection details.

It's done when

  • documentation is available on all the objects (e.g. ConfigMaps) created by this operator and the circumstances under which they are created and
  • documentation on the contents of the created objects is available.

NOTE: This ticket is part of an epic and autocreated for all our operators. It might not apply to this operator in particular, in that case please comment and close
stackabletech/documentation#86

Document Druid Operator

Acceptance Criteria

  • It is documented how to get started with Druid Operator in the Getting Started guide

Refactoring config.rs (hard-coded config properties)

To clean up config.rs we should check all the properties that are set per process and remove the ones that have a default entirely (for now). The ones that are left over should be moved to the product config with a sensible default. At the end, the file should be entirely removed.

We do not map anything into config properties in the resource for now, until we know which properties we should map. Overriding properties is still possible by using configOverrides.

Configuration Reference (states the defaults for each property): https://druid.apache.org/docs/latest/configuration/index.html

Integration-tests relocation and improved stability

As a user I want to have the druid integration tests moved to the operator repository and for the stability to be improved (e.g. currently the python tests start before the druid components are reachable).

Add deploying ingestion specs from custom resources

Implementation ticket for #168

The Druid operator should be extended to be able to deploy ingestion specs from definitions provided in CRDs.

Ingestion specs can be defined by the user via a customresource, which is watched by a controller in the Druid operator that then provisions these specs.

The crd will contain at least the following:

  • type of source (stream, flat file, ...)
  • failure behavior
    • none
    • retry
    • delete and retry
  • specification of the ingest job
    • inline
    • reference to configmap
    • reference to pvc (or file therein)

These objects will initially be considered read-only, so changes to them will not be propagated to Druid by the controller.

The initial implementation will not be a perfect ingestion task management solution, but rather a first attempt to offer something useful to our users.
The user needs to decide themselves what the appropriate failure behavior is for the spec they provide to the operator, whether duplicate data might be created by retrying etc.
The defined failure options should offer simple solutions for all scenarios:

  • task is too complex -> none , the user will investigate themselves
  • no duplicates expected -> retry , task is idempotent and can be retried
  • duplicates possible -> delete and retry , not idempotent, to be on the safe side delete the target before retrying

Add option to enable S3 path style access

Discussed in https://github.com/stackabletech/druid-operator/discussions/213

Originally posted by sbernauer April 5, 2022
In our Docaton we had to use something like this as we were not able to specify s3.pathStyleAccess: true in the CRD
We need to add that attribute to avoid using custom druid settings

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: druid-nytaxidata
spec:
  version: 0.22.1
  zookeeperConfigMapName: simple-druid-znode
  metadataStorageDatabase:
    dbType: postgresql
    connString: jdbc:postgresql://postgresql-druid/druid
    host: postgresql-druid
    port: 5432
    user: druid
    password: druid
  s3:
    endpoint: http://minio:9000
    credentialsSecret: druid-s3-credentials
  deepStorage:
    storageType: s3
    bucket: nytaxidata
    baseKey: storage
  brokers:
    configOverrides:
      runtime.properties:
        druid.s3.enablePathStyleAccess: "true" ### <<< HERE
    roleGroups:
      default:
        selector:
          matchLabels:
            kubernetes.io/os: linux
        config: {}
        replicas: 1
```</div>

As a user I'd like to add Ingestion specs/Datasources programatically

I'd like to have a CustomResource/ConfigMap that I can use to define all my Datasources/Ingestions in Druid.

We decided to just take the verbatim JSON as it is generated by the Druid Web UI and have this in a CustomResource.
I thought about having just a ConfigMap instead of a CR but that opens the question on how to reference the Druid cluster in question.

It could be the other way around: A "config map selector" in Druid that selects all ConfigMaps that should be added but that also doesn't seem very agile. Either way this needs to be authorized later.

  • Please come up with a final architecture (CustomResource vs. other options)
  • Bring up the architecture at the weekly architecture meeting for discussion
  • Implement
  • Add Integration Tests
  • Add documentation

Monitoring

Metrics can either be written to a log file or posted to an HTTP endpoint. To get that to work with Prometheus, we should use Druid Exporter, deploy it with druid and use that for metrics.

Once we run on docker that should be easy to package together so I'm blocking this issue on the docker issue.

Package Druid

I'd like v0.22.0 packaged, together with a the stackable startup script.

This is the directory structure that I'm working with right now:

        druid-0.22.0/
        └── apache-druid-0.22.0
            ├── bin
            ├── conf
            ├── lib
            ├── stackable/
                └── run-druid
            └── ...

Standalone overlord

The coordinator supports running the overlord itself, which is how we do it at the moment.

It might be useful to some to have it run in a seperate process/pod.

supposedly seperating them makes Druid more resilient.

healthcheck technical user with multiple authenticators

Problem

The k8s healthcheck probes make HTTP requests which get blocked by Druid if Authorization is enabled.

Proposed solution

We need a technical user to make these probe requests, in case authentication and authorization are enabled. Otherwise the endpoints cannot be queried.

Druid supports an authentication and authorization chain, with multiple authenticators/authorizers. We can add a second mini-authenticator for just a single health-check user, or maybe reuse the existing one for basic auth, on top of LDAP. we can then use this user to do our health checks. The user should be created automatically, with generated credentials.

Open Questions

RUSTSEC-2020-0036: failure is officially deprecated/unmaintained

failure is officially deprecated/unmaintained

Details
Status unmaintained
Package failure
Version 0.1.8
URL rust-lang-deprecated/failure#347
Date 2020-05-02

The failure crate is officially end-of-life: it has been marked as deprecated
by the former maintainer, who has announced that there will be no updates or
maintenance work on it going forward.

The following are some suggested actively developed alternatives to switch to:

See advisory page for additional details.

Configure Druid to support S3

See Druid documentation
https://druid.apache.org/docs/latest/ingestion/native-batch.html#s3-input-source

Acceptance Criteria

  • Druid is configured to use S3 as Input Source for data (by Stackable Druid Operator)
  • Druid is configured to use S3 as deep staorage (by Stackable Druid Operator)
  • A User may load data stored in S3 into Druid
  • A User may query these data, at least via REST call: curl -X POST '<queryable_host>:/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d @<query_json_file>

RUSTSEC-2020-0071: Potential segfault in the time crate

Potential segfault in the time crate

Details
Package time
Version 0.1.44
URL time-rs/time#293
Date 2020-11-18
Patched versions >=0.2.23
Unaffected versions =0.2.0,=0.2.1,=0.2.2,=0.2.3,=0.2.4,=0.2.5,=0.2.6

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

The affected functions from time 0.2.7 through 0.2.22 are:

  • time::UtcOffset::local_offset_at
  • time::UtcOffset::try_local_offset_at
  • time::UtcOffset::current_local_offset
  • time::UtcOffset::try_current_local_offset
  • time::OffsetDateTime::now_local
  • time::OffsetDateTime::try_now_local

The affected functions in time 0.1 (all versions) are:

  • at
  • at_utc
  • now

Non-Unix targets (including Windows and wasm) are unaffected.

Patches

Pending a proper fix, the internal method that determines the local offset has been modified to always return None on the affected operating systems. This has the effect of returning an Err on the try_* methods and UTC on the non-try_* methods.

Users and library authors with time in their dependency tree should perform cargo update, which will pull in the updated, unaffected code.

Users of time 0.1 do not have a patch and should upgrade to an unaffected version: time 0.2.23 or greater or the 0.3 series.

Workarounds

No workarounds are known.

References

time-rs/time#293

See advisory page for additional details.

Convert Operator to K8S Architecture

These Acceptance Criteria need to be met:

  • Pods are changed to use Docker images instead of the current "packages"
  • hostNetwork is used (for now)
  • Volumes and VolumeMounts are added (for hostPath/local volumes)
  • command line templates are fixed

This depends on stackabletech/docker-images#6 which provides initial Docker images but might require further changes to the images.

Support TLS authentication & encryption with provided certificates

This is the same as we did for ZooKeeper in stackabletech/zookeeper-operator#466 but with a new structure according to stackabletech/issues#293.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: druid
spec:
  version: 24.0.0-stackable0.1.0
  commonConfig:
    tls:
      # client-server encryption (only server requires a trusted certificate)
      serverSecretClass: String # defaults to "tls"
      # server-server encryption
      internalSecretClass: String # defaults to "tls"
    # This should be a Vector. Can be a vector of Strings but preferably an extra struct containing at least a 
    # String to reference the operator-rs AuthenticationClass (plus optional settings if required)
    authentication: 
      # mTLS (client and server require a trusted certificate)
      - authenticationClass: druid-tls-authentication-class # String
    authorization:
      opa:
        configMapName: druid-opa
    # all other top level configuration should be under shared-/global-/cluster-config as well
    zookeeperConfigMapName: simple-druid-znode
    metadataStorageDatabase:
      dbType: postgresql
      connString: jdbc:postgresql://druid-postgresql/druid
      host: druid-postgresql
      port: 5432
      user: druid
      password: druid
    deepStorage: ...

This is done when

  • Client-Server communication can be encrypted via TLS (on by default)
  • Server-Server communication can be encrypted via TLS (on by default - can be deactivated if performance impacted heavily)
  • The common foundations are used (SecretClass, AuthenticationClass)
  • Documentation has been added and adapted to the new structure
  • Integration tests have been added and adapted to the new structure
  • Examples have been added and adapted to the new structure
  • All top level fields except version or image and stopped are moved to commonConfig (See next for opa config map)
  • Opa discovery config map name field moved to commonConfig.authorization
  • Feature tracker has been updated (@lfrancke can do this if needed, ping him)

OPA Authorizer

As an admin I'd like all requests to Druid resources optionally be authorized via OpenPolicyAgent (OPA).

To support this in Druid we need to

  • Create a project that implements the Druid Authorizer interface
  • Add integration tests
  • We need to establish a runbook or similar so we don't forget to test/update the authorizer project when we upgrade Druid
  • (Optional for now) We should open a ticket upstream and propose donating the code so it gets integrated into Druid itself
  • We need examples and documentation covering how to use this feature
  • Package the Authorizer in the Docker image
  • Make all necessary (unknown to me at the moment if any are needed) changes to the operator to support it

I'm happy to split this up into multiple tickets instead of one big one.
I can help with that or whoever picks it up is welcome to do so.

Consolidate example files/folders

There are currently three directories with multiple files that together (as far as I can tell) make up a single example.

I'd like each file in the examples folder to be useful on its own which means it should include everything that's needed including a comment at the top detailing what's happening in this file for a new user.
This makes it easy for people to get started by issuing a command like kubectl apply -f https://github.com/stackabletech/.....

I'm fine with keeping three directories but I'm also fine with getting rid of the directories.

But each example should contain the ZK cluster definition as well as a Znode and use said Znode configmap instead of the ZooKeeper wide configmap

RUSTSEC-2021-0124: Data race when sending and receiving after closing a `oneshot` channel

Data race when sending and receiving after closing a oneshot channel

Details
Package tokio
Version 0.1.22
URL tokio-rs/tokio#4225
Date 2021-11-16
Patched versions >=1.8.4, <1.9.0,>=1.13.1
Unaffected versions <0.1.14

If a tokio::sync::oneshot channel is closed (via the
oneshot::Receiver::close method), a data race may occur if the
oneshot::Sender::send method is called while the corresponding
oneshot::Receiver is awaited or calling try_recv.

When these methods are called concurrently on a closed channel, the two halves
of the channel can concurrently access a shared memory location, resulting in a
data race. This has been observed to cause memory corruption.

Note that the race only occurs when both halves of the channel are used
after the Receiver half has called close. Code where close is not used, or where the
Receiver is not awaited and try_recv is not called after calling close,
is not affected.

See tokio#4225 for more details.

See advisory page for additional details.

Implement resource requests and limits for Druid pods

Part of this epic stackabletech/issues#241

Acceptance criteria

  • Resource requests and limits are configurable in CRD using the common structs from operator-rs
  • Resource requests and limits are configured for Kubernetes pods
  • Resource requests and limits are configured in the product (e.g. "-Xmx" etc. for Java based images)
  • Adapt/Add integration tests to specify and test correct amount of resources
  • Adapt/Add examples
  • Adapt documentation: New section in usage.adoc with product specific information and link to common shared resources concept
  • Optional: Use sensible defaults for each role (if reasonable and applicable) and document accordingly in usage.adoc
  • Code contains useful comments
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)
  • Helm chart can be installed and deployed operator works (or not applicable)
  • Feature Tracker has been updated
  • Followup tickets have been created if needed (e.g. to update demos)

Relevant part of the code: https://github.com/stackabletech/druid-operator/blob/129c5e9769f513c9a9f318392f6b508b2a1f2a81/rust/operator-binary/src/config.rs
There are the jvm settings already in there, should be factored out somewhere

Document monitoring

Documentation is missing for Monitoring.

Please see other operators (e.g. ZooKeeper) for the snippet to copy.

RUSTSEC-2020-0159: Potential segfault in `localtime_r` invocations

Potential segfault in localtime_r invocations

Details
Package chrono
Version 0.4.19
URL chronotope/chrono#499
Date 2020-11-10

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

Workarounds

No workarounds are known.

References

See advisory page for additional details.

RUSTSEC-2022-0048: xml-rs is Unmaintained

xml-rs is Unmaintained

Details
Status unmaintained
Package xml-rs
Version 0.8.4
URL https://github.com/netvl/xml-rs/issues
Date 2022-01-26

xml-rs is a XML parser has open issues around parsing including integer
overflows / panics that may or may not be an issue with untrusted data.

Together with these open issues with Unmaintained status xml-rs
may or may not be suited to parse untrusted data.

Alternatives

See advisory page for additional details.

Bootstrap an Apache Druid operator

Implement initial Druid Operator for all Server-/Process Types (https://druid.apache.org/docs/latest/design/processes.html) (ACs: )

Acceptance Criteria

  • Operator can start/stop/restart a Druid Cluster

  • Druid configs can be applied and updated

  • Monitoring is integrated

  • all Process types are supported (Coordinator, Overlord, Broker, Historical, MiddleManager and Peons, Indexer (Optional), Router (Optional)

  • all Server types are supported (Master, Query, Data)

  • support Maturity Level 1 (Is there more todo than in AC 1?)

tbd

Use HDFS ConfigMap to get a reference to the HDFS end-point for deep storage

As a user of druid services I want to use the HDFS config-map to reference the HDFS endpoint for druid deep storage. Instead of:

  deepStorage:
    hdfs: 
      configMapName: production
      storageDirectory: /data

I want to use the hdfs config map and the properties contained in the "hdfs-site.xml" key therein.

This is done when

RUSTSEC-2021-0139: ansi_term is Unmaintained

ansi_term is Unmaintained

Details
Status unmaintained
Package ansi_term
Version 0.12.1
URL ogham/rust-ansi-term#72
Date 2021-08-18

The maintainer has adviced this crate is deprecated and will not
receive any maintenance.

The crate does not seem to have much dependencies and may or may not be ok to use as-is.

Last release seems to have been three years ago.

Possible Alternative(s)

The below list has not been vetted in any way and may or may not contain alternatives;

See advisory page for additional details.

Support LDAP authentication

As a user I'd like to use my existing LDAP/AD credentials to log into Druid. This was already done in e.g. NiFi or Trino. This can be especially helpful for writing tests.

The LDAP support should be integrated in the structure from PR #6 which must be finished first.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: druid
spec:
  version: 24.0.0-stackable0.1.0
  clusterConfig:
    tls:
      serverSecretClass: String # defaults to "tls"
      internalSecretClass: String # defaults to "tls"
    authentication: 
      - authenticationClass: druid-tls-authentication-class # String
      - authenticationClass: druid-ldap-authentication-class # String

This is done when

  • LDAP is configurable in the CRD using the LDAP structs from operator-rs
  • A user/admin can configure Druid to use a LDAP server for authentication (while still offering existing authentication methods)
  • There is documentation on how to configure Druid with LDAP using the Custom Resource
  • Optional: There is an example demonstrating Druid with LDAP (docs, or example folder)
  • There are tests which include:
    • OpenLDAP is installed and accessible via Druid
    • LDAP authenticated access to Druid works
  • It is added to the feature tracker (ask Lars for help)

This depends on the reference architecture developed in stackabletech/issues#170

Handle stale information/clean up stale resources

Currently our operators will not act on removed information from the CR in some/most/all cases.

One example:
HBase operator has three roles (master, regionServer, restServer). If I create a HBase server CR with a restServer component and then remove it later (entirely, not setting replicas to 0) our operator will not clean up the STS that belongs to this role.

Proposed solution

Use the ClusterResource struct from operator-rs to manage Kubernetes resources belonging to a Cluster object. An example of its usage can be found in the Superset Operator: https://github.com/stackabletech/superset-operator/blob/main/rust/operator-binary/src/superset_controller.rs#L241

Acceptance criteria

  • This is done when all stale Kubernetes resources are cleaned up. A resource becomes stale when it's not part of the current cluster definition anymore.
  • ZNode and S3Connection resources are not deleted because the operator cannot know if they are stale or not.
  • There is at least one test.
  • Documentation on operator implementation is updated with information regarding handling of stale resources.
  • Upgrade to the latest version of operator-rs (0.25 at the moment).

NOTE: This is part of an epic (stackabletech/issues#203) and might not apply to this operator. If that is the case please comment on this issue and just close it. This issue was created as part of a special bulk creation operation.

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.


  • Check this box to trigger a request for Renovate to run again on this repository

Authorize access to Druid

Authorize access to Druid by simple OPA RegoRules

Acceptance Criteria

  • It is checked if an individual implementation is needed or if we should use Ranger
  • Druid OPA Authorizer is implemented (with or without Ranger)
  • Druid Operator is able to write RegoRules

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.