Coder Social home page Coder Social logo

thanos-operator's Introduction

Docker Automated build Docker Pulls CircleCI Go Report Card license

Thanos Operator

Thanos Operator is a Kubernetes operator to manage Thanos stack deployment on Kubernetes.

What is Thanos

Open source, highly available Prometheus setup with long term storage capabilities.

Architecture

Feature highlights

  • Auto discover endpoints
  • Manage persistent volumes
  • Metrics configuration
  • Simple TLS configuration

Work in progress

  • Tracing configuration
  • Endpoint validation
  • Certificate management
  • Advanced secret configuration

Documentation

You can find the complete documentation of thanos operator here πŸ“˜

Commercial support

If you are using the Thanos operator in a production environment and require commercial support, contact Banzai Cloud, the company backing the development of the Thanos operator. If you are looking for the ultimate observability tool for multi-cluster Kubernetes infrastructures to automate the collection, correlation, and storage of logs and metrics, check out One Eye.

Contributing

If you find this project useful, help us:

  • Support the development of this project and star this repo! ⭐
  • If you use the Thanos operator in a production environment, add yourself to the list of production adopters.:metal:
  • Help new users with issues they may encounter πŸ’ͺ
  • Send a pull request with your new features and bug fixes πŸš€

For more information, read the developer documentation.

License

Copyright (c) 2017-2020 Banzai Cloud, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

thanos-operator's People

Contributors

ahma avatar asdwsda avatar bonifaido avatar bshifter avatar burningalchemist avatar dependabot[bot] avatar ecsy avatar ekarlso avatar evertonsa avatar fekete-robert avatar frizlab avatar joshuasimon-taulia avatar kralicky avatar matthew-beckett avatar matyix avatar pavan541cs avatar pepov avatar pradels avatar siliconbrain avatar tarokkk avatar ultrafenrir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

thanos-operator's Issues

Statefulset metadata label has invalid value (too long)

Describe the bug

The name in the label seems to tag on the remote name:

Warning FailedCreate 14m (x730 over 8d) statefulset-controller create Pod thanos-sausw2-perf-compute-remote-sausw2-perf-compute-rule-0 in StatefulSet thanos-sausw2-perf-compute-remote-sausw2-perf-compute-rule failed error: Pod "thanos-sausw2-perf-compute-remote-sausw2-perf-compute-rule-0" is invalid: metadata.labels: Invalid value: "thanos-sausw2-perf-compute-remote-sausw2-perf-compute-rule-768cbbd9b8": must be no more than 63 characters

this is making it too long. it seems like you're taking these 3 lines to form the name:

Labels:
app.kubernetes.io/managed-by=thanos-sausw2-perf-general
app.kubernetes.io/name=rule
monitoring.banzaicloud.io/storeendpoint=remote-sausw2-perf-general

Steps to reproduce the issue:

Normal Thanos Operator deployment

Expected behavior
no repeats on the name making it unnecessarily long. make the metadata label name generation more graceful.

External ServiceMonitor deleted by controller

Describe the bug

Then controller will assume that any ServiceMonitor resource which has a matching name to the Thanos resource is owned by the controller. This causes a ServiceMonitor with the name foo-compactor to be deleted if there is a Thanos resource in the same namespace with the name foo.

Steps to reproduce the issue:

Create a Thanos resource with the name foo whit a compactor configuration. The important thing is that it does not have any monitor configuration for the compactor. Create a ServiceMonitor in the same namespace with the name foo-compactor. Wait until the controller reconciles the Thanos resource and deletes the ServiceMonitor.

Expected behavior

The service monitor created outside of the controller should never be modified or deleted by the controller as it is not the owner of the resource.

Screenshots

N/A

Additional context

I think the reason this is occurring is because of this logic. It basically assumes that if there is no service monitor configuration it should delete any service monitor with a given name, even if the operator has not created the service monitor.

delete := &prometheus.ServiceMonitor{
ObjectMeta: c.getMeta(),
}
return delete, reconciler.StateAbsent, nil

Handle remove receivergroups

Describe the bug
Resources for receiverGroups are currently not removed when a receiverGroup is removed from the CR. This is becase the receiverGroup is an array and some extra logic would be required to detect and delete resources for groups that are not available anymore in the configuration.

Steps to reproduce the issue:
Create a receiver with two groups, like in the example:
config/samples/monitoring_v1alpha1_receiver.yaml

Expected behavior
The resources that belong to the group are removed.

Additional context
A potential solution would be to use the component reconciler from operator-tools with the receiver as the parent object.

Query and query-frontend can't be scaled up

Describe the bug

The stateless components query and query-frontend have hard-coded replica counts, that can't be overriden in deploymentOverrides.

Steps to reproduce the issue:

Create a Thanos instance with spec.queryFrontend.deploymentOverrides.replicas to 2, and observe that the resultant deployment created by the operator has only one replica.

You can see the hardcoding here: https://github.com/craigfurman/thanos-operator/blob/query-frontend-service-type-configurability/pkg/resources/query_frontend/deployment.go#L40

Expected behavior

Query or QueryFrontend deployments have replica counts equal to the value of the relevant deploymentOverrides.replicas.

Screenshots

Additional context

Typo for this flag --query-frontend.log-queries-longer-than

Describe the bug
Query Frontend supports --query-frontend.log-queries-longer-than flag to log queries running longer than some duration.
The flag is wrongly hardcoded in the thanos operator. _(underscore) is in place instead of -(hiphen)
https://github.com/banzaicloud/thanos-operator/blob/master/pkg/sdk/api/v1alpha1/thanos_types.go#L141

Expected behavior
--query-frontend.log-queries-longer-than - This is flag expected by thanos querier frontend

Add logging to the manager

Now manager do not produce any messages on errors, resource creation.
Will be very nice to have something like audit logs of manager ongoing processes/errors + be able to configure verbosity

Ingress API version downgrade breaks helm chart

Describe the bug
It looks to me like this commit downgrades the ingress to v1beta1, but does not undo the field name changes from this commit. This results in the following error when applying:

Error: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "service" in io.k8s.api.networking.v1beta1.IngressBackend

Steps to reproduce the issue:
I am applying the helm chart via Terraform:

resource "helm_release" "thanos-operator" {
  repository = "https://kubernetes-charts.banzaicloud.com"
  chart      = "thanos-operator"
  name       = "thanos"
  namespace  = "my-namespace"
  values     = [file("${path.module}/data/values.yaml")]
}

Setting ingress.enabled = true in values.yaml should trigger this bug.

Expected behavior
The ingress should be created.

Deploy Store memcached

Describe the solution you'd like to see

Thanos store supports memcached for index caches. It would be useful to manage the memached via the operator.

Describe alternatives you've considered

Running a separate memcached service.

Enable flag to merge instead of patch thanos.spec.storeGateway.containerOverrides.volumeMounts

Is your feature request related to a problem? Please describe.

When setting a custom data store in thanos.spec.storeGateway.containerOverrides.volumeMounts the objectstore configmap get's overwritten.
This kind of destroys the usability of having ObjectStore.

Describe the solution you'd like to see

Add bool merge under thanos.spec.storeGateway.containerOverrides.volumeMounts or set merge/patch as default.

Example

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: thanos-sample
spec:
  queryDiscovery: true
  query: {}
  storeGateway:
    containerOverrides:
      volumeMounts:
        - name: thanos-data
          mountPath: ./data
          merge: true
        - mountPath: /etc/config/
          name: objectstore-secret
          readOnly: true
    workloadOverrides:
      volumes:
        - name: thanos-data
          emptyDir: {}
        - name: objectstore-secret
          secret:
            defaultMode: 420
            secretName: thanos-objstore-config

Describe alternatives you've considered

Ether make merge/path to the default or make it opt in or opt out.

Additional context

I need to define a location to store my volumeMounts due to that I have PSP rules that say I'm not allowed to write data to root dir.

Overriding meta replicas is not working

Describe the bug
If u will try to setup the following config

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: thanos
spec:
  clusterDomain: platform-rke-test-env
  enableRecreateWorkloadOnImmutableFieldChange: true
  query:
    metaOverrides:
      replicas: "3"

u still get only 1 pod of query

Steps to reproduce the issue:
Set the

  query:
    metaOverrides:
      replicas: "3"

Expected behavior
U should be able to control the number of instances for query and queryFrontend

Additional context
With query there are a strange behaviour , every time i deploy query even without

   query:
    metaOverrides:
      replicas: "3"

i getting 2 instances of query until one of them don't pass readiness probe, then the second one is marking as "removing"
And if i set replicas into 2 or 3 , i will get 3 instances of query on startup, but then 2 of them will be removed.

The same thing with queryFrontend

Resource quota's and nodeselector's for StoreEndpoint spec

Is your feature request related to a problem? Please describe.
When u setup a lot of store-gateway in one cluster, u are no able to set custom nodeselectors/resource qouta's for particular store ( like we do for compactor, for example)

Describe the solution you'd like to see
Add nodeselector and resource fields into the StoreEndpoint spec like we do for compactor and ObjectStore

Describe alternatives you've considered
Manually editing stores after operator created it, but for big amount of stores this is nit the case.

Additional context
Can provide a pull request, if u ok with it

Too many colons in StoreEndpoint url

Describe the bug

When I configure a StoreEndpoint like this:

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: StoreEndpoint
metadata:
  name: storeendpoint
spec:
  thanos: thanos
  url: https://url.to.my.external.querier:443
  config:
    mountFrom:
      secretKeyRef:
        name: thanos-storage-config
        key: config

I get this error:

fetching store info from https://url.to.my.external.querier:443: rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address https://url.to.my.external.querier:443: too many colons in address" 

Steps to reproduce the issue:

N/A

Expected behavior

I should have my querier correctly connected

Screenshots

N/A

Additional context

Installed from the helm chart

Add support for Affinity/AntiAffinity

Is your feature request related to a problem? Please describe.
I want be able to configure AntiAffinity to not to collocate pods on the same nodes for example

Describe the solution you'd like to see
Add affinity/AntiAffinity into supported types

Describe alternatives you've considered

containerOverrides missing in 0.2.* helm chart thanos CRD

Describe the bug

The config to be able to override container configuration for the different parts of thanos is missing.

Steps to reproduce the issue:

Install 0.1.1 helm chart

helm upgrade -i thanos-operator --namespace monitor banzaicloud-stable/thanos-operator --version 0.1.1

Apply a Thanos CR that looks something like this:

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: thanos-sample
spec:
  queryDiscovery: true
  query: {}
  rule:
    containerOverrides:
      volumeMounts:
        - name: thanos-data
          mountPath: ./data
        - mountPath: /etc/config/
          name: objectstore-secret
          readOnly: true
    workloadOverrides:
      volumes:
        - name: thanos-data
          emptyDir: {}
        - name: objectstore-secret
          secret:
            defaultMode: 420
            secretName: thanos-objstore-config
  storeGateway:
    containerOverrides:
      volumeMounts:
        - name: thanos-data
          mountPath: ./data
        - mountPath: /etc/config/
          name: objectstore-secret
          readOnly: true
    workloadOverrides:
      volumes:
        - name: thanos-data
          emptyDir: {}
        - name: objectstore-secret
          secret:
            defaultMode: 420
            secretName: thanos-objstore-config

Upgrade to helm chart 0.2.1

helm upgrade thanos-operator --namespace monitor banzaicloud-stable/thanos-operator --version 0.2.1
k apply -f https://raw.githubusercontent.com/banzaicloud/thanos-operator/chart/thanos-operator/0.2.1/charts/thanos-operator/crds/monitoring.banzaicloud.io_objectstores.yaml
k apply -f https://raw.githubusercontent.com/banzaicloud/thanos-operator/chart/thanos-operator/0.2.1/charts/thanos-operator/crds/monitoring.banzaicloud.io_receivers.yaml
k apply -f https://raw.githubusercontent.com/banzaicloud/thanos-operator/chart/thanos-operator/0.2.1/charts/thanos-operator/crds/monitoring.banzaicloud.io_storeendpoints.yaml
k apply -f https://raw.githubusercontent.com/banzaicloud/thanos-operator/chart/thanos-operator/0.2.1/charts/thanos-operator/crds/monitoring.banzaicloud.io_thanos.yaml
k apply -f https://raw.githubusercontent.com/banzaicloud/thanos-operator/chart/thanos-operator/0.2.1/charts/thanos-operator/crds/monitoring.banzaicloud.io_thanosendpoints.yaml
k apply -f https://raw.githubusercontent.com/banzaicloud/thanos-operator/chart/thanos-operator/0.2.1/charts/thanos-operator/crds/monitoring.banzaicloud.io_thanospeers.yaml

Expected behavior

Everything keeps on working like in 0.1.1

Screenshots

Additional context

Instead i get the following crashLoopBack due to I'm not able to write to disk due to a muttating webhooks that applies

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - NET_RAW
      readOnlyRootFilesystem: true

Thanks to this I get similar error in both:
statefulsets.apps thanos-sample-storeendpoint-receiver-rule
and
deployment thanos-sample-storeendpoint-receiver-store

➜ k logs thanos-sample-storeendpoint-receiver-store-868b54d476-6qznc              
level=info ts=2021-04-13T08:31:15.0960554Z caller=main.go:152 msg="Tracing will be disabled"
level=info ts=2021-04-13T08:31:15.0962121Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-04-13T08:31:15.1869422Z caller=inmemory.go:172 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=maxInt
level=error ts=2021-04-13T08:31:15.1872223Z caller=main.go:186 err="mkdir data: read-only file system\nmeta fetcher\nmain.runStore\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/store.go:280\nmain.registerStore.func1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/store.go:119\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:184\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\npreparing store command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:186\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"

Allow labels to be added to serviceMonitor resources

Is your feature request related to a problem? Please describe.

When using the prometheus operator, one can configure Prometheus to only look at ServiceMonitors that match a particular label. With the thanos-operator, the serviceMonitor fields are only booleans, and I do not see any way to add labels to them to match what prometheus operator would expect.

Describe the solution you'd like to see

A new field serviceMonitorLabels or something to that effect that would add additional labels to the Service Monitor resource.

Describe alternatives you've considered

Additional context

thanos-store flags: index cache postings compression and relabel configs

Is your feature request related to a problem? Please describe.

We'd like to set the following flags on the thanos store component:

  • --experimental.enable-index-cache-postings-compression
  • --selector.relabel-config / --selector.relabel-config-file

Describe the solution you'd like to see

We could add these options to Thanos.spec.storeGateway.

Describe alternatives you've considered

While no substitute for spec fields, #71 could be a nice escape hatch allowing users to set arbitrary thanos (store) flags before operator support is added.

Additional context

I'm happy to send PRs for this, but it'd be nice to get feedback on #71 first.

Compactor didnt respect days as units for retention policies

Describe the bug
When i try to set retention for compactor for example for 30d, the following error occurring:

E1028 11:06:51.065884 807 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1alpha1.ObjectStore: v1alpha1.ObjectStoreList.Items: []v1alpha1.ObjectStore: v1alpha1.ObjectStore.Spec: v1alpha1.ObjectStoreSpec.Compactor: v1alpha1.Compactor.RetentionResolution5m: RetentionResolution1h: unmarshalerDecoder: time: unknown unit d in duration 30d, error found in #10 byte of ...|n1h":"30d","retentio|..., bigger context ...|or-iam-eks-sbx-eu"}},"retentionResolution1h":"30d","retentionResolution5m":"30d","retentionResolutio|... 28/10/2020 13:07:39

Steps to reproduce the issue:
Set any of retentionResolution1h/retentionResolution5m/retentionResolutionraw as 30d

Expected behavior
It should respect any units ( hours, days, years)

Additional context
working with hours (h)

Query Grafana Datasource Nil pointer

Describe the bug

In the latest release (0.3.3) the option to create grafana datasources for a query was added. This also introduced a nil pointer error when a Thanos instance is created without a query configured.

Steps to reproduce the issue:

Create a Thanos object without a query object configured. Wait for the controller to reconcile the Thanos object and panic due to invalid memory reference.

Expected behavior

Thanos objects without a Query object should be able to be reconciled without the operator crashing.

Screenshots

Additional context

The problem is caused from this code not checking if the query object is nil or not.

if q.Thanos.Spec.Query.GrafanaDatasource {

Absence of store endpoints in query with sample configuration

Describe the bug
No store endpoints registered in query after single cluster setup on version 0.1.1

Steps to reproduce the issue:

sample Configuration from here https://banzaicloud.com/blog/thanos-operator/#single-cluster-deployment

some secret here

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: ObjectStore
metadata:
  name: objectstore-sample
spec:
  config:
    mountFrom:
      secretKeyRef:
        name: some-secret 
        key: object-store.yaml
  compactor: {}


apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: StoreEndpoint
metadata:
  name: storeendpoint-sample
spec:
  thanos: thanos-sample
  config:
    mountFrom:
      secretKeyRef:
        name: some-secret
        key: object-store.yaml
  selector: {}


apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: thanos-sample
spec:
  queryDiscovery: true
  clusterDomain: some-custom-name
  query: {}
  queryFrontend: {}
  storeGateway: {}

Expected behavior
After store-gateway creation it should automatically appear in query "stores"

Wrong discovery domains created when queryDiscovery is set to true

Describe the bug
I followed this blog post to setup the thanos observer/observee clusters.

When I apply following resource it creates wrong discovery domains:

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: query-master
spec:
  query: {}
  queryDiscovery: true

Created result:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - args:
            - query
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --log.level=info
            - --store=dnssrvnoa+_grpc._tcp.thanos-apps-query.monitoring.cluster.local.cluster.local
            - --store=dnssrvnoa+_grpc._tcp.thanos-monitoring-query.monitoring.cluster.local.cluster.local
            - --store=dnssrvnoa+_grpc._tcp.thanos-data-query.monitoring.cluster.local.cluster.local
          image: quay.io/thanos/thanos:v0.15.0

Take a look into this part cluster.local.cluster.local

Steps to reproduce the issue:

  1. install operator
helm install thanos-operator --namespace monitor banzaicloud-stable/thanos-operator --set manageCrds=false

Installed chart version: 0.1.1

  1. apply resources from the articles observer/observee -> https://banzaicloud.com/blog/thanos-operator/#observer-cluster-deployment
  2. kubectl get deploy thanos-query-master-query -o yaml find wrong descriptions of discovery domains

Expected behavior
This part:

            - --store=dnssrvnoa+_grpc._tcp.thanos-apps-query.monitoring.cluster.local.cluster.local
            - --store=dnssrvnoa+_grpc._tcp.thanos-monitoring-query.monitoring.cluster.local.cluster.local
            - --store=dnssrvnoa+_grpc._tcp.thanos-data-query.monitoring.cluster.local.cluster.local

Should looks like this:

            - --store=dnssrvnoa+_grpc._tcp.thanos-apps-query.monitoring.cluster.local
            - --store=dnssrvnoa+_grpc._tcp.thanos-monitoring-query.monitoring.cluster.local
            - --store=dnssrvnoa+_grpc._tcp.thanos-data-query.monitoring.cluster.local

Node Tolerations and Node Selector on componets

Is your feature request related to a problem? Please describe.
Currently we are trying to deploy the compactor to a specific set of nodes with extra disk space. We tried to add the workloadOverrides with nodeSelector and tolerations but that does not seem to be picked up. It's possible we are just using the overrides incorrectly too since the documentation is limited on how to use them.

compactor:
  workloadOverrides:
    tolerations:
      - effect: NoSchedule
        key: dedicated
        value: prometheus
    nodeSelector:
      dedicated: prometheus

Describe the solution you'd like to see
Either have specific fields to add these to the components (compactor and others) or allow the overrides to pull them in.

Describe alternatives you've considered
Patching the deployment after the operator creates it.

test3

Describe the bug

Steps to reproduce the issue:

Expected behavior

Screenshots

Additional context

store: Implement hashmod sharding

Is your feature request related to a problem? Please describe.

To handle very large TSDB buckets, allow sharding with hashmod

Describe the solution you'd like to see

Specifying a replica count for stores will automatically generate a bucket relabel config that splits with hashmod.

      spec:
        containers:
        - args:
          - store
          - |
            --selector.relabel-config=
              - action: hashmod
                source_labels: ["__block_id"]
                target_label: shard
                modulus: 3
              - action: keep
                source_labels: ["shard"]
                regex: 0

Describe alternatives you've considered

Label or time range sharding is also an option, but IMO this setup would be easier to scale.

Additional context

For example, we have a bucket storage with 10k individual TSDB blocks coming from a dozen Prometheus instances. Rather than try to manually shard based on external lables, we can reasonably evenly divide the work among a number of replicas based on consistent hash of block ID.

See the Relabling docs: https://thanos.io/tip/thanos/sharding.md/#relabelling

Ability to configure external caches for thanos store

Is your feature request related to a problem? Please describe.

We use memcached as an external cache for thanos-store, for both the index and chunk caches. See --index-cache.config and --store.caching-bucket.config in https://thanos.io/tip/components/store.md.

Describe the solution you'd like to see

Ideally, there would be configurable fields under Thanos.spec.storeGateway or StoreEndpoint.spec that translate into the YAML snippets that Thanos uses as external cache configs. These could be passed as string literals into flags, or mounted as files from ConfigMaps or Secrets.

Describe alternatives you've considered

Alternatively / in addition, we could add an "extra args" setting to allow users to pass arbitrary flags to thanos store (and other thanos components). It looks like this could either be done in this repo or by adding an "Args" field to ContainerBase in https://github.com/banzaicloud/operator-tools.

Additional context

I'm happy to send a PR, and I'm of course interested in what the maintainers think about the problem, and its proposed solutions.

Make ingresses optional

Is your feature request related to a problem? Please describe.
I do not want to create an ingress for querier, as there's a local grafana instance that will access it in-cluster.

Describe the solution you'd like to see
A bool variable in the CRD spec that disables ingress creation.

Describe alternatives you've considered
N/A

Additional context
N/A

Ingress Annotations

Is your feature request related to a problem? Please describe.

I want to be able to annotate the ingress to let cert-manager and nginx handle the certs etc...

Describe the solution you'd like to see

I'd love to have the ability to do:

HTTPIngress:
  host: query.example.com
  path: /
  annotations:
    "kubernetes.io/ingress.class": nginx
    "cert-manager.io/cluster-issuer": "letsencrypt-prod"
    "kubernetes.io/tls-acme": "true"

Thanos store uses ephemeral storage and fills up the Node storage.

Describe the bug
Thanos operator is used for deploying the the thanos components (storegateway, query, query frontend).

Thanos store container fills up the ephemeral disk that is only 50GB.

Tried configuring the PVC for the store container, but the operator is not accepting the additional PVC volume and gives the below error.

thanos> Reconciler error failed to reconcile resource: failed to create resource: creating resource failed: Deployment.apps "sandbox-us-west-2-thanos-cluster-store" is invalid: spec.template.spec.volumes[0].persistentVolumeClaim: Forbidden: may not specify more than 1 volume type (name: sandbox-us-west-2-thanos-cluster-store, namespace: monitoring, apiVersion: apps/v1, kind: Deployment, name: sandbox-us-west-2-thanos-cluster-store, namespace: monitoring, apiVersion: apps/v1, kind: Deployment) name: sandbox-us-west-2, namespace: monitoring

Here is my Thanos store gateway config input given to Thanos CR.

storeGateway:
indexCacheSize: 250MB
deploymentOverrides:
spec:
template:
spec:
containers:

  • image:thanos/thanos:v0.19.0-rc.0
    imagePullPolicy: Always
    name: store
    volumeMounts:
  • mountPath: /data
    name: task-pv-storage
    volumes:
  • name: task-pv-storage
    persistentVolumeClaim:
    claimName: clusterstore-pvc

This issue is not happening in other clusters where the ephemeral disk is 100GB. I tried configuring the IndexCacheSize to 10GB, and it is still not taking effect.

Have anyone encountered this issue?

How do we configure the Thanos store Gateway to use the PVC or configure the object download limit on the ephemeral disk?

Thanos Operator details:

Thanos Version: thanos:v0.19.0-rc.0

Thanos Operator Version: 0.3.3

Object Storage Provider: S3

Restricting operator permissions to Thanos components/namespaces only

The Helm chart for the operator creates a ClusterRole and ClusterRoleBinding that give very broad cluster-wide access, including being able to access all Secrets, manipulate all Deployments, and so on. This is upsetting our security folks who want to decrease attack surfaces whenever possible using the principle of least-privilege, so I wonder if all this access is really needed or that we at least could get away with a regular Role and RoleBinding inside the Helm release namespace (when the operator only manipulates Thanos components in this namespace), or be able to specify namespace(s) that we are allowed to do things in.

I do not have an in-depth understanding of the exact K8s permissions the Thanos operator needs for all its actions, but I think it should be possible to limit it to only manage workloads in Thanos-related namespaces?

Store option "timeRanges" has no effect

Describe the bug

Setting Thanos.spec.storeGateway.timeRanges, as documented in https://github.com/banzaicloud/thanos-operator/blob/master/docs/types/thanos_types.md, has no effect.

Steps to reproduce the issue:

Upload the following manifests:

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: thanos
spec:
  storeGateway:
    timeRanges:
    - maxTime: -24h
---
apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: StoreEndpoint
metadata:
  name: bucket
spec:
  config:
    mountFrom:
      secretKeyRef:
        key: EXAMPLE
        name: EXAMPLE
  thanos: thanos

Observe that the resultant store pods do not have a --max-time flag

Expected behavior

I'd expect the resultant store pods' Pod.spec.containers.args to have a --max-time=-24h argument.

Screenshots

Additional context

It's possible I've misunderstood the operator code, but it looks like this option does nothing, looking at this setArgs function. We either need to set the flags with reflection on a thanos struct tag, if this is not too awkward to do with the nested struct, or write code to handle this field.

ServiceAccount defaults to 'default' when creating operated resources

Describe the bug
Operated resources are created without a serviceAccount specified, causing the resources to use the default service account for the namespace. In many environments with restrictive pod security policies, service accounts are created with least privilege necessary to instantiate resources.

Steps to reproduce the issue:
Create any object-kind in thanos-operator that prompts the generation of a deployment, see that said resource is running in 'default' service account instead of the service account installed with the helm chart.

Expected behavior
The thanos-operator would utilize the service account generated by the helm chart, or have the ability to specify the service account to be used when creating operated resources.

Screenshots

ns/monitoring       pod/thanos-operator-6cf7b55df6-jjv6v                                 sa/thanos-operator                               psp/readwritefs                                   state/Running
ns/monitoring       pod/thanos-objstore-bucket-546478d96c-xqzbq                          sa/default                                       psp/restricted                                    state/PendingCreateContainerConfigError
ns/monitoring       pod/thanos-objstore-compactor-5ffd7b764b-9tjjt                       sa/default                                       psp/restricted                                    state/PendingCreateContainerConfigError

Additional context
Utilizing helm-chart version 0.1.0 / operator version banzaicloud/thanos-operator:0.1.0

Error setting query and store response timeouts

Describe the bug

When trying to set:

apiVersion: monitoring.banzaicloud.io/v1alpha1
kind: Thanos
metadata:
  name: thanos
  namespace: monitoring
spec:
  query:
    queryTimeout: 120s
    storeResponseTimeout: 120s

I get the following error from thanos-query pod:

Error parsing commandline arguments: not a valid duration string: "&Duration{Duration:2m0s,}"

Seems like the string value based to thanos query is not right?

Steps to reproduce the issue:

Apply the above yaml for Thanos.

Expected behavior

Expected the query and store response timeouts to be set to 120s.

Screenshots

N/A

Additional context

Please let me know if you need additional information!

Reconciliation for created resources

When u manually changing / deleting resources created by operator nothing happen.
Will be very cool at least have a recreation of resources based on CRD state, even better will be if operator will overwrite any manual changes.

Thanos 0.13.0 bucket web has changed arguments

Is your feature request related to a problem? Please describe.

Thanos operator is spwaning bucket web with wrong args for 0.13+

 containers:
  - args:
    - bucket
    - web
    - --log.level=info
    - --http-address=0.0.0.0:10902
    - --objstore.config-file=/etc/config/object-store.yaml
    - --refresh=1800s
    - --timeout=300s
    image: quay.io/thanos/thanos:v0.13.0
    imagePullPolicy: Always
    name: bucket
    ports:
    - containerPort: 10902
      name: http
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:

Describe the solution you'd like to see

Change arguments for the deployment to use correct args aka
tools bucket web instead of current

Additional context

I got a PR coming up after I have tested it.

Thanos Multi-cluster deployment

Describe the bug
So I'm following this: https://banzaicloud.com/docs/one-eye/thanos-operator/quickstarts/multicluster/ on a multi-namespace cluster
I would expect to see on the peer side 3 pods (or at least 3 containers):
Query, Rule and StoreAPI gateway to collect metrics from object storage
Although only 1 pod is set up with 1 container, for the query deployment
Using this with Prometheus operator and Prometheus sidecars, what I get is that the data is indeed uploaded to an object storage,
but I can query only the data of the Prometheus (that is shipped from the sidecar)

Steps to reproduce the issue:
Really just copy these YAMLs https://banzaicloud.com/docs/one-eye/thanos-operator/quickstarts/multicluster/

Expected behavior
Have a complete set-up of Query, Ruler and StoreAPI gateway

Ability to deploy Thanos Compactor as a CronJob

Is your feature request related to a problem? Please describe.

Thanos compactor uses close to no resources, except when actually compacting data. However, AFAIK using thanos-operator the way it is done for now, we have to deploy the compactor as a deployment, hence setting request/limits for the CPU/RAM to the resources used when compactor is doing stuff, which can be quite a waste of resources.

Describe the solution you'd like to see

Allow the ability to deploy Thanos compactor as a CronJob instead of a Deployment.

Describe alternatives you've considered

Losing resources when the compactor does not actually compact anything.

Additional context

None.

Helm chart version for 0.1.1

Describe the bug

I see version 0.1.1 of this operator is tagged and available on Dockerhub. The helm chart currently pins the operator's image tag to the Chart.yaml's appVersion, and this can't be overriden.

Would it be possible to release a helm chart for v0.1.1?

Steps to reproduce the issue:

Expected behavior

Screenshots

Additional context

I'd like to try the new query frontend support, which from the dates of various releases looks like it might be available in v0.1.1.

We're currently consuming the helm chart via tanka's helm chart ingestion feature, so I can jsonnet-patch the image tag in the mean time as a workaround.

jsonnet library distribution

Is your feature request related to a problem? Please describe.

The thanos-operator is distributed by helm chart. While the operator's Deployment is simple enough to recreate, there are lengthy, relatively non-customizable CRD and ClusterRole declarations in this repository that helm non-users have to keep up with.

Describe the solution you'd like to see

What do you think of publishing a jsonnet library for the operator's resources, similar to https://github.com/prometheus-operator/kube-prometheus?

This would allow the operator to be easily consumed by jsonnet-based deployment systems, such as https://tanka.dev.

Describe alternatives you've considered

Additional context

If you're interested in this idea, I'm happy to try to PR it.

annotations for kube2iam not working

We are experimenting with Thano and Thanos operator in our AWS EKS environment
We are having an issue with pod annotations for kube2iam…not really sure on how to set them up using the Thanos config files

We are using the following configuration, but the annotations do not appear on the pods and we can see access denied errors in the logs...attached

apiVersion: monitoring.banzaicloud.io/v1alpha1 kind: ObjectStore metadata: name: objectstore-ice-2 spec: config: mountFrom: secretKeyRef: name: thanos key: object-store.yaml bucketWeb: label: cluster compactor: workloadMetaOverrides: annotations: {"iam.amazonaws.com/role": "k8s-thanos-metrics" }

New Docker Image Tag

Can we get a new Docker tag/version that incorporates the Query Frontend?

Thanks!

queryFrontend don't respect flags described in types

Describe the bug
When i trying setup following configuration

  queryFrontend:
    queryRangeMaxRetriesPerRequest: 2
    queryRangeMaxQueryParallelism: 14  
    queryRangeSplit: "6h"
    queryRangeResponseCacheMaxFreshness: "5m"

operator don't add into deployment following args

--query-range.max-retries-per-request
--query-range.max-query-parallelism

Steps to reproduce the issue:
add into the configuration

    queryRangeMaxRetriesPerRequest: 
    queryRangeMaxQueryParallelism:  

Expected behavior

--query-range.max-retries-per-request
--query-range.max-query-parallelism

should be added to deployment args

Additional context
Maybe related to the type, cause other option(string based) working just fine

Implement Thanos receive

Is your feature request related to a problem? Please describe.

Describe the solution you'd like to see

Support for recieve components.

Describe alternatives you've considered

DIY thanos statefulset installs

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.