zalando-incubator / es-operator Goto Github PK
View Code? Open in Web Editor NEWKubernetes Operator for Elasticsearch
Kubernetes Operator for Elasticsearch
Add periodic kubectl logs
output to simplify debugging of the e2e process.
Store observedGeneration
on EDS status field such that it easy to validate the status of the EDS e.g. in terms of current number of replicas.
The module path github.com/go-resty/resty
found in your /go.mod
doesn't match the actual path gopkg.in/resty.v1
found in the dependency's go.mod
.
Updating the module path in your go.mod
to gopkg.in/resty.v1
should resolve this issue.
Running https://github.com/zalando-incubator/es-operator/blob/master/docs/elasticsearchdataset-vct.yaml should result in a working cluster.
Running https://github.com/zalando-incubator/es-operator/blob/master/docs/elasticsearchdataset-vct.yaml fails with a crash loop back off, with a boot time check failure.
The initContainer that sets vm.max_map_count is missing in the VCT example.
Deployment to just work
The image can't be pulled from the registry because I'm assuming it doesn't exists there
Warning Failed 40m (x4 over 41m) kubelet, xxxxxxxxxxx-central-1.compute.internal Failed to pull image "pierone.stups.zalan.do/poirot/es-operator:latest": rpc error: code = Unknown desc = Error: image poirot/es-operator:latest not found
Warning Failed 40m (x4 over 41m) kubelet, xxxxxxxxxxx-central-1.compute.internal Error: ErrImagePull
Warning Failed 6m27s (x152 over 41m) kubelet, ip-xxxxxxxxxxx-central-1.compute.internal Error: ImagePullBackOff
Normal BackOff 83s (x173 over 41m) kubelet, ip-xxxxxxxxxxx-central-1.compute.internal Back-off pulling image "pierone.stups.zalan.do/poirot/es-operator:latest"
kubectl apply -f https://raw.githubusercontent.com/zalando-incubator/es-operator/master/docs/deployment.yaml
kubectl describe po es-operator-677d44db9f-98rx5
Offer canary deployment for EDS similar to what is provided by StatefulSets. See https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#partitions
When I specify maxShardsPerNode: X
I would assume that the ES Operator would still allow scaling to that X shards per node target, not stop before.
The boundary is seen non-inclusive, which means we don't save as much cost as we could.
It is usually sufficient to scale up replicas of one index in a group: the one with the highest traffic. Benefit is increased efficiency of scaling operation, less wasted resources by adding replicas for indices that may not require it.
Implementation would require monitoring per-index or per-node CPU stats to identify the hot-spot in the cluster group. The indices allocated on this node are potential candidates for scaling out.
The ES Operator could decide to scale down, but while scaling down the thresholds for scale up are being exceeded, e.g. we're reaching a point where we should stop scaling down, or even scale back up.
At the moment the scaling operation is bound to finish before a new scaling operation can be started.
Based on our meeting at KubeCon we should see how the Zalando es-operator can work together with the elasticsearch operator: https://github.com/elastic/cloud-on-k8s
What should the interface be?
TBD.
When an index does not exist (anymore) the es-operator should continue to work.
When an index that is referenced in the 'current-scaling-operation' doesn't exist anymore, the scaling fails because ES returns a 404 when trying to update number_of_replicas.
es-operator-85d68b858d-q87bh es-operator time="2019-04-28T16:08:43Z" level=info msg="Setting number_of_replicas for index 'index-a' to 1." endpoint="http://es-data-othera.poirot-test.svc.cluster.local.:9200"
apiVersion: zalando.org/v1
kind: ElasticsearchDataSet
metadata:
annotations:
es-operator.zalando.org/current-scaling-operation: '{"ScalingDirection":0,"NodeReplicas":3,"IndexReplicas":[{"index":"index-a","pri":5,"rep":1}],"Description":"Keeping
shard-to-node ratio (35.67), and decreasing index replicas."}'
I am having trouble creating "Persistent Volume Claims". I request your support
Probably related to "volumeClaimTemplates" in "org_elasticsearchdatasets.yaml". Can you check ?
Exception Message:
create Claim -es-data-simple-0 for Pod es-data-simple-0 in StatefulSet es-data-simple failed error: PersistentVolumeClaim "-es-data-simple-0" is invalid: metadata.name: Invalid value: "-es-data-simple-0": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character
ES-Operator doesn't separate system indices and ordinary users indices.
It's already added a deprecation alert about new default behavior for system indices, e.g.
Deprecation: this request accesses system indices: [.kibana_2], but in a future major version, direct access to system indices will be prevented by default
So it makes sense to exclude system indices from es-opeartor management.
I have been watching the work you and the team are doing on the ElasticSearch Operator. I am not sure if you've seen the news that ElasticSearch has turned into a proprietary SSPL license and is no longer open source. AWS and others such as logz.io (my employer) have gotten behind a new Apache 2.0 fork which is focused on different goals than building the Elastic business model. We are interested in seeing if we can integrate and somehow use the nice job you've all done on the Operator with OpenSearch. Do you plan on moving to OpenSearch over at Zalando?
Thanks!
When calling kubectl get eds
the desired and current size of the EDS should be reported.
Only desired size is reported.
Several people have asked if the operator can somehow avoid node draining in cases where you have a PVC and don't actually need the draining in order to safely move the data around.
If possible we should add a feature where we can prepare a node for update by making sure it doesn't get traffic (not sure how to do this) and then simply delete the pod and let Kubernetes schedule it and attach the PVC on the new pod. Once ready, it can again get traffic.
Draining would still be needed in case of scaledown to ensure there is no data loss (depending on index configuration).
Our current node-group based index allocation is mainly due to the fact that the traffic pattern for certain indices is similar. This served fairly well in the past, but it has certain limitations.
As a result we can end up with sub-optimal resource utilisation in our cluster: While some nodes may be under-utilised, other nodes could offload some shards there to balance their load, before having to scale up.
The proposed solution may look like this: Based on the assumption that all nodes should be utilised equally we try to manually balance the shard-to-node allocation in es-operator. Taking a cost-function we can try to optimise the shard-to-node allocation.
Unable to enable basic authentication for elastic search with username and password.
Is there is any other way to enable authentication for elastic search with username and password?
I am trying to enable xpack.security I am facing errors.
I have just added xpack.security.enabled: true in es-config.
apiVersion: v1
kind: ConfigMap
metadata:
name: es-config
namespace: es-operator-demo
data:
elasticsearch.yml: |
cluster.name: es-cluster
network.host: "0.0.0.0"
bootstrap.memory_lock: false
discovery.seed_hosts: [es-master]
cluster.initial_master_nodes: [es-master-0]
xpack.security.enabled: true
I have tried different versions of elastic search dockers and also by changing the version of ES_JAVA_OPTS.
Used all the configuration given in https://github.com/zalando-incubator/es-operator/tree/master/docs.
Creating a cluster definition where the pod template doesn't include either a nodeSelector
or an affinity
that's compatible with the priority node selector should either result in an error, or an automatically modified pod template so the cluster behaves correctly during the updates.
If the users don't define a nodeSelector
or an affinity
in the pod spec, but define a priority node selector in the operator configuration, it's highly likely that a rolling cluster update will not be handled correctly. For example, if the cluster management software is currently draining node A
and the cluster pods live on nodes A
, B
and C
that are scheduled to be drained, it's possible that the operator will just keep deleting pod on node B
that would be rescheduled to the same node again and again, and will not actually proceed to node A
.
First of all, I appreciate you sharing this operator. Currently, I'm gaining some hands-on experience with it and I'm encountering some strange behaviour (to my best knowledge). I'm trying to apply a new configuration to my EDS, but it's having trouble updating the individual pods. Please correct me if I'm doing anything stupid/unsupported.
The logs below shows all relevant logging from a single "loop". Notice how it tells that it deleted pod demo/es-data1-0, and then directly pod demo/es-data1-0 should be updated.
enabled: true
, minReplicas: 1
, minIndexReplicas: 0
time="2021-05-20T13:12:25Z" level=info msg="Ensuring cluster is in green state" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"ElasticsearchDataSet\", Namespace:\"demo\", Name:\"es-data1\", UID:\"22b3dd79-41b6-4165-bc7f-ad78557d7959\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"6271955\", FieldPath:\"\"}): type: 'Normal' reason: 'DrainingPod' Draining Pod 'demo/es-data1-0'"
time="2021-05-20T13:12:25Z" level=info msg="Disabling auto-rebalance" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:26Z" level=info msg="Excluding pod demo/es-data1-0 from shard allocation" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:26Z" level=info msg="Waiting for draining to finish" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:26Z" level=info msg="Found 0 remaining shards on demo/es-data1-0 (10.244.3.147)" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:26Z" level=info msg="Event(v1.ObjectReference{Kind:\"ElasticsearchDataSet\", Namespace:\"demo\", Name:\"es-data1\", UID:\"22b3dd79-41b6-4165-bc7f-ad78557d7959\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"6271955\", FieldPath:\"\"}): type: 'Normal' reason: 'DrainedPod' Successfully drained Pod 'demo/es-data1-0'"
time="2021-05-20T13:12:26Z" level=info msg="Event(v1.ObjectReference{Kind:\"ElasticsearchDataSet\", Namespace:\"demo\", Name:\"es-data1\", UID:\"22b3dd79-41b6-4165-bc7f-ad78557d7959\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"6271955\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingPod' Deleting Pod 'demo/es-data1-0'"
time="2021-05-20T13:12:42Z" level=info msg="Event(v1.ObjectReference{Kind:\"ElasticsearchDataSet\", Namespace:\"demo\", Name:\"es-data1\", UID:\"22b3dd79-41b6-4165-bc7f-ad78557d7959\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"6271955\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletedPod' Successfully deleted Pod 'demo/es-data1-0'"
time="2021-05-20T13:12:42Z" level=info msg="Setting exclude list to ''" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:42Z" level=info msg="Enabling auto-rebalance" endpoint="http://es-data1.demo.svc.cluster.local.:9200"
time="2021-05-20T13:12:43Z" level=info msg="Pod demo/es-data1-0 should be updated. Priority: 5 (NodeSelector,PodOldRevision,STSReplicaDiff)"
time="2021-05-20T13:12:43Z" level=info msg="Pod demo/es-data1-1 should be updated. Priority: 5 (NodeSelector,PodOldRevision,STSReplicaDiff)"
time="2021-05-20T13:12:43Z" level=info msg="Found 2 Pods on StatefulSet demo/es-data1 to update"
time="2021-05-20T13:12:43Z" level=info msg="StatefulSet demo/es-data1 has 1/2 ready replicas"
Have structured context (EDS) for logged events in autoscaler.go.
No or unstructured context provided.
The module path github.com/go-resty/resty
found in your /go.mod
doesn't match the actual path gopkg.in/resty.v1
found in the dependency's go.mod
.
Updating the module path in your go.mod
to gopkg.in/resty.v1
should resolve this issue.
Does it support Elasticsearch v7.0?
We could potentially find and resolve disk-issues before ES blocks writing to the index. At the moment the only disk-based check we do is to prevent scaling down in case of high disk usage.
From time to time the operator outputs these logs:
time="2019-04-27T09:01:59Z" level=error msg="Failed to operate resource: failed to ensure resources: resource name may not be empty"
time="2019-04-27T09:02:02Z" level=error msg="Failed to operate resource: failed to ensure resources: resource name may not be empty"
time="2019-04-27T09:02:05Z" level=error msg="Failed to operate resource: failed to ensure resources: PodDisruptionBudget poirot/es-data-prio2a is not owned by the ElasticsearchDataSet poirot/es-data-prio2a"
time="2019-04-27T09:02:09Z" level=error msg="Failed to operate resource: failed to ensure resources: resource name may not be empty"
time="2019-04-27T09:02:12Z" level=error msg="Failed to operate resource: failed to ensure resources: resource name may not be empty"
This essentially prevents it from operating on the resource meaning no scale-up/scale-down can happen. Restarting the operator fixes this for some time.
I fear that we are missing a deepcopy somewhere.
Only log that es-operator is scaling, if there is an actual change in replicas
es-operator keeps logging "Updating desired scaling for EDS .... New desired replicas: 20. Decreasing node replicas to 20.", although the current replicas are already 20 (=minReplicas)
One EDS that shows this behaviour
spec:
replicas: 20
scaling:
diskUsagePercentScaledownWatermark: 0
enabled: true
maxIndexReplicas: 4
maxReplicas: 40
maxShardsPerNode: 30
minIndexReplicas: 4
minReplicas: 20
minShardsPerNode: 12
scaleDownCPUBoundary: 25
scaleDownCooldownSeconds: 600
scaleDownThresholdDurationSeconds: 600
scaleUpCPUBoundary: 40
scaleUpCooldownSeconds: 120
scaleUpThresholdDurationSeconds: 60
Image registry.opensource.zalan.do/poirot/es-operator:v0.1.0-17-gd237530
An Elasticsearch version upgrade is a situation where the number of spare instances needs to exceed the number of index replicas in order to allow both primaries and replicas to be allocated on one of the new nodes. This is different from a normal rolling restart where one extra instance is enough.
To accommodate this, we need either need to make the es-operator aware of a version upgrade, and make treat this specially, or allow the users to define the spare instances in the EDS (e.g. spec.maxSurge
) to control the es-operator behavior during the rolling restart. Or, we don't change anything, and users will need to control the version upgrade by temporarily increasing minReplicas
.
There are situations when ES will refuse to drain a given node (usually allocation constraints like max. number of shards per index and node). This will cause ES Operator to wait indefinitely for the draining to finish. At some point the scale-down event gets superseded by a scale-up event.
This should lead to the previously "to-be-drained" node to be used again.
What happens instead is that the IP stays in the cluster.routing.allocation.exclude._ip
and the scale-up event only causes the statefulset to be updated, spawning new nodes. This leaves the node in a commissioned but unused state.
:9200/_cluster/settings
to see the IP being still in there.Given an EDS size of 5 nodes (for whatever reason it scaled to this number...), and an index with 1 replica and 4 primaries (ie. current shard-to-node ratio is 8/5 = 1.6) I would expect the next scaling up operation to snap to a non-fractioned shard-to-node ratio of 8/8 = 1.0.
The ES operator reduces the shard-to-node ratio by one, leading to 8/10, making a shard-to-node ratio of 0.8. This is an issue for several reasons:
a) Imbalance of load because some nodes get a different load than others
b) In this case, some nodes don't get any shards allocated at all, although one could mitigate this by setting maxReplicas.
Should be able to handle a lot more load easily
All pods - eds, es-master - pods are failing health checks. The entire cluster is just crashing.
1.kubectl -n elasticsearch-zalando describe elasticsearchdataset.zalando.org/es-data-zalando gives the following response:
Name: es-storage
Command:
sysctl
-w
vm.max_map_count=262144
Image: busybox:1.27.2
Name: init-sysctl
Resources:
Limits:
Cpu: 50m
Memory: 50Mi
Requests:
Cpu: 50m
Memory: 50Mi
Security Context:
Privileged: true
Service Account Name: operator
Volumes:
Config Map:
Items:
Key: elasticsearch.yml
Path: elasticsearch.yml
Name: es-config
Name: elasticsearch-config
Volume Claim Templates:
Metadata:
Annotations:
Volume . Beta . Kubernetes . Io / Storage - Class: fast
Creation Timestamp: <nil>
Name: es-storage
Spec:
Access Modes:
ReadWriteOnce
Data Source: <nil>
Resources:
Requests:
Storage: 100Gi
Storage Class Name: fast
Status:
Status:
Last Scale Up Started: 2019-09-18T17:30:22Z
Observed Generation: 4
Replicas: 5
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DrainingPod 4m10s (x278 over 5h) es-operator Draining Pod 'elasticsearch-zalando/es-data-zalando-0'
.
Explain how to upgrade your Elasticsearch cluster from ES6 to ES7 without down-time.
The service will stop working on 15-th of April, 2020.
๐ข https://medium.com/golangci/golangci-com-is-closing-d1fc1bd30e0e
README or docs should contain a complete tutorial to get started, i.e. including how to deploy Elasticsearch master nodes etc. The tutorial could use kind or Minikube.
When an EDS needs to both scale and and be rolled because of an update to the EDS or a cluster update. The Operator should prefer to scale the EDS rather than rolling the pods as extra capacity might be more important than e.g. moving pods to new cluster nodes.
Currently the operator always checks if any pods needs to be drained before it consideres scaling the EDS. This means that if you have an EDS with say 20 pods and a cluster update is ongoing, then you could wait for all the 20 pods to be upgraded before a potential scale up could be applied. We saw this in production where an EDS was stuck at 35 pods, but the autoscaler recommended scaling to 48.
I propose that we always favor scale-up over draining pods for a rolling upgrade. That is; if eds.Spec.Replicas > sts.Spec.Replicas
then rescale the STS before doing anything else.
Scale down should generally also be favored over rolling upgrade because it's pointless to upgrade pods which would be scaled down anyway, however, it might make sense to favor moving pods on draining nodes before scaling down to ensure that a pod is moved before a node is forcefully terminated.
When updating an EDS without auto-scaling, one should be able to specify desired replicas to 0 in order to drain all data from an EDS.
The EDS change is being rejected by the verification of the manifest.
I have an overall question about es-operator
. I am migrating from the official ES helm chart (for the data nodes the other node are still deploy with helm). However I am stuck to the last part: configuring my ingress
On the official helm chart, I can setup an ingress / services like:
ingress:
enabled: True
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/target-type: instance
alb.ingress.kubernetes.io/subnets: subnet-XX,subnet-YY
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 9200}]'
alb.ingress.kubernetes.io/healthcheck-path: '/_cluster/health'
hosts: [""]
path: "/*"
tls: []
service:
type: LoadBalancer
Which will create 2 services (one load balancer and one headless)
However I can not (or find a way) to setup this load balancer correctly. From what I saw from the code, the operator need to own the service and will create a NodePort service.
Does there is anyway to solve that ?
What I will do in the meantime is deploy some coordinate nodes with the load balancer, so I can access the Elasticsearch cluster (or do that on the master node for testing)
Goals
We could use ES new features for machine learning to predict scaling requirements from traffic ( https://www.elastic.co/products/stack/machine-learning ) Alternatively a Kalman filter might work as well.
Shard replica should not be added, before node are ready.
As you can see in the actual behavior, because the EDS is updated ? I guess the existing operator get cancel, and restarted, but does not check for ready of the statefulsets
time="2020-09-24T09:00:52Z" level=error msg="Failed to operate resource: failed to rescale StatefulSet: StatefulSet es/data-id-id-v2 is not stable: 2/4 replicas ready"
time="2020-09-24T09:00:52Z" level=info msg="Scaling hint: UP" eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Updating last scaling event in EDS 'es/data-id-id-v2'"
time="2020-09-24T09:00:52Z" level=info msg="Waiting for operation to stop" eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Terminating operator loop." eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Updating desired scaling for EDS 'es/data-id-id-v2'. New desired replicas: 4. Keeping shard-to-node ratio (1.00), and increasing index replicas."
time="2020-09-24T09:00:52Z" level=info msg="Waiting for operation to stop" eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Terminating operator loop." eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"ElasticsearchDataSet\", Namespace:\"es\", Name:\"data-id-id-v2\", UID:\"c63525d9-2b50-4b7d-9a34-102fed5c8327\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"12281498\", FieldPath:\"\"}): type: 'Normal' reason: 'UpdatedStatefulSet' Updated StatefulSet 'es/data-id-id-v2'"
time="2020-09-24T09:00:52Z" level=info msg="Waiting for operation to stop" eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"ElasticsearchDataSet\", Namespace:\"es\", Name:\"data-id-id-v2\", UID:\"c63525d9-2b50-4b7d-9a34-102fed5c8327\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"12281502\", FieldPath:\"\"}): type: 'Normal' reason: 'ChangingReplicas' Changing replicas 2 -> 4 for StatefulSet 'es/data-id-id-v2'"
time="2020-09-24T09:00:52Z" level=info msg="StatefulSet es/data-id-id-v2 has 2/4 ready replicas"
time="2020-09-24T09:00:52Z" level=error msg="Failed to operate resource: failed to rescale StatefulSet: StatefulSet es/data-id-id-v2 is not stable: 2/4 replicas ready"
time="2020-09-24T09:00:52Z" level=info msg="Terminating operator loop." eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Waiting for operation to stop" eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Setting number_of_replicas for index 'id_id_v2' to 1." endpoint="http://data-id-id-v2.es.svc.cluster.local.:9200"
time="2020-09-24T09:00:52Z" level=info msg="Terminating operator loop." eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Waiting for operation to stop" eds=data-id-id-v2 namespace=es
time="2020-09-24T09:00:52Z" level=info msg="Terminating operator loop." eds=data-id-id-v2 namespace=es
time="2020-09-24T09:01:22Z" level=info msg="Not scaling up, currently in cool-down period." eds=data-id-id-v2 namespace=es
users that have only namespace-wide permissions to define Roles (and forbidden from defining ClusterRoles) should have an option to play with.
kubectl apply -f docs/cluster-roles.yaml
fails when the user does not have cluster-wide privileges.
kubectl apply -f docs/cluster-roles.yaml
leads to:serviceaccount/operator created
Error from server (Forbidden): error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=clusterroles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=ClusterRole"
Name: "es-operator", Namespace: ""
Object: &{map["apiVersion":"rbac.authorization.k8s.io/v1" "kind":"ClusterRole" "metadata":map["annotations":map["kubectl.kubernetes.io/last-applied-configuration":""] "name":"es-operator"] "rules":[map["apiGroups":["<xxxx>"] "resources":["elasticsearchdatasets" "elasticsearchdatasets/status" "elasticsearchmetricsets" "elasticsearchmetricsets/status"] "verbs":["get" "list" "watch" "update" "patch"]] map["apiGroups":[""] "resources":["pods" "services"] "verbs":["get" "watch" "list" "create" "update" "patch" "delete"]] map["apiGroups":["apps"] "resources":["statefulsets"] "verbs":["get" "create" "update" "patch" "delete" "watch" "list"]] map["apiGroups":["policy"] "resources":["poddisruptionbudgets"] "verbs":["get" "create" "update" "patch" "delete" "watch" "list"]] map["apiGroups":[""] "resources":["events"] "verbs":["create" "patch" "update"]] map["apiGroups":[""] "resources":["nodes"] "verbs":["get" "list" "watch"]] map["apiGroups":["metrics.k8s.io"] "resources":["pods"] "verbs":["get" "list" "watch"]]]]}
from server for: "cluster-roles.yaml": clusterroles.rbac.authorization.k8s.io "es-operator" is forbidden: User "<xxxx>" cannot get resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
Error from server (Forbidden): error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=clusterrolebindings", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding"
Name: "es-operator", Namespace: ""
Object: &{map["apiVersion":"rbac.authorization.k8s.io/v1" "kind":"ClusterRoleBinding" "metadata":map["annotations":map["kubectl.kubernetes.io/last-applied-configuration":""] "name":"es-operator"] "roleRef":map["apiGroup":"rbac.authorization.k8s.io" "kind":"ClusterRole" "name":"es-operator"] "subjects":[map["kind":"ServiceAccount" "name":"operator" "namespace":"kube-system"]]]}
from server for: "cluster-roles.yaml": clusterrolebindings.rbac.authorization.k8s.io "es-operator" is forbidden: User "<xxxx>" cannot get resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
In the case where spec.Replicas < scaling.MinReplicas
the spec.Replicas
is only automatically updated if there is a scaling operation (UP/DOWN
).
This can cause confusion when looking at the EDS because status.Replicas
would be greater than spec.Replicas
and everything would be stable
.
Would be helpful if the operator automatically adjusted spec.Replicas
to at least scaling.MinReplicas
when scaling is enabled and scaling.MinReplicas > spec.Replicas
.
Scaling down should work, even if two stacks get scaled down simultaneously.
Only the last IP gets persisted in the exclude._ip setting, and as such only one node gets drained correctly. The reason is that excludePodIP
is not atomic, but needs to call ES to retrieve the current exclude list, and then update it.
Currently we leave all resources after a test has been run and only let CDP clean everything up by deleting the namespace. If we instead cleaned up the resources of a test as it successfully succeeded, then we could save resources and other tests could use those resources instead avoid waiting for a new node in some situations.
It looks like that the es-operator respects minIndexReplicas
value only on auto scaling actions.
So there is no convenient way of setting desired index replicas count.
This would be very useful for e.g. pre-scaling for some events.
The minIndexReplicas
parameter and maybe other parameters like minShardsPerNode
could be applied also on config change.
If I change the size of an existing PVC inside of a ElasticsearchDataSet
, it should get propagated to the underlying StatefulSet
's pod volumes
The StatefulSet
does not change its properties and therefore nothing changes.
ElasticsearchDataSet
with a PersistentVolumeClaimAdd example manifests for different persistence options (EBS volume / SSD mount / RAM disk)
In a cluster with a huge variance of usage, it is good to be able to set different configuration for auto scaling depending of the size of the cluster.
it would be good to set different minShardsPerNode
, maxShardsPerNode
, scaleUpCPUBoundary
dependings of the size of the cluster.
Not sure what would be the correct syntax.
But for example adding a rules
or overwrites
part. And a selector like replicaLte
(replica less than). The operator could check the overwrite part and fallback to default if there is none.
scaling:
enabled: true
minReplicas: 2
maxReplicas: 99
minIndexReplicas: 1
maxIndexReplicas: 40
minShardsPerNode: 3
maxShardsPerNode: 3
scaleUpCPUBoundary: 75
scaleUpThresholdDurationSeconds: 240
scaleUpCooldownSeconds: 1000
scaleDownCPUBoundary: 40
scaleDownThresholdDurationSeconds: 1200
scaleDownCooldownSeconds: 1200
diskUsagePercentScaledownWatermark: 80
scaling:
enabled: true
minReplicas: 2
maxReplicas: 99
minIndexReplicas: 1
maxIndexReplicas: 40
minShardsPerNode: 3
maxShardsPerNode: 3
scaleUpCPUBoundary: 75
scaleUpThresholdDurationSeconds: 240
scaleUpCooldownSeconds: 1000
scaleDownCPUBoundary: 40
scaleDownThresholdDurationSeconds: 1200
scaleDownCooldownSeconds: 1200
diskUsagePercentScaledownWatermark: 80
rules:
- replicaLte: 2
scaleUpCPUBoundary: 30
- replicaLte: 4
scaleUpCPUBoundary: 40
- replicaLte: 10
scaleUpCPUBoundary: 60
It is mainly for huge cost optimization. During night a cluster can be very small, but at early morning, the cluster need to be able to scale aggressively, but when cluster start to be big, it can scale slowly
I am willing to implement this feature if it makes sense for this project
The module path github.com/go-resty/resty
found in your /go.mod
doesn't match the actual path gopkg.in/resty.v1
found in the dependency's go.mod
.
Updating the module path in your go.mod
to gopkg.in/resty.v1
should resolve this issue.
Release the first version of es-operator.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.