kubernetes-sigs / descheduler Goto Github PK
View Code? Open in Web Editor NEWDescheduler for Kubernetes
Home Page: https://sigs.k8s.io/descheduler
License: Apache License 2.0
Descheduler for Kubernetes
Home Page: https://sigs.k8s.io/descheduler
License: Apache License 2.0
The descheduler
will be run as a job in kube-system namespace, and the command is
Command:
/bin/sh
-ec
/bin/descheduler --policy-config-file /policy-dir/policy.yaml
So, there should be a /bin/sh
binary in the container, but the image was build from sratch and didn't include it. We can find this from Dockerfile:
FROM scratch
MAINTAINER Avesh Agarwal <[email protected]>
COPY --from=0 /go/src/github.com/kubernetes-incubator/descheduler/_output/bin/descheduler /bin/descheduler
CMD ["/bin/descheduler", "--help"]
And we got the Error:
Error: failed to start container "descheduler": Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: "/bin/sh": stat /bin/sh: no such file or directory"
This makes the pod runs into ContainerCannotRun
state and the job create a new pod immediatly,
several minutes later I got hundreds of pods and my small cluster finally went down for no responding.
/cc @aveshagarwal - As per our offline discussion, I think first step would be to enable profiling. I am planning to add flag(s) which enable profiling. I will try to avoid starting a httpserver based profiling in the initial stages.
I am testing out LowNodeUtilization
policy with the following value:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 60
memory: 60
pods: 5
targetThresholds:
cpu: 100
memory: 100
pods: 1000
However all nodes are appropriately utilized.
Eg:
I0514 20:35:09.877111 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-wnp4" is appropriately utilized with usage: api.ResourceThresholds{"memory":48.0697066631997, "pods":12.727272727272727, "cpu":34.64443045940843}
For the above node
"memory":48.0697066631997, < 60
"cpu":34.64443045940843 < 60
But "pods":12.727272727272727 > 5
I checked the code and it looks like IsNodeWithLowUtilization
will return false if any threshold is not violated - https://github.com/kubernetes-incubator/descheduler/blob/master/pkg/descheduler/strategies/lownodeutilization.go#L298
This means that ALL thresholds need be violated instead of ANY. Is that by design?
While I am using this descheduler, I have noticed that the log shows the exactly the same number of memory utilization for many nodes. Also, each node shows exactly the same number of CPU util & memory util in the log. it seems like descheduler is calculating the utilization from resource requests & limits?
I was hoping it is utilizing the kubectl top nodes to calculating current utilization (which should reflect the results in the log with dynamically changing % of CPU & memory util at the moment). Please clarify how is it calculating the current node resource utilization.
e.g. here is the data that I am talking: in the log, I see this: Node “172.16.4.3" is appropriately utilized with usage: api.ResourceThresholds{“cpu”:52.5, “memory”:32.080248132547204, “pods”:41.25} but kubectl top node shows 172.16.4.3 2025m 25% 8699Mi 55% - meaning CPU 25%, memory 55% utilized
Also, many of my pods are showing memory utilization exactly same as "memory":4.193744917801392
Hi, I would like to know if it possible to define deschedule policies only for namespaces with a specific label, any idea ?
Thanks.
So, if I understood correctly,
nodeResourceUtilizationThresholds.thresholds
is considered underutilizednodeResourceUtilizationThresholds.targetThresholds
is considered overutilizedIf this is correct, the following happens -
I have 4 nodes, 1 master node and 3 worker nodes -
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetes-master Ready,SchedulingDisabled <none> 6h v1.10.0-alpha.0.456+f85649c6cd2032-dirty
kubernetes-minion-group-1vp4 Ready <none> 6h v1.10.0-alpha.0.456+f85649c6cd2032-dirty
kubernetes-minion-group-frgx Ready <none> 6h v1.10.0-alpha.0.456+f85649c6cd2032-dirty
kubernetes-minion-group-k7c7 Ready <none> 6h v1.10.0-alpha.0.456+f85649c6cd2032-dirty
I tainted and then uncordoned node kubernetes-minion-group-1vp4
, which means there are no pods or Kubernetes resources on that node -
$ kubectl get all -o wide | grep kubernetes-minion-group-1vp4
$
and the allocated resources on this node are -
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
200m (10%) 0 (0%) 200Mi (2%) 300Mi (4%)
while on the other 2 worker nodes the allocated resources are -
--
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
1896m (94%) 446m (22%) 1133952Ki (15%) 1441152Ki (19%)
--
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
1840m (92%) 300m (15%) 1130Mi (15%) 1540Mi (21%)
So with the right DeschedulerPolicy
, pods should have been descheduled from the loads that are over utilized and scheduled on the fresh node.
I wrote the following DeschedulerPolicy
-
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds: # any node below the following percentages is considered underutilized
"cpu" : 40
"memory": 40
"pods": 40
targetThresholds: # any node above the following percentages is considered overutilized
"cpu" : 30
"memory": 2
"pods": 1
I run the descheduler as the following -
$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/policy.yaml -v 5
I1123 22:12:27.298937 9381 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 22:12:27.299080 9381 node.go:50] node lister returned empty list, now fetch directly
I1123 22:12:27.299230 9381 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 22:12:31.596854 9381 lownodeutilization.go:115] Node "kubernetes-master" usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1123 22:12:31.597019 9381 lownodeutilization.go:115] Node "kubernetes-minion-group-1vp4" usage: api.ResourceThresholds{"memory":2.764226588836412, "pods":1.8181818181818181, "cpu":10}
I1123 22:12:31.597508 9381 lownodeutilization.go:115] Node "kubernetes-minion-group-frgx" usage: api.ResourceThresholds{"cpu":94.8, "memory":15.305177094063607, "pods":16.363636363636363}
I1123 22:12:31.597910 9381 lownodeutilization.go:115] Node "kubernetes-minion-group-k7c7" usage: api.ResourceThresholds{"cpu":92, "memory":15.617880226925726, "pods":14.545454545454545}
I1123 22:12:31.597955 9381 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-frgx" with usage: api.ResourceThresholds{"cpu":94.8, "memory":15.305177094063607, "pods":16.363636363636363}
I1123 22:12:31.597993 9381 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-k7c7" with usage: api.ResourceThresholds{"cpu":92, "memory":15.617880226925726, "pods":14.545454545454545}
I1123 22:12:31.598017 9381 lownodeutilization.go:163] evicting pods from node "kubernetes-master" with usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
$
Seems like the descheduler ended up making the decisions for evicting pods from overutilized nodes, but when I check the cluster, nothing on the old nodes was terminated and nothing on the fresh node popped up -
$ kubectl get all -o wide | grep kubernetes-minion-group-1vp4
$
What am I doing wrong? :(
If a pod consumes high-value devices (gpus, hugepages) or even has higher latency benefits via cpu pinning, we should avoid descheduling those as its not known if those new pods will actually get a better fit.
I go through the README and found that the descheduler is deployed as a job on kubernetes, and it only exec once.
So why not be a cron job?
Ref:
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
Go through the README and got
I1130 06:29:15.559480 1 duplicates.go:59] Error when evicting pod: "nginx-1-55kh7" (&errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:""}, Status:"Failure", Message:"pods \"nginx-1-55kh7\" is forbidden: User \"system:serviceaccount:kube-system:descheduler-sa\" cannot create pods/eviction in the namespace \"default\": User \"system:serviceaccount:kube-system:descheduler-sa\" cannot create pods/eviction in project \"default\"", Reason:"Forbidden", Details:(*v1.StatusDetails)(0xc4202d77a0), Code:403}})
As per the email sent to kubernetes-dev[1], please create a SECURITY_CONTACTS
file.
The template for the file can be found in the kubernetes-template repository[2].
A description for the file is in the steering-committee docs[3], you might need
to search that page for "Security Contacts".
Please feel free to ping me on the PR when you make it, otherwise I will see when
you close this issue. :)
Thanks so much, let me know if you have any questions.
(This issue was generated from a tool, apologies for any weirdness.)
[1] https://groups.google.com/forum/#!topic/kubernetes-dev/codeiIoQ6QE
[2] https://github.com/kubernetes/kubernetes-template-project/blob/master/SECURITY_CONTACTS
[3] https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance-template-short.md
Recently one of the users requested a strategy for taints and tolerations. While I don't have cycles to work on this, I would be more than happy to review if anyone in the community is interested to work on.
Hello! This feature seems fundamental to strong bin packing, but it's been months since the last update.
Is this project still active and is there a timeline to have it merged into an official K8 release?
As of now, descheduler doesn't have a cap on the max number of pods to be evicted from each node. We should have this feature to ensure that cluster won't be on fire.
The current configuration types describe NodeResourceUtilizationThresholds
.
I think utilization implies observed usage, not what is scheduled or allocated.
If we use the term utilization, it should mean the decision is based on metrics.
If we use the term allocated or node scheduling thresholds, it should mean the decision is based on pod resource requests, and not observed usage.
In a recent discussion on sig-scheduling mailing list, we have decided to move this repo to a sig-sponsored repo. Creating this issue to track the movement.
It would be very helpfull to get a documentation about how to use and setup the descheduler job within a openshift environment.
I tried to follow the README within my openshift cluster but when creating the ClusterRole i get the following error:
error: unable to recognize "STDIN": no matches for rbac.authorization.k8s.io/, Kind=ClusterRole
When calling the "make" on my MAC OS or CENTOS also the build fails:
go build -ldflags "-X github.com/kubernetes-incubator/descheduler/cmd/descheduler/app.version=git describe --tags
-X github.com/kubernetes-incubator/descheduler/cmd/descheduler/app.buildDate=date +%FT%T%z
-X github.com/kubernetes-incubator/descheduler/cmd/descheduler/app.gitCommit=git rev-parse HEAD
" -o _output/bin/descheduler github.com/kubernetes-incubator/descheduler/cmd/descheduler
I have this weird issue where the descheduler correctly spots which pods to be evicted but no pods are actually deleted.
Could it be a permission issue? I'm using RBAC and have setup the roles like described in the README
I0608 11:47:47.133914 1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0608 11:47:47.133965 1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0608 11:47:47.234083 1 duplicates.go:50] Processing node: "ip-172-20-34-164.eu-west-2.compute.internal"
I0608 11:47:47.342677 1 duplicates.go:54] "ReplicaSet/notification-service-v1-b66596d89"
I0608 11:47:47.342721 1 duplicates.go:50] Processing node: "ip-172-20-80-152.eu-west-2.compute.internal"
I0608 11:47:47.354665 1 duplicates.go:54] "ReplicaSet/rabbitmq-k8s-0-5b87cfcfbd"
I0608 11:47:47.354688 1 duplicates.go:54] "ReplicaSet/revenue-modeller-data-store-v2-97fcdc568"
I0608 11:47:47.354697 1 duplicates.go:54] "ReplicaSet/entitlement-service-v1-8454ccd585"
I0608 11:47:47.354705 1 duplicates.go:50] Processing node: "ip-172-20-104-255.eu-west-2.compute.internal"
I0608 11:47:47.407949 1 duplicates.go:54] "ReplicaSet/alert-service-v1-7dc6ddcf8d"
I0608 11:47:47.408001 1 duplicates.go:54] "ReplicaSet/hazelcast-k8s-0-7466b7cb4f"
I0608 11:47:47.438606 1 lownodeutilization.go:141] Node "ip-172-20-34-164.eu-west-2.compute.internal" is under utilized with usage: api.ResourceThresholds{"cpu":32.5, "memory":20.6892852865826, "pods":5.454545454545454}
I0608 11:47:47.438649 1 lownodeutilization.go:149] allPods:6, nonRemovablePods:2, bePods:0, bPods:2, gPods:2
I0608 11:47:47.438798 1 lownodeutilization.go:144] Node "ip-172-20-80-152.eu-west-2.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":99, "memory":86.26743748074475, "pods":16.363636363636363}
I0608 11:47:47.438821 1 lownodeutilization.go:149] allPods:18, nonRemovablePods:6, bePods:1, bPods:10, gPods:1
I0608 11:47:47.438990 1 lownodeutilization.go:144] Node "ip-172-20-104-255.eu-west-2.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":99.5, "memory":92.91303949597655, "pods":15.454545454545455}
I0608 11:47:47.439014 1 lownodeutilization.go:149] allPods:17, nonRemovablePods:8, bePods:0, bPods:6, gPods:3
I0608 11:47:47.439023 1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 74, Mem: 68, Pods: 12
I0608 11:47:47.439034 1 lownodeutilization.go:72] Total number of underutilized nodes: 1
I0608 11:47:47.439047 1 lownodeutilization.go:89] Criteria for a node above target utilization: CPU: 77, Mem: 75, Pods: 14
I0608 11:47:47.439061 1 lownodeutilization.go:91] Total number of nodes above target utilization: 2
I0608 11:47:47.439077 1 lownodeutilization.go:183] Total capacity to be moved: CPU:1780, Mem:9.083513856e+09, Pods:9.4
I0608 11:47:47.439093 1 lownodeutilization.go:184] ********Number of pods evicted from each node:***********
I0608 11:47:47.439101 1 lownodeutilization.go:191] evicting pods from node "ip-172-20-104-255.eu-west-2.compute.internal" with usage: api.ResourceThresholds{"pods":15.454545454545455, "cpu":99.5, "memory":92.91303949597655}
I0608 11:47:47.439125 1 lownodeutilization.go:202] 0 pods evicted from node "ip-172-20-104-255.eu-west-2.compute.internal" with usage map[cpu:99.5 memory:92.91303949597655 pods:15.454545454545455]
I0608 11:47:47.439152 1 lownodeutilization.go:191] evicting pods from node "ip-172-20-80-152.eu-west-2.compute.internal" with usage: api.ResourceThresholds{"cpu":99, "memory":86.26743748074475, "pods":16.363636363636363}
I0608 11:47:47.439175 1 lownodeutilization.go:202] 0 pods evicted from node "ip-172-20-80-152.eu-west-2.compute.internal" with usage map[cpu:99 memory:86.26743748074475 pods:16.363636363636363]
I0608 11:47:47.439195 1 lownodeutilization.go:94] Total number of pods evicted: 0
I0608 11:47:47.439203 1 pod_antiaffinity.go:45] Processing node: "ip-172-20-34-164.eu-west-2.compute.internal"
I0608 11:47:47.446324 1 pod_antiaffinity.go:45] Processing node: "ip-172-20-80-152.eu-west-2.compute.internal"
I0608 11:47:47.455917 1 pod_antiaffinity.go:45] Processing node: "ip-172-20-104-255.eu-west-2.compute.internal"
I0608 11:47:47.492859 1 node_affinity.go:31] Evicted 0 pods
This is using Kubernetes v1.10.3
As of now, we don't have version flag for descheduler. We need to have one. I will submit a PR for this.
The namespace should not be ignored.
Hi,
I was just playing around with the project.
My policy file looks like -
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
targetThresholds:
"cpu" : 50
and I run the descheduler like -
$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/policy.yaml -v 5
I1123 17:14:37.581631 13825 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 17:14:37.581785 13825 node.go:50] node lister returned empty list, now fetch directly
I1123 17:14:37.582104 13825 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 17:14:38.069287 13825 lownodeutilization.go:104] no target resource threshold for pods is configured
The exit code is 0, but I'm not sure if the descheduler actually went ahead and processed the nodes, etc, because it might have stopped while seeing that there is no targetThreshold
for pods.
If this is the case, does it make sense to make all the 3 parameters, pods, memory and cpu, mandatory for the descheduler to take decisions, why can I not set the parameter to just cpu, or memory, or pods.
Does this make sense?
I have the following config but the descheduler does not remove any pods
LowNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 30
memory: 30
pods: 30
targetThresholds:
cpu: 50
memory: 50
pods: 50
I0508 07:06:40.389008 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-2mj6" is over utilized with usage: api.ResourceThresholds{"memory":63.35539318178952, "pods":10, "cpu":29.98741346758968}
I0508 07:06:40.389076 1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:1, bPods:2, gPods:3
I0508 07:06:40.389225 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-1qsh" is appropriately utilized with usage: api.ResourceThresholds{"cpu":31.623662680931403, "memory":38.651985111461904, "pods":10}
I0508 07:06:40.389265 1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:1, bPods:4, gPods:1
I0508 07:06:40.389353 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-8fc1" is appropriately utilized with usage: api.ResourceThresholds{"pods":5.454545454545454, "cpu":33.38577721837634, "memory":46.87748520961707}
I0508 07:06:40.389375 1 lownodeutilization.go:149] allPods:6, nonRemovablePods:4, bePods:0, bPods:1, gPods:1
I0508 07:06:40.389508 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-0s07" is over utilized with usage: api.ResourceThresholds{"cpu":43.14033983637508, "memory":61.846487904599, "pods":8.181818181818182}
I0508 07:06:40.389535 1 lownodeutilization.go:149] allPods:9, nonRemovablePods:4, bePods:0, bPods:2, gPods:3
I0508 07:06:40.389712 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-nq13" is over utilized with usage: api.ResourceThresholds{"cpu":84.36123348017621, "memory":86.35326222824197, "pods":10}
I0508 07:06:40.389744 1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:0, bPods:2, gPods:4
I0508 07:06:40.389853 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-sb0v" is over utilized with usage: api.ResourceThresholds{"pods":7.2727272727272725, "cpu":40.308370044052865, "memory":65.15821416455076}
I0508 07:06:40.389879 1 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:0, bPods:2, gPods:1
I0508 07:06:40.389965 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-3290" is appropriately utilized with usage: api.ResourceThresholds{"cpu":27.72183763373191, "memory":45.02291850404409, "pods":6.363636363636363}
I0508 07:06:40.389988 1 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0508 07:06:40.390079 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-7w01" is appropriately utilized with usage: api.ResourceThresholds{"cpu":37.161736941472626, "memory":43.96316610085953, "pods":6.363636363636363}
I0508 07:06:40.390114 1 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:0, gPods:2
I0508 07:06:40.390311 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-pcz2" is over utilized with usage: api.ResourceThresholds{"memory":59.747681387354575, "pods":12.727272727272727, "cpu":67.36941472624292}
I0508 07:06:40.390335 1 lownodeutilization.go:149] allPods:14, nonRemovablePods:6, bePods:1, bPods:5, gPods:2
I0508 07:06:40.390425 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-rw0t" is appropriately utilized with usage: api.ResourceThresholds{"cpu":33.38577721837634, "memory":38.39946598414058, "pods":6.363636363636363}
I0508 07:06:40.390452 1 lownodeutilization.go:149] allPods:7, nonRemovablePods:4, bePods:0, bPods:2, gPods:1
I0508 07:06:40.390580 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-36ae422e-wnp4" is over utilized with usage: api.ResourceThresholds{"cpu":65.48143486469478, "memory":81.88036194839464, "pods":8.181818181818182}
I0508 07:06:40.390617 1 lownodeutilization.go:149] allPods:9, nonRemovablePods:4, bePods:0, bPods:3, gPods:2
I0508 07:06:40.390701 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-150f" is appropriately utilized with usage: api.ResourceThresholds{"cpu":27.847702957835118, "memory":39.989094588917425, "pods":5.454545454545454}
I0508 07:06:40.390722 1 lownodeutilization.go:149] allPods:6, nonRemovablePods:4, bePods:0, bPods:1, gPods:1
I0508 07:06:40.390907 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-x9lg" is over utilized with usage: api.ResourceThresholds{"memory":76.15314534759058, "pods":12.727272727272727, "cpu":47.860289490245435}
I0508 07:06:40.390929 1 lownodeutilization.go:149] allPods:14, nonRemovablePods:6, bePods:0, bPods:5, gPods:3
I0508 07:06:40.391013 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-mcvm" is appropriately utilized with usage: api.ResourceThresholds{"cpu":27.72183763373191, "memory":45.02291850404409, "pods":6.363636363636363}
I0508 07:06:40.391032 1 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0508 07:06:40.391230 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-tpck" is over utilized with usage: api.ResourceThresholds{"cpu":79.32662051604783, "memory":86.72583143248654, "pods":15.454545454545455}
I0508 07:06:40.391254 1 lownodeutilization.go:149] allPods:17, nonRemovablePods:8, bePods:1, bPods:6, gPods:2
I0508 07:06:40.391429 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-bftv" is over utilized with usage: api.ResourceThresholds{"cpu":67.0547514159849, "memory":40.020142022604475, "pods":12.727272727272727}
I0508 07:06:40.391454 1 lownodeutilization.go:149] allPods:14, nonRemovablePods:5, bePods:0, bPods:5, gPods:4
I0508 07:06:40.391560 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-zdd4" is over utilized with usage: api.ResourceThresholds{"cpu":41.63624921334173, "memory":81.68473886035868, "pods":7.2727272727272725}
I0508 07:06:40.391579 1 lownodeutilization.go:149] allPods:8, nonRemovablePods:7, bePods:0, bPods:0, gPods:1
I0508 07:06:40.391630 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-ffq9" is over utilized with usage: api.ResourceThresholds{"pods":4.545454545454546, "cpu":28.351164254247955, "memory":58.79969974544339}
I0508 07:06:40.391659 1 lownodeutilization.go:149] allPods:5, nonRemovablePods:5, bePods:0, bPods:0, gPods:0
I0508 07:06:40.391865 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-0plb" is over utilized with usage: api.ResourceThresholds{"cpu":69.57205789804908, "memory":26.510368710913788, "pods":12.727272727272727}
I0508 07:06:40.391891 1 lownodeutilization.go:149] allPods:14, nonRemovablePods:5, bePods:1, bPods:2, gPods:6
I0508 07:06:40.391985 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-36ae422e-32s4" is over utilized with usage: api.ResourceThresholds{"cpu":45.972309628697296, "memory":55.872961663211015, "pods":7.2727272727272725}
I0508 07:06:40.392007 1 lownodeutilization.go:149] allPods:8, nonRemovablePods:4, bePods:0, bPods:3, gPods:1
I0508 07:06:40.392185 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-v103" is over utilized with usage: api.ResourceThresholds{"cpu":88.82945248584015, "memory":65.58252909160707, "pods":11.818181818181818}
I0508 07:06:40.392206 1 lownodeutilization.go:149] allPods:13, nonRemovablePods:7, bePods:0, bPods:3, gPods:3
I0508 07:06:40.392274 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-290fc974-pwh6" is appropriately utilized with usage: api.ResourceThresholds{"cpu":22.183763373190686, "memory":36.015023076975325, "pods":5.454545454545454}
I0508 07:06:40.392297 1 lownodeutilization.go:149] allPods:6, nonRemovablePods:5, bePods:0, bPods:1, gPods:0
I0508 07:06:40.392450 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-fxpk" is over utilized with usage: api.ResourceThresholds{"cpu":47.23096286972939, "memory":56.65535699212463, "pods":10}
I0508 07:06:40.392479 1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:0, bPods:6, gPods:0
I0508 07:06:40.392632 1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-fsr8" is appropriately utilized with usage: api.ResourceThresholds{"pods":11.818181818181818, "cpu":35.08495909376967, "memory":38.59402990191275}
I0508 07:06:40.392652 1 lownodeutilization.go:149] allPods:13, nonRemovablePods:4, bePods:1, bPods:7, gPods:1
I0508 07:06:40.392727 1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-rq6s" is over utilized with usage: api.ResourceThresholds{"cpu":34.64443045940843, "memory":62.24389505579321, "pods":5.454545454545454}
I0508 07:06:40.392753 1 lownodeutilization.go:149] allPods:6, nonRemovablePods:6, bePods:0, bPods:0, gPods:0
I0508 07:06:40.392759 1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 30, Mem: 30, Pods: 30
I0508 07:06:40.392782 1 lownodeutilization.go:69] No node is underutilized, nothing to do here, you might tune your thersholds further
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-asia-northeast1-std--default-pool-36ae422e-32s4 199m 1% 12555Mi 12%
gke-asia-northeast1-std--default-pool-290fc974-pwh6 101m 0% 10892Mi 11%
gke-asia-northeast1-std--default-pool-290fc974-2mj6 218m 1% 8947Mi 9%
gke-asia-northeast1-std--default-pool-290fc974-bftv 372m 2% 17092Mi 17%
gke-asia-northeast1-std--default-pool-290fc974-pcz2 279m 1% 44959Mi 46%
gke-asia-northeast1-std--default-pool-36ae422e-mcvm 286m 1% 29233Mi 30%
gke-asia-northeast1-std--default-pool-290fc974-0s07 120m 0% 9409Mi 9%
gke-asia-northeast1-std--default-pool-290fc974-ffq9 164m 1% 13839Mi 14%
gke-asia-northeast1-std--default-pool-290fc974-sb0v 404m 2% 11927Mi 12%
gke-asia-northeast1-std--default-pool-290fc974-fxpk 211m 1% 30067Mi 31%
gke-asia-northeast1-std--default-pool-290fc974-v103 1337m 8% 42334Mi 43%
gke-asia-northeast1-std--default-pool-36ae422e-wnp4 291m 1% 19506Mi 20%
gke-asia-northeast1-std--default-pool-36ae422e-fsr8 532m 3% 22507Mi 23%
gke-asia-northeast1-std--default-pool-36ae422e-3290 235m 1% 33359Mi 34%
gke-asia-northeast1-std--default-pool-290fc974-rq6s 78m 0% 34039Mi 35%
gke-asia-northeast1-std--default-pool-36ae422e-8fc1 112m 0% 10349Mi 10%
gke-asia-northeast1-std--default-pool-36ae422e-7w01 185m 1% 10906Mi 11%
gke-asia-northeast1-std--default-pool-36ae422e-150f 162m 1% 11357Mi 11%
gke-asia-northeast1-std--default-pool-290fc974-x9lg 333m 2% 13055Mi 13%
gke-asia-northeast1-std--default-pool-290fc974-0plb 137m 0% 22509Mi 23%
gke-asia-northeast1-std--default-pool-36ae422e-1qsh 269m 1% 20021Mi 20%
gke-asia-northeast1-std--default-pool-290fc974-nq13 256m 1% 56451Mi 58%
gke-asia-northeast1-std--default-pool-290fc974-tpck 435m 2% 40776Mi 42%
gke-asia-northeast1-std--default-pool-290fc974-zdd4 88m 0% 27627Mi 28%
gke-asia-northeast1-std--default-pool-36ae422e-rw0t 120m 0% 26677Mi 27%
Stark difference between what is reported in the logs vs reported by kubectl top
lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-36ae422e-32s4" is over utilized with usage: api.ResourceThresholds{"cpu":45.972309628697296, "memory":55.872961663211015, "pods":7.2727272727272725}
vs
gke-asia-northeast1-std--default-pool-36ae422e-32s4 199m 1% 12555Mi 12%
Server Version: version.Info{Major:"1", Minor:"9+", GitVersion:"v1.9.4-gke.1", GitCommit:"10e47a740d0036a4964280bd663c8500da58e3aa", GitTreeState:"clean", BuildDate:"2018-03-13T18:00:36Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
It is not uncommon that pods get scheduled on nodes that are not able to start it.
For example, a node may have network issues and unable to mount a networked persistent volume, or cannot pull a docker image, or has some docker configuration issue which is seen only on container startup.
Another common issue is when a container gets restarted by liveliness check because of some local node issue (e.g. wrong routing table, slow storage, network latency or packet-drop). In that case, a pod is unhealthy most of the time and hangs in a restart state forever without a chance of being migrated to another node.
As of now, there is no possibility to re-schedule pods with faulty containers. It may be helpful to introduce two new Strategies:
$notReadyPeriod
seconds and one of its containers was restarted $maxRestartCount
times.$maxStartupTime
seconds.The similar issue is filed against kubernetes: kubernetes/kubernetes#13385
Not to sure if this is an issue for anyone else but in order to build and run out of box I had to update the Dockerfile
it build FROM debian:stretch-slim
in order to run with/bin/sh
.
For pods with podAffinity
set using preferredDuringSchedulingIgnoredDuringExecution
, it might be possible that at the time of scheduling on the current node, no pod with the set labels were running, but still the pod got scheduled on the current node since the nature of the affinity was preferred and not required.
In such a case, if the descheduler is run, it can do the following -
podAffinity
set using preferredDuringSchedulingIgnoredDuringExecution
podAffinity
condition can be metMaybe we could have a policy file describing the strategy like -
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingPodAffinity":
enabled: true
Does it make sense to support such a strategy?
When I run the following policy config file -
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
with RemoveDuplicates strategy enabled, and if there is only one schedulable node available (on which the pods are already running on), then the descheduler still evicts the pods, only to be scheduled on the same node again. This would lead to disruption of service without any gain.
$ descheduler --kubeconfig $KUBECONFIG --policy-config-file policy.yaml -v 5
I0120 02:49:28.828911 13141 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I0120 02:49:28.828993 13141 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I0120 02:49:28.929080 13141 node.go:50] node lister returned empty list, now fetch directly
I0120 02:49:30.343166 13141 duplicates.go:49] Processing node: "kubernetes-master"
I0120 02:49:30.873924 13141 duplicates.go:49] Processing node: "kubernetes-minion-group-5xwf"
I0120 02:49:31.123129 13141 duplicates.go:53] "ReplicaSet/wordpress-57f4bb46bf"
I0120 02:49:31.367105 13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-fn8qm" (<nil>)
I0120 02:49:31.607484 13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-k6tqz" (<nil>)
I0120 02:49:31.865925 13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-rvwd9" (<nil>)
I0120 02:49:32.155498 13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-v9bzq" (<nil>)
I0120 02:49:32.155526 13141 duplicates.go:49] Processing node: "kubernetes-minion-group-cxg1"
I0120 02:49:32.433999 13141 duplicates.go:49] Processing node: "kubernetes-minion-group-v738"
How about descheduler only evict the pods if there are other schedulable nodes are available?
cc @aveshagarwal . As per our offline discussion, we need to parallelize computation in the strategies so as to reduce the overall time. Coming up with a generic MapReduce framework would be ideal.
Currently the descheduler only checks for the kubernetes.io/created-by annotation in order to proceed with descheduling. It also ignores every pod which has a hostDir
volume mounted.
Will it be possible to allow descheduling of Pods which have hostDirs (maybe configurable, based on the content in kubernetes.io/created-by) ?
I have the following nodes -
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetes-master Ready,SchedulingDisabled <none> 56m v1.8.4-dirty
kubernetes-minion-group-5rrh Ready <none> 56m v1.8.4-dirty
kubernetes-minion-group-fb8c Ready <none> 56m v1.8.4-dirty
kubernetes-minion-group-t1r3 Ready,SchedulingDisabled <none> 56m v1.8.4-dirty
The worker node kubernetes-minion-group-t1r3
was cordoned and marked as unschedulable, however it fulfilled the criteria for being an underutilized node according to the following policy file -
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds: # any node below the following percentages is considered underutilized
"cpu" : 40
"memory": 40
"pods": 40
targetThresholds: # any node above the following percentages is considered overutilized
"cpu" : 30
"memory": 2
"pods": 1
When I ran the descheduler, kubernetes-minion-group-t1r3
(the cordoned node) was taken into account and marked as underutilized and multiple pods were evicted from other nodes in the hope that the scheduler will schedule on kubernetes-minion-group-t1r3
, but that never happened since the node was cordoned.
Does it make sense to not take a cordoned node into consideration while looking for underutilized nodes?
I ran the descheduler like the following -
$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/custom.yaml -v 5
I1125 18:58:46.014381 2813 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1125 18:58:46.016167 2813 node.go:50] node lister returned empty list, now fetch directly
I1125 18:58:46.017010 2813 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1125 18:58:47.834184 2813 lownodeutilization.go:115] Node "kubernetes-master" usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1125 18:58:47.834986 2813 lownodeutilization.go:115] Node "kubernetes-minion-group-5rrh" usage: api.ResourceThresholds{"cpu":90.5, "memory":6.932161992316314, "pods":17.272727272727273}
I1125 18:58:47.835701 2813 lownodeutilization.go:115] Node "kubernetes-minion-group-fb8c" usage: api.ResourceThresholds{"cpu":96.5, "memory":14.0975556030657, "pods":17.272727272727273}
I1125 18:58:47.835783 2813 lownodeutilization.go:115] Node "kubernetes-minion-group-t1r3" usage: api.ResourceThresholds{"cpu":10, "memory":2.764226588836412, "pods":1.8181818181818181}
I1125 18:58:47.835819 2813 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-fb8c" with usage: api.ResourceThresholds{"cpu":96.5, "memory":14.0975556030657, "pods":17.272727272727273}
I1125 18:58:48.096681 2813 lownodeutilization.go:194] Evicted pod: "database-6f97f65956-6pxp5" (<nil>)
I1125 18:58:48.098323 2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":91.5, "memory":14.0975556030657, "pods":16.363636363636363}
I1125 18:58:48.361411 2813 lownodeutilization.go:194] Evicted pod: "wordpress-57f4bb46bf-g27k6" (<nil>)
I1125 18:58:48.361522 2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":86.5, "memory":14.0975556030657, "pods":15.454545454545455}
I1125 18:58:48.623304 2813 lownodeutilization.go:194] Evicted pod: "wordpress-57f4bb46bf-m62cm" (<nil>)
I1125 18:58:48.623330 2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":81.5, "memory":14.0975556030657, "pods":14.545454545454547}
I1125 18:58:48.894712 2813 lownodeutilization.go:194] Evicted pod: "wordpress-57f4bb46bf-mblx7" (<nil>)
I1125 18:58:48.894832 2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":76.5, "memory":14.0975556030657, "pods":13.636363636363638}
I1125 18:58:48.894991 2813 lownodeutilization.go:163] evicting pods from node "kubernetes-master" with usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1125 18:58:48.895063 2813 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-5rrh" with usage: api.ResourceThresholds{"cpu":90.5, "memory":6.932161992316314, "pods":17.272727272727273}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.4-dirty", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"dirty", BuildDate:"2017-11-25T12:04:44Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.4-dirty", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"dirty", BuildDate:"2017-11-25T11:54:10Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
I found this feature from the roadmap and I thought it involves real-time scheduling which is quite different from the current logic.
Pod anti affinity strategy doesn't have a check on type of pod to be evicted. It can evict critical, mirror pods. As this is some functionality that needs to be respected by all strategies in descheduler, I am planning to move this to pods.go to avoid code duplication so that people implementing strategies won't have to think about them.
hi:
Just want to make it work and I found that this strategy is not work as my expect.
descheduler version
Descheduler version {Major:0 Minor:4+ GitCommit:d3c2f256852874fdca4682c3c94bc30624979036 GitVersion:v0.4.0 BuildDate:2018-01-10T13:23:09+0800 GoVersion:go1.8.5 Compiler:gc Platform:linux/amd64}
The origin try is as the following steps:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: key
operator: In
values: [“value”]
topologyKey: kubernetes.io/hostname
Then I found no one pod has been evicted.
Then I go through all the unit test and try to catch a demo for this strategy.
And try to reproduce the test in https://github.com/kubernetes-incubator/descheduler/blob/master/pkg/descheduler/strategies/pod_antiaffinity_test.go
Then the reproduced steps is:
The origin try is as the following steps:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: key
operator: In
values: [“value”]
topologyKey: kubernetes.io/hostname
Then I found there is one pod has been evicted.
So here I want to discuss is if only one scenario(in the unit test) is for the strategy.
And why my origin steps can not work? Is it a bug?
Thanks!
Right now the control to mark pods as critical is very basic and requires doing changes in many pods' annotation.
Proposal 1 - Non-critical annotation
If I have 100 pods but I want the descheduler to consider "non-critical" only 20, that means I have to add annotations to 80 pods. We could have a "non-critical" annotation to only mark 20 pods. This could be controlled with an argument. --non-critical-pod-matcher=true
(default false).
Proposal 2 - Consider current labels as critical
If I already have an annotation in my running applications that I know identifies a set of critical pods, it would be nice to be able to say "Pods with this custom annotation and value are considered critical". With this, no changes would have to be applied at all to make descheduler run. Personally, I have an annotation called "layer" with values (backend|monitoring|data|frontend). I consider my data and monitoring Pods critical, if I already have this annotation, why add another?
It could be done with --extra-critical-annotations="layer=data,layer=monitoring,k8s-app=prometheus"
. And if --non-critical-pod-matcher
is set to true, then --extra-non-critical-annotations="...."
Given the following policy:
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 50
"memory": 50
"pods": 10
targetThresholds:
"cpu" : 50
"memory": 50
"pods": 50
I am confused by this output:
./_output/bin/descheduler --kubeconfig ~/.kube/config --policy-config-file policy.yaml --node-selector beta.kubernetes.io/instance-type=n1-highmem-4 -v 4
I0814 11:56:02.699491 27948 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0814 11:56:02.699629 27948 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0814 11:56:02.799560 27948 node.go:51] node lister returned empty list, now fetch directly
I0814 11:56:04.839125 27948 request.go:480] Throttling request took 122.384366ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-mr9b%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.039313 27948 request.go:480] Throttling request took 82.857788ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-ps8k%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.239153 27948 request.go:480] Throttling request took 65.339548ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-qcpq%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.439252 27948 request.go:480] Throttling request took 126.223138ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-qg9g%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.639253 27948 request.go:480] Throttling request took 128.039815ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-tw62%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.839299 27948 request.go:480] Throttling request took 111.435987ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-w7n7%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.912036 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-kfh8" is under utilized with usage: api.ResourceThresholds{"cpu":21.428571428571427, "memory":7.093371019678181, "pods":7.2727272727272725}
I0814 11:56:05.912091 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912148 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-mr9b" is under utilized with usage: api.ResourceThresholds{"pods":9.090909090909092, "cpu":33.92857142857143, "memory":3.773439230136872}
I0814 11:56:05.912160 27948 lownodeutilization.go:149] allPods:10, nonRemovablePods:7, bePods:0, bPods:3, gPods:0
I0814 11:56:05.912219 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-qcpq" is under utilized with usage: api.ResourceThresholds{"cpu":15.561224489795919, "memory":7.322094578473096, "pods":9.090909090909092}
I0814 11:56:05.912230 27948 lownodeutilization.go:149] allPods:10, nonRemovablePods:6, bePods:0, bPods:4, gPods:0
I0814 11:56:05.912264 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-qg9g" is under utilized with usage: api.ResourceThresholds{"cpu":27.551020408163264, "memory":7.681984428891371, "pods":7.2727272727272725}
I0814 11:56:05.912273 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912513 27948 lownodeutilization.go:147] Node "gke-node-bf2a5a1e-9b15" is appropriately utilized with usage: api.ResourceThresholds{"memory":13.213583436348788, "pods":10.909090909090908, "cpu":22.372448979591837}
I0814 11:56:05.912537 27948 lownodeutilization.go:149] allPods:12, nonRemovablePods:6, bePods:0, bPods:6, gPods:0
I0814 11:56:05.912613 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-cs2l" is under utilized with usage: api.ResourceThresholds{"memory":2.9337443495765223, "pods":7.2727272727272725, "cpu":27.551020408163264}
I0814 11:56:05.912631 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912749 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-dk7j" is under utilized with usage: api.ResourceThresholds{"cpu":15.051020408163266, "memory":6.403485637657102, "pods":7.2727272727272725}
I0814 11:56:05.912770 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:1, bPods:1, gPods:0
I0814 11:56:05.912828 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-ggdw" is under utilized with usage: api.ResourceThresholds{"cpu":19.387755102040817, "memory":1.9656336753240788, "pods":7.2727272727272725}
I0814 11:56:05.912855 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912897 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-tw62" is under utilized with usage: api.ResourceThresholds{"cpu":20.918367346938776, "memory":2.4510629446719583, "pods":6.363636363636363}
I0814 11:56:05.912909 27948 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0814 11:56:05.920135 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-3rt5" is under utilized with usage: api.ResourceThresholds{"cpu":27.551020408163264, "memory":7.68198179540342, "pods":7.2727272727272725}
I0814 11:56:05.920172 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.920269 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-50rb" is under utilized with usage: api.ResourceThresholds{"cpu":28.316326530612244, "memory":8.43523410990875, "pods":10}
I0814 11:56:05.920288 27948 lownodeutilization.go:149] allPods:11, nonRemovablePods:6, bePods:0, bPods:5, gPods:0
I0814 11:56:05.920354 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-hmq1" is under utilized with usage: api.ResourceThresholds{"cpu":27.806122448979593, "memory":7.93306590023853, "pods":8.181818181818182}
I0814 11:56:05.920370 27948 lownodeutilization.go:149] allPods:9, nonRemovablePods:6, bePods:0, bPods:3, gPods:0
I0814 11:56:05.920444 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-ps8k" is under utilized with usage: api.ResourceThresholds{"cpu":27.040816326530614, "memory":7.1798135857332, "pods":5.454545454545454}
I0814 11:56:05.920467 27948 lownodeutilization.go:149] allPods:6, nonRemovablePods:6, bePods:0, bPods:0, gPods:0
I0814 11:56:05.920580 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-w7n7" is under utilized with usage: api.ResourceThresholds{"cpu":23.724489795918366, "memory":4.948809588699222, "pods":8.181818181818182}
I0814 11:56:05.920632 27948 lownodeutilization.go:149] allPods:9, nonRemovablePods:5, bePods:0, bPods:4, gPods:0
I0814 11:56:05.920674 27948 lownodeutilization.go:147] Node "gke-node-bf2a5a1e-1t8l" is appropriately utilized with usage: api.ResourceThresholds{"memory":0.8776025543719349, "pods":3.6363636363636362, "cpu":7.653061224489796}
I0814 11:56:05.920690 27948 lownodeutilization.go:149] allPods:4, nonRemovablePods:4, bePods:0, bPods:0, gPods:0
I0814 11:56:05.920733 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-fjbd" is under utilized with usage: api.ResourceThresholds{"pods":5.454545454545454, "cpu":27.040816326530614, "memory":7.1798135857332}
I0814 11:56:05.920745 27948 lownodeutilization.go:149] allPods:6, nonRemovablePods:6, bePods:0, bPods:0, gPods:0
I0814 11:56:05.921125 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-kxb8" is under utilized with usage: api.ResourceThresholds{"cpu":27.29591836734694, "memory":7.43089769056831, "pods":6.363636363636363}
I0814 11:56:05.921145 27948 lownodeutilization.go:149] allPods:7, nonRemovablePods:6, bePods:0, bPods:1, gPods:0
I0814 11:56:05.921205 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-l4xf" is under utilized with usage: api.ResourceThresholds{"memory":2.8898642218579256, "pods":7.2727272727272725, "cpu":22.193877551020407}
I0814 11:56:05.921220 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.921279 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-cvsc" is under utilized with usage: api.ResourceThresholds{"cpu":8.418367346938776, "memory":1.4256836686734968, "pods":7.2727272727272725}
I0814 11:56:05.921297 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:1, bPods:1, gPods:1
I0814 11:56:05.921388 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-kntr" is under utilized with usage: api.ResourceThresholds{"cpu":12.525510204081632, "memory":3.879292539212722, "pods":6.363636363636363}
I0814 11:56:05.921407 27948 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0814 11:56:05.921478 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-5415" is under utilized with usage: api.ResourceThresholds{"cpu":21.428571428571427, "memory":7.093371019678181, "pods":7.2727272727272725}
I0814 11:56:05.921495 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.921562 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-5dd0" is under utilized with usage: api.ResourceThresholds{"cpu":27.806122448979593, "memory":7.93306590023853, "pods":8.181818181818182}
I0814 11:56:05.921580 27948 lownodeutilization.go:149] allPods:9, nonRemovablePods:6, bePods:0, bPods:3, gPods:0
I0814 11:56:05.921636 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-74z1" is under utilized with usage: api.ResourceThresholds{"cpu":21.1734693877551, "memory":2.7021479758398677, "pods":7.2727272727272725}
I0814 11:56:05.921652 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:0, bPods:3, gPods:0
I0814 11:56:05.923369 27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-bvsh" is under utilized with usage: api.ResourceThresholds{"cpu":21.1734693877551, "memory":2.7021470495070683, "pods":7.2727272727272725}
I0814 11:56:05.923431 27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:0, bPods:3, gPods:0
I0814 11:56:05.923442 27948 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 50, Mem: 50, Pods: 10
I0814 11:56:05.923478 27948 lownodeutilization.go:72] Total number of underutilized nodes: 22
I0814 11:56:05.923493 27948 lownodeutilization.go:85] all nodes are under target utilization, nothing to do here
According to the descheduler, Total number of underutilized nodes: 22
and all nodes are under target utilization
, yet nothing to do here
. None of my underutilized nodes get drained.
How can I instruct the descheduler to drain the underutilized nodes?
This is just a starter issue to get ramped up on contributing. Its been a while since I've contributed anything to upstream :).
For this issue we'd like to:
Hi
I found two things that
Command:
/bin/descheduler
Args:
--policy-config-file=/policy-dir/policy.yaml
--dry-run
And I checked help info and found no such things metioned.
-v, --v Level log level for V logs
# oc logs -f descheduler-cronjob-1523411400-z2z8z
I0411 01:50:37.761513 1 round_trippers.go:436] GET https://172.30.0.1:443/api 200 OK in 124 milliseconds
I0411 01:50:37.882412 1 round_trippers.go:436] GET https://172.30.0.1:443/apis 200 OK in 15 milliseconds
I0411 01:50:37.898564 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1 200 OK in 15 milliseconds
I0411 01:50:37.903192 1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0411 01:50:37.903215 1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0411 01:50:37.919025 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0 200 OK in 15 milliseconds
I0411 01:50:37.963552 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/nodes?resourceVersion=8694&timeoutSeconds=481&watch=true 200 OK in 25 milliseconds
I0411 01:50:38.011943 1 duplicates.go:50] Processing node: "ip-172-18-7-158.ec2.internal"
I0411 01:50:38.028600 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-7-158.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 16 milliseconds
I0411 01:50:38.433484 1 duplicates.go:54] "ReplicationController/hello-1"
I0411 01:50:38.433510 1 duplicates.go:50] Processing node: "ip-172-18-14-173.ec2.internal"
I0411 01:50:38.461836 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-14-173.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 28 milliseconds
I0411 01:50:38.479347 1 duplicates.go:54] "ReplicationController/hello-1"
I0411 01:50:38.495887 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-7-158.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 16 milliseconds
I0411 01:50:38.568027 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-14-173.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 15 milliseconds
I0411 01:50:38.569526 1 lownodeutilization.go:141] Node "ip-172-18-7-158.ec2.internal" is under utilized with usage: api.ResourceThresholds{"cpu":30, "memory":14.27776271919991, "pods":5.6}
I0411 01:50:38.569571 1 lownodeutilization.go:149] allPods:14, nonRemovablePods:9, bePods:4, bPods:1, gPods:0
I0411 01:50:38.569603 1 lownodeutilization.go:141] Node "ip-172-18-14-173.ec2.internal" is under utilized with usage: api.ResourceThresholds{"cpu":20, "memory":11.422210175359927, "pods":4.4}
I0411 01:50:38.569616 1 lownodeutilization.go:149] allPods:11, nonRemovablePods:3, bePods:8, bPods:0, gPods:0
I0411 01:50:38.569623 1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 40, Mem: 40, Pods: 40
I0411 01:50:38.569630 1 lownodeutilization.go:72] Total number of underutilized nodes: 2
I0411 01:50:38.569635 1 lownodeutilization.go:80] all nodes are underutilized, nothing to do here
I0411 01:50:38.569644 1 pod_antiaffinity.go:45] Processing node: "ip-172-18-7-158.ec2.internal"
I0411 01:50:38.585039 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-7-158.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 15 milliseconds
I0411 01:50:38.595997 1 pod_antiaffinity.go:45] Processing node: "ip-172-18-14-173.ec2.internal"
I0411 01:50:38.635903 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-14-173.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 39 milliseconds
I0411 01:50:38.659140 1 node_affinity.go:31] Evicted 0 pods
Project looks interesting! One consideration on the "Future Roadmap" that would be worth considering is the Fault Domains and the PVCs that are associated with a pod.
"Evicting" the pod with a PVC due to LowNodeUtilization on another node would not result in actual re-placement of that pod, so it shouldn't be attempted.
Woops
Currently RemoveDuplicates evict pod according to metadata.annotations.kubernetes.io/created-by
, but created-by is deprecated (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md)
https://github.com/kubernetes-incubator/descheduler/blob/master/pkg/descheduler/pod/pods.go#L113
Currently, the descheduler only supports cpu, memory and pods, but if we put another resource name or an invalid resource name, then we do not get an error -
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 40
"memory": 40
"pods": 40
"storage": 25 # unsupported value
targetThresholds:
"cpu" : 30
"memory": 2
"pods": 1
should throw an error, but it does not -
$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/custom.yaml -v 5
I1124 15:07:19.211499 16232 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1124 15:07:19.211643 16232 node.go:50] node lister returned empty list, now fetch directly
I1124 15:07:19.211789 16232 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1124 15:07:21.982785 16232 lownodeutilization.go:115] Node "kubernetes-minion-group-j57l" usage: api.ResourceThresholds{"cpu":74.5, "memory":12.991864967531136, "pods":13.636363636363637}
I1124 15:07:21.983109 16232 lownodeutilization.go:115] Node "kubernetes-minion-group-jk62" usage: api.ResourceThresholds{"memory":17.931192353458197, "pods":12.727272727272727, "cpu":87.3}
I1124 15:07:21.983179 16232 lownodeutilization.go:115] Node "kubernetes-minion-group-vkv3" usage: api.ResourceThresholds{"cpu":10, "memory":2.764226588836412, "pods":1.8181818181818181}
I1124 15:07:21.983320 16232 lownodeutilization.go:115] Node "kubernetes-master" usage: api.ResourceThresholds{"pods":8.181818181818182, "cpu":95, "memory":11.575631035804197}
I1124 15:07:21.983374 16232 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-jk62" with usage: api.ResourceThresholds{"cpu":87.3, "memory":17.931192353458197, "pods":12.727272727272727}
I1124 15:07:21.983402 16232 lownodeutilization.go:163] evicting pods from node "kubernetes-master" with usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1124 15:07:21.983485 16232 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-j57l" with usage: api.ResourceThresholds{"cpu":74.5, "memory":12.991864967531136, "pods":13.636363636363637}
Hey!
It looks like PVCs should be marked as non-evictable, or there should be a user-defined flag to allow that behavior.
Greetings.
Now that 1.9 rebase happened, we can use priorities(https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) which are in alpha while evicting pods.
From the little that I read about node affinity, does adding the following strategy make sense -
For pods with node affinity set using preferredDuringSchedulingIgnoredDuringExecution
, it might be possible that the preferred node was unavailable during scheduling and the pod was scheduled on another node. In this case, if the descheduler is run, it does the following -
preferredDuringSchedulingIgnoredDuringExecution
Maybe we could have a policy file describing the strategy like -
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeAffinity":
enabled: true
@aveshagarwal @ravisantoshgudimetla if this makes sense, can I take a stab at a PoC for this?
As of now, we are spinning up a new cluster on GCE for descheduler e2es, we should explore other alternatives to this. https://github.com/kubernetes-sigs/kind seems really easy to setup and run tests against. The problem I notice as of now is it doesn't support a multi-node cluster but this feature is actively being worked on kubernetes-sigs/kind#147, we can wait for this to land before we make switch.
I was reading the Rescheduler-Design-Implementation document (https://docs.google.com/document/d/1KXw02Q0cOF1MUrdpPNiug0yGZlixvPg2SwBycrT5DkE/edit) and saw that the descheduler should support also the HighNodeUtilization strategy option.
Meaning that the descheduler should evict pods from nodes that reached high thresholds.
This is what i am trying to achieve, balancing a heavy load nodes pods into low utilized nodes but cannot seem to get that to work :(
Any idea how does a policy that balances HighNodeUtilization nodes should be defined ? Or is it not implemented in the code ? Is it a feature that can be added ?
Thank you for any kind of help
Roiy
Does the descheduler evict pods created by a StatefulSet
?
I've got 3 pods from the same the StatefulSet
in one node but this is not picked up by the duplicate strategy
Running go test -cover
:
? github.com/kubernetes-incubator/descheduler/cmd/descheduler [no test files]
? github.com/kubernetes-incubator/descheduler/cmd/descheduler/app [no test files]
? github.com/kubernetes-incubator/descheduler/cmd/descheduler/app/options [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/api [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/api/install [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/api/v1alpha1 [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/apis/componentconfig [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/apis/componentconfig/install [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/apis/componentconfig/v1alpha1 [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/descheduler [no test files]
? github.com/kubernetes-incubator/descheduler/pkg/descheduler/client [no test files]
ok github.com/kubernetes-incubator/descheduler/pkg/descheduler/evictions 0.273s coverage: 50.0% of statements
? github.com/kubernetes-incubator/descheduler/pkg/descheduler/evictions/utils [no test files]
ok github.com/kubernetes-incubator/descheduler/pkg/descheduler/node 0.268s coverage: 72.5% of statements
ok github.com/kubernetes-incubator/descheduler/pkg/descheduler/pod 0.144s coverage: 33.3% of statements
? github.com/kubernetes-incubator/descheduler/pkg/descheduler/scheme [no test files]
ok github.com/kubernetes-incubator/descheduler/pkg/descheduler/strategies 0.077s coverage: 73.6% of statements
? github.com/kubernetes-incubator/descheduler/pkg/utils [no test files]
? github.com/kubernetes-incubator/descheduler/test [no test files]
Some packages are missing the tests completely.
Another feature I love about the Golang :). More about test coverage at https://blog.golang.org/cover
Currently types.IsCriticalPod(pod)
is still restricted in "kube-system" namspace.
There was a PR doing this, but seems not work as expected.
kubernetes/kubernetes@5b54626#diff-9fe046de3c6aaa377bb7fa24a34509c9R155
Hi,
My k8s cluster is running on GKE.
I tried using the descheduler, but after compiling I get this error:
$ ./bin/descheduler --dry-run --kubeconfig ~/.kube/config
E0521 16:26:53.978025 4226 server.go:46] No Auth Provider found for name "gcp"
Apparently the code in descheduler/pkg/descheduler/client/client.go
needs to import _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
(or _ "k8s.io/client-go/plugin/pkg/client/auth"
to support other auth providers)
After adding _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
to the import list I was able to authenticate against GKE.
It would be great to have a version matrix in the readme showing versions of descheduler to versions of k8s compatibility.
The title says it all really, will the descheduler honor PDB's to ensure we don't evict too many things at once?
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.