Comments (44)
Seems like a reasonable ask. @kabakaev I am planning to defer this to 0.6 release or later. Hope you are ok with that.
from descheduler.
I'd imagine an extra
descheduler
policy, which evicts a pod instatus.phase
!= Running for more than a configured period sincemetadata.creationTimestamp
@kabakaev thanks for the info. How about using the PodLifeTime strategy? We would need to add an additional strategy parameter to handle status.phase != Running
.
Maybe something like this ...
---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 300
podStatusPhase:
- pending
@damemi @ingvagabund @lixiang233 please add any additional ideas you have. Thanks!
from descheduler.
@ingvagabund I think the use case is to deschedule pods that are Pending for a short period of time.
Yeah, with such a short period of time, it makes sense to limit the phase. Though, maybe not to every phase. Pending is the first phase when a pod is accepted. I can't find any field in pod' status saying when a pod transitioned into a given phase. Also, other phases are completely ignored (Failed, Succeeded) which leaves only Running
and Unknown
. Running
is the default one in most cases. podStatusPhase
field is fine though I would just limit it to Pending
and Running
right now.
from descheduler.
@seanmalloy @ingvagabund @kabakaev Does anyone plan to work on this? If not, I'd love to help with the feature.
@lixiang233 this feature enhancement is all yours. Thanks!
from descheduler.
Ref also kubernetes/kubernetes#14796
from descheduler.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
from descheduler.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
from descheduler.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
from descheduler.
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
A whole bunch of issues were referred to this and then this gets auto closed. Should the users just write a controller and delete pod after too many restarts etc
from descheduler.
/reopen
from descheduler.
@mbdas: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
/reopen
from descheduler.
@k82cn: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
from descheduler.
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
/reopen
from descheduler.
@ravisantoshgudimetla: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
#89 tried addressing this. Let's make sure that we're getting this in before the next release
from descheduler.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
from descheduler.
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
/reopen
/remove-lifecycle rotten
from descheduler.
@damemi: Reopened this issue.
In response to this:
/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from descheduler.
/kind feature
from descheduler.
Please remember to implement safeguards against a "pod of death" where attempting to start it crashes kubelet. From what I understand, this would look similar to a pod that simply can't be started on that node.
Wrt restart too often (CrashLoopBackoff), the value of this is questionable because the majority of cases where that happens is an application bug.
from descheduler.
True it is app error most times but there are valid scenarios where restarting a sidecar (assuming that crashed) may not fix the pod as a whole and may need a fresh launch and in real life often host misconfigurations/bad upgrade may cause the pod failures and good to move away from those bad hosts in the fleet.
from descheduler.
This looks like a reasonable proposal. Ephemeral inline volumes have the same problem: a pod gets scheduled onto a node, then the CSI driver's NodePublishVolume
finds that it cannot create the volume, and the pod is stuck.
from descheduler.
For my own understanding: is the proposal to delete such a failed pod and then let a higher-level controller (like a stateful set) create a new pod, or is the proposal to just move the pod off the node and schedule it again?
from descheduler.
If not mistaken a pod gets scheduled only once for its entire lifetime. So unless its deleted and replaced from a controller/operator, new scheduling will not happen. Now there is a chance the pod maybe scheduled back to the bad node (for that specific use case) but proper fleet management will essentially remove a node that has high rate of failure. But in most cases it will land in a good node. For use cases where a fresh pod launch required any node is fine.
from descheduler.
It should also be noted that the descheduler only considers pods for eviction that have an ownerreference (unless this is explicitly overridden), so pods that aren't managed by a controller which would attempt to reschedule them would not be evicted by default.
from descheduler.
If not mistaken a pod gets scheduled only once for its entire lifetime.
That's what I thought, thanks for the confirmation.
Now there is a chance the pod maybe scheduled back to the bad node (for that specific use case) but proper fleet management will essentially remove a node that has high rate of failure.
That may be true for "broken" nodes, but not for a node that simply doesn't have enough storage capacity left for a certain kind of inline ephemeral volume. I was proposing to add capacity tracking to ensure that Kubernetes will eventually pick a node that has enough capacity, but that KEP has been postponed.
from descheduler.
New strategy RemovePodsHavingTooManyRestarts
was added in PR #254.
Looking at the original requirements provided in the issue description there is request to add a strategy that can ...
pod-startup-failure: a pod was scheduled on a node, but was unable to start all of its containers since $maxStartupTime seconds.
@kabakaev do you still have a need for this feature?
from descheduler.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
from descheduler.
@seanmalloy, i've tested the PodsHavingTooManyRestarts
policy on v0.18.0.
It works well for the case of too many restarts triggered by a livenessProbe
. Supposedly, it helps with CrashLoopBackoff
too. Thus, first part container-restart-rate
of the feature request is implemented.
Unfortunately, the second part pod-startup-failure
is not addressed by PR #254.
I've tested the second case by breaking the CSI node plugin on one of k8s nodes. It led to a new pod hanging in ContainerCreating
state forever. Descheduler did not evict the pod.
(Let me know if a "how to reproduce" guide is needed.)
It seems, all necessary info is already written in pod object:
- metadata.creationTimestamp;
- ownerReferences;
- status.phase = Pending.
I'd imagine an extra descheduler
policy, which evicts a pod in status.phase
!= Running for more than a configured period since metadata.creationTimestamp
.
/remove-lifecycle stale
from descheduler.
@seanmalloy I think PodLifeTime
already has the ability to evict not running
pods(except Succeeded
and Failed
pods), do you mean only evict certain StatusPhase
pods?
from descheduler.
PodLifeTime
checks meta.CreationTimestamp
field, nothing else. Also, any pods that's not Failed
and not Succeeded
is processed. So PodLifeTime
should work from scratch. No need to modify it.
from descheduler.
@seanmalloy I think
PodLifeTime
already has the ability to evictnot running
pods(exceptSucceeded
andFailed
pods), do you mean only evict certainStatusPhase
pods?
@lixiang233 yes based on my understanding of the problem I think it might be reasonable to have an option to only consider pods with a certain StatusPhase
for eviction(i.e. StatusePhase
is pending).
PodLifeTime checks meta.CreationTimestamp field, nothing else. Also, any pods that's not Failed and not Succeeded is processed. So PodLifeTime should work from scratch. No need to modify it.
@ingvagabund I think the use case is to deschedule pods that are Pending
for a short period of time. For example evict all pods that are in status pending and are more that 5 minutes old. Right now configuring PodLifeTime
to evict pods that are more than 5 minutes old will probably cause too much disruption in a cluster.
I know we have a lot of recently added configuration options for including/excluding pods based on different criteria(i.e. namespace and priority). But what do you think of adding one more? We could try adding this to only the PodLifeTime
strategy. What do you think?
@kabakaev do my above comments make sense to you? Would this newly proposed feature handle your use case?
from descheduler.
I think this can be implemented now. The consensus is to add a new strategy parameter to the PodLifeTime
strategy. The new strategy parameter will filter pod considered for eviction based on podStatusPhase
This strategy parameter should only allow filtering the Pending
and Running
phases.
Maybe something like this ...
---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 300
podStatusPhases: # <=== this is the default if not specified
- pending
- running
Another example ...
---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 300
podStatusPhases:
- pending # <==== only evict pending pods
from descheduler.
@seanmalloy @ingvagabund @kabakaev Does anyone plan to work on this? If not, I'd love to help with the feature.
from descheduler.
/assign
from descheduler.
I am not sure if the implementation proposed here is addressing what was actually requested in #62 (comment). Correct me if I'm wrong, but it seems like we're talking about adding a parameter that will only evict Pending
pods (and not Running
pods) which was added in #393.
Which led to the request of
I'd imagine an extra descheduler policy, which evicts a pod in
status.phase
!= Running
But my understanding of the problem above is more that a pod that was in Pending
state didn't get evicted at all. I think that points to a bigger bug, because the current pod selector should only exclude Succeeded
and Failed
pods:
descheduler/pkg/descheduler/pod/pods.go
Line 75 in 03dbc93
Is there somewhere else in the code that we are only selecting Running
pods for eviction?
Also, is there a use case for excluding all Running
pods from eviction with this strategy?
from descheduler.
If CNI/CSI plugin failed to set up a pod or the pod's image is not available on a node, the pod will be in ContainerCreating
or ImagePullBackOff
status and its status.phase
will be Pending
, if we want to recreate these pods immediately, I think we can only use PodLifeTime
strategy and set maxPodLifeTimeSeconds
to a short period of time, and we should limit the phase to protect running pods, so here we add a new parameter to include only Pending
pods.
@damemi Do you mean we should let every strategy to custom its exclude phases?
from descheduler.
...a pod that was in
Pending
state didn't get evicted at all. I think that points to a bigger bug...
@damemi, first statement is true, but it is because I didn't enable PodLifeTime
strategy while testing this. With PodLifeTime
enabled, the pending pod would probably have been deleted as well, after some ridiculous delay configured in maxPodLifeTimeSeconds
. So there should be no bug in descheduler
, if your mean my test outcome.
we're talking about adding a parameter that will only evict Pending pods (and not Running pods)
Yes, my understanding is that PodLifeTime
with podStatusPhases=Pending
parameter should evict only pending pods.
from descheduler.
Ah I see, you want to set a short LifeTime for these pending pods and not be evicting many running pods because of that. Sounds good, I understand now. Thanks for clarifying!
from descheduler.
Related Issues (20)
- Remove v1alpha1 in v0.31.0 HOT 1
- Is descheduler executed in a loop? HOT 2
- Add exitCode to RemoveFailedPods strategy HOT 1
- Howto test RemovePodsViolatingInterPodAntiAffinity ? HOT 1
- Something is wrong with Helm Chart v0.29.0... HOT 9
- RemoveDuplicates and nodeAffinity
- Would like a descheduler that evicts pods have preferred node anti-affinity
- Please reopen #1383 HOT 1
- Concurrency issues with the EvictPod method HOT 7
- Option for LowNodeUtilization and HighNodeUtilization for node regex name filtering HOT 3
- Investigate `kind` replacement for e2e HOT 7
- Descheduler Pod stuck in crashloopback HOT 2
- RemovePodsViolatingNodeTaints with includedTaints HOT 1
- status.phase=Unknown not available in 1.29 release HOT 8
- KEP-1397: descheduler integration with evacuation API as an alternative to eviction API
- Bump Kubernetes dependencies to v1.30.0
- 1.30: Update version references in docs and readme
- Chart not available anymore HOT 2
- Descheduler not evicting anything when deployed with Helm chart as a deployment HOT 2
- CrashLoopBackOff 0.29.0 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from descheduler.