mattmattox / drain-node-on-crash Goto Github PK
View Code? Open in Web Editor NEWThis app is designed to automatically Drain a node after a crash where the node fails to recover after 5mins.
This app is designed to automatically Drain a node after a crash where the node fails to recover after 5mins.
Leader is failing to pull from Docker hub
Error message:
ImagePullBackOff: Back-off pulling image "drainnode/leader:v1.0-rc9"
ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: manifest for drainnode/leader:v1.0-rc9 not found: manifest unknown: manifest unknown
Need to fix the permissions for the service account system:serviceaccount:drain-node-on-crash:drain-node
Error message:
F0907 23:42:27.979181 8 main.go:108] failed to create election: endpoints "drain-node-on-crash" is forbidden: User "system:serviceaccount:drain-node-on-crash:drain-node" cannot get resource "endpoints" in API group "" in the namespace "default"
Please provide info on how to install from master? Looks like there were changes to build stack.
Chart: v1.0-rc8
Settings: Defaults
k8s: v1.15.12
Error message:
Failed to install app drain-node-on-crash. Error: Deployment.apps "drain-node_manager" is invalid: metadata.name: Invalid value: "drain-node_manager": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Replicas are getting set to 1 by default. Need to fix values.yaml and deployment.yaml.
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
meta.helm.sh/release-name: drain-node-on-crash
meta.helm.sh/release-namespace: drain-node-on-crash
creationTimestamp: "2020-09-07T23:08:19Z"
generation: 2
labels:
app: drain-node-on-crash
app.kubernetes.io/managed-by: Helm
io.cattle.field/appId: drain-node-on-crash
name: drain-node-manager
namespace: drain-node-on-crash
resourceVersion: "6331"
selfLink: /apis/apps/v1/namespaces/drain-node-on-crash/deployments/drain-node-manager
uid: a49650e1-d1c0-4133-95e6-21edc1cbb5a8
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: drain-node-on-crash
io.cattle.field/appId: drain-node-on-crash
strategy:
type: Recreate
template:
metadata:
annotations:
cattle.io/timestamp: "2020-09-07T23:16:27Z"
field.cattle.io/ports: '[[{"containerPort":4040,"dnsName":"drain-node-manager","hostPort":0,"kind":"ClusterIP","name":"4040tcp02","protocol":"TCP","sourcePort":0}]]'
creationTimestamp: null
labels:
app: drain-node-on-crash
io.cattle.field/appId: drain-node-on-crash
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- drain-node-on-crash
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: AUTO_UNCORDON
value: "true"
- name: NODE_TIMEOUT
value: "360"
image: drainnode/manager:v1.0-rc9
imagePullPolicy: IfNotPresent
name: drain
resources: {}
securityContext:
capabilities: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- args:
- --election=drain-node-on-crash
- --http=0.0.0.0:4040
image: docker.io/drainnode/leader:v1.0-rc9
imagePullPolicy: IfNotPresent
name: leader
ports:
- containerPort: 4040
name: 4040tcp02
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: drain-node
serviceAccountName: drain-node
terminationGracePeriodSeconds: 30
status:
conditions:
- lastTransitionTime: "2020-09-07T23:08:19Z"
lastUpdateTime: "2020-09-07T23:08:19Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2020-09-07T23:08:19Z"
lastUpdateTime: "2020-09-07T23:17:12Z"
message: ReplicaSet "drain-node-manager-9847c878d" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
observedGeneration: 2
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
Hi,
Any reason you decided to use a Deployment rather than a DaemonSet? DS would be more reliable IMO.
For example, in a two-node cluster, if node A goes down, all pods in Deployment will be scheduled on node B. Then, when A returns, if B goes down, drain-node-on-crash is slower to function since it must first wait for a pod to be scheduled on A. Using a DaemonSet means there is already an available pod and drain-node-on-crash doesn't suffer this delay.
Thanks and nice work BTW.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.