groundnuty / k8s-wait-for Goto Github PK
View Code? Open in Web Editor NEWA simple script that allows to wait for a k8s service, job or pods to enter a desired state
License: MIT License
A simple script that allows to wait for a k8s service, job or pods to enter a desired state
License: MIT License
We are running a k8-wait-for init container in aws/eks. Kubernetes server version =
"serverVersion": {
"major": "1",
"minor": "23+",
"gitVersion": "v1.23.17-eks-0a21954",
"gitCommit": "cd5c12c51b0899612375453f7a7c2e7b6563f5e9",
"gitTreeState": "clean",
"buildDate": "2023-04-15T00:32:27Z",
"goVersion": "go1.19.6",
"compiler": "gc",
"platform": "linux/amd64"
}
But still, the outcome of the describe job command is
...
Completion Mode: NonIndexed
Start Time: Mon, 26 Jun 2023 10:55:00 +0000
Pods Statuses: 1 Active / 0 Succeeded / 0 Failed
Pod Template:
...
which means that the script will silently fail to wait for the job.
#63 will solve the immediate problem, but it may be a good idea to be a bit more defensive and exit as error when the first sed command yields an empty string.
I stumbled across this issue when I was not passing a --namespace ...
to the script args...
My init pods were reporting success
state:
terminated:
containerID: docker://754eff26e250a9cab94d6a5f49e118bb8975fe93807bc05cbae4bcec6104f722
exitCode: 0
finishedAt: "2019-11-14T20:39:09Z"
reason: Completed
startedAt: "2019-11-14T20:39:08Z"
and the logs from the init container contained
parse error: Invalid numeric literal at line 1, column 6
service qa-postgres is ready.
In reality, it was querying a service that does not exist. The output from the kubectl
command would have been:
$ kubectl get service "qa-postgres" -ojson
Error from server (NotFound): services "qa-postgres" not found
which when piped to jq
results in:
$ kubectl get service "qa-postgres" -ojson 2>&1 | jq -cr
parse error: Invalid numeric literal at line 1, column 6
however that failure (as seen in the pod log output) results in the sh script returning 0 (success)
It should cause the init pod to fail.
The above is true whenever kubectl
outputs non-json.
/usr/local/bin # kubectl get pods -n qa
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:qa:default" cannot list resource "pods" in API group "" in the namespace "qa"
/usr/local/bin # wait_for.sh service qa-postgress -n qa && echo "*** returns success"
parse error: Invalid numeric literal at line 1, column 6
[2019-11-14 23:12:49] service qa-postgress is ready.
*** returns success
The script does not seem to find pods or services in other namespaces ..
Adding --all-namespaces to the kubectl commands in the script solve the issue.
This is due to this change
kubernetes/kubernetes#60210
the Completed
pods will need to be filtered manually.
When starting pod I got following error
Error from server (Forbidden): jobs.batch is forbidden: User system:serviceaccount:blabla:default cannot get resource jobs in API group batch in the namespace blabla
What is the difference between the wait_for.sh
shell script and the kubectl wait
command
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#wait
I've got a job that normally succeeds and we don't see any problems with k8s-wait-for. However, I'm observing a case where sometimes the job will fail and restart a new job. The metadata on the job ends up here:
status:
completionTime: "2022-03-03T01:25:56Z"
conditions:
- lastProbeTime: "2022-03-03T01:25:56Z"
lastTransitionTime: "2022-03-03T01:25:56Z"
status: "True"
type: Complete
failed: 1
startTime: "2022-03-03T01:25:14Z"
succeeded: 1
You can see it completed but has one failure. This is fine because the job only needs to succeed once. However right now k8s-wait-for does not consider this ok and continues to spin forever.
Here's some sample logs when using k8s-wait-for:v1.5.1
to wait for a k8s job to complete:
Waiting for job my-job...
Waiting for job my-job...
Waiting for job my-job..
[2021-09-27 14:02:52] job my-job is ready.
It would be awesome if every intermediate log leading up to the " is ready" line also outputted timestamps, so someone watching logs in real time could understand how long the pod has been waiting.
Code xrefs
https://github.com/groundnuty/k8s-wait-for/blob/master/wait_for.sh#L216
https://github.com/groundnuty/k8s-wait-for/blob/master/wait_for.sh#L207
Currently the image groundnuty/k8s-wait-for:no-root-v2.0
has several security vulnerabilities.
Running the command docker scout cves groundnuty/k8s-wait-for:no-root-v2.0
list all of these.
Here is the summary at the end:
67 vulnerabilities found in 8 packages
LOW 5
MEDIUM 30
HIGH 29
CRITICAL 3
The Trivy scan for this repo has been failing for some time too:
https://github.com/groundnuty/k8s-wait-for/actions/workflows/trivy.yml 💥
I have not looked into this in depth, but maybe the older image of alpine is a part of this?
FROM alpine:3.16.2
Hi,
How can I debug initContainer with k8s-wait-for?
I only see sth like this:
Waiting for pod -lapp=xxxx...
Waiting for pod -lapp=xxxx...
Waiting for pod -lapp=xxxx...
Is there any debug swtich?
Instead of multiple initContainers, can the script handle more than one job/service?
Etc., It will stop waiting when all jobs are done (even some failed and some succeeded)
args: [
"job", "job-1",
"job", "job-2",
...
]
The wait for script cannot distinct between failed jobs when waiting for unless there are more then 10 failed pods when describing the job.
$kubectl get jobs,pods
NAME COMPLETIONS DURATION AGE
job.batch/pi 0/1 22h 22h
NAME READY STATUS RESTARTS AGE
pod/pi-j64gc 0/1 Error 0 22h
When using the script without -we flag
# wait_for.sh job pi
[2020-07-03 08:11:50] job pi is ready.
The error is because of the regex at https://github.com/groundnuty/k8s-wait-for/blob/master/wait_for.sh#L172
It should be changed to
sed_reg='-e s/^[1-9][[:digit:]]*:[[:digit:]]+:[[:digit:]]+$/1/p -e s/^0:[[:digit:]]+:[1-9][[:digit:]]*$/1/p'
The last digit from the second regex [1-9][[:digit:]]+
will only match more then 1 digits. because of the +
sign. Above regex changes it to *
on the second digit. Allowing it to be missing.
Thus matching 0:0:1
as well as 0:0:10
After the fix:
/ # wait_for.sh job pi
Waiting for job pi...
Waiting for job pi...
/ # wait_for.sh job-we pi
[2020-07-03 08:59:20] job pi is ready.
Love the tool, use it extensively.
Would you consider a move to hosting your images on ghcr.io instead of at Docker Hub? Docker Hub has some quite onerous pull limits nowadays.
Hi. I'm new to this tool and to k8s in general and I had some problems getting started.
The example in the README does not seem to be working with later versions of kubectl
:
$ kubectl run --generator=run-pod/v1 k8s-wait-for --rm -it --image groundnuty/k8s-wait-for:v1.3 --restart Never --command /bin/sh
Error: unknown flag: --generator
See 'kubectl run --help' for usage.
$ kubectl version --client
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
The release notes indicate that the --generator
flag was deprecated and then removed in 1.21:
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md
Sometimes pods enter the 'Evicted' state, but the running of the cluster is not affected.
Could you consider adding this feature? Thank you!
I was expecting for this app to wait until a job completed successfully but it only waited for the job to be ready. Am I misunderstanding something?
This is a portion of my deployment resource and I have verified that my job runs to completion and exits with a status code of 0.
initContainers:
- name: data-migration-init
image: 'groundnuty/k8s-wait-for:v1.7'
args:
- job
- my-data-migration-job
Hi,
It will be very nice to add an option to wait until all pods are running / failed in a specific namespace.
something like:
wait_for.sh namespace my-namespace
Tx,
Roee.
Line 17 in 6507200
Is it possible to add a version with non-root user?
This seems similar to #39, however this issue provides no solution.
I am running k8s-wait-for
as an init container in a deployment, waiting for a job to complete.
The container starts fine, but then exits with:
Error from server (Forbidden): jobs.batch "app-migration-a4c7915a7495153bcae396e7bd9e3d66c" is forbidden: User "system:serviceaccount:default:default" cannot get resource "jobs" in API group "batch" in the namespace "default"
It seems to me like the container with wait-for is lacking the required permissions to access the API with kubectl.
However, I was unable to find any information in the documentation on how to set up proper credentials.
I have a scenario in which I need to wait for at least one pod to be ready instead of all.
Hi @groundnuty ,
Your wait for service is amazing, but I have a feature request that maybe you can help with.
Issue:
I have a use case which sometimes (depends on the server resources) my job will create a pod that will be failed, but using backoffLimit
it will create a second pod that will run successfully.
The thing is that wait for job
is waiting that all pods will be succeeded and not just the last pod for a job.
Expected behavior:
add the ability to wait for job with only the last pod success and not all of the pods.
How to reproduce:
backoffLimit: 1
I got wrecked by #60 because I was using an old image (v1.6). It took me a while to realize that I just copied the example in the README, but the latest was at v2.0.
Maybe it'd be helpful to update the README, or setup something to update the README when a new package is released?
Or maybe even better, since there's specific k8s versions this package is compatible with, wait-for.sh
could detect this and throw?
now one container only can wait one pod or job. if support multiple
such as: wait_for.sh pod -lapp=app -lapp=app2
I'm getting this error:
No resources found.
FalseError from server (BadRequest): Unable to find "/v1, Resource=pods" that match label selector "app=cloud-services,-ltier=postgres", field selector "": unable to parse requirement: invalid label key "-ltier": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')
This is on container image tag v1.2-5-g92c083e
and the current script in master
.
This appears to be caused by
get_service_state_selectors=$(kubectl get service "$get_service_state_name" $KUBECTL_ARGS -ojson 2>&1 | jq -cr 'if . | has("items") then .items[]
else . end | [ .spec.selector | to_entries[] | "-l\(.key)=\(.value)" ] | join(",") ')
when the service has more than 1 selector
"selector": {
"app": "cloud-services",
"tier": "postgres"
The to_entries[] | "-l\(.key)=\(.value)" ] | join(",")
appears to be incorrect and would make
-lapp=cloud-services,-ltier=postgres
when it should not have the additional -l
. It should be
-lapp=cloud-services,tier=postgres
The code below that is:
for get_service_state_selector in $get_service_state_selectors ; do
get_service_state_selector=$(echo "$get_service_state_selector" | tr ',' ' ')
get_service_state_state=$(get_pod_state "$get_service_state_selectors")
get_service_state_states="${get_service_state_states}${get_service_state_state}" ;
done
echo "$get_service_state_states"
which looks like it would then replace that comma with a space to make
-lapp=cloud-services -ltier=postgres
except for 2 issues; 1) get_service_state_selector
is never used, only assigned to, 2) I don't think the kubectl command will allow multiple -l
arguments anyway.
I think this could be fixed by doing:
get_service_state() {
get_service_state_name="$1"
get_service_state_selectors=$(kubectl get service "$get_service_state_name" $KUBECTL_ARGS -ojson 2>&1 | jq -cr 'if . | has("items") then .items[]
else . end | [ .spec.selector | to_entries[] | "\(.key)=\(.value)" ] | join(",") ')
get_service_state_states=""
for get_service_state_selector in $get_service_state_selectors ; do
get_service_state_state=$(get_pod_state "-l$get_service_state_selectors")
get_service_state_states="${get_service_state_states}${get_service_state_state}" ;
done
echo "$get_service_state_states"
}
But I'm not sure of all the cases that this has to handle.
If the job first fails and is retried by kubernetes, k8s-wait-for
is not able to revive from it. It waits for the job for ever even after the retry has succeeded.
./wait_for.sh service mongodb-replicaset
expr: syntax error
error: name cannot be provided when a selector is specified
We got a problem.
We use k8s-wait-for in initContainers in the workload to wait for database migrations to complete, and we have HPA running.
If the ttlSecondsAfterFinished on the Job expires and the pod is removed, then the new workload pods fail the job check.
What options do we have to solve the problem?
Docker Hub throttles traffic and for a critical init container it'd be nice if there were an alternative that was more stable. I'm working around the issue in other ways (re-hosting the image) but just throwing it out there as a suggested improvement.
Line 13 in 1c9c1a6
Hi, Thanks for this helpful code! 🚀
We are very interested in using this utility. Still, the current version uses a base image that contains some vulnerabilities classified as critical and is impossible to use in some projects by security policies:
I would like to know if it is possible to update the Alpine base image version** from 3.12.1 to 3.12.19 or the latest Alpine version.
thanks,
I was wondering why my wait for was never finishing and while debugging in the container I saw that my API server was returning a warning message on top of the POD infos:
I0624 10:31:01.428761 375 request.go:668] Waited for 1.181153768s due to client-side throttling, not priority and fairness, request: GET:https://<server>/apis/rbac.authorization.k8s.io/v1?timeout=32s
NAME READY STATUS RESTARTS AGE
<pod> 1/1 Running 0 173m
<pod> 1/1 Running 0 173m
<pod> 1/1 Running 0 173m
the script seems to be parsing the first line and ends up not detecting the pod status.
I might have some time to prepare a quick fix but for now I will leave an issue report here.
Basem Vaseghi [email protected], Daimler TSS GmbH, legal info/Impressum
The container fails on Error from server (Forbidden): jobs.batch "init-data-job" is forbidden: User "system:serviceaccount:default:default" cannot get jobs.batch in the namespace "default": Unknown user "system:serviceaccount:default:default"
With kubectl it can be done, something like this. kubectl wait --for delete pod --selector=<label>=<value>
I'm experiencing the following error when using the latest release v1.5
/usr/local/bin/wait_for.sh: line 183: syntax error: unexpected "else" (expecting "}")
I think that it is related to an unexpected fi instruction.
Line 182 in 30e860b
❯ docker pull ghcr.io/groundnuty/k8s-wait-for:v2.0
Error response from daemon: Head "https://ghcr.io/v2/groundnuty/k8s-wait-for/manifests/v2.0": denied: denied
were the permissions for this changed recently?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.