postfinance / kubenurse Goto Github PK
View Code? Open in Web Editor NEWKubernetes network monitoring
License: MIT License
Kubernetes network monitoring
License: MIT License
We should use httptrace for requests done by the kubenurse in order to get more advanced error details like DNS failures and statistics like time-to-first-byte.
We will log a little bit more on http errors and get new metrics from the requests. This allows users for easier debugging like #44, where only a http client error message is present and will get insights on different latencies.
There could be a small increase in log messages and metrics.
There is already httptrace included using prometheus framework but it needs to be extended.
hi @djboris9
After I deploy version 3.0, I will prompt an error. Even if the clusterrole permission is given to admin, it will not work
E0415 06:59:52.669811 1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 06:59:53.957503 1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 06:59:57.066479 1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 07:00:02.396648 1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
It would be great if the tool can be extended to monitor external endpoints (HTTP, TCP and DNS).
This way we could monitor the outgoing network related infrastructure too from each k8s node.
I have deployed kubenurse with ArgoCD on an AKS cluster. For the Ingress Url I provided a HOST but no TLS.
It all seems to be working. In the browser I can reach /alive, /alwayshappy (200 response) and /metrics.
But the logs of the container show the error failed request for me_ingress with Get "kubenurse.rancher-canary-dev.k8s.digitaldev.nl/alwayshappy": unsupported protocol scheme ""
.
The ingress does not accept https:// in front of the ingress url.
No clue on what I am doing wrong here.
the current pod shouldn't be scanned, as otherwise 1/n % of the path_
checks are simple localhost queries, with a non-representative latency.
i.e. neighbours check should only check neighbours, not the current pod
When kubenurse does that, it stays in pending state and all other tests fail. It would be better to do that periodically.
Hello together,
We're using kubenurse in our Openshift environment on-premises and on Azure.
On-premises everything works as expected, but on Azure RedHat Openshift (ARO) we have a strange behaviour.
Sometimes the request for "me_ingress" failed with following error message:
2022/07/10 03:03:25 failed request for me_ingress with Get "https://kubenurse.cloud.rohde-schwarz.com/alwayshappy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
But if we execute a curl command from the terminal at the same time the command is successful.
~ $ curl https://kubenurse.cloud.rohde-schwarz.com/alwayshappy -v
* Trying 51.105.220.245:443...
* Connected to kubenurse.cloud.rohde-schwarz.com (51.105.220.245) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
* subject: C=DE; ST=Bayern; L=München; O=Rohde & Schwarz GmbH & Co. KG; CN=*.cloud.rohde-schwarz.com
* start date: Jun 21 00:00:00 2022 GMT
* expire date: Jun 29 23:59:59 2023 GMT
* subjectAltName: host "kubenurse.cloud.rohde-schwarz.com" matched cert's "*.cloud.rohde-schwarz.com"
* issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=Thawte RSA CA 2018
* SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /alwayshappy]
* h2h3 [:scheme: https]
* h2h3 [:authority: kubenurse.cloud.rohde-schwarz.com]
* h2h3 [user-agent: curl/7.83.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7f37d3bd51b0)
> GET /alwayshappy HTTP/2
> Host: kubenurse.cloud.rohde-schwarz.com
> user-agent: curl/7.83.1
> accept: */*
>
< HTTP/2 200
< date: Mon, 11 Jul 2022 11:34:03 GMT
< set-cookie: ca2642a40ba64b7363cb7d68301b753a=4d110bca2cf93cb742c45b40f9c7d239; path=/; HttpOnly; Secure; SameSite=None
< content-length: 0
< cache-control: private
< strict-transport-security: max-age=31536000; includesubdomains; preload
<
* Connection #0 to host kubenurse.cloud.rohde-schwarz.com left intact
Sometimes only a restart of the complete Node solves the issue.
We don't have any idea why this error message appears, is it possible to set a Debug level, to get more information?
Regards
Tobias
"user_agent": "curl/7.76.1",
"request_uri": "/alive",
"remote_addr": "127.0.0.1:38864",
"api_server_direct": "ok",
"api_server_dns": "ok",
"me_ingress": "404 Not Found",
"me_service": "Get \"http://kubenurse.kube-system.svc.cluster.local:8080/alwayshappy\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)",
"neighbourhood_state": "ok",
Any idea why /alive and other endpoints return 404 when trying to send requests to the service?
Hi @zbindenren , could you create a gh-pages branch in the repository so I can create a PR for the helm repo? It requires a gh-pages branch, but that is not something I can create through my forked repo (I believe).
Grz Bram
Hi, I was thinking that I maybe can contribute by making a helm repo for kubenurse.
I was wondering if this was maybe already something planned by the maintainers.. Or maybe not wanted.
Any comments/guidelines on how to contribute?
Grz Bram
when an http request produces a status code different than 200
, an error is produced:
however, with the changes I introduced in #125, this status error is now silently discarded.
IMO, we should at least log this error, and we could also consider this as an error and increase the error counter.
@zbindenren what's your take on this. ?
should this be Before
rather than After
?
Hello
Neighborhood check is expensive when running on the scale. For instance:
My proposal would be:
While the first point is straightforward and does not change the behavior much, the latter requires discussion
Currently kubenurse discovers all running neighbour Pods (see kubediscovery.go). If we perform maintenance on a Node it is possible that the kubenurse instance on this node can't be reached - which is not neccesairly a problem. Thus graphs/metrics might show errors (or even trigger false alarms).
Exclude kubenurse instances from Nodes where scheduling is disabled.
Disable checks entirely on a kubenurse instance if the node the instance runs on has scheduling disabled (to avoid possible service check errors for example).
Implement a flag for additional CA certificates, used by the ingress checker
I install kubenurse as a Helm chart by placing these files to the chart's "templates" folder. Next, I install it with helm install .
. Then from a pod, where I kubectl exec
into, I try to send an HTTP request to http://kubenurse.kube-system.svc.cluster.local:8080
with curl
and get this response:
<a href="/alive">Moved Permanently</a>.
Currently the /alive
endpoint will trigger a synchronous kubenurse check. If there are a lot of requests on this endpoint, the network traffic will increase drastically.
Implement a simple cache of checker results and serve it on /alive
with a cache TTL.
the total number of requests is not incremented when there is an error, error queries also need to be accounted for in the total query, it also makes it overly complicated to compute an error rate.
Doc has no Json of Grafana Dashboard For kubenurse
https://github.com/postfinance/kubenurse/blob/master/pkg/checker/checker.go#L64 --> at least the suffix ".cluster.local" should be configurable?
Example deploys container with image v1.0.0, it should point to v1.1.2.
The following error occurs:
Run kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml
namespace/ingress-nginx created
serviceaccount/ingress-nginx created
configmap/ingress-nginx-controller created
clusterrole.rbac.authorization.k8s.io/ingress-nginx created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created
role.rbac.authorization.k8s.io/ingress-nginx created
error: error validating "https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml": error validating data: ValidationError(Service.spec.ports[0]): unknown field "appProtocol" in io.k8s.api.core.v1.ServicePort; if you choose to ignore these errors, turn validation off with --validate=false
rolebinding.rbac.authorization.k8s.io/ingress-nginx created
Error: Process completed with exit code 1.
The error is unknown field "appProtocol"
currently, service checks are run sequentially:
kubenurse/internal/servicecheck/servicecheck.go
Lines 75 to 78 in 154f931
Run()
of the checks could take up a lot of time:checks should all be performed in parallel, and WaitGroups should be used for that
This is a proposal to extend the kubenurse in order to discover more cluster and network issues.
Currently we can detect cluster issues that occur at runtime. These are mainly network related and concern kube-proxy
, kube-apiserver
-availability, cluster DNS, Ingress and upstream DNS.
The following example scenarios can occur without the kubenurse
detecting it:
Pod
gets created and has no networking because of IPAM, CNI or iptables failuresService
isn't working after creation or mutation of the state (Pod readiness, Pod existence)Ingress
doesn't catch up changes of the state e.g. a Pod is created or a new Ingress record is setNode
cannot start any new Pods due to various issuesNode
can start a Pod but it's not usable due to resource pressureUsually such issues are detected when some mutation (e.g. deployment, scaling) is done or the Chaosmonkey is doing his work.
An elected kubenurse should periodically create a Pod, Service and Ingress.
As soon as the Pod is ready, the parent kubenurse connects to the Pod directly, over Service and over Ingress while recording errors and time.
Additionally the child kubenurse should create some configurable load (CPU, Disk, Memory) after being started. After the connection checks are done or a timeout is reached, all resources have to be garbage collected.
Constrains:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.