postfinance / kubenurse Goto Github PK

View Code? Open in Web Editor NEW

399.0 399.0 38.0 3.46 MB

Kubernetes network monitoring

License: MIT License

Go 94.77% Dockerfile 0.59% Smarty 4.64%

golang kubernetes monitoring network prometheus

kubenurse's People

Contributors

Stargazers

Watchers

kubenurse's Issues

failed request for me_ingress with Get "<INGRESS_URL>/alwayshappy": unsupported protocol scheme ""

I have deployed kubenurse with ArgoCD on an AKS cluster. For the Ingress Url I provided a HOST but no TLS.
It all seems to be working. In the browser I can reach /alive, /alwayshappy (200 response) and /metrics.
But the logs of the container show the error failed request for me_ingress with Get "kubenurse.rancher-canary-dev.k8s.digitaldev.nl/alwayshappy": unsupported protocol scheme "".

The ingress does not accept https:// in front of the ingress url.

No clue on what I am doing wrong here.

Failed request for me_ingress

Hello together,

We're using kubenurse in our Openshift environment on-premises and on Azure.
On-premises everything works as expected, but on Azure RedHat Openshift (ARO) we have a strange behaviour.

Sometimes the request for "me_ingress" failed with following error message:
2022/07/10 03:03:25 failed request for me_ingress with Get "https://kubenurse.cloud.rohde-schwarz.com/alwayshappy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

But if we execute a curl command from the terminal at the same time the command is successful.

~ $ curl https://kubenurse.cloud.rohde-schwarz.com/alwayshappy -v
*   Trying 51.105.220.245:443...
* Connected to kubenurse.cloud.rohde-schwarz.com (51.105.220.245) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: C=DE; ST=Bayern; L=München; O=Rohde & Schwarz GmbH & Co. KG; CN=*.cloud.rohde-schwarz.com
*  start date: Jun 21 00:00:00 2022 GMT
*  expire date: Jun 29 23:59:59 2023 GMT
*  subjectAltName: host "kubenurse.cloud.rohde-schwarz.com" matched cert's "*.cloud.rohde-schwarz.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=Thawte RSA CA 2018
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /alwayshappy]
* h2h3 [:scheme: https]
* h2h3 [:authority: kubenurse.cloud.rohde-schwarz.com]
* h2h3 [user-agent: curl/7.83.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7f37d3bd51b0)
> GET /alwayshappy HTTP/2
> Host: kubenurse.cloud.rohde-schwarz.com
> user-agent: curl/7.83.1
> accept: */*
> 
< HTTP/2 200 
< date: Mon, 11 Jul 2022 11:34:03 GMT
< set-cookie: ca2642a40ba64b7363cb7d68301b753a=4d110bca2cf93cb742c45b40f9c7d239; path=/; HttpOnly; Secure; SameSite=None
< content-length: 0
< cache-control: private
< strict-transport-security: max-age=31536000; includesubdomains; preload
< 
* Connection #0 to host kubenurse.cloud.rohde-schwarz.com left intact

Sometimes only a restart of the complete Node solves the issue.

We don't have any idea why this error message appears, is it possible to set a Debug level, to get more information?

Regards
Tobias

WIP: Proposal: Check cluster health with mutations to the state

This is a proposal to extend the kubenurse in order to discover more cluster and network issues.

Current situation

Currently we can detect cluster issues that occur at runtime. These are mainly network related and concern kube-proxy, kube-apiserver-availability, cluster DNS, Ingress and upstream DNS.

Not covered issues

The following example scenarios can occur without the kubenurse detecting it:

A Pod gets created and has no networking because of IPAM, CNI or iptables failures
A Service isn't working after creation or mutation of the state (Pod readiness, Pod existence)
A Ingress doesn't catch up changes of the state e.g. a Pod is created or a new Ingress record is set
A Node cannot start any new Pods due to various issues
A Node can start a Pod but it's not usable due to resource pressure

Usually such issues are detected when some mutation (e.g. deployment, scaling) is done or the Chaosmonkey is doing his work.

Proposal

An elected kubenurse should periodically create a Pod, Service and Ingress.
As soon as the Pod is ready, the parent kubenurse connects to the Pod directly, over Service and over Ingress while recording errors and time.
Additionally the child kubenurse should create some configurable load (CPU, Disk, Memory) after being started. After the connection checks are done or a timeout is reached, all resources have to be garbage collected.

Constrains:

This function is disabled by default
All resource names must be customizable (random or static)
Ingress path must be customizable (random or static)
There is no conflict between kubenurses in the same namespace
RBAC needs to be adapted

Challenges

The leader election should be between the same group of kubenurses
Are Ingress and Pod templates needed and how should they be implemented

Monitor external endpoints

It would be great if the tool can be extended to monitor external endpoints (HTTP, TCP and DNS).

This way we could monitor the outgoing network related infrastructure too from each k8s node.

Kubenurse shold not try to bind pvc on startup

When kubenurse does that, it stays in pending state and all other tests fail. It would be better to do that periodically.

CI: Create CI test with kind

Some CI test with kind like https://github.com/postfinance/hlfabric-k8scc/blob/034e3de7c0f599078dd44e7f9584fbafe949361d/.github/workflows/ci-k8s.yml#L6

exclude neighbourhood pods from Nodes where scheduling is disabled

Problem statement

Currently kubenurse discovers all running neighbour Pods (see kubediscovery.go). If we perform maintenance on a Node it is possible that the kubenurse instance on this node can't be reached - which is not neccesairly a problem. Thus graphs/metrics might show errors (or even trigger false alarms).

Proposal

Exclude kubenurse instances from Nodes where scheduling is disabled.

Further enhancement

Disable checks entirely on a kubenurse instance if the node the instance runs on has scheduling disabled (to avoid possible service check errors for example).

Example daemonset does not point to current release

Example deploys container with image v1.0.0, it should point to v1.1.2.

non-200 status code do not generate an error anymore

when an http request produces a status code different than 200, an error is produced:

kubenurse/internal/servicecheck/transport.go

Line 52 in e139b8c

return resp.Status, errors.New(resp.Status)

however, with the changes I introduced in #125, this status error is now silently discarded.

IMO, we should at least log this error, and we could also consider this as an error and increase the error counter.

@zbindenren what's your take on this. ?

Cache /alive results

Result cacheing

Issue

Currently the /alive endpoint will trigger a synchronous kubenurse check. If there are a lot of requests on this endpoint, the network traffic will increase drastically.

Proposal

Implement a simple cache of checker results and serve it on /alive with a cache TTL.

Proposal: Use httptrace to get advanced metrics and error messages

Proposal

We should use httptrace for requests done by the kubenurse in order to get more advanced error details like DNS failures and statistics like time-to-first-byte.

Positive impact

We will log a little bit more on http errors and get new metrics from the requests. This allows users for easier debugging like #44, where only a http client error message is present and will get insights on different latencies.

Negative impact

There could be a small increase in log messages and metrics.

Notes

There is already httptrace included using prometheus framework but it needs to be extended.

CI Test fails on ingress deployment

The following error occurs:

Run kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml
namespace/ingress-nginx created
serviceaccount/ingress-nginx created
configmap/ingress-nginx-controller created
clusterrole.rbac.authorization.k8s.io/ingress-nginx created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created
role.rbac.authorization.k8s.io/ingress-nginx created
error: error validating "https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml": error validating data: ValidationError(Service.spec.ports[0]): unknown field "appProtocol" in io.k8s.api.core.v1.ServicePort; if you choose to ignore these errors, turn validation off with --validate=false
rolebinding.rbac.authorization.k8s.io/ingress-nginx created
Error: Process completed with exit code 1.

The error is unknown field "appProtocol"

Question: How to contribute?

Hi, I was thinking that I maybe can contribute by making a helm repo for kubenurse.
I was wondering if this was maybe already something planned by the maintainers.. Or maybe not wanted.

Any comments/guidelines on how to contribute?

Grz Bram

goreleaser & github actions for releases

More transparent than docker cloud build
Automatically creates release notes
alignment with other projects

Optimisation for Neighbourhood discovery on scale

Hello

Neighborhood check is expensive when running on the scale. For instance:

neighbor discovery creates a load on the API server proportional to the cluster size and frequency of the checks. The same applies to node watchers
we have n^2 TCP/IP handshakes every 5s, where n is the number of nodes. This also creates more packets into node network queues (FIFO queues), and regular network traffic for production applications could be hit by a latency increase

My proposal would be:

configurable check scheduling, so we can control to reduce the check frequency to once per minute, for example, when needed
optimize the way we query the API server for discovery information; one option is a SWIM-based solution like hashicorp/memberlist

While the first point is straightforward and does not change the behavior much, the latter requires discussion

Additional CA certificates

Implement a flag for additional CA certificates, used by the ingress checker

Request: Create a gh-pages branch

Hi @zbindenren , could you create a gh-pages branch in the repository so I can create a PR for the helm repo? It requires a gh-pages branch, but that is not something I can create through my forked repo (I believe).

Grz Bram

Write documentation

bug: kubenurse also checks itself when performing neighbours check

the current pod shouldn't be scanned, as otherwise 1/n % of the path_ checks are simple localhost queries, with a non-representative latency.

i.e. neighbours check should only check neighbours, not the current pod

Add Dockerfile

should this be `Before`

https://github.com/postfinance/kubenurse/blob/ceb036f9afee54683bf68313e02b833d9b948141/internal/servicecheck/cache.go#LL8C56-L8C61

should this be Before rather than After?

Grafana Dashboard

Doc has no Json of Grafana Dashboard For kubenurse

Autocreate helm chart index

include errors in total count

the total number of requests is not incremented when there is an error, error queries also need to be accounted for in the total query, it also makes it overly complicated to compute an error rate.

configurable k8s service name

https://github.com/postfinance/kubenurse/blob/master/pkg/checker/checker.go#L64 --> at least the suffix ".cluster.local" should be configurable?

clusterrole permission

hi @djboris9
After I deploy version 3.0, I will prompt an error. Even if the clusterrole permission is given to admin, it will not work

E0415 06:59:52.669811       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 06:59:53.957503       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 06:59:57.066479       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 07:00:02.396648       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope

404 Not Found service endpoints

"user_agent": "curl/7.76.1",
 "request_uri": "/alive",
 "remote_addr": "127.0.0.1:38864",
 "api_server_direct": "ok",
 "api_server_dns": "ok",
 "me_ingress": "404 Not Found",
 "me_service": "Get \"http://kubenurse.kube-system.svc.cluster.local:8080/alwayshappy\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)",
 "neighbourhood_state": "ok",

Any idea why /alive and other endpoints return 404 when trying to send requests to the service?

"Moved Permanently" response from /alive endpoint.

I install kubenurse as a Helm chart by placing these files to the chart's "templates" folder. Next, I install it with helm install .. Then from a pod, where I kubectl exec into, I try to send an HTTP request to http://kubenurse.kube-system.svc.cluster.local:8080 with curl and get this response:

<a href="/alive">Moved Permanently</a>.

run checks in parallel

currently, service checks are run sequentially:

kubenurse/internal/servicecheck/servicecheck.go

Lines 75 to 78 in 154f931

    
           res.APIServerDirect = c.measure(c.APIServerDirect, "api_server_direct") 
        
           res.APIServerDNS = c.measure(c.APIServerDNS, "api_server_dns") 
        
           res.MeIngress = c.measure(c.MeIngress, "me_ingress") 
        
           res.MeService = c.measure(c.MeService, "me_service")

however, in case of network failure, then each Run() of the checks could take up a lot of time:
assuming a timeout of 5 seconds, then we could have a worst case of (4 * 5 sec + (nodes) * 5 sec).

checks should all be performed in parallel, and WaitGroups should be used for that

	res.APIServerDirect = c.measure(c.APIServerDirect, "api_server_direct")
	res.APIServerDNS = c.measure(c.APIServerDNS, "api_server_dns")
	res.MeIngress = c.measure(c.MeIngress, "me_ingress")
	res.MeService = c.measure(c.MeService, "me_service")

postfinance / kubenurse Goto Github PK

kubenurse's People

Contributors

Stargazers

Watchers

Forkers

kubenurse's Issues

Current situation

Not covered issues

Proposal

Challenges

Problem statement

Proposal

Further enhancement

Result cacheing

Issue

Proposal

Proposal

Positive impact

Negative impact

Notes

Recommend Projects

Recommend Topics

Recommend Org