Coder Social home page Coder Social logo

kubenurse's People

Contributors

bramvdklinkenberg avatar ckittelmann avatar clementnuss avatar dependabot[bot] avatar djboris9 avatar domi2120 avatar eli-halych avatar ghouscht avatar luanabanana avatar matthisholleville avatar matthyx avatar myaser avatar opensourcepf avatar philipsahli avatar phspagiari avatar zbindenren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubenurse's Issues

Proposal: Use httptrace to get advanced metrics and error messages

Proposal

We should use httptrace for requests done by the kubenurse in order to get more advanced error details like DNS failures and statistics like time-to-first-byte.

Positive impact

We will log a little bit more on http errors and get new metrics from the requests. This allows users for easier debugging like #44, where only a http client error message is present and will get insights on different latencies.

Negative impact

There could be a small increase in log messages and metrics.

Notes

There is already httptrace included using prometheus framework but it needs to be extended.

clusterrole permission

hi @djboris9
After I deploy version 3.0, I will prompt an error. Even if the clusterrole permission is given to admin, it will not work

E0415 06:59:52.669811       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 06:59:53.957503       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 06:59:57.066479       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope
E0415 07:00:02.396648       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:kube-system:nurse" cannot list resource "nodes" in API group "" at the cluster scope

Monitor external endpoints

It would be great if the tool can be extended to monitor external endpoints (HTTP, TCP and DNS).

This way we could monitor the outgoing network related infrastructure too from each k8s node.

failed request for me_ingress with Get "<INGRESS_URL>/alwayshappy": unsupported protocol scheme ""

I have deployed kubenurse with ArgoCD on an AKS cluster. For the Ingress Url I provided a HOST but no TLS.
It all seems to be working. In the browser I can reach /alive, /alwayshappy (200 response) and /metrics.
But the logs of the container show the error failed request for me_ingress with Get "kubenurse.rancher-canary-dev.k8s.digitaldev.nl/alwayshappy": unsupported protocol scheme "".

The ingress does not accept https:// in front of the ingress url.

No clue on what I am doing wrong here.

Failed request for me_ingress

Hello together,

We're using kubenurse in our Openshift environment on-premises and on Azure.
On-premises everything works as expected, but on Azure RedHat Openshift (ARO) we have a strange behaviour.

Sometimes the request for "me_ingress" failed with following error message:
2022/07/10 03:03:25 failed request for me_ingress with Get "https://kubenurse.cloud.rohde-schwarz.com/alwayshappy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

But if we execute a curl command from the terminal at the same time the command is successful.

~ $ curl https://kubenurse.cloud.rohde-schwarz.com/alwayshappy -v
*   Trying 51.105.220.245:443...
* Connected to kubenurse.cloud.rohde-schwarz.com (51.105.220.245) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: C=DE; ST=Bayern; L=München; O=Rohde & Schwarz GmbH & Co. KG; CN=*.cloud.rohde-schwarz.com
*  start date: Jun 21 00:00:00 2022 GMT
*  expire date: Jun 29 23:59:59 2023 GMT
*  subjectAltName: host "kubenurse.cloud.rohde-schwarz.com" matched cert's "*.cloud.rohde-schwarz.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=Thawte RSA CA 2018
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /alwayshappy]
* h2h3 [:scheme: https]
* h2h3 [:authority: kubenurse.cloud.rohde-schwarz.com]
* h2h3 [user-agent: curl/7.83.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7f37d3bd51b0)
> GET /alwayshappy HTTP/2
> Host: kubenurse.cloud.rohde-schwarz.com
> user-agent: curl/7.83.1
> accept: */*
> 
< HTTP/2 200 
< date: Mon, 11 Jul 2022 11:34:03 GMT
< set-cookie: ca2642a40ba64b7363cb7d68301b753a=4d110bca2cf93cb742c45b40f9c7d239; path=/; HttpOnly; Secure; SameSite=None
< content-length: 0
< cache-control: private
< strict-transport-security: max-age=31536000; includesubdomains; preload
< 
* Connection #0 to host kubenurse.cloud.rohde-schwarz.com left intact

Sometimes only a restart of the complete Node solves the issue.

We don't have any idea why this error message appears, is it possible to set a Debug level, to get more information?

Regards
Tobias

404 Not Found service endpoints

"user_agent": "curl/7.76.1",
 "request_uri": "/alive",
 "remote_addr": "127.0.0.1:38864",
 "api_server_direct": "ok",
 "api_server_dns": "ok",
 "me_ingress": "404 Not Found",
 "me_service": "Get \"http://kubenurse.kube-system.svc.cluster.local:8080/alwayshappy\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)",
 "neighbourhood_state": "ok",

Any idea why /alive and other endpoints return 404 when trying to send requests to the service?

Request: Create a gh-pages branch

Hi @zbindenren , could you create a gh-pages branch in the repository so I can create a PR for the helm repo? It requires a gh-pages branch, but that is not something I can create through my forked repo (I believe).

Grz Bram

Question: How to contribute?

Hi, I was thinking that I maybe can contribute by making a helm repo for kubenurse.
I was wondering if this was maybe already something planned by the maintainers.. Or maybe not wanted.

Any comments/guidelines on how to contribute?

Grz Bram

Write documentation

  • Configuration options
  • Endpoints
  • Metrics with prometheus example
  • DaemonSet, RBAC etc. yamls
  • Use of project

Optimisation for Neighbourhood discovery on scale

Hello

Neighborhood check is expensive when running on the scale. For instance:

  • neighbor discovery creates a load on the API server proportional to the cluster size and frequency of the checks. The same applies to node watchers
  • we have n^2 TCP/IP handshakes every 5s, where n is the number of nodes. This also creates more packets into node network queues (FIFO queues), and regular network traffic for production applications could be hit by a latency increase

My proposal would be:

  1. configurable check scheduling, so we can control to reduce the check frequency to once per minute, for example, when needed
  2. optimize the way we query the API server for discovery information; one option is a SWIM-based solution like hashicorp/memberlist

While the first point is straightforward and does not change the behavior much, the latter requires discussion

exclude neighbourhood pods from Nodes where scheduling is disabled

Problem statement

Currently kubenurse discovers all running neighbour Pods (see kubediscovery.go). If we perform maintenance on a Node it is possible that the kubenurse instance on this node can't be reached - which is not neccesairly a problem. Thus graphs/metrics might show errors (or even trigger false alarms).

Proposal

Exclude kubenurse instances from Nodes where scheduling is disabled.

Further enhancement

Disable checks entirely on a kubenurse instance if the node the instance runs on has scheduling disabled (to avoid possible service check errors for example).

"Moved Permanently" response from /alive endpoint.

I install kubenurse as a Helm chart by placing these files to the chart's "templates" folder. Next, I install it with helm install .. Then from a pod, where I kubectl exec into, I try to send an HTTP request to http://kubenurse.kube-system.svc.cluster.local:8080 with curl and get this response:

<a href="/alive">Moved Permanently</a>.

Cache /alive results

Result cacheing

Issue

Currently the /alive endpoint will trigger a synchronous kubenurse check. If there are a lot of requests on this endpoint, the network traffic will increase drastically.

Proposal

Implement a simple cache of checker results and serve it on /alive with a cache TTL.

include errors in total count

the total number of requests is not incremented when there is an error, error queries also need to be accounted for in the total query, it also makes it overly complicated to compute an error rate.

CI Test fails on ingress deployment

The following error occurs:

Run kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml
namespace/ingress-nginx created
serviceaccount/ingress-nginx created
configmap/ingress-nginx-controller created
clusterrole.rbac.authorization.k8s.io/ingress-nginx created
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created
role.rbac.authorization.k8s.io/ingress-nginx created
error: error validating "https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml": error validating data: ValidationError(Service.spec.ports[0]): unknown field "appProtocol" in io.k8s.api.core.v1.ServicePort; if you choose to ignore these errors, turn validation off with --validate=false
rolebinding.rbac.authorization.k8s.io/ingress-nginx created
Error: Process completed with exit code 1.

The error is unknown field "appProtocol"

run checks in parallel

currently, service checks are run sequentially:

res.APIServerDirect = c.measure(c.APIServerDirect, "api_server_direct")
res.APIServerDNS = c.measure(c.APIServerDNS, "api_server_dns")
res.MeIngress = c.measure(c.MeIngress, "me_ingress")
res.MeService = c.measure(c.MeService, "me_service")

however, in case of network failure, then each Run() of the checks could take up a lot of time:
assuming a timeout of 5 seconds, then we could have a worst case of (4 * 5 sec + (nodes) * 5 sec).

checks should all be performed in parallel, and WaitGroups should be used for that

WIP: Proposal: Check cluster health with mutations to the state

This is a proposal to extend the kubenurse in order to discover more cluster and network issues.

Current situation

Currently we can detect cluster issues that occur at runtime. These are mainly network related and concern kube-proxy, kube-apiserver-availability, cluster DNS, Ingress and upstream DNS.

Not covered issues

The following example scenarios can occur without the kubenurse detecting it:

  • A Pod gets created and has no networking because of IPAM, CNI or iptables failures
  • A Service isn't working after creation or mutation of the state (Pod readiness, Pod existence)
  • A Ingress doesn't catch up changes of the state e.g. a Pod is created or a new Ingress record is set
  • A Node cannot start any new Pods due to various issues
  • A Node can start a Pod but it's not usable due to resource pressure

Usually such issues are detected when some mutation (e.g. deployment, scaling) is done or the Chaosmonkey is doing his work.

Proposal

An elected kubenurse should periodically create a Pod, Service and Ingress.
As soon as the Pod is ready, the parent kubenurse connects to the Pod directly, over Service and over Ingress while recording errors and time.
Additionally the child kubenurse should create some configurable load (CPU, Disk, Memory) after being started. After the connection checks are done or a timeout is reached, all resources have to be garbage collected.

Constrains:

  • This function is disabled by default
  • All resource names must be customizable (random or static)
  • Ingress path must be customizable (random or static)
  • There is no conflict between kubenurses in the same namespace
  • RBAC needs to be adapted

Challenges

  • The leader election should be between the same group of kubenurses
  • Are Ingress and Pod templates needed and how should they be implemented

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.