napsty / check_rancher2 Goto Github PK

View Code? Open in Web Editor NEW

24.0 5.0 10.0 53 KB

Monitoring plugin to check Docker / Kubernetes clusters managed by Rancher 2.x

License: GNU General Public License v2.0

Shell 100.00%

rancher rancher-server monitoring rancher2 monitoring-plugin nagios-plugin docker containers kubernetes

check_rancher2's Introduction

check_rancher2

Monitoring plugin to check Kubernetes container environments in Rancher 2.x

This is the public repository for development.

Latest release and documentation can be found here:

https://www.claudiokuenzler.com/monitoring-plugins/check_rancher2.php

The check_rancher2 monitoring plugin is sponsored by Infiniroot (Hosted Rancher Kubernetes in Switzerland) -> www.infiniroot.com

check_rancher2's People

Contributors

Stargazers

Watchers

Forkers

mat1010 justindavis lbrines devops-corner lopf disaster37 jostmart yzapf steffeneichler coocyg

check_rancher2's Issues

check_rancher2 does not detect pressure situations on nodes

When a Kubernetes node suffers from a pressure situation (e.g. Disk Pressure), this is not seen by the plugin. Instead only the "node status" is read - but this remains "Active".

The plugin needs to read additional conditions from the API:

DiskPressure
MemoryPressure
NetworkUnavailable
Ready

Work for a PR is already in progress.

Allow self-signed certificates

Allow self-signed certificates on the Rancher 2 API.

Mentioned in #2

check_rancher2 not returning data through Nagios

Hello!

I've been able to get the check_rancher2 service check to correctly return data by executing it from the command line. I use the proper token, password and when I attempt to get info, everything works as expected:

However, when I attempt to run the same command through the Nagios (using Check_MK as a UI) wrapper, it never returns any data:

I've tried putting parameters in quotes, I've tried using $USERX$ variables for the token and password, I've tried escaping values, etc, etc. I've tried a lot of different methods that have worked in my 10 years of doing Nagios to get this working, but I can't figure it out.

Command definition:

Service definition:

Have you actually confirmed that you got it working in Nagios and not just Icinga? My only modification to the script was adding a "-k" to all the curl calls since I'm using a self-signed CA Cert chain for the HTTPS rancher endpoint. And like I said, everything works perfectly on the command line but when it's trying to run in Nagios itself, it doesn't return.

Thoughts?

Handle 403 access forbidden

When the API is secured through IP restriction or something similar and the plugin receives a 403 forbidden error, this is currently (in 1.2.2) not considered and the plugin will just show json read errors and return OK.

# ./check_rancher2.sh -H rancher.example.com -S -U token-xxxxx -P secret -t info
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
CHECK_RANCHER2 OK - Found 0 clusters:  and 0 projects: |'clusters'=0;;;; 'projects'=0;;;;

At the same time, found 0 clusters could also be an indicator, that something's not right.

Cluster node check ignoring all nodes down state

Currently my entire cluster is down and all of the nodes along with it, the script itself does not care if the cluster is entirely down it still thinks that the cluster state is healthy, is there something I am doing wrong?

Currently I am attempting to monitor if a cluster has a broken node inside of it, and this one is down with 6 nodes, can you point me in the right direction?

vaue error condition leads to "Cluster .. not found"

Hi!

check_rancher2/check_rancher2.sh

Line 225 in d93c4de

if [[ -n $(echo "$api_out_single_cluster" | grep -i "error") ]]

the condition checks for the string "error" which could also be a configuration, e.g. for ingress:

"options": {
    "custom-http-errors": "404,503",
    "use-forwarded-headers": "true"
}

It's not clear to me what the initial intention was. Can you give an example so that I can improve the error string to not match valid options like custom-http-errors?

Thank you anyway for your check!

Ignoring statuses in workloads

Hello all,
we use the plugin for our monitoring environment and in version 1.3.0 the option to ignore some status(es) of nodes was integrated.
I would like to do this also for workloads. Workloads in status "initializing" should not be shown directly as critical.
Is it possible to implement this function? I haven't found a corresponding parameter yet. "-i" refers only to the nodes.

Many thanks already
Best regards

Cluster error detected but wrong cluster in output

Current situation

When all clusters are checked and one of the discovered clusters has an error, the output does (mostly) not contain the correct cluster name.

Doing a cluster check type on all clusters:

$ ./check_rancher2.sh -H rancher2api.example.com -U token-xxxx -P secret -S -t cluster
CHECK_RANCHER2 CRITICAL - controller-manager in cluster "local" is not healthy -|'clusters_total'=5;;;; 'clusters_errors'=1;;;;

However the local cluster is fine and can be verified in the Rancher UI and also by doing a specific cluster check:

$ ./check_rancher2.sh -H rancher2api.example.com -U token-xxxx -P secret -S -t cluster -c local
CHECK_RANCHER2 OK - Cluster local is healthy|'cluster_healthy'=1;;;; 'cluster_errors'=0;;;;

It's another cluster which has the issue:

$ ./check_rancher2.sh -H rancher2api.example.com -U token-xxxx -P secret -S -t cluster -c c-tx5fl
CHECK_RANCHER2 CRITICAL - Cluster c-tx5fl: controller-manager is not healthy - scheduler is not healthy -|'cluster_healthy'=0;;;; 'cluster_errors'=2;;;;

Expectation

The output contains the correct cluster name with the unhealthy components.

Cluster check issue with older clusters

Hello,

We are using this script to monitor Rancher clusters.
We upgraded it to the latest version (check_rancher2 v 1.9.0 (c) 2018-2022).
We have noticed that the cluster check also throws an error for previously included clusters:

jq: error (at :1): Cannot iterate over null (null)
jq: error (at :1): Cannot iterate over null (null) CHECK_RANCHER2 OK - Cluster vh-rke is healthy

On newly registered RKE2 clusters, runs fine.
We noticed a small thing, that the cluster ID of the cluster with a good check is prefixed with 'c-m-', while the cluster that throws an error simply has the prefix 'c-' without 'm'.
Without cluster ID, we get as many errors as we have clusters.. x2. Because a cluster throws 2 errors.
We have got 4 older cluster and 2 newer cluster, so there is (4x2) 8 error message:

$ /usr/lib/nagios/plugins/check_rancher2.sh -H 'rancher.mgmt' -P 'xxxxxxxxxx' -S -U 'token-yyyyy' -t 'cluster'
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
CHECK_RANCHER2 OK - All clusters (6) are healthy|'clusters_total'=6;;;; 'clusters_errors'=0;;;;

Cluster IDs:

$ /usr/lib/nagios/plugins/check_rancher2.sh -H 'rancher.mgmt' -P 'xxxxxxxxxxxxxxxxxxxxxxx' -S -U 'token-yyyyyy' -t 'info'
CHECK_RANCHER2 OK - Found 6 clusters: c-m-ppqqlmn9 alias bkp-avsz-rke - c-m-vhlmbdms alias prod-avsz-rke - c-nvqh7 alias tst-avsz-rke - c-qvgjw alias vh-rke - c-s7vbs alias tst-vh-rke - local alias rke-rancher -

I ran the script in debug, so I found which part throws the error:
1.

+ clusterstate=active
+ component=($(echo "$api_out_single_cluster" | jq -r '.componentStatuses[].name'))
++ jq -r '.componentStatuses[].name'

+ declare -a component
+ healthstatus=($(echo "$api_out_single_cluster" | jq -r '.componentStatuses[].conditions[].status'))
++ jq -r '.componentStatuses[].conditions[].status'

Could you fix it, please?

Thank you in advance.

Regards,
Adam

follow redirects

On my rancher instance for some reason there is a redirect when querying cluster status.

Adding -L to curl invocation seems to fix the problem.

Metric for cluster_healthy (single cluster check)

Hi,

check_rancher2/check_rancher2.sh

Line 225 in 64d5349

    
           echo "CHECK_RANCHER2 OK - Cluster $clustername is healthy|'cluster_healthy'=0;;;; 'cluster_errors'=${#componenterrors[*]};;;;"

In the above mentioned line, the metric for cluster_healhy, in case a single cluster is checked, is always 0, even if the cluster is healthy. Wouldn't it be useful to hardcode it to 1 instead, in case there are no componenterrors ?

Best,
mat1010

Feature Request: Check workload by namespace

Currently, within my Rancher2 project I have my namespaces divided into environments; i.e. applications-dev1, applications-dev2, etc. I use the same workload name regardless of namespace and when the same workload name exists in multiple namespaces, the workload check fails. I would like the option to specify a namespace when specifying a workload.

Example output of a workload that exists in one namespace:

CHECK_RANCHER2 OK - Workload singlenamepsaceapp is active|'workload_active'=1;;;; 'workload_error'=0;;;; 'workload_warning'=0;;;;

Example output of a workload that exists in multiple namespaces:

CHECK_RANCHER2 CRITICAL - Workload multinamepsaceapp is active|'workload_active'=0;;;; 'workload_error'=1;;;; 'workload_warning'=0;;;;

Perhaps the output could say something like:

CHECK_RANCHER2 OK - Workload multinamepsaceapp in namespace applications-dev1 is active|'workload_active'=1;;;; 'workload_error'=0;;;; 'workload_warning'=0;;;;

CHECK_RANCHER2 CRITICAL - Workload multinamepsaceapp in namespace applications-dev2 is critical|'workload_active'=0;;;; 'workload_error'=1;;;; 'workload_warning'=0;;;;

and if the same workload name is detected in multiple namespaces, perhaps a warning?

CHECK_RANCHER2 UNKNOWN - Identical workload names detected in multiple namespaces. To check a specific workload you must also define the namespace (-n). This will check all workloads within a specific namespace for the given project. or something like that.

Thank you!

Plugin should check for correct API Hostname

When the plugin runs correctly on the Rancher2 API DNS, it returns correct data:

# /usr/lib/nagios/plugins/check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P secret -t info
CHECK_RANCHER2 OK - Found 3 clusters: "c-hpb7s" alias "gamma-stage" - "c-scsb6" alias "hugo-stage" - "local" alias "local" - and 6 projects: "c-hpb7s:p-46pd2" alias "Gamma" - "c-hpb7s:p-cngpb" alias "System" - "c-scsb6:p-cxk9s" alias "hugo-stage" - "c-scsb6:p-dqxdx" alias "System" - "local:p-9rwc7" alias "Default" - "local:p-ls96l" alias "System" -|'clusters'=3;;;; 'projects'=6;;;;

However when an incorrect hostname was given, the plugin returns an error on the command line but the final output is still OK:

# /usr/lib/nagios/plugins/check_rancher2.sh -H localhost -S -U token-xxxxx -P secret -t info
json read error: line 2 column 0: '[' or '{' expected near end of file
json read error: line 2 column 0: '[' or '{' expected near end of file
json read error: line 2 column 0: '[' or '{' expected near end of file
json read error: line 2 column 0: '[' or '{' expected near end of file
CHECK_RANCHER2 OK - Found 0 clusters:  and 0 projects: |'clusters'=0;;;; 'projects'=0;;;;

Expected result: The plugin should return UNKNOWN and output some info about invalid host address/url.

Mentioned in #2

No alert for "succeeded" pods

When a workload is updated or redeployed, the currently running pods of that workload will change from "running" to "succeeded" when the new pods were started. With the current version of check_rancher2, this will result in a CRITICAL alert:

Additional Info: CHECK_RANCHER2 CRITICAL - Pod "internal-api-cron-test-1541435700-2jqh5" is succeeded - Pod "internal-api-cron-test-1541435760-h5hrh" is succeeded - Pod "internal-api-cron-test-1541435820-t58bs" is succeeded - Pod "internal-api-cron-test-1541435880-5ptfv" is succeeded -

This should result in a WARNING, not CRITICAL.

support for "Kubernetes Version: v1.20.8"

Hi does the check_rancher2 support Kubernetes Version: v1.20.8?

We have five clusters and all are visible except for the cluster on Kubernetes Version: v1.20.8 the rest is on Kubernetes Version: v1.19.x

[root@ilop51 plugins]# ./check_rancher2.sh -H ilrancherha01 -S -U token-xxxxx -S -P xxxxxxxxx -t cluster
CHECK_RANCHER2 OK - All clusters (4) are healthy|'clusters_total'=4;;;; 'clusters_errors'=0;;;;

if I query a project directly I get

./check_rancher2.sh -H ilrancherha01 -S -U token-xxxxx -S -P xxxxxxxxx -t workload -p c-4pfth:p-49bns
CHECK_RANCHER2 WARNING - No workloads found in project c-4pfth:p-49bns.

//Andreas

pod check ignores namespaces in project

I wanted to check pods in ingress-nginx namespace in System, but the script would check entire System project and gave me the totals from System.

I have made a small adjustment to the script :

if [ $namespacename != "" ]; then
        EXTRAURL="?namespaceId=$namespacename"
fi
  api_out_pods=$(curl -L -s ${selfsigned} -u "${apiuser}:${apipass}" "${proto}://${apihost}/v3/project/${projectname}/pods$EXTRAURL")

Apparently you can pass ?namespaceId=somenamespace to api call to filter out by namespace.

Show namespace of monitored workload in output

check_rancher2 correctly identifies when a workload name is used in multiple namespaces:

CHECK_RANCHER2 UNKNOWN - Identical workload names detected in multiple namespaces. To check a specific workload you must also define the namespace (-n).

With the definition of the namespace, using -n, the workload name is then unique within the namespace and can be monitored.

Problem however is that in the output the namespace is missing:

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t workload -p c-xxxxx:p-xxxxx -w workload -n namespace
CHECK_RANCHER2 OK - Workload workload is active|'workload_active'=1;;;; 'workload_error'=0;;;; 'workload_warning'=0;;;;

It would be helpful that the output shows Workload X in namespace Y is active.

Workload Check ignores Namespace name filter

First, thanks for your awesome rancher check plugin.

I found out, that the --namespacename (-n) filter is totally ignored for workload checks.
I have a project with 3 namespaces, I only want to check one of this namesspaces, because the other are only test workloads of this project.

/check_rancher2.sh <connectdata (-H -P -U -S)> -t workload -p '<clusterid>:<projectid>' -n namespace1
/check_rancher2.sh <connectdata (-H -P -U -S)> -t workload -p '<clusterid>:<projectid>' -n namespace2

->both checks find all workloads in that project

/check_rancher2.sh <connectdata (-H -P -U -S)> -t workload -p '<clusterid>:<projectid>' -n dontexistingnamespace

-> this check works also, nevertheless "dontexistingnamespace" doesn't exist

I use the latest version of plugin (1.12.1) and connect to a Rancher v2.7.9

Show resource problems in output line

Version 1.8.0 added new resource related threshold checks (cpu, memory, pod usage). When thresholds are reached, the information what resource has reached the threshold is not shown in the first line (plugin output) but rather in a line afterwards. This works, yes, but it would be nice to see this immediately in the output line.

$ ./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P "verylongsecret" -t cluster -c c-xxxxx --cpu-warn 50 --cpu-crit 75
CHECK_RANCHER2 CRITICAL - Cluster my-test has resource problems|'cluster_healthy'=0;;;; 'component_errors'=0;;;; 'cpu'=2380;;;;4000 'memory'=1572864000B;;;0;12469846016 'pods'=55;;;;220 'usage_cpu'=59%;50;75;0;100 'usage_memory'=12%;;;0;100 'usage_pods'=25%;;;0;100
CPU usage 59 higher than warn threshold of 50