outsideit / check_netapp_ontap Goto Github PK

View Code? Open in Web Editor NEW

38.0 18.0 28.0 2.74 MB

:four_leaf_clover: Check NetApp Ontap :four_leaf_clover:

Home Page: https://outsideit.net/monitoring-netapp-ontap/

License: GNU General Public License v3.0

Perl 100.00%

monitoring nagios-plugins netapp-ontap-cluster perl-script health-check health-checks nagios-plugin monitoring-plugins

check_netapp_ontap's People

Stargazers

Watchers

check_netapp_ontap's Issues

DIMM status check

Hello,
It would be great to add DIMM status check via the filer_hardware_health option:
Output example:

>system controller memory dimm show
			               DIMM    UECC      CECC      CPU                 Slot
			Node           Name    Count     Count     Socket     Channel  Number  Status
			-------------- ------- --------  --------  ---------  -------  ------  ------
			node1          DIMM-1         0         0          0       0        0  ok
			node1          DIMM-NV1       0         0          0       1        1  ok
			node2          DIMM-1         1         0          0       0        0  ok
			node3          DIMM-NV1       0         0          0       1        1  ok
			4 entries were displayed.

BR,
Yannick

ontap 8.3 : advanced partitionning (adp) ,false CRITICAL RESULT

Hi, with ontap 8.3 with new adp feature
http://www.datacenterdan.com/blog/netapp-dataontap-83-adp-root-disk-slice-deep-dive

disk and nodes check, give us:

$USER1$/check_nac.py -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -t disks

NAC CRITICAL - disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared

$USER1$/check_nac.py -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -t nodes

NAC CRITICAL - node node1 has critical alarm for aggregate_used aggr0_cluster_01_0 node node2 has critical alarm for aggregate_used aggr0_cluster_02_0

volume_health and offline volume

Hi,

When a volume is offline, volume_health use the uninitialized value in $hrefVolInfo->{$strVol}->{'space-total'} (line 1202 version 2.5.10) because offline volume does'nt have space information.

You need to test if $hrefVolInfo->{$strVol}->{'space-total' id defined or not before does the division.

Test:
I have add some print

	#	if (defined($hrefVolInfo->{$strVol}->{'space-total'})) {
		    print ($hrefThresholds->{'space-percent'} . "\n");
		    print ($strVol . "\n");
		    print ($hrefVolInfo->{$strVol}->{'space-total'} . "\n");
			my $intUsedPercent = ($hrefVolInfo->{$strVol}->{'space-used'} / $hrefVolInfo->{$strVol}->{'space-total'}) * 100;
        	$intUsedPercent = floor($intUsedPercent + 0.5);
        	my $strReadableUsed = space_to_human_readable($hrefVolInfo->{$strVol}->{'space-used'});
        	my $strReadableTotal = space_to_human_readable($hrefVolInfo->{$strVol}->{'space-total'});
        	my $strNewMessage = $strVol . " - " . $strReadableUsed . "/" . $strReadableTotal . " (" . $intUsedPercent . "%) SPACE USED";

=====

perl check_netapp_ontap-debug.pl -o volume_health -H keris.enssat.fr -u xxxx -p xxxx -w offline -n sftestkerb
sftestkerb/vol_04052017_174853 is offline

perl check_netapp_ontap-debug.pl -o volume_health -H keris.enssat.fr -u xxxx -p xxxx -w 50% -c 51% -n sftestkerb
51
sftestkerb/sfTestKerbData
2638827909120
51
sftestkerb/vol_04052017_174853
Use of uninitialized value in concatenation (.) or string at check_netapp_ontap-debug.pl line 1204.

Use of uninitialized value in division (/) at check_netapp_ontap-debug.pl line 1205.
Use of uninitialized value in division (/) at check_netapp_ontap-debug.pl line 1205.
Illegal division by zero at check_netapp_ontap-debug.pl line 1205.

netapp_alarms and root aggregate usage

I've configured the netapp_alarms check and the root aggregates are always making this check critical. They are always 95% on the filer and do not grow.

Can the root volume space be excluded from the check?

Snapshot_health usage percentage based on Data Space size

When using the snapshot_health option, the percentage of snapshot size usage is calculated from the Snapshot Space Used in relation to the Data Space Total.

Should it not be based on the Total Snapshot Reserve? Because with some volumes the Snapshot Reserve can actually be larger than the Data Space, causing the Snapshot Using 140% space.

So snapshot_health should be calculated within the Total Snapshot Reserve, and volume_health should be calculated within the Data Space Total.

400 Bad Request with ONTAP 9.3

Issue Type

Bug report

Issue Detail

check_netapp_ontap version: v3.01.171611
NetApp Ontap version: 9.3
Monitoring solution: Opsview/Nagios

Expected Behavior
return aggregate usage

Actual Behavior
./check_netapp_ontap.pl -H -u opsview -p -o aggregate_health -m include,n1_aggr0 -w 95% -c 98%
Smartmatch is experimental at ./check_netapp_ontap.pl line 425.
Smartmatch is experimental at ./check_netapp_ontap.pl line 426.
Smartmatch is experimental at ./check_netapp_ontap.pl line 427.
Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:

================================================^

<title>400 Bad Request</title> at /usr/local/nagios/perl/lib/x86_64-linux-gnu-thread-multi/XML/Parser.pm line 187.

How to reproduce Behavior
Run it any time, nothing seems to work. Have configured opsview user with ontapi application readonly role and a password.

Assistance Needed

After following the instructions for installation, I receive the following when executing /check_netapp_ontap.pl -H ip_address -u username -p password -o disk_health:

Failed test query: No elements in API request

I'm extremely unfamiliar with OnTap and any guidance is appreciated.

Feature request: Add performance data to volume_health

Hi willemdh,

Would it be possible to add the performance data to the volume_health check?
The current "OK" Message doesn't allow me to write graphs: "OK - No problems found (1 checked)"

Thank you & best regards
Pascal

Error... Failed test query: Couldn't find end of Start Tag netapp

If you could help with an error I am receiving... I do...
./check_netapp_ontap.pl -H Mycluster --user netappro --password xxxx -option disk_health
I get... Failed test query: Couldn't find end of Start Tag netapp
Notes: The cluster can be pinged, The netappro is a read only user, have tried multiple -option choices, Also happens with 'admin' user.

One additional note,when trying the dev version I get...
./check_netapp_ontap_dev.pl: line 5: syntax error near unexpected token newline' ./check_netapp_ontap_dev.pl: line 5:'

Any ideas?

Tom

Script is not working (400 Bad Request) after upgrading to Ontap 9.3P9 (from 9.1x)

Issue Type

Bug report

Issue Detail
After updating to ONTAP 9.3P9 i'll get the following error when running the script
--> Failed test query: Server returned HTTP Error: 400 Bad Request

All already read Issue 76 and copied the content of "netapp-manageability-sdk-9.3/lib/perl/NetApp/*" to "/usr/lib64/perl5" but still the same error.

The Cluster hostname was with unterscores and i also changed it to one with minus. (from netapp_cluster_01 to netapp-cluster-01)
DNS resolving is working.

check_netapp_ontap version: v3.01.171611
NetApp Ontap version: 9.3P9
Monitoring solution: Nagios 4.3.4 (on CentOS 6.9)

How to reproduce Behavior
For example run /usr/local/nagios/libexec/check_netapp_ontap.pl -H -u -p -o interface_health

Netapp Alarms shows aggregate_used %%. Need to add the aggregate space in GB.

Currently Netapp Alarms return aggregate usage in percents. Is it possible to add usage in Gigabytes too?
Right now the output is "aggregate_used 92%". I want to see it as "aggregate_used 92% (92GB/100GB)

Thanks

Get error "Can't call method child_get_string" when using option "disk_spare"

using the option "disk_spare" gives the following error:

Can't call method "child_get_string" on an undefined value at /usr/local/nagios/libexec/check_netapp_ontap9.pl line 240.

quota health check doesn't check soft limits if hard limits are defined

We have hard and soft limits, both on space and files, and expected this plugin to raise warnings when soft limits were breached and critical alerts when hard limits were breached. We received the critical alerts but not the warnings by which time issues were already being experienced.

Looking at the code in the calc_quota_health subroutine it will not check the soft limits if a hard limit is defined because of the elsif statements.

Need to add iterator code fix for Quota space checking also

The iterator code fix for volume check needs to be added to quota checking also. I copied that code, and also added code for configurable Quota threshold alerting (i.e. alert when quota usage exceeds X% of hard limit, even if no threshold or soft limit is defined). I only use hard limits for my quotas, but still want Nagios to alert me when usage is high, same as is done for volumes. Here are my changes below. I am not too familiar with GitHub, so I'll just paste it here, if you want to incorporate it into the code.
Thanks,
Moshe

793a796
>         my $nahTag = NaElement->new("tag");
804a808,809
>         $nahQuotaIterator->child_add_string("max-records", 100);
>         $nahQuotaIterator->child_add($nahTag);
807c812
<                         $nahQuotaIterator->child_add_string("tag", $strActiveTag);

---
>                         $nahTag->set_content($strActiveTag);
810d814
<                 $nahQuotaIterator->child_add_string("max-records", 200);
846c850,852
<       my $hrefQuotaInfo = shift;

---
>       #my $hrefQuotaInfo = shift;
>       my ($hrefQuotaInfo, $strWarning, $strCritical) = @_;
>       my ($hrefWarnThresholds, $hrefCritThresholds) = space_threshold_converter($strWarning, $strCritical);
866a873,889
>                       # Added for Quota threshold alerting (ML)
>                       if (defined($hrefWarnThresholds->{'space-percent'}) || defined($hrefCritThresholds->{'space-percent'})) {
>                               my $intUsedPercent = ($hrefQuotaInfo->{$strQuota}->{'space-used'} / $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'}) * 100;
>                               $intUsedPercent = floor($intUsedPercent + 0.5);
>                               my $intThreshToBytes = $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'}*1024;
>                               my $strReadableThresh = space_to_human_readable($intThreshToBytes);
>                                 my $strNewMessage = $strQuota . " - " . $strReadableUsed . "/" . $strReadableThresh . " \(" . sprintf("%d",$intUsedPercent) . "\%\)" . " USED";
>                                 if ($intUsedPercent >= $hrefWarnThresholds->{'space-percent'}) {
>                                       $strOutput = get_nagios_description($strOutput, $strNewMessage);
>                                         $intState = get_nagios_state($intState, 2);
>                                 }
>                                 if ($intUsedPercent >= $hrefCritThresholds->{'space-percent'}) {
>                                       $strOutput = get_nagios_description($strOutput, $strNewMessage);
>                                         $intState = get_nagios_state($intState, 1);
>                                 }
>                         }
>                       # Added for Quota threshold alerting (ML)

Failed test query: NaServer::parse_xml - Error in parsing xml

Issue Type
Bug report

Issue Detail
Since we migrated to netapp version 9.3P2, the plugin return this error :
Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:

================================================^

<title>400 Bad Request</title> at /usr/lib64/perl5/XML/Parser.pm line 187

check_netapp_ontap version: v3.01.171611
NetApp Ontap version: NetApp Release 9.3P2
Monitoring solution: NAGIOS

Undefined subroutine

Facing the below error when executing the script.

[root@h1dciminagios1 NetApp]# /usr/local/nagios/libexec/check_netapp_ontap_v04.pl -H 10.116.104.46 -u nagios -p Welcome@123 -o interface_health -w 85% -c 90%
Undefined subroutine &XML::Parser called at /usr/local/lib64/perl5/NaServer.pm line 1068.

snapshot_health check has uninitialized values

root@mon:~# perl /usr/local/lib/nagios/plugins/check_netapp_ontap -H netapphost -u adminuser -p adminpassword -o snapshot_health
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
OK - No problems found (58 checked)

Failed filer health query: The request was for an API version which is not supported by the appliance

Hello,

Just updated ontap & the filer health query:
netapp_alarms
is returning an error about the API version

Wll this require an update to the script?

disk_health ontap 9.1 FAS 2650

Hi, we're using your plugin to monitor a fresh FAS 2650. w/ Ontap 9.1
Works fine but some disks monitored with disk_health always show "yellow" due to scrubbing. This seems to be normal according to NetApp:
https://library.netapp.com/ecmdocs/ECMP1196912/html/GUID-0C6E982A-AC7C-44BA-93BF-4DEA7EE15576.html
Any help?

disk_health netapp FAS8080 cdot 8.3.2 - state is unknown

Hi,
we got a new netapp fas8080 with cdot os version 8.3.2.
When I'm trying to get the disk_health status, the plugin fails with the output:

4.1.8 state is unknown, 2.11.6 state is unknown, 3.10.12 state is unknown, 1.0.0 state is unknown (...)

Is there a chance to get this fixed?
How can I assist you doing that?

Thanks a lot!

Use of uninitialized value $intState in exit at ./check_netapp_ontap.pl line 2137

In Last line of code . while exiting getting error $intState uninitialized

Print the output and exit with the resulting state.

$strOutput .= "\n";
print $strOutput;
exit $intState;

I am using these option in the plugin

./check_netapp_ontap.pl -H X.X.X.X -node xyz.zyz.com -user xxx -password xxx –option volume_health

snapmirror_health has uninitialized value

This happens with the plugin from both the master and the dev branch (as of today).

./check_netapp_ontap.pl -H mynetapp -u monitoring -p secret -o snapmirror_health -w 1h
Use of uninitialized value in string eq at ./check_netapp_ontap.pl line 720.
Use of uninitialized value in string eq at ./check_netapp_ontap.pl line 720.
OK - No problems found (23 checked)

Line 720 in this case is:

                    if ($nahSM->child_get_string("relationship-control-plane") eq "v2") {

Getting error when invoking the check_netapp_ontap.pl

Hi,

I am getting this error while invoking the perl command
[root@blr-monitor libexec]# /usr/bin/perl ./check_netapp_ontap.pl -H 10.0.0.20 -u root -p password disk_health

Failed test query: in Zapi::invoke, cannot connect to socket

Please help me out.

Thanks,
Krishna M S

get_port_health gets stuck when number of records > 100

Hello,

The check gets stuck in an endless loop when num-records exceeds max-records (in our case 225 vs 100).
Perhaps the max-records should be made configurable? Alternatively, inform the user about this issue, along with an appropriate Nagios error code, instead of hanging?

Do you need any more info about this particular issue we're having?

Thanks

Regards,
Robert

Failed test query: NaServer::parse_xml

Issue Detail

check_netapp_ontap version: 3.01.171611
NetApp Ontap version: 9.3
Monitoring solution: EyesOfNetwork 4.1 / Centos6.5

Hello,

I try to use the script but when i run the following command :
check_netapp_ontap.pl -H HOST_IP -u admin -p admin_pass -o cluster_health
(for exemple, same issue all the time)

I have this error:

Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
================================================^
<html><head>
<title>400 Bad Request</title>
at /usr/lib64/perl5/XML/Parser.pm line 187
I

I don't really understand what is the problem here...

maybe my perl version is too older (perl-5.10.1-136.el6.x86_64) or something like that
I searched requirements but i didn't found anything (https://outsideit.net/check-netapp-ontap/ is offline)

thanks for your help

check_netapp_ontap error uninitialized value in string

I'm not sure if it is a bug
Issue Detail

check_netapp_ontap version:
check_netapp_ontapi version: v3.01.171611
By John Murphy [email protected], Willem D'Haese [email protected], GNU GPL License
NetApp Ontap version: 9.3
I did this command in ordet to get a info from one volume
./check_netapp_ontap.pl -H 10.209.49.137 -u nagios_user -p Nagios_User_pwd01 -o volume_health -m include,SGSFILVR06/filenetp8b
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
SGSFILVR06/filenetp8b - 4.69TB/5.28TB (89%) SPACE USED | 'SGSFILVR06/filenetp8b_used'=5159984656384B 'SGSFILVR06/filenetp8b_free'=645436739584B

Why I got a lot of this output line
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350

Thanks for you help
Emilio
[email protected]

volume_health, Division by zero when offline volumes exist

Running latest version(2.5.10) of this script, also occurred in previous versions at different line numbers.

Ontapp version 9.2, with offline volumes present:

# ./check_netapp_ontap.pl -H hostname -u user -p password --option volume_health -w 98%,96%i -c 99%,98%i
Use of uninitialized value in division (/) at ./check_netapp_ontap.pl line 1207.
Use of uninitialized value in division (/) at ./check_netapp_ontap.pl line 1207.
Illegal division by zero at ./check_netapp_ontap.pl line 1207.

It looks like this script needs more logic to ignore offline volumes because this command works(adding offline as a critical state):

# ./check_netapp_ontap.pl -H hostname -u user -p password --option volume_health -w 98%,96%i -c 99%,98%i,offline
svm1/home1 - 100950109/105382963 (96%) INODES USED, svm2/home1_snap - 100869566/105382963 (96%) INODES USED

Script also works as expected if you bring all volumes back online.

Add warning and critical threshold values for performance data

check_netapp_ontap/check_netapp_ontap.pl

Line 1712 in 6c85ef8

    
           $perfOutput->{"space-$strVol"} = "'" . $strVol . "_usage'=" . $hrefVolInfo->{$strVol}->{'space-used'} . "B;;;0;" . $hrefVolInfo->{$strVol}->{'space-total'};

Sanpmirror lag time

How to monitor snapmirror lag time,

/usr/local/nagios/libexec/check_netapp_ontap.pl -H NETAPP_C-MODE_IP -n SVM_NODE -u USERNAME -p PASSWD -c public -o snapmirror_health -w 1h -c 2h
OK - No problems found (0 checked)

Above return 0 checked?

Can't locate UserAgent.pm. Compilation Failed

I installed Nagios 4.2.1 on CentOS Linux release 7.2.1511 (Core). Everything worked perfect. I want to monitor my NetApp Release 8.1.3.7-Mode.

I use tutorial How To from http://outsideit.net/check-netapp-ontap/

given chown nagios:nagios /usr/local/nagios/libexec/check_netapp_ontap.pl
given chmod 644 /usr/local/nagios/libexec/check_netapp_ontap.pl

copy Netapp/* to /usr/lib64/perl5

but when i try to run test command

./check_netapp_ontap.pl the error below appear. is it because Netapp SDK? How can I solve this error?

\Can't locate LWP/UserAgent.pm in @inc (@inc contains: /usr/local/lib64/perl5 /u
BEGIN failed--compilation aborted at /usr/lib64/perl5/NaServer.pm line 27.
Compilation failed in require at ./check_netapp_ontap.pl line 26.
BEGIN failed--compilation aborted at ./check_netapp_ontap.pl line 26.

Use of uninitialized value $strOption in lc

Issue Type
Bug report

Issue Detail

$ sudo -u icinga /usr/lib64/nagios/plugins/contrib/check_netapp_ontap.pl -H 10.193.1.1 --user "USER" --password "PASSWORD" aggregate_health
Use of uninitialized value $strOption in lc at /usr/lib64/nagios/plugins/contrib/check_netapp_ontap.pl line 1946.
Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
================================================^
<html><head>
<title>400 Bad Request</title>
 at /usr/lib64/perl5/vendor_perl/XML/Parser.pm line 187.

check_netapp_ontap version: v2.07.170621
NetApp Ontap version:
Monitoring solution: Icinga 2

Expected Behavior

Actual Behavior

How to reproduce Behavior

volume_health executed in cli shows "usage" performance data, but "inodes" in Icinga 2

Issue Type

Question / Maybe bug

Issue Detail

check_netapp_ontap version: v3.01.171611
NetApp Ontap version: 9.2
Monitoring solution: Icinga 2

Expected Behavior

When executing the check on the cli, the volume usage is shown in the performance data:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H netapp -u "user" -p "pass" -o volume_health -m include,v_volume_name1 -v
OK - No problem found (1 checked) | 'cifsserver/v_volume_name1_usage'=160401108992B;;;0;255013683200

So it is expected that Icinga 2 would show cifsserver/v_volume_name1_usage in the performance data.

Actual Behavior

In Icinga 2 interface, the performance data is cifsserver/v_volume_name1_inodes and shows different values, even though the same voume seems to used as a base.

How to reproduce Behavior

Simply define a volume health check in Icinga 2.

So the question is: Where does this come from?

"disk_health" considered "admin failed" disk as normal

Issue Type

Bug report

Issue Detail

check_netapp_ontap version: v3.04.201124
NetApp Ontap version: 9.6.0 ( simulator, 2 nodes )
Monitoring solution: LibreNMS

Expected Behavior
When a disk failed, "disk_health" change to warning due to reconstructing.
After reconstruct completed, "disk_health" should keep warning state because of a failed disk in list.

Actual Behavior
After reconstruct completed, "disk_health" become normal but there is a failed disk in list.

Edit: I've tried add "-w 1" to set warning level at 1 failed disk"
Edit2: Sorry, I didn't notice that "disk_health" can't set warning / critical level, only "disk_spare" can.

How to reproduce Behavior
Fail a disk manually and wait for reconstruct completed, check the exit code from "disk_health".

filter unassigned

Thank you for the plugin.

Would it be possible to filter out unassigned disks like:
-m exclude,state is unassigned

Seems not to work or isn't this possible currently ?

Snapmirror lag time for each

Hi,

Is it possible to check each snapmirror with a "name option" instead of "all in one" command in plugin.

Thanks,

Space Health perf data not working when critical

Issue Type

Bug report

Issue Detail

check_netapp_ontap version: 3.01.171611
NetApp Ontap version: 9.3
Monitoring solution: Nagios XI 5.4.13

Expected Behavior
When checking for aggregate_health, we expect to get performance data for all aggregates.

Actual Behavior
This works fine when the state is OK (0) or WARNING (1). However, when one of the aggregates is in CRITICAL state, it disappears from the performance data output, causing problems with Nagios XI PNP performance data engine. In our case, we are checking 3 aggregates, but as soon as 1 goes in Critical state, we only get performance data for 2 aggregates. The RRD file still expects 3 datasources, so we don't see any performance graphs anymore.

I expect this behaviour will also happen with other checks which use the calc_space_health sub, as when an object is critical, it is removed before checking for Warning, and perf data is only added on Warning check.

How to reproduce Behavior
Run the script for aggregate health so that no aggregates are critical. You should see perf data for all aggregates checked. rerun the check with critical level so that one or more aggregates have critical state, they will not be included in performance data then.

Would be great if this could be fixed soon.
Thanks
Edward

volume_health has uninitialized value

High, when volumes are in offline state, we get an uninitialized value:
/usr/lib64/nagios/plugins/check_netapp_ontap.pl -H onvarjv-ups -u ONVARJV\nagiosmon -p xxxxxxxx -o volume_health --warning 93% --critical 96% -m exclude,worm
Use of uninitialized value in division (/) at /usr/lib64/nagios/plugins/check_netapp_ontap.pl line 1197.
Use of uninitialized value in division (/) at /usr/lib64/nagios/plugins/check_netapp_ontap.pl line 1197.
Illegal division by zero at /usr/lib64/nagios/plugins/check_netapp_ontap.pl line 1197.

I added a check on volume state in 'space_threshold_helper':
sub space_threshold_helper {
# Test the various monitored object values against the thresholds provided by the user.
my ($intState, $strOutput, $hrefVolInfo, $hrefThresholds, $intAlertLevel) = @_;

    foreach my $strVol (keys %$hrefVolInfo) {
            my $bMarkedForRemoval = 0;

            # Test added by Didier Tollenaers 03/04/2015
            if ($hrefVolInfo->{$strVol}->{'state'} ne 'offline')  {

            # Test if various thresholds are defined and if they are then test if the monitored object exceeds them.
            if (defined($hrefThresholds->{'space-percent'}) || defined($hrefThresholds->{'space-count'})) {
                    # Prepare certain variables pre-check to reduce code duplication.

....

It seems better with that test (perhaps not 'elegant' in perl, but I 'm not perl developper)

smartwatch is experimental in output

Issue Type

Bug report
Issue Detail

check_netapp_ontap version: v3.01.171611 (latest)
NetApp Ontap version: 9.1 P7
NetApp SDK version: 9.4 (latest)
Monitoring solution: Icinga2

Expected Behavior

Smartwatch is experimental does not show up in output.

Tested with v2.5.10 on the same machine:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H mynetapp -u "monitoring" -p "secret" -o filer_hardware_health
OK - No problems found (8 checked)

Actual Behavior

The output "Smartwatch is experimental..." is shown on all checks. Here two examples:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H mynetapp -u "monitoring" -p "secret" -o cluster_health
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 414.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 415.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 416.
OK - No problem found (2 checked)


# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H mynetapp -u "monitoring" -p "secret" -o filer_hardware_health
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 414.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 415.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 416.
OK - No problem found (8 checked)

Relevant code blocks from plugin:

 411         if (defined $strSuboption) {
 412                 $strCheckLIFStatus = $strCheckLIFHomeNode = $strCheckLIFHomePort = 0;
 413                 my @arySuboption = split(",",$strSuboption);
 414                 if ("status" ~~ @arySuboption) { $strCheckLIFStatus = 1; }
 415                 if ("home-node" ~~ @arySuboption) { $strCheckLIFHomeNode = 1; }
 416                 if ("home-port" ~~ @arySuboption) { $strCheckLIFHomePort = 1; }
 417         }

How to reproduce Behavior

Just execute the plugin.

Ontap 9.1 - is scrubbing

Since we updated our Netapp to the Ontap Version 9.1 (before 8.3) we get warnings from the disk_health check. Some disks are in the status "is scrubbing". Is this a known issue? How can we resolve it?

thanks

Question about "cluster_health" option

Issue Type

Enhancement Request

Issue Detail

check_netapp_ontap version: v3.04.201124
NetApp Ontap version: 9.6.0 ( simulator )
Monitoring solution: LibreNMS

Hi, thanks for the plugin and it gives me many help.
I have a NetApp 9.6 simulator ( 2 nodes ) for study and using "cluster_health" for monitor, but I get a "OK - No problem found (0 checked)" output.
After check the script, it maps to "cluster peer health show" which differ from "system health status show" I expected.
Could you help to clearify the option usage?
Thanks.

restrict checks to "Nodes" is not working

Hi,

I am trying to restrict my "option check" to nodes in a cluster, but is reverting status form cluster.
Even if we provide wrong "node name" still plugin is running with success and data checked from cluster.
For eg: o/p when wrong node name:
./check_netapp_ontap.pl -H CLUSTER1 -c public -u USERID -p PASWORD -o disk_health -n WRONGNODE_NAME
OK - No problems found (552 checked)

May i know how to fix it?

Unable to find API

We have already downloaded the nagios plugin for checking netapp ontap and installed it in our Centreon server,but found that there is "Unable to find API" error.Only result of disk_health is ok.May I know how to solve these kind of problems?

[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o disk_health
OK - No problems found (28 checked)
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o volume_health
Failed volume query: Unable to find API: volume-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o aggregate_health
Failed volume query: Unable to find API: aggr-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o snapshot_health
Failed volume query: Unable to find API: volume-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o quota_health
Failed volume query: Unable to find API: quota-report-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o snapmirror_health
Failed volume query: Unable to find API: snapmirror-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o filer_hardware_health
Failed filer health query: Unable to find API: system-node-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o port_health
Failed filer health query: Unable to find API: net-port-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o interface_health
Failed filer health query: Unable to find API: net-interface-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o netapp_alarms
Failed filer health query: Unable to find API: dashboard-alarm-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o cluster_health
Failed filer health query: Unable to find API: cluster-peer-health-info-get-iter

Scales of Volume size under storage should have to be configurable.

Hi,
We need bigger scales/Units as MB, GB and those too in configurable manner, sometimes its very hectic to dealing with perf-data as given below.
$ /usr/lib64/nagios/plugins/check_netapp_ontap.pl -H XX.XX.XX.XX -u USERNAME -p PASSWORD --option volume_health -n nlcl01_exch -w 80% -c 90%
OK - No problems found (5 checked) | 'nlcl01_exch/vol_EXCH_NL_LOG_used'=1606890414080B 'nlcl01_exch/vol_EXCH_NL_LOG_free'=702084005888B 'nlcl01_exch/nlcl01_exch_root_used'=335872B 'nlcl01_exch/nlcl01_exch_root_free'=1019719680B 'nlcl01_exch/vol_EXCH_NL_DB_used'=2616908759040B 'nlcl01_exch/vol_EXCH_NL_DB_free'=2880649379840B 'nlcl01_exch/vol_EXCH_BE_DB_used'=3450560196608B 'nlcl01_exch/vol_EXCH_BE_DB_free'=4575874686976B 'nlcl01_exch/vol_EXCH_BE_LOG_used'=2307689426944B 'nlcl01_exch/vol_EXCH_BE_LOG_free'=4520277782528B
$

Why I meant to say Configurable (and not human readable), cause while pushing this data to graphing tools we need single scale/Unit.

compilation error in "Dev (#49)"

Hi,

I'm a collegue of Willemdh and he added me to this project for further development.
I try to get "Dev (#49)" working but I get:

Type of arg 1 to keys must be hash (not hash element) at check_netapp_ontap.pl line 270, near "}) "
Execution of check_netapp_ontap.pl aborted due to compilation errors.

if i can get this version to work, i can check it with our new netap with ontap 9.1 and commit it to the master branch.
But because i don't wanna go throught te various commits, I ask for a little assistance :-)

Gtz,
Tony

[Enhancement Request] Perfdata Total size of Aggregates and Volumes

check_netapp_ontap version: v3.01.171611
Monitoring solution: Icinga2 + InfluxDB/Grafana

Hi all,
Currently the perfdata metrics are *_used and *_free. An additional _totalsize would be nice.
Also the warning and critical thresholds in the perfdata would be nice

Regards,
Marcus

Hard Limit Quota isn't working - Checking if used is higher then hard limit (will never be)

Broken Feature - Quota Hard Limit

Bug report

Issue Detail

check_netapp_ontap version: All
NetApp Ontap version: 8.3.2P5

Expected Behavior
I use Hard quota on my QTrees,
When quota is about to be full I need an alert,
I expect something like this:
ERROR: volume_name/qtree_name - HARD Quota at 87%

Actual Behavior
The code do this:

if (used >= hard-quota) {
    alert
}

This will not work due to the fact that the used space/files will never exceed the hard-quota
So the result is always:
OK - No problems found (77 checked)

Suggested Solution
I've done the following:

Top of the code added:

# DEFUALTS
my $QUOTA_HARD_LIMIT_ERROR_THRESHOLD = 0.85;

in sub calc_quota_health changed these:

       if ($hrefQuotaInfo->{$strQuota}->{'space-hard-limit'} ne "-") {
            # Fixing bug - The test that space-used is > then space-hard-limit is wrong - spaced-used can never be bigger!
            my $quotaPrecentage = $hrefQuotaInfo->{$strQuota}->{'space-used'} / $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'};
            #if ($hrefQuotaInfo->{$strQuota}->{'space-used'} >= $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'}) {
            if ($quotaPrecentage >= $QUOTA_HARD_LIMIT_ERROR_THRESHOLD) {

and These lines:

        if ($hrefQuotaInfo->{$strQuota}->{'files-hard-limit'} ne "-") {
            # Fixing bug - the test that files-used is > then files-hard-limit is wrong - files-used can never be bigger!
            #if ($hrefQuotaInfo->{$strQuota}->{'files-used'} >= $hrefQuotaInfo->{$strQuota}->{'files-hard-limit'}) {
            my $quotaPrecentage = $hrefQuotaInfo->{$strQuota}->{'files-used'} / $hrefQuotaInfo->{$strQuota}->{'files-hard-limit'};
            if ($quotaPrecentage >= $QUOTA_HARD_LIMIT_ERROR_THRESHOLD) {

Result is something like this:
virtual_storage_node/volume/qtree - 146.37GB/170.00GB SPACE USED
With exit code of 2.

This should resolve the issue (but still missing the ability to set CRITICAL or ERROR thresholds)

no output for volume_health

hi team,
First of all thanks for this wonderful script.
i have a 2 netapp storages. one with 170 volumes and another with 500 volumes. For the 170 volumes everything works fine. but for 500 volumes, when i have given command to display volume_health with more than 95% it is running for than more than 1 hour with no output or error. may i know whether script has any issue with the number of volumes.?
Please help me on this.

Error in snapshot_health check if 0 sized vlome : Illegal division by zero at line 1464.

Issue Type

Bug report

Issue Detail

check_netapp_ontap version:
NetApp Ontap version:
Monitoring solution: Centreon / Centreon-Broker

Expected Behavior
No bug in check snapshot_health if volume is 0 sized

Actual Behavior
Illegal division by zero at check_netapp_ontap.pl line 1464

How to resolve Behavior
at line 1458 add space-total condition :
From :
if ($hrefVolInfo->{$strVol}->{'state'} eq 'online') {
To :
if ( ($hrefVolInfo->{$strVol}->{'state'} eq 'online') && ($hrefVolInfo->{$strVol}->{'space-total'} ne 0) ) {

Zapi::invoke fails

When attempting to run a check:

./check_netapp_ontap.pl -H x.x.x.x -u username -p pass -o cluster_health

I am getting the following error:

Failed test query: in Zapi::invoke, cannot connect to socket

Through some debug statements, I've found the message is originating on line 655 of NaServer.pm, looks like it's a result of this if statement:

if (!$sock->connect($that_sockaddr)) {

I'm trying to figure out what could be causing this, but I'm not sure. The server is running NetApp 8.3, C-Mode (not 7-mode as the other issue in github had). I'm sure this is a configuration issue, but I'm not very strong with perl or nagios, so I'm having trouble figuring it out.

Performance data for Volumes that are in Critical state is not included in output

Bug report

When a volume is reporting as Critical, then performance data for that volume is not output by the check.

check_netapp_ontap version: v3.03.200924
NetApp Ontap version: 9
Monitoring solution: OMD 3.31~2020-09-13-labs-edition

Expected Behavior
Performance data should be output for all Volumes regardless of threshold state.

Actual Behavior
Performance data is output for all volumes that are not in a Critical state.

How to reproduce Behavior
Set a threshold which will result in some volumes going into a critical state and check the performance data output, volumes that are in a critical state will not be included in the performance data output.

Issue is caused by the sub space_threshold_helper which is called twice, first with $alertlevel =2 and then with $alertlevel =1. Volumes are removed from the list of volumes during the first call if they exceed the critical threshold, however $perfoutput is only produced if ($intAlertLevel == 1). Hence, Volumes that had critical threshold are not processed in the second call to the sub and therefor do not get any perdata output.

outsideit / check_netapp_ontap Goto Github PK

check_netapp_ontap's People

Stargazers

Watchers

Forkers

check_netapp_ontap's Issues

Test: I have add some print

Print the output and exit with the resulting state.

I am using these option in the plugin

./check_netapp_ontap.pl -H x.x.x.x -u username -p pass -o cluster_health

Recommend Projects

Recommend Topics

Recommend Org

Test:
I have add some print