Coder Social home page Coder Social logo

outsideit / check_netapp_ontap Goto Github PK

View Code? Open in Web Editor NEW
38.0 18.0 28.0 2.74 MB

:four_leaf_clover: Check NetApp Ontap :four_leaf_clover:

Home Page: https://outsideit.net/monitoring-netapp-ontap/

License: GNU General Public License v3.0

Perl 100.00%
monitoring nagios-plugins netapp-ontap-cluster perl-script health-check health-checks nagios-plugin monitoring-plugins

check_netapp_ontap's Introduction

NetApp ONTAPI (ZAPI) will reach end of availability (EOA) in January 2023. For customers using ONTAPI to automate ONTAP data storage management tasks, ONTAP 9.12.1 software—which is expected to release in the fourth quarter (Q4) of calendar year 2022—will be the final version to support ONTAPI. The subsequent release, ONTAP 9.13.1, which is targeted for Q2 of calendar year 2023, will remove ONTAPI support from the product."

Nagios plugin to check health of a NetApp Ontap cluster

Idea

This Perl script is able to monitor most components of a NetApp Ontap cluster, such as volume, aggregate, snapshot, quota, snapmirror, filer hardware, port, interface cluster and disk health.

Status

Deprecated.

How To

This script requires NetApp Manageability SDK for Perl to be installed. Can be found on https://mysupport.netapp.com/NOW/cgi-bin/software

There are of course numerous way to monitor your NetApp Ontap storage, but this post focusses for now on how to achieve quality monitoring with the help of a Nagios plugin, which was originally developed by John Murphy. The plugin definitely has some flaws, so all help is welcome to improve it. Read the post about debugging Perl scripts, make a fork of the project on GitHub and start experimenting.

The plugin is able monitor multiple critical NetApp Ontap components, from disk to aggregates to volumes. It can also alert you if it finds any unhealthy components.

How to monitor Netapp Ontap with Nagios?

Download the latest release from GitHub to a temp directory and then navigate to it.

Copy the contents of NetApp/* to your /usr/lib/perl5 or /usr/lib64/perl5 directory to install the required version of the NetApp Perl SDK. (confirmed to work with SDK 5.1 and 5.2)

Copy check_netapp_ontap.pl script to your nagios libexec folder and configure the correct permissions

Parameters:

  • --hostname, -H => Hostname or address of the cluster administrative interface.
  • --node, -n => Name of a vhost or cluster-node to restrict this query to.
  • --user, -u => Username of a Netapp Ontapi enabled user.
  • --password, -p => Password for the netapp Ontapi enabled user.
  • --option, -o => The name of the option you want to check. See the option and threshold list at the bottom of this help text.
  • --suboption, -s => If available for the option, specifies the list of checks to perform.
  • --warning, -w => A custom warning threshold value. See the option and threshold list at the bottom of this help text.
  • --critical, -c => A custom warning threshold value. See the option and threshold list at the bottom of this help text.
  • --modifier, -m => This modifier is used to set an inclusive or exclusive filter on what you want to monitor.
  • --help, -h => Display this help text.

Options

volume_health

Check the space and inode health of a vServer volume on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to accomodate large volume monitoring better. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

  • Examples
    • -w 80% - Warn if volume grows more than 80% full
    • -w 100GB - Warn if volume has less than 100GB free space
    • -w 80%,50GB - Warn if volume is more than 80% used and has less than 50GB free space

aggregate_health

Check the space and inode health of a cluster aggregate on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large aggregate monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword, “is-home” keyword node: The node option restricts this check by cluster-node name.

snapshot_health

Check the space and inode health of a vServer snapshot. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large snapshot monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

quota_health

Check that the space and file thresholds have not been crossed on a quota. thresh: N/A storage defined. node: The node option restricts this check by vserver name. snapmirror_health: Check the lag time and health flag of the snapmirror relationships. thresh: snapmirror lag time (valid intervals are s, m, h, d). node: The node options restricts this check by snapmirror destination cluster-node name.

snapmirror_health

Check the lag time and health flag of the snapmirror relationships. thresh: Snapmirror lag time (valid intervals are s, m, h, d). node: The node options restricts this check by snapmirror destination cluster-node name.

filer_hardware_health

Check the environment hardware health of the filers (fan, psu, temperature, battery). thresh: component name (fan, psu, temperature, battery). There is no default alert level they MUST be defined. node: The node option restricts this check by cluster-node name.

port_health

Checks the state of a physical network port. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

vscan_health

Check if vscan is disabled. node: The node option restricts this check by vserver name.

interface_health

Check that a LIF is in the correctly configured state and that it is on its home node and port. Additionally checks the state of a physical port. thresh: N/A not customizable. node: The node option restricts this check by vserver name.

netapp_alarms

Check for Netapp console alarms. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

cluster_health

Check the cluster disks for failure or other potentially undesirable states. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

clusternnode_health

Check the cluster-nodes for unhealthy conditions. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

disk_health

Check the health of the disks in the cluster. thresh: Not customizable yet. node: The node option restricts this check by cluster-node name.

disk_spare

Check the number of spare disks. thresh: Warning / critical required spare disks. Default thresholds are 2 / 1. node: The node option restricts this check by cluster-node name.

For keyword thresholds, if you want to ignore alerts for that particular keyword you set it at the same threshold that the alert defaults to.

Help

In case you find a bug or have a feature request, please make an issue on GitHub.

On Nagios Exchange

https://exchange.nagios.org/directory/Plugins/Hardware/Storage-Systems/SAN-and-NAS/NetApp/Check-Netapp-Ontap/details

Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details at http://www.gnu.org/licenses/.

check_netapp_ontap's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

check_netapp_ontap's Issues

compilation error in "Dev (#49)"

Hi,

I'm a collegue of Willemdh and he added me to this project for further development.
I try to get "Dev (#49)" working but I get:

Type of arg 1 to keys must be hash (not hash element) at check_netapp_ontap.pl line 270, near "}) "
Execution of check_netapp_ontap.pl aborted due to compilation errors.

if i can get this version to work, i can check it with our new netap with ontap 9.1 and commit it to the master branch.
But because i don't wanna go throught te various commits, I ask for a little assistance :-)

Gtz,
Tony

Snapmirror lag time for each

Hi,

Is it possible to check each snapmirror with a "name option" instead of "all in one" command in plugin.

Thanks,

Can't locate UserAgent.pm. Compilation Failed

I installed Nagios 4.2.1 on CentOS Linux release 7.2.1511 (Core). Everything worked perfect. I want to monitor my NetApp Release 8.1.3.7-Mode.

I use tutorial How To from http://outsideit.net/check-netapp-ontap/

given chown nagios:nagios /usr/local/nagios/libexec/check_netapp_ontap.pl
given chmod 644 /usr/local/nagios/libexec/check_netapp_ontap.pl

copy Netapp/* to /usr/lib64/perl5

but when i try to run test command

./check_netapp_ontap.pl the error below appear. is it because Netapp SDK? How can I solve this error?

\Can't locate LWP/UserAgent.pm in @inc (@inc contains: /usr/local/lib64/perl5 /u
BEGIN failed--compilation aborted at /usr/lib64/perl5/NaServer.pm line 27.
Compilation failed in require at ./check_netapp_ontap.pl line 26.
BEGIN failed--compilation aborted at ./check_netapp_ontap.pl line 26.

Feature request: Add performance data to volume_health

Hi willemdh,

Would it be possible to add the performance data to the volume_health check?
The current "OK" Message doesn't allow me to write graphs: "OK - No problems found (1 checked)"

Thank you & best regards
Pascal

restrict checks to "Nodes" is not working

Hi,

I am trying to restrict my "option check" to nodes in a cluster, but is reverting status form cluster.
Even if we provide wrong "node name" still plugin is running with success and data checked from cluster.
For eg: o/p when wrong node name:
./check_netapp_ontap.pl -H CLUSTER1 -c public -u USERID -p PASWORD -o disk_health -n WRONGNODE_NAME
OK - No problems found (552 checked)

May i know how to fix it?

Undefined subroutine

Facing the below error when executing the script.

[root@h1dciminagios1 NetApp]# /usr/local/nagios/libexec/check_netapp_ontap_v04.pl -H 10.116.104.46 -u nagios -p Welcome@123 -o interface_health -w 85% -c 90%
Undefined subroutine &XML::Parser called at /usr/local/lib64/perl5/NaServer.pm line 1068.

filter unassigned

Thank you for the plugin.

Would it be possible to filter out unassigned disks like:
-m exclude,state is unassigned

Seems not to work or isn't this possible currently ?

Unable to find API

We have already downloaded the nagios plugin for checking netapp ontap and installed it in our Centreon server,but found that there is "Unable to find API" error.Only result of disk_health is ok.May I know how to solve these kind of problems?

[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o disk_health
OK - No problems found (28 checked)
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o volume_health
Failed volume query: Unable to find API: volume-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o aggregate_health
Failed volume query: Unable to find API: aggr-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o snapshot_health
Failed volume query: Unable to find API: volume-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o quota_health
Failed volume query: Unable to find API: quota-report-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o snapmirror_health
Failed volume query: Unable to find API: snapmirror-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o filer_hardware_health
Failed filer health query: Unable to find API: system-node-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o port_health
Failed filer health query: Unable to find API: net-port-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o interface_health
Failed filer health query: Unable to find API: net-interface-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o netapp_alarms
Failed filer health query: Unable to find API: dashboard-alarm-get-iter
[root@lxapan04 check_netapp_ontap-2.5.10]# ./check_netapp_ontap.pl -H 10.95.6.6 -u root -p P@ssw0rd -o cluster_health
Failed filer health query: Unable to find API: cluster-peer-health-info-get-iter

Space Health perf data not working when critical

Issue Type

Bug report

Issue Detail

  • check_netapp_ontap version: 3.01.171611
  • NetApp Ontap version: 9.3
  • Monitoring solution: Nagios XI 5.4.13

Expected Behavior
When checking for aggregate_health, we expect to get performance data for all aggregates.

Actual Behavior
This works fine when the state is OK (0) or WARNING (1). However, when one of the aggregates is in CRITICAL state, it disappears from the performance data output, causing problems with Nagios XI PNP performance data engine. In our case, we are checking 3 aggregates, but as soon as 1 goes in Critical state, we only get performance data for 2 aggregates. The RRD file still expects 3 datasources, so we don't see any performance graphs anymore.

I expect this behaviour will also happen with other checks which use the calc_space_health sub, as when an object is critical, it is removed before checking for Warning, and perf data is only added on Warning check.

How to reproduce Behavior
Run the script for aggregate health so that no aggregates are critical. You should see perf data for all aggregates checked. rerun the check with critical level so that one or more aggregates have critical state, they will not be included in performance data then.

Would be great if this could be fixed soon.
Thanks
Edward

check_netapp_ontap error uninitialized value in string

I'm not sure if it is a bug
Issue Detail

  • check_netapp_ontap version:
    check_netapp_ontapi version: v3.01.171611
    By John Murphy [email protected], Willem D'Haese [email protected], GNU GPL License

  • NetApp Ontap version: 9.3
    I did this command in ordet to get a info from one volume
    ./check_netapp_ontap.pl -H 10.209.49.137 -u nagios_user -p Nagios_User_pwd01 -o volume_health -m include,SGSFILVR06/filenetp8b
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350.
    SGSFILVR06/filenetp8b - 4.69TB/5.28TB (89%) SPACE USED | 'SGSFILVR06/filenetp8b_used'=5159984656384B 'SGSFILVR06/filenetp8b_free'=645436739584B

Why I got a lot of this output line
Use of uninitialized value in string ne at ./check_netapp_ontap.pl line 1350

Thanks for you help
Emilio
[email protected]

Snapshot_health usage percentage based on Data Space size

When using the snapshot_health option, the percentage of snapshot size usage is calculated from the Snapshot Space Used in relation to the Data Space Total.

Should it not be based on the Total Snapshot Reserve? Because with some volumes the Snapshot Reserve can actually be larger than the Data Space, causing the Snapshot Using 140% space.

So snapshot_health should be calculated within the Total Snapshot Reserve, and volume_health should be calculated within the Data Space Total.

volume_health has uninitialized value

High, when volumes are in offline state, we get an uninitialized value:
/usr/lib64/nagios/plugins/check_netapp_ontap.pl -H onvarjv-ups -u ONVARJV\nagiosmon -p xxxxxxxx -o volume_health --warning 93% --critical 96% -m exclude,worm
Use of uninitialized value in division (/) at /usr/lib64/nagios/plugins/check_netapp_ontap.pl line 1197.
Use of uninitialized value in division (/) at /usr/lib64/nagios/plugins/check_netapp_ontap.pl line 1197.
Illegal division by zero at /usr/lib64/nagios/plugins/check_netapp_ontap.pl line 1197.

I added a check on volume state in 'space_threshold_helper':
sub space_threshold_helper {
# Test the various monitored object values against the thresholds provided by the user.
my ($intState, $strOutput, $hrefVolInfo, $hrefThresholds, $intAlertLevel) = @_;

    foreach my $strVol (keys %$hrefVolInfo) {
            my $bMarkedForRemoval = 0;

            # Test added by Didier Tollenaers 03/04/2015
            if ($hrefVolInfo->{$strVol}->{'state'} ne 'offline')  {

            # Test if various thresholds are defined and if they are then test if the monitored object exceeds them.
            if (defined($hrefThresholds->{'space-percent'}) || defined($hrefThresholds->{'space-count'})) {
                    # Prepare certain variables pre-check to reduce code duplication.

....

It seems better with that test (perhaps not 'elegant' in perl, but I 'm not perl developper)

Question about "cluster_health" option

Issue Type

Enhancement Request

Issue Detail

  • check_netapp_ontap version: v3.04.201124
  • NetApp Ontap version: 9.6.0 ( simulator )
  • Monitoring solution: LibreNMS

Hi, thanks for the plugin and it gives me many help.
I have a NetApp 9.6 simulator ( 2 nodes ) for study and using "cluster_health" for monitor, but I get a "OK - No problem found (0 checked)" output.
After check the script, it maps to "cluster peer health show" which differ from "system health status show" I expected.
Could you help to clearify the option usage?
Thanks.

disk_health netapp FAS8080 cdot 8.3.2 - state is unknown

Hi,
we got a new netapp fas8080 with cdot os version 8.3.2.
When I'm trying to get the disk_health status, the plugin fails with the output:

4.1.8 state is unknown, 2.11.6 state is unknown, 3.10.12 state is unknown, 1.0.0 state is unknown (...)

Is there a chance to get this fixed?
How can I assist you doing that?

Thanks a lot!

volume_health executed in cli shows "usage" performance data, but "inodes" in Icinga 2

Issue Type

Question / Maybe bug

Issue Detail

  • check_netapp_ontap version: v3.01.171611
  • NetApp Ontap version: 9.2
  • Monitoring solution: Icinga 2

Expected Behavior

When executing the check on the cli, the volume usage is shown in the performance data:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H netapp -u "user" -p "pass" -o volume_health -m include,v_volume_name1 -v
OK - No problem found (1 checked) | 'cifsserver/v_volume_name1_usage'=160401108992B;;;0;255013683200

So it is expected that Icinga 2 would show cifsserver/v_volume_name1_usage in the performance data.

Actual Behavior

In Icinga 2 interface, the performance data is cifsserver/v_volume_name1_inodes and shows different values, even though the same voume seems to used as a base.

image

How to reproduce Behavior

Simply define a volume health check in Icinga 2.

So the question is: Where does this come from?

snapshot_health check has uninitialized values

root@mon:~# perl /usr/local/lib/nagios/plugins/check_netapp_ontap -H netapphost -u adminuser -p adminpassword -o snapshot_health
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
Use of uninitialized value in string eq at /usr/local/lib/nagios/plugins/check_netapp_ontap line 1020.
OK - No problems found (58 checked)

get_port_health gets stuck when number of records > 100

Hello,

The check gets stuck in an endless loop when num-records exceeds max-records (in our case 225 vs 100).
Perhaps the max-records should be made configurable? Alternatively, inform the user about this issue, along with an appropriate Nagios error code, instead of hanging?

Do you need any more info about this particular issue we're having?

Thanks

Regards,
Robert

Performance data for Volumes that are in Critical state is not included in output

Bug report

When a volume is reporting as Critical, then performance data for that volume is not output by the check.

  • check_netapp_ontap version: v3.03.200924
  • NetApp Ontap version: 9
  • Monitoring solution: OMD 3.31~2020-09-13-labs-edition

Expected Behavior
Performance data should be output for all Volumes regardless of threshold state.

Actual Behavior
Performance data is output for all volumes that are not in a Critical state.

How to reproduce Behavior
Set a threshold which will result in some volumes going into a critical state and check the performance data output, volumes that are in a critical state will not be included in the performance data output.

Issue is caused by the sub space_threshold_helper which is called twice, first with $alertlevel =2 and then with $alertlevel =1. Volumes are removed from the list of volumes during the first call if they exceed the critical threshold, however $perfoutput is only produced if ($intAlertLevel == 1). Hence, Volumes that had critical threshold are not processed in the second call to the sub and therefor do not get any perdata output.

Sanpmirror lag time

How to monitor snapmirror lag time,

/usr/local/nagios/libexec/check_netapp_ontap.pl -H NETAPP_C-MODE_IP -n SVM_NODE -u USERNAME -p PASSWD -c public -o snapmirror_health -w 1h -c 2h
OK - No problems found (0 checked)

Above return 0 checked?

400 Bad Request with ONTAP 9.3

Issue Type

Bug report

Issue Detail

  • check_netapp_ontap version: v3.01.171611
  • NetApp Ontap version: 9.3
  • Monitoring solution: Opsview/Nagios

Expected Behavior
return aggregate usage

Actual Behavior
./check_netapp_ontap.pl -H -u opsview -p -o aggregate_health -m include,n1_aggr0 -w 95% -c 98%
Smartmatch is experimental at ./check_netapp_ontap.pl line 425.
Smartmatch is experimental at ./check_netapp_ontap.pl line 426.
Smartmatch is experimental at ./check_netapp_ontap.pl line 427.
Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:

================================================^

<title>400 Bad Request</title> at /usr/local/nagios/perl/lib/x86_64-linux-gnu-thread-multi/XML/Parser.pm line 187.

How to reproduce Behavior
Run it any time, nothing seems to work. Have configured opsview user with ontapi application readonly role and a password.

Failed test query: NaServer::parse_xml

Issue Detail

  • check_netapp_ontap version: 3.01.171611
  • NetApp Ontap version: 9.3
  • Monitoring solution: EyesOfNetwork 4.1 / Centos6.5

Hello,

I try to use the script but when i run the following command :
check_netapp_ontap.pl -H HOST_IP -u admin -p admin_pass -o cluster_health
(for exemple, same issue all the time)

I have this error:

Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
================================================^
<html><head>
<title>400 Bad Request</title>
at /usr/lib64/perl5/XML/Parser.pm line 187
I

I don't really understand what is the problem here...

maybe my perl version is too older (perl-5.10.1-136.el6.x86_64) or something like that
I searched requirements but i didn't found anything (https://outsideit.net/check-netapp-ontap/ is offline)

thanks for your help

Assistance Needed

After following the instructions for installation, I receive the following when executing /check_netapp_ontap.pl -H ip_address -u username -p password -o disk_health:

Failed test query: No elements in API request

I'm extremely unfamiliar with OnTap and any guidance is appreciated.

volume_health, Division by zero when offline volumes exist

Running latest version(2.5.10) of this script, also occurred in previous versions at different line numbers.

Ontapp version 9.2, with offline volumes present:

# ./check_netapp_ontap.pl -H hostname -u user -p password --option volume_health -w 98%,96%i -c 99%,98%i
Use of uninitialized value in division (/) at ./check_netapp_ontap.pl line 1207.
Use of uninitialized value in division (/) at ./check_netapp_ontap.pl line 1207.
Illegal division by zero at ./check_netapp_ontap.pl line 1207.

It looks like this script needs more logic to ignore offline volumes because this command works(adding offline as a critical state):

# ./check_netapp_ontap.pl -H hostname -u user -p password --option volume_health -w 98%,96%i -c 99%,98%i,offline
svm1/home1 - 100950109/105382963 (96%) INODES USED, svm2/home1_snap - 100869566/105382963 (96%) INODES USED 

Script also works as expected if you bring all volumes back online.

Need to add iterator code fix for Quota space checking also

The iterator code fix for volume check needs to be added to quota checking also. I copied that code, and also added code for configurable Quota threshold alerting (i.e. alert when quota usage exceeds X% of hard limit, even if no threshold or soft limit is defined). I only use hard limits for my quotas, but still want Nagios to alert me when usage is high, same as is done for volumes. Here are my changes below. I am not too familiar with GitHub, so I'll just paste it here, if you want to incorporate it into the code.
Thanks,
Moshe

793a796
>         my $nahTag = NaElement->new("tag");
804a808,809
>         $nahQuotaIterator->child_add_string("max-records", 100);
>         $nahQuotaIterator->child_add($nahTag);
807c812
<                         $nahQuotaIterator->child_add_string("tag", $strActiveTag);

---
>                         $nahTag->set_content($strActiveTag);
810d814
<                 $nahQuotaIterator->child_add_string("max-records", 200);
846c850,852
<       my $hrefQuotaInfo = shift;

---
>       #my $hrefQuotaInfo = shift;
>       my ($hrefQuotaInfo, $strWarning, $strCritical) = @_;
>       my ($hrefWarnThresholds, $hrefCritThresholds) = space_threshold_converter($strWarning, $strCritical);
866a873,889
>                       # Added for Quota threshold alerting (ML)
>                       if (defined($hrefWarnThresholds->{'space-percent'}) || defined($hrefCritThresholds->{'space-percent'})) {
>                               my $intUsedPercent = ($hrefQuotaInfo->{$strQuota}->{'space-used'} / $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'}) * 100;
>                               $intUsedPercent = floor($intUsedPercent + 0.5);
>                               my $intThreshToBytes = $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'}*1024;
>                               my $strReadableThresh = space_to_human_readable($intThreshToBytes);
>                                 my $strNewMessage = $strQuota . " - " . $strReadableUsed . "/" . $strReadableThresh . " \(" . sprintf("%d",$intUsedPercent) . "\%\)" . " USED";
>                                 if ($intUsedPercent >= $hrefWarnThresholds->{'space-percent'}) {
>                                       $strOutput = get_nagios_description($strOutput, $strNewMessage);
>                                         $intState = get_nagios_state($intState, 2);
>                                 }
>                                 if ($intUsedPercent >= $hrefCritThresholds->{'space-percent'}) {
>                                       $strOutput = get_nagios_description($strOutput, $strNewMessage);
>                                         $intState = get_nagios_state($intState, 1);
>                                 }
>                         }
>                       # Added for Quota threshold alerting (ML)

netapp_alarms and root aggregate usage

I've configured the netapp_alarms check and the root aggregates are always making this check critical. They are always 95% on the filer and do not grow.

Can the root volume space be excluded from the check?

Use of uninitialized value $strOption in lc

Issue Type
Bug report

Issue Detail

$ sudo -u icinga /usr/lib64/nagios/plugins/contrib/check_netapp_ontap.pl -H 10.193.1.1 --user "USER" --password "PASSWORD" aggregate_health
Use of uninitialized value $strOption in lc at /usr/lib64/nagios/plugins/contrib/check_netapp_ontap.pl line 1946.
Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
================================================^
<html><head>
<title>400 Bad Request</title>
 at /usr/lib64/perl5/vendor_perl/XML/Parser.pm line 187.
  • check_netapp_ontap version: v2.07.170621
  • NetApp Ontap version:
  • Monitoring solution: Icinga 2

Expected Behavior

Actual Behavior

How to reproduce Behavior

volume_health and offline volume

Hi,

When a volume is offline, volume_health use the uninitialized value in $hrefVolInfo->{$strVol}->{'space-total'} (line 1202 version 2.5.10) because offline volume does'nt have space information.

You need to test if $hrefVolInfo->{$strVol}->{'space-total' id defined or not before does the division.

Test:
I have add some print

	#	if (defined($hrefVolInfo->{$strVol}->{'space-total'})) {
		    print ($hrefThresholds->{'space-percent'} . "\n");
		    print ($strVol . "\n");
		    print ($hrefVolInfo->{$strVol}->{'space-total'} . "\n");
			my $intUsedPercent = ($hrefVolInfo->{$strVol}->{'space-used'} / $hrefVolInfo->{$strVol}->{'space-total'}) * 100;
        	$intUsedPercent = floor($intUsedPercent + 0.5);
        	my $strReadableUsed = space_to_human_readable($hrefVolInfo->{$strVol}->{'space-used'});
        	my $strReadableTotal = space_to_human_readable($hrefVolInfo->{$strVol}->{'space-total'});
        	my $strNewMessage = $strVol . " - " . $strReadableUsed . "/" . $strReadableTotal . " (" . $intUsedPercent . "%) SPACE USED";

=====

perl check_netapp_ontap-debug.pl -o volume_health -H keris.enssat.fr -u xxxx -p xxxx -w offline -n sftestkerb
sftestkerb/vol_04052017_174853 is offline

perl check_netapp_ontap-debug.pl -o volume_health -H keris.enssat.fr -u xxxx -p xxxx -w 50% -c 51% -n sftestkerb
51
sftestkerb/sfTestKerbData
2638827909120
51
sftestkerb/vol_04052017_174853
Use of uninitialized value in concatenation (.) or string at check_netapp_ontap-debug.pl line 1204.

Use of uninitialized value in division (/) at check_netapp_ontap-debug.pl line 1205.
Use of uninitialized value in division (/) at check_netapp_ontap-debug.pl line 1205.
Illegal division by zero at check_netapp_ontap-debug.pl line 1205.

Hard Limit Quota isn't working - Checking if used is higher then hard limit (will never be)

Broken Feature - Quota Hard Limit

Bug report

Issue Detail

  • check_netapp_ontap version: All
  • NetApp Ontap version: 8.3.2P5

Expected Behavior
I use Hard quota on my QTrees,
When quota is about to be full I need an alert,
I expect something like this:
ERROR: volume_name/qtree_name - HARD Quota at 87%

Actual Behavior
The code do this:

if (used >= hard-quota) {
    alert
}

This will not work due to the fact that the used space/files will never exceed the hard-quota
So the result is always:
OK - No problems found (77 checked)

Suggested Solution
I've done the following:

  • Top of the code added:
# DEFUALTS
my $QUOTA_HARD_LIMIT_ERROR_THRESHOLD = 0.85;
  • in sub calc_quota_health changed these:
       if ($hrefQuotaInfo->{$strQuota}->{'space-hard-limit'} ne "-") {
            # Fixing bug - The test that space-used is > then space-hard-limit is wrong - spaced-used can never be bigger!
            my $quotaPrecentage = $hrefQuotaInfo->{$strQuota}->{'space-used'} / $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'};
            #if ($hrefQuotaInfo->{$strQuota}->{'space-used'} >= $hrefQuotaInfo->{$strQuota}->{'space-hard-limit'}) {
            if ($quotaPrecentage >= $QUOTA_HARD_LIMIT_ERROR_THRESHOLD) {
  • and These lines:
        if ($hrefQuotaInfo->{$strQuota}->{'files-hard-limit'} ne "-") {
            # Fixing bug - the test that files-used is > then files-hard-limit is wrong - files-used can never be bigger!
            #if ($hrefQuotaInfo->{$strQuota}->{'files-used'} >= $hrefQuotaInfo->{$strQuota}->{'files-hard-limit'}) {
            my $quotaPrecentage = $hrefQuotaInfo->{$strQuota}->{'files-used'} / $hrefQuotaInfo->{$strQuota}->{'files-hard-limit'};
            if ($quotaPrecentage >= $QUOTA_HARD_LIMIT_ERROR_THRESHOLD) {
  • Result is something like this:
    virtual_storage_node/volume/qtree - 146.37GB/170.00GB SPACE USED
    With exit code of 2.

This should resolve the issue (but still missing the ability to set CRITICAL or ERROR thresholds)

Getting error when invoking the check_netapp_ontap.pl

Hi,

I am getting this error while invoking the perl command
[root@blr-monitor libexec]# /usr/bin/perl ./check_netapp_ontap.pl -H 10.0.0.20 -u root -p password disk_health

Failed test query: in Zapi::invoke, cannot connect to socket

Please help me out.

Thanks,
Krishna M S

"disk_health" considered "admin failed" disk as normal

Issue Type

Bug report

Issue Detail

  • check_netapp_ontap version: v3.04.201124
  • NetApp Ontap version: 9.6.0 ( simulator, 2 nodes )
  • Monitoring solution: LibreNMS

Expected Behavior
When a disk failed, "disk_health" change to warning due to reconstructing.
After reconstruct completed, "disk_health" should keep warning state because of a failed disk in list.

Actual Behavior
After reconstruct completed, "disk_health" become normal but there is a failed disk in list.

Edit: I've tried add "-w 1" to set warning level at 1 failed disk"
Edit2: Sorry, I didn't notice that "disk_health" can't set warning / critical level, only "disk_spare" can.

How to reproduce Behavior
Fail a disk manually and wait for reconstruct completed, check the exit code from "disk_health".

snapmirror_health has uninitialized value

This happens with the plugin from both the master and the dev branch (as of today).

./check_netapp_ontap.pl -H mynetapp -u monitoring -p secret -o snapmirror_health -w 1h
Use of uninitialized value in string eq at ./check_netapp_ontap.pl line 720.
Use of uninitialized value in string eq at ./check_netapp_ontap.pl line 720.
OK - No problems found (23 checked)

Line 720 in this case is:

                    if ($nahSM->child_get_string("relationship-control-plane") eq "v2") {

quota health check doesn't check soft limits if hard limits are defined

We have hard and soft limits, both on space and files, and expected this plugin to raise warnings when soft limits were breached and critical alerts when hard limits were breached. We received the critical alerts but not the warnings by which time issues were already being experienced.

Looking at the code in the calc_quota_health subroutine it will not check the soft limits if a hard limit is defined because of the elsif statements.

[Enhancement Request] Perfdata Total size of Aggregates and Volumes

check_netapp_ontap version: v3.01.171611
Monitoring solution: Icinga2 + InfluxDB/Grafana

Hi all,
Currently the perfdata metrics are *_used and *_free. An additional _totalsize would be nice.
Also the warning and critical thresholds in the perfdata would be nice

Regards,
Marcus

Script is not working (400 Bad Request) after upgrading to Ontap 9.3P9 (from 9.1x)

Issue Type

Bug report

Issue Detail
After updating to ONTAP 9.3P9 i'll get the following error when running the script
--> Failed test query: Server returned HTTP Error: 400 Bad Request

All already read Issue 76 and copied the content of "netapp-manageability-sdk-9.3/lib/perl/NetApp/*" to "/usr/lib64/perl5" but still the same error.

The Cluster hostname was with unterscores and i also changed it to one with minus. (from netapp_cluster_01 to netapp-cluster-01)
DNS resolving is working.

  • check_netapp_ontap version: v3.01.171611
  • NetApp Ontap version: 9.3P9
  • Monitoring solution: Nagios 4.3.4 (on CentOS 6.9)

How to reproduce Behavior
For example run /usr/local/nagios/libexec/check_netapp_ontap.pl -H -u -p -o interface_health

DIMM status check

Hello,
It would be great to add DIMM status check via the filer_hardware_health option:
Output example:

>system controller memory dimm show
			               DIMM    UECC      CECC      CPU                 Slot
			Node           Name    Count     Count     Socket     Channel  Number  Status
			-------------- ------- --------  --------  ---------  -------  ------  ------
			node1          DIMM-1         0         0          0       0        0  ok
			node1          DIMM-NV1       0         0          0       1        1  ok
			node2          DIMM-1         1         0          0       0        0  ok
			node3          DIMM-NV1       0         0          0       1        1  ok
			4 entries were displayed.

BR,
Yannick

Zapi::invoke fails

When attempting to run a check:

./check_netapp_ontap.pl -H x.x.x.x -u username -p pass -o cluster_health

I am getting the following error:

Failed test query: in Zapi::invoke, cannot connect to socket

Through some debug statements, I've found the message is originating on line 655 of NaServer.pm, looks like it's a result of this if statement:

if (!$sock->connect($that_sockaddr)) {

I'm trying to figure out what could be causing this, but I'm not sure. The server is running NetApp 8.3, C-Mode (not 7-mode as the other issue in github had). I'm sure this is a configuration issue, but I'm not very strong with perl or nagios, so I'm having trouble figuring it out.

smartwatch is experimental in output

Issue Type

Bug report
Issue Detail

  • check_netapp_ontap version: v3.01.171611 (latest)
  • NetApp Ontap version: 9.1 P7
  • NetApp SDK version: 9.4 (latest)
  • Monitoring solution: Icinga2

Expected Behavior

Smartwatch is experimental does not show up in output.

Tested with v2.5.10 on the same machine:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H mynetapp -u "monitoring" -p "secret" -o filer_hardware_health
OK - No problems found (8 checked)

Actual Behavior

The output "Smartwatch is experimental..." is shown on all checks. Here two examples:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H mynetapp -u "monitoring" -p "secret" -o cluster_health
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 414.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 415.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 416.
OK - No problem found (2 checked)


# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H mynetapp -u "monitoring" -p "secret" -o filer_hardware_health
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 414.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 415.
Smartmatch is experimental at /usr/lib/nagios/plugins/check_netapp_ontap.pl line 416.
OK - No problem found (8 checked)

Relevant code blocks from plugin:

 411         if (defined $strSuboption) {
 412                 $strCheckLIFStatus = $strCheckLIFHomeNode = $strCheckLIFHomePort = 0;
 413                 my @arySuboption = split(",",$strSuboption);
 414                 if ("status" ~~ @arySuboption) { $strCheckLIFStatus = 1; }
 415                 if ("home-node" ~~ @arySuboption) { $strCheckLIFHomeNode = 1; }
 416                 if ("home-port" ~~ @arySuboption) { $strCheckLIFHomePort = 1; }
 417         }

How to reproduce Behavior

Just execute the plugin.

Error in snapshot_health check if 0 sized vlome : Illegal division by zero at line 1464.

Issue Type

Bug report

Issue Detail

  • check_netapp_ontap version:
  • NetApp Ontap version:
  • Monitoring solution: Centreon / Centreon-Broker

Expected Behavior
No bug in check snapshot_health if volume is 0 sized

Actual Behavior
Illegal division by zero at check_netapp_ontap.pl line 1464

How to resolve Behavior
at line 1458 add space-total condition :
From :
if ($hrefVolInfo->{$strVol}->{'state'} eq 'online') {
To :
if ( ($hrefVolInfo->{$strVol}->{'state'} eq 'online') && ($hrefVolInfo->{$strVol}->{'space-total'} ne 0) ) {

Ontap 9.1 - is scrubbing

Since we updated our Netapp to the Ontap Version 9.1 (before 8.3) we get warnings from the disk_health check. Some disks are in the status "is scrubbing". Is this a known issue? How can we resolve it?
nagstamon 1 0rc2_2017-07-06_15-31-38

thanks

no output for volume_health

hi team,
First of all thanks for this wonderful script.
i have a 2 netapp storages. one with 170 volumes and another with 500 volumes. For the 170 volumes everything works fine. but for 500 volumes, when i have given command to display volume_health with more than 95% it is running for than more than 1 hour with no output or error. may i know whether script has any issue with the number of volumes.?
Please help me on this.

Error... Failed test query: Couldn't find end of Start Tag netapp

If you could help with an error I am receiving... I do...
./check_netapp_ontap.pl -H Mycluster --user netappro --password xxxx -option disk_health
I get... Failed test query: Couldn't find end of Start Tag netapp
Notes: The cluster can be pinged, The netappro is a read only user, have tried multiple -option choices, Also happens with 'admin' user.

One additional note,when trying the dev version I get...
./check_netapp_ontap_dev.pl: line 5: syntax error near unexpected token newline' ./check_netapp_ontap_dev.pl: line 5:'

Any ideas?

Tom

ontap 8.3 : advanced partitionning (adp) ,false CRITICAL RESULT

Hi, with ontap 8.3 with new adp feature
http://www.datacenterdan.com/blog/netapp-dataontap-83-adp-root-disk-slice-deep-dive

disk and nodes check, give us:

$USER1$/check_nac.py -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -t disks

NAC CRITICAL - disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared disk None is shared

$USER1$/check_nac.py -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -t nodes

NAC CRITICAL - node node1 has critical alarm for aggregate_used aggr0_cluster_01_0 node node2 has critical alarm for aggregate_used aggr0_cluster_02_0

Failed test query: NaServer::parse_xml - Error in parsing xml

Issue Type
Bug report

Issue Detail
Since we migrated to netapp version 9.3P2, the plugin return this error :
Failed test query: NaServer::parse_xml - Error in parsing xml:
syntax error at line 1, column 49, byte 49:

================================================^

<title>400 Bad Request</title> at /usr/lib64/perl5/XML/Parser.pm line 187
  • check_netapp_ontap version: v3.01.171611
  • NetApp Ontap version: NetApp Release 9.3P2
  • Monitoring solution: NAGIOS

Scales of Volume size under storage should have to be configurable.

Hi,
We need bigger scales/Units as MB, GB and those too in configurable manner, sometimes its very hectic to dealing with perf-data as given below.
$ /usr/lib64/nagios/plugins/check_netapp_ontap.pl -H XX.XX.XX.XX -u USERNAME -p PASSWORD --option volume_health -n nlcl01_exch -w 80% -c 90%
OK - No problems found (5 checked) | 'nlcl01_exch/vol_EXCH_NL_LOG_used'=1606890414080B 'nlcl01_exch/vol_EXCH_NL_LOG_free'=702084005888B 'nlcl01_exch/nlcl01_exch_root_used'=335872B 'nlcl01_exch/nlcl01_exch_root_free'=1019719680B 'nlcl01_exch/vol_EXCH_NL_DB_used'=2616908759040B 'nlcl01_exch/vol_EXCH_NL_DB_free'=2880649379840B 'nlcl01_exch/vol_EXCH_BE_DB_used'=3450560196608B 'nlcl01_exch/vol_EXCH_BE_DB_free'=4575874686976B 'nlcl01_exch/vol_EXCH_BE_LOG_used'=2307689426944B 'nlcl01_exch/vol_EXCH_BE_LOG_free'=4520277782528B
$

Why I meant to say Configurable (and not human readable), cause while pushing this data to graphing tools we need single scale/Unit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.