opmdg / check_pgactivity Goto Github PK
View Code? Open in Web Editor NEWNagios remote agent
Home Page: http://opm.readthedocs.io/probes/check_pgactivity.html
License: PostgreSQL License
Nagios remote agent
Home Page: http://opm.readthedocs.io/probes/check_pgactivity.html
License: PostgreSQL License
The ready_archives
service is fine, but checking for numbers of WAL ready to be archive is not fine grained enough to know if the archiving process is just lagging of really failing in some way.
We should add a new service able to detect if the archiving is doing fine or not.
Hi,
When pg_dump_backups is run for the first time, the following erreur appears :
Remote command execution failed: Use of uninitialized value in subtraction (-) at /home/dalibo/bin/check_pgactivity line 4493.
This error should be silented.
Counters buffers_backend
and buffers_backend_fsync
should be in Bps instead of Nps.
Hello,
I am doing this from last 2 days and its didn't work. Please guide me.
Thanks
I use last_vaccum and last_analyze and these are same function - check_last_maintenance.
In my case, on pg_stat_user_tables there are null values on last_vacuum & last_analyze column
because we don't run vacuum and analyze manually.
So only last_auto${type} column has a value.
So, I modified the statement as follows.
SELECT coalesce(min(
extract(epoch FROM current_timestamp -
CASE (last_auto${type} > last_${type}) or (last_${type} is null)
WHEN 't' THEN last_auto${type}
ELSE last_${type}
END
)::float
), 'NaN'::float),
coalesce(sum(${type}_count), 0) AS ${type}_count,
coalesce(sum(auto${type}_count), 0) AS auto${type}_count
FROM pg_stat_user_tables
Please check above whether it will be right or not.
Thank you in advance.
Commit befd73c50f11a6c6a6719dae20f0de7b7585bef4 in PostgreSQL core add pg_ls_logdir() and pg_ls_waldir() functions. This affects at least wal_files and archiver probes.
Commit 25fff40798fc4ac11a241bfd9ab0c45c085e2212 introduces monitoring roles. We should prepare some documentation for it.
If no access since last check, no perfdata must be returned for the database(s).
Currently, we have no way to determine if some transaction is blocked for a certain period of time.
A service oldest_waiting seems like a good idea.
Hello,
On the raw of 1001, the code is check_pgactivity -h localhost -p 5492 -s last_vacuum -w 30m -c 1h30m
,
the parameter -p 5492
should be -p 5432
I think.
Thanks for OPM, I now can monitor some services of my database, I added locks, btree_bloat and backends days ago, and it seems quite fine. However when I add the service using command like check_pgactivity -h localhost -p 5432 -s last_vacuum -w 30m -c 1h30m
into the nagios service define file, the nagios returns code of 13 is out of bounds, any other command parameters need to be configured? And the service database_size return the same code of 13.
Thanks a lot.
On a 9.1.1 server, it gives me this error message:
ERROR: set-valued function called in context that cannot accept a set
I tried on a 9.1.15, and it works for me. I don't think this is an issue with the minor releases. It might be well to look further into it.
The SQL query fired by check_pgactivity errors out when the timeline isn't an integer:
ERROR: invalid input syntax for integer: "0000001B"
Probably because the query tries to convert the timeline to an integer from a substring:
substring(s.f from 1 for 8)::integer AS tli
Currently, the message is for instance:
POSTGRES_STREAMING_DELTA CRITICAL: critical flush lag for name@ip
Adding the human readable lag size would be better. For instance, something like
POSTGRES_STREAMING_DELTA CRITICAL: critical flush lag XMB for name@ip
Documentation report the following for the -d | --dbname option:
Database name to connect to (default: "postgres").
Actually this is template1 that is really used.
With logical replication introduced in PostgreSQL 10, there will be active sessions running for each active publication on the sending server, as showed in pg_stat_activity :
-[ RECORD 2 ]----+--------------------------------------------------
datid | 17359
datname | src
pid | 13786
usesysid | 10
usename | thomas
application_name | local
client_addr | ::1
client_hostname |
client_port | 53628
backend_start | 2017-01-27 18:08:19.93667+01
xact_start |
query_start | 2017-01-27 18:08:19.946993+01
state_change | 2017-01-27 18:08:19.947523+01
wait_event_type | Client
wait_event | WalSenderWaitForWAL
state | active
backend_xid |
backend_xmin |
query | walsender
We will have to handle this sessions correctly in backends and backends_status.
Also, we may monitor subscriptions on the receiving side with the pg_stat_subscription view. Some informations available here : https://www.postgresql.org/docs/devel/static/logical-replication-monitoring.html
Currently, there's no way to check how much indexes on toast relation are.
A dedicated service to check such bloat would be nice.
I would expect triggering a warning when running check_pgactivity
with following arguments:
-s database_size -w 10GB -c 20GB
the output is following:
POSTGRES_DB_SIZE OK: 19 database(s) checked | sentry=11414456504B;10737418240;21474836480 ...
where obviously 11414456504B (10.63GB) is above threshold 10737418240 (10GB).
Hello,
Would it be welcome to distinguish page and relation locks held for SIReadLocks, used in SSI, in the locks service ?
The aim of this question is to help determine which kind of lock is effectively held while using SSI. Thus, it would help to explain some application or PostgreSQL behavior while using SSI.
Thoughts ?
Pending restart has been introduced in 9.5 to inform of some setting changes that will apply on next start.
This should probably be integrated in the setting
service.
I want to be able to use /usr/local/lib/monitoring/plugins/check_pgactivity -s $service
without any connection parameters - and it should initiate psql without connection parameters.
The default behaviour for psql is to connect through the unix socket file, found at the compile-time configured default socket directory (/tmp in the official postgres release, but /var/run/postgres under most distributions) and then connect as the logged-in-user (I'll add an nrpe user with superuser privileges, alternatively just do sudo -u postgres
).
Two things:
It's required to add either at least one connection parameter. This is just an aesthetic issue, I just don't want to pass a "dummy" connection parameter.
The default host is "localhost". This causes a TCP connection, and the user is no longer authenticated by the OS, so one is asked for a password:
$ sudo -u postgres psql -h localhost
Password:
psql: fe_sendauth: no password supplied
Possibly this could be solved by running up an ident server, but I don't want to go that route.
I could add '--host /tmp' or '--host /var/run/postgres', but I don't like to do such hard-coding as it may be distro-specific where the socket file is located.
I could add some config service file on the host with a password ... but I don't want to complicate the setup.
Last, -h ""
seems to work, but it does give some warnings so perhaps best to avoid it:
$ sudo -u postgres /usr/local/lib/monitoring/plugins/check_pgactivity -h '' -s autovacuum
Use of uninitialized value in sprintf at /usr/local/lib/monitoring/plugins/check_pgactivity line 879.
POSTGRES_AUTOVACUUM OK: Number of autovacuum: 0 | ANALYZE=0 VACUUM=0 VACUUM_FREEZE=0 VACUUM_ANALYZE=0 oldest_autovacuum=NaNs max_workers=3
I'm considering writing up a pull request on this one.
The line "use v5.8.0;" prevents check_pgactivity from working with Perl 5.10.0. It's obviously a bug as related in : http://www.nntp.perl.org/group/perl.perl5.porters/2008/01/msg132524.html
Adding use 5.006; before this line fix the issue.
On Amazon RDS, pg_statistic is owned by rdsadmin role, so many checks fail with permission denied error.
The tables pg_stats should be used instead.
I think https://github.com/ioguix/pgsql-bloat-estimation has updated query, but if this is not the case, then please Bucardo postgres nagios plugin here, which I can confirm it works with Amazon RDS.
Thank you.
When psql warn about something, the warning is captured, shown in the first line of the result then parsed, leading to various errors that are not handled as psql exit with rc=0
.
This comes from this piece of code where we redirect stderr to stdout:
$psqlcmd = qq{ $args{'psql'} --set "ON_ERROR_STOP=1" }
. qq{ -qXAf $tmpfile -R $RS -F $FS };
$psqlcmd .= qq{ --dbname='$db' } if defined $db;
$psqlcmd .= qq{ -t } unless defined $get_fields;
$res = qx{ $psqlcmd 2>&1 };
$rc = $?;
For instance:
root# sudo -u postgres psql -qXAt --set "ON_ERROR_STOP=1" -c "select 'test'"
could not change directory to "/root": Permission denied
test
root# echo $?
0
A solution to explore for this bug is to redirect stderr to a temp file and check its size and content before getting further in the parsing etc.
I'll work on this later.
Hi,
Thanks.
Not sure if this is a bug or not.
When I use the pg_dump_backup service and when the delta is beyond the warning and/or critical values, it gives me the size of the latest backup in the perfdatas. It looks like this:
POSTGRES_PGDUMP_BACKUP CRITICAL: 'b1_delta'=2301687B
From the name of the label, I thought it was the size of the delta for my b1 database. But actually, it is the size of the latest backup. Shouldn't it be the size of the delta?
For now, the probe only calculate the delta between the max number of connections (max_connections) and the number of backend connections (pg_stat_activity). However :
Also, should we bother with superuser_reserved_connections ? If yes, we should find a way to handle applications connecting as superuser.
Database connection limit and user connection limit could also be considered, but I think they're too specific and can lead to unexpected alerts, so it should be better to not handle them.
PostgreSQL will change its numbering policy starting with the next release 10, dropping the major release on 2 digit to keep it on a single number.
To keep it backward compatible with the old policy, the pg_version_num will stay compatible with its old behavior.
check_pgactivity does not rely on pg_version_num because it appears in 8.2, so it recompute it based on the "human" format. The current code will likely break with the new version scheme.
We should fix it to use pg_version_num first and fallback on the current implementation if pg_version_num fail.
ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.
pg_last_xlog_receive_location() should be used on standby instead of pg_current_xlog_location();
The case relpages = 0 is not handled :
[...]
100 * (sub.relpages-est_pages)::float / sub.relpages AS bloat_ratio, is_na
[...]
Hello,
Thank you for this useful plugin we are currently putting in place with Icinga to monitor our Postgres DBs.
While setting up the monitoring; I found out that commit_ratio gives performance data which contains units like 'tps'. Unfortunately Icinga is unable to process this because it follows the Nagios plugin guidelines.
https://nagios-plugins.org/doc/guidelines.html
http://docs.icinga.org/latest/en/perfdata.html
which does not describe tps as a performance data unit.
Could you take a look into this please ?
--- check_pgactivity.orig 2016-03-16 17:47:57.933416750 +0100
+++ check_pgactivity 2016-03-16 17:49:29.766926601 +0100
@@ -2443,9 +2443,9 @@
$global_commits += $commits;
$global_rollbacks += $rollbacks;
- push @perfdata, sprintf "'%s_commit_rate'=%.2ftps", $db, $commit_rate;
- push @perfdata, sprintf "'%s_rollback_rate'=%.2ftps", $db, $rollback_rate;
- push @perfdata, sprintf "'%s_xact_rate'=%.2ftps", $db, $xact_rate;
+ push @perfdata, sprintf "'%s_commit_rate'=%.2f", $db, $commit_rate;
+ push @perfdata, sprintf "'%s_rollback_rate'=%.2f", $db, $rollback_rate;
+ push @perfdata, sprintf "'%s_xact_rate'=%.2f", $db, $xact_rate;
if ( $rollbacks == 0) {
$ratio = 0;
@@ -2464,7 +2464,7 @@
next THRESHOLD_LOOP;
}
if ( $val eq "rollback_rate" and $crit{$val} < $rollback_rate ) {
- push @msg_crit => sprintf "'%s'=%.2ftps", $prefix, $rollback_rate;
+ push @msg_crit => sprintf "'%s'=%.2f", $prefix, $rollback_rate;
next THRESHOLD_LOOP;
}
if ( $val eq "rollback_ratio" and $crit{$val} < $ratio ) {
@@ -2479,7 +2479,7 @@
next THRESHOLD_LOOP;
}
if ( $val eq "rollback_rate" and $warn{$val} < $rollback_rate ) {
- push @msg_warn => sprintf("'%s'=%.2ftps", $prefix, $rollback_rate );
+ push @msg_warn => sprintf("'%s'=%.2f", $prefix, $rollback_rate );
next THRESHOLD_LOOP;
}
if ( $val eq "rollback_ratio" and $warn{$val} < $ratio ) {
I am using check_pgactivity with Icinga2 instead of Nagios and am getting the "File «/usr/lib64/nagios/plugins/check_pgactivity.data» not recognized as a check_pgactivity status file." errors on every call to check_pgactivity (for all services using check_pgactivity), and then 1 minute later the call to check_pg_activity works fine. This pattern repeats on every subsequent call.
Here is an example from Icinga2:
Could this be a race condition for all the services competing for read access to the same status file?
Any assistance would be greatly appreciated,
Thanks.
Hello,
This bug only occurs with the human-readable output format. If the unit finishes with "s", the output is automatically converted to a duration. This happens at least with bgwriter with 'Bps' and commit_ratio with 'tps'.
$ ./check_pgactivity -h /tmp -p 5410 -U thomas -F human -s bgwriter
Service : POSTGRES_BGWRITER
Returns : 0 (OK)
Message : No writes
Perfdata : buffers_backend=0s
Perfdata : checkpoint_timed=0s
Perfdata : checkpoint_req=0s
Perfdata : buffers_checkpoint=0s
Perfdata : buffers_clean=0s
Perfdata : maxwritten_clean=0s
Perfdata : buffers_backend_fsync=0s
Perfdata : buffers_alloc=0s
It's OK in Nagios output format :
$ ./check_pgactivity -h /tmp -p 5410 -U thomas -s bgwriter
POSTGRES_BGWRITER OK: No writes | buffers_backend=0Bps checkpoint_timed=0.00249376558603491Nps checkpoint_req=0Nps buffers_checkpoint=0Bps buffers_clean=0Bps maxwritten_clean=0Nps buffers_backend_fsync=0Nps buffers_alloc=0Bps
$ ./check_pgactivity -h /tmp -p 5410 -U thomas -F human -s commit_ratio
Service : POSTGRES_COMMIT_RATIO
Returns : 0 (OK)
Message : Commits: 2 - Rollbacks: 0
Perfdata : template1_commit_rate=0s
Perfdata : template1_rollback_rate=0s
Perfdata : template1_xact_rate=0s
Perfdata : template1_rollback_ratio=0.00%
Perfdata : postgres_commit_rate=0s
Perfdata : postgres_rollback_rate=0s
Perfdata : postgres_xact_rate=0s
Perfdata : postgres_rollback_ratio=0.00%
The problem lays in sub human_output
which don't calls is_time
. We should verify other output formats for the same issue.
We use Grafana to display the perf measures in a dashboard, but it's difficult when we have a service with duplicated labels. For example temp_files, each database have 4 perf measures with the same label, database name.
I think would be great to add a sufix (or prefix) to the database name indicating the kind of measueres.
For example:
postgres_Fpm=0Fpm postgres_Bpm=0Bpm postgres_temp_files=0Files postgres_temp_size=0B
Instead of:
postgres=0Fpm postgres=0Bpm postgres=0Files postgres=0B
The longest_query service use something like I am explaining
Is it possible to add a check for unused indexes?
Josh Berkus had a nice blog post about that in:
http://www.databasesoup.com/2014/05/new-finding-unused-indexes-query.html
Probably also a check for sequencial scans on a table since last execution of check.
Some tools take a looong time to execute, and they held a transaction as long as they run. Yeah, pg_dump is my target :) I shouldn't get an alert if pg_dump holds a transaction for a long time. It looks like this:
03:39:12.951715 | pg_dump | 14555 | COPY public.<bip> (<bip>) TO stdout;
03:39:12.95212 | pg_dump | 14554 | COPY <bip>.<bip> (<bip>) TO stdout;
03:39:12.952371 | pg_dump | 14552 | COPY <bip>.<bip> (<bip>) TO stdout;
03:39:12.952425 | pg_dump | 14550 | COPY <bip>.<bip> (<bip>) TO stdout;
Probably it should filter out pg_dump connection lines (something like "WHERE application_name='pg_dump' ").
Thanks.
Hello,
Thanks for OPM.
I want to show some query on the OPM UI like the current connection-number of each user. I used the --custom_query
service and there appeared the information of the connections on the Nagios UI, and I want to show them on OPM UI, what should I do?
Thanks.
I'd like to graph the max_connections value. Would it be possible to add the max_connections value as another "database"?
This should do it:
diff --git a/check_pgactivity b/check_pgactivity
index 14bdadb..2781c40 100755
--- a/check_pgactivity
+++ b/check_pgactivity
@@ -1164,6 +1164,9 @@ sub check_backends {
FROM pg_stat_database AS s
JOIN pg_database d ON d.oid=s.datid
WHERE d.datallowconn
UNION
SELECT 'max_connections', current_setting('max_connections')::integer,
current_setting('max_connections')
};
@@ -1196,7 +1199,7 @@ sub check_backends {
if
LOOP_DB: foreach my $db (@rs) {
$num_backends += $db->[1];
$num_backends += $db->[1] if $db->[0] !~ '^max_connections$';
push @perfdata,
"$db->[0]=$db->[1];$args{'warning'};$args{'critical'};0;$db->[2]";
}
Quick-n-dirty patch, you're warned :)
The thresholds sent to database_size is applied on the database growth since last query - this may be slightly unexpected from the name of the service.
Further, the perf-data gives the absolute database size and not the growth. I was staring at the numbers for a while and believing I found a bug - until I read the documentation. I think it's not appropriate that the thresholds apply to one variable while the actual reported data is another variable. And I even do have a use-case where I'd actually like to monitor the absolute database size.
Suggestion: mark database_size as deprecated and split it into two, database_size_growth as a rename of the currenct database_size (but it should return growth values in perfdata), and database_total_size, database_disk_size or something like that for a service that checks for and returns the actual database size. Check the available disk space on the postgres data partition (if possible) and let percentages in the thresholds apply to the ratio database_size / (database_size + free disk space)
When someone drops a column in a table, the table is not rewritten. VACUUM won't claim the space. And check_pgactivity won't detect the space used for nothing. Wouldn't it be right that check_pgactivity checks that kind of issues?
Commit edc4cf8 introduced some basic support for wait events in 9.6. However, it only handles strong locks but not the Buffer Pin waits and the 2 LWLocks waits. We should think about how we treat them.
Commit edc4cf8 simply handles strong locks as queries waiting for locks, which is the "historical" behavior of the waiting column of pg_stat_activity from older versions of PostgreSQL.
The check_postgres.pl probe offers a simple output format which only outputs one number, and returns zero regardless of the defined threshold.
This would be very useful to have in check_pgactivty too because it would then be able to be used with Zabbix.
"One number to rule them all" should be defined for each service. Here is a suggestion for most of them, some of them would require significant change to provide anything useful, so I put them at the bottom of the list:
Service Name | Returned value |
---|---|
archive_folder | Number of archived wal files OR age of the latest one ? |
archiver | Number of wal files ready to archive |
autovacuum | Age of the oldest running autovacuum (in seconds) |
backup_label_age | Age of backup label file (in seconds) |
btree_bloat | Bloat size for the most bloated index |
commit_ratio | Commit ratio since last execution |
custom_query | First value of the first column of the resultset |
database_size | Total variation of database sizes, using the already implemented dbinclude and dbexclude options |
hit_ratio | Total hit ratio for all databases according to dbinclude and dbexclude options |
invalid_indexes | Number of invalid indexes |
last_analyze | Age of the oldest analyze or autoanalyze (in seconds) |
last_vacuum | Age of the oldest vacuum or autovacuum (in seconds) |
locks | Total number of locks. Optionnally, filter by lock type ? |
longest_query | Age of the longest running query (in seconds) |
max_freeze_age | Oldest database in transaction age. |
minor_version | Version number (as in PG_NUM_VERSION ?) |
oldest_2pc | Age of the oldest 2PC xact |
oldest_idlexact | Age of the oldest idle xact |
pga_version | Version number of check_pga |
table_bloat | Bloat in bytes of the most bloated table |
table_unlogged | Number of unlogged tables |
temp_files | Rate (in bytes/s) of temp file generation |
wal_files | Total number of WAL files. |
streaming_delta | Delta in bytes between master and standby. By default, replayed delta but could be filtered |
hot_standby_delta | Delta in bytes between master and standby. By default, replayed delta but could be filtered |
backends | Percentage of used connections OR raw number ? dbinclude and dbexclude could be useful |
backends_status | Maybe an option to filter on a set of specific status should be implemented. Then, number of connections in those status |
bgwriter | Maybe an option to filter on a specific counter should be implemented. Then, this counter |
replication_slots | Delta in bytes on the most lagging replication slot. Possibly filtered by replication slot ? |
configuration | No idea |
connection | No idea |
is_hot_standby | No idea |
is_master | No idea |
is_replay_paused | No idea |
pg_dump_backup | No idea |
settings | No idea |
We do not check for error while saving the status file in the save()
sub.
For memory, here is the doc about this: http://perldoc.perl.org/Storable.html#ERROR-REPORTING
I'm trying to run check_pgactivitiy
via NRPE
Running command: sudo -u postgres /usr/lib/nagios/plugins/check_pgactivity -h /var/run/postgresql/ -s autovacuum
Command completed with return code 1 and output:
Return Code: 1, Output: NRPE: Unable to read output
For some reason NRPE fails to read the output. Could there be problem with flushing stdout?
nrpe
daemon runs under nagios
user, so I've added sudo -u postgres
. nagios
should be able to connect to DB without password.
/etc/sudoers
:
nagios ALL=NOPASSWD:/usr/lib/nagios/plugins/check_pgactivity,/usr/bin/psql
Any ideas what could possibly go wrong?
For instance, pg_stat_bgwriter:
Perfdata contains the difference from the pg_stat_bgwriter counters since last execution.
This has no meaning for the user: you never know the exact duration between 2 executions. If you have a downtime, then you'll have a spike at next connection.
A more rigourous measure would be difference divided by time. You'd get consistent results no matter if there is 10s or 20 minutes between calls. Anyway, the unit would seem logical: buffers allocations per minute for instance. Currently, it is just "buffers". It's like mixing distance and speed...
The current command is:
curl --silent --max-time=$timeout -H 'Pragma: no-cache' $url
and should be:
'curl' => "curl --silent --max-time $timeout -H 'Pragma: no-cache' $url"
At the very least with curl 7.19.7.
This is quite annoying to receive alert when a very small backup grows from eg. 20 to 22MB.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.