jobinau / pg_gather Goto Github PK

Scan PostgreSQL Instance for potential problems. pg_gather is a SQL-only script leveraging the built-in features of psql.

License: Other

Awk 1.87% HTML 75.06% Shell 7.52% PLpgSQL 15.56%

database performance-analysis postgres postgresql scannner

pg_gather's People

Contributors

Stargazers

Watchers

Forkers

arronax guanqingtao yansheng836 dbbao venumadhavvennu vikashsingh2310 ibrarahmad codeforall zabey2001 narmanomar vkayar7 chefren alansr nijesh8 psvampa peppla

pg_gather's Issues

pg_gather can be blocked if there is a lock on the pg_class

This could happen on a system with lot of temporary table creation

VacuumDelay is shouldn't be counted as the databse time

VacuumDelay is more of a sleep/delay within the vacuum worker. it shouldn't be a matter of concern and shouldn't be reported as something on which PostgreSQL is spending its time.

Limit the number of tables and indexes in the HTML report

If there are too many indexes and tables listed in the report, the Browser may hang on devices with fewer resources.

Problem : Report with less session details if collection timesamp is not available

Information like "Connection Since", "Statement since" and "State since" goes missing if the data collection timestamp is not available.
This problem is commonly reported in odd cases where the PostgreSQL version is old (9.6 or below) or working with PostgreSQL-like software like Amazon Aurora.

Create a wrapper script to execute pg_gather

There are two versions of the pg_gather script: for 9.5 and 9.6, and for the newer versions. This results in unnecessary overhead for the operator. Also, sometimes, when pg_gather is executed (or is requested to be executed), PostgreSQL version is not known in advance.

There should be a standard way to execute pg_gather without thinking about the underlying complexity. Something like pg_gather.sh -h -P -p -d... that takes care of checking the target instance's version, and runs the required SQL script.

Rarely users reported that additonal empty column appears along with wait-event data

A few users directly contacted and reported that sometimes they experience data import failures due to an additional empty column appearing data related to wait-events to pg_pid_wait. Even though the issue was reported on a very old version (v7), since it was not reproducible, noone filed a bug yet..
So I am creating this bug report on behalf of all of them.

Unexpected error psql:<stdin>:95550: ERROR: end-of-copy marker corrupt CONTEXT: COPY pg_pid_wait

While importing the data in rare cases, produces an error message as follows at the end

COPY 1
COPY 1
COPY 236
COPY 2264
COPY 4864
COPY 5
COPY 7
COPY 354
COPY 42928
COPY 16447
COPY 15570
COPY 3855
COPY 6807
COPY 0
COPY 0
COPY 1
COPY 1
psql:<stdin>:95550: ERROR:  end-of-copy marker corrupt
CONTEXT:  COPY pg_pid_wait, line 2166

Parallel execution of quries found to cause performance problem

When the imported data contains more than ten thousand tables and indexes. PostgreSQL may trigger parallel execution of queries in rare cases. which is triggering performance issues for generating the report.

Those databases where blks_fetch is zero also need to be present in the report

As per earlier design those databases where there is no blks_fetch information is avoided from the report.
In generally, they are small, newly created databases. So missing them from report was not a major concern.
However, When we share the report with end-users, missing databases from the list could be a problem
So it is important to list them also for the completeness of the report.

When generate_report fails, it may leave a docker container running

When generate_report.sh fails, it may leave a running docker container. For example, that happens when the target report file is not found.

$ ./generate_report.sh $PWD/CS0018462_out.txt
13: Pulling from library/postgres
Digest: sha256:117c3ea384ce21421541515edfb11f2997b2c853d4fdd58a455b77664c1adc20
Status: Image is up to date for postgres:13
docker.io/library/postgres:13
Docker container is 035c3073ad9ea3f5dd0dc9a46326caa8bd31ff04c517c693d5f8c6c9e137956a; will wait 3 seconds before proceeding
SET
DROP TABLE
...
CREATE TABLE
sed: can't read /home/skuzmichev/projects/pg_gather/CS0018462_out.txt: No such file or directory
$ docker ps
CONTAINER ID   IMAGE         COMMAND                  CREATED          STATUS          PORTS                 NAMES
035c3073ad9e   postgres:13   "docker-entrypoint.s…"   18 seconds ago   Up 17 seconds   5432/tcp              musing_liskov
...

ERROR: division by zero if data from pg_stat_bgwriter is not available

@psvampa reported that report generation shows an error as follows if data from pg_stat_bgwriter is not available.

psql:gather_report.sql:213: ERROR:  division by zero

DB level Net/Delay calcuation could a valuable improvement

Filing this improvement request on behalf of @ylacancellera
expects net/delay in that table to show up at db level which will be easy for users to understand.

PostgreSQL 9.6 dont have backend_type in pg_stat_acitivity and data import fails

If the user runs the gather.sql against PostgreSQL 9.6, There is no "backend_type" and data import fails

Column leader_pid from pg_get_activity table too small

Hitting this error when running the gather_schema.sql script:

psql:<stdin>:71: ERROR:  value "3610288274858318246" is out of range for type integer
CONTEXT:  COPY pg_get_activity, line 1, column leader_pid: "3610288274858318246"

Changing the column type to bigint fixed it:

diff --git a/gather_schema.sql b/gather_schema.sql
index f465c32..8bb666f 100644
--- a/gather_schema.sql
+++ b/gather_schema.sql
@@ -68,7 +68,7 @@ CREATE UNLOGGED TABLE pg_get_activity (
     gss_auth boolean,
     gss_princ text,
     gss_enc boolean,
-    leader_pid integer
+    leader_pid bigint
 )
```;

Automate running pg_gather against all databases

Currently, pg_gather requires a connection to specific database in a cluster to get most details. However, what specific database should be targeted may not be clear immediately, and moreover there may be dozens of databases in the cluster.

Since we've started planning for a wrapper script to run pg_gather, perhaps we can also add some flag like (--all-dbs) to run pg_gather against every database automatically?

There are global and non-global parts in the resulting report, so we can also discuss separating them if we're going to be aiming for multiple sub-reports. In the beginning, however, just having multiple regular reports generated automatically should be a good quality of life improvement.

@jobinau, what's your take on this?

Bulk History import feature sometimes results in ERROR

\Importing :2021-09-16 02:30:01.492215+00
Partial
Pager usage is off.
COPY 1
psql::39: ERROR: invalid input syntax for type timestamp with time zone: "16405"
CONTEXT: COPY pg_get_activity, line 20, column collect_ts: "16405"
COPY 6198
COPY 9
COPY 0
COPY 0
COPY 1
COPY 1
COPY 8196

ERROR while running against Aurora : ERROR: wal_level must be set to 'logical'

Here is a sample error.

COPY pg_gather FROM stdin;
psql:/app/gather.sql:35: ERROR:  wal_level must be set to 'logical'
HINT:  WAL control functions cannot be executed when wal_level < logical.

function max(pg_lsn) does not exist

Thank you very much for your work !

On the other hand, I have this error when trying to run the report.

It is normal?

Thanks.

-bash-4.2$ /usr/pgsql-11/bin/psql -X -f gather_report.sql > GatherReport.html
psql:gather_report.sql:180: ERROR: function max(pg_lsn) does not exist
LINE 1: ...TEST((SELECT(current_wal) FROM pg_gather),(SELECT MAX(sent_l...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.

Gather per-table tuple statistics

When analyzing vacuum-related issues, it's important to know the load profile on the table: is it insert-heavy? Is it update heavy? How many updates have been done since the last vacuum or analyze? Are most updates HOT?
pg_gather should gather and provide stats similar to what pg_stat_user_tables view provides:

                                  View "pg_catalog.pg_stat_user_tables"
       Column        |           Type           | Collation | Nullable | Default | Storage | Description
---------------------+--------------------------+-----------+----------+---------+---------+-------------
 relid               | oid                      |           |          |         | plain   |
 schemaname          | name                     |           |          |         | plain   |
 relname             | name                     |           |          |         | plain   |
...
 n_tup_ins           | bigint                   |           |          |         | plain   |
 n_tup_upd           | bigint                   |           |          |         | plain   |
 n_tup_del           | bigint                   |           |          |         | plain   |
 n_tup_hot_upd       | bigint                   |           |          |         | plain   |
 n_live_tup          | bigint                   |           |          |         | plain   |
 n_dead_tup          | bigint                   |           |          |         | plain   |
 n_mod_since_analyze | bigint                   |           |          |         | plain   |
...

New feature request: MultiXactId reporting

Hello Jobin,
Currently, pg_gather reports the database/relation age. Over the last time, I faced a few MultiXactId scenarios. I would like pg_gather to report this if possible.
Best Regards,

Timestamp information on the report should be same as PG logs

It is found very difficult to correlate information from pg_gather report with PostgreSQL logs as they are displaying timestamp information in different timezones.
So it is important to have the pg_gather report have the timestamp information as the PostgreSQL Server

Timestamp from the moment when pg_gather is executed

It would be very good to have the time when pg_gather started to collect data logged and displayed in the header of the HTML report. Something as simple as:
select now(); show log_timezone;
we could later use to match entries in the PostgreSQL log and/or other monitoring systems.

ERRORs while running. ERROR: syntax error at or near ":" invalid command \if invalid command \endif

A sample pg_gather (gather.sql) output is

psql:/app/gather.sql:5: ERROR:  syntax error at or near ":"
LINE 1: SELECT ( :SERVER_VERSION_NUM > 120000 ) AS pg12, ( :SERVER_V...
                 ^
psql:/app/gather.sql:7: invalid command \if
SELECT (SELECT count(*) > 1 FROM pg_srvr) AS conlines \gset
\if :conlines
\echo SOMETHING WRONG, EXITING
SOMETHING WRONG, EXITING;
\q
\endif
psql:/app/gather.sql:15: invalid command \endif
Tuples only is on.
\t
\r
psql:/app/gather.sql:27: invalid command \if
COPY pg_srvr FROM stdin;

Multiple times of page rendering causing CPU spikes in browser.

When the Report is loaded, First HTML DOM with CSS is displayed as it is.
The analysis data comes later on top of that. Currently, jquery is used to manipulate the DOM by using CSS for highlighting and Mouseover hints (element.title) are directly inserted into the DOM elements.
All these dynamic changes result in Rendering the HTML page multiple times in the browser, causing CPU usage spikes and sometimes delays. This problem is more visible when the report is big in size.

Distinguish between last vacuum and last autovacuum

Currently, pg_gather doesn't allow to tell if last vacuum and last analyze were automatic or manual. See https://github.com/jobinau/pg_gather/blob/main/gather.sql#L134
It is important to be able to distinguish between the two types, and it shouldn't be a lot of noise in the final report.
At the very least, pg_gather should gather the raw data (as much of it as possible), but coalesce vacuum details when generating the report.

Temp generation per database shouldn't be cumulative

Users reported that the temp generation information don't make much meaning since it is the cumulative value from the last stats reset.
As per discussion, Temp generation per day makes more sense

Reduction in columns in Table info in version 11 broke the analysis

Analysis of the data present in the table is done using javascript.
The columns in HTML tables are referenced by their indexes. When the number of columns is reduced, the Index it references became wrong.

Catch per-table overrides

It is possible to override various parameters--like vacuum scale factor--per table. pg_gather should detect and report on every table that has custom reloptions.

Parent table of a table partitionng is ls listed in the table list, which done't need to be analyed

Parent table of Partitioning doesn't store data. So they won't get bloated or aged. It is more of a metadata only structure.
There is no point in adding it for analysis

New feature: prepared_xacts

Having xmin age from prepared_xacts could help
I'm thinking of

select *, (now()-prepared)::INTERVAL as duration, age(transaction) as trx_age from pg_prepared_xacts order by age(transaction) desc;

Report header goes missing of timezone info is not available.

When we analyze partial data collection on a continuous data gathering. parameter values are not collected each time because that adds unwanted overhead.
But this will result in an unwanted consequence: the header data goes missing because it joins with the parameter value.

Showing WaitEvents as percentages

Hello Jobin!

The Session Details section currently presents WaitEvent values as cardinal numbers, E.g., WalWriterMain:1937,WALWrite:40,WALSync:16,CPU:7
What do you think about presenting those cardinal numbers as percentages? Wouldn’t it be more human-readable/friendly?
Using cardinal numbers over the data-collection time, we could say something like, “This session was statistically spending ##% of its time doing this, this, and that".
It adds no additional value but still, it can help to understand the output.

Best regards,

There should be feature to detect the compilation time parameter changes

Compile-time parameter changes can result in hard to troubleshoot problems and odd behaviours.

Improvement : Eleminate Sed utility from the requirement for data import

There were historical reasons why Unix/Linux Sed is used for cleaning up the data collected before importing it.
Many of those cases are gone over a period of time with multiple revisions.
Now, conceptually it is possible to have collected data clean enough for importing.
So the use of Sed utility can be eliminated.

Delay between wait event sampling is done by sending pg_sleep() to server repatedly.

On a remote connection sending same pg_sleep 2000 times is not something very desirable.

Provide a list of extensions

A feature request: provide a list of extensions that are installed and that are available.
Right now, the only way to guess what's installed/available is by looking at the value of shared_preload_libraries.