jobinau / pg_gather Goto Github PK
View Code? Open in Web Editor NEWScan PostgreSQL Instance for potential problems. pg_gather is a SQL-only script leveraging the built-in features of psql.
License: Other
Scan PostgreSQL Instance for potential problems. pg_gather is a SQL-only script leveraging the built-in features of psql.
License: Other
This could happen on a system with lot of temporary table creation
VacuumDelay is more of a sleep/delay within the vacuum worker. it shouldn't be a matter of concern and shouldn't be reported as something on which PostgreSQL is spending its time.
If there are too many indexes and tables listed in the report, the Browser may hang on devices with fewer resources.
Information like "Connection Since", "Statement since" and "State since" goes missing if the data collection timestamp is not available.
This problem is commonly reported in odd cases where the PostgreSQL version is old (9.6 or below) or working with PostgreSQL-like software like Amazon Aurora.
There are two versions of the pg_gather script: for 9.5 and 9.6, and for the newer versions. This results in unnecessary overhead for the operator. Also, sometimes, when pg_gather is executed (or is requested to be executed), PostgreSQL version is not known in advance.
There should be a standard way to execute pg_gather without thinking about the underlying complexity. Something like pg_gather.sh -h -P -p -d...
that takes care of checking the target instance's version, and runs the required SQL script.
A few users directly contacted and reported that sometimes they experience data import failures due to an additional empty column appearing data related to wait-events to pg_pid_wait
. Even though the issue was reported on a very old version (v7), since it was not reproducible, noone filed a bug yet..
So I am creating this bug report on behalf of all of them.
While importing the data in rare cases, produces an error message as follows at the end
COPY 1
COPY 1
COPY 236
COPY 2264
COPY 4864
COPY 5
COPY 7
COPY 354
COPY 42928
COPY 16447
COPY 15570
COPY 3855
COPY 6807
COPY 0
COPY 0
COPY 1
COPY 1
psql:<stdin>:95550: ERROR: end-of-copy marker corrupt
CONTEXT: COPY pg_pid_wait, line 2166
When the imported data contains more than ten thousand tables and indexes. PostgreSQL may trigger parallel execution of queries in rare cases. which is triggering performance issues for generating the report.
As per earlier design those databases where there is no blks_fetch information is avoided from the report.
In generally, they are small, newly created databases. So missing them from report was not a major concern.
However, When we share the report with end-users, missing databases from the list could be a problem
So it is important to list them also for the completeness of the report.
When generate_report.sh fails, it may leave a running docker container. For example, that happens when the target report file is not found.
$ ./generate_report.sh $PWD/CS0018462_out.txt
13: Pulling from library/postgres
Digest: sha256:117c3ea384ce21421541515edfb11f2997b2c853d4fdd58a455b77664c1adc20
Status: Image is up to date for postgres:13
docker.io/library/postgres:13
Docker container is 035c3073ad9ea3f5dd0dc9a46326caa8bd31ff04c517c693d5f8c6c9e137956a; will wait 3 seconds before proceeding
SET
DROP TABLE
...
CREATE TABLE
sed: can't read /home/skuzmichev/projects/pg_gather/CS0018462_out.txt: No such file or directory
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
035c3073ad9e postgres:13 "docker-entrypoint.s…" 18 seconds ago Up 17 seconds 5432/tcp musing_liskov
...
@psvampa reported that report generation shows an error as follows if data from pg_stat_bgwriter
is not available.
psql:gather_report.sql:213: ERROR: division by zero
Filing this improvement request on behalf of @ylacancellera
expects net/delay in that table to show up at db level which will be easy for users to understand.
If the user runs the gather.sql against PostgreSQL 9.6, There is no "backend_type" and data import fails
Hitting this error when running the gather_schema.sql script:
psql:<stdin>:71: ERROR: value "3610288274858318246" is out of range for type integer
CONTEXT: COPY pg_get_activity, line 1, column leader_pid: "3610288274858318246"
Changing the column type to bigint fixed it:
diff --git a/gather_schema.sql b/gather_schema.sql
index f465c32..8bb666f 100644
--- a/gather_schema.sql
+++ b/gather_schema.sql
@@ -68,7 +68,7 @@ CREATE UNLOGGED TABLE pg_get_activity (
gss_auth boolean,
gss_princ text,
gss_enc boolean,
- leader_pid integer
+ leader_pid bigint
)
```;
Currently, pg_gather requires a connection to specific database in a cluster to get most details. However, what specific database should be targeted may not be clear immediately, and moreover there may be dozens of databases in the cluster.
Since we've started planning for a wrapper script to run pg_gather, perhaps we can also add some flag like (--all-dbs) to run pg_gather against every database automatically?
There are global and non-global parts in the resulting report, so we can also discuss separating them if we're going to be aiming for multiple sub-reports. In the beginning, however, just having multiple regular reports generated automatically should be a good quality of life improvement.
@jobinau, what's your take on this?
\Importing :2021-09-16 02:30:01.492215+00
Partial
Pager usage is off.
COPY 1
psql::39: ERROR: invalid input syntax for type timestamp with time zone: "16405"
CONTEXT: COPY pg_get_activity, line 20, column collect_ts: "16405"
COPY 6198
COPY 9
COPY 0
COPY 0
COPY 1
COPY 1
COPY 8196
Here is a sample error.
COPY pg_gather FROM stdin;
psql:/app/gather.sql:35: ERROR: wal_level must be set to 'logical'
HINT: WAL control functions cannot be executed when wal_level < logical.
Thank you very much for your work !
On the other hand, I have this error when trying to run the report.
It is normal?
Thanks.
-bash-4.2$ /usr/pgsql-11/bin/psql -X -f gather_report.sql > GatherReport.html
psql:gather_report.sql:180: ERROR: function max(pg_lsn) does not exist
LINE 1: ...TEST((SELECT(current_wal) FROM pg_gather),(SELECT MAX(sent_l...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
When analyzing vacuum-related issues, it's important to know the load profile on the table: is it insert-heavy? Is it update heavy? How many updates have been done since the last vacuum or analyze? Are most updates HOT?
pg_gather should gather and provide stats similar to what pg_stat_user_tables
view provides:
View "pg_catalog.pg_stat_user_tables"
Column | Type | Collation | Nullable | Default | Storage | Description
---------------------+--------------------------+-----------+----------+---------+---------+-------------
relid | oid | | | | plain |
schemaname | name | | | | plain |
relname | name | | | | plain |
...
n_tup_ins | bigint | | | | plain |
n_tup_upd | bigint | | | | plain |
n_tup_del | bigint | | | | plain |
n_tup_hot_upd | bigint | | | | plain |
n_live_tup | bigint | | | | plain |
n_dead_tup | bigint | | | | plain |
n_mod_since_analyze | bigint | | | | plain |
...
Hello Jobin,
Currently, pg_gather
reports the database/relation age. Over the last time, I faced a few MultiXactId
scenarios. I would like pg_gather
to report this if possible.
Best Regards,
It is found very difficult to correlate information from pg_gather report with PostgreSQL logs as they are displaying timestamp information in different timezones.
So it is important to have the pg_gather report have the timestamp information as the PostgreSQL Server
It would be very good to have the time when pg_gather started to collect data logged and displayed in the header of the HTML report. Something as simple as:
select now(); show log_timezone;
we could later use to match entries in the PostgreSQL log and/or other monitoring systems.
A sample pg_gather (gather.sql) output is
psql:/app/gather.sql:5: ERROR: syntax error at or near ":"
LINE 1: SELECT ( :SERVER_VERSION_NUM > 120000 ) AS pg12, ( :SERVER_V...
^
psql:/app/gather.sql:7: invalid command \if
SELECT (SELECT count(*) > 1 FROM pg_srvr) AS conlines \gset
\if :conlines
\echo SOMETHING WRONG, EXITING
SOMETHING WRONG, EXITING;
\q
\endif
psql:/app/gather.sql:15: invalid command \endif
Tuples only is on.
\t
\r
psql:/app/gather.sql:27: invalid command \if
COPY pg_srvr FROM stdin;
When the Report is loaded, First HTML DOM with CSS is displayed as it is.
The analysis data comes later on top of that. Currently, jquery is used to manipulate the DOM by using CSS for highlighting and Mouseover hints (element.title) are directly inserted into the DOM elements.
All these dynamic changes result in Rendering the HTML page multiple times in the browser, causing CPU usage spikes and sometimes delays. This problem is more visible when the report is big in size.
Currently, pg_gather doesn't allow to tell if last vacuum and last analyze were automatic or manual. See https://github.com/jobinau/pg_gather/blob/main/gather.sql#L134
It is important to be able to distinguish between the two types, and it shouldn't be a lot of noise in the final report.
At the very least, pg_gather should gather the raw data (as much of it as possible), but coalesce vacuum details when generating the report.
Users reported that the temp generation information don't make much meaning since it is the cumulative value from the last stats reset.
As per discussion, Temp generation per day makes more sense
Analysis of the data present in the table is done using javascript.
The columns in HTML tables are referenced by their indexes. When the number of columns is reduced, the Index it references became wrong.
It is possible to override various parameters--like vacuum scale factor
--per table. pg_gather
should detect and report on every table that has custom reloptions
.
Parent table of Partitioning doesn't store data. So they won't get bloated or aged. It is more of a metadata only structure.
There is no point in adding it for analysis
Having xmin age from prepared_xacts could help
I'm thinking of
select *, (now()-prepared)::INTERVAL as duration, age(transaction) as trx_age from pg_prepared_xacts order by age(transaction) desc;
When we analyze partial data collection on a continuous data gathering. parameter values are not collected each time because that adds unwanted overhead.
But this will result in an unwanted consequence: the header data goes missing because it joins with the parameter value.
Hello Jobin!
The Session Details
section currently presents WaitEvent values as cardinal numbers, E.g., WalWriterMain:1937,WALWrite:40,WALSync:16,CPU:7
What do you think about presenting those cardinal numbers as percentages? Wouldn’t it be more human-readable/friendly?
Using cardinal numbers over the data-collection time, we could say something like, “This session was statistically spending ##% of its time doing this, this, and that".
It adds no additional value but still, it can help to understand the output.
Best regards,
Compile-time parameter changes can result in hard to troubleshoot problems and odd behaviours.
There were historical reasons why Unix/Linux Sed is used for cleaning up the data collected before importing it.
Many of those cases are gone over a period of time with multiple revisions.
Now, conceptually it is possible to have collected data clean enough for importing.
So the use of Sed utility can be eliminated.
On a remote connection sending same pg_sleep
2000 times is not something very desirable.
A feature request: provide a list of extensions that are installed and that are available.
Right now, the only way to guess what's installed/available is by looking at the value of shared_preload_libraries.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.