perforce / p4prometheus Goto Github PK

[Community Supported] Perforce (Helix Core) interface for writing Prometheus metrics from real-time analysis of p4d log files.

License: MIT License

Makefile 1.44% Go 15.20% Python 21.53% Shell 61.04% Dockerfile 0.78%

p4prometheus's People

Contributors

Stargazers

Watchers

p4prometheus's Issues

monitor_metrics.sh - 'p4_errors.prom' metrics are not updated if errors.csv file exists and is empty

We have an alert of files read by node_exporter that alarms if those files are not updated in a timely manner. When 'errors.csv' is empty, the metrics.sh script will not udpate the 'p4_errors.prom' file and this leads to alerts for us: indicating a problem with our monitoring scripts.

'p4_errors.prom' should at least get a new modification time every time monitor_metrics.sh is run.

The cronjobs (monitor_metrics*) get stuck and keep spawning when replica is taking a checkpoint

I have to remove the p4prom installation and the cron jobs from replica because of this.

I have a quite a huge environment where taking a checkpoint would take 12h +
During that time, all perforce command would stuck, including the cron jobs.

Any work around for this?

monitor_metrics.sh is reading the same lines in the p4log and errors.csv on every execution

The way the monitoring script is pulling data that relies on logs seems to have a flaw. It is reading the entire log file ever time it is executed.

This is true for monitor_completed_cmds and monitor_errors functions.

I will be submitting a pull request to fix these problems and others.
To specifically address this problem I used a simple data file to record the last timestamps and line counts from the log files to determine where the last run had left off. It's not perfect but it's a good start

errant quotes in `.prom` files break exporting of data.

When you run a python script that uses p4python in windows it reports its' version with quotes in the string (this does not happen when it is run in Linux) this then does not get sanitized when being pulled by the exporter which in turn breaks the node exporter with the following error level=error ts=2023-04-12T01:00:40.552Z caller=textfile.go:209 collector=textfile msg="failed to collect textfile data" file=p4_cmds.prom err="failed to parse textfile data from \"/var/lib/node_exporter/p4_cmds.prom\": text format parsing error in line 968: unexpected end of label value \"unnamed_p4-python_script_[PY3.9.4/P4PY\""

on the different platforms on my system the revision is reported as on windows. P4PYTHON/"NTX64"/"2022.1"/"2369090" ("2022.1/2361553" API) ("2022"/"11"/"11"). and on Linux as P4PYTHON/LINUX54X86_64/2022.1/2405572 (2022.1/2361553 API) (2023/02/09). to see this on your side open a python instance and

>>> import P4
>>> print(P4.P4.identify())

while there is an argument that this is something that p4python could fix, I am unable to post issues to that repo and this change on the side of P4Prometheus would make it more robust

check_for_replica runs only for non-sdp setup

In monitor_metics.sh, around line 112 the check_for_replica is run only for non-SDP setup. IMO, it should run for both setups, else monitor_pull would fail for SDP setups.

Change needed would be

errors_file=$($p4 configure show | egrep "serverlog.file.*errors.csv" | cut -d= -f2 | sed -e 's/ (.*//')
fi

check_for_replica=$($p4 info | grep -c 'Replica of:')
if [[ "$check_for_replica" -eq "0" ]]; then
     P4REPLICA="FALSE"
else
    P4REPLICA="TRUE"
fi

Handle server=1 logs

Such logs do not have a completed record. Needs to be detected and handled appropriately.

monitor_metrics.py lslocks doesn't show file path

I ran it in centos7 container and opensuse 15 (baremetal), lslocks doesn't show the path of files being locked by p4d. These are locks caused by p4 stats. p4d version 2022.1/2361553

# lslocks -o +BLOCKER
COMMAND           PID  TYPE SIZE MODE  M START END PATH                            BLOCKER
anacron          2723 POSIX   0B WRITE 0     0   0 /var/spool/anacron/cron.weekly  
anacron          2723 POSIX   0B WRITE 0     0   0 /var/spool/anacron/cron.monthly 
p4d              7165 FLOCK   0B READ  0     0   0                                 
p4d              7165 FLOCK   0B READ  0     0   0                                 
p4d              7165 FLOCK   0B READ  0     0   0                                 
rhsmcertd         102 FLOCK   0B WRITE 0     0   0 /run/lock/subsys/rhsmcertd      
lvmetad            97 POSIX   3B WRITE 0     0   0 /run/lvmetad.pid                
crond             101 FLOCK   4B WRITE 0     0   0 /run/crond.pid

lsof actually displays correctly

p4d       7165      root   18uR     REG              259,1  3984678912   28442608 /opt/perforce/depot/db.revdx
p4d       7165      root   19uR     REG              259,1 23456456704   28442609 /opt/perforce/depot/db.revhx
p4d       7165      root   20uR     REG              259,1  8059305984   28442611 /opt/perforce/depot/db.revsx

R=full file read lock, W=full file write lock
You might consider using lsof as it is more reliable.

Make sure config file value output_cmds_by_user: is respected

Set this value to false in p4prometheus.yaml but it wasn't being respected.

Simplify Windows installation (instructions/automated scripts)

This is currently a bit vague.

Create non-SDP version

Make sure all works correctly for non-SDP setup and that it is documented.
Needs a revised monitor_metrics.sh

p4prometheus needs to detect change in server.id (e.g. after failover)

Failed over to a standby and the stats were still being reported against the original serverid

Add alert to detect core dump from p4d/p4broker/p4p.

See this SDP change to standardize location of core dumps. SDP 2023.2 Patch 1 will need to be shipped before location of core dumps is standardized. Review is: https://swarm.workshop.perforce.com/reviews/30128

Consider using 'p4 pull -ls' to get pull queue size in monitor_pull when gathering metrics.

p4prometheus/scripts/monitor_metrics.sh

Line 581 in 7ea8505

$p4 pull -l > "$tmp_pull_queue" 2> /dev/null

On replica/edge servers with large rdb.lbr content, 'p4 pull -l' can read lock rdb.lbr for extended periods of time:

2024/04/22 10:33:10 pid 2952365 bruno@bruno_edge1_ws 127.0.0.1 [p4/2023.2/LINUX26X86_64/2578891] 'user-pull -l'
--- rdb.lbr
--- pages in+out+cached 345601+0+96
--- locks read/write 1/0 rows get+pos+scan put+del 0+0+8775518 0+0
--- total lock wait+held read/write 0ms+199211ms/0ms+0ms

and running this every 60 seconds from the cron can block metadata replication. I believe 'p4 pull -ls' could be used to obtain the pull queue size and seems to have much less of a read-lock impact on rdb.lbr:

2024/04/22 10:37:47 pid 2953239 bruno@bruno_edge1_ws 127.0.0.1 [p4/2023.2/LINUX26X86_64/2578891] 'user-pull -ls'
--- rdb.lbr
--- pages in+out+cached 345601+0+96
--- locks read/write 1/0 rows get+pos+scan put+del 0+0+8775518 0+0
--- total lock wait+held read/write 0ms+5555ms/0ms+0ms

Use a temp file to create prometheus logs

From time to time, I receive bogus metrics.
For example, Prometheus reports data for a none-existing server named “sgn.sgn.1” in addition to the correct server “sgn.1”.
The node_exporter process also occasionally crashes on wrong data in the p4_cmds-*.prom file.
Many errors for this file are logged in te node_exporter logfile.

I think this is because sometimes, this p4_cmds-*.prom file is overwritten while the node_exporter process is still reading it.
The monitor_metrics.sh script handles this correctly in most places.
It defines the name of the prom file in variable fname.
Next, it defines: tmpfname="$fname.$$".
This tmpfname is used as the filename for the new data.
When it is complete, it is moved to the correct name with: mv "$tmpfname" "$fname"

Can something similar be done in the writeMetricsFile function in the p4prom.go file?

Monitor_metrics.sh generates metrics with "unset" in name

This happens if server is down for a bit - detect and avoid.

monitor_metrics.sh causes pull errors when run on non-replica servers

The script does not currently detect if the server is a replica or not, so when the monitor_pull function is executed on a server other then a replica, the error of "Pull only allowed on replica servers."

A simple fix would be to encapsulate the same detection mechanism in the monitor_pull function that's used in the SDP scripts.

monitor_pull () {
    if [[ "${P4REPLICA}" == "TRUE" ]]
        ....existing code
    fi
}

This won't work for non-SDP servers though unless they source a similar vars script.

Gen new version to include latest go-libp4dlog version at v0.9.6

Hi,
Any chance you can generate a new version of the binary to include the latest release of the go-libp4dlog binary, which is now at v0.9.6? It has the MaxRunningCount bumped, which would help me out since I am running into that limit.

[jmillette@ussel-pfse002 ~]$ sudo systemctl status p4prometheus -l
● p4prometheus.service - P4prometheus
   Loaded: loaded (/etc/systemd/system/p4prometheus.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-06-07 19:47:02 PDT; 18h ago
  Process: 65155 ExecStart=/usr/local/bin/p4prometheus.linux-amd64 --config=/p4/common/config/p4prometheus.yml (code=exited, status=2)
 Main PID: 65155 (code=exited, status=2)

Jun 07 08:00:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: time="2021-06-07T08:00:02-07:00" level=info msg="file was removed, closing and un-watching" fd=7 file=log
Jun 07 19:47:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: panic: ERROR: max running command limit (10000) exceeded. Does this server log have completion records configured (configurable server=3)?
Jun 07 19:47:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: goroutine 23 [running]:
Jun 07 19:47:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: github.com/rcowham/go-libp4dlog.(*P4dFileParser).LogParser.func4(0xc0000e3800, 0x72e1e0, 0xc0000b5080)
Jun 07 19:47:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: /Users/rcowham/go/pkg/mod/github.com/rcowham/[email protected]/p4dlog.go:1421 +0x2a7
Jun 07 19:47:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: created by github.com/rcowham/go-libp4dlog.(*P4dFileParser).LogParser
Jun 07 19:47:02 ussel-pfse002 p4prometheus.linux-amd64[65155]: /Users/rcowham/go/pkg/mod/github.com/rcowham/[email protected]/p4dlog.go:1407 +0x174
Jun 07 19:47:02 ussel-pfse002 systemd[1]: p4prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 07 19:47:02 ussel-pfse002 systemd[1]: Unit p4prometheus.service entered failed state.
Jun 07 19:47:02 ussel-pfse002 systemd[1]: p4prometheus.service failed.

Is there a Perforce version requirement

Does the integration requires minimum Perforce version or Perforce log level?

Log tailing stops working after an hour

I'm having problems on some servers with the log tailing seeming to stop with no obvious error messages in the log. In this case I restarted the p4prometheus service at 1:34 and it worked until 2:01 at which point all metrics went to zero. The prometheus textfile is being written, but none of the values are changing. Restarting the service gets it working again for a while and then it stops. Some servers which are running the same Ubuntu release and provisioned the same way with Puppet have been working well for days.

I can't seem to correlate this with journal rotations so I don't think it's related to a new log file showing up.

Note that I had this problem with the elastic beat exporter as well, so it's likely a problem with either the go-p4log module or the underlying tailing module.

ug 04 01:34:21 ct-host-u38 p4prometheus[10996]: time="2021-08-04T01:34:21Z" level=info msg="watching new file" directory=/p4/1/logs fd=3 file=log
Aug 04 01:34:25 ct-host-u38 p4prometheus[10996]: time="2021-08-04T01:34:25Z" level=info msg="Resetting running to 923 as encountered server threads message"
Aug 04 02:24:06 ct-host-u38 p4prometheus[10996]: time="2021-08-04T02:24:06Z" level=info msg="watching new file" fd=7 file=log
Aug 04 02:24:06 ct-host-u38 p4prometheus[10996]: time="2021-08-04T02:24:06Z" level=info msg="file was removed, closing and un-watching" fd=3 file=log
Aug 04 06:00:02 ct-host-u38 p4prometheus[10996]: time="2021-08-04T06:00:02Z" level=info msg="watching new file" fd=3 file=log
Aug 04 06:00:02 ct-host-u38 p4prometheus[10996]: time="2021-08-04T06:00:02Z" level=info msg="file was removed, closing and un-watching" fd=7 file=log
Aug 04 07:00:02 ct-host-u38 p4prometheus[10996]: time="2021-08-04T07:00:02Z" level=info msg="watching new file" fd=7 file=log
Aug 04 07:00:02 ct-host-u38 p4prometheus[10996]: time="2021-08-04T07:00:02Z" level=info msg="file was removed, closing and un-watching" fd=3 file=log
Aug 04 15:00:02 ct-host-u38 p4prometheus[10996]: time="2021-08-04T15:00:02Z" level=info msg="watching new file" fd=3 file=log
Aug 04 15:00:02 ct-host-u38 p4prometheus[10996]: time="2021-08-04T15:00:02Z" level=info msg="file was removed, closing and un-watching" fd=7 file=log

monitor_metrics.sh errors out when checkpoint logs don't exist

the function monitor_checkpoint errors out on perforce servers that don't have checkpoint logs. This is due to how the log files are being discovered using ls on line 205

    for f in $(ls -tr /p4/$SDP_INSTANCE/logs/checkpoint.log*);

The error: "ls: cannot access /p4/1/logs/checkpoint.log*: No such file or directory"

I will be submitting a pull request to address this issue and others

monitor_metrics.* script do not properly tolerate running on a single server with multiple SDP instances

We run multiple SDP instances on a shared servers. Some of the temporary/status files in the metrics root directory were getting used by monitor_metrics.py and monitor_metrics.sh scripts leading to confused/incorrect metrics. In particular these files currently are shared improperly between the multiple SDP instances:

$metrics_root/locks.prom
$metrics_root/tmp_info.dat
$metrics_root/tmp_license
$metrics_root/tmp_filesys
/tmp/mon.out
$metrics_root/pullq.out
$metrics_root/pull-lj.out

Handle log file not being present

Have a config option which means log file not being present is not an error.
Instead just poll for it.
Makes life easier when run as a service.

Make monitor_metrics.sh support Windows more easily

Current requirement for git-bash is not ideal.
A rewrite in Golang for cross platform support may be beneficial in various ways.

p4log_lc_last=0 problem

Hello.

when I run $ ./monitor_metrics.sh -p ip:port -u perforce -nosdp command first time.

there's
[p4log][curr timestamp: 1594364793][curr linecount: 78346601]
sed: -e expression #1, char 11: invalid usage of line address 0

there's sed warning message.

I found https://unix.stackexchange.com/questions/107301/how-to-use-variables-in-sed link.

p4log_lc_last=1 is may be right.

One more thing,
last line of monitor_metrics.sh can change 755 to 644
chmod 644 $metrics_root/*.prom

Avoid creation of cruff temp files /p4/metrics if commands fail.

A common workflow in several scripts is to generate data to a temp file first, and move that data to the correct target file name on successful completion, so that only completed metrics data files appear for scraping by the node_exporter.

This enhancement is to better handle the case where commands fail. In that case, the temp files should be deleted.

Perhaps we consider doing something separate to indicate count of failures or some such?

p4prometheus v0.7.5 may have bug for arguments

I am not sure if potentially the required arguments have changed, or if something else is occuring, but after updating one of the perforce instances to the latest p4prometheus release, I am seeing this error frequently:

[root@euodyl-prf001 ~]# systemctl status p4prometheus -l
● p4prometheus.service - P4prometheus
   Loaded: loaded (/etc/systemd/system/p4prometheus.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2021-11-04 12:07:01 PDT; 5 days ago
  Process: 127327 ExecStart=/usr/local/bin/p4prometheus.linux-amd64 --config=/p4/common/config/p4prometheus.yml (code=exited, status=2)
 Main PID: 127327 (code=exited, status=2)

Nov 02 09:22:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-02T09:22:01-07:00" level=info msg="P4Prometheus config: &{Debug:0 ServerID:standby2 SDPInstance:1 UpdateInterval:15s OutputCmdsByUser:true OutputCmdsByUserRegex: OutputCmdsByIP:true CaseSensitiveServer:true}"
Nov 02 09:22:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-02T09:22:01-07:00" level=info msg="watching new file" directory=/p4/1/logs fd=3 file=log
Nov 02 15:03:10 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-02T15:03:10-07:00" level=info msg="Resetting running to 8 as encountered server threads message"
Nov 03 05:00:02 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-03T05:00:02-07:00" level=info msg="watching new file" fd=7 file=log
Nov 03 05:00:02 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-03T05:00:02-07:00" level=info msg="file was removed, closing and un-watching" fd=3 file=log
Nov 04 05:00:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-04T05:00:01-07:00" level=info msg="watching new file" fd=3 file=log
Nov 04 05:00:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-04T05:00:01-07:00" level=info msg="file was removed, closing and un-watching" fd=7 file=log
Nov 04 12:07:01 euodyl-prf001 systemd[1]: p4prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 04 12:07:01 euodyl-prf001 systemd[1]: Unit p4prometheus.service entered failed state.
Nov 04 12:07:01 euodyl-prf001 systemd[1]: p4prometheus.service failed.

This is my config file

[root@euodyl-prf001 ~]# cat /p4/common/config/p4prometheus.yml 
# Configuration file for p4prometheus
# Normally stored in /p4/common/config

log_path: /p4/1/logs/log
metrics_output: /srv/prometheus/node_exporter/p4_cmds.prom
server_id:
sdp_instance: 1
output_cmds_by_user: true

Any idea what may be causing this?

monitor_metrics.sh generating incorrect noserverid metrics for SDP installations sometimes

Maybe related to "p4 serverid" command - can be replaced for SDP by reading server.id file directly.

Complete full docker-compose demo

Make it easy for people to see how to configure and provide working demo.

monitor_metrics.sh: time between periodic updates of p4_licenses.prom and p4_filesys.prom should be configurable

monitor_metrics.sh only updates p4_licenses.prom and p4_filesys.prom every 60 minutes. Some users of P4Prometheus have alertmanager configurations that warn when files read by node_exporter are not updated frequently enough. The configurations may alert on updates less than every 60 minutes and are shared by more than p4prometheus. To tolerate these cases, monitor_metrics.sh should allow this update time to be configurable.

Implement semaphore mechanism to assure crontab calls stacking up.

Implement a semaphore mechanism in the scripts to assure crontab calls, which run every minute, don't stack up if for some reason the usually quick commands it calls take more than a minute to run.

Massive CPU usage

p4prometheus uses a lot of memory (9 GiB), with no apparent reason:

Is there a way to check what it’s actually doing?

Added Docker-based regression test suite suitable for CI.

Quick Plan:
• v1 – Add files to the p4prometheus to run a local Docker regression test suite.
• v2 – Setup a Jenkins server to run the suite for every change.
• v3 – Expand test coverage.
• v4 – Evolve processes to require new tests when new functionality is added.

Document Windows Usage

It works with Windows together with other utilities to run it as a service.
Or consider wrapping as a Windows service using appropriate go library

upload_grafana_dashboard.sh doesn't work due to clear text json data

original code

# Pull through jq to validate json
payload="$(jq . ${DASHBOARD}) >> $logfile"

# Upload the JSON to Grafana
curl -X POST \
  -H 'Content-Type: application/json' \
  -d "${payload}" \
  "https://api_key:$GRAFANA_API_KEY@$GRAFANA_SERVER/api/dashboards/db" -w "\n" | tee -a "$logfile"

Fix:

# Get path/file parm
DASHBOARD=$1

# Upload the JSON to Grafana
curl -X POST \
  -H 'Content-Type: application/json' \
  -d @${DASHBOARD} \
  "http://api_key:$GRAFANA_API_KEY@$GRAFANA_SERVER/api/dashboards/db" -w "\n" | tee -a "$logfile"

Note: $DASHBOARD is a valid json file generated by create_dashboard.py. There is no need to jq it.

Extend p4prometheus to monitor what's needed for Helix DAM.

Create a version for structured logs

Handle new structured logs from p4d 2019.2:

#1828890 (Bug #95112, #90346, #67029) **
Structured logging improvements.
Structured logs now have a versioned schema that allow new versions
of existing events to be added. Updated versions of events are
represented by having the server version included after a period in
the event type field. The event types with new versions this release
are:
2.49 - CommandEnd
6.49 - Audit
8.49 - NetworkPerformance
9.49 - DatabasePerformance
The 'serverlog.version.N' configurable can be used to pin a
structured log file to a specific server version's format.
To retain the prior structured log events format, set the
'serverlog.version.N' configurable to 48.

perforce / p4prometheus Goto Github PK

p4prometheus's People

Contributors

Stargazers

Watchers

Forkers

p4prometheus's Issues

Recommend Projects

Recommend Topics

Recommend Org