Description
We're using go-livestatus
to submit a huge amount of passive checks results to naemon
through the livestatus
unix socket.
Very important detail: we're opening only one connection to submit 900+ check results.
Many of our probes submit check results for production and test clusters. Sometimes the host and services exist in the naemon
configuration. Sometimes it doesn't. This is no real problem for naemon
. It simply emits a warning in its logs and ignores the event.
Everything was working fine, until one of our probe started to submit a very huge amount of check results for services that do not exist in naemon
using livestatus
. And then, we saw the livestatus
connection blocking after ~500 results and no more check results could be submitted through the connection.
What we think happened
After a little analysis, we were able to identify the problem.
On successful commands, livestatus
does NOT responds anything.
On failed commands, livestatus
responds with the error.
The current go-livestatus
code does not reads a command response from livestatus
.
We think that after a certain amount of failed commands on the same connection, livestatus
blocks if nothing reads its responses.
I didn't dig into the livestatus
code so I'm not a 100% sure of the exact reason here.
Can we reproduce the problem
Yes, the following code can reproduce the problem:
package main
import (
"fmt"
"time"
livestatus "github.com/vbatoufflet/go-livestatus"
lnagios "github.com/vbatoufflet/go-livestatus/nagios"
)
func main() {
c := livestatus.NewClient("unix", "/var/cache/naemon/live")
defer c.Close()
var lCmds []livestatus.Command
for i := 0; i < 900; i++ {
lCmd := lnagios.ProcessServiceCheckResult(
fmt.Sprintf("hostname-no-exist-%d", i),
fmt.Sprintf("service-description-no-exist-%d", i),
0,
"Output",
)
lCmds = append(lCmds, *lCmd)
}
for idx, cmd := range lCmds {
resp, err := c.Exec(cmd)
if err != nil {
panic(fmt.Sprintf("[%d] failed: %v\n", idx, err))
}
fmt.Printf("[%d] resp = %v\n", idx, resp)
}
}
This program submits 900 check results to naemon
for a host
and service
that does not exist in the naemon
configuration. livestatus
will then return an error.
Software versions
naemon
server
[root@rocky8 ~]# rpm -qa '*naemon*' | sort
libnaemon-1.3.0-13.16.x86_64
naemon-core-1.3.0-13.16.x86_64
naemon-livestatus-1.3.0-11.16.x86_64
naemon-thruk-1.3.0-10.16.noarch
Go client
# go.mod
[...]
require github.com/vbatoufflet/go-livestatus v0.0.0-20190218065636-65182dd594b0
[...]
❯ go version
go version go1.17.7 linux/amd64
Known workarounds
A simple workaround is to Close()
the livestatus
connection after a small number of check results submitted and open a new one to continue This creates other problem (like the livestatus
socket being temporary unavailable) when a burst of check results occurs. But this workaround is available right now and easy to use.
Discussion
We're currently using the close and reopen the livestatus
socket workaround.
Another solution would be to properly handle the response from livestatus
.
The real problem is that, AFAIK, livestatus
does not responds anything if a COMMAND
successfully executed. It only responds something if an error occurred.
I'm quite curious to know if someone else has already triggered this behavior.
I'm also curious to know if you have any better idea about how to handle this in the go-livestatus
library.
Thanks in advance