We have a lot of socket errors on Knot 2.9.4. The most common one is:
OS lacked necessary resources (data: None)
The errors always happen in one of 12 servers, as we explained priviously. So we believe that simply upgrading the Knot version doesn't solve the issue.
This error occurred frequently that we have our documentation and recovery scripts. It happens once every two weeks.
Now we are trying desperately the latest version 3.1.1
hoping that this issue disappeared. Unfortunately, we don't find any mentions of the related issue in the changelog, and the developer stated that the issue is still there. Nevertheless, we try it anyway.
Preparations
We upgrade all the knot serves to 3.1.1
, Python libknot to 3.1.2.1
, and refactor the agent to match the latest changes in the libknot.
First experiment
The error happens directly after the agent starts. We reported the issue through the Knot Gitter channel and Daniel Salzman (Current Knot DNS maintainer) suggested that we pass B
(blocking) to the knotc send_block
command.
We agreed:
def send_block(
ttl=ttl,
rtype=rtype,
data=data,
- flags=flags,
+ flags="B",
filter=filter_,
)
resp_ = ctl.receive_block()
The solution doesn't work at first. But after we restart (docker-compose down && docker-compose up -d
) the agent, the issue disappeared.
We try to create 3 new domains to reproduce the issue, but no error happened. Then try to simulate where multiple users create many new domains simultaneously, as we suppose that this event will likely trigger the error in the socket. We shut down the agent, create 8 new domains, then start the agent. This time, The Kafka broker will push those 8 new domain messages so fast that the socket will fail. Turns out, it is not. The socket is fine.
Now, we remove our previous changes (the B
) and do the same previous experiment ( 3 domains step-by-step, 8 domains simultaneously). If the error occurred, we can make sure that B
was the solution. Sadly, both experiment doesn't fail the socket too.
We try to produce the error directly using knotc
, as this is reproducible in 2.9.2
.
[centos@knot-slave-2 knot]$ sudo knotc -t 30 reload
OS lacked necessary resources (data: None)
Unfortunately, this doesn't happen in 3.1.1
. Otherwise, we will pass the B
after the error and see the result.
At this point, we are not 100% sure that B
was the solution.
Now, we deploy the agent some with B
and some are not. We will try to abuse the socket more, and see the result.