Netdata needed to be removed due to consuming ALL available file deors (mildly e

Hey, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

netdata state container: runaway FD use about helmchart HOT 3 OPEN

withinboredom commented on July 28, 2024

netdata state container: runaway FD use

from helmchart.

Comments (3)

ilyam8 commented on July 28, 2024

Hey, @withinboredom.

approximately ^ 100's of thousands of times until the node

How did you come to this conclusion?
Can you show lsof output with a lot of open WAL/db files?
And can you show the "Applications -> disk -> apps.files" chart when there are a lot of open WAL/db files? Netdata tracks the number of open files for application groups.

screenshot

from helmchart.

ktsaou commented on July 28, 2024

@withinboredom I am very sorry you had this bad experience with Netdata.

Please help us find the issue and fix it.

In the current nightly version of netdata we added the 2 more monitoring functions, based on your suggestions at netdata/netdata#15411

apps.plugin monitors the open file descriptors of all processes and raises alerts
proc.plugin monitors the total file descriptors of the system and raises alerts.

But even before these changes, as @ilyam8 says, we were monitoring the file descriptors per application with apps.plugin. Keep in mind that apps.plugin monitors the file descriptors of applications from /proc, so even for netdata processes it monitors the them, as understood by the kernel.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

So, although I understand you have removed Netdata from your systems, could you please help us trace the issue?

I have also read your blog post. You state that somehow a process managed to exhaust all file descriptors of the system and this led to a system-wide corruption. To my understanding this is not technically possible. Even if an application is leaking file descriptors, the limits of the process of far below the limits of the system. So, the process will start misbehaving but it cannot kill or corrupt the entire system. The ability of a process to corrupt the entire system, would be a big issue for Linux systems.

Anyway, please help us verify that Netdata is leaking file descriptors. If it does, we need to find where it does. You mention wal, but we don't see it in the lsof you posted.

Can you help us?

from helmchart.

withinboredom commented on July 28, 2024

I ended up installing netdata back, but directly on the nodes instead of using the helm chart. I really like netdata and couldn't find anything nearly as awesome. I lose some internal monitoring of the cluster, but that's ok with me for now.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

It was essentially as I posted, hundreds and hundreds of thousands of lines of it (~500k). I probably should have gotten you the entire output, but didn't think of it at the time. As you can see from the output, it is entirely netdata that has these open files. This isn't filtered for netdata, it's raw output of lsof on the node.

At first, I thought it ran out of space on the volume, but post-mortem, the volume only had about 20mb being used (out of 1gb). I have no idea why netdata did this.

You mention wal, but we don't see it in the lsof you posted.

You may have to scroll to the right to read the end of the lines (github doesn't wrap code blocks, apparently).

Can you help us?

Yeah, absolutely. With this new monitoring, I'd feel much safer installing the helm chart again.

I dug through the nodes logs. Here are some things I saw, before running out of file descriptors:

Jul 04 20:55:44 cameo kernel: netdata[20554]: segfault at 30 ip 00007f19a348a2f9 sp 00007f19a0fab510 error 4 in ld-musl-x86_64.so.1[7f19a3447000+4c000]

There are many logs like that before the system fails due to no more FD. Does a segfault leave open files behind if the segfault happens in a container? I know PID 1 normally is responsible for reaping processes, but I don't know the structure of these containers (if there isn't a PID 1 in the container, that could be it ... 🤔).

Eventually, it looks like etcd loses access to FDs first (well, "first" might be that it is just the most active process in the node), followed by netdata attempting to ask k8s for a configmap repeatedly. Then, k8s itself. Eventually, there are enough processes stuck with too many FDs that it overwhelms the system.

At the time, there were only a handful of containers/processes on this node.

I'll see if I can get the full logs here.

from helmchart.

netdata state container: runaway FD use about helmchart HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent