Coder Social home page Coder Social logo

Comments (3)

ilyam8 avatar ilyam8 commented on July 28, 2024

Hey, @withinboredom.

approximately ^ 100's of thousands of times until the node

  • How did you come to this conclusion?
  • Can you show lsof output with a lot of open WAL/db files?
  • And can you show the "Applications -> disk -> apps.files" chart when there are a lot of open WAL/db files? Netdata tracks the number of open files for application groups.
screenshot Screenshot 2023-07-17 at 20 03 03

from helmchart.

ktsaou avatar ktsaou commented on July 28, 2024

@withinboredom I am very sorry you had this bad experience with Netdata.

Please help us find the issue and fix it.

In the current nightly version of netdata we added the 2 more monitoring functions, based on your suggestions at netdata/netdata#15411

  1. apps.plugin monitors the open file descriptors of all processes and raises alerts
  2. proc.plugin monitors the total file descriptors of the system and raises alerts.

But even before these changes, as @ilyam8 says, we were monitoring the file descriptors per application with apps.plugin. Keep in mind that apps.plugin monitors the file descriptors of applications from /proc, so even for netdata processes it monitors the them, as understood by the kernel.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

So, although I understand you have removed Netdata from your systems, could you please help us trace the issue?

I have also read your blog post. You state that somehow a process managed to exhaust all file descriptors of the system and this led to a system-wide corruption. To my understanding this is not technically possible. Even if an application is leaking file descriptors, the limits of the process of far below the limits of the system. So, the process will start misbehaving but it cannot kill or corrupt the entire system. The ability of a process to corrupt the entire system, would be a big issue for Linux systems.

Anyway, please help us verify that Netdata is leaking file descriptors. If it does, we need to find where it does. You mention wal, but we don't see it in the lsof you posted.

Can you help us?

from helmchart.

withinboredom avatar withinboredom commented on July 28, 2024

I ended up installing netdata back, but directly on the nodes instead of using the helm chart. I really like netdata and couldn't find anything nearly as awesome. I lose some internal monitoring of the cluster, but that's ok with me for now.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

It was essentially as I posted, hundreds and hundreds of thousands of lines of it (~500k). I probably should have gotten you the entire output, but didn't think of it at the time. As you can see from the output, it is entirely netdata that has these open files. This isn't filtered for netdata, it's raw output of lsof on the node.

At first, I thought it ran out of space on the volume, but post-mortem, the volume only had about 20mb being used (out of 1gb). I have no idea why netdata did this.

You mention wal, but we don't see it in the lsof you posted.

You may have to scroll to the right to read the end of the lines (github doesn't wrap code blocks, apparently).

Can you help us?

Yeah, absolutely. With this new monitoring, I'd feel much safer installing the helm chart again.

I dug through the nodes logs. Here are some things I saw, before running out of file descriptors:

Jul 04 20:55:44 cameo kernel: netdata[20554]: segfault at 30 ip 00007f19a348a2f9 sp 00007f19a0fab510 error 4 in ld-musl-x86_64.so.1[7f19a3447000+4c000]

There are many logs like that before the system fails due to no more FD. Does a segfault leave open files behind if the segfault happens in a container? I know PID 1 normally is responsible for reaping processes, but I don't know the structure of these containers (if there isn't a PID 1 in the container, that could be it ... 🤔).

Eventually, it looks like etcd loses access to FDs first (well, "first" might be that it is just the most active process in the node), followed by netdata attempting to ask k8s for a configmap repeatedly. Then, k8s itself. Eventually, there are enough processes stuck with too many FDs that it overwhelms the system.

At the time, there were only a handful of containers/processes on this node.

I'll see if I can get the full logs here.

from helmchart.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.