Comments (3)
Hey, @withinboredom.
approximately ^ 100's of thousands of times until the node
- How did you come to this conclusion?
- Can you show
lsof
output with a lot of open WAL/db files? - And can you show the "Applications -> disk -> apps.files" chart when there are a lot of open WAL/db files? Netdata tracks the number of open files for application groups.
from helmchart.
@withinboredom I am very sorry you had this bad experience with Netdata.
Please help us find the issue and fix it.
In the current nightly version of netdata we added the 2 more monitoring functions, based on your suggestions at netdata/netdata#15411
apps.plugin
monitors the open file descriptors of all processes and raises alertsproc.plugin
monitors the total file descriptors of the system and raises alerts.
But even before these changes, as @ilyam8 says, we were monitoring the file descriptors per application with apps.plugin. Keep in mind that apps.plugin monitors the file descriptors of applications from /proc
, so even for netdata processes it monitors the them, as understood by the kernel.
From the output of lsof
you posted above, we don't see any leaks in file descriptors.
So, although I understand you have removed Netdata from your systems, could you please help us trace the issue?
I have also read your blog post. You state that somehow a process managed to exhaust all file descriptors of the system and this led to a system-wide corruption. To my understanding this is not technically possible. Even if an application is leaking file descriptors, the limits of the process of far below the limits of the system. So, the process will start misbehaving but it cannot kill or corrupt the entire system. The ability of a process to corrupt the entire system, would be a big issue for Linux systems.
Anyway, please help us verify that Netdata is leaking file descriptors. If it does, we need to find where it does. You mention wal
, but we don't see it in the lsof
you posted.
Can you help us?
from helmchart.
I ended up installing netdata back, but directly on the nodes instead of using the helm chart. I really like netdata and couldn't find anything nearly as awesome. I lose some internal monitoring of the cluster, but that's ok with me for now.
From the output of lsof you posted above, we don't see any leaks in file descriptors.
It was essentially as I posted, hundreds and hundreds of thousands of lines of it (~500k). I probably should have gotten you the entire output, but didn't think of it at the time. As you can see from the output, it is entirely netdata
that has these open files. This isn't filtered for netdata
, it's raw output of lsof
on the node.
At first, I thought it ran out of space on the volume, but post-mortem, the volume only had about 20mb being used (out of 1gb). I have no idea why netdata did this.
You mention wal, but we don't see it in the lsof you posted.
You may have to scroll to the right to read the end of the lines (github doesn't wrap code blocks, apparently).
Can you help us?
Yeah, absolutely. With this new monitoring, I'd feel much safer installing the helm chart again.
I dug through the nodes logs. Here are some things I saw, before running out of file descriptors:
Jul 04 20:55:44 cameo kernel: netdata[20554]: segfault at 30 ip 00007f19a348a2f9 sp 00007f19a0fab510 error 4 in ld-musl-x86_64.so.1[7f19a3447000+4c000]
There are many logs like that before the system fails due to no more FD. Does a segfault leave open files behind if the segfault happens in a container? I know PID 1 normally is responsible for reaping processes, but I don't know the structure of these containers (if there isn't a PID 1 in the container, that could be it ... 🤔).
Eventually, it looks like etcd loses access to FDs first (well, "first" might be that it is just the most active process in the node), followed by netdata attempting to ask k8s for a configmap repeatedly. Then, k8s itself. Eventually, there are enough processes stuck with too many FDs that it overwhelms the system.
At the time, there were only a handful of containers/processes on this node.
I'll see if I can get the full logs here.
from helmchart.
Related Issues (20)
- Please support imagePullSecrets in values.yml HOT 1
- Please clarify how to collect metrics from RabbitMQ with auth HOT 20
- Disable netdata-parent HOT 1
- Provide instructions on how to run with a parent outside the k8s cluster HOT 4
- How to add PostgreSQL monitoring in the Kubernetes cluster HOT 1
- Helm upgrade fails 3.7.33 to 3.7.34, 3.7.35 or 3.7.36 HOT 2
- Incompatible with current versions of k8s (1.25) HOT 3
- Can't use ingressClassName HOT 3
- Add support for the nightlies channel HOT 5
- Specify an Alarm Configuration Example HOT 1
- Include default requests/limits for child pods HOT 2
- Netdata deployment issue: PersistentVolume provisioning failure and child pods not loading on k3s cluster HOT 5
- Add initialDelaySeconds to DaemonSet livenessProbe HOT 2
- Helm chart broken in recent releases when not using secrets HOT 4
- Netdata parent pod keeps running into error HOT 9
- storedType not in values.yaml HOT 6
- Error: template: netdata/templates/secrets.yaml:1:21 HOT 5
- Impossibility to configure child agent nodes differently (for A/B testing, progressive alert rollout, etc)
- avoid child open port and fix liveness probe on public worker nodes HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from helmchart.