Coder Social home page Coder Social logo

Thorough check of services about check_docker HOT 9 OPEN

timdaman avatar timdaman commented on August 22, 2024 1
Thorough check of services

from check_docker.

Comments (9)

owk avatar owk commented on August 22, 2024

I added some functions I'm gonna use for our monitoring setup. I shared them in the gist below:

https://gist.github.com/owk/bab9d45cc60eba6295fe4a3cb8176c6f

It's not the cleanest code, but it works :)

from check_docker.

timdaman avatar timdaman commented on August 22, 2024

Thanks, that looks interesting. One thing I am noticing is the logic for selecting the status. I don't work with swarm much so I may be missing something. The way I see it, if the service is not running we are in a critical state. The way I see it, OK means things are working, WARNING means things are working but with less performance margin than desired, and CRITICAL means things are not working. If a service is 'preparing' would it not be non-functional? Normally services start quickly but if took a very long time it would clearly be critical until it was running.

from check_docker.

owk avatar owk commented on August 22, 2024

Good point, maybe I ought to set 'preparing' as a CRITICAL status. I expect one would only see this status if the check interval is quite low, or if there is indeed a problem. Thanks for that!

from check_docker.

jkozera avatar jkozera commented on August 22, 2024

If I understand the issue right, there's also the possibility to use --running check from check_swarm.py, but only after significant change of logic, to allow old instances of the same service in not-running state - see operasoftware@2b1fc15 for example implementation.

FWIW, after I'm done cleaning up these changes, I'm going to open another issue to review all the changes I've made, because they can be useful in general for Swarm setups.

from check_docker.

nksupport avatar nksupport commented on August 22, 2024

from what i can see in the script, you're only determining service health by the HTTP return code, which in fact only signals swarm health (and even that is a rather limited swarm health check). Consider this example with a failed service:

$ docker service ls | grep testservice

wapxwn0h5i2l        testservice                         replicated          0/1                 testimage:latest

(the container failed to start due to an unsatisfied node constraint, but that could be anything).

You're doing a simple HTTP API check, similar to this:

$ curl -v --unix-socket /var/run/docker.sock http://localhost/services/testservice
> GET /services/testservice HTTP/1.1
< HTTP/1.1 200 OK
(normal service details follow)

200 here doesn't mean the service is green - it is in fact completely down - 200 only means that swarm was able to serve your request.

The service check is somewhat misleading the way it is now; putting that in a production environment as the primary health check is obviously the technician's fault and not yours, but still this will inevitably happen (as it did on the system i'm investigating right now). Perhaps it would be best to remove the service check option until it's fixed to help people avoid getting hurt?

from check_docker.

timdaman avatar timdaman commented on August 22, 2024

@nksupport So sorry to hear this slowed your investigation down, that is exactly not what i had hoped it would do. Indeed I see the issue you mentioned, not good.

This was written for a previous job so I don't actually use this code much anymore but given the seriousness of the issue I will see if I can get a test environment up to test fixes.

Thank you such a detailed explanation of the problem, it really helps.

from check_docker.

timdaman avatar timdaman commented on August 22, 2024

@nksupport , I have a pushed an updated swarm check. It is not fully tested but I figured I should get it to you quickly given the impact.
https://github.com/timdaman/check_docker/tree/swarm_fix

Below is a direct link to the updated swarm check, you can run it like this by hand, python3 check_swarm.py <Your options>

https://github.com/timdaman/check_docker/blob/edcacedef5ae8e6354962d151dce1cbe50483240/check_docker/check_swarm.py

from check_docker.

timdaman avatar timdaman commented on August 22, 2024

@nksupport, update, that check had another bug which I found when I added unit tests. Here is the updated version which I intend to ship in the next few days. If you have a moment would love to know if it looks good to you.

https://github.com/timdaman/check_docker/raw/swarm_fix/check_docker/check_swarm.py

from check_docker.

timdaman avatar timdaman commented on August 22, 2024

I believe version 2.2.1 should resolve this issue. Let me know if you see otherwise.

from check_docker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.