Coder Social home page Coder Social logo

Inconsistent error reporting about r2d2 HOT 7 CLOSED

sfackler avatar sfackler commented on July 30, 2024
Inconsistent error reporting

from r2d2.

Comments (7)

sfackler avatar sfackler commented on July 30, 2024

The error message included along with a checkout timeout error is just a convenience to hopefully give some context if the pool can no longer connect to the database. The only reason that a checkout fails is due to a timeout, and the pool may have attempted to open zero, one, or many connections during that time, but none of those was tied to that specific checkout request. The pool's set of connections is maintained entirely asynchronously to individual checkouts.

from r2d2.

Diggsey avatar Diggsey commented on July 30, 2024

That makes sense, but it has the unfortunate effect that many kinds of issue when using r2d2 are quite hard to debug.

Let me explain a problem and maybe you can think of a way to make these kinds of issues more obvious:

Yesterday I had to deal with a production issue where one or more instances of our application would get into a "bad state", and would start failing a certain percentage of requests. Sometimes these instances would recover, whilst other times they would get stuck and have to be manually restarted. The trigger seemed to be when a flurry of requests happened at the same time.

The only error being reported was that r2d2 was failing to obtain connections from the pool. Sometimes these errors would say "could not connect" - most of the time the error would just be empty.

I had a bunch of theories, from connections in the pool becoming "tainted" in some way, to the "database proxy" we use malfunctioning. In the end, it turned out to be a rather subtle deadlock: in a couple of places in the code, the same thread would try to lease a connection from the pool whilst a connection was already active further up the callstack.

When several requests came in at the exact same time, and they all obtained the "first" connection, they would then all deadlock trying to obtain the "second" connection, as it would exceed the pool limit, so no thread could make progress. As soon as one thread timed out, only one connection would be released, and (due to some apparent unfairness in the way threads are woken?) that connection would be taken by the next incoming request, which would continue the cycle indefinitely.

In response to this issue, we put in place some measures to prevent it from happening again:

  • TLS safety check: ensuring only one connection is obtained from a thread at a time (in our own code)
  • Adding logging throughout our fork of r2d2, including the amount of time spent in various stages (how long to establish a connection, how long a connection is leased before being returned, how long it took to get a connection from the pool, etc.)

One thing that exacerbated the issue was that when a connection is released to the pool, it seems like it will almost always be picked up by the thread that most recently tried to obtain a connection - I don't know if Condvar is supposed to give any fairness guarantees, but it appears to be very unfair. The result was that threads which failed to obtain a connection on the first attempt could easily be starved (in this case, it is the reason why instances rarely recovered by themselves - the steady influx of new requests caused a self-perpetuating problem, as new requests would get the connections as they were released, so older requests would take the full 30s or so to time out).

Anyway, as far as what can be actually done within r2d2, adding logging and also finding a way to more fairly distribute connections are things I think would help.

There are certain things about the state of the pool that are very strong indicators that something is wrong (eg. long lease times, high contention for connections, etc.) In our case, the most important thing was when we added logging for the lease time on connections, as we knew our code should only ever lease connections for a few milliseconds.

from r2d2.

sfackler avatar sfackler commented on July 30, 2024

I'd be happy to take a PR adding a bunch of debug and trace level logging!

We could switch from the stdlib synchronization primitives to parking lot, which does have a fair unlock API: https://docs.rs/lock_api/0.1.5/lock_api/struct.MutexGuard.html#method.unlock_fair.

It would also be nice to report various metrics like checkout and lease times, but I'm not sure how exactly we'd want that API to look. Maybe something like this?

pub trait ConsumeMetric {
    fn checkout_time(&self, time: Duration) {}

    fn lease_time(&self, time: Duration) {}

    // ...

You could then hook that up to whatever metrics library your service is using and feed the times into a histogram or whatever.

from r2d2.

sfackler avatar sfackler commented on July 30, 2024

You can also install an error handler which can listen on every error produced from the connection manager, which should be more reliable than whatever happens to end up tacked on to the timeout error: https://docs.rs/r2d2/0.8.3/r2d2/struct.Builder.html#method.error_handler

from r2d2.

Diggsey avatar Diggsey commented on July 30, 2024

Cool, if I get time I'll try to put together a PR adding some logging.

That metric API looks reasonable. I did have a vision for some kind of automatic monitor you could (optionally) wrap around a pool - you would configure expected lease / checkout times and it would log errors or warnings any time the behaviour was too unusual, possibly even reset the pool, but maybe that's over-engineering it... The benefit would be that you could worry less about performance impact because the monitor would be opt-in, whilst also providing a very resilient "default" for users of the library.

from r2d2.

repnop avatar repnop commented on July 30, 2024

Has there been any additional developments with this? I seem to have ran into a similar issue where connections in a pool would die and not be recovered while using diesel's pooled connections. This would happen over night and my service that depends on the connections would stop responding and error out with no discernible error message. First one would be Err("The MySQL server has left") then I'd just get Err(None) when the database is running just fine. In the mean time I have removed all of the connection pooling from the service since its not high traffic anyway, but was quite annoying to find out the issue.

from r2d2.

sfackler avatar sfackler commented on July 30, 2024

You could register an event handler to track what's going on in the pool to try to diagnose.

from r2d2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.