The way errors are reported appears to be racey/inconsistent when a connection cannot

Inconsistent error reporting about r2d2 HOT 7 CLOSED

sfackler commented on July 30, 2024

Inconsistent error reporting

from r2d2.

Comments (7)

sfackler commented on July 30, 2024

The error message included along with a checkout timeout error is just a convenience to hopefully give some context if the pool can no longer connect to the database. The only reason that a checkout fails is due to a timeout, and the pool may have attempted to open zero, one, or many connections during that time, but none of those was tied to that specific checkout request. The pool's set of connections is maintained entirely asynchronously to individual checkouts.

from r2d2.

Diggsey commented on July 30, 2024

That makes sense, but it has the unfortunate effect that many kinds of issue when using r2d2 are quite hard to debug.

Let me explain a problem and maybe you can think of a way to make these kinds of issues more obvious:

Yesterday I had to deal with a production issue where one or more instances of our application would get into a "bad state", and would start failing a certain percentage of requests. Sometimes these instances would recover, whilst other times they would get stuck and have to be manually restarted. The trigger seemed to be when a flurry of requests happened at the same time.

The only error being reported was that r2d2 was failing to obtain connections from the pool. Sometimes these errors would say "could not connect" - most of the time the error would just be empty.

I had a bunch of theories, from connections in the pool becoming "tainted" in some way, to the "database proxy" we use malfunctioning. In the end, it turned out to be a rather subtle deadlock: in a couple of places in the code, the same thread would try to lease a connection from the pool whilst a connection was already active further up the callstack.

When several requests came in at the exact same time, and they all obtained the "first" connection, they would then all deadlock trying to obtain the "second" connection, as it would exceed the pool limit, so no thread could make progress. As soon as one thread timed out, only one connection would be released, and (due to some apparent unfairness in the way threads are woken?) that connection would be taken by the next incoming request, which would continue the cycle indefinitely.

In response to this issue, we put in place some measures to prevent it from happening again:

TLS safety check: ensuring only one connection is obtained from a thread at a time (in our own code)
Adding logging throughout our fork of r2d2, including the amount of time spent in various stages (how long to establish a connection, how long a connection is leased before being returned, how long it took to get a connection from the pool, etc.)

One thing that exacerbated the issue was that when a connection is released to the pool, it seems like it will almost always be picked up by the thread that most recently tried to obtain a connection - I don't know if Condvar is supposed to give any fairness guarantees, but it appears to be very unfair. The result was that threads which failed to obtain a connection on the first attempt could easily be starved (in this case, it is the reason why instances rarely recovered by themselves - the steady influx of new requests caused a self-perpetuating problem, as new requests would get the connections as they were released, so older requests would take the full 30s or so to time out).

Anyway, as far as what can be actually done within r2d2, adding logging and also finding a way to more fairly distribute connections are things I think would help.

There are certain things about the state of the pool that are very strong indicators that something is wrong (eg. long lease times, high contention for connections, etc.) In our case, the most important thing was when we added logging for the lease time on connections, as we knew our code should only ever lease connections for a few milliseconds.

from r2d2.

sfackler commented on July 30, 2024

I'd be happy to take a PR adding a bunch of debug and trace level logging!

We could switch from the stdlib synchronization primitives to parking lot, which does have a fair unlock API: https://docs.rs/lock_api/0.1.5/lock_api/struct.MutexGuard.html#method.unlock_fair.

It would also be nice to report various metrics like checkout and lease times, but I'm not sure how exactly we'd want that API to look. Maybe something like this?

pub trait ConsumeMetric {
    fn checkout_time(&self, time: Duration) {}

    fn lease_time(&self, time: Duration) {}

    // ...

You could then hook that up to whatever metrics library your service is using and feed the times into a histogram or whatever.

from r2d2.

sfackler commented on July 30, 2024

You can also install an error handler which can listen on every error produced from the connection manager, which should be more reliable than whatever happens to end up tacked on to the timeout error: https://docs.rs/r2d2/0.8.3/r2d2/struct.Builder.html#method.error_handler

from r2d2.

Diggsey commented on July 30, 2024

Cool, if I get time I'll try to put together a PR adding some logging.

That metric API looks reasonable. I did have a vision for some kind of automatic monitor you could (optionally) wrap around a pool - you would configure expected lease / checkout times and it would log errors or warnings any time the behaviour was too unusual, possibly even reset the pool, but maybe that's over-engineering it... The benefit would be that you could worry less about performance impact because the monitor would be opt-in, whilst also providing a very resilient "default" for users of the library.

from r2d2.

repnop commented on July 30, 2024

Has there been any additional developments with this? I seem to have ran into a similar issue where connections in a pool would die and not be recovered while using diesel's pooled connections. This would happen over night and my service that depends on the connections would stop responding and error out with no discernible error message. First one would be Err("The MySQL server has left") then I'd just get Err(None) when the database is running just fine. In the mean time I have removed all of the connection pooling from the service since its not high traffic anyway, but was quite annoying to find out the issue.

from r2d2.

sfackler commented on July 30, 2024

You could register an event handler to track what's going on in the pool to try to diagnose.

from r2d2.

Inconsistent error reporting about r2d2 HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent