A common use case we have is creating jobs for external APIs. Upon repeated failure (f

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Closing this in favor of <a class="issue-link js-issue-link" data-error-text="Failed t

A home for dead jobs and retry limit about disque HOT 7 CLOSED

antirez commented on August 25, 2024

A home for dead jobs and retry limit

from disque.

Comments (7)

antirez commented on August 25, 2024

Hello @fritzy, I understand the use case for this, and there is a symmetrical feature in Amazon SQS for example, but at the same time there is no global retry counter being Disque distributed. This problem could be kinda fixed by using a non perfect counter in each node that uses QUEUED messages in order to take the counter updated. However there is another problem due to the distributed nature of Disque, we also need to coordinate nodes so that usually we get just a dead letter for each job that was not able to be processed, otherwise all the nodes having a copy may put the job into the dead letter queue multiple times.

This also seems to be kinda solvable using some protocol that does not ensure unicity but that approximates it. Basically it is not a trivial problem, but with some design we may find ways to solve it.

A few more thoughts:

This means to enlarge the job structure even more, it's already 120 bytes.
Maybe to specify the queue name for dead letters explicitly is not strictly needed? We may just prefix the queue name with something like dead_letter:queue_name. Not sure about that, saves memory but is less flexible, maybe some user may want all the dead letters to go into the same queue.
It is possible to address this in a much more memory efficient way at the cost of other things: we are going to have an iterator to iterate jobs given specific attributes. We may ask to iterate all the jobs inside the node where the approximated_retry_attempt is greater than N. However this makes the problem a per node problem, while all the rest of the Disque API is distributed and not linked to a specific node, so I'm not a big fan of this.
The obvious counter argument for this feature is that this should be up to workers: when they can't process a job they may add a dead letter themselves for people to inspect. This is how Resque works I guess? However this means certain kinds of failures are harder to catch, for example the worker crashing as soon as it starts to process a given message, without having a chance to send a dead letter.

No clear ideas here in my side :-) Need to think / work on this more time. Any feedback?

from disque.

fritzy commented on August 25, 2024

Three thoughts:

Generated names are fine.

NSQ allows you to listen for requeue events (I believe). If you could do this, a watcher process could manually move it elsewhere, assuming it could prevent a worker from picking it back up.

The accuracy of the counter isn't terribly important. Really we're just looking for a threshold of pain -- this job is not likely to succeed, so we pass it off somewhere else. So if we said 5 failures, and we didn't notice until 10 failures, it's a bit wasteful, and takes up resources, but the number of times itself isn't really that important.

from disque.

sheerun commented on August 25, 2024

Isn't one global dead job queue enough? It's up to workers to process dead jobs, and report them. Using disque as a "database" for dead jobs doesn't seem like a good idea. Even if it only means to sort them by the originating queue.

I think it could be solved by allowing to add metadata to each message, just as RabbitMQ does. We need it other usecases too. For example for failed jobs we want to add stacktrace and error name to the jobs and put it in dead jobs queue. At the same time we'd like to avoid modyfying body of the original job.

from disque.

antirez commented on August 25, 2024

@fritzy @sheerun thanks for your comments. A few random arguments to keep the discussion going.

It's interesting to note that the dead letter could be implemented client side if we offer, via GETJOB, a way to fetch also this approximated "retry counter" so that the worker itself may add the job into the dead letter. The only issue I see with this compared to the other approach of handling this inside the server, is that if we have a failure where the worker immediately crashes once it tries to process the job, it will not be able to add the dead letter.
The Pub/Sub thing would be nice and extremely easy to implement, just a command where you listen for events like: job expired before being processed, job retry limit reached, ... However like any fire and forget thing it is not reliable. But, maybe it's not really needed to be reliable if we re-issue the event every time the job is re-queued with a retry count greater than the retry limit. Eventually some listener will notice. So this could be an option.

I'm a bit more biased towards the client-side solution if you think this could work, for reasons of simplicity of the Disque server and for the reasons @sheerun says: usually we want to augment the failed job with more information that the worker provides.

So a feature sketch for you to evaluate:

Take the approximated count of the number of times a job was reissued in some interesting way. Not super reliable during failures, but reasonably reliable otherwise would be cool. We can use the QUEUED or WILLQUEUE cluster message I guess or something like that.
Provide a GETJOB option to also fetch the retry count for the job, so that the worker can act ASAP if needed and post a dead letter.

Sounds good?

from disque.

sheerun commented on August 25, 2024

Don't get me wrong. I think putting dead jobs to dead job queue (or job_events queue described below) after timeout (no ack), is quite important. Without this there's no guarantee some jobs aren't dropped, because worker hangs or doesn't implement dead-job logic.

I think one clean solution would be to:

Not to handle retry counts at disque level at all
Introduce unacknowledged / timeout state for a job.
Create some special job_events queue that both disque and clients can push to. Clients can subscribe to job events selectively in a pub/sub manner (e.g. on transition to unacknowledged disque adds a job to job_events queue stating exactly that).
Clients can process events from job_events and re-schedule dead jobs as needed with increased re-try counter (stored in the metadata). I still think introducing metadata to jobs is quite important.

from disque.

sheerun commented on August 25, 2024

On the other hand the body of jobs in job_events queue could serve as jobs metadata. Clients could read it and store somewhere if needed.

from disque.

antirez commented on August 25, 2024

Closing this in favor of #68 in order to start a fresh discussion around a different design proposal. My top priority was to explore alternative designs giving the same practical features without to resort to actual dead letter implementation.

from disque.

A home for dead jobs and retry limit about disque HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent