The orchestration of shutdown between QueueProcessor, Queue and Workers is a bit dupli

Cleanup/simplify shutdown orchestration about kanaloa HOT 2 OPEN

iheartradio commented on June 4, 2024

Cleanup/simplify shutdown orchestration

from kanaloa.

Comments (2)

kailuowang commented on June 4, 2024

The Queue tries to finish all work already enqueued, that's the intention.
Upon receiving the Retire, it first does a dispatchWork which will try dispatch all the work in the queue.
The dispatchWork once called, will either exhaust the work queue or the worker queue. Thus, if there are still workers left, then it means all the work in the queue are dispatched, the left workers can be retired, so it sends the NoWorkLeft message to them. If there is still work left in the queue, then there won't be any Worker left to receive that NoWorkLeft message.
Then in the retiring state, the Queue will try to dispatch multiple times to finish the leftover work.

Bottom line is, ShutdownGracefully does mean shutdown gracefully, no work left behind.
Another factor that adds to the confusion is that other actors such as Worker and QueueProcessor watches Queue but that's more a sanity check than part of the shutdown process. If the Queue accidentally dies, there is no reason to keep either QueueProcessor or Workers around. The whole system better be recreated by the supervisor which is the Dispatcher, although we need some tests around that.

from kanaloa.

nsauro commented on June 4, 2024

I've been poking around a bit at this.

I have a basic idea which I just wanted to put down in writing somewhere, so I figured where better than here 😄

So, here's a rundown of things I think we need to address, one way or another:

Currently, there is Work loss potential. IE: If the Queue receives the final RetiringTimeout, and there is Work in the queue still, that Work goes with it. To fix this, we could easily at this point in time, send a message back to the original sender, similar to EnqueueRejected(work). While the message is slightly off, it might be better to just reuse the same public message that is being used now to reject work, vs introducing a new one.
The above idea, of having Work which was enqueued, subsequently get rejected, started to make me think, that we might be missing something from acknowledgement. Currently, the sender can know when Work is enqueued or rejected, but the sender never knows when the Work was completed(one way or another). This might be handy to have, since it allows the sender(not the replyTo) to track the status of Work from start to finish. This would allow the sender to fully react to the status of Work. Def a thing up for debate. In either event, we would want to have messages like these optional to send.
So, the idea behind graceful shutdown is that all work completes, and then all components shut themselves down. The only slight wrinkle I see with this, is that in the retiring state, we are still actively scheduling work(if it is already enqueued). The issue is, we have no control or guarantee that work we schedule while in this state will successfully be completed because of the RetireTimeout. Figure, if during the retiring state, right at T-1(before shutdown), a Worker gets new work, sends a message to its Routee, and then the system shuts down. That message is now in limbo(from the end user of Kanaloa's point of view). Did it go through? Did it timeout? If this was a transaction of some sort, there is no record of what happened(the replyTo and the sender both get nothing). The thing that rubs me the wrong way here is that in graceful shutdown, we can actively be increasing our chances of not having a graceful shutdown(by scheduling more work).

I think the safest thing to do, (as things are implemented currently), is upon retire, immediately reject all queued messages(similar to how things would need to be rejected in the above point). This gives the sender the ability to confidently handle things that we had no guarantee of executing, and it can react accordingly. To the sender, this is just another rejection event, which it already has to handle.

The graceful shutdown would then be a signal for "in flight" operations. IE: did everything that was in flight succeed, and were all unexecuted queued message successfully returned to the sender?

I know this is a bit of a departure from what was originally envisioned, but personally, I feel like we need to do our absolute best to make sure Kanaloa never drops work.

One interesting thing to note, and one thing we might want to explore, is that there actually might be diverging behaviors on shutdown between the pulling and pushing models.

One other thing to note is that based on our earlier chats about this, there actually might be the case for implementing multiple types of shutdown behavior. I think there were many good points bought up about both the current and potential implementations, and they don't need to be mutually exclusive.

from kanaloa.

Cleanup/simplify shutdown orchestration about kanaloa HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent