Thank you for your work and open-sourcing these new libraries!
I watched your ElixirConf keynote on this and started thinking about graceful shutdown/restart behavior during common deployment setups, especially rolling deployments and blue-green deployments.
For DurableServers that are user-facing or latency-sensitive, it would be useful to minimize the gap between one node shutting down and another node running the same DurableServer.
Current Behavior
During graceful shutdown, DurableServer already syncs state before termination and marks the node as draining.
For permanent: true DurableServers, another node can then restart them through the existing LifecycleManager flow. That is correct and safe, but there can still be a visible gap where the DurableServer is not running anywhere yet.
Proposal
Add an opt-in supervisor option handoff_on_shutdown.
On shutdown, the supervisor would:
- Mark itself as draining, reusing the existing shutdown behavior.
- Enumerate local DurableServers.
- For each eligible DurableServer, proactively rehome it to another node using existing placement rules.
- Fall back to current graceful shutdown behavior if no eligible node exists, handoff starts failing, or the timeout is reached.
This would reduce the downtime gap for each DurableServer.
Details
Internally this might be implemented using the existing rehome_child/3 flow, with bounded concurrency and timeout options.
Example config shape:
handoff_on_shutdown: [
enabled: true,
timeout_ms: 30_000,
max_concurrency: 50
]
This should be best-effort only. If there is no eligible new node running, or once handoff attempts start timing out/failing, the supervisor could stop attempting proactive handoff and fall back to the existing graceful shutdown behavior for the remaining DurableServers.
I think this would be useful in many cases where new nodes overlap with old nodes during deploys and the DurableServers are latency-sensitive. It might be less useful, or even counterproductive, when mass restarts create so much load that there is a resource bottleneck, like in the demo in your keynote where the bottleneck appeared to be the object store.
It could still be extended with configurability/priority per DurableServer module or even per instance, but this is the basic idea.
Would you be open to this direction? If yes, I'd be happy to draft a PR.
Thank you for your work and open-sourcing these new libraries!
I watched your ElixirConf keynote on this and started thinking about graceful shutdown/restart behavior during common deployment setups, especially rolling deployments and blue-green deployments.
For DurableServers that are user-facing or latency-sensitive, it would be useful to minimize the gap between one node shutting down and another node running the same DurableServer.
Current Behavior
During graceful shutdown, DurableServer already syncs state before termination and marks the node as draining.
For
permanent: trueDurableServers, another node can then restart them through the existing LifecycleManager flow. That is correct and safe, but there can still be a visible gap where the DurableServer is not running anywhere yet.Proposal
Add an opt-in supervisor option
handoff_on_shutdown.On shutdown, the supervisor would:
This would reduce the downtime gap for each DurableServer.
Details
Internally this might be implemented using the existing
rehome_child/3flow, with bounded concurrency and timeout options.Example config shape:
This should be best-effort only. If there is no eligible new node running, or once handoff attempts start timing out/failing, the supervisor could stop attempting proactive handoff and fall back to the existing graceful shutdown behavior for the remaining DurableServers.
I think this would be useful in many cases where new nodes overlap with old nodes during deploys and the DurableServers are latency-sensitive. It might be less useful, or even counterproductive, when mass restarts create so much load that there is a resource bottleneck, like in the demo in your keynote where the bottleneck appeared to be the object store.
It could still be extended with configurability/priority per DurableServer module or even per instance, but this is the basic idea.
Would you be open to this direction? If yes, I'd be happy to draft a PR.