Feature proposal: Handoff on shutdown

Thank you for your work and open-sourcing these new libraries!

I watched your ElixirConf keynote on this and started thinking about graceful shutdown/restart behavior during common deployment setups, especially rolling deployments and blue-green deployments.

For DurableServers that are user-facing or latency-sensitive, it would be useful to minimize the gap between one node shutting down and another node running the same DurableServer.

## Current Behavior

During graceful shutdown, DurableServer already syncs state before termination and marks the node as draining.

For `permanent: true` DurableServers, another node can then restart them through the existing LifecycleManager flow. That is correct and safe, but there can still be a visible gap where the DurableServer is not running anywhere yet.

## Proposal

Add an opt-in supervisor option `handoff_on_shutdown`.

On shutdown, the supervisor would:

1. Mark itself as draining, reusing the existing shutdown behavior.
2. Enumerate local DurableServers.
3. For each eligible DurableServer, proactively rehome it to another node using existing placement rules.
4. Fall back to current graceful shutdown behavior if no eligible node exists, handoff starts failing, or the timeout is reached.

This would reduce the downtime gap for each DurableServer.

### Details

Internally this might be implemented using the existing `rehome_child/3` flow, with bounded concurrency and timeout options.

Example config shape:

```elixir
handoff_on_shutdown: [
  enabled: true,
  timeout_ms: 30_000,
  max_concurrency: 50
]
```

This should be best-effort only. If there is no eligible new node running, or once handoff attempts start timing out/failing, the supervisor could stop attempting proactive handoff and fall back to the existing graceful shutdown behavior for the remaining DurableServers.

I think this would be useful in many cases where new nodes overlap with old nodes during deploys and the DurableServers are latency-sensitive. It might be less useful, or even counterproductive, when mass restarts create so much load that there is a resource bottleneck, like in the demo in your keynote where the bottleneck appeared to be the object store.

It could still be extended with configurability/priority per DurableServer module or even per instance, but this is the basic idea.

Would you be open to this direction? If yes, I'd be happy to draft a PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposal: Handoff on shutdown #12

Current Behavior

Proposal

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature proposal: Handoff on shutdown #12

Description

Current Behavior

Proposal

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions