Make Core.shutdown idempotent and safe to call concurrently#6891
Conversation
After #6887, Core.shutdown() now runs in the SIGTERM path during host shutdown in addition to the existing host.control reboot/shutdown and backup restore paths. Multiple concurrent callers were possible (e.g. SIGTERM arriving while a reboot API call is mid-flight), so __main__.py debounced the signal handler by stashing the in-flight task in a single- element list and bailing out on the second SIGTERM. Move the idempotency into Core.shutdown() itself, where it belongs: - A second call while shutdown is in progress awaits the in-flight shutdown via an asyncio.Event rather than re-running the sequence. - Calls during STOPPING/CLOSE return early (Supervisor is already going away; the work is moot). - Calls during STARTING_STATES (INITIALIZE/STARTUP/SETUP) return early too. There is nothing coherent to gracefully stop before startup completes, and on the SIGTERM-during-startup path the caller cancels startup_task first, so waiting for it to complete would deadlock. - The sequence is wrapped in try/finally so the completion event is set even when an inner step raises. With that in place the closure workaround in __main__.py collapses to a plain coresys.create_task(stop_supervisor()): repeat SIGTERMs spawn extra tasks but each just observes the in-flight shutdown and waits. Tests cover the four state branches and confirm the event is reset between repeated shutdown cycles (backup restore re-enters RUNNING).
mdegat01
left a comment
There was a problem hiding this comment.
The code is fine, works and definitely helpful.
My comments are just about a seemingly supported use case that confuses me. There is currently no way to return to RUNNING after SHUTDOWN without exiting and restarting the python process in production. Is there a future use case here we're looking to support I'm unaware of? If not we should probably remove the perceived support for that since its a bit confusing.
| # Reset event for this shutdown cycle (supports repeated use, e.g. backup restore) | ||
| self._shutdown_event.clear() |
There was a problem hiding this comment.
The concept of repeated use doesn't really make sense here. Shutdown eventually leads to a stop in the python process which obviously clears the event. And there's no way to go back to CoreState.RUNNING from those closing states. So I'm not really sure when this would have any effect? At the very least the comment should probably be adjusted since its example isn't a real use case.
| async def test_shutdown_event_reset_between_cycles(coresys: CoreSys): | ||
| """Repeated shutdown cycles (e.g. backup restore) work because the event is reset.""" | ||
| await coresys.core.set_state(CoreState.RUNNING) | ||
|
|
||
| await coresys.core.shutdown() | ||
| assert coresys.core._shutdown_event.is_set() | ||
|
|
||
| # Simulate restore returning to RUNNING and shutting down again | ||
| await coresys.core.set_state(CoreState.RUNNING) | ||
|
|
||
| second_entered = False | ||
| original_shutdown = coresys.apps.shutdown | ||
|
|
||
| async def track_app_shutdown(startup): | ||
| nonlocal second_entered | ||
| second_entered = True | ||
| return await original_shutdown(startup) | ||
|
|
||
| with patch.object(coresys.apps, "shutdown", side_effect=track_app_shutdown): | ||
| await coresys.core.shutdown() | ||
|
|
||
| assert second_entered | ||
| assert coresys.core._shutdown_event.is_set() |
There was a problem hiding this comment.
I mean this is very much a fabricated use case, it can't ever happen in production. CoreState cannot get back to RUNNING from SHUTDOWN. Is there a plan to support some kind of live backup and restore for Supervisor I'm unaware of?
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
Proposed change
After #6887,
Core.shutdown()runs from the SIGTERM signal handler during host shutdown in addition to the existing call sites (host.control.reboot()/shutdown(), backup restore). To avoid spawning a secondstop_supervisor()task when a repeat SIGTERM arrives,__main__.pydebounced the signal handler by stashing the in-flight task in a single-element list and bailing out on subsequent invocations. That workaround only covers the SIGTERM-vs-SIGTERM race; two API-initiated reboots, or a reboot API call racing with SIGTERM, would still re-enter the shutdown sequence.This PR moves the idempotency into
Core.shutdown()itself, where every caller benefits:asyncio.Eventrather than re-running the sequence.STOPPING/CLOSEreturn early — Supervisor is already going away, the work is moot.STARTING_STATES(INITIALIZE/STARTUP/SETUP) return early too. There is nothing coherent to gracefully stop before startup completes, and on the SIGTERM-during-startup path the caller cancelsstartup_taskfirst, so waiting for it to complete would deadlock (thefinallyinCore.start()cannot reachset_state(RUNNING)once awaits raiseCancelledError).try/finallyso the completion event is set even when an inner step raises.With idempotency in
Core, the closure-with-list workaround in__main__.pycollapses to a plaincoresys.create_task(stop_supervisor()): repeat SIGTERMs spawn extra tasks but each just observes the in-flight shutdown and waits.Inspired by (a subset of) #6631.
Type of change
Additional information
Checklist
ruff format supervisor tests)If API endpoints or add-on configuration are added/changed: