Skip to content

Make Core.shutdown idempotent and safe to call concurrently#6891

Draft
agners wants to merge 1 commit into
mainfrom
core-shutdown-reentrant
Draft

Make Core.shutdown idempotent and safe to call concurrently#6891
agners wants to merge 1 commit into
mainfrom
core-shutdown-reentrant

Conversation

@agners

@agners agners commented May 28, 2026

Copy link
Copy Markdown
Member

Proposed change

After #6887, Core.shutdown() runs from the SIGTERM signal handler during host shutdown in addition to the existing call sites (host.control.reboot()/shutdown(), backup restore). To avoid spawning a second stop_supervisor() task when a repeat SIGTERM arrives, __main__.py debounced the signal handler by stashing the in-flight task in a single-element list and bailing out on subsequent invocations. That workaround only covers the SIGTERM-vs-SIGTERM race; two API-initiated reboots, or a reboot API call racing with SIGTERM, would still re-enter the shutdown sequence.

This PR moves the idempotency into Core.shutdown() itself, where every caller benefits:

  • A second call while a shutdown is in progress awaits the in-flight shutdown via an asyncio.Event rather than re-running the sequence.
  • Calls while state is STOPPING/CLOSE return early — Supervisor is already going away, the work is moot.
  • Calls while state is in STARTING_STATES (INITIALIZE/STARTUP/SETUP) return early too. There is nothing coherent to gracefully stop before startup completes, and on the SIGTERM-during-startup path the caller cancels startup_task first, so waiting for it to complete would deadlock (the finally in Core.start() cannot reach set_state(RUNNING) once awaits raise CancelledError).
  • The sequence is wrapped in try/finally so the completion event is set even when an inner step raises.

With idempotency in Core, the closure-with-list workaround in __main__.py collapses to a plain coresys.create_task(stop_supervisor()): repeat SIGTERMs spawn extra tasks but each just observes the in-flight shutdown and waits.

Inspired by (a subset of) #6631.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

After #6887, Core.shutdown() now runs in the SIGTERM path during host
shutdown in addition to the existing host.control reboot/shutdown and
backup restore paths. Multiple concurrent callers were possible (e.g.
SIGTERM arriving while a reboot API call is mid-flight), so __main__.py
debounced the signal handler by stashing the in-flight task in a single-
element list and bailing out on the second SIGTERM.

Move the idempotency into Core.shutdown() itself, where it belongs:

- A second call while shutdown is in progress awaits the in-flight
  shutdown via an asyncio.Event rather than re-running the sequence.
- Calls during STOPPING/CLOSE return early (Supervisor is already going
  away; the work is moot).
- Calls during STARTING_STATES (INITIALIZE/STARTUP/SETUP) return early
  too. There is nothing coherent to gracefully stop before startup
  completes, and on the SIGTERM-during-startup path the caller cancels
  startup_task first, so waiting for it to complete would deadlock.
- The sequence is wrapped in try/finally so the completion event is set
  even when an inner step raises.

With that in place the closure workaround in __main__.py collapses to a
plain coresys.create_task(stop_supervisor()): repeat SIGTERMs spawn
extra tasks but each just observes the in-flight shutdown and waits.

Tests cover the four state branches and confirm the event is reset
between repeated shutdown cycles (backup restore re-enters RUNNING).
@agners agners added the refactor A code change that neither fixes a bug nor adds a feature label May 28, 2026
@agners agners requested a review from sairon May 28, 2026 13:42

@mdegat01 mdegat01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is fine, works and definitely helpful.

My comments are just about a seemingly supported use case that confuses me. There is currently no way to return to RUNNING after SHUTDOWN without exiting and restarting the python process in production. Is there a future use case here we're looking to support I'm unaware of? If not we should probably remove the perceived support for that since its a bit confusing.

Comment thread supervisor/core.py
Comment on lines +392 to +393
# Reset event for this shutdown cycle (supports repeated use, e.g. backup restore)
self._shutdown_event.clear()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of repeated use doesn't really make sense here. Shutdown eventually leads to a stop in the python process which obviously clears the event. And there's no way to go back to CoreState.RUNNING from those closing states. So I'm not really sure when this would have any effect? At the very least the comment should probably be adjusted since its example isn't a real use case.

Comment thread tests/test_core.py
Comment on lines +272 to +294
async def test_shutdown_event_reset_between_cycles(coresys: CoreSys):
"""Repeated shutdown cycles (e.g. backup restore) work because the event is reset."""
await coresys.core.set_state(CoreState.RUNNING)

await coresys.core.shutdown()
assert coresys.core._shutdown_event.is_set()

# Simulate restore returning to RUNNING and shutting down again
await coresys.core.set_state(CoreState.RUNNING)

second_entered = False
original_shutdown = coresys.apps.shutdown

async def track_app_shutdown(startup):
nonlocal second_entered
second_entered = True
return await original_shutdown(startup)

with patch.object(coresys.apps, "shutdown", side_effect=track_app_shutdown):
await coresys.core.shutdown()

assert second_entered
assert coresys.core._shutdown_event.is_set()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this is very much a fabricated use case, it can't ever happen in production. CoreState cannot get back to RUNNING from SHUTDOWN. Is there a plan to support some kind of live backup and restore for Supervisor I'm unaware of?

@home-assistant

Copy link
Copy Markdown

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

@home-assistant home-assistant Bot marked this pull request as draft May 28, 2026 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed refactor A code change that neither fixes a bug nor adds a feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants