Skip to content

Reap fork children off the proxy loop#1816

Open
xeophon wants to merge 2 commits into
feat/nano-as-v1from
codex/v1-reap-off-loop
Open

Reap fork children off the proxy loop#1816
xeophon wants to merge 2 commits into
feat/nano-as-v1from
codex/v1-reap-off-loop

Conversation

@xeophon

@xeophon xeophon commented Jun 21, 2026

Copy link
Copy Markdown
Member

Overview

Move fork-child reaping and private-CWD deletion off the shared MCP proxy event loop while preserving the fork parent's single-threaded spawn boundary.

Why

A rollout close currently performs blocking waitpid(..., 0) and recursive rmtree(...) calls directly on the proxy loop. A metadata-heavy child directory can therefore pause heartbeats, timers, and requests to otherwise-ready rollout children for the full cleanup duration.

Design

  • Keep child lookup, removal, waiter notification, and SIGKILL ownership on the proxy loop.
  • Serialize cleanup with the existing fork lock so no replacement child can reuse a directory while it is being deleted and no fork() can run while a cleanup thread exists.
  • Run ordered waitpid and rmtree work in one short-lived executor worker, then join that worker before releasing the fork lock.
  • Let requests to already-ready children continue through the lock-free fast path.
  • Finish the cleanup future before propagating cancellation, so a removed child cannot leave its private directory behind.

Performance and resources

A standalone PEP 723 benchmark reaped one real child and deleted an identical tree of 20,000 one-byte files across 21 directories. Three paired macOS/APFS runs produced these medians:

Metric Synchronous Off-loop Change
Maximum proxy-loop gap 0.852173 s 0.001872 s 0.850301 s recovered (99.8%)
Cleanup wall time 0.851710 s 0.896938 s +0.045228 s (+5.3%)
Retained RSS delta 376,832 B 671,744 B +294,912 B (~0.281 MiB)
Peak traced Python allocation 137,918 B 228,641 B +90,723 B (~88.6 KiB)
Maximum asyncio tasks 2 2 unchanged
Active threads during cleanup 1 2 one temporary worker

The optimization targets loop responsiveness rather than deletion throughput: the filesystem work remains the same, and wall time varies with metadata/cache behavior. The temporary worker is fully joined before another child may be forked. The workload opens no network connections, copies no bytes, and removes the same child process, files, directories, and payload in both modes.


Note

Medium Risk
Changes fork-child lifecycle and lock ordering around fork() and directory deletion; mistakes could leak processes/dirs or race spawns, though scope is limited to multiplex.py reap.

Overview
Fork-child teardown no longer blocks the MCP proxy loop. In serve_forked's reap, blocking waitpid and shutil.rmtree on the rollout's private CWD are moved to a single-worker ThreadPoolExecutor, while SIGKILL, map removal, and waiter wakeups stay on the asyncio loop.

Cleanup is serialized with spawning by holding the existing forking lock for the full reap (including joining the executor), so a new child cannot fork() while a cleanup thread is still running and a directory is not deleted under a reused key.

Cancellation still finishes cleanup: the executor future is awaited with asyncio.shield, and on CancelledError the code awaits the future before re-raising so teardown cannot orphan the private CWD.

Reviewed by Cursor Bugbot for commit a0c743e. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Reap fork children off the proxy event loop in serve_forked

  • The reap coroutine in multiplex.py now acquires the forking lock before reaping, serializing cleanup with concurrent fork operations.
  • Blocking waitpid and directory removal are offloaded to a ThreadPoolExecutor via loop.run_in_executor, keeping the event loop unblocked.
  • asyncio.shield wraps the cleanup future so that cancellation of the coroutine still awaits full cleanup before re-raising.

Macroscope summarized a0c743e.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2cb75b6. Configure here.

Comment thread verifiers/v1/mcp/multiplex.py
@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR changes concurrency behavior in fork child reaping - adding new locking, moving blocking operations to a ThreadPoolExecutor, and modifying cancellation semantics. Changes to process lifecycle management and async concurrency patterns warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant