feat(daemon): session idle reaper for automatic cleanup#4833
Conversation
…nected sessions Idle sessions accumulate when clients close browser tabs or crash without calling DELETE /session. Without cleanup, sessions leak memory (EventBus ring ~2-4 MB each) and eventually hit the maxSessions cap (default 20), locking out new sessions entirely. Add a configurable session reaper that periodically scans the in-memory session registry and closes sessions that have no SSE subscribers, no registered clients, no active prompt, and whose last heartbeat exceeds a configurable idle TTL (default 30 minutes). Key design decisions: - Uses existing closeSession path (soft close, not hard kill) - JSONL transcripts on disk are preserved — session/load or session/resume can restore any reaped session - Emits session_closed with reason 'idle_timeout' so clients can distinguish from explicit closes - Reaper timer is .unref()'d and stopped on shutdown/killAllSync - Configurable via --session-reap-interval-ms and --session-idle-timeout-ms CLI flags (0 = disabled) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
|
Thanks for the PR! Template: the PR body uses Direction: clearly aligned. Orphaned sessions locking out new users at the Approach: the scope feels right — a Moving on to code review. 🔍 中文说明感谢贡献! 模板: PR 正文使用了 方向: 明确对齐。孤立会话在 方案: 范围合理——桥闭包内的 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
📋 Review SummaryThis PR implements a session idle reaper that automatically cleans up orphaned sessions in the daemon's in-memory registry. The implementation is well-designed, thoroughly tested, and follows existing patterns in the codebase. All 222 existing bridge tests pass with zero regression, and 11 new reaper-specific tests comprehensively cover the feature's behavior. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
| } | ||
|
|
||
| async function closeSessionImpl( | ||
| sessionId: string, |
There was a problem hiding this comment.
[Suggestion] The closeSessionImpl extraction drops four load-bearing comments from the old inline closeSession method. These comments explain why, not what — and the regression suite explicitly does not enforce the invariants they describe.
The most critical is the HAZARD block about channelInfoForEntry(entry) vs module-scoped channelInfo:
"The two diverge during the channel-overlap window — A dying, B freshly spawned — where capturing
channelInfowould (1) skip thesessionIds.delete()sinceB.channel !== entry.channel, and (2) callmarkSessionClosedon B's client instead of A's. The regression test is single-channel smoke only and WILL NOT fail if this reverts to module-scoped channelInfo."
Without this warning, a future maintainer could "simplify" channelInfoForEntry(entry) to channelInfo and silently reintroduce the channel-overlap bug.
Three other dropped comments are also load-bearing:
- Tombstone:
markSessionClosedprevents lateextNotificationfrom seeding the early-event buffer - Ordering:
session_closedmust be published before ACP cancel so late cancellation frames are intentionally dropped - Back-compat:
data.closedByexists for wire-level back-compat; new code should use envelope-leveloriginatorClientId
Please restore these comments verbatim in closeSessionImpl. AGENTS.md states: "don't delete existing [comments] as cleanup."
— qwen3.7-max via Qwen Code /review
| originatorClientId = resolveTrustedClientId(entry, context.clientId); | ||
| } | ||
| writeStderrLine( | ||
| `qwen serve: closing session ${JSON.stringify(sessionId)}` + |
There was a problem hiding this comment.
[Suggestion] The stderr log line here does not include the close reason. When the reaper closes a session, operators see the same log format as a client-initiated close — making it hard to distinguish the two during incident investigation.
| `qwen serve: closing session ${JSON.stringify(sessionId)}` + | |
| writeStderrLine( | |
| `qwen serve: closing session ${JSON.stringify(sessionId)}` + | |
| ` (reason: ${closeOpts?.reason ?? 'client_close'})` + | |
| (originatorClientId | |
| ? ` by client ${JSON.stringify(originatorClientId)}` | |
| : ''), | |
| ); |
— qwen3.7-max via Qwen Code /review
| : ''), | ||
| ); | ||
| telemetry.event('session.close', { | ||
| 'qwen-code.daemon.bridge.operation': 'session.close', |
There was a problem hiding this comment.
[Suggestion] The telemetry event does not include the close reason attribute. This makes it impossible to answer "how many sessions were reaped vs. explicitly closed?" from a telemetry dashboard — a likely first question during incidents involving unexpected session closures.
| 'qwen-code.daemon.bridge.operation': 'session.close', | |
| telemetry.event('session.close', { | |
| 'qwen-code.daemon.bridge.operation': 'session.close', | |
| 'session.id': sessionId, | |
| 'session.close.reason': closeOpts?.reason ?? 'client_close', | |
| }); |
Note: since reason is computed later (line 2128), you'd need to either move this computation up or inline the fallback as shown above.
— qwen3.7-max via Qwen Code /review
| }, | ||
| ); | ||
| } | ||
| }, sessionReapIntervalMs); |
There was a problem hiding this comment.
[Suggestion] String(err) calls toString() which only returns the message, losing the stack trace. If closeSessionImpl fails inside notifyAgentSessionClose (which does async I/O to the ACP child), the operator gets a one-line error with no call stack — making reaper failures hard to debug.
| }, sessionReapIntervalMs); | |
| writeStderrLine( | |
| `qwen serve: session reaper failed to close ` + | |
| `${JSON.stringify(id)}: ${err instanceof Error ? (err.stack ?? err.message) : String(err)}`, | |
| ); |
— qwen3.7-max via Qwen Code /review
…server integration tests - Add 'session.close.reason' attribute to telemetry event so operators can distinguish reaper-initiated closes from client-initiated ones in dashboards - Add test verifying channel idle timer fires after reaper closes the last session on a channel (design doc test #12) - Add server.test.ts integration tests: health endpoint reflects session count changes, DELETE /session passes no close opts - Update fakeBridge.closeSession signature to accept the new CloseSessionOpts third parameter Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
PR Verification ReportPR: #4833 — feat(daemon): session idle reaper for automatic cleanup Test Results
New Tests (12 reaper tests)
Code ReviewChanges (9 files, +1010/−100):
Key observations:
Caveats
Verdict✅ Ready to merge — Well-designed session reaper with thorough test coverage (12 tests, all passing). Clean Verified by wenshao |
|
|
||
| function resolvePositiveFiniteMs( | ||
| raw: number | undefined, | ||
| fallback: number, |
There was a problem hiding this comment.
[Suggestion] resolvePositiveFiniteMs does not clamp its return value to 2_147_483_647 (max 32-bit signed int), unlike the sibling helper resolvedChannelIdleTimeoutMs (line ~823) which does Math.min(raw, 2_147_483_647). The resolved value feeds setInterval — Node.js treats delays larger than 2^31-1 as 1ms, which would cause the reaper to fire in a tight CPU-burning loop.
| fallback: number, | |
| function resolvePositiveFiniteMs( | |
| raw: number | undefined, | |
| fallback: number, | |
| ): number { | |
| if (raw === undefined) return fallback; | |
| return raw > 0 && Number.isFinite(raw) ? Math.min(raw, 2_147_483_647) : 0; | |
| } |
— qwen3.7-max via Qwen Code /review
| `threshold ${Math.round(sessionIdleTimeoutMs / 1000)}s)`, | ||
| ); | ||
| void closeSessionImpl(id, undefined, { reason: 'idle_timeout' }).catch( | ||
| (err) => { |
There was a problem hiding this comment.
[Suggestion] closeSessionImpl is fire-and-forget (void) with no re-entrancy guard. Since byId.delete(sessionId) happens after await notifyAgentSessionClose(...) (an ACP round-trip that can take seconds), the next reaper tick can find the same session still in byId and fire a duplicate close — producing double notifyAgentSessionClose, double session_closed events, and double telemetry. The same gap also creates a TOCTOU race: a client that reconnects during the async window has its session silently destroyed, violating the design doc's goal of "never destroy a session that has an active prompt."
Consider adding a Set<string> guard plus an idle-predicate re-check:
const reapingSessions = new Set<string>();
// In reaper loop:
if (reapingSessions.has(id)) continue;
reapingSessions.add(id);
void closeSessionImpl(id, undefined, { reason: 'idle_timeout' })
.catch((err) => { ... })
.finally(() => { reapingSessions.delete(id); });
// Inside closeSessionImpl, after the byId.get check:
if (closeOpts?.reason === 'idle_timeout') {
if (entry.activePromptOriginatorClientId !== undefined) return;
if (entry.events.subscriberCount > 0) return;
if (entry.clientIds.size > 0) return;
}This also prevents the race between the reaper and an explicit client DELETE arriving concurrently.
— qwen3.7-max via Qwen Code /review
| if (shuttingDown) return; | ||
| const now = Date.now(); | ||
| for (const [id, entry] of byId) { | ||
| if (entry.activePromptOriginatorClientId !== undefined) continue; |
There was a problem hiding this comment.
[Suggestion] No startup log confirms whether the reaper is active or what thresholds it is using. If resolvePositiveFiniteMs silently converts an invalid CLI flag value (e.g., NaN from --session-idle-timeout-ms=abc) to 0, the reaper is disabled with no diagnostic. During an incident, an oncall engineer has no way to confirm the reaper is running without inspecting process args.
| if (entry.activePromptOriginatorClientId !== undefined) continue; | |
| function startSessionReaper(): void { | |
| if (sessionReapIntervalMs <= 0 || sessionIdleTimeoutMs <= 0) { | |
| writeStderrLine('qwen serve: session reaper disabled'); | |
| return; | |
| } | |
| writeStderrLine( | |
| `qwen serve: session reaper started ` + | |
| `(interval ${sessionReapIntervalMs}ms, ` + | |
| `idle threshold ${sessionIdleTimeoutMs}ms)`, | |
| ); | |
| sessionReaper = setInterval(() => { |
— qwen3.7-max via Qwen Code /review
| it('publishes session_closed with reason idle_timeout via closeSession opts', async () => { | ||
| const handle = makeChannel(); | ||
| const bridge = makeBridge({ | ||
| channelFactory: async () => handle.channel, |
There was a problem hiding this comment.
[Suggestion] No test combines the reaper's natural fire with an assertion on the published session_closed event's reason field. The "idle_timeout reason" test (above) calls bridge.closeSession directly — bypassing the reaper. The "reaps an idle session" test lets the reaper fire but only checks sessionCount === 0, never verifying the event carries reason: 'idle_timeout'. If someone accidentally omits the opts argument in the reaper's closeSessionImpl call, no test would catch it.
Consider adding a test that subscribes to events, lets the reaper fire naturally (fake timers + detach + advance), and asserts event.data.reason === 'idle_timeout'.
— qwen3.7-max via Qwen Code /review
| } | ||
| }); | ||
|
|
||
| it('does NOT reap a session with an active prompt and client', async () => { |
There was a problem hiding this comment.
[Suggestion] The "active prompt" guard test does not isolate the guard it claims to verify. The test sends a prompt but never calls detachClient, so entry.clientIds.size > 0 throughout. The reaper checks clientIds.size > 0 before checking activePromptOriginatorClientId, so if someone accidentally removed the active-prompt guard, this test would still pass because the clientIds guard alone prevents reaping.
To properly isolate the guard: detach the client first (making clientIds.size === 0), then verify the session survives solely because activePromptOriginatorClientId !== undefined.
— qwen3.7-max via Qwen Code /review
…redicate Add close-on-last-detach to detachClient: when clientIds.size drops to 0 AND no SSE subscribers remain, call closeSessionImpl immediately. This handles the normal tab-close path without waiting for the idle reaper. Adjust the idle reaper to NOT check clientIds.size — it now serves as a backstop for the crash path where detach was never sent (clientIds still > 0 but no subscriber and no heartbeat). Add SessionEntry.promptActive boolean flag to reliably detect active prompts regardless of whether an originator clientId was provided, fixing a gap where headless prompts (no clientId context) were invisible to the reaper's activePromptOriginatorClientId check. Update existing heartbeat detach test to use two clients (single-client detach now triggers close-on-last-detach). Add 3 close-on-last-detach tests. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
| } | ||
| } | ||
|
|
||
| async function closeSessionImpl( |
There was a problem hiding this comment.
[Suggestion] R3 adds a third call site for closeSessionImpl (the new close-on-last-detach path, alongside the explicit DELETE /session route and the idle reaper), which makes the latent re-entrancy in this function easier to hit.
The R2 round flagged the re-entrancy risk at the reaper level. A per-entry guard in the reaper loop would treat the symptom, but the root cause lives here: byId.delete(sessionId) happens on line 2142, after await notifyAgentSessionClose(...) on line 2135. Compare with killSession (around line 3830), which does the symmetric teardown correctly — byId.delete(sessionId) there is synchronous and precedes await notifyAgentSessionClose(...).
Scenario in R3: a client sends detach, detachClient observes clientIds.size === 0 and enters the new close-on-last-detach branch, which awaits this function. During the notifyAgentSessionClose ACP round-trip (potentially seconds), the reaper tick fires, finds the same entry still in byId, and calls closeSessionImpl again. Both invocations proceed past the byId.get guard and reach the ACP notification, the session_closed publish, and the session.close telemetry event — producing duplicate agent notifications, duplicate telemetry, and a second session_closed publish that only avoids crashing because the entry.events.publish try/catch swallows the "bus already closed" throw.
Suggested fix: mirror the killSession ordering. Move the synchronous teardown block (byId.delete, permissionMediator.forgetSession, pendingPermissionIds.clear, telemetry.metrics?.sessionLifecycle('close'), ci?.client.markSessionClosed) to run before await notifyAgentSessionClose(...). After that, any concurrent caller that does byId.get(sessionId) gets undefined and throws SessionNotFoundError, which every caller already catches. Optionally, also add a closing flag on the entry as defense in depth.
— qwen3.7-max via Qwen Code /review
Summary
closeSessionpath (soft close) — JSONL transcripts on disk are preserved,session/loadorsession/resumecan restore any reaped sessionsession_closedwithreason: 'idle_timeout'so clients can distinguish reaper closes from explicit closes--session-reap-interval-ms(default 60s) and--session-idle-timeout-ms(default 30min) CLI flags;0disablesMotivation
Idle sessions accumulate when clients close browser tabs or crash without calling
DELETE /session. Each orphaned session holds an EventBus replay ring (~2-4 MB). Without cleanup, the daemon eventually hits themaxSessionscap (default 20) and rejects new sessions entirely — a hard availability failure.Design
Full design document included at
docs/design/session-idle-reaper/README.md.Test plan
session_closedevent carries correctreasonfieldcloseSessiondefaults toreason: 'client_close'🤖 Generated with Qwen Code