refactor(server): crash handling and scheduling tests for WorkerPool#465
refactor(server): crash handling and scheduling tests for WorkerPool#46516bit-ykiko wants to merge 3 commits into
Conversation
Replace the round-robin stateless worker dispatch with priority-aware scheduling that prevents background indexing from starving interactive requests and adapts concurrency to memory pressure. Key changes: - Add Priority enum (High/Low) to worker protocol; interactive requests (completion, signature help, format) are High, indexing is Low - WorkerPool tracks busy/idle state per worker and maintains priority queues — High requests are always dispatched before Low - Reserve at least one worker slot for High priority tasks - Adaptive low_limit via memory monitoring (poll every 3s) and AIMD backoff on worker crashes - Set OS nice value on workers based on task priority - Limit stateless workers to one concurrent compilation via UV_THREADPOOL_SIZE=1 - Remove redundant Indexer-layer concurrency control (max_concurrent, monitor_resources) — backpressure is now provided by WorkerPool
Add comprehensive crash handling to WorkerPool: - process_crash() centralizes all crash state transitions for both stateful and stateless workers - Stateful crashes report lost_documents via on_crash callback, letting the caller decide recovery (lazy rebuild via LRU) - Stateless send_stateless() transparently retries up to 2 times on worker crash - AIMD backoff on low_limit after stateless worker crashes - Idempotent release_stateless_slot prevents busy_count underflow when crash handler and StatelessSlot RAII destructor both fire - pick_idle_stateless uses llvm_unreachable instead of silent fallback to index 0 - respawn_worker calls try_dispatch_pending after peer->run() Add 37 unit tests across 4 suites (WorkerPoolStateful, WorkerPoolScheduling, WorkerPoolCrash, WorkerPoolIntegration) covering routing, priority scheduling, crash handling, AIMD backoff, and real-process integration.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughIntroduces a ChangesPriority-Aware Stateless Worker Scheduling & Crash Handling
Sequence Diagram(s)sequenceDiagram
participant Compiler
participant WorkerPool
participant acquire_stateless_slot
participant StatelessWorker
participant process_crash
participant on_crash
rect rgba(100, 149, 237, 0.5)
note over Compiler,WorkerPool: High-priority build/format dispatch
Compiler->>WorkerPool: send_stateless(BuildParams{priority=High})
WorkerPool->>acquire_stateless_slot: acquire(Priority::High)
acquire_stateless_slot->>acquire_stateless_slot: bypass low_limit gate, pick idle worker
acquire_stateless_slot-->>WorkerPool: worker_index + StatelessSlot
end
WorkerPool->>StatelessWorker: dispatch build request
StatelessWorker-->>WorkerPool: response (success or crash)
alt worker crashed
rect rgba(220, 80, 80, 0.5)
note over WorkerPool,on_crash: Crash handling path
WorkerPool->>process_crash: process_crash(index, stateful, exit_code, signal)
process_crash->>WorkerPool: decrement alive/busy counters
process_crash->>WorkerPool: apply_crash_backoff() → reduce low_limit
process_crash->>WorkerPool: try_dispatch_pending()
process_crash->>on_crash: WorkerCrashInfo{index, exit_code, ...}
process_crash-->>WorkerPool: should_restart=true
WorkerPool->>WorkerPool: respawn_worker(index)
end
end
WorkerPool->>WorkerPool: ~StatelessSlot → release_stateless_slot
WorkerPool->>WorkerPool: try_dispatch_pending() → signal next waiter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8abe2db72
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto result = co_await stateless_workers[idx].peer->send_request(params, opts); | ||
|
|
||
| // Success, or application-level error (worker still alive) — return as-is. | ||
| if(result.has_value() || stateless_workers[idx].alive) |
There was a problem hiding this comment.
Retry against the crashed process, not the reused slot
When a stateless worker dies during send_request, monitor_worker() can run process_crash() and respawn_worker() before this await resumes; respawn_worker() reuses the same slot index and sets alive = true. In that interleaving the result is still the IPC error from the retired peer, but this check sees the newly spawned worker as alive and returns the error instead of performing the intended retry. Capture a per-process generation/restart count or the peer identity before sending rather than checking the current slot state.
Useful? React with 👍 / 👎.
| high_queue.push_back(&pending); | ||
| else | ||
| low_queue.push_back(&pending); | ||
| co_await pending.ready.wait(); |
There was a problem hiding this comment.
Remove cancelled waiters from stateless queues
If an LSP request is cancelled while waiting for a stateless slot, this stack-local PendingStateless is destroyed as the coroutine unwinds, but its address remains in high_queue/low_queue. The next try_dispatch_pending() from a release, crash, or respawn will dereference that dangling pointer and write assigned_worker/signal ready, which can crash or corrupt scheduling. The queued entry needs cancellation cleanup or owned lifetime independent of the waiting frame.
Useful? React with 👍 / 👎.
| if(!can_proceed()) { | ||
| PendingStateless pending(priority); | ||
| if(priority == P::High) | ||
| high_queue.push_back(&pending); | ||
| else | ||
| low_queue.push_back(&pending); | ||
| co_await pending.ready.wait(); |
There was a problem hiding this comment.
Return an error when no stateless workers are alive
When every stateless worker is down, for example after the last one exceeds max_restarts, alive_stateless_count is zero so can_proceed() is false and the request is queued here forever. With no live worker and no future respawn/release to call try_dispatch_pending(), user requests hang instead of getting the previous “all workers are down” style error. Check alive_stateless_count == 0 before enqueueing and fail the request.
Useful? React with 👍 / 👎.
| if(on_crash) | ||
| on_crash(info); |
There was a problem hiding this comment.
Mark sessions dirty after stateful crashes
Stateful crashes now clear the worker ownership, but production code never installs WorkerPool::on_crash (repo-wide search only finds the unit-test fixture). For an already clean open document, Compiler::ensure_compiled() will therefore return true without rebuilding, send_stateful() will assign the path to a fresh worker with no AST, and features like hover/semantic tokens return null until an edit or dependency change dirties the session. Wire this callback in MasterServer to mark lost_documents dirty or otherwise force recompilation.
Useful? React with 👍 / 👎.
|
|
||
| auto pid = stateless_workers[idx].proc.pid(); | ||
| if(pid > 0) { | ||
| kota::sys::set_priority(priority == P::Low ? 10 : 0, pid); |
There was a problem hiding this comment.
Avoid one-way nice changes for worker priority
On normal Linux/macOS runs without elevated scheduling privileges, increasing a process nice value to 10 is allowed but lowering it back to 0 fails; I confirmed this with setpriority in the same environment. Because these calls ignore failure, once a worker is assigned a low-priority request here, later high-priority dispatches that try set_priority(0, pid) cannot restore the worker, so high-priority completions/formatting can continue running at low OS priority after indexing has used every worker.
Useful? React with 👍 / 👎.
| /// - SignatureHelp: + text, version, offset, pch, pcms | ||
| /// - Format: + text, format_range (optional) | ||
| struct BuildParams { | ||
| Priority priority = Priority::Low; |
There was a problem hiding this comment.
Route user-triggered PCM builds as high priority
Defaulting new BuildParams to low priority leaves BuildPCM requests low unless a caller overrides it, and the compile-graph dispatch used by Compiler::ensure_deps() constructs PCM builds without setting priority. When an interactive hover/completion needs module dependencies while background indexing has already filled the low-priority queue, the dependency build waits behind background Index work even though the final user request is marked high priority. Propagate the caller priority into compile-graph PCM builds or split background PCM work from user-triggered dependencies.
Useful? React with 👍 / 👎.
| ++dispatched; | ||
| workers.spawn([&, server_path_id]() -> kota::task<> { | ||
| co_await index_one(server_path_id); |
There was a problem hiding this comment.
Keep background indexing bounded before worker dispatch
With the in_flight gate removed, this loop now spawns one coroutine for every stale queued file before any stateless-worker slot is acquired. On large workspaces those index_one tasks all run far enough to do synchronous per-file work such as module dependency preparation and compile-argument lookup, then pile up in the worker-pool queue, defeating the previous concurrency cap and making pause/resume unable to stop new indexing work until the whole queue has been spawned. Reintroduce a bounded producer or acquire scheduling capacity before launching each per-file task.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/server/worker/stateless_worker.cpp`:
- Around line 299-301: The return values of the _putenv_s function (Windows
platform) and the setenv function (non-Windows platform) are not being checked
after setting the UV_THREADPOOL_SIZE environment variable. If either of these
calls fails, the worker will silently continue with libuv's default thread pool
size, breaking the one-job-per-process scheduling guarantee. Check the return
values of both _putenv_s and setenv immediately after they are called, and if
either returns a non-zero/non-success value, log an appropriate error message
and exit the process early to ensure the initialization failure is caught and
reported.
In `@src/server/worker/worker_pool.cpp`:
- Around line 402-410: The issue is that the stack-local PendingStateless
pending variable's address is stored in either high_queue or low_queue when
can_proceed() returns false, and if the coroutine scope exits before the queue
removes this entry (due to cancellation or early return), subsequent calls to
try_dispatch_pending will dereference a dangling pointer. Fix this by wrapping
the queue insertion in an RAII guard using scope_exit to ensure that the address
of pending is removed from whichever queue it was added to (high_queue or
low_queue) before the pending variable goes out of scope, protecting against
premature coroutine termination.
In `@src/server/worker/worker_pool.h`:
- Around line 109-115: The PendingStateless struct is stack-allocated in
acquire_stateless_slot but its address is stored in priority queues, creating a
dangling pointer risk if the coroutine is cancelled before try_dispatch_pending
can dispatch it. To fix this, implement one of the following approaches:
heap-allocate PendingStateless using std::shared_ptr to extend the object
lifetime beyond the coroutine frame and ensure safe cleanup, or add a
validity/cancellation flag to the PendingStateless struct that
try_dispatch_pending checks before dereferencing to safely skip cancelled
entries, or implement a cancellation hook or destructor that removes the entry
from the queue when the coroutine is cancelled to prevent dereferencing a stale
pointer.
In `@tests/unit/server/worker_pool_tests.cpp`:
- Around line 711-717: The loop that waits for worker restart exits prematurely
when the worker becomes alive again, but before the PID has actually changed due
to the respawn. Modify the break condition around line 713-715 to check not only
that f.worker_alive(0) is true, but also that f.worker_pid(0) has changed from
the original pid variable before breaking from the loop. This ensures the
crash/respawn cycle is fully complete, preventing line 717 from intermittently
failing when it verifies the PID has changed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eee3ed09-a6fa-4d4e-ab18-9d1f7de729ea
📒 Files selected for processing (9)
src/server/compiler/compiler.cppsrc/server/compiler/indexer.cppsrc/server/compiler/indexer.hsrc/server/protocol/worker.hsrc/server/service/master_server.cppsrc/server/worker/stateless_worker.cppsrc/server/worker/worker_pool.cppsrc/server/worker/worker_pool.htests/unit/server/worker_pool_tests.cpp
💤 Files with no reviewable changes (3)
- src/server/compiler/indexer.h
- src/server/service/master_server.cpp
- src/server/compiler/indexer.cpp
| _putenv_s("UV_THREADPOOL_SIZE", "1"); | ||
| #else | ||
| ::setenv("UV_THREADPOOL_SIZE", "1", 1); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/server/worker/stateless_worker.cpp | head -310 | tail -30Repository: clice-io/clice
Length of output: 1207
🏁 Script executed:
rg "_putenv_s|::setenv" src/server/worker/stateless_worker.cpp -B 5 -A 5Repository: clice-io/clice
Length of output: 578
🏁 Script executed:
rg "UV_THREADPOOL_SIZE" src/server/worker/stateless_worker.cpp -B 10 -A 10Repository: clice-io/clice
Length of output: 750
🌐 Web query:
_putenv_s Windows return value error handling MSVC
💡 Result:
The _putenv_s function in the Microsoft C Runtime (CRT) returns an errno_t value to indicate success or failure [1][2]. Return Values: - 0: The function completed successfully [1][3]. - EINVAL: An error occurred [1][3]. This typically happens if either the varname or value_string arguments are NULL [1][4]. Error Handling: - When an error condition occurs (such as passing NULL arguments), the function invokes the CRT invalid parameter handler [1][3]. - If the application allows execution to continue after the handler is invoked, the function returns EINVAL and also sets the global errno variable to EINVAL [1][3]. Important Considerations: - Thread Safety: The _putenv_s function is not thread-safe [1][2]. Concurrent use with _getenv_s or other environment-modifying functions can cause random failures or undefined behavior [1][5]. Synchronization is required if accessed from multiple threads [1][3]. - Scope: Changes made by _putenv_s affect only the environment local to the current process and do not modify the environment segment managed by the operating system [2]. However, these changes are inherited by new processes created via _spawn, _exec, or system [2].
Citations:
- 1: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/putenv-s-wputenv-s?view=msvc-170
- 2: https://github.com/Microsoft/cpp-docs/blob/main/docs/c-runtime-library/reference/putenv-s-wputenv-s.md
- 3: https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2013/eyw7eyfw(v=vs.120)
- 4: https://learn.microsoft.com/en-us/previous-versions/eyw7eyfw(v=vs.140)
- 5: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/getenv-s-wgetenv-s?view=msvc-170
🌐 Web query:
setenv POSIX return value error handling unix
💡 Result:
Under the POSIX standard, the setenv function is used to add or update an environment variable in the calling process [1][2]. Return Values: - Upon successful completion, setenv returns 0 [1][2]. - If an error occurs, it returns -1, and the global errno variable is set to indicate the specific error [1][2]. When an error occurs, the environment remains unchanged [1][2]. Error Handling (errno): The function is required to fail and set errno in the following cases [1][2]: - [EINVAL]: The environment variable name (envname) is an empty string, is a NULL pointer (depending on implementation), or contains an '=' character [1][3][2]. - [ENOMEM]: The system has insufficient memory to add or update the variable in the environment [1][2]. Developers should check the return value against -1 to detect errors and then consult errno to determine the cause [3][2]. Because setenv copies the strings provided to it [1][2], it is generally safer than functions like putenv regarding memory management, though it is not required by POSIX to be thread-safe [2][4].
Citations:
- 1: https://pubs.opengroup.org/onlinepubs/9699919799.2013edition/functions/setenv.html
- 2: https://man7.org/linux/man-pages/man3/setenv.3p.html
- 3: https://man7.org/linux/man-pages/man3/setenv.3.html
- 4: https://man.archlinux.org/man/setenv.3.en.txt
Check return values of _putenv_s and setenv to ensure UV_THREADPOOL_SIZE is set.
Line 299 and Line 301 ignore the return values from these functions. If environment setup fails (e.g., due to memory exhaustion or invalid arguments), the worker will run with libuv's default thread pool size instead of the required single thread, violating the one-job-per-process scheduling guarantee. Early exit with error reporting is appropriate for initialization failures.
Proposed patch
-#include <cstdlib>
+#include <cerrno>
+#include <cstdio>
+#include <cstdlib>
int run_stateless_worker_mode(const std::string& worker_name, const std::string& log_dir) {
// Limit libuv thread pool to 1 thread so each stateless worker executes
// only one compilation at a time. Must be set before any kota::queue call.
`#ifdef` _WIN32
- _putenv_s("UV_THREADPOOL_SIZE", "1");
+ if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) {
+ std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr);
+ return 1;
+ }
`#else`
- ::setenv("UV_THREADPOOL_SIZE", "1", 1);
+ if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) {
+ std::perror("setenv(UV_THREADPOOL_SIZE)");
+ return 1;
+ }
`#endif`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _putenv_s("UV_THREADPOOL_SIZE", "1"); | |
| #else | |
| ::setenv("UV_THREADPOOL_SIZE", "1", 1); | |
| `#ifdef` _WIN32 | |
| if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) { | |
| std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr); | |
| return 1; | |
| } | |
| `#else` | |
| if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) { | |
| std::perror("setenv(UV_THREADPOOL_SIZE)"); | |
| return 1; | |
| } | |
| `#endif` |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/server/worker/stateless_worker.cpp` around lines 299 - 301, The return
values of the _putenv_s function (Windows platform) and the setenv function
(non-Windows platform) are not being checked after setting the
UV_THREADPOOL_SIZE environment variable. If either of these calls fails, the
worker will silently continue with libuv's default thread pool size, breaking
the one-job-per-process scheduling guarantee. Check the return values of both
_putenv_s and setenv immediately after they are called, and if either returns a
non-zero/non-success value, log an appropriate error message and exit the
process early to ensure the initialization failure is caught and reported.
| if(!can_proceed()) { | ||
| PendingStateless pending(priority); | ||
| if(priority == P::High) | ||
| high_queue.push_back(&pending); | ||
| else | ||
| low_queue.push_back(&pending); | ||
| co_await pending.ready.wait(); | ||
| co_return pending.assigned_worker; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if kota::task supports cancellation that could destroy the coroutine frame
# while it's suspended, and whether PendingStateless needs cleanup guards.
# Look for cancellation patterns in the codebase
rg -n -C3 'cancel|Cancel' --type cpp -g '*.h' -g '*.cpp' | head -100
# Check if task_group has cancel semantics
ast-grep --pattern 'class task_group {
$$$
}'
# Look at how PendingStateless is used in tests
rg -n 'PendingStateless' --type cppRepository: clice-io/clice
Length of output: 1980
🌐 Web query:
clice-io/clice kota::task coroutine cancellation
💡 Result:
In the context of the clice-io ecosystem, coroutine cancellation is primarily handled within the kotatsu library, which provides the asynchronous runtime and task management framework used by the clice language server [1][2]. The kotatsu library features a cooperative cancellation model for its task<T, E, C> objects [1]. Key components and mechanisms of this model include: Cancellation Tokens: The library uses cancellation_source and cancellation_token for thread-safe, external triggering of cancellation requests [1]. Explicit Cancellation: Tasks can be explicitly cancelled using co_await cancel, which transitions the task to a cancelled state [1]. Composed Cancellation: The runtime supports structured concurrency patterns such as: when_all: Wait for multiple tasks; if the first task fails, it cancels the remaining tasks [1]. when_any: Race multiple tasks; the winner cancels the losers [1]. task_group: Manage dynamic fan-out of tasks that support joining [1]. Cancellation Handling: Tasks support methods like.catch_cancel to convert cancellation into an explicit outcome channel and.or_fail for short-circuiting error propagation without resuming at the await site [1]. Runtime Integration: The asynchronous runtime includes I/O cancellation (e.g., for timers, signals, and process execution) where cancellation signals trigger immediate cleanup, such as disarming timers or sending termination signals (e.g., SIGKILL) to processes [3]. While clice also implements internal Task structures for managing its own asynchronous operations [4][5], these are built upon or mirror the patterns found in the broader kotatsu runtime [1]. Cancellation logic typically involves checking task flags within the promise_base and conditionally destroying coroutine handles when tasks are both cancelled and disposable [4].
Citations:
- 1: https://github.com/clice-io/kotatsu
- 2: https://github.com/clice-io
- 3: clice-io/kotatsu@cee0720
- 4: MoonChasing@467c735
- 5: tsurumi-yizhou@ce2f355
🌐 Web query:
clice worker pool coroutine queue PendingStateless lifetime
💡 Result:
In the context of the clice C++ language server, the terms you mentioned appear to be associated with different, unrelated systems or project-specific implementations, rather than a single unified "clice worker pool coroutine queue PendingStateless" mechanism. 1. clice (C++ Language Server): This is a modern language server for C++23 [1]. Its architecture uses a multi-process worker pool model [2] where a Master Server coordinates stateful and stateless worker processes [2]. It does not use a type or concept named "PendingStateless" [2][3][1]. Instead, it uses libuv for event-driven asynchronous task scheduling [3] and C++20 coroutines for background tasks like indexing [2][4]. Stateful workers use strands (per-document mutexes) and thread pools to serialize compilation and feature queries [2]. 2. PendingStateless: This term does not appear in the clice source code or documentation [2][3][1]. However, an exact string match for "PendingStateless" exists in an unrelated repository named 'suna' [5][6], which appears to be a Python-based agent/pipeline system. In that context, it relates to managing state transitions and write buffers within a stateless agent pipeline [5][6]. 3. Worker Pool/Coroutine/Queue: In the clice language server, the worker pool manages child processes [2][7]. Tasks (such as compilation or background indexing) are dispatched via a task queue [2][3]. These tasks are implemented using C++20 coroutines [2][3][8], which are scheduled on an event loop or thread pool [3]. Worker processes themselves are managed by the main server process [2][7]. If you are investigating performance or architecture within the clice language server, you should focus on the src/server/worker_pool.cpp and src/server/stateless_worker.cpp files [2][7], which define how the LSP server dispatches one-shot requests to stateless workers via round-robin routing and thread-pool-backed task queues [2]. There is no "PendingStateless" lifecycle component in this architecture [2][3].
Citations:
- 1: https://deepwiki.com/clice-project/clice
- 2: 020c2cb
- 3: https://clice.io/design/architecture
- 4: 1dd94e5
- 5: https://github.com/kortix-ai/suna/blob/f9a9123c/backend/core/agents/pipeline/stateless/state.py
- 6: https://github.com/kortix-ai/suna/blob/f9a9123c/backend/core/agents/pipeline/stateless/__init__.py
- 7: tsurumi-yizhou@b6886d2
- 8: MoonChasing@909fcd5
🌐 Web query:
C++ coroutine frame stack allocation queue pointer lifetime
💡 Result:
In C++20, coroutines are stackless, meaning they do not use the standard call stack for their execution state. Instead, they maintain a coroutine frame that stores local variables, temporaries, and promise objects, which persists across suspension points [1][2][3]. Coroutine Frame Allocation: By default, the compiler allocates the coroutine frame on the heap using operator new [1][4]. The size of this frame is determined by the compiler at compile time based on the coroutine's needs, including local variables, promise object size, and bookkeeping [4][2]. Allocation Elision (Optimization): The compiler can optimize away dynamic heap allocation if it can prove that the coroutine's lifetime is strictly nested within the lifetime of its caller and the frame size is known at the call site [1][2]. In such cases, the coroutine frame is embedded directly into the caller's stack frame or the caller's coroutine frame [1][2]. This optimization is often referred to as "HALO" (Heap Allocation elision optimization) or "CoroElide" [5][6]. Custom Allocation: Programmers can control frame allocation by defining a custom operator new within the promise_type [1][4]. This allows for the use of custom pools, allocators, or memory management strategies [7][8]. Because operator new is invoked before the coroutine body executes, the allocator must be available either via the coroutine's parameter list or through ambient state (e.g., thread-local storage) [7][4]. Pointer Lifetime: The coroutine handle (std::coroutine_handle) serves as a non-owning pointer to the coroutine frame [9][2]. It remains valid as long as the coroutine frame has not been destroyed [2]. The lifetime of the frame is managed by the coroutine itself (typically via explicit destruction or completion) [2]. If allocation elision is performed, the frame exists as a local object within the caller; it is essential that the handle is not used after the caller's scope ends, as the frame's memory will be reclaimed when the caller's activation record is popped from the stack [1][2]. Queue and Memory Lifetime: There is no inherent "queue" mechanism for coroutine frames in the C++ standard [7]. If a user implements a custom queue or pool to manage frame lifetime, they must ensure the frame's memory remains valid until the coroutine completes and is destroyed [7][8]. Using a pointer to a coroutine frame that has already been destroyed leads to undefined behavior, as the frame is the only location storing the information needed to resume or finalize the coroutine [9][2].
Citations:
- 1: https://www.cppreference.com/cpp/language/coroutines
- 2: https://en.cppreference.com/cpp/language/coroutines
- 3: https://cppreference.net/cpp/language/coroutines.html
- 4: https://lewissbaker.github.io/2018/09/05/understanding-the-promise-type
- 5: https://releases.llvm.org/17.0.1/docs/Coroutines.html
- 6: https://open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2477r2.html
- 7: https://isocpp.org/files/papers/P4127R0.pdf
- 8: https://stackoverflow.com/questions/57163510/are-stackless-c20-coroutines-a-problem
- 9: https://devblogs.microsoft.com/oldnewthing/20211007-00/?p=105777
Potential dangling pointer from stack-local address stored in queue.
PendingStateless pending is a stack-local variable whose address is stored in high_queue or low_queue. If the coroutine's scope exits (e.g., via cancellation or early return) before the queue removes the pending entry, subsequent calls to try_dispatch_pending() will dereference a dangling pointer.
The risk is particularly acute if coroutine frame allocation uses elision (stack-embedded), which is common for small coroutines. Even with kotatsu's cooperative cancellation model, ensure that queue removal happens before scope exit, or use RAII guards (e.g., scope_exit) to remove &pending from the queue on coroutine destruction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/server/worker/worker_pool.cpp` around lines 402 - 410, The issue is that
the stack-local PendingStateless pending variable's address is stored in either
high_queue or low_queue when can_proceed() returns false, and if the coroutine
scope exits before the queue removes this entry (due to cancellation or early
return), subsequent calls to try_dispatch_pending will dereference a dangling
pointer. Fix this by wrapping the queue insertion in an RAII guard using
scope_exit to ensure that the address of pending is removed from whichever queue
it was added to (high_queue or low_queue) before the pending variable goes out
of scope, protecting against premature coroutine termination.
| struct PendingStateless { | ||
| worker::Priority priority; | ||
| kota::event ready{}; | ||
| std::size_t assigned_worker = 0; | ||
|
|
||
| explicit PendingStateless(worker::Priority p) : priority(p) {} | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if kota::event has any cancellation-aware wait() overload or
# if there's a pattern for safe queue cleanup elsewhere in the codebase.
rg -n --type cpp -C3 'co_await.*\.wait\(\)' | head -60Repository: clice-io/clice
Length of output: 4658
🏁 Script executed:
#!/bin/bash
# First, get the structure and context of worker_pool.h
cat -n src/server/worker/worker_pool.h | head -150Repository: clice-io/clice
Length of output: 5947
🏁 Script executed:
#!/bin/bash
# Find the acquire_stateless_slot method implementation
rg -n 'acquire_stateless_slot' --type cpp -A 30 src/server/worker/Repository: clice-io/clice
Length of output: 6495
🏁 Script executed:
#!/bin/bash
# Find try_dispatch_pending implementation
rg -n 'try_dispatch_pending' --type cpp -A 20 src/server/worker/Repository: clice-io/clice
Length of output: 6270
🏁 Script executed:
#!/bin/bash
# Check kota::event definition and if it supports cancellation
find . -name '*.h' -o -name '*.cpp' | xargs grep -l 'class event' | head -5
rg -n 'struct event|class event' --type cpp -A 15 | head -80Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Find kota event definition
find . -path ./build -prune -o -name '*.h' -type f -print | xargs grep -l 'struct event\|class event' 2>/dev/null | grep -i kota | head -10Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Check if there's a cancellation token or cancellation-aware wait() overload in kota
rg -n 'cancel|Cancel' --type cpp src/server/worker/worker_pool.cpp | head -30Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Look for any cleanup or destructors in PendingStateless or surrounding code
sed -n '391,430p' src/server/worker/worker_pool.cppRepository: clice-io/clice
Length of output: 1236
🏁 Script executed:
#!/bin/bash
# Search for kota library or headers in the repo
find . -name "kota" -o -name "*kota*" -type d | head -10Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Check if kota is included as a submodule or external dependency
ls -la | grep -i kota
cat CMakeLists.txt 2>/dev/null | grep -i kota | head -20Repository: clice-io/clice
Length of output: 329
🏁 Script executed:
#!/bin/bash
# Search for any coroutine cancellation or exception handling in send_stateless
sed -n '180,210p' src/server/worker/worker_pool.hRepository: clice-io/clice
Length of output: 1279
🏁 Script executed:
#!/bin/bash
# Check if there are any exception handlers or cancellation mechanisms
rg -n 'catch_cancel|cancel|exception' --type cpp src/server/worker/worker_pool.h src/server/worker/worker_pool.cpp | head -40Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Look at the full try_dispatch_pending implementation
sed -n '432,460p' src/server/worker/worker_pool.cppRepository: clice-io/clice
Length of output: 1074
🏁 Script executed:
#!/bin/bash
# Search for usage of acquire_stateless_slot and check if it's in try/catch blocks
rg -B5 -A5 'acquire_stateless_slot' --type cpp src/server/Repository: clice-io/clice
Length of output: 2311
🏁 Script executed:
#!/bin/bash
# Check if kota::task supports cancellation or has cancellation tokens
rg -n 'kota::task' --type cpp src/server/worker/worker_pool.h | head -5Repository: clice-io/clice
Length of output: 299
🏁 Script executed:
#!/bin/bash
# Look for any request_options that might include timeout or cancellation
rg -n 'request_options|timeout' --type cpp src/server/ | grep -v test | head -20Repository: clice-io/clice
Length of output: 1047
🏁 Script executed:
#!/bin/bash
# Check if there's any try-catch around the acquire_stateless_slot or send_stateless
sed -n '180,210p' src/server/worker/worker_pool.hRepository: clice-io/clice
Length of output: 1279
🏁 Script executed:
#!/bin/bash
# Search for RAII patterns or scope guards that might clean up queue entries
rg -n 'struct.*Slot|~.*Slot' --type cpp src/server/worker/ -A5Repository: clice-io/clice
Length of output: 814
Dangling pointer risk when coroutine is cancelled.
PendingStateless is a stack-local object in acquire_stateless_slot, and its address is pushed to the priority queues. If the waiting coroutine is cancelled (e.g., request timeout, client disconnect) before being dispatched, the coroutine frame is destroyed while the pointer remains in the queue. try_dispatch_pending would then dereference a dangling pointer.
Consider either:
- Heap-allocate
PendingStatelesswith shared ownership, or - Add a cancellation flag that
try_dispatch_pendingchecks before dereferencing, or - Remove the entry from the queue in a destructor/cancellation hook.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/server/worker/worker_pool.h` around lines 109 - 115, The PendingStateless
struct is stack-allocated in acquire_stateless_slot but its address is stored in
priority queues, creating a dangling pointer risk if the coroutine is cancelled
before try_dispatch_pending can dispatch it. To fix this, implement one of the
following approaches: heap-allocate PendingStateless using std::shared_ptr to
extend the object lifetime beyond the coroutine frame and ensure safe cleanup,
or add a validity/cancellation flag to the PendingStateless struct that
try_dispatch_pending checks before dereferencing to safely skip cancelled
entries, or implement a cancellation hook or destructor that removes the entry
from the queue when the coroutine is cancelled to prevent dereferencing a stale
pointer.
| for(int i = 0; i < 50; ++i) { | ||
| co_await kota::sleep(100); | ||
| if(f.worker_alive(0)) | ||
| break; | ||
| } | ||
| EXPECT_TRUE(f.worker_alive(0)); | ||
| EXPECT_NE(f.worker_pid(0), pid); |
There was a problem hiding this comment.
Wait for PID change before exiting the restart loop.
Line 713 can evaluate true before the crash/respawn cycle is complete, so the loop may exit early and Line 717 can compare against the original PID intermittently.
Proposed fix
- for(int i = 0; i < 50; ++i) {
- co_await kota::sleep(100);
- if(f.worker_alive(0))
- break;
- }
- EXPECT_TRUE(f.worker_alive(0));
- EXPECT_NE(f.worker_pid(0), pid);
+ for(int i = 0; i < 50; ++i) {
+ co_await kota::sleep(100);
+ if(f.worker_alive(0) && f.worker_pid(0) != pid)
+ break;
+ }
+ EXPECT_TRUE(f.worker_alive(0));
+ EXPECT_NE(f.worker_pid(0), pid);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for(int i = 0; i < 50; ++i) { | |
| co_await kota::sleep(100); | |
| if(f.worker_alive(0)) | |
| break; | |
| } | |
| EXPECT_TRUE(f.worker_alive(0)); | |
| EXPECT_NE(f.worker_pid(0), pid); | |
| for(int i = 0; i < 50; ++i) { | |
| co_await kota::sleep(100); | |
| if(f.worker_alive(0) && f.worker_pid(0) != pid) | |
| break; | |
| } | |
| EXPECT_TRUE(f.worker_alive(0)); | |
| EXPECT_NE(f.worker_pid(0), pid); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/server/worker_pool_tests.cpp` around lines 711 - 717, The loop
that waits for worker restart exits prematurely when the worker becomes alive
again, but before the PID has actually changed due to the respawn. Modify the
break condition around line 713-715 to check not only that f.worker_alive(0) is
true, but also that f.worker_pid(0) has changed from the original pid variable
before breaking from the loop. This ensures the crash/respawn cycle is fully
complete, preventing line 717 from intermittently failing when it verifies the
PID has changed.
Replace SIGKILL (POSIX-only) with literal 9 in monitor_worker fallback path to fix Windows build. Lower default max_restarts from 5 to 2 since excessive retries are not useful.
Summary
process_crash()for both stateful and stateless workerslost_documentsviaon_crashcallback — caller decides recovery strategy (lazy rebuild via LRU on next access)send_stateless()transparently retries up to 2 times on mid-request worker crashlow_limitafter stateless crashes, with memory-pressure monitoringrelease_stateless_slotpreventsbusy_countunderflow between crash handler andStatelessSlotRAII destructorpick_idle_statelessusesllvm_unreachableinstead of silent fallback to dead workerrespawn_workerdispatches pending requests only afterpeer->run()startsTest plan
WorkerPoolStateful(8),WorkerPoolScheduling(11),WorkerPoolCrash(14),WorkerPoolIntegration(4)low_limitenforcement, RAII slot release, concurrency limitsSummary by CodeRabbit
Release Notes
New Features
Improvements
Changes