refactor(server): crash handling and scheduling tests for WorkerPool by 16bit-ykiko · Pull Request #465 · clice-io/clice

16bit-ykiko · 2026-06-18T14:26:00Z

Summary

Add centralized crash handling via process_crash() for both stateful and stateless workers
Stateful crashes report lost_documents via on_crash callback — caller decides recovery strategy (lazy rebuild via LRU on next access)
Stateless send_stateless() transparently retries up to 2 times on mid-request worker crash
AIMD (Additive Increase / Multiplicative Decrease) backoff on low_limit after stateless crashes, with memory-pressure monitoring
Idempotent release_stateless_slot prevents busy_count underflow between crash handler and StatelessSlot RAII destructor
pick_idle_stateless uses llvm_unreachable instead of silent fallback to dead worker
respawn_worker dispatches pending requests only after peer->run() starts

Test plan

37 unit tests across 4 suites: WorkerPoolStateful (8), WorkerPoolScheduling (11), WorkerPoolCrash (14), WorkerPoolIntegration (4)
Stateful routing: least-loaded assignment, path affinity, LRU, clear on crash
Stateless scheduling: priority queues, low_limit enforcement, RAII slot release, concurrency limits
Crash handling: state cleanup, lost document reporting, idempotent release, AIMD backoff math, restart limits
Integration: real process spawn/stop, stateless request round-trip, crash-and-restart, crash notification callback
Full unit test suite passes (658 tests, 0 failures)

Summary by CodeRabbit

Release Notes

New Features
- Added priority-aware scheduling for stateless build and format tasks, allowing high-priority work to run ahead of lower-priority requests.
Improvements
- Overhauled stateless worker capacity management with priority-aware queuing, dynamic throttling, and more responsive dispatch.
- Enhanced worker crash handling, including structured crash reporting and restart/backoff logic.
Changes
- Background indexing concurrency behavior was simplified, and stateless worker execution now limits libuv’s thread pool to a single thread per process.

Replace the round-robin stateless worker dispatch with priority-aware scheduling that prevents background indexing from starving interactive requests and adapts concurrency to memory pressure. Key changes: - Add Priority enum (High/Low) to worker protocol; interactive requests (completion, signature help, format) are High, indexing is Low - WorkerPool tracks busy/idle state per worker and maintains priority queues — High requests are always dispatched before Low - Reserve at least one worker slot for High priority tasks - Adaptive low_limit via memory monitoring (poll every 3s) and AIMD backoff on worker crashes - Set OS nice value on workers based on task priority - Limit stateless workers to one concurrent compilation via UV_THREADPOOL_SIZE=1 - Remove redundant Indexer-layer concurrency control (max_concurrent, monitor_resources) — backpressure is now provided by WorkerPool

Add comprehensive crash handling to WorkerPool: - process_crash() centralizes all crash state transitions for both stateful and stateless workers - Stateful crashes report lost_documents via on_crash callback, letting the caller decide recovery (lazy rebuild via LRU) - Stateless send_stateless() transparently retries up to 2 times on worker crash - AIMD backoff on low_limit after stateless worker crashes - Idempotent release_stateless_slot prevents busy_count underflow when crash handler and StatelessSlot RAII destructor both fire - pick_idle_stateless uses llvm_unreachable instead of silent fallback to index 0 - respawn_worker calls try_dispatch_pending after peer->run() Add 37 unit tests across 4 suites (WorkerPoolStateful, WorkerPoolScheduling, WorkerPoolCrash, WorkerPoolIntegration) covering routing, priority scheduling, crash handling, AIMD backoff, and real-process integration.

coderabbitai · 2026-06-18T14:26:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80489e48-cd40-42e6-8968-5ae263c4bd4b

📥 Commits

Reviewing files that changed from the base of the PR and between a8abe2d and 0bbb206.

📒 Files selected for processing (2)

src/server/worker/worker_pool.cpp
src/server/worker/worker_pool.h

🚧 Files skipped from review as they are similar to previous changes (2)

src/server/worker/worker_pool.h
src/server/worker/worker_pool.cpp

📝 Walkthrough

Walkthrough

Introduces a Priority enum (High/Low) in the worker protocol and rewrites WorkerPool stateless dispatch with priority-aware slot acquisition, AIMD-based memory pressure monitoring, and a process_crash helper that handles restart decisions and fires an on_crash callback. Compiler callsites set High priority for PCH/build/format. The Indexer's own concurrency controls (set_max_concurrency, monitor_resources, in_flight throttling) are removed as they are superseded by pool-level scheduling. Stateless worker processes now cap libuv's thread pool to one via UV_THREADPOOL_SIZE=1.

Changes

Priority-Aware Stateless Worker Scheduling & Crash Handling

Layer / File(s)	Summary
Priority enum and BuildParams contract `src/server/protocol/worker.h`	Adds `enum class Priority { High, Low }` and a `priority` field (default `Low`) to `BuildParams`.
WorkerPool crash info, options, and scheduling data structures `src/server/worker/worker_pool.h`	Introduces `WorkerCrashInfo`, adds `max_restarts` to `WorkerPoolOptions`, replaces round-robin internals with `PendingStateless`/`StatelessSlot`, high/low priority deques, concurrency counters, and new method declarations.
WorkerPool spawn, start, and respawn lifecycle `src/server/worker/worker_pool.cpp`	`spawn_worker` tracks `alive_stateless_count`; `start` initializes `low_limit` and launches `monitor_memory()`; `respawn_worker` resets busy/alive state and calls `try_dispatch_pending()`.
Worker crash detection and process_crash helper `src/server/worker/worker_pool.cpp`	Refactors `monitor_worker` to normalize exits and delegate to new `process_crash`, which marks workers dead, collects lost stateful documents, decrements stateless counters, applies backoff, and fires `on_crash`.
Priority slot acquisition, release, and pending dispatch `src/server/worker/worker_pool.cpp`	Implements `acquire_stateless_slot` (blocks in high/low queue when capacity unavailable), `release_stateless_slot`, `try_dispatch_pending` (high-before-low with `low_limit` gate), and `pick_idle_stateless`.
Memory monitoring and AIMD crash backoff `src/server/worker/worker_pool.cpp`	Adds `monitor_memory()` periodic loop that adjusts `low_limit` from memory pressure, and `apply_crash_backoff()` for AIMD-style reduction on crashes.
Stateless worker UV_THREADPOOL_SIZE initialization `src/server/worker/stateless_worker.cpp`	Sets `UV_THREADPOOL_SIZE=1` via platform-specific env-var call at the start of `run_stateless_worker_mode`.
Compiler callsites set High priority `src/server/compiler/compiler.cpp`	`ensure_pch`, `forward_build`, and `forward_format` each set `BuildParams::priority = High` before dispatching stateless requests.
Remove Indexer-level concurrency controls `src/server/compiler/indexer.h`, `src/server/compiler/indexer.cpp`, `src/server/service/master_server.cpp`	Removes `set_max_concurrency`, `monitor_resources()`, `max_concurrent`/`baseline_concurrent` fields, and `in_flight`/`slot_available` throttling from the indexer; removes the `set_max_concurrency` call from `MasterServer::initialize()`.
WorkerPool unit and integration tests `tests/unit/server/worker_pool_tests.cpp`	Adds `WorkerPoolFixture` and test suites for stateful scheduling, stateless priority dispatch, crash semantics (counters, lost documents, AIMD backoff), and start/stop/crash-restart integration.

Sequence Diagram(s)

sequenceDiagram
  participant Compiler
  participant WorkerPool
  participant acquire_stateless_slot
  participant StatelessWorker
  participant process_crash
  participant on_crash

  rect rgba(100, 149, 237, 0.5)
    note over Compiler,WorkerPool: High-priority build/format dispatch
    Compiler->>WorkerPool: send_stateless(BuildParams{priority=High})
    WorkerPool->>acquire_stateless_slot: acquire(Priority::High)
    acquire_stateless_slot->>acquire_stateless_slot: bypass low_limit gate, pick idle worker
    acquire_stateless_slot-->>WorkerPool: worker_index + StatelessSlot
  end

  WorkerPool->>StatelessWorker: dispatch build request
  StatelessWorker-->>WorkerPool: response (success or crash)

  alt worker crashed
    rect rgba(220, 80, 80, 0.5)
      note over WorkerPool,on_crash: Crash handling path
      WorkerPool->>process_crash: process_crash(index, stateful, exit_code, signal)
      process_crash->>WorkerPool: decrement alive/busy counters
      process_crash->>WorkerPool: apply_crash_backoff() → reduce low_limit
      process_crash->>WorkerPool: try_dispatch_pending()
      process_crash->>on_crash: WorkerCrashInfo{index, exit_code, ...}
      process_crash-->>WorkerPool: should_restart=true
      WorkerPool->>WorkerPool: respawn_worker(index)
    end
  end

  WorkerPool->>WorkerPool: ~StatelessSlot → release_stateless_slot
  WorkerPool->>WorkerPool: try_dispatch_pending() → signal next waiter

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

clice-io/clice#432: Directly modifies clice::Indexer's set_max_concurrency and monitor_resources scheduling in indexer.h/indexer.cpp, the same code this PR removes.
clice-io/clice#435: Modifies Compiler::ensure_pch in compiler.cpp along the BuildPCH dispatch path, where this PR also adds the priority=High assignment.

Poem

🐇 Hoppity-hop through the priority lane,
High builds go first, low ones must wait and strain.
When a worker crashes with a bump and a fall,
AIMD backoff adjusts the limit for all.
No more indexer juggling its own concurrency knobs—
The pool now handles all the scheduling jobs!
— signed, the rabbit who reviews your code 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.21% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'refactor(server): crash handling and scheduling tests for WorkerPool' directly addresses the main objectives of centralizing crash management and introducing comprehensive scheduling tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/worker-pool-scheduling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8abe2db72

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T14:35:35Z

+        auto result = co_await stateless_workers[idx].peer->send_request(params, opts);
+
+        // Success, or application-level error (worker still alive) — return as-is.
+        if(result.has_value() || stateless_workers[idx].alive)


Retry against the crashed process, not the reused slot

When a stateless worker dies during send_request, monitor_worker() can run process_crash() and respawn_worker() before this await resumes; respawn_worker() reuses the same slot index and sets alive = true. In that interleaving the result is still the IPC error from the retired peer, but this check sees the newly spawned worker as alive and returns the error instead of performing the intended retry. Capture a per-process generation/restart count or the peer identity before sending rather than checking the current slot state.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T14:35:35Z

+            high_queue.push_back(&pending);
+        else
+            low_queue.push_back(&pending);
+        co_await pending.ready.wait();


Remove cancelled waiters from stateless queues

If an LSP request is cancelled while waiting for a stateless slot, this stack-local PendingStateless is destroyed as the coroutine unwinds, but its address remains in high_queue/low_queue. The next try_dispatch_pending() from a release, crash, or respawn will dereference that dangling pointer and write assigned_worker/signal ready, which can crash or corrupt scheduling. The queued entry needs cancellation cleanup or owned lifetime independent of the waiting frame.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T14:35:35Z

+    if(!can_proceed()) {
+        PendingStateless pending(priority);
+        if(priority == P::High)
+            high_queue.push_back(&pending);
+        else
+            low_queue.push_back(&pending);
+        co_await pending.ready.wait();


Return an error when no stateless workers are alive

When every stateless worker is down, for example after the last one exceeds max_restarts, alive_stateless_count is zero so can_proceed() is false and the request is queued here forever. With no live worker and no future respawn/release to call try_dispatch_pending(), user requests hang instead of getting the previous “all workers are down” style error. Check alive_stateless_count == 0 before enqueueing and fail the request.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T14:35:35Z

+    if(on_crash)
+        on_crash(info);


Mark sessions dirty after stateful crashes

Stateful crashes now clear the worker ownership, but production code never installs WorkerPool::on_crash (repo-wide search only finds the unit-test fixture). For an already clean open document, Compiler::ensure_compiled() will therefore return true without rebuilding, send_stateful() will assign the path to a fresh worker with no AST, and features like hover/semantic tokens return null until an edit or dependency change dirties the session. Wire this callback in MasterServer to mark lost_documents dirty or otherwise force recompilation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T14:35:35Z

+
+    auto pid = stateless_workers[idx].proc.pid();
+    if(pid > 0) {
+        kota::sys::set_priority(priority == P::Low ? 10 : 0, pid);


Avoid one-way nice changes for worker priority

On normal Linux/macOS runs without elevated scheduling privileges, increasing a process nice value to 10 is allowed but lowering it back to 0 fails; I confirmed this with setpriority in the same environment. Because these calls ignore failure, once a worker is assigned a low-priority request here, later high-priority dispatches that try set_priority(0, pid) cannot restore the worker, so high-priority completions/formatting can continue running at low OS priority after indexing has used every worker.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T14:35:35Z

 ///   - SignatureHelp: + text, version, offset, pch, pcms
 ///   - Format:        + text, format_range (optional)
 struct BuildParams {
+    Priority priority = Priority::Low;


Route user-triggered PCM builds as high priority

Defaulting new BuildParams to low priority leaves BuildPCM requests low unless a caller overrides it, and the compile-graph dispatch used by Compiler::ensure_deps() constructs PCM builds without setting priority. When an interactive hover/completion needs module dependencies while background indexing has already filled the low-priority queue, the dependency build waits behind background Index work even though the final user request is marked high priority. Propagate the caller priority into compile-graph PCM builds or split background PCM work from user-triggered dependencies.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T14:35:35Z

        ++dispatched;
        workers.spawn([&, server_path_id]() -> kota::task<> {
            co_await index_one(server_path_id);


Keep background indexing bounded before worker dispatch

With the in_flight gate removed, this loop now spawns one coroutine for every stale queued file before any stateless-worker slot is acquired. On large workspaces those index_one tasks all run far enough to do synchronous per-file work such as module dependency preparation and compile-argument lookup, then pile up in the worker-pool queue, defeating the previous concurrency cap and making pause/resume unable to stop new indexing work until the whole queue has been spawned. Reintroduce a bounded producer or acquire scheduling capacity before launching each per-file task.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/server/worker/stateless_worker.cpp`:
- Around line 299-301: The return values of the _putenv_s function (Windows
platform) and the setenv function (non-Windows platform) are not being checked
after setting the UV_THREADPOOL_SIZE environment variable. If either of these
calls fails, the worker will silently continue with libuv's default thread pool
size, breaking the one-job-per-process scheduling guarantee. Check the return
values of both _putenv_s and setenv immediately after they are called, and if
either returns a non-zero/non-success value, log an appropriate error message
and exit the process early to ensure the initialization failure is caught and
reported.

In `@src/server/worker/worker_pool.cpp`:
- Around line 402-410: The issue is that the stack-local PendingStateless
pending variable's address is stored in either high_queue or low_queue when
can_proceed() returns false, and if the coroutine scope exits before the queue
removes this entry (due to cancellation or early return), subsequent calls to
try_dispatch_pending will dereference a dangling pointer. Fix this by wrapping
the queue insertion in an RAII guard using scope_exit to ensure that the address
of pending is removed from whichever queue it was added to (high_queue or
low_queue) before the pending variable goes out of scope, protecting against
premature coroutine termination.

In `@src/server/worker/worker_pool.h`:
- Around line 109-115: The PendingStateless struct is stack-allocated in
acquire_stateless_slot but its address is stored in priority queues, creating a
dangling pointer risk if the coroutine is cancelled before try_dispatch_pending
can dispatch it. To fix this, implement one of the following approaches:
heap-allocate PendingStateless using std::shared_ptr to extend the object
lifetime beyond the coroutine frame and ensure safe cleanup, or add a
validity/cancellation flag to the PendingStateless struct that
try_dispatch_pending checks before dereferencing to safely skip cancelled
entries, or implement a cancellation hook or destructor that removes the entry
from the queue when the coroutine is cancelled to prevent dereferencing a stale
pointer.

In `@tests/unit/server/worker_pool_tests.cpp`:
- Around line 711-717: The loop that waits for worker restart exits prematurely
when the worker becomes alive again, but before the PID has actually changed due
to the respawn. Modify the break condition around line 713-715 to check not only
that f.worker_alive(0) is true, but also that f.worker_pid(0) has changed from
the original pid variable before breaking from the loop. This ensures the
crash/respawn cycle is fully complete, preventing line 717 from intermittently
failing when it verifies the PID has changed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eee3ed09-a6fa-4d4e-ab18-9d1f7de729ea

📥 Commits

Reviewing files that changed from the base of the PR and between 8cba654 and a8abe2d.

📒 Files selected for processing (9)

src/server/compiler/compiler.cpp
src/server/compiler/indexer.cpp
src/server/compiler/indexer.h
src/server/protocol/worker.h
src/server/service/master_server.cpp
src/server/worker/stateless_worker.cpp
src/server/worker/worker_pool.cpp
src/server/worker/worker_pool.h
tests/unit/server/worker_pool_tests.cpp

💤 Files with no reviewable changes (3)

src/server/compiler/indexer.h
src/server/service/master_server.cpp
src/server/compiler/indexer.cpp

coderabbitai · 2026-06-18T14:36:54Z

+    _putenv_s("UV_THREADPOOL_SIZE", "1");
+#else
+    ::setenv("UV_THREADPOOL_SIZE", "1", 1);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n src/server/worker/stateless_worker.cpp | head -310 | tail -30

Repository: clice-io/clice

Length of output: 1207

🏁 Script executed:

rg "_putenv_s|::setenv" src/server/worker/stateless_worker.cpp -B 5 -A 5

Repository: clice-io/clice

Length of output: 578

🏁 Script executed:

rg "UV_THREADPOOL_SIZE" src/server/worker/stateless_worker.cpp -B 10 -A 10

Repository: clice-io/clice

Length of output: 750

🌐 Web query:

_putenv_s Windows return value error handling MSVC

💡 Result:

The _putenv_s function in the Microsoft C Runtime (CRT) returns an errno_t value to indicate success or failure [1][2]. Return Values: - 0: The function completed successfully [1][3]. - EINVAL: An error occurred [1][3]. This typically happens if either the varname or value_string arguments are NULL [1][4]. Error Handling: - When an error condition occurs (such as passing NULL arguments), the function invokes the CRT invalid parameter handler [1][3]. - If the application allows execution to continue after the handler is invoked, the function returns EINVAL and also sets the global errno variable to EINVAL [1][3]. Important Considerations: - Thread Safety: The _putenv_s function is not thread-safe [1][2]. Concurrent use with _getenv_s or other environment-modifying functions can cause random failures or undefined behavior [1][5]. Synchronization is required if accessed from multiple threads [1][3]. - Scope: Changes made by _putenv_s affect only the environment local to the current process and do not modify the environment segment managed by the operating system [2]. However, these changes are inherited by new processes created via _spawn, _exec, or system [2].

Citations:

1: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/putenv-s-wputenv-s?view=msvc-170

2: https://github.com/Microsoft/cpp-docs/blob/main/docs/c-runtime-library/reference/putenv-s-wputenv-s.md

3: https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2013/eyw7eyfw(v=vs.120)

4: https://learn.microsoft.com/en-us/previous-versions/eyw7eyfw(v=vs.140)

5: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/getenv-s-wgetenv-s?view=msvc-170

🌐 Web query:

setenv POSIX return value error handling unix

💡 Result:

Under the POSIX standard, the setenv function is used to add or update an environment variable in the calling process [1][2]. Return Values: - Upon successful completion, setenv returns 0 [1][2]. - If an error occurs, it returns -1, and the global errno variable is set to indicate the specific error [1][2]. When an error occurs, the environment remains unchanged [1][2]. Error Handling (errno): The function is required to fail and set errno in the following cases [1][2]: - [EINVAL]: The environment variable name (envname) is an empty string, is a NULL pointer (depending on implementation), or contains an '=' character [1][3][2]. - [ENOMEM]: The system has insufficient memory to add or update the variable in the environment [1][2]. Developers should check the return value against -1 to detect errors and then consult errno to determine the cause [3][2]. Because setenv copies the strings provided to it [1][2], it is generally safer than functions like putenv regarding memory management, though it is not required by POSIX to be thread-safe [2][4].

Citations:

1: https://pubs.opengroup.org/onlinepubs/9699919799.2013edition/functions/setenv.html

2: https://man7.org/linux/man-pages/man3/setenv.3p.html

3: https://man7.org/linux/man-pages/man3/setenv.3.html

4: https://man.archlinux.org/man/setenv.3.en.txt

Check return values of _putenv_s and setenv to ensure UV_THREADPOOL_SIZE is set.

Line 299 and Line 301 ignore the return values from these functions. If environment setup fails (e.g., due to memory exhaustion or invalid arguments), the worker will run with libuv's default thread pool size instead of the required single thread, violating the one-job-per-process scheduling guarantee. Early exit with error reporting is appropriate for initialization failures.

Proposed patch

-#include <cstdlib> +#include <cerrno> +#include <cstdio> +#include <cstdlib> int run_stateless_worker_mode(const std::string& worker_name, const std::string& log_dir) { // Limit libuv thread pool to 1 thread so each stateless worker executes // only one compilation at a time. Must be set before any kota::queue call. `#ifdef` _WIN32 - _putenv_s("UV_THREADPOOL_SIZE", "1"); + if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) { + std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr); + return 1; + } `#else` - ::setenv("UV_THREADPOOL_SIZE", "1", 1); + if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) { + std::perror("setenv(UV_THREADPOOL_SIZE)"); + return 1; + } `#endif`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_putenv_s("UV_THREADPOOL_SIZE", "1");

#else

::setenv("UV_THREADPOOL_SIZE", "1", 1);

`#ifdef` _WIN32

if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) {

std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr);

return 1;

}

`#else`

if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) {

std::perror("setenv(UV_THREADPOOL_SIZE)");

return 1;

}

`#endif`

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/worker/stateless_worker.cpp` around lines 299 - 301, The return values of the _putenv_s function (Windows platform) and the setenv function (non-Windows platform) are not being checked after setting the UV_THREADPOOL_SIZE environment variable. If either of these calls fails, the worker will silently continue with libuv's default thread pool size, breaking the one-job-per-process scheduling guarantee. Check the return values of both _putenv_s and setenv immediately after they are called, and if either returns a non-zero/non-success value, log an appropriate error message and exit the process early to ensure the initialization failure is caught and reported.

coderabbitai · 2026-06-18T14:36:54Z

+    if(!can_proceed()) {
+        PendingStateless pending(priority);
+        if(priority == P::High)
+            high_queue.push_back(&pending);
+        else
+            low_queue.push_back(&pending);
+        co_await pending.ready.wait();
+        co_return pending.assigned_worker;
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if kota::task supports cancellation that could destroy the coroutine frame # while it's suspended, and whether PendingStateless needs cleanup guards. # Look for cancellation patterns in the codebase rg -n -C3 'cancel|Cancel' --type cpp -g '*.h' -g '*.cpp' | head -100 # Check if task_group has cancel semantics ast-grep --pattern 'class task_group { $$$ }' # Look at how PendingStateless is used in tests rg -n 'PendingStateless' --type cpp

Repository: clice-io/clice

Length of output: 1980

🌐 Web query:

clice-io/clice kota::task coroutine cancellation

💡 Result:

In the context of the clice-io ecosystem, coroutine cancellation is primarily handled within the kotatsu library, which provides the asynchronous runtime and task management framework used by the clice language server [1][2]. The kotatsu library features a cooperative cancellation model for its task<T, E, C> objects [1]. Key components and mechanisms of this model include: Cancellation Tokens: The library uses cancellation_source and cancellation_token for thread-safe, external triggering of cancellation requests [1]. Explicit Cancellation: Tasks can be explicitly cancelled using co_await cancel, which transitions the task to a cancelled state [1]. Composed Cancellation: The runtime supports structured concurrency patterns such as: when_all: Wait for multiple tasks; if the first task fails, it cancels the remaining tasks [1]. when_any: Race multiple tasks; the winner cancels the losers [1]. task_group: Manage dynamic fan-out of tasks that support joining [1]. Cancellation Handling: Tasks support methods like.catch_cancel to convert cancellation into an explicit outcome channel and.or_fail for short-circuiting error propagation without resuming at the await site [1]. Runtime Integration: The asynchronous runtime includes I/O cancellation (e.g., for timers, signals, and process execution) where cancellation signals trigger immediate cleanup, such as disarming timers or sending termination signals (e.g., SIGKILL) to processes [3]. While clice also implements internal Task structures for managing its own asynchronous operations [4][5], these are built upon or mirror the patterns found in the broader kotatsu runtime [1]. Cancellation logic typically involves checking task flags within the promise_base and conditionally destroying coroutine handles when tasks are both cancelled and disposable [4].

Citations:

1: https://github.com/clice-io/kotatsu

2: https://github.com/clice-io

3: clice-io/kotatsu@cee0720

4: MoonChasing@467c735

5: tsurumi-yizhou@ce2f355

🌐 Web query:

clice worker pool coroutine queue PendingStateless lifetime

💡 Result:

In the context of the clice C++ language server, the terms you mentioned appear to be associated with different, unrelated systems or project-specific implementations, rather than a single unified "clice worker pool coroutine queue PendingStateless" mechanism. 1. clice (C++ Language Server): This is a modern language server for C++23 [1]. Its architecture uses a multi-process worker pool model [2] where a Master Server coordinates stateful and stateless worker processes [2]. It does not use a type or concept named "PendingStateless" [2][3][1]. Instead, it uses libuv for event-driven asynchronous task scheduling [3] and C++20 coroutines for background tasks like indexing [2][4]. Stateful workers use strands (per-document mutexes) and thread pools to serialize compilation and feature queries [2]. 2. PendingStateless: This term does not appear in the clice source code or documentation [2][3][1]. However, an exact string match for "PendingStateless" exists in an unrelated repository named 'suna' [5][6], which appears to be a Python-based agent/pipeline system. In that context, it relates to managing state transitions and write buffers within a stateless agent pipeline [5][6]. 3. Worker Pool/Coroutine/Queue: In the clice language server, the worker pool manages child processes [2][7]. Tasks (such as compilation or background indexing) are dispatched via a task queue [2][3]. These tasks are implemented using C++20 coroutines [2][3][8], which are scheduled on an event loop or thread pool [3]. Worker processes themselves are managed by the main server process [2][7]. If you are investigating performance or architecture within the clice language server, you should focus on the src/server/worker_pool.cpp and src/server/stateless_worker.cpp files [2][7], which define how the LSP server dispatches one-shot requests to stateless workers via round-robin routing and thread-pool-backed task queues [2]. There is no "PendingStateless" lifecycle component in this architecture [2][3].

Citations:

1: https://deepwiki.com/clice-project/clice

2: 020c2cb

3: https://clice.io/design/architecture

4: 1dd94e5

5: https://github.com/kortix-ai/suna/blob/f9a9123c/backend/core/agents/pipeline/stateless/state.py

6: https://github.com/kortix-ai/suna/blob/f9a9123c/backend/core/agents/pipeline/stateless/__init__.py

7: tsurumi-yizhou@b6886d2

8: MoonChasing@909fcd5

🌐 Web query:

C++ coroutine frame stack allocation queue pointer lifetime

💡 Result:

In C++20, coroutines are stackless, meaning they do not use the standard call stack for their execution state. Instead, they maintain a coroutine frame that stores local variables, temporaries, and promise objects, which persists across suspension points [1][2][3]. Coroutine Frame Allocation: By default, the compiler allocates the coroutine frame on the heap using operator new [1][4]. The size of this frame is determined by the compiler at compile time based on the coroutine's needs, including local variables, promise object size, and bookkeeping [4][2]. Allocation Elision (Optimization): The compiler can optimize away dynamic heap allocation if it can prove that the coroutine's lifetime is strictly nested within the lifetime of its caller and the frame size is known at the call site [1][2]. In such cases, the coroutine frame is embedded directly into the caller's stack frame or the caller's coroutine frame [1][2]. This optimization is often referred to as "HALO" (Heap Allocation elision optimization) or "CoroElide" [5][6]. Custom Allocation: Programmers can control frame allocation by defining a custom operator new within the promise_type [1][4]. This allows for the use of custom pools, allocators, or memory management strategies [7][8]. Because operator new is invoked before the coroutine body executes, the allocator must be available either via the coroutine's parameter list or through ambient state (e.g., thread-local storage) [7][4]. Pointer Lifetime: The coroutine handle (std::coroutine_handle) serves as a non-owning pointer to the coroutine frame [9][2]. It remains valid as long as the coroutine frame has not been destroyed [2]. The lifetime of the frame is managed by the coroutine itself (typically via explicit destruction or completion) [2]. If allocation elision is performed, the frame exists as a local object within the caller; it is essential that the handle is not used after the caller's scope ends, as the frame's memory will be reclaimed when the caller's activation record is popped from the stack [1][2]. Queue and Memory Lifetime: There is no inherent "queue" mechanism for coroutine frames in the C++ standard [7]. If a user implements a custom queue or pool to manage frame lifetime, they must ensure the frame's memory remains valid until the coroutine completes and is destroyed [7][8]. Using a pointer to a coroutine frame that has already been destroyed leads to undefined behavior, as the frame is the only location storing the information needed to resume or finalize the coroutine [9][2].

Citations:

1: https://www.cppreference.com/cpp/language/coroutines

2: https://en.cppreference.com/cpp/language/coroutines

3: https://cppreference.net/cpp/language/coroutines.html

4: https://lewissbaker.github.io/2018/09/05/understanding-the-promise-type

5: https://releases.llvm.org/17.0.1/docs/Coroutines.html

6: https://open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2477r2.html

7: https://isocpp.org/files/papers/P4127R0.pdf

8: https://stackoverflow.com/questions/57163510/are-stackless-c20-coroutines-a-problem

9: https://devblogs.microsoft.com/oldnewthing/20211007-00/?p=105777

Potential dangling pointer from stack-local address stored in queue.

PendingStateless pending is a stack-local variable whose address is stored in high_queue or low_queue. If the coroutine's scope exits (e.g., via cancellation or early return) before the queue removes the pending entry, subsequent calls to try_dispatch_pending() will dereference a dangling pointer.

The risk is particularly acute if coroutine frame allocation uses elision (stack-embedded), which is common for small coroutines. Even with kotatsu's cooperative cancellation model, ensure that queue removal happens before scope exit, or use RAII guards (e.g., scope_exit) to remove &pending from the queue on coroutine destruction.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/worker/worker_pool.cpp` around lines 402 - 410, The issue is that the stack-local PendingStateless pending variable's address is stored in either high_queue or low_queue when can_proceed() returns false, and if the coroutine scope exits before the queue removes this entry (due to cancellation or early return), subsequent calls to try_dispatch_pending will dereference a dangling pointer. Fix this by wrapping the queue insertion in an RAII guard using scope_exit to ensure that the address of pending is removed from whichever queue it was added to (high_queue or low_queue) before the pending variable goes out of scope, protecting against premature coroutine termination.

coderabbitai · 2026-06-18T14:36:54Z

+    struct PendingStateless {
+        worker::Priority priority;
+        kota::event ready{};
+        std::size_t assigned_worker = 0;
+
+        explicit PendingStateless(worker::Priority p) : priority(p) {}
+    };


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if kota::event has any cancellation-aware wait() overload or # if there's a pattern for safe queue cleanup elsewhere in the codebase. rg -n --type cpp -C3 'co_await.*\.wait' | head -60

Repository: clice-io/clice

Length of output: 4658

🏁 Script executed:

#!/bin/bash # First, get the structure and context of worker_pool.h cat -n src/server/worker/worker_pool.h | head -150

Repository: clice-io/clice

Length of output: 5947

🏁 Script executed:

#!/bin/bash # Find the acquire_stateless_slot method implementation rg -n 'acquire_stateless_slot' --type cpp -A 30 src/server/worker/

Repository: clice-io/clice

Length of output: 6495

🏁 Script executed:

#!/bin/bash # Find try_dispatch_pending implementation rg -n 'try_dispatch_pending' --type cpp -A 20 src/server/worker/

Repository: clice-io/clice

Length of output: 6270

🏁 Script executed:

#!/bin/bash # Check kota::event definition and if it supports cancellation find . -name '*.h' -o -name '*.cpp' | xargs grep -l 'class event' | head -5 rg -n 'struct event|class event' --type cpp -A 15 | head -80

Repository: clice-io/clice

Length of output: 40

🏁 Script executed:

#!/bin/bash # Find kota event definition find . -path ./build -prune -o -name '*.h' -type f -print | xargs grep -l 'struct event\|class event' 2>/dev/null | grep -i kota | head -10

Repository: clice-io/clice

Length of output: 40

🏁 Script executed:

#!/bin/bash # Check if there's a cancellation token or cancellation-aware wait() overload in kota rg -n 'cancel|Cancel' --type cpp src/server/worker/worker_pool.cpp | head -30

Repository: clice-io/clice

Length of output: 40

🏁 Script executed:

#!/bin/bash # Look for any cleanup or destructors in PendingStateless or surrounding code sed -n '391,430p' src/server/worker/worker_pool.cpp

Repository: clice-io/clice

Length of output: 1236

🏁 Script executed:

#!/bin/bash # Search for kota library or headers in the repo find . -name "kota" -o -name "*kota*" -type d | head -10

Repository: clice-io/clice

Length of output: 40

🏁 Script executed:

#!/bin/bash # Check if kota is included as a submodule or external dependency ls -la | grep -i kota cat CMakeLists.txt 2>/dev/null | grep -i kota | head -20

Repository: clice-io/clice

Length of output: 329

🏁 Script executed:

#!/bin/bash # Search for any coroutine cancellation or exception handling in send_stateless sed -n '180,210p' src/server/worker/worker_pool.h

Repository: clice-io/clice

Length of output: 1279

🏁 Script executed:

#!/bin/bash # Check if there are any exception handlers or cancellation mechanisms rg -n 'catch_cancel|cancel|exception' --type cpp src/server/worker/worker_pool.h src/server/worker/worker_pool.cpp | head -40

Repository: clice-io/clice

Length of output: 40

🏁 Script executed:

#!/bin/bash # Look at the full try_dispatch_pending implementation sed -n '432,460p' src/server/worker/worker_pool.cpp

Repository: clice-io/clice

Length of output: 1074

🏁 Script executed:

#!/bin/bash # Search for usage of acquire_stateless_slot and check if it's in try/catch blocks rg -B5 -A5 'acquire_stateless_slot' --type cpp src/server/

Repository: clice-io/clice

Length of output: 2311

🏁 Script executed:

#!/bin/bash # Check if kota::task supports cancellation or has cancellation tokens rg -n 'kota::task' --type cpp src/server/worker/worker_pool.h | head -5

Repository: clice-io/clice

Length of output: 299

🏁 Script executed:

#!/bin/bash # Look for any request_options that might include timeout or cancellation rg -n 'request_options|timeout' --type cpp src/server/ | grep -v test | head -20

Repository: clice-io/clice

Length of output: 1047

🏁 Script executed:

#!/bin/bash # Check if there's any try-catch around the acquire_stateless_slot or send_stateless sed -n '180,210p' src/server/worker/worker_pool.h

Repository: clice-io/clice

Length of output: 1279

🏁 Script executed:

#!/bin/bash # Search for RAII patterns or scope guards that might clean up queue entries rg -n 'struct.*Slot|~.*Slot' --type cpp src/server/worker/ -A5

Repository: clice-io/clice

Length of output: 814

Dangling pointer risk when coroutine is cancelled.

PendingStateless is a stack-local object in acquire_stateless_slot, and its address is pushed to the priority queues. If the waiting coroutine is cancelled (e.g., request timeout, client disconnect) before being dispatched, the coroutine frame is destroyed while the pointer remains in the queue. try_dispatch_pending would then dereference a dangling pointer.

Consider either:

Heap-allocate PendingStateless with shared ownership, or

Add a cancellation flag that try_dispatch_pending checks before dereferencing, or

Remove the entry from the queue in a destructor/cancellation hook.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/worker/worker_pool.h` around lines 109 - 115, The PendingStateless struct is stack-allocated in acquire_stateless_slot but its address is stored in priority queues, creating a dangling pointer risk if the coroutine is cancelled before try_dispatch_pending can dispatch it. To fix this, implement one of the following approaches: heap-allocate PendingStateless using std::shared_ptr to extend the object lifetime beyond the coroutine frame and ensure safe cleanup, or add a validity/cancellation flag to the PendingStateless struct that try_dispatch_pending checks before dereferencing to safely skip cancelled entries, or implement a cancellation hook or destructor that removes the entry from the queue when the coroutine is cancelled to prevent dereferencing a stale pointer.

coderabbitai · 2026-06-18T14:36:54Z

+        for(int i = 0; i < 50; ++i) {
+            co_await kota::sleep(100);
+            if(f.worker_alive(0))
+                break;
+        }
+        EXPECT_TRUE(f.worker_alive(0));
+        EXPECT_NE(f.worker_pid(0), pid);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for PID change before exiting the restart loop.

Line 713 can evaluate true before the crash/respawn cycle is complete, so the loop may exit early and Line 717 can compare against the original PID intermittently.

Proposed fix

- for(int i = 0; i < 50; ++i) { - co_await kota::sleep(100); - if(f.worker_alive(0)) - break; - } - EXPECT_TRUE(f.worker_alive(0)); - EXPECT_NE(f.worker_pid(0), pid); + for(int i = 0; i < 50; ++i) { + co_await kota::sleep(100); + if(f.worker_alive(0) && f.worker_pid(0) != pid) + break; + } + EXPECT_TRUE(f.worker_alive(0)); + EXPECT_NE(f.worker_pid(0), pid);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for(int i = 0; i < 50; ++i) {

co_await kota::sleep(100);

if(f.worker_alive(0))

break;

}

EXPECT_TRUE(f.worker_alive(0));

EXPECT_NE(f.worker_pid(0), pid);

for(int i = 0; i < 50; ++i) {

co_await kota::sleep(100);

if(f.worker_alive(0) && f.worker_pid(0) != pid)

break;

}

EXPECT_TRUE(f.worker_alive(0));

EXPECT_NE(f.worker_pid(0), pid);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/server/worker_pool_tests.cpp` around lines 711 - 717, The loop that waits for worker restart exits prematurely when the worker becomes alive again, but before the PID has actually changed due to the respawn. Modify the break condition around line 713-715 to check not only that f.worker_alive(0) is true, but also that f.worker_pid(0) has changed from the original pid variable before breaking from the loop. This ensures the crash/respawn cycle is fully complete, preventing line 717 from intermittently failing when it verifies the PID has changed.

Replace SIGKILL (POSIX-only) with literal 9 in monitor_worker fallback path to fix Windows build. Lower default max_restarts from 5 to 2 since excessive retries are not useful.

16bit-ykiko added 2 commits June 18, 2026 22:16

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

fix(server): SIGKILL undefined on Windows, reduce max_restarts to 2

0bbb206

Replace SIGKILL (POSIX-only) with literal 9 in monitor_worker fallback path to fix Windows build. Lower default max_restarts from 5 to 2 since excessive retries are not useful.

-    _putenv_s("UV_THREADPOOL_SIZE", "1");
-#else
-    ::setenv("UV_THREADPOOL_SIZE", "1", 1);
+`#ifdef` _WIN32
+    if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) {
+        std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr);
+        return 1;
+    }
+`#else`
+    if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) {
+        std::perror("setenv(UV_THREADPOOL_SIZE)");
+        return 1;
+    }
+`#endif`

Conversation

16bit-ykiko commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

16bit-ykiko commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading