Skip to content

refactor(server): crash handling and scheduling tests for WorkerPool#465

Open
16bit-ykiko wants to merge 3 commits into
mainfrom
refactor/worker-pool-scheduling
Open

refactor(server): crash handling and scheduling tests for WorkerPool#465
16bit-ykiko wants to merge 3 commits into
mainfrom
refactor/worker-pool-scheduling

Conversation

@16bit-ykiko

@16bit-ykiko 16bit-ykiko commented Jun 18, 2026

Copy link
Copy Markdown
Member

Summary

  • Add centralized crash handling via process_crash() for both stateful and stateless workers
  • Stateful crashes report lost_documents via on_crash callback — caller decides recovery strategy (lazy rebuild via LRU on next access)
  • Stateless send_stateless() transparently retries up to 2 times on mid-request worker crash
  • AIMD (Additive Increase / Multiplicative Decrease) backoff on low_limit after stateless crashes, with memory-pressure monitoring
  • Idempotent release_stateless_slot prevents busy_count underflow between crash handler and StatelessSlot RAII destructor
  • pick_idle_stateless uses llvm_unreachable instead of silent fallback to dead worker
  • respawn_worker dispatches pending requests only after peer->run() starts

Test plan

  • 37 unit tests across 4 suites: WorkerPoolStateful (8), WorkerPoolScheduling (11), WorkerPoolCrash (14), WorkerPoolIntegration (4)
  • Stateful routing: least-loaded assignment, path affinity, LRU, clear on crash
  • Stateless scheduling: priority queues, low_limit enforcement, RAII slot release, concurrency limits
  • Crash handling: state cleanup, lost document reporting, idempotent release, AIMD backoff math, restart limits
  • Integration: real process spawn/stop, stateless request round-trip, crash-and-restart, crash notification callback
  • Full unit test suite passes (658 tests, 0 failures)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added priority-aware scheduling for stateless build and format tasks, allowing high-priority work to run ahead of lower-priority requests.
  • Improvements

    • Overhauled stateless worker capacity management with priority-aware queuing, dynamic throttling, and more responsive dispatch.
    • Enhanced worker crash handling, including structured crash reporting and restart/backoff logic.
  • Changes

    • Background indexing concurrency behavior was simplified, and stateless worker execution now limits libuv’s thread pool to a single thread per process.

Replace the round-robin stateless worker dispatch with priority-aware
scheduling that prevents background indexing from starving interactive
requests and adapts concurrency to memory pressure.

Key changes:
- Add Priority enum (High/Low) to worker protocol; interactive requests
  (completion, signature help, format) are High, indexing is Low
- WorkerPool tracks busy/idle state per worker and maintains priority
  queues — High requests are always dispatched before Low
- Reserve at least one worker slot for High priority tasks
- Adaptive low_limit via memory monitoring (poll every 3s) and AIMD
  backoff on worker crashes
- Set OS nice value on workers based on task priority
- Limit stateless workers to one concurrent compilation via
  UV_THREADPOOL_SIZE=1
- Remove redundant Indexer-layer concurrency control (max_concurrent,
  monitor_resources) — backpressure is now provided by WorkerPool
Add comprehensive crash handling to WorkerPool:
- process_crash() centralizes all crash state transitions for both
  stateful and stateless workers
- Stateful crashes report lost_documents via on_crash callback,
  letting the caller decide recovery (lazy rebuild via LRU)
- Stateless send_stateless() transparently retries up to 2 times
  on worker crash
- AIMD backoff on low_limit after stateless worker crashes
- Idempotent release_stateless_slot prevents busy_count underflow
  when crash handler and StatelessSlot RAII destructor both fire
- pick_idle_stateless uses llvm_unreachable instead of silent
  fallback to index 0
- respawn_worker calls try_dispatch_pending after peer->run()

Add 37 unit tests across 4 suites (WorkerPoolStateful,
WorkerPoolScheduling, WorkerPoolCrash, WorkerPoolIntegration)
covering routing, priority scheduling, crash handling, AIMD
backoff, and real-process integration.
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80489e48-cd40-42e6-8968-5ae263c4bd4b

📥 Commits

Reviewing files that changed from the base of the PR and between a8abe2d and 0bbb206.

📒 Files selected for processing (2)
  • src/server/worker/worker_pool.cpp
  • src/server/worker/worker_pool.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/server/worker/worker_pool.h
  • src/server/worker/worker_pool.cpp

📝 Walkthrough

Walkthrough

Introduces a Priority enum (High/Low) in the worker protocol and rewrites WorkerPool stateless dispatch with priority-aware slot acquisition, AIMD-based memory pressure monitoring, and a process_crash helper that handles restart decisions and fires an on_crash callback. Compiler callsites set High priority for PCH/build/format. The Indexer's own concurrency controls (set_max_concurrency, monitor_resources, in_flight throttling) are removed as they are superseded by pool-level scheduling. Stateless worker processes now cap libuv's thread pool to one via UV_THREADPOOL_SIZE=1.

Changes

Priority-Aware Stateless Worker Scheduling & Crash Handling

Layer / File(s) Summary
Priority enum and BuildParams contract
src/server/protocol/worker.h
Adds enum class Priority { High, Low } and a priority field (default Low) to BuildParams.
WorkerPool crash info, options, and scheduling data structures
src/server/worker/worker_pool.h
Introduces WorkerCrashInfo, adds max_restarts to WorkerPoolOptions, replaces round-robin internals with PendingStateless/StatelessSlot, high/low priority deques, concurrency counters, and new method declarations.
WorkerPool spawn, start, and respawn lifecycle
src/server/worker/worker_pool.cpp
spawn_worker tracks alive_stateless_count; start initializes low_limit and launches monitor_memory(); respawn_worker resets busy/alive state and calls try_dispatch_pending().
Worker crash detection and process_crash helper
src/server/worker/worker_pool.cpp
Refactors monitor_worker to normalize exits and delegate to new process_crash, which marks workers dead, collects lost stateful documents, decrements stateless counters, applies backoff, and fires on_crash.
Priority slot acquisition, release, and pending dispatch
src/server/worker/worker_pool.cpp
Implements acquire_stateless_slot (blocks in high/low queue when capacity unavailable), release_stateless_slot, try_dispatch_pending (high-before-low with low_limit gate), and pick_idle_stateless.
Memory monitoring and AIMD crash backoff
src/server/worker/worker_pool.cpp
Adds monitor_memory() periodic loop that adjusts low_limit from memory pressure, and apply_crash_backoff() for AIMD-style reduction on crashes.
Stateless worker UV_THREADPOOL_SIZE initialization
src/server/worker/stateless_worker.cpp
Sets UV_THREADPOOL_SIZE=1 via platform-specific env-var call at the start of run_stateless_worker_mode.
Compiler callsites set High priority
src/server/compiler/compiler.cpp
ensure_pch, forward_build, and forward_format each set BuildParams::priority = High before dispatching stateless requests.
Remove Indexer-level concurrency controls
src/server/compiler/indexer.h, src/server/compiler/indexer.cpp, src/server/service/master_server.cpp
Removes set_max_concurrency, monitor_resources(), max_concurrent/baseline_concurrent fields, and in_flight/slot_available throttling from the indexer; removes the set_max_concurrency call from MasterServer::initialize().
WorkerPool unit and integration tests
tests/unit/server/worker_pool_tests.cpp
Adds WorkerPoolFixture and test suites for stateful scheduling, stateless priority dispatch, crash semantics (counters, lost documents, AIMD backoff), and start/stop/crash-restart integration.

Sequence Diagram(s)

sequenceDiagram
  participant Compiler
  participant WorkerPool
  participant acquire_stateless_slot
  participant StatelessWorker
  participant process_crash
  participant on_crash

  rect rgba(100, 149, 237, 0.5)
    note over Compiler,WorkerPool: High-priority build/format dispatch
    Compiler->>WorkerPool: send_stateless(BuildParams{priority=High})
    WorkerPool->>acquire_stateless_slot: acquire(Priority::High)
    acquire_stateless_slot->>acquire_stateless_slot: bypass low_limit gate, pick idle worker
    acquire_stateless_slot-->>WorkerPool: worker_index + StatelessSlot
  end

  WorkerPool->>StatelessWorker: dispatch build request
  StatelessWorker-->>WorkerPool: response (success or crash)

  alt worker crashed
    rect rgba(220, 80, 80, 0.5)
      note over WorkerPool,on_crash: Crash handling path
      WorkerPool->>process_crash: process_crash(index, stateful, exit_code, signal)
      process_crash->>WorkerPool: decrement alive/busy counters
      process_crash->>WorkerPool: apply_crash_backoff() → reduce low_limit
      process_crash->>WorkerPool: try_dispatch_pending()
      process_crash->>on_crash: WorkerCrashInfo{index, exit_code, ...}
      process_crash-->>WorkerPool: should_restart=true
      WorkerPool->>WorkerPool: respawn_worker(index)
    end
  end

  WorkerPool->>WorkerPool: ~StatelessSlot → release_stateless_slot
  WorkerPool->>WorkerPool: try_dispatch_pending() → signal next waiter
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • clice-io/clice#432: Directly modifies clice::Indexer's set_max_concurrency and monitor_resources scheduling in indexer.h/indexer.cpp, the same code this PR removes.
  • clice-io/clice#435: Modifies Compiler::ensure_pch in compiler.cpp along the BuildPCH dispatch path, where this PR also adds the priority=High assignment.

Poem

🐇 Hoppity-hop through the priority lane,
High builds go first, low ones must wait and strain.
When a worker crashes with a bump and a fall,
AIMD backoff adjusts the limit for all.
No more indexer juggling its own concurrency knobs—
The pool now handles all the scheduling jobs!
— signed, the rabbit who reviews your code 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'refactor(server): crash handling and scheduling tests for WorkerPool' directly addresses the main objectives of centralizing crash management and introducing comprehensive scheduling tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/worker-pool-scheduling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8abe2db72

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

auto result = co_await stateless_workers[idx].peer->send_request(params, opts);

// Success, or application-level error (worker still alive) — return as-is.
if(result.has_value() || stateless_workers[idx].alive)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Retry against the crashed process, not the reused slot

When a stateless worker dies during send_request, monitor_worker() can run process_crash() and respawn_worker() before this await resumes; respawn_worker() reuses the same slot index and sets alive = true. In that interleaving the result is still the IPC error from the retired peer, but this check sees the newly spawned worker as alive and returns the error instead of performing the intended retry. Capture a per-process generation/restart count or the peer identity before sending rather than checking the current slot state.

Useful? React with 👍 / 👎.

Comment on lines +405 to +408
high_queue.push_back(&pending);
else
low_queue.push_back(&pending);
co_await pending.ready.wait();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove cancelled waiters from stateless queues

If an LSP request is cancelled while waiting for a stateless slot, this stack-local PendingStateless is destroyed as the coroutine unwinds, but its address remains in high_queue/low_queue. The next try_dispatch_pending() from a release, crash, or respawn will dereference that dangling pointer and write assigned_worker/signal ready, which can crash or corrupt scheduling. The queued entry needs cancellation cleanup or owned lifetime independent of the waiting frame.

Useful? React with 👍 / 👎.

Comment on lines +402 to +408
if(!can_proceed()) {
PendingStateless pending(priority);
if(priority == P::High)
high_queue.push_back(&pending);
else
low_queue.push_back(&pending);
co_await pending.ready.wait();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error when no stateless workers are alive

When every stateless worker is down, for example after the last one exceeds max_restarts, alive_stateless_count is zero so can_proceed() is false and the request is queued here forever. With no live worker and no future respawn/release to call try_dispatch_pending(), user requests hang instead of getting the previous “all workers are down” style error. Check alive_stateless_count == 0 before enqueueing and fail the request.

Useful? React with 👍 / 👎.

Comment on lines +308 to +309
if(on_crash)
on_crash(info);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark sessions dirty after stateful crashes

Stateful crashes now clear the worker ownership, but production code never installs WorkerPool::on_crash (repo-wide search only finds the unit-test fixture). For an already clean open document, Compiler::ensure_compiled() will therefore return true without rebuilding, send_stateful() will assign the path to a fresh worker with no AST, and features like hover/semantic tokens return null until an edit or dependency change dirties the session. Wire this callback in MasterServer to mark lost_documents dirty or otherwise force recompilation.

Useful? React with 👍 / 👎.


auto pid = stateless_workers[idx].proc.pid();
if(pid > 0) {
kota::sys::set_priority(priority == P::Low ? 10 : 0, pid);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid one-way nice changes for worker priority

On normal Linux/macOS runs without elevated scheduling privileges, increasing a process nice value to 10 is allowed but lowering it back to 0 fails; I confirmed this with setpriority in the same environment. Because these calls ignore failure, once a worker is assigned a low-priority request here, later high-priority dispatches that try set_priority(0, pid) cannot restore the worker, so high-priority completions/formatting can continue running at low OS priority after indexing has used every worker.

Useful? React with 👍 / 👎.

/// - SignatureHelp: + text, version, offset, pch, pcms
/// - Format: + text, format_range (optional)
struct BuildParams {
Priority priority = Priority::Low;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route user-triggered PCM builds as high priority

Defaulting new BuildParams to low priority leaves BuildPCM requests low unless a caller overrides it, and the compile-graph dispatch used by Compiler::ensure_deps() constructs PCM builds without setting priority. When an interactive hover/completion needs module dependencies while background indexing has already filled the low-priority queue, the dependency build waits behind background Index work even though the final user request is marked high priority. Propagate the caller priority into compile-graph PCM builds or split background PCM work from user-triggered dependencies.

Useful? React with 👍 / 👎.

Comment on lines 900 to 902
++dispatched;
workers.spawn([&, server_path_id]() -> kota::task<> {
co_await index_one(server_path_id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep background indexing bounded before worker dispatch

With the in_flight gate removed, this loop now spawns one coroutine for every stale queued file before any stateless-worker slot is acquired. On large workspaces those index_one tasks all run far enough to do synchronous per-file work such as module dependency preparation and compile-argument lookup, then pile up in the worker-pool queue, defeating the previous concurrency cap and making pause/resume unable to stop new indexing work until the whole queue has been spawned. Reintroduce a bounded producer or acquire scheduling capacity before launching each per-file task.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/server/worker/stateless_worker.cpp`:
- Around line 299-301: The return values of the _putenv_s function (Windows
platform) and the setenv function (non-Windows platform) are not being checked
after setting the UV_THREADPOOL_SIZE environment variable. If either of these
calls fails, the worker will silently continue with libuv's default thread pool
size, breaking the one-job-per-process scheduling guarantee. Check the return
values of both _putenv_s and setenv immediately after they are called, and if
either returns a non-zero/non-success value, log an appropriate error message
and exit the process early to ensure the initialization failure is caught and
reported.

In `@src/server/worker/worker_pool.cpp`:
- Around line 402-410: The issue is that the stack-local PendingStateless
pending variable's address is stored in either high_queue or low_queue when
can_proceed() returns false, and if the coroutine scope exits before the queue
removes this entry (due to cancellation or early return), subsequent calls to
try_dispatch_pending will dereference a dangling pointer. Fix this by wrapping
the queue insertion in an RAII guard using scope_exit to ensure that the address
of pending is removed from whichever queue it was added to (high_queue or
low_queue) before the pending variable goes out of scope, protecting against
premature coroutine termination.

In `@src/server/worker/worker_pool.h`:
- Around line 109-115: The PendingStateless struct is stack-allocated in
acquire_stateless_slot but its address is stored in priority queues, creating a
dangling pointer risk if the coroutine is cancelled before try_dispatch_pending
can dispatch it. To fix this, implement one of the following approaches:
heap-allocate PendingStateless using std::shared_ptr to extend the object
lifetime beyond the coroutine frame and ensure safe cleanup, or add a
validity/cancellation flag to the PendingStateless struct that
try_dispatch_pending checks before dereferencing to safely skip cancelled
entries, or implement a cancellation hook or destructor that removes the entry
from the queue when the coroutine is cancelled to prevent dereferencing a stale
pointer.

In `@tests/unit/server/worker_pool_tests.cpp`:
- Around line 711-717: The loop that waits for worker restart exits prematurely
when the worker becomes alive again, but before the PID has actually changed due
to the respawn. Modify the break condition around line 713-715 to check not only
that f.worker_alive(0) is true, but also that f.worker_pid(0) has changed from
the original pid variable before breaking from the loop. This ensures the
crash/respawn cycle is fully complete, preventing line 717 from intermittently
failing when it verifies the PID has changed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eee3ed09-a6fa-4d4e-ab18-9d1f7de729ea

📥 Commits

Reviewing files that changed from the base of the PR and between 8cba654 and a8abe2d.

📒 Files selected for processing (9)
  • src/server/compiler/compiler.cpp
  • src/server/compiler/indexer.cpp
  • src/server/compiler/indexer.h
  • src/server/protocol/worker.h
  • src/server/service/master_server.cpp
  • src/server/worker/stateless_worker.cpp
  • src/server/worker/worker_pool.cpp
  • src/server/worker/worker_pool.h
  • tests/unit/server/worker_pool_tests.cpp
💤 Files with no reviewable changes (3)
  • src/server/compiler/indexer.h
  • src/server/service/master_server.cpp
  • src/server/compiler/indexer.cpp

Comment on lines +299 to +301
_putenv_s("UV_THREADPOOL_SIZE", "1");
#else
::setenv("UV_THREADPOOL_SIZE", "1", 1);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n src/server/worker/stateless_worker.cpp | head -310 | tail -30

Repository: clice-io/clice

Length of output: 1207


🏁 Script executed:

rg "_putenv_s|::setenv" src/server/worker/stateless_worker.cpp -B 5 -A 5

Repository: clice-io/clice

Length of output: 578


🏁 Script executed:

rg "UV_THREADPOOL_SIZE" src/server/worker/stateless_worker.cpp -B 10 -A 10

Repository: clice-io/clice

Length of output: 750


🌐 Web query:

_putenv_s Windows return value error handling MSVC

💡 Result:

The _putenv_s function in the Microsoft C Runtime (CRT) returns an errno_t value to indicate success or failure [1][2]. Return Values: - 0: The function completed successfully [1][3]. - EINVAL: An error occurred [1][3]. This typically happens if either the varname or value_string arguments are NULL [1][4]. Error Handling: - When an error condition occurs (such as passing NULL arguments), the function invokes the CRT invalid parameter handler [1][3]. - If the application allows execution to continue after the handler is invoked, the function returns EINVAL and also sets the global errno variable to EINVAL [1][3]. Important Considerations: - Thread Safety: The _putenv_s function is not thread-safe [1][2]. Concurrent use with _getenv_s or other environment-modifying functions can cause random failures or undefined behavior [1][5]. Synchronization is required if accessed from multiple threads [1][3]. - Scope: Changes made by _putenv_s affect only the environment local to the current process and do not modify the environment segment managed by the operating system [2]. However, these changes are inherited by new processes created via _spawn, _exec, or system [2].

Citations:


🌐 Web query:

setenv POSIX return value error handling unix

💡 Result:

Under the POSIX standard, the setenv function is used to add or update an environment variable in the calling process [1][2]. Return Values: - Upon successful completion, setenv returns 0 [1][2]. - If an error occurs, it returns -1, and the global errno variable is set to indicate the specific error [1][2]. When an error occurs, the environment remains unchanged [1][2]. Error Handling (errno): The function is required to fail and set errno in the following cases [1][2]: - [EINVAL]: The environment variable name (envname) is an empty string, is a NULL pointer (depending on implementation), or contains an '=' character [1][3][2]. - [ENOMEM]: The system has insufficient memory to add or update the variable in the environment [1][2]. Developers should check the return value against -1 to detect errors and then consult errno to determine the cause [3][2]. Because setenv copies the strings provided to it [1][2], it is generally safer than functions like putenv regarding memory management, though it is not required by POSIX to be thread-safe [2][4].

Citations:


Check return values of _putenv_s and setenv to ensure UV_THREADPOOL_SIZE is set.

Line 299 and Line 301 ignore the return values from these functions. If environment setup fails (e.g., due to memory exhaustion or invalid arguments), the worker will run with libuv's default thread pool size instead of the required single thread, violating the one-job-per-process scheduling guarantee. Early exit with error reporting is appropriate for initialization failures.

Proposed patch
-#include <cstdlib>
+#include <cerrno>
+#include <cstdio>
+#include <cstdlib>

 int run_stateless_worker_mode(const std::string& worker_name, const std::string& log_dir) {
     // Limit libuv thread pool to 1 thread so each stateless worker executes
     // only one compilation at a time. Must be set before any kota::queue call.
 `#ifdef` _WIN32
-    _putenv_s("UV_THREADPOOL_SIZE", "1");
+    if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) {
+        std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr);
+        return 1;
+    }
 `#else`
-    ::setenv("UV_THREADPOOL_SIZE", "1", 1);
+    if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) {
+        std::perror("setenv(UV_THREADPOOL_SIZE)");
+        return 1;
+    }
 `#endif`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_putenv_s("UV_THREADPOOL_SIZE", "1");
#else
::setenv("UV_THREADPOOL_SIZE", "1", 1);
`#ifdef` _WIN32
if(_putenv_s("UV_THREADPOOL_SIZE", "1") != 0) {
std::fputs("Failed to set UV_THREADPOOL_SIZE=1 via _putenv_s\n", stderr);
return 1;
}
`#else`
if(::setenv("UV_THREADPOOL_SIZE", "1", 1) != 0) {
std::perror("setenv(UV_THREADPOOL_SIZE)");
return 1;
}
`#endif`
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/worker/stateless_worker.cpp` around lines 299 - 301, The return
values of the _putenv_s function (Windows platform) and the setenv function
(non-Windows platform) are not being checked after setting the
UV_THREADPOOL_SIZE environment variable. If either of these calls fails, the
worker will silently continue with libuv's default thread pool size, breaking
the one-job-per-process scheduling guarantee. Check the return values of both
_putenv_s and setenv immediately after they are called, and if either returns a
non-zero/non-success value, log an appropriate error message and exit the
process early to ensure the initialization failure is caught and reported.

Comment on lines +402 to +410
if(!can_proceed()) {
PendingStateless pending(priority);
if(priority == P::High)
high_queue.push_back(&pending);
else
low_queue.push_back(&pending);
co_await pending.ready.wait();
co_return pending.assigned_worker;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if kota::task supports cancellation that could destroy the coroutine frame
# while it's suspended, and whether PendingStateless needs cleanup guards.

# Look for cancellation patterns in the codebase
rg -n -C3 'cancel|Cancel' --type cpp -g '*.h' -g '*.cpp' | head -100

# Check if task_group has cancel semantics
ast-grep --pattern 'class task_group {
  $$$
}'

# Look at how PendingStateless is used in tests
rg -n 'PendingStateless' --type cpp

Repository: clice-io/clice

Length of output: 1980


🌐 Web query:

clice-io/clice kota::task coroutine cancellation

💡 Result:

In the context of the clice-io ecosystem, coroutine cancellation is primarily handled within the kotatsu library, which provides the asynchronous runtime and task management framework used by the clice language server [1][2]. The kotatsu library features a cooperative cancellation model for its task<T, E, C> objects [1]. Key components and mechanisms of this model include: Cancellation Tokens: The library uses cancellation_source and cancellation_token for thread-safe, external triggering of cancellation requests [1]. Explicit Cancellation: Tasks can be explicitly cancelled using co_await cancel, which transitions the task to a cancelled state [1]. Composed Cancellation: The runtime supports structured concurrency patterns such as: when_all: Wait for multiple tasks; if the first task fails, it cancels the remaining tasks [1]. when_any: Race multiple tasks; the winner cancels the losers [1]. task_group: Manage dynamic fan-out of tasks that support joining [1]. Cancellation Handling: Tasks support methods like.catch_cancel to convert cancellation into an explicit outcome channel and.or_fail for short-circuiting error propagation without resuming at the await site [1]. Runtime Integration: The asynchronous runtime includes I/O cancellation (e.g., for timers, signals, and process execution) where cancellation signals trigger immediate cleanup, such as disarming timers or sending termination signals (e.g., SIGKILL) to processes [3]. While clice also implements internal Task structures for managing its own asynchronous operations [4][5], these are built upon or mirror the patterns found in the broader kotatsu runtime [1]. Cancellation logic typically involves checking task flags within the promise_base and conditionally destroying coroutine handles when tasks are both cancelled and disposable [4].

Citations:


🌐 Web query:

clice worker pool coroutine queue PendingStateless lifetime

💡 Result:

In the context of the clice C++ language server, the terms you mentioned appear to be associated with different, unrelated systems or project-specific implementations, rather than a single unified "clice worker pool coroutine queue PendingStateless" mechanism. 1. clice (C++ Language Server): This is a modern language server for C++23 [1]. Its architecture uses a multi-process worker pool model [2] where a Master Server coordinates stateful and stateless worker processes [2]. It does not use a type or concept named "PendingStateless" [2][3][1]. Instead, it uses libuv for event-driven asynchronous task scheduling [3] and C++20 coroutines for background tasks like indexing [2][4]. Stateful workers use strands (per-document mutexes) and thread pools to serialize compilation and feature queries [2]. 2. PendingStateless: This term does not appear in the clice source code or documentation [2][3][1]. However, an exact string match for "PendingStateless" exists in an unrelated repository named 'suna' [5][6], which appears to be a Python-based agent/pipeline system. In that context, it relates to managing state transitions and write buffers within a stateless agent pipeline [5][6]. 3. Worker Pool/Coroutine/Queue: In the clice language server, the worker pool manages child processes [2][7]. Tasks (such as compilation or background indexing) are dispatched via a task queue [2][3]. These tasks are implemented using C++20 coroutines [2][3][8], which are scheduled on an event loop or thread pool [3]. Worker processes themselves are managed by the main server process [2][7]. If you are investigating performance or architecture within the clice language server, you should focus on the src/server/worker_pool.cpp and src/server/stateless_worker.cpp files [2][7], which define how the LSP server dispatches one-shot requests to stateless workers via round-robin routing and thread-pool-backed task queues [2]. There is no "PendingStateless" lifecycle component in this architecture [2][3].

Citations:


🌐 Web query:

C++ coroutine frame stack allocation queue pointer lifetime

💡 Result:

In C++20, coroutines are stackless, meaning they do not use the standard call stack for their execution state. Instead, they maintain a coroutine frame that stores local variables, temporaries, and promise objects, which persists across suspension points [1][2][3]. Coroutine Frame Allocation: By default, the compiler allocates the coroutine frame on the heap using operator new [1][4]. The size of this frame is determined by the compiler at compile time based on the coroutine's needs, including local variables, promise object size, and bookkeeping [4][2]. Allocation Elision (Optimization): The compiler can optimize away dynamic heap allocation if it can prove that the coroutine's lifetime is strictly nested within the lifetime of its caller and the frame size is known at the call site [1][2]. In such cases, the coroutine frame is embedded directly into the caller's stack frame or the caller's coroutine frame [1][2]. This optimization is often referred to as "HALO" (Heap Allocation elision optimization) or "CoroElide" [5][6]. Custom Allocation: Programmers can control frame allocation by defining a custom operator new within the promise_type [1][4]. This allows for the use of custom pools, allocators, or memory management strategies [7][8]. Because operator new is invoked before the coroutine body executes, the allocator must be available either via the coroutine's parameter list or through ambient state (e.g., thread-local storage) [7][4]. Pointer Lifetime: The coroutine handle (std::coroutine_handle) serves as a non-owning pointer to the coroutine frame [9][2]. It remains valid as long as the coroutine frame has not been destroyed [2]. The lifetime of the frame is managed by the coroutine itself (typically via explicit destruction or completion) [2]. If allocation elision is performed, the frame exists as a local object within the caller; it is essential that the handle is not used after the caller's scope ends, as the frame's memory will be reclaimed when the caller's activation record is popped from the stack [1][2]. Queue and Memory Lifetime: There is no inherent "queue" mechanism for coroutine frames in the C++ standard [7]. If a user implements a custom queue or pool to manage frame lifetime, they must ensure the frame's memory remains valid until the coroutine completes and is destroyed [7][8]. Using a pointer to a coroutine frame that has already been destroyed leads to undefined behavior, as the frame is the only location storing the information needed to resume or finalize the coroutine [9][2].

Citations:


Potential dangling pointer from stack-local address stored in queue.

PendingStateless pending is a stack-local variable whose address is stored in high_queue or low_queue. If the coroutine's scope exits (e.g., via cancellation or early return) before the queue removes the pending entry, subsequent calls to try_dispatch_pending() will dereference a dangling pointer.

The risk is particularly acute if coroutine frame allocation uses elision (stack-embedded), which is common for small coroutines. Even with kotatsu's cooperative cancellation model, ensure that queue removal happens before scope exit, or use RAII guards (e.g., scope_exit) to remove &pending from the queue on coroutine destruction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/worker/worker_pool.cpp` around lines 402 - 410, The issue is that
the stack-local PendingStateless pending variable's address is stored in either
high_queue or low_queue when can_proceed() returns false, and if the coroutine
scope exits before the queue removes this entry (due to cancellation or early
return), subsequent calls to try_dispatch_pending will dereference a dangling
pointer. Fix this by wrapping the queue insertion in an RAII guard using
scope_exit to ensure that the address of pending is removed from whichever queue
it was added to (high_queue or low_queue) before the pending variable goes out
of scope, protecting against premature coroutine termination.

Comment on lines +109 to +115
struct PendingStateless {
worker::Priority priority;
kota::event ready{};
std::size_t assigned_worker = 0;

explicit PendingStateless(worker::Priority p) : priority(p) {}
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if kota::event has any cancellation-aware wait() overload or 
# if there's a pattern for safe queue cleanup elsewhere in the codebase.
rg -n --type cpp -C3 'co_await.*\.wait\(\)' | head -60

Repository: clice-io/clice

Length of output: 4658


🏁 Script executed:

#!/bin/bash
# First, get the structure and context of worker_pool.h
cat -n src/server/worker/worker_pool.h | head -150

Repository: clice-io/clice

Length of output: 5947


🏁 Script executed:

#!/bin/bash
# Find the acquire_stateless_slot method implementation
rg -n 'acquire_stateless_slot' --type cpp -A 30 src/server/worker/

Repository: clice-io/clice

Length of output: 6495


🏁 Script executed:

#!/bin/bash
# Find try_dispatch_pending implementation
rg -n 'try_dispatch_pending' --type cpp -A 20 src/server/worker/

Repository: clice-io/clice

Length of output: 6270


🏁 Script executed:

#!/bin/bash
# Check kota::event definition and if it supports cancellation
find . -name '*.h' -o -name '*.cpp' | xargs grep -l 'class event' | head -5
rg -n 'struct event|class event' --type cpp -A 15 | head -80

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Find kota event definition
find . -path ./build -prune -o -name '*.h' -type f -print | xargs grep -l 'struct event\|class event' 2>/dev/null | grep -i kota | head -10

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Check if there's a cancellation token or cancellation-aware wait() overload in kota
rg -n 'cancel|Cancel' --type cpp src/server/worker/worker_pool.cpp | head -30

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Look for any cleanup or destructors in PendingStateless or surrounding code
sed -n '391,430p' src/server/worker/worker_pool.cpp

Repository: clice-io/clice

Length of output: 1236


🏁 Script executed:

#!/bin/bash
# Search for kota library or headers in the repo
find . -name "kota" -o -name "*kota*" -type d | head -10

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Check if kota is included as a submodule or external dependency
ls -la | grep -i kota
cat CMakeLists.txt 2>/dev/null | grep -i kota | head -20

Repository: clice-io/clice

Length of output: 329


🏁 Script executed:

#!/bin/bash
# Search for any coroutine cancellation or exception handling in send_stateless
sed -n '180,210p' src/server/worker/worker_pool.h

Repository: clice-io/clice

Length of output: 1279


🏁 Script executed:

#!/bin/bash
# Check if there are any exception handlers or cancellation mechanisms
rg -n 'catch_cancel|cancel|exception' --type cpp src/server/worker/worker_pool.h src/server/worker/worker_pool.cpp | head -40

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Look at the full try_dispatch_pending implementation
sed -n '432,460p' src/server/worker/worker_pool.cpp

Repository: clice-io/clice

Length of output: 1074


🏁 Script executed:

#!/bin/bash
# Search for usage of acquire_stateless_slot and check if it's in try/catch blocks
rg -B5 -A5 'acquire_stateless_slot' --type cpp src/server/

Repository: clice-io/clice

Length of output: 2311


🏁 Script executed:

#!/bin/bash
# Check if kota::task supports cancellation or has cancellation tokens
rg -n 'kota::task' --type cpp src/server/worker/worker_pool.h | head -5

Repository: clice-io/clice

Length of output: 299


🏁 Script executed:

#!/bin/bash
# Look for any request_options that might include timeout or cancellation
rg -n 'request_options|timeout' --type cpp src/server/ | grep -v test | head -20

Repository: clice-io/clice

Length of output: 1047


🏁 Script executed:

#!/bin/bash
# Check if there's any try-catch around the acquire_stateless_slot or send_stateless
sed -n '180,210p' src/server/worker/worker_pool.h

Repository: clice-io/clice

Length of output: 1279


🏁 Script executed:

#!/bin/bash
# Search for RAII patterns or scope guards that might clean up queue entries
rg -n 'struct.*Slot|~.*Slot' --type cpp src/server/worker/ -A5

Repository: clice-io/clice

Length of output: 814


Dangling pointer risk when coroutine is cancelled.

PendingStateless is a stack-local object in acquire_stateless_slot, and its address is pushed to the priority queues. If the waiting coroutine is cancelled (e.g., request timeout, client disconnect) before being dispatched, the coroutine frame is destroyed while the pointer remains in the queue. try_dispatch_pending would then dereference a dangling pointer.

Consider either:

  1. Heap-allocate PendingStateless with shared ownership, or
  2. Add a cancellation flag that try_dispatch_pending checks before dereferencing, or
  3. Remove the entry from the queue in a destructor/cancellation hook.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/worker/worker_pool.h` around lines 109 - 115, The PendingStateless
struct is stack-allocated in acquire_stateless_slot but its address is stored in
priority queues, creating a dangling pointer risk if the coroutine is cancelled
before try_dispatch_pending can dispatch it. To fix this, implement one of the
following approaches: heap-allocate PendingStateless using std::shared_ptr to
extend the object lifetime beyond the coroutine frame and ensure safe cleanup,
or add a validity/cancellation flag to the PendingStateless struct that
try_dispatch_pending checks before dereferencing to safely skip cancelled
entries, or implement a cancellation hook or destructor that removes the entry
from the queue when the coroutine is cancelled to prevent dereferencing a stale
pointer.

Comment on lines +711 to +717
for(int i = 0; i < 50; ++i) {
co_await kota::sleep(100);
if(f.worker_alive(0))
break;
}
EXPECT_TRUE(f.worker_alive(0));
EXPECT_NE(f.worker_pid(0), pid);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for PID change before exiting the restart loop.

Line 713 can evaluate true before the crash/respawn cycle is complete, so the loop may exit early and Line 717 can compare against the original PID intermittently.

Proposed fix
-        for(int i = 0; i < 50; ++i) {
-            co_await kota::sleep(100);
-            if(f.worker_alive(0))
-                break;
-        }
-        EXPECT_TRUE(f.worker_alive(0));
-        EXPECT_NE(f.worker_pid(0), pid);
+        for(int i = 0; i < 50; ++i) {
+            co_await kota::sleep(100);
+            if(f.worker_alive(0) && f.worker_pid(0) != pid)
+                break;
+        }
+        EXPECT_TRUE(f.worker_alive(0));
+        EXPECT_NE(f.worker_pid(0), pid);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for(int i = 0; i < 50; ++i) {
co_await kota::sleep(100);
if(f.worker_alive(0))
break;
}
EXPECT_TRUE(f.worker_alive(0));
EXPECT_NE(f.worker_pid(0), pid);
for(int i = 0; i < 50; ++i) {
co_await kota::sleep(100);
if(f.worker_alive(0) && f.worker_pid(0) != pid)
break;
}
EXPECT_TRUE(f.worker_alive(0));
EXPECT_NE(f.worker_pid(0), pid);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/server/worker_pool_tests.cpp` around lines 711 - 717, The loop
that waits for worker restart exits prematurely when the worker becomes alive
again, but before the PID has actually changed due to the respawn. Modify the
break condition around line 713-715 to check not only that f.worker_alive(0) is
true, but also that f.worker_pid(0) has changed from the original pid variable
before breaking from the loop. This ensures the crash/respawn cycle is fully
complete, preventing line 717 from intermittently failing when it verifies the
PID has changed.

Replace SIGKILL (POSIX-only) with literal 9 in monitor_worker
fallback path to fix Windows build. Lower default max_restarts
from 5 to 2 since excessive retries are not useful.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant