feat(server): unified cache framework and PCH cache keys#453
feat(server): unified cache framework and PCH cache keys#45316bit-ykiko wants to merge 17 commits into
Conversation
Content-addressed blob store with two-phase atomic writes (tmp + fsync + rename), per-namespace lifecycle policies (LRU eviction, Persistent, per-instance Scratch), crash recovery via tmp sweep and directory scan, and a manifest checkpoint carrying last-accessed times across restarts.
canonicalize(args, profile) renders the profile-selected subset of a compile command into a stable string for cache keys. Frontend drops codegen-only and clice-managed options; Preprocessing additionally drops diagnostics-presentation options. Filtering never reorders, unknown options are kept verbatim, and per-file identity (input path, -main-file-name) stays out so identical preambles share keys.
PCH cache keys now hash the preamble text together with the canonical frontend flags, the compile directories and the clang version (fixes the FIXME where -D/-I/-std variants wrongly shared one PCH); PCM keys likewise. Keys use xxh3_128bits. Workers write blobs to store-allocated tmp paths and the master commits them; evicted PCMs are re-validated before each compile. The index moved into the store as a Persistent namespace with mark-and-sweep cleanup; stale pre-framework layouts are discarded on startup. cache.json keeps only validity metadata and lives under the versioned store root.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a versioned CacheStore, introduces argument canonicalization for deterministic cache keys, migrates PCH/PCM/index persistence to the store, coordinates workspace/server cache lifecycle and checkpointing, simplifies worker output handling, and updates tests and docs to the versioned cache layout. ChangesUnified cache store and compilation persistence migration
Sequence Diagram(s)sequenceDiagram
participant Client
participant Compiler
participant CacheStore
participant StatelessWorker
Client->>Compiler: invoke compile (args)
Compiler->>Compiler: canonicalize(args) -> key
Compiler->>CacheStore: lookup("pch"/"pcm", key)
alt cache miss
Compiler->>CacheStore: begin_store(ns, key) -> pending.tmp_path
Compiler->>StatelessWorker: build(output_path=pending.tmp_path)
StatelessWorker-->>Compiler: build result (tmp_path)
Compiler->>CacheStore: commit(pending)
else cache hit
CacheStore-->>Compiler: blob path
end
Compiler->>Client: compilation result (uses blob path or built artifact)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb145d38e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/support/cache_store.h (1)
73-81: ⚖️ Poor tradeoffPendingEntry lacks automatic cleanup.
Dropping a
PendingEntrywithout callingcommit()orabort()leaks the tmp file until the nextopen()sweep. This creates exception-safety risk: if the caller's code throws betweenbegin_store()andcommit(), the tmp file is orphaned.Consider adding an RAII guard (e.g.,
abort()in destructor if not committed) or documenting the leak risk more prominently at the call sites.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/support/cache_store.h` around lines 73 - 81, PendingEntry can leak tmp files if dropped before commit()/abort(); modify the struct to follow RAII by adding a destructor that calls abort() when the entry hasn't been finalized, and add a boolean flag (e.g., finalized or committed) updated by commit() and abort() to prevent double work; update begin_store()/commit()/abort() callers to rely on this RAII behavior and ensure move semantics are handled (delete or define move ctor/assign to transfer ownership of tmp_path and the finalized flag) so exceptions between begin_store() and commit() automatically clean up the tmp file without waiting for open() sweeps.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/server/compiler/indexer.cpp`:
- Around line 133-149: The code currently proceeds to write merged shard blobs
even if storing the "project" blob fails; modify the save sequence in the
function containing store_blob and workspace.project_index so that after
attempting store_blob(..., "project", ...) you check the boolean result and
return early (or abort further saves) when it is false, preventing any writes of
workspace.merged_indices shards; ensure you reference the existing store_blob
call that serializes workspace.project_index and guard the subsequent loop over
workspace.merged_indices (and any use of shard.index.serialize) so shards are
only persisted when the project blob save succeeded.
In `@src/server/service/master_server.cpp`:
- Around line 211-214: The current shutdown persists caches before quiescing
producers, which allows in-flight indexing/PCH work to complete after the
snapshot; change the shutdown order so you first await/quiesce producers and
compilers (co_await kota::when_all(indexer.stop(), compiler.stop()) or
equivalent) to drain in-flight work, then call indexer.save() and
workspace.save_cache() to persist a stable snapshot, and finally stop the pool
(co_await pool.stop()); update calls around the indexer/compile stop/save
sequence to ensure no new work can modify caches before serialization.
---
Nitpick comments:
In `@src/support/cache_store.h`:
- Around line 73-81: PendingEntry can leak tmp files if dropped before
commit()/abort(); modify the struct to follow RAII by adding a destructor that
calls abort() when the entry hasn't been finalized, and add a boolean flag
(e.g., finalized or committed) updated by commit() and abort() to prevent double
work; update begin_store()/commit()/abort() callers to rely on this RAII
behavior and ensure move semantics are handled (delete or define move
ctor/assign to transfer ownership of tmp_path and the finalized flag) so
exceptions between begin_store() and commit() automatically clean up the tmp
file without waiting for open() sweeps.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 742002e5-0580-49a1-8440-017a7f13aca7
📒 Files selected for processing (18)
src/command/argument_parser.cppsrc/command/argument_parser.hsrc/server/compiler/compiler.cppsrc/server/compiler/indexer.cppsrc/server/compiler/indexer.hsrc/server/service/master_server.cppsrc/server/service/master_server.hsrc/server/service/session.hsrc/server/worker/stateless_worker.cppsrc/server/workspace/workspace.cppsrc/server/workspace/workspace.hsrc/support/cache_store.cppsrc/support/cache_store.htests/integration/compilation/test_persistent_cache.pytests/integration/compilation/test_staleness.pytests/integration/utils/cache.pytests/unit/command/canonicalize_tests.cpptests/unit/support/cache_store_tests.cpp
windows.h was included before the LLVM headers without NOMINMAX, so its min/max function-like macros broke SmallVector.h/MathExtras.h macro expansion on the MSVC CI jobs. Also define WIN32_LEAN_AND_MEAN; this is the only direct windows.h include in clice, and the comment documents that any future include needs the same guards.
The benign rename-collision fallback in CacheStore::commit assumed
content-addressed keys, which only holds for LRU namespaces. Persistent
and Scratch keys are mutable (the index rewrites "project"/"{path_id}"
with new content), so on Windows a rename onto an open blob was silently
treated as success and the new data discarded — stale index served
forever. Now only LRU keeps the fallback (and requires a regular file);
Persistent/Scratch remove the stale destination and retry, surfacing an
error if the blob still cannot be published, and dropping the in-memory
entry when the old blob is gone too.
Indexer::save() ran store.commit() — an fsync per blob — synchronously on the event loop, unlike the PCH/PCM paths which already offload via kota::queue. save() is now a task: phase one serializes the ProjectIndex and dirty shards to tmp files with no suspension point, so the batch stays a consistent snapshot even if a merge runs before the commits land; phase two commits each blob on the kota thread pool.
The index lives inside the unified cache store under cache_dir, so the field no longer points at anything. Drop it from the config struct, its default/substitution logic and the docs; the TOML codec skips unknown keys, so configs still setting index_dir keep parsing (covered by a new test). The legacy index/ directory is still cleaned up at store open.
CliceClient now exposes server/kill_server()/stop_io(), giving the pygls internals (_server, _stop_event, _async_tasks) a single home. The kill -9 recovery test, conftest shutdown and the agentic shutdown tests use the new API instead of poking at private members.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration/agentic/test_agentic.py`:
- Around line 447-448: Wrap the call to assert_server_exited_cleanly(c.server)
and await c.stop_io() in a try/finally so stop_io() always runs even if the
assertion fails; specifically, in the test (test_rpc_shutdown) ensure you call
await assert_server_exited_cleanly(c.server) inside a try block and put await
c.stop_io() in the finally block to guarantee IO cleanup and prevent task leaks.
In `@tests/integration/utils/client.py`:
- Around line 129-134: The stop_io method sets _stop_event and cancels tasks but
doesn't await them, leaving pending tasks that can leak; update stop_io to
cancel each Task in self._async_tasks and then await their completion (e.g.,
await asyncio.gather(*self._async_tasks, return_exceptions=True) or await tasks
individually while swallowing asyncio.CancelledError) to ensure all client-side
IO tasks finish before teardown, and clear or reset self._async_tasks afterward.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7f45079b-1634-44e9-adec-e769cb1dc493
📒 Files selected for processing (16)
docs/clice.tomldocs/en/guide/configuration.mddocs/zh/guide/configuration.mdsrc/server/compiler/indexer.cppsrc/server/compiler/indexer.hsrc/server/service/master_server.cppsrc/server/workspace/config.cppsrc/server/workspace/config.hsrc/support/cache_store.cppsrc/support/cache_store.htests/conftest.pytests/integration/agentic/test_agentic.pytests/integration/compilation/test_persistent_cache.pytests/integration/utils/client.pytests/unit/server/config_tests.cpptests/unit/support/cache_store_tests.cpp
💤 Files with no reviewable changes (2)
- src/server/workspace/config.h
- src/server/workspace/config.cpp
✅ Files skipped from review due to trivial changes (2)
- docs/zh/guide/configuration.md
- docs/en/guide/configuration.md
🚧 Files skipped from review as they are similar to previous changes (6)
- src/server/compiler/indexer.cpp
- src/support/cache_store.h
- src/server/compiler/indexer.h
- src/server/service/master_server.cpp
- tests/unit/support/cache_store_tests.cpp
- tests/integration/compilation/test_persistent_cache.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec80822213
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The collision fallback assumed LRU blobs colliding by name share the same content, but even LRU keys are not fully content-addressed: a dependency edit changes PCH content without changing the key input, so on Windows a rename onto an open stale blob was kept while the caller recorded fresh deps metadata — stale PCH served with no rebuild trigger. Keep the survivor only when verified byte-identical; otherwise remove and retry like mutable keys. Also document that the unsigned wraparound in the size accounting is intentional and exact.
Publish shard blobs only together with the ProjectIndex they were built against: bail out when its serialization fails and drop the staged shards when its commit fails, so a restart never pairs new shards with an old project blob. Quiesce compiler/indexer before the shutdown save so the snapshot covers everything that completed.
stop_io now awaits the cancelled IO tasks so none outlive the test, and test_rpc_shutdown reaches it even when the exit assertion fails.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ade9017146
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The building-event cleanup (wake waiters, clear the marker) ran inline on each exit path of ensure_pch, so a coroutine cancelled at any of its suspension points unwound without ever signalling the event and later requests for the same file would wait on it forever. Today that only happens at shutdown where the waiters are cancelled too, but the invariant is fragile — any future per-compile cancellation would trip it. An RAII guard now owns the cleanup, covering cancellation as well as the normal paths, and only clears the marker if it still points at its own registration (the entry can be erased by didClose and re- registered by a newer build mid-flight).
Record why the store is synchronous-plus-mutex rather than coroutines over kota's async fs: a started operation always runs to completion, so cancellation never observes a torn mid-operation state. Leave a TODO to re-evaluate the loop-confined coroutine design (no mutex, cancellation points inside operations) once real-world usage settles.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52ed28fd20
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
llvm::sys::fs::remove_directories is implemented over shell COM on Windows (CoInitializeEx + IFileOperation) with no fallback, and with the default IgnoreErrors=true a COM failure (e.g. the thread already holds a different apartment model) is a silent no-op — unsuitable for the server event loop, which uses it for the version sweep, dead-pid reclamation and shutdown cleanup. Add fs::remove_all, the same plain recursive removal LLVM's Unix implementation uses (directory_iterator + remove, no COM), and use it everywhere. Also document that the commit rename collision is rarer than it looks: sys::fs::rename already moves aside destinations whose holders granted delete sharing, which LLVM-opened files always do.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eff8fcf930
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Nothing has shipped yet, so there is no old on-disk layout to clean up after: remove the startup deletion of the pre-store index/ directory and the stale-layout wording around the version sweep (which keeps discarding anything under cache/ that isn't the current version dir).
A stale entry under cache/ that is a symlink to a directory outside the cache root would be recursed into by the version sweep, deleting files the store does not own. lstat the root first: anything that is not a real directory — including symlinks to directories — is unlinked without being followed (children were already handled no-follow).
The eviction preflight ran only before compile_deps, but building dependencies can itself evict another clean module's PCM under LRU budget pressure, reopening the window just closed and handing clang a dangling path via fill_pcm_deps. Scan-dirty-compile now loops until a scan finds nothing evicted (bounded retries).
There was a problem hiding this comment.
💡 Codex Review
clice/src/command/argument_parser.cpp
Lines 218 to 221 in 4f6d17d
For compile commands that differ only by -fpic versus -fPIC, this makes canonicalize(..., ArgsProfile::Frontend) drop the distinguishing flag, so files with the same preamble can reuse the same PCH/PCM even though Clang exposes different predefined macros (clang++ -fpic/-fPIC -dM -E -x c++ /dev/null reports different __PIC__/__pic__ values). Any header that branches on those macros can therefore be compiled from a stale artifact; these flags need to remain in the frontend key.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Background
clice persists three kinds of derived artifacts, and before this PR each managed its own files with ad-hoc logic: PCH/PCM blobs under
cache/{pch,pcm}tracked by a hand-rolledcache.json, index shards under a separateindex_dir, and synthesized header-context preambles accumulating underheader_context/with no cleanup at all. None of them had a real reliability story:-D/-I/-stdshared one PCH — wrong macros, wrong diagnostics. This was a long-standing FIXME incompiler.cpp.This PR introduces a unified blob store (
CacheStore) that owns the lifecycle of every on-disk cache artifact, migrates PCH/PCM/index onto it, and fixes the cache-key bug properly.Design
Narrow ownership. The store manages blob lifecycle only: atomic publication, size accounting and eviction, crash recovery, and an access-time checkpoint. It never computes keys and never decides validity — staleness is domain knowledge (deps snapshots, two-layer mtime+hash checks) that stays with the callers, who hand the store opaque filename-safe keys. This keeps the store generic enough that future artifacts (render outputs, preamble files) plug in as namespaces rather than new subsystems.
One invariant instead of many states. A blob is complete iff it exists at its final path. Commit is fsync-then-atomic-rename, so a crash at any instant leaves exactly one kind of residue: tmp files, which live in per-pid directories and are swept at startup by pid-liveness. There are no sidecar metadata files to get out of sync — the filename is the key, size/mtime come from the filesystem, and the manifest is not a source of truth: it only carries last-accessed times across restarts. Registration always reconciles against a directory scan, so a corrupt or missing manifest merely degrades LRU ordering to mtime, and blobs committed after the last checkpoint are still adopted.
Versioned root, zero migration code. Everything lives under
cache/v{N}; anything else undercache/is discarded wholesale at open. Caller-side metadata (cache.json, now reduced to dependency snapshots) lives inside the versioned root, so a format bump retires blobs and metadata together. We will never write or debug migration code for cache data that can simply be rebuilt.Policies are the variation point. Three lifecycle policies cover everything we store:
LRU— rebuildable large artifacts (PCH/PCM, 8 GiB budget each): evicted coldest-first on commit, with the freshly committed blob protected and failed deletions (Windows open files) retried on later rounds.Persistent— data that is expensive to accumulate (the index): never auto-evicted; the store provides key enumeration and the indexer does mark-and-sweep of orphaned shards on load.Scratch— per-instance working files in pid subdirectories, excluded from manifest and LRU; dead-pid directories are reclaimed on registration. This both bounds previously-unbounded preamble accumulation and stops concurrent instances from deleting each other's live files. (The namespace is registered now; the legacy preamble path migrates onto it with the upcoming preamble rework.)Process boundary follows the existing architecture. The master exclusively owns the store. Workers never touch store state: the master allocates a tmp path (
begin_store), the worker compiles into it, and the master commits after the worker reports success. No cross-process coordination protocol — concurrent instances sharing a root get best-effort safety (per-pid isolation), not coordinated LRU accounting.Passive and thread-safe by construction. The store owns no timers and never couples to the event loop. All methods are internally synchronized so the heavy operations — commit's fsync, manifest writes — are offloaded by callers to the kota thread pool while lookups (pure in-memory, zero disk IO) continue on the loop. Periodic checkpointing is scheduled by the server; losing the most recent access times in a crash only blurs LRU ordering, which is an accepted trade.
Cache keys
Keys are a caller convention: hex of
xxh3_128bits. 128 bits because a collision here means silently reusing the wrong PCH — the 64-bit birthday bound (~2³²) is not acceptable for that failure mode, and there is no adversarial model that would justify paying for SHA-256.The PCH key now covers, via the new
canonicalize(args, profile):-g,-fPIC,-flto, …) stay out of the key so they can't cause spurious rebuilds; note that-O*is kept deliberately, since it defines__OPTIMIZE__and changes preprocessing. Filtering never reorders (relative-I/-Dorder is semantic), and unknown options are kept verbatim — over-keying costs sharing, under-keying costs correctness.One subtlety the implementation defends against: even these keys are not fully content-addressed (editing an included header changes PCH content without changing the key input). A rename collision on commit — possible on Windows when the destination blob is open — is therefore treated as benign only after verifying the existing blob is byte-identical; otherwise the stale blob is removed and the publish retried.
Behavior changes
project.index_dirconfig field is gone — the index lives inside the store undercache_dir. There is no legacy-layout cleanup or migration anywhere: nothing has shipped yet, so there are no layouts to migrate from.kill -9at any point leaves no residue that the next start doesn't reclaim, and never a half-written blob under a valid name.cache_dirremains the single user-facing "clear everything" action.Testing
canonicalizekey stability (irrelevant flags keep the key;-D/-I/-std/ordering/unknown flags change it; input file and-main-file-namestay out).-D→ two PCHs; flag change across restart invalidates), pluskill -9mid-build → restart sweeps residue and serves normally.Summary by CodeRabbit
New Features
Refactor
Bug Fixes