feat(index): optimize merge with ContextBitmap and selective merge by 16bit-ykiko · Pull Request #422 · clice-io/clice

16bit-ykiko · 2026-04-11T07:00:31Z

Summary

Introduce ContextBitmap (tagged-pointer inline bitmap for <63 ids) to replace roaring::Roaring in MergedIndex hot paths, eliminating heap allocations in the common case
Cache FileIndex SHA256 hashes with deterministic ordering to avoid recomputation and ensure cross-instance consistency
Add selective ProjectIndex::merge that skips symbols not referencing newly-merged files
MergedIndex::merge now returns bool to indicate cache hits, enabling the selective optimization
Add index_benchmark tool for end-to-end merge/query performance profiling

Test plan

Unit tests: 535 passed (new tests for merge cache hit, hash determinism, selective merge, ContextBitmap edge cases)
Integration tests: 119 passed
Smoke tests: 2/2 passed
Pre-PR triple review (correctness, style, test coverage) — all findings addressed

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a new indexing benchmark tool with detailed phase timings, serialization metrics, and query micro-benchmarks.
Performance
- Merge path now avoids redundant work via caching, enabling faster incremental updates.
- Deterministic, cached hashing reduces repeated computation.
- Selective merging updates only affected symbols for more efficient incremental indexing.
- More compact/efficient bitmap storage for index contexts.
Tests
- Added unit tests covering merge cache behavior, selective merging, hashing, and bitmap correctness.

coderabbitai · 2026-04-11T07:00:49Z

📝 Walkthrough

Walkthrough

Adds a new index benchmarking executable; introduces a compact ContextBitmap type and replaces roaring usage in MergedIndex; makes MergedIndex::merge return a cache-hit bool; adds selective ProjectIndex::merge overload; caches deterministic FileIndex hashes; and includes unit tests for these behaviors.

Changes

Cohort / File(s)	Summary
Benchmark Binary `CMakeLists.txt`, `benchmarks/index_benchmark.cpp`	New `index_benchmark` executable measuring parallel per-file compilation/TUIndex construction, single-threaded merging/serialization, micro lookups, and high-level "find references" queries with CLI knobs and percentile reporting.
Bitmap & MergedIndex `src/support/bitmap.h`, `src/index/merged_index.h`, `src/index/merged_index.cpp`	Added `clice::ContextBitmap` (inline/heap hybrid), replaced roaring fields with `ContextBitmap`, adjusted serialization, lookup filtering, and changed `MergedIndex::merge` overloads to return `bool` indicating cache hits.
ProjectIndex Selective Merge `src/index/project_index.h`, `src/index/project_index.cpp`	Added `ProjectIndex::merge(TUIndex&, const Bitmap&)` overload that only merges symbols referenced by file IDs in the provided bitmap; returns mapped file IDs.
FileIndex Hashing `src/index/tu_index.h`, `src/index/tu_index.cpp`	Added cached SHA256 hash storage and deterministic hashing: short-circuit cache, sort relation keys/entries before hashing, store cached_hash.
Unit Tests Added `tests/unit/support/context_bitmap_tests.cpp`, `tests/unit/index/merged_index_tests.cpp`, `tests/unit/index/project_index_tests.cpp`, `tests/unit/index/tu_index_tests.cpp`	New tests covering ContextBitmap inline/heap behavior and semantics, MergedIndex merge cache-hit behavior, ProjectIndex selective-merge semantics, and FileIndex hash caching/determinism.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Compiler as Compiler (parallel)
    participant Merger as Merger / ProjectIndex
    participant Merged as MergedIndex (shards)
    participant Query as Query Engine

    User->>Compiler: Phase 1: compile files in parallel
    Compiler->>Compiler: Build TUIndex/FileIndex per file\ncompute deterministic hash
    Compiler-->>Merger: emit per-file indices & file IDs

    Merger->>Merger: Phase 2: selective merge\n(ProjectIndex::merge with Bitmap)
    Merger->>Merged: update shards (MergedIndex::merge)\nreturn cache-hit flags
    Merger->>Merged: serialize shards (measure size/time)

    User->>Query: Phase 3: micro-benchmarks (lookups)
    Query->>Merged: raw occurrence/relation lookups\n(position mapping)
    Query->>Query: Phase 4: find-references scan\nacross project symbols & shards
    Query-->>User: report timings, sizes, coverage

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat: implement index system with LSP query handlers #382: Modifies the index layer including MergedIndex and ProjectIndex merge/serialization—strong overlap with bitmap and merge changes.
feat: store indices to disk #279: Changes FileIndex/TUIndex hashing and merge/cache semantics; directly related to hashing and merge API updates.

Poem

🐰✨ A Rabbit’s Benchmark Verse
I hopped through files both small and grand,
Cached hashes held tight in my pawed hand.
Bits tucked inline, then leapt to heap,
Merges that whisper, “cache hit—sleep.”
Benchmarks drum; indices hum—what a land! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the two main changes: introducing ContextBitmap for optimization and adding selective merge capability, which are the core features of this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/symbol-lookup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/index/merged_index.cpp (1)

606-610: ⚠️ Potential issue | 🟠 Major

Refresh header content on cache hits too.

Unlike the compilation overload, this only writes impl->content once. FileIndex::hash() is content-agnostic, so comment/whitespace-only header edits can still hit the cache and leave content() stale, which will skew later position mapping.

Proposed fix

     self.load_in_memory();
-    if(self.impl->content.empty() && !content.empty()) {
+    if(!content.empty()) {
         self.impl->content = content.str();
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 606 - 610, The cache merge only sets
impl->content when the entry is initially empty, leaving impl->content stale on
cache hits; update the logic so that when new content is provided (non-empty)
you refresh impl->content even on hits. Locate the call sequence around
load_in_memory(), the merge(...) invocation and the Impl type members
(impl->content) and ensure either the merge callback or the post-merge path
assigns impl->content = content.str() when content is non-empty (so
FileIndex::hash() cache hits also update content() for correct position
mapping).

🧹 Nitpick comments (1)

tests/unit/index/project_index_tests.cpp (1)

203-238: This fixture doesn't prove the filter excludes untouched symbols.

Every symbol in this setup can still reference main.cpp, so project.merge(tu, new_file_ids) may legally copy the whole table and the test would still pass even if the selective filter were bypassed. Add a header-only symbol that is never referenced from main.cpp and assert it stays absent.

🧪 Example tightening

 add_file("header.h", R"(
         `#pragma` once
         inline int shared() { return 1; }
+        inline int header_only() { return 2; }
     )");
@@
     ASSERT_TRUE(has_main_symbols);
+
+    for(auto& [hash, symbol]: tu.symbols) {
+        if(symbol.name == "header_only") {
+            ASSERT_FALSE(project.symbols.contains(hash));
+        }
+    }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/index/project_index_tests.cpp` around lines 203 - 238, The test
SelectiveMergeFilters currently can pass even if merge ignores the new_file_ids
filter because every symbol is referenced by main.cpp; to fix, add a header-only
symbol in add_file (e.g., a function or variable named something like
unreferenced_header_symbol) that is not used in add_main, keep the rest of the
setup, call project.merge(tu, new_file_ids), then assert that the header-only
symbol is not present in project.symbols (use the symbol's hash/key from
tu.symbols or look it up by name in tu.symbols) to prove untouched symbols are
excluded; reference the existing identifiers tu.symbols, project.symbols,
project.merge, and new_file_ids to locate where to add the new header and
assertion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 175-179: The code casts CLI values to unsigned types before
validating, so negative inputs like opts.jobs or opts.limit become huge
positives; update the handling for num_jobs and the limit by first reading them
into a signed integer (e.g., int or long), check for <= 0 and handle negatives
by falling back to hardware_concurrency() or default values, then clamp to a
reasonable max before converting to unsigned/size_t for reserve(num_jobs) and
the worker loop; ensure the same validation logic is applied to both the
num_jobs (from opts.jobs) and the limit (from opts.limit) code paths.
- Around line 362-380: The code currently always treats the main TU as a new
file and forces a cache miss by unconditionally calling
new_file_ids.add(main_tu_path_id) and incrementing cache_misses after calling
merged[global_main_id].merge(...); change this to use the actual result from
merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs),
tu_index.main_file_index, content): capture its hit/miss return value and only
add main_tu_path_id to new_file_ids or increment cache_misses when the merge
reports a miss, otherwise treat it as a hit and avoid mutating
new_file_ids/cache_misses so repeated TU paths reflect correct Phase 2
statistics.

In `@src/index/tu_index.h`:
- Around line 56-61: The cached_hash stored in FileIndex can become stale
because mutable members (relations, occurrences) can change after hash()
populates cached_hash, causing MergedIndex::merge to use a wrong canonical key;
fix by either making FileIndex immutable after construction (e.g., move
relations and occurrences to private read-only members and expose only const
accessors or provide a finalize/build method that seals the instance and
prevents further mutation) or by explicitly invalidating cached_hash (reset
cached_hash = std::nullopt) at every mutating code path that modifies relations
or occurrences (add centralized mutator methods or wrap direct mutations to
ensure invalidation); update hash(), constructors, any setters, and
MergedIndex::merge call sites accordingly to use the sealed or invalidation-safe
FileIndex.

In `@src/support/bitmap.h`:
- Around line 134-141: operator== assumes differing storage modes imply
inequality, but remove() can leave an empty heap that should compare equal to an
inline-empty bitmap; update operator==(const ContextBitmap& lhs, const
ContextBitmap& rhs) to handle mixed modes by normalizing comparison: if
is_inline() differs, take the heap side (as_heap()) and verify it contains no
ids >= inline_capacity (i.e., all bits fit in inline_capacity) and that the
low-inline_capacity bits equal the inline side's data; only return false if the
heap has any id >= inline_capacity or the low bits differ. Reference:
operator==, ContextBitmap::is_inline, ContextBitmap::data,
ContextBitmap::as_heap, inline_capacity, and the remove/add behavior.

---

Outside diff comments:
In `@src/index/merged_index.cpp`:
- Around line 606-610: The cache merge only sets impl->content when the entry is
initially empty, leaving impl->content stale on cache hits; update the logic so
that when new content is provided (non-empty) you refresh impl->content even on
hits. Locate the call sequence around load_in_memory(), the merge(...)
invocation and the Impl type members (impl->content) and ensure either the merge
callback or the post-merge path assigns impl->content = content.str() when
content is non-empty (so FileIndex::hash() cache hits also update content() for
correct position mapping).

---

Nitpick comments:
In `@tests/unit/index/project_index_tests.cpp`:
- Around line 203-238: The test SelectiveMergeFilters currently can pass even if
merge ignores the new_file_ids filter because every symbol is referenced by
main.cpp; to fix, add a header-only symbol in add_file (e.g., a function or
variable named something like unreferenced_header_symbol) that is not used in
add_main, keep the rest of the setup, call project.merge(tu, new_file_ids), then
assert that the header-only symbol is not present in project.symbols (use the
symbol's hash/key from tu.symbols or look it up by name in tu.symbols) to prove
untouched symbols are excluded; reference the existing identifiers tu.symbols,
project.symbols, project.merge, and new_file_ids to locate where to add the new
header and assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eb9b2d7e-023a-4c77-bd4c-a48d92191f47

📥 Commits

Reviewing files that changed from the base of the PR and between 8bafaa8 and 1d1122d.

📒 Files selected for processing (13)

CMakeLists.txt
benchmarks/index_benchmark.cpp
src/index/merged_index.cpp
src/index/merged_index.h
src/index/project_index.cpp
src/index/project_index.h
src/index/tu_index.cpp
src/index/tu_index.h
src/support/bitmap.h
tests/unit/index/merged_index_tests.cpp
tests/unit/index/project_index_tests.cpp
tests/unit/index/tu_index_tests.cpp
tests/unit/support/context_bitmap_tests.cpp

coderabbitai · 2026-04-11T07:10:17Z

+    auto num_jobs = static_cast<unsigned>(*opts.jobs);
+    if(num_jobs == 0)
+        num_jobs = std::thread::hardware_concurrency();
+    if(num_jobs == 0)
+        num_jobs = 4;


⚠️ Potential issue | 🟠 Major

Reject negative --jobs and --limit before casting.

Lines 175-179 and 187-189 convert signed CLI values to unsigned/size_t first. -1 turns into a huge positive number here, which can explode reserve(num_jobs) and the worker loop instead of falling back to defaults.

Proposed fix

- auto num_jobs = static_cast<unsigned>(*opts.jobs); + if(*opts.jobs < 0 || *opts.limit < 0) { + std::println(stderr, "--jobs and --limit must be >= 0"); + return 1; + } + + auto num_jobs = static_cast<unsigned>(*opts.jobs); if(num_jobs == 0) num_jobs = std::thread::hardware_concurrency(); if(num_jobs == 0) num_jobs = 4; @@ - auto limit = static_cast<std::size_t>(*opts.limit); + auto limit = static_cast<std::size_t>(*opts.limit);

Also applies to: 187-189

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/index_benchmark.cpp` around lines 175 - 179, The code casts CLI values to unsigned types before validating, so negative inputs like opts.jobs or opts.limit become huge positives; update the handling for num_jobs and the limit by first reading them into a signed integer (e.g., int or long), check for <= 0 and handle negatives by falling back to hardware_concurrency() or default values, then clamp to a reasonable max before converting to unsigned/size_t for reserve(num_jobs) and the worker loop; ensure the same validation logic is applied to both the num_jobs (from opts.jobs) and the limit (from opts.limit) code paths.

coderabbitai · 2026-04-11T07:10:17Z

+        // Main file is always new.
+        new_file_ids.add(main_tu_path_id);
+        auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]);
+        auto content = read_content(global_main_id, graph.paths[main_tu_path_id]);
+        auto now = std::chrono::duration_cast<std::chrono::milliseconds>(
+            std::chrono::system_clock::now().time_since_epoch());
+
+        std::vector<index::IncludeLocation> remapped_locs;
+        for(auto& loc: graph.locations) {
+            auto remapped = loc;
+            remapped.path_id = project_index.path_pool.path_id(graph.paths[loc.path_id]);
+            remapped_locs.push_back(remapped);
+        }
+        merged[global_main_id].merge(global_main_id,
+                                     now,
+                                     std::move(remapped_locs),
+                                     tu_index.main_file_index,
+                                     content);
+        cache_misses++;


⚠️ Potential issue | 🟠 Major

Use the main-file merge() result instead of forcing a miss.

Lines 362-380 always add main_tu_path_id to new_file_ids and always increment cache_misses, even though the compilation-context overload now returns the same cache-hit signal as the header overload. That makes the Phase 2 hit/miss numbers and the selective ProjectIndex::merge() input wrong for repeated TU paths.

Proposed fix

- // Main file is always new. - new_file_ids.add(main_tu_path_id); auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]); auto content = read_content(global_main_id, graph.paths[main_tu_path_id]); auto now = std::chrono::duration_cast<std::chrono::milliseconds>( std::chrono::system_clock::now().time_since_epoch()); @@ - merged[global_main_id].merge(global_main_id, - now, - std::move(remapped_locs), - tu_index.main_file_index, - content); - cache_misses++; + bool main_hit = merged[global_main_id].merge(global_main_id, + now, + std::move(remapped_locs), + tu_index.main_file_index, + content); + if(main_hit) { + cache_hits++; + } else { + cache_misses++; + new_file_ids.add(main_tu_path_id); + } merged_index_merge_ms += mt.ms();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/index_benchmark.cpp` around lines 362 - 380, The code currently always treats the main TU as a new file and forces a cache miss by unconditionally calling new_file_ids.add(main_tu_path_id) and incrementing cache_misses after calling merged[global_main_id].merge(...); change this to use the actual result from merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs), tu_index.main_file_index, content): capture its hit/miss return value and only add main_tu_path_id to new_file_ids or increment cache_misses when the merge reports a miss, otherwise treat it as a hit and avoid mutating new_file_ids/cache_misses so repeated TU paths reflect correct Phase 2 statistics.

coderabbitai · 2026-04-11T07:10:17Z

+    /// Compute (and cache) SHA256 hash of this index's content.
+    /// Each FileIndex instance must only be hashed by one thread at a time.
    std::array<std::uint8_t, 32> hash();
+
+private:
+    std::optional<std::array<std::uint8_t, 32>> cached_hash;


⚠️ Potential issue | 🟠 Major

Cached hashes need an immutability or invalidation guarantee.

relations and occurrences are still publicly mutable, but once hash() populates cached_hash it never changes again. MergedIndex::merge uses that digest as the canonical-cache key, so any later mutation can turn into a false cache hit and merge different FileIndex contents under the same canonical id. Please either make FileIndex immutable after build or reset cached_hash from every mutating path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.h` around lines 56 - 61, The cached_hash stored in FileIndex can become stale because mutable members (relations, occurrences) can change after hash() populates cached_hash, causing MergedIndex::merge to use a wrong canonical key; fix by either making FileIndex immutable after construction (e.g., move relations and occurrences to private read-only members and expose only const accessors or provide a finalize/build method that seals the instance and prevents further mutation) or by explicitly invalidating cached_hash (reset cached_hash = std::nullopt) at every mutating code path that modifies relations or occurrences (add centralized mutator methods or wrap direct mutations to ensure invalidation); update hash(), constructors, any setters, and MergedIndex::merge call sites accordingly to use the sealed or invalidation-safe FileIndex.

coderabbitai · 2026-04-11T07:10:17Z

+    friend bool operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) {
+        if(lhs.is_inline() && rhs.is_inline())
+            return lhs.data == rhs.data;
+        // Different modes implies different sets: a heap bitmap always contains
+        // at least one id >= inline_capacity that an inline bitmap cannot hold.
+        if(lhs.is_inline() != rhs.is_inline())
+            return false;
+        return *lhs.as_heap() == *rhs.as_heap();


⚠️ Potential issue | 🟠 Major

Mixed-mode equality becomes wrong after removing a high id.

remove() never downgrades heap storage back to inline, so a sequence like add(63); remove(63); leaves an empty heap-backed bitmap. This branch will still return false against an inline-empty bitmap, which breaks equality and any serialize/deserialize round-trip that normalizes back to inline.

💡 Suggested fix

friend bool operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) { if(lhs.is_inline() && rhs.is_inline()) return lhs.data == rhs.data; - // Different modes implies different sets: a heap bitmap always contains - // at least one id >= inline_capacity that an inline bitmap cannot hold. - if(lhs.is_inline() != rhs.is_inline()) - return false; + if(lhs.is_inline() != rhs.is_inline()) + return !lhs.any_not_in(rhs) && !rhs.any_not_in(lhs); return *lhs.as_heap() == *rhs.as_heap(); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/support/bitmap.h` around lines 134 - 141, operator== assumes differing storage modes imply inequality, but remove() can leave an empty heap that should compare equal to an inline-empty bitmap; update operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) to handle mixed modes by normalizing comparison: if is_inline() differs, take the heap side (as_heap()) and verify it contains no ids >= inline_capacity (i.e., all bits fit in inline_capacity) and that the low-inline_capacity bits equal the inline side's data; only return false if the heap has any id >= inline_capacity or the low bits differ. Reference: operator==, ContextBitmap::is_inline, ContextBitmap::data, ContextBitmap::as_heap, inline_capacity, and the remove/add behavior.

…ective merge Introduce ContextBitmap (tagged-pointer inline bitmap for <63 ids) to replace roaring::Roaring in MergedIndex hot paths, eliminating heap allocations in the common case. Cache FileIndex SHA256 hashes to avoid recomputation. Add selective ProjectIndex::merge that skips symbols not referencing newly-merged files. MergedIndex::merge now returns bool to indicate cache hits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix use-after-partial-move and division-by-zero in benchmark - Use std::exchange in ContextBitmap move operations - Add clarifying comments on operator== mixed mode and hash() sorting - Add tests: compilation-context merge cache hit, hash determinism, any_not_in empty inputs, remove edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

benchmarks/index_benchmark.cpp (2)

174-188: ⚠️ Potential issue | 🟠 Major

Negative --jobs / --limit still cast straight to unsigned.

Casting a negative int to unsigned/std::size_t wraps to a huge positive value, which then drives workers.reserve(num_jobs) and entries.take_front(limit). Validate signedness before the cast and fall back to defaults (or reject the input) for negative values.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 174 - 188, Validate the signed
values before casting: check the raw signed integers behind opts.jobs and
opts.limit (e.g., test *opts.jobs <= 0 and *opts.limit < 0) and handle negatives
instead of blindly casting to unsigned; for jobs, if the signed value is <= 0
use std::thread::hardware_concurrency() (and fallback to 4) before assigning
num_jobs, and for limit, if the signed limit is negative treat it as no limit
(or 0) before computing limit and calling entries.take_front(limit). Ensure you
reference the existing symbols num_jobs, limit, opts.jobs and opts.limit and
perform the signedness check prior to the static_cast.

361-380: ⚠️ Potential issue | 🟠 Major

Main-file merge still forces a miss instead of using the returned bool.

Lines 362 and 374-379 unconditionally new_file_ids.add(main_tu_path_id) and increment cache_misses, ignoring the cache-hit return value that the compilation-context overload now provides (the whole point of the API change in MergedIndex::merge). For repeated TU paths this skews the Phase 2 hit/miss totals and causes ProjectIndex::merge(tu, new_file_ids) to receive an overly inclusive bitmap, partly defeating the selective-merge measurement.

Proposed fix

-        // Main file is always new.
-        new_file_ids.add(main_tu_path_id);
         auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]);
         ...
-        merged[global_main_id].merge(global_main_id,
-                                     now,
-                                     std::move(remapped_locs),
-                                     tu_index.main_file_index,
-                                     content);
-        cache_misses++;
+        bool main_hit = merged[global_main_id].merge(global_main_id,
+                                                     now,
+                                                     std::move(remapped_locs),
+                                                     tu_index.main_file_index,
+                                                     content);
+        if(main_hit) {
+            cache_hits++;
+        } else {
+            cache_misses++;
+            new_file_ids.add(main_tu_path_id);
+        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 361 - 380, The main-file merge
currently unconditionally calls new_file_ids.add(main_tu_path_id) and increments
cache_misses despite MergedIndex::merge (invoked as
merged[global_main_id].merge(...)) returning a bool indicating whether the merge
was a cache hit; change the call to capture that bool (e.g., bool merged_hit =
merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs),
tu_index.main_file_index, content);) and only call
new_file_ids.add(main_tu_path_id) and increment cache_misses when merged_hit is
false; leave timing aggregation (merged_index_merge_ms += mt.ms()) as-is.

src/support/bitmap.h (1)

91-142: ⚠️ Potential issue | 🟠 Major

Mixed-mode operator== is still incorrect after removing a high id.

remove() (Lines 91-98) never downgrades heap storage back to inline, so the sequence add(63); remove(63); leaves an empty heap-backed bitmap. operator== at Lines 139-140 then returns false when comparing against an inline-empty (or inline-equivalent) bitmap, even though both sets are semantically equal. This also breaks any serialize/deserialize round-trip that normalizes through from_roaring (which picks inline when r.maximum() < inline_capacity).

Two viable fixes:
Fall back to a content comparison when modes differ:
 if(lhs.is_inline() != rhs.is_inline())
-    return false;
+    return !lhs.any_not_in(rhs) && !rhs.any_not_in(lhs);
Or, downgrade heap back to inline in remove() whenever as_heap()->maximum() < inline_capacity, preserving the "heap ⇒ has an id ≥ inline_capacity" invariant the comment at Lines 137-138 currently assumes.
Option 1 is simpler; option 2 also lets is_empty() / any_not_in hit the fast path more often.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/support/bitmap.h` around lines 91 - 142, The bug is that remove() never
downgrades a heap-backed bitmap to inline, so operator== can wrongly treat
semantically-equal bitmaps as different; fix by changing remove(std::uint32_t
id) to, after calling as_heap()->remove(id), check if the heap's maximum() (or
isEmpty()) is below inline_capacity and if so rebuild the inline representation
(set data = bits from the heap), destroy the heap storage and flip the mode back
to inline (use as_heap(), is_inline(), inline_capacity and the same
bit-extraction logic used in from_roaring to populate inline bits); this
preserves the heap⇒contains-high-id invariant and makes operator== and
serialize/deserialize behave correctly.

🧹 Nitpick comments (2)

src/index/tu_index.cpp (1)
179-208: Redundant re-sort of relations inside hash().

Builder::build() already sorts each symbol's relations at lines 120-128 before this function is ever reached (and dedupes via unique). The per-symbol sort at 195-198 here is therefore almost always a no-op except on the first call after external mutation. Since hash() now caches, you can drop the inner sort and rely on the invariant established at build time — or alternatively, move the sort out of build() and make hash() the sole place responsible for canonicalization. Keeping sorts in both places hides which side is authoritative.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.cpp` around lines 179 - 208, The per-symbol
std::ranges::sort inside hash() is redundant because Builder::build() already
sorts and deduplicates each relations vector; remove the inner sort call that
orders "rels" (the lambda comparing Relation::kind, range.begin, range.end,
target_symbol) so hash() relies on the canonical invariant established by
Builder::build(), keeping the outer sorted_keys construction and the data block
hashing, and ensure cached_hash remains valid for cached results; alternatively,
if you prefer hash() to own canonicalization, move the sort+unique logic from
Builder::build() into hash() and remove the build-time sort to keep a single
authoritative place.
tests/unit/index/project_index_tests.cpp (1)
203-253: Consider tightening SelectiveMergeFilters assertions.

The current check proj_sym.reference_files.cardinality() >= symbol.reference_files.cardinality() would pass even if the selective merge incorrectly inserted extra unrelated files. A stricter assertion would verify equality (or a subset relation) of the reference-file set to actually prove the selective path preserves — and doesn't corrupt — per-symbol reference data. Non-blocking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/index/project_index_tests.cpp` around lines 203 - 253, The test
SelectiveMergeFilters currently uses
ASSERT_TRUE(proj_sym.reference_files.cardinality() >=
symbol.reference_files.cardinality()), which allows extra unrelated files to
slip in; change the assertion to ensure exact preservation of reference-file
sets (e.g., assert equality between proj_sym.reference_files and
symbol.reference_files or that proj_sym.reference_files is equal to the expected
subset) after calling project.merge(tu, new_file_ids) so that comparisons on
tu.symbols -> symbol.reference_files vs project.symbols ->
proj_sym.reference_files confirm no extra paths were added.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 478-508: The division-by-zero occurs when computing queries /
elapsed in the Relation lookup printing (using variables merged, queries,
elapsed, sample_hashes, rounds, and Timer t); change the q/ms expression to
guard the divide (e.g., use elapsed > 0 ? queries / elapsed : 0.0) so it prints
0.0 when elapsed is zero; apply the same guarded expression to the other
benchmarking print that uses queries / elapsed (the similar block later that
also computes queries and elapsed).

---

Duplicate comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 174-188: Validate the signed values before casting: check the raw
signed integers behind opts.jobs and opts.limit (e.g., test *opts.jobs <= 0 and
*opts.limit < 0) and handle negatives instead of blindly casting to unsigned;
for jobs, if the signed value is <= 0 use std::thread::hardware_concurrency()
(and fallback to 4) before assigning num_jobs, and for limit, if the signed
limit is negative treat it as no limit (or 0) before computing limit and calling
entries.take_front(limit). Ensure you reference the existing symbols num_jobs,
limit, opts.jobs and opts.limit and perform the signedness check prior to the
static_cast.
- Around line 361-380: The main-file merge currently unconditionally calls
new_file_ids.add(main_tu_path_id) and increments cache_misses despite
MergedIndex::merge (invoked as merged[global_main_id].merge(...)) returning a
bool indicating whether the merge was a cache hit; change the call to capture
that bool (e.g., bool merged_hit = merged[global_main_id].merge(global_main_id,
now, std::move(remapped_locs), tu_index.main_file_index, content);) and only
call new_file_ids.add(main_tu_path_id) and increment cache_misses when
merged_hit is false; leave timing aggregation (merged_index_merge_ms += mt.ms())
as-is.

In `@src/support/bitmap.h`:
- Around line 91-142: The bug is that remove() never downgrades a heap-backed
bitmap to inline, so operator== can wrongly treat semantically-equal bitmaps as
different; fix by changing remove(std::uint32_t id) to, after calling
as_heap()->remove(id), check if the heap's maximum() (or isEmpty()) is below
inline_capacity and if so rebuild the inline representation (set data = bits
from the heap), destroy the heap storage and flip the mode back to inline (use
as_heap(), is_inline(), inline_capacity and the same bit-extraction logic used
in from_roaring to populate inline bits); this preserves the
heap⇒contains-high-id invariant and makes operator== and serialize/deserialize
behave correctly.

---

Nitpick comments:
In `@src/index/tu_index.cpp`:
- Around line 179-208: The per-symbol std::ranges::sort inside hash() is
redundant because Builder::build() already sorts and deduplicates each relations
vector; remove the inner sort call that orders "rels" (the lambda comparing
Relation::kind, range.begin, range.end, target_symbol) so hash() relies on the
canonical invariant established by Builder::build(), keeping the outer
sorted_keys construction and the data block hashing, and ensure cached_hash
remains valid for cached results; alternatively, if you prefer hash() to own
canonicalization, move the sort+unique logic from Builder::build() into hash()
and remove the build-time sort to keep a single authoritative place.

In `@tests/unit/index/project_index_tests.cpp`:
- Around line 203-253: The test SelectiveMergeFilters currently uses
ASSERT_TRUE(proj_sym.reference_files.cardinality() >=
symbol.reference_files.cardinality()), which allows extra unrelated files to
slip in; change the assertion to ensure exact preservation of reference-file
sets (e.g., assert equality between proj_sym.reference_files and
symbol.reference_files or that proj_sym.reference_files is equal to the expected
subset) after calling project.merge(tu, new_file_ids) so that comparisons on
tu.symbols -> symbol.reference_files vs project.symbols ->
proj_sym.reference_files confirm no extra paths were added.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56b57c5c-2fd5-40ca-9fc0-e87490573874

📥 Commits

Reviewing files that changed from the base of the PR and between 1d1122d and e1dde9e.

📒 Files selected for processing (13)

CMakeLists.txt
benchmarks/index_benchmark.cpp
src/index/merged_index.cpp
src/index/merged_index.h
src/index/project_index.cpp
src/index/project_index.h
src/index/tu_index.cpp
src/index/tu_index.h
src/support/bitmap.h
tests/unit/index/merged_index_tests.cpp
tests/unit/index/project_index_tests.cpp
tests/unit/index/tu_index_tests.cpp
tests/unit/support/context_bitmap_tests.cpp

✅ Files skipped from review due to trivial changes (1)

tests/unit/index/tu_index_tests.cpp

🚧 Files skipped from review as they are similar to previous changes (6)

src/index/project_index.h
CMakeLists.txt
src/index/tu_index.h
tests/unit/index/merged_index_tests.cpp
src/index/project_index.cpp
src/index/merged_index.cpp

coderabbitai · 2026-04-18T06:58:02Z

+        auto elapsed = t.ms();
+        auto queries = merged.size() * 20 * rounds;
+        std::println("  Occurrence lookup (raw):  {} queries in {:.1f}ms ({:.0f} q/ms, {} hits)",
+                     queries,
+                     elapsed,
+                     queries / elapsed,
+                     hit_count);
+    }
+
+    // Relation lookup by symbol hash (raw, no position conversion).
+    {
+        std::size_t hit_count = 0;
+        Timer t;
+        constexpr int rounds = 10;
+        for(int r = 0; r < rounds; r++) {
+            for(auto& [path, shard]: merged) {
+                for(auto hash: sample_hashes) {
+                    shard.lookup(hash, RelationKind::Definition, [&](const index::Relation&) {
+                        hit_count++;
+                        return true;
+                    });
+                }
+            }
+        }
+        auto elapsed = t.ms();
+        auto queries = merged.size() * sample_hashes.size() * rounds;
+        std::println("  Relation lookup (raw):    {} queries in {:.1f}ms ({:.0f} q/ms, {} hits)",
+                     queries,
+                     elapsed,
+                     queries / elapsed,
+                     hit_count);


⚠️ Potential issue | 🟡 Minor

Guard against division-by-zero in queries / elapsed.

If merged is empty (e.g., --limit 0 or every compile failing) both queries and elapsed can be zero, yielding NaN in the printed q/ms values. A simple guard (elapsed > 0 ? queries/elapsed : 0.0) keeps the output sane. Same pattern at Line 586.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/index_benchmark.cpp` around lines 478 - 508, The division-by-zero occurs when computing queries / elapsed in the Relation lookup printing (using variables merged, queries, elapsed, sample_hashes, rounds, and Timer t); change the q/ms expression to guard the divide (e.g., use elapsed > 0 ? queries / elapsed : 0.0) so it prints 0.0 when elapsed is zero; apply the same guarded expression to the other benchmarking print that uses queries / elapsed (the similar block later that also computes queries and elapsed).

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

16bit-ykiko and others added 2 commits April 18, 2026 14:49

16bit-ykiko force-pushed the feat/symbol-lookup branch from 1d1122d to e1dde9e Compare April 18, 2026 06:53

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): optimize merge with ContextBitmap and selective merge#422

feat(index): optimize merge with ContextBitmap and selective merge#422
16bit-ykiko wants to merge 2 commits into
mainfrom
feat/symbol-lookup

16bit-ykiko commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

16bit-ykiko commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

16bit-ykiko commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading