Skip to content

feat(index): optimize merge with ContextBitmap and selective merge#422

Open
16bit-ykiko wants to merge 2 commits into
mainfrom
feat/symbol-lookup
Open

feat(index): optimize merge with ContextBitmap and selective merge#422
16bit-ykiko wants to merge 2 commits into
mainfrom
feat/symbol-lookup

Conversation

@16bit-ykiko

@16bit-ykiko 16bit-ykiko commented Apr 11, 2026

Copy link
Copy Markdown
Member

Summary

  • Introduce ContextBitmap (tagged-pointer inline bitmap for <63 ids) to replace roaring::Roaring in MergedIndex hot paths, eliminating heap allocations in the common case
  • Cache FileIndex SHA256 hashes with deterministic ordering to avoid recomputation and ensure cross-instance consistency
  • Add selective ProjectIndex::merge that skips symbols not referencing newly-merged files
  • MergedIndex::merge now returns bool to indicate cache hits, enabling the selective optimization
  • Add index_benchmark tool for end-to-end merge/query performance profiling

Test plan

  • Unit tests: 535 passed (new tests for merge cache hit, hash determinism, selective merge, ContextBitmap edge cases)
  • Integration tests: 119 passed
  • Smoke tests: 2/2 passed
  • Pre-PR triple review (correctness, style, test coverage) — all findings addressed

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a new indexing benchmark tool with detailed phase timings, serialization metrics, and query micro-benchmarks.
  • Performance

    • Merge path now avoids redundant work via caching, enabling faster incremental updates.
    • Deterministic, cached hashing reduces repeated computation.
    • Selective merging updates only affected symbols for more efficient incremental indexing.
    • More compact/efficient bitmap storage for index contexts.
  • Tests

    • Added unit tests covering merge cache behavior, selective merging, hashing, and bitmap correctness.

@coderabbitai

coderabbitai Bot commented Apr 11, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds a new index benchmarking executable; introduces a compact ContextBitmap type and replaces roaring usage in MergedIndex; makes MergedIndex::merge return a cache-hit bool; adds selective ProjectIndex::merge overload; caches deterministic FileIndex hashes; and includes unit tests for these behaviors.

Changes

Cohort / File(s) Summary
Benchmark Binary
CMakeLists.txt, benchmarks/index_benchmark.cpp
New index_benchmark executable measuring parallel per-file compilation/TUIndex construction, single-threaded merging/serialization, micro lookups, and high-level "find references" queries with CLI knobs and percentile reporting.
Bitmap & MergedIndex
src/support/bitmap.h, src/index/merged_index.h, src/index/merged_index.cpp
Added clice::ContextBitmap (inline/heap hybrid), replaced roaring fields with ContextBitmap, adjusted serialization, lookup filtering, and changed MergedIndex::merge overloads to return bool indicating cache hits.
ProjectIndex Selective Merge
src/index/project_index.h, src/index/project_index.cpp
Added ProjectIndex::merge(TUIndex&, const Bitmap&) overload that only merges symbols referenced by file IDs in the provided bitmap; returns mapped file IDs.
FileIndex Hashing
src/index/tu_index.h, src/index/tu_index.cpp
Added cached SHA256 hash storage and deterministic hashing: short-circuit cache, sort relation keys/entries before hashing, store cached_hash.
Unit Tests Added
tests/unit/support/context_bitmap_tests.cpp, tests/unit/index/merged_index_tests.cpp, tests/unit/index/project_index_tests.cpp, tests/unit/index/tu_index_tests.cpp
New tests covering ContextBitmap inline/heap behavior and semantics, MergedIndex merge cache-hit behavior, ProjectIndex selective-merge semantics, and FileIndex hash caching/determinism.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Compiler as Compiler (parallel)
    participant Merger as Merger / ProjectIndex
    participant Merged as MergedIndex (shards)
    participant Query as Query Engine

    User->>Compiler: Phase 1: compile files in parallel
    Compiler->>Compiler: Build TUIndex/FileIndex per file\ncompute deterministic hash
    Compiler-->>Merger: emit per-file indices & file IDs

    Merger->>Merger: Phase 2: selective merge\n(ProjectIndex::merge with Bitmap)
    Merger->>Merged: update shards (MergedIndex::merge)\nreturn cache-hit flags
    Merger->>Merged: serialize shards (measure size/time)

    User->>Query: Phase 3: micro-benchmarks (lookups)
    Query->>Merged: raw occurrence/relation lookups\n(position mapping)
    Query->>Query: Phase 4: find-references scan\nacross project symbols & shards
    Query-->>User: report timings, sizes, coverage
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰✨ A Rabbit’s Benchmark Verse
I hopped through files both small and grand,
Cached hashes held tight in my pawed hand.
Bits tucked inline, then leapt to heap,
Merges that whisper, “cache hit—sleep.”
Benchmarks drum; indices hum—what a land! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the two main changes: introducing ContextBitmap for optimization and adding selective merge capability, which are the core features of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/symbol-lookup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/index/merged_index.cpp (1)

606-610: ⚠️ Potential issue | 🟠 Major

Refresh header content on cache hits too.

Unlike the compilation overload, this only writes impl->content once. FileIndex::hash() is content-agnostic, so comment/whitespace-only header edits can still hit the cache and leave content() stale, which will skew later position mapping.

Proposed fix
     self.load_in_memory();
-    if(self.impl->content.empty() && !content.empty()) {
+    if(!content.empty()) {
         self.impl->content = content.str();
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 606 - 610, The cache merge only sets
impl->content when the entry is initially empty, leaving impl->content stale on
cache hits; update the logic so that when new content is provided (non-empty)
you refresh impl->content even on hits. Locate the call sequence around
load_in_memory(), the merge(...) invocation and the Impl type members
(impl->content) and ensure either the merge callback or the post-merge path
assigns impl->content = content.str() when content is non-empty (so
FileIndex::hash() cache hits also update content() for correct position
mapping).
🧹 Nitpick comments (1)
tests/unit/index/project_index_tests.cpp (1)

203-238: This fixture doesn't prove the filter excludes untouched symbols.

Every symbol in this setup can still reference main.cpp, so project.merge(tu, new_file_ids) may legally copy the whole table and the test would still pass even if the selective filter were bypassed. Add a header-only symbol that is never referenced from main.cpp and assert it stays absent.

🧪 Example tightening
 add_file("header.h", R"(
         `#pragma` once
         inline int shared() { return 1; }
+        inline int header_only() { return 2; }
     )");
@@
     ASSERT_TRUE(has_main_symbols);
+
+    for(auto& [hash, symbol]: tu.symbols) {
+        if(symbol.name == "header_only") {
+            ASSERT_FALSE(project.symbols.contains(hash));
+        }
+    }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/index/project_index_tests.cpp` around lines 203 - 238, The test
SelectiveMergeFilters currently can pass even if merge ignores the new_file_ids
filter because every symbol is referenced by main.cpp; to fix, add a header-only
symbol in add_file (e.g., a function or variable named something like
unreferenced_header_symbol) that is not used in add_main, keep the rest of the
setup, call project.merge(tu, new_file_ids), then assert that the header-only
symbol is not present in project.symbols (use the symbol's hash/key from
tu.symbols or look it up by name in tu.symbols) to prove untouched symbols are
excluded; reference the existing identifiers tu.symbols, project.symbols,
project.merge, and new_file_ids to locate where to add the new header and
assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 175-179: The code casts CLI values to unsigned types before
validating, so negative inputs like opts.jobs or opts.limit become huge
positives; update the handling for num_jobs and the limit by first reading them
into a signed integer (e.g., int or long), check for <= 0 and handle negatives
by falling back to hardware_concurrency() or default values, then clamp to a
reasonable max before converting to unsigned/size_t for reserve(num_jobs) and
the worker loop; ensure the same validation logic is applied to both the
num_jobs (from opts.jobs) and the limit (from opts.limit) code paths.
- Around line 362-380: The code currently always treats the main TU as a new
file and forces a cache miss by unconditionally calling
new_file_ids.add(main_tu_path_id) and incrementing cache_misses after calling
merged[global_main_id].merge(...); change this to use the actual result from
merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs),
tu_index.main_file_index, content): capture its hit/miss return value and only
add main_tu_path_id to new_file_ids or increment cache_misses when the merge
reports a miss, otherwise treat it as a hit and avoid mutating
new_file_ids/cache_misses so repeated TU paths reflect correct Phase 2
statistics.

In `@src/index/tu_index.h`:
- Around line 56-61: The cached_hash stored in FileIndex can become stale
because mutable members (relations, occurrences) can change after hash()
populates cached_hash, causing MergedIndex::merge to use a wrong canonical key;
fix by either making FileIndex immutable after construction (e.g., move
relations and occurrences to private read-only members and expose only const
accessors or provide a finalize/build method that seals the instance and
prevents further mutation) or by explicitly invalidating cached_hash (reset
cached_hash = std::nullopt) at every mutating code path that modifies relations
or occurrences (add centralized mutator methods or wrap direct mutations to
ensure invalidation); update hash(), constructors, any setters, and
MergedIndex::merge call sites accordingly to use the sealed or invalidation-safe
FileIndex.

In `@src/support/bitmap.h`:
- Around line 134-141: operator== assumes differing storage modes imply
inequality, but remove() can leave an empty heap that should compare equal to an
inline-empty bitmap; update operator==(const ContextBitmap& lhs, const
ContextBitmap& rhs) to handle mixed modes by normalizing comparison: if
is_inline() differs, take the heap side (as_heap()) and verify it contains no
ids >= inline_capacity (i.e., all bits fit in inline_capacity) and that the
low-inline_capacity bits equal the inline side's data; only return false if the
heap has any id >= inline_capacity or the low bits differ. Reference:
operator==, ContextBitmap::is_inline, ContextBitmap::data,
ContextBitmap::as_heap, inline_capacity, and the remove/add behavior.

---

Outside diff comments:
In `@src/index/merged_index.cpp`:
- Around line 606-610: The cache merge only sets impl->content when the entry is
initially empty, leaving impl->content stale on cache hits; update the logic so
that when new content is provided (non-empty) you refresh impl->content even on
hits. Locate the call sequence around load_in_memory(), the merge(...)
invocation and the Impl type members (impl->content) and ensure either the merge
callback or the post-merge path assigns impl->content = content.str() when
content is non-empty (so FileIndex::hash() cache hits also update content() for
correct position mapping).

---

Nitpick comments:
In `@tests/unit/index/project_index_tests.cpp`:
- Around line 203-238: The test SelectiveMergeFilters currently can pass even if
merge ignores the new_file_ids filter because every symbol is referenced by
main.cpp; to fix, add a header-only symbol in add_file (e.g., a function or
variable named something like unreferenced_header_symbol) that is not used in
add_main, keep the rest of the setup, call project.merge(tu, new_file_ids), then
assert that the header-only symbol is not present in project.symbols (use the
symbol's hash/key from tu.symbols or look it up by name in tu.symbols) to prove
untouched symbols are excluded; reference the existing identifiers tu.symbols,
project.symbols, project.merge, and new_file_ids to locate where to add the new
header and assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eb9b2d7e-023a-4c77-bd4c-a48d92191f47

📥 Commits

Reviewing files that changed from the base of the PR and between 8bafaa8 and 1d1122d.

📒 Files selected for processing (13)
  • CMakeLists.txt
  • benchmarks/index_benchmark.cpp
  • src/index/merged_index.cpp
  • src/index/merged_index.h
  • src/index/project_index.cpp
  • src/index/project_index.h
  • src/index/tu_index.cpp
  • src/index/tu_index.h
  • src/support/bitmap.h
  • tests/unit/index/merged_index_tests.cpp
  • tests/unit/index/project_index_tests.cpp
  • tests/unit/index/tu_index_tests.cpp
  • tests/unit/support/context_bitmap_tests.cpp

Comment on lines +175 to +179
auto num_jobs = static_cast<unsigned>(*opts.jobs);
if(num_jobs == 0)
num_jobs = std::thread::hardware_concurrency();
if(num_jobs == 0)
num_jobs = 4;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject negative --jobs and --limit before casting.

Lines 175-179 and 187-189 convert signed CLI values to unsigned/size_t first. -1 turns into a huge positive number here, which can explode reserve(num_jobs) and the worker loop instead of falling back to defaults.

Proposed fix
-    auto num_jobs = static_cast<unsigned>(*opts.jobs);
+    if(*opts.jobs < 0 || *opts.limit < 0) {
+        std::println(stderr, "--jobs and --limit must be >= 0");
+        return 1;
+    }
+
+    auto num_jobs = static_cast<unsigned>(*opts.jobs);
     if(num_jobs == 0)
         num_jobs = std::thread::hardware_concurrency();
     if(num_jobs == 0)
         num_jobs = 4;
@@
-    auto limit = static_cast<std::size_t>(*opts.limit);
+    auto limit = static_cast<std::size_t>(*opts.limit);

Also applies to: 187-189

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 175 - 179, The code casts CLI
values to unsigned types before validating, so negative inputs like opts.jobs or
opts.limit become huge positives; update the handling for num_jobs and the limit
by first reading them into a signed integer (e.g., int or long), check for <= 0
and handle negatives by falling back to hardware_concurrency() or default
values, then clamp to a reasonable max before converting to unsigned/size_t for
reserve(num_jobs) and the worker loop; ensure the same validation logic is
applied to both the num_jobs (from opts.jobs) and the limit (from opts.limit)
code paths.

Comment on lines +362 to +380
// Main file is always new.
new_file_ids.add(main_tu_path_id);
auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]);
auto content = read_content(global_main_id, graph.paths[main_tu_path_id]);
auto now = std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now().time_since_epoch());

std::vector<index::IncludeLocation> remapped_locs;
for(auto& loc: graph.locations) {
auto remapped = loc;
remapped.path_id = project_index.path_pool.path_id(graph.paths[loc.path_id]);
remapped_locs.push_back(remapped);
}
merged[global_main_id].merge(global_main_id,
now,
std::move(remapped_locs),
tu_index.main_file_index,
content);
cache_misses++;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the main-file merge() result instead of forcing a miss.

Lines 362-380 always add main_tu_path_id to new_file_ids and always increment cache_misses, even though the compilation-context overload now returns the same cache-hit signal as the header overload. That makes the Phase 2 hit/miss numbers and the selective ProjectIndex::merge() input wrong for repeated TU paths.

Proposed fix
-        // Main file is always new.
-        new_file_ids.add(main_tu_path_id);
         auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]);
         auto content = read_content(global_main_id, graph.paths[main_tu_path_id]);
         auto now = std::chrono::duration_cast<std::chrono::milliseconds>(
             std::chrono::system_clock::now().time_since_epoch());
@@
-        merged[global_main_id].merge(global_main_id,
-                                     now,
-                                     std::move(remapped_locs),
-                                     tu_index.main_file_index,
-                                     content);
-        cache_misses++;
+        bool main_hit = merged[global_main_id].merge(global_main_id,
+                                                     now,
+                                                     std::move(remapped_locs),
+                                                     tu_index.main_file_index,
+                                                     content);
+        if(main_hit) {
+            cache_hits++;
+        } else {
+            cache_misses++;
+            new_file_ids.add(main_tu_path_id);
+        }
         merged_index_merge_ms += mt.ms();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 362 - 380, The code currently
always treats the main TU as a new file and forces a cache miss by
unconditionally calling new_file_ids.add(main_tu_path_id) and incrementing
cache_misses after calling merged[global_main_id].merge(...); change this to use
the actual result from merged[global_main_id].merge(global_main_id, now,
std::move(remapped_locs), tu_index.main_file_index, content): capture its
hit/miss return value and only add main_tu_path_id to new_file_ids or increment
cache_misses when the merge reports a miss, otherwise treat it as a hit and
avoid mutating new_file_ids/cache_misses so repeated TU paths reflect correct
Phase 2 statistics.

Comment thread src/index/tu_index.h
Comment on lines +56 to +61
/// Compute (and cache) SHA256 hash of this index's content.
/// Each FileIndex instance must only be hashed by one thread at a time.
std::array<std::uint8_t, 32> hash();

private:
std::optional<std::array<std::uint8_t, 32>> cached_hash;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cached hashes need an immutability or invalidation guarantee.

relations and occurrences are still publicly mutable, but once hash() populates cached_hash it never changes again. MergedIndex::merge uses that digest as the canonical-cache key, so any later mutation can turn into a false cache hit and merge different FileIndex contents under the same canonical id. Please either make FileIndex immutable after build or reset cached_hash from every mutating path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.h` around lines 56 - 61, The cached_hash stored in
FileIndex can become stale because mutable members (relations, occurrences) can
change after hash() populates cached_hash, causing MergedIndex::merge to use a
wrong canonical key; fix by either making FileIndex immutable after construction
(e.g., move relations and occurrences to private read-only members and expose
only const accessors or provide a finalize/build method that seals the instance
and prevents further mutation) or by explicitly invalidating cached_hash (reset
cached_hash = std::nullopt) at every mutating code path that modifies relations
or occurrences (add centralized mutator methods or wrap direct mutations to
ensure invalidation); update hash(), constructors, any setters, and
MergedIndex::merge call sites accordingly to use the sealed or invalidation-safe
FileIndex.

Comment thread src/support/bitmap.h
Comment on lines +134 to +141
friend bool operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) {
if(lhs.is_inline() && rhs.is_inline())
return lhs.data == rhs.data;
// Different modes implies different sets: a heap bitmap always contains
// at least one id >= inline_capacity that an inline bitmap cannot hold.
if(lhs.is_inline() != rhs.is_inline())
return false;
return *lhs.as_heap() == *rhs.as_heap();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mixed-mode equality becomes wrong after removing a high id.

remove() never downgrades heap storage back to inline, so a sequence like add(63); remove(63); leaves an empty heap-backed bitmap. This branch will still return false against an inline-empty bitmap, which breaks equality and any serialize/deserialize round-trip that normalizes back to inline.

💡 Suggested fix
 friend bool operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) {
     if(lhs.is_inline() && rhs.is_inline())
         return lhs.data == rhs.data;
-    // Different modes implies different sets: a heap bitmap always contains
-    // at least one id >= inline_capacity that an inline bitmap cannot hold.
-    if(lhs.is_inline() != rhs.is_inline())
-        return false;
+    if(lhs.is_inline() != rhs.is_inline())
+        return !lhs.any_not_in(rhs) && !rhs.any_not_in(lhs);
     return *lhs.as_heap() == *rhs.as_heap();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/support/bitmap.h` around lines 134 - 141, operator== assumes differing
storage modes imply inequality, but remove() can leave an empty heap that should
compare equal to an inline-empty bitmap; update operator==(const ContextBitmap&
lhs, const ContextBitmap& rhs) to handle mixed modes by normalizing comparison:
if is_inline() differs, take the heap side (as_heap()) and verify it contains no
ids >= inline_capacity (i.e., all bits fit in inline_capacity) and that the
low-inline_capacity bits equal the inline side's data; only return false if the
heap has any id >= inline_capacity or the low bits differ. Reference:
operator==, ContextBitmap::is_inline, ContextBitmap::data,
ContextBitmap::as_heap, inline_capacity, and the remove/add behavior.

16bit-ykiko and others added 2 commits April 18, 2026 14:49
…ective merge

Introduce ContextBitmap (tagged-pointer inline bitmap for <63 ids) to
replace roaring::Roaring in MergedIndex hot paths, eliminating heap
allocations in the common case. Cache FileIndex SHA256 hashes to avoid
recomputation. Add selective ProjectIndex::merge that skips symbols not
referencing newly-merged files. MergedIndex::merge now returns bool to
indicate cache hits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix use-after-partial-move and division-by-zero in benchmark
- Use std::exchange in ContextBitmap move operations
- Add clarifying comments on operator== mixed mode and hash() sorting
- Add tests: compilation-context merge cache hit, hash determinism,
  any_not_in empty inputs, remove edge cases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
benchmarks/index_benchmark.cpp (2)

174-188: ⚠️ Potential issue | 🟠 Major

Negative --jobs / --limit still cast straight to unsigned.

Casting a negative int to unsigned/std::size_t wraps to a huge positive value, which then drives workers.reserve(num_jobs) and entries.take_front(limit). Validate signedness before the cast and fall back to defaults (or reject the input) for negative values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 174 - 188, Validate the signed
values before casting: check the raw signed integers behind opts.jobs and
opts.limit (e.g., test *opts.jobs <= 0 and *opts.limit < 0) and handle negatives
instead of blindly casting to unsigned; for jobs, if the signed value is <= 0
use std::thread::hardware_concurrency() (and fallback to 4) before assigning
num_jobs, and for limit, if the signed limit is negative treat it as no limit
(or 0) before computing limit and calling entries.take_front(limit). Ensure you
reference the existing symbols num_jobs, limit, opts.jobs and opts.limit and
perform the signedness check prior to the static_cast.

361-380: ⚠️ Potential issue | 🟠 Major

Main-file merge still forces a miss instead of using the returned bool.

Lines 362 and 374-379 unconditionally new_file_ids.add(main_tu_path_id) and increment cache_misses, ignoring the cache-hit return value that the compilation-context overload now provides (the whole point of the API change in MergedIndex::merge). For repeated TU paths this skews the Phase 2 hit/miss totals and causes ProjectIndex::merge(tu, new_file_ids) to receive an overly inclusive bitmap, partly defeating the selective-merge measurement.

Proposed fix
-        // Main file is always new.
-        new_file_ids.add(main_tu_path_id);
         auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]);
         ...
-        merged[global_main_id].merge(global_main_id,
-                                     now,
-                                     std::move(remapped_locs),
-                                     tu_index.main_file_index,
-                                     content);
-        cache_misses++;
+        bool main_hit = merged[global_main_id].merge(global_main_id,
+                                                     now,
+                                                     std::move(remapped_locs),
+                                                     tu_index.main_file_index,
+                                                     content);
+        if(main_hit) {
+            cache_hits++;
+        } else {
+            cache_misses++;
+            new_file_ids.add(main_tu_path_id);
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 361 - 380, The main-file merge
currently unconditionally calls new_file_ids.add(main_tu_path_id) and increments
cache_misses despite MergedIndex::merge (invoked as
merged[global_main_id].merge(...)) returning a bool indicating whether the merge
was a cache hit; change the call to capture that bool (e.g., bool merged_hit =
merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs),
tu_index.main_file_index, content);) and only call
new_file_ids.add(main_tu_path_id) and increment cache_misses when merged_hit is
false; leave timing aggregation (merged_index_merge_ms += mt.ms()) as-is.
src/support/bitmap.h (1)

91-142: ⚠️ Potential issue | 🟠 Major

Mixed-mode operator== is still incorrect after removing a high id.

remove() (Lines 91-98) never downgrades heap storage back to inline, so the sequence add(63); remove(63); leaves an empty heap-backed bitmap. operator== at Lines 139-140 then returns false when comparing against an inline-empty (or inline-equivalent) bitmap, even though both sets are semantically equal. This also breaks any serialize/deserialize round-trip that normalizes through from_roaring (which picks inline when r.maximum() < inline_capacity).

Two viable fixes:

  1. Fall back to a content comparison when modes differ:
     if(lhs.is_inline() != rhs.is_inline())
    -    return false;
    +    return !lhs.any_not_in(rhs) && !rhs.any_not_in(lhs);
  2. Or, downgrade heap back to inline in remove() whenever as_heap()->maximum() < inline_capacity, preserving the "heap ⇒ has an id ≥ inline_capacity" invariant the comment at Lines 137-138 currently assumes.

Option 1 is simpler; option 2 also lets is_empty() / any_not_in hit the fast path more often.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/support/bitmap.h` around lines 91 - 142, The bug is that remove() never
downgrades a heap-backed bitmap to inline, so operator== can wrongly treat
semantically-equal bitmaps as different; fix by changing remove(std::uint32_t
id) to, after calling as_heap()->remove(id), check if the heap's maximum() (or
isEmpty()) is below inline_capacity and if so rebuild the inline representation
(set data = bits from the heap), destroy the heap storage and flip the mode back
to inline (use as_heap(), is_inline(), inline_capacity and the same
bit-extraction logic used in from_roaring to populate inline bits); this
preserves the heap⇒contains-high-id invariant and makes operator== and
serialize/deserialize behave correctly.
🧹 Nitpick comments (2)
src/index/tu_index.cpp (1)

179-208: Redundant re-sort of relations inside hash().

Builder::build() already sorts each symbol's relations at lines 120-128 before this function is ever reached (and dedupes via unique). The per-symbol sort at 195-198 here is therefore almost always a no-op except on the first call after external mutation. Since hash() now caches, you can drop the inner sort and rely on the invariant established at build time — or alternatively, move the sort out of build() and make hash() the sole place responsible for canonicalization. Keeping sorts in both places hides which side is authoritative.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.cpp` around lines 179 - 208, The per-symbol
std::ranges::sort inside hash() is redundant because Builder::build() already
sorts and deduplicates each relations vector; remove the inner sort call that
orders "rels" (the lambda comparing Relation::kind, range.begin, range.end,
target_symbol) so hash() relies on the canonical invariant established by
Builder::build(), keeping the outer sorted_keys construction and the data block
hashing, and ensure cached_hash remains valid for cached results; alternatively,
if you prefer hash() to own canonicalization, move the sort+unique logic from
Builder::build() into hash() and remove the build-time sort to keep a single
authoritative place.
tests/unit/index/project_index_tests.cpp (1)

203-253: Consider tightening SelectiveMergeFilters assertions.

The current check proj_sym.reference_files.cardinality() >= symbol.reference_files.cardinality() would pass even if the selective merge incorrectly inserted extra unrelated files. A stricter assertion would verify equality (or a subset relation) of the reference-file set to actually prove the selective path preserves — and doesn't corrupt — per-symbol reference data. Non-blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/index/project_index_tests.cpp` around lines 203 - 253, The test
SelectiveMergeFilters currently uses
ASSERT_TRUE(proj_sym.reference_files.cardinality() >=
symbol.reference_files.cardinality()), which allows extra unrelated files to
slip in; change the assertion to ensure exact preservation of reference-file
sets (e.g., assert equality between proj_sym.reference_files and
symbol.reference_files or that proj_sym.reference_files is equal to the expected
subset) after calling project.merge(tu, new_file_ids) so that comparisons on
tu.symbols -> symbol.reference_files vs project.symbols ->
proj_sym.reference_files confirm no extra paths were added.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 478-508: The division-by-zero occurs when computing queries /
elapsed in the Relation lookup printing (using variables merged, queries,
elapsed, sample_hashes, rounds, and Timer t); change the q/ms expression to
guard the divide (e.g., use elapsed > 0 ? queries / elapsed : 0.0) so it prints
0.0 when elapsed is zero; apply the same guarded expression to the other
benchmarking print that uses queries / elapsed (the similar block later that
also computes queries and elapsed).

---

Duplicate comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 174-188: Validate the signed values before casting: check the raw
signed integers behind opts.jobs and opts.limit (e.g., test *opts.jobs <= 0 and
*opts.limit < 0) and handle negatives instead of blindly casting to unsigned;
for jobs, if the signed value is <= 0 use std::thread::hardware_concurrency()
(and fallback to 4) before assigning num_jobs, and for limit, if the signed
limit is negative treat it as no limit (or 0) before computing limit and calling
entries.take_front(limit). Ensure you reference the existing symbols num_jobs,
limit, opts.jobs and opts.limit and perform the signedness check prior to the
static_cast.
- Around line 361-380: The main-file merge currently unconditionally calls
new_file_ids.add(main_tu_path_id) and increments cache_misses despite
MergedIndex::merge (invoked as merged[global_main_id].merge(...)) returning a
bool indicating whether the merge was a cache hit; change the call to capture
that bool (e.g., bool merged_hit = merged[global_main_id].merge(global_main_id,
now, std::move(remapped_locs), tu_index.main_file_index, content);) and only
call new_file_ids.add(main_tu_path_id) and increment cache_misses when
merged_hit is false; leave timing aggregation (merged_index_merge_ms += mt.ms())
as-is.

In `@src/support/bitmap.h`:
- Around line 91-142: The bug is that remove() never downgrades a heap-backed
bitmap to inline, so operator== can wrongly treat semantically-equal bitmaps as
different; fix by changing remove(std::uint32_t id) to, after calling
as_heap()->remove(id), check if the heap's maximum() (or isEmpty()) is below
inline_capacity and if so rebuild the inline representation (set data = bits
from the heap), destroy the heap storage and flip the mode back to inline (use
as_heap(), is_inline(), inline_capacity and the same bit-extraction logic used
in from_roaring to populate inline bits); this preserves the
heap⇒contains-high-id invariant and makes operator== and serialize/deserialize
behave correctly.

---

Nitpick comments:
In `@src/index/tu_index.cpp`:
- Around line 179-208: The per-symbol std::ranges::sort inside hash() is
redundant because Builder::build() already sorts and deduplicates each relations
vector; remove the inner sort call that orders "rels" (the lambda comparing
Relation::kind, range.begin, range.end, target_symbol) so hash() relies on the
canonical invariant established by Builder::build(), keeping the outer
sorted_keys construction and the data block hashing, and ensure cached_hash
remains valid for cached results; alternatively, if you prefer hash() to own
canonicalization, move the sort+unique logic from Builder::build() into hash()
and remove the build-time sort to keep a single authoritative place.

In `@tests/unit/index/project_index_tests.cpp`:
- Around line 203-253: The test SelectiveMergeFilters currently uses
ASSERT_TRUE(proj_sym.reference_files.cardinality() >=
symbol.reference_files.cardinality()), which allows extra unrelated files to
slip in; change the assertion to ensure exact preservation of reference-file
sets (e.g., assert equality between proj_sym.reference_files and
symbol.reference_files or that proj_sym.reference_files is equal to the expected
subset) after calling project.merge(tu, new_file_ids) so that comparisons on
tu.symbols -> symbol.reference_files vs project.symbols ->
proj_sym.reference_files confirm no extra paths were added.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56b57c5c-2fd5-40ca-9fc0-e87490573874

📥 Commits

Reviewing files that changed from the base of the PR and between 1d1122d and e1dde9e.

📒 Files selected for processing (13)
  • CMakeLists.txt
  • benchmarks/index_benchmark.cpp
  • src/index/merged_index.cpp
  • src/index/merged_index.h
  • src/index/project_index.cpp
  • src/index/project_index.h
  • src/index/tu_index.cpp
  • src/index/tu_index.h
  • src/support/bitmap.h
  • tests/unit/index/merged_index_tests.cpp
  • tests/unit/index/project_index_tests.cpp
  • tests/unit/index/tu_index_tests.cpp
  • tests/unit/support/context_bitmap_tests.cpp
✅ Files skipped from review due to trivial changes (1)
  • tests/unit/index/tu_index_tests.cpp
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/index/project_index.h
  • CMakeLists.txt
  • src/index/tu_index.h
  • tests/unit/index/merged_index_tests.cpp
  • src/index/project_index.cpp
  • src/index/merged_index.cpp

Comment on lines +478 to +508
auto elapsed = t.ms();
auto queries = merged.size() * 20 * rounds;
std::println(" Occurrence lookup (raw): {} queries in {:.1f}ms ({:.0f} q/ms, {} hits)",
queries,
elapsed,
queries / elapsed,
hit_count);
}

// Relation lookup by symbol hash (raw, no position conversion).
{
std::size_t hit_count = 0;
Timer t;
constexpr int rounds = 10;
for(int r = 0; r < rounds; r++) {
for(auto& [path, shard]: merged) {
for(auto hash: sample_hashes) {
shard.lookup(hash, RelationKind::Definition, [&](const index::Relation&) {
hit_count++;
return true;
});
}
}
}
auto elapsed = t.ms();
auto queries = merged.size() * sample_hashes.size() * rounds;
std::println(" Relation lookup (raw): {} queries in {:.1f}ms ({:.0f} q/ms, {} hits)",
queries,
elapsed,
queries / elapsed,
hit_count);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against division-by-zero in queries / elapsed.

If merged is empty (e.g., --limit 0 or every compile failing) both queries and elapsed can be zero, yielding NaN in the printed q/ms values. A simple guard (elapsed > 0 ? queries/elapsed : 0.0) keeps the output sane. Same pattern at Line 586.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/index_benchmark.cpp` around lines 478 - 508, The division-by-zero
occurs when computing queries / elapsed in the Relation lookup printing (using
variables merged, queries, elapsed, sample_hashes, rounds, and Timer t); change
the q/ms expression to guard the divide (e.g., use elapsed > 0 ? queries /
elapsed : 0.0) so it prints 0.0 when elapsed is zero; apply the same guarded
expression to the other benchmarking print that uses queries / elapsed (the
similar block later that also computes queries and elapsed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant