feat(index): optimize merge with ContextBitmap and selective merge#422
feat(index): optimize merge with ContextBitmap and selective merge#42216bit-ykiko wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughAdds a new index benchmarking executable; introduces a compact ContextBitmap type and replaces roaring usage in MergedIndex; makes MergedIndex::merge return a cache-hit bool; adds selective ProjectIndex::merge overload; caches deterministic FileIndex hashes; and includes unit tests for these behaviors. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Compiler as Compiler (parallel)
participant Merger as Merger / ProjectIndex
participant Merged as MergedIndex (shards)
participant Query as Query Engine
User->>Compiler: Phase 1: compile files in parallel
Compiler->>Compiler: Build TUIndex/FileIndex per file\ncompute deterministic hash
Compiler-->>Merger: emit per-file indices & file IDs
Merger->>Merger: Phase 2: selective merge\n(ProjectIndex::merge with Bitmap)
Merger->>Merged: update shards (MergedIndex::merge)\nreturn cache-hit flags
Merger->>Merged: serialize shards (measure size/time)
User->>Query: Phase 3: micro-benchmarks (lookups)
Query->>Merged: raw occurrence/relation lookups\n(position mapping)
Query->>Query: Phase 4: find-references scan\nacross project symbols & shards
Query-->>User: report timings, sizes, coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/index/merged_index.cpp (1)
606-610:⚠️ Potential issue | 🟠 MajorRefresh header content on cache hits too.
Unlike the compilation overload, this only writes
impl->contentonce.FileIndex::hash()is content-agnostic, so comment/whitespace-only header edits can still hit the cache and leavecontent()stale, which will skew later position mapping.Proposed fix
self.load_in_memory(); - if(self.impl->content.empty() && !content.empty()) { + if(!content.empty()) { self.impl->content = content.str(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 606 - 610, The cache merge only sets impl->content when the entry is initially empty, leaving impl->content stale on cache hits; update the logic so that when new content is provided (non-empty) you refresh impl->content even on hits. Locate the call sequence around load_in_memory(), the merge(...) invocation and the Impl type members (impl->content) and ensure either the merge callback or the post-merge path assigns impl->content = content.str() when content is non-empty (so FileIndex::hash() cache hits also update content() for correct position mapping).
🧹 Nitpick comments (1)
tests/unit/index/project_index_tests.cpp (1)
203-238: This fixture doesn't prove the filter excludes untouched symbols.Every symbol in this setup can still reference
main.cpp, soproject.merge(tu, new_file_ids)may legally copy the whole table and the test would still pass even if the selective filter were bypassed. Add a header-only symbol that is never referenced frommain.cppand assert it stays absent.🧪 Example tightening
add_file("header.h", R"( `#pragma` once inline int shared() { return 1; } + inline int header_only() { return 2; } )"); @@ ASSERT_TRUE(has_main_symbols); + + for(auto& [hash, symbol]: tu.symbols) { + if(symbol.name == "header_only") { + ASSERT_FALSE(project.symbols.contains(hash)); + } + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/index/project_index_tests.cpp` around lines 203 - 238, The test SelectiveMergeFilters currently can pass even if merge ignores the new_file_ids filter because every symbol is referenced by main.cpp; to fix, add a header-only symbol in add_file (e.g., a function or variable named something like unreferenced_header_symbol) that is not used in add_main, keep the rest of the setup, call project.merge(tu, new_file_ids), then assert that the header-only symbol is not present in project.symbols (use the symbol's hash/key from tu.symbols or look it up by name in tu.symbols) to prove untouched symbols are excluded; reference the existing identifiers tu.symbols, project.symbols, project.merge, and new_file_ids to locate where to add the new header and assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 175-179: The code casts CLI values to unsigned types before
validating, so negative inputs like opts.jobs or opts.limit become huge
positives; update the handling for num_jobs and the limit by first reading them
into a signed integer (e.g., int or long), check for <= 0 and handle negatives
by falling back to hardware_concurrency() or default values, then clamp to a
reasonable max before converting to unsigned/size_t for reserve(num_jobs) and
the worker loop; ensure the same validation logic is applied to both the
num_jobs (from opts.jobs) and the limit (from opts.limit) code paths.
- Around line 362-380: The code currently always treats the main TU as a new
file and forces a cache miss by unconditionally calling
new_file_ids.add(main_tu_path_id) and incrementing cache_misses after calling
merged[global_main_id].merge(...); change this to use the actual result from
merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs),
tu_index.main_file_index, content): capture its hit/miss return value and only
add main_tu_path_id to new_file_ids or increment cache_misses when the merge
reports a miss, otherwise treat it as a hit and avoid mutating
new_file_ids/cache_misses so repeated TU paths reflect correct Phase 2
statistics.
In `@src/index/tu_index.h`:
- Around line 56-61: The cached_hash stored in FileIndex can become stale
because mutable members (relations, occurrences) can change after hash()
populates cached_hash, causing MergedIndex::merge to use a wrong canonical key;
fix by either making FileIndex immutable after construction (e.g., move
relations and occurrences to private read-only members and expose only const
accessors or provide a finalize/build method that seals the instance and
prevents further mutation) or by explicitly invalidating cached_hash (reset
cached_hash = std::nullopt) at every mutating code path that modifies relations
or occurrences (add centralized mutator methods or wrap direct mutations to
ensure invalidation); update hash(), constructors, any setters, and
MergedIndex::merge call sites accordingly to use the sealed or invalidation-safe
FileIndex.
In `@src/support/bitmap.h`:
- Around line 134-141: operator== assumes differing storage modes imply
inequality, but remove() can leave an empty heap that should compare equal to an
inline-empty bitmap; update operator==(const ContextBitmap& lhs, const
ContextBitmap& rhs) to handle mixed modes by normalizing comparison: if
is_inline() differs, take the heap side (as_heap()) and verify it contains no
ids >= inline_capacity (i.e., all bits fit in inline_capacity) and that the
low-inline_capacity bits equal the inline side's data; only return false if the
heap has any id >= inline_capacity or the low bits differ. Reference:
operator==, ContextBitmap::is_inline, ContextBitmap::data,
ContextBitmap::as_heap, inline_capacity, and the remove/add behavior.
---
Outside diff comments:
In `@src/index/merged_index.cpp`:
- Around line 606-610: The cache merge only sets impl->content when the entry is
initially empty, leaving impl->content stale on cache hits; update the logic so
that when new content is provided (non-empty) you refresh impl->content even on
hits. Locate the call sequence around load_in_memory(), the merge(...)
invocation and the Impl type members (impl->content) and ensure either the merge
callback or the post-merge path assigns impl->content = content.str() when
content is non-empty (so FileIndex::hash() cache hits also update content() for
correct position mapping).
---
Nitpick comments:
In `@tests/unit/index/project_index_tests.cpp`:
- Around line 203-238: The test SelectiveMergeFilters currently can pass even if
merge ignores the new_file_ids filter because every symbol is referenced by
main.cpp; to fix, add a header-only symbol in add_file (e.g., a function or
variable named something like unreferenced_header_symbol) that is not used in
add_main, keep the rest of the setup, call project.merge(tu, new_file_ids), then
assert that the header-only symbol is not present in project.symbols (use the
symbol's hash/key from tu.symbols or look it up by name in tu.symbols) to prove
untouched symbols are excluded; reference the existing identifiers tu.symbols,
project.symbols, project.merge, and new_file_ids to locate where to add the new
header and assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eb9b2d7e-023a-4c77-bd4c-a48d92191f47
📒 Files selected for processing (13)
CMakeLists.txtbenchmarks/index_benchmark.cppsrc/index/merged_index.cppsrc/index/merged_index.hsrc/index/project_index.cppsrc/index/project_index.hsrc/index/tu_index.cppsrc/index/tu_index.hsrc/support/bitmap.htests/unit/index/merged_index_tests.cpptests/unit/index/project_index_tests.cpptests/unit/index/tu_index_tests.cpptests/unit/support/context_bitmap_tests.cpp
| auto num_jobs = static_cast<unsigned>(*opts.jobs); | ||
| if(num_jobs == 0) | ||
| num_jobs = std::thread::hardware_concurrency(); | ||
| if(num_jobs == 0) | ||
| num_jobs = 4; |
There was a problem hiding this comment.
Reject negative --jobs and --limit before casting.
Lines 175-179 and 187-189 convert signed CLI values to unsigned/size_t first. -1 turns into a huge positive number here, which can explode reserve(num_jobs) and the worker loop instead of falling back to defaults.
Proposed fix
- auto num_jobs = static_cast<unsigned>(*opts.jobs);
+ if(*opts.jobs < 0 || *opts.limit < 0) {
+ std::println(stderr, "--jobs and --limit must be >= 0");
+ return 1;
+ }
+
+ auto num_jobs = static_cast<unsigned>(*opts.jobs);
if(num_jobs == 0)
num_jobs = std::thread::hardware_concurrency();
if(num_jobs == 0)
num_jobs = 4;
@@
- auto limit = static_cast<std::size_t>(*opts.limit);
+ auto limit = static_cast<std::size_t>(*opts.limit);Also applies to: 187-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/index_benchmark.cpp` around lines 175 - 179, The code casts CLI
values to unsigned types before validating, so negative inputs like opts.jobs or
opts.limit become huge positives; update the handling for num_jobs and the limit
by first reading them into a signed integer (e.g., int or long), check for <= 0
and handle negatives by falling back to hardware_concurrency() or default
values, then clamp to a reasonable max before converting to unsigned/size_t for
reserve(num_jobs) and the worker loop; ensure the same validation logic is
applied to both the num_jobs (from opts.jobs) and the limit (from opts.limit)
code paths.
| // Main file is always new. | ||
| new_file_ids.add(main_tu_path_id); | ||
| auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]); | ||
| auto content = read_content(global_main_id, graph.paths[main_tu_path_id]); | ||
| auto now = std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| std::chrono::system_clock::now().time_since_epoch()); | ||
|
|
||
| std::vector<index::IncludeLocation> remapped_locs; | ||
| for(auto& loc: graph.locations) { | ||
| auto remapped = loc; | ||
| remapped.path_id = project_index.path_pool.path_id(graph.paths[loc.path_id]); | ||
| remapped_locs.push_back(remapped); | ||
| } | ||
| merged[global_main_id].merge(global_main_id, | ||
| now, | ||
| std::move(remapped_locs), | ||
| tu_index.main_file_index, | ||
| content); | ||
| cache_misses++; |
There was a problem hiding this comment.
Use the main-file merge() result instead of forcing a miss.
Lines 362-380 always add main_tu_path_id to new_file_ids and always increment cache_misses, even though the compilation-context overload now returns the same cache-hit signal as the header overload. That makes the Phase 2 hit/miss numbers and the selective ProjectIndex::merge() input wrong for repeated TU paths.
Proposed fix
- // Main file is always new.
- new_file_ids.add(main_tu_path_id);
auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]);
auto content = read_content(global_main_id, graph.paths[main_tu_path_id]);
auto now = std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now().time_since_epoch());
@@
- merged[global_main_id].merge(global_main_id,
- now,
- std::move(remapped_locs),
- tu_index.main_file_index,
- content);
- cache_misses++;
+ bool main_hit = merged[global_main_id].merge(global_main_id,
+ now,
+ std::move(remapped_locs),
+ tu_index.main_file_index,
+ content);
+ if(main_hit) {
+ cache_hits++;
+ } else {
+ cache_misses++;
+ new_file_ids.add(main_tu_path_id);
+ }
merged_index_merge_ms += mt.ms();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/index_benchmark.cpp` around lines 362 - 380, The code currently
always treats the main TU as a new file and forces a cache miss by
unconditionally calling new_file_ids.add(main_tu_path_id) and incrementing
cache_misses after calling merged[global_main_id].merge(...); change this to use
the actual result from merged[global_main_id].merge(global_main_id, now,
std::move(remapped_locs), tu_index.main_file_index, content): capture its
hit/miss return value and only add main_tu_path_id to new_file_ids or increment
cache_misses when the merge reports a miss, otherwise treat it as a hit and
avoid mutating new_file_ids/cache_misses so repeated TU paths reflect correct
Phase 2 statistics.
| /// Compute (and cache) SHA256 hash of this index's content. | ||
| /// Each FileIndex instance must only be hashed by one thread at a time. | ||
| std::array<std::uint8_t, 32> hash(); | ||
|
|
||
| private: | ||
| std::optional<std::array<std::uint8_t, 32>> cached_hash; |
There was a problem hiding this comment.
Cached hashes need an immutability or invalidation guarantee.
relations and occurrences are still publicly mutable, but once hash() populates cached_hash it never changes again. MergedIndex::merge uses that digest as the canonical-cache key, so any later mutation can turn into a false cache hit and merge different FileIndex contents under the same canonical id. Please either make FileIndex immutable after build or reset cached_hash from every mutating path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/tu_index.h` around lines 56 - 61, The cached_hash stored in
FileIndex can become stale because mutable members (relations, occurrences) can
change after hash() populates cached_hash, causing MergedIndex::merge to use a
wrong canonical key; fix by either making FileIndex immutable after construction
(e.g., move relations and occurrences to private read-only members and expose
only const accessors or provide a finalize/build method that seals the instance
and prevents further mutation) or by explicitly invalidating cached_hash (reset
cached_hash = std::nullopt) at every mutating code path that modifies relations
or occurrences (add centralized mutator methods or wrap direct mutations to
ensure invalidation); update hash(), constructors, any setters, and
MergedIndex::merge call sites accordingly to use the sealed or invalidation-safe
FileIndex.
| friend bool operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) { | ||
| if(lhs.is_inline() && rhs.is_inline()) | ||
| return lhs.data == rhs.data; | ||
| // Different modes implies different sets: a heap bitmap always contains | ||
| // at least one id >= inline_capacity that an inline bitmap cannot hold. | ||
| if(lhs.is_inline() != rhs.is_inline()) | ||
| return false; | ||
| return *lhs.as_heap() == *rhs.as_heap(); |
There was a problem hiding this comment.
Mixed-mode equality becomes wrong after removing a high id.
remove() never downgrades heap storage back to inline, so a sequence like add(63); remove(63); leaves an empty heap-backed bitmap. This branch will still return false against an inline-empty bitmap, which breaks equality and any serialize/deserialize round-trip that normalizes back to inline.
💡 Suggested fix
friend bool operator==(const ContextBitmap& lhs, const ContextBitmap& rhs) {
if(lhs.is_inline() && rhs.is_inline())
return lhs.data == rhs.data;
- // Different modes implies different sets: a heap bitmap always contains
- // at least one id >= inline_capacity that an inline bitmap cannot hold.
- if(lhs.is_inline() != rhs.is_inline())
- return false;
+ if(lhs.is_inline() != rhs.is_inline())
+ return !lhs.any_not_in(rhs) && !rhs.any_not_in(lhs);
return *lhs.as_heap() == *rhs.as_heap();
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/support/bitmap.h` around lines 134 - 141, operator== assumes differing
storage modes imply inequality, but remove() can leave an empty heap that should
compare equal to an inline-empty bitmap; update operator==(const ContextBitmap&
lhs, const ContextBitmap& rhs) to handle mixed modes by normalizing comparison:
if is_inline() differs, take the heap side (as_heap()) and verify it contains no
ids >= inline_capacity (i.e., all bits fit in inline_capacity) and that the
low-inline_capacity bits equal the inline side's data; only return false if the
heap has any id >= inline_capacity or the low bits differ. Reference:
operator==, ContextBitmap::is_inline, ContextBitmap::data,
ContextBitmap::as_heap, inline_capacity, and the remove/add behavior.
…ective merge Introduce ContextBitmap (tagged-pointer inline bitmap for <63 ids) to replace roaring::Roaring in MergedIndex hot paths, eliminating heap allocations in the common case. Cache FileIndex SHA256 hashes to avoid recomputation. Add selective ProjectIndex::merge that skips symbols not referencing newly-merged files. MergedIndex::merge now returns bool to indicate cache hits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix use-after-partial-move and division-by-zero in benchmark - Use std::exchange in ContextBitmap move operations - Add clarifying comments on operator== mixed mode and hash() sorting - Add tests: compilation-context merge cache hit, hash determinism, any_not_in empty inputs, remove edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1d1122d to
e1dde9e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
benchmarks/index_benchmark.cpp (2)
174-188:⚠️ Potential issue | 🟠 MajorNegative
--jobs/--limitstill cast straight to unsigned.Casting a negative
inttounsigned/std::size_twraps to a huge positive value, which then drivesworkers.reserve(num_jobs)andentries.take_front(limit). Validate signedness before the cast and fall back to defaults (or reject the input) for negative values.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/index_benchmark.cpp` around lines 174 - 188, Validate the signed values before casting: check the raw signed integers behind opts.jobs and opts.limit (e.g., test *opts.jobs <= 0 and *opts.limit < 0) and handle negatives instead of blindly casting to unsigned; for jobs, if the signed value is <= 0 use std::thread::hardware_concurrency() (and fallback to 4) before assigning num_jobs, and for limit, if the signed limit is negative treat it as no limit (or 0) before computing limit and calling entries.take_front(limit). Ensure you reference the existing symbols num_jobs, limit, opts.jobs and opts.limit and perform the signedness check prior to the static_cast.
361-380:⚠️ Potential issue | 🟠 MajorMain-file merge still forces a miss instead of using the returned bool.
Lines 362 and 374-379 unconditionally
new_file_ids.add(main_tu_path_id)and incrementcache_misses, ignoring the cache-hit return value that the compilation-context overload now provides (the whole point of the API change inMergedIndex::merge). For repeated TU paths this skews the Phase 2 hit/miss totals and causesProjectIndex::merge(tu, new_file_ids)to receive an overly inclusive bitmap, partly defeating the selective-merge measurement.Proposed fix
- // Main file is always new. - new_file_ids.add(main_tu_path_id); auto global_main_id = project_index.path_pool.path_id(graph.paths[main_tu_path_id]); ... - merged[global_main_id].merge(global_main_id, - now, - std::move(remapped_locs), - tu_index.main_file_index, - content); - cache_misses++; + bool main_hit = merged[global_main_id].merge(global_main_id, + now, + std::move(remapped_locs), + tu_index.main_file_index, + content); + if(main_hit) { + cache_hits++; + } else { + cache_misses++; + new_file_ids.add(main_tu_path_id); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/index_benchmark.cpp` around lines 361 - 380, The main-file merge currently unconditionally calls new_file_ids.add(main_tu_path_id) and increments cache_misses despite MergedIndex::merge (invoked as merged[global_main_id].merge(...)) returning a bool indicating whether the merge was a cache hit; change the call to capture that bool (e.g., bool merged_hit = merged[global_main_id].merge(global_main_id, now, std::move(remapped_locs), tu_index.main_file_index, content);) and only call new_file_ids.add(main_tu_path_id) and increment cache_misses when merged_hit is false; leave timing aggregation (merged_index_merge_ms += mt.ms()) as-is.src/support/bitmap.h (1)
91-142:⚠️ Potential issue | 🟠 MajorMixed-mode
operator==is still incorrect after removing a high id.
remove()(Lines 91-98) never downgrades heap storage back to inline, so the sequenceadd(63); remove(63);leaves an empty heap-backed bitmap.operator==at Lines 139-140 then returnsfalsewhen comparing against an inline-empty (or inline-equivalent) bitmap, even though both sets are semantically equal. This also breaks any serialize/deserialize round-trip that normalizes throughfrom_roaring(which picks inline whenr.maximum() < inline_capacity).Two viable fixes:
- Fall back to a content comparison when modes differ:
if(lhs.is_inline() != rhs.is_inline()) - return false; + return !lhs.any_not_in(rhs) && !rhs.any_not_in(lhs);- Or, downgrade heap back to inline in
remove()wheneveras_heap()->maximum() < inline_capacity, preserving the "heap ⇒ has an id ≥ inline_capacity" invariant the comment at Lines 137-138 currently assumes.Option 1 is simpler; option 2 also lets
is_empty()/any_not_inhit the fast path more often.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/support/bitmap.h` around lines 91 - 142, The bug is that remove() never downgrades a heap-backed bitmap to inline, so operator== can wrongly treat semantically-equal bitmaps as different; fix by changing remove(std::uint32_t id) to, after calling as_heap()->remove(id), check if the heap's maximum() (or isEmpty()) is below inline_capacity and if so rebuild the inline representation (set data = bits from the heap), destroy the heap storage and flip the mode back to inline (use as_heap(), is_inline(), inline_capacity and the same bit-extraction logic used in from_roaring to populate inline bits); this preserves the heap⇒contains-high-id invariant and makes operator== and serialize/deserialize behave correctly.
🧹 Nitpick comments (2)
src/index/tu_index.cpp (1)
179-208: Redundant re-sort ofrelationsinsidehash().
Builder::build()already sorts each symbol'srelationsat lines 120-128 before this function is ever reached (and dedupes viaunique). The per-symbol sort at 195-198 here is therefore almost always a no-op except on the first call after external mutation. Sincehash()now caches, you can drop the inner sort and rely on the invariant established at build time — or alternatively, move the sort out ofbuild()and makehash()the sole place responsible for canonicalization. Keeping sorts in both places hides which side is authoritative.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.cpp` around lines 179 - 208, The per-symbol std::ranges::sort inside hash() is redundant because Builder::build() already sorts and deduplicates each relations vector; remove the inner sort call that orders "rels" (the lambda comparing Relation::kind, range.begin, range.end, target_symbol) so hash() relies on the canonical invariant established by Builder::build(), keeping the outer sorted_keys construction and the data block hashing, and ensure cached_hash remains valid for cached results; alternatively, if you prefer hash() to own canonicalization, move the sort+unique logic from Builder::build() into hash() and remove the build-time sort to keep a single authoritative place.tests/unit/index/project_index_tests.cpp (1)
203-253: Consider tighteningSelectiveMergeFiltersassertions.The current check
proj_sym.reference_files.cardinality() >= symbol.reference_files.cardinality()would pass even if the selective merge incorrectly inserted extra unrelated files. A stricter assertion would verify equality (or a subset relation) of the reference-file set to actually prove the selective path preserves — and doesn't corrupt — per-symbol reference data. Non-blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/index/project_index_tests.cpp` around lines 203 - 253, The test SelectiveMergeFilters currently uses ASSERT_TRUE(proj_sym.reference_files.cardinality() >= symbol.reference_files.cardinality()), which allows extra unrelated files to slip in; change the assertion to ensure exact preservation of reference-file sets (e.g., assert equality between proj_sym.reference_files and symbol.reference_files or that proj_sym.reference_files is equal to the expected subset) after calling project.merge(tu, new_file_ids) so that comparisons on tu.symbols -> symbol.reference_files vs project.symbols -> proj_sym.reference_files confirm no extra paths were added.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 478-508: The division-by-zero occurs when computing queries /
elapsed in the Relation lookup printing (using variables merged, queries,
elapsed, sample_hashes, rounds, and Timer t); change the q/ms expression to
guard the divide (e.g., use elapsed > 0 ? queries / elapsed : 0.0) so it prints
0.0 when elapsed is zero; apply the same guarded expression to the other
benchmarking print that uses queries / elapsed (the similar block later that
also computes queries and elapsed).
---
Duplicate comments:
In `@benchmarks/index_benchmark.cpp`:
- Around line 174-188: Validate the signed values before casting: check the raw
signed integers behind opts.jobs and opts.limit (e.g., test *opts.jobs <= 0 and
*opts.limit < 0) and handle negatives instead of blindly casting to unsigned;
for jobs, if the signed value is <= 0 use std::thread::hardware_concurrency()
(and fallback to 4) before assigning num_jobs, and for limit, if the signed
limit is negative treat it as no limit (or 0) before computing limit and calling
entries.take_front(limit). Ensure you reference the existing symbols num_jobs,
limit, opts.jobs and opts.limit and perform the signedness check prior to the
static_cast.
- Around line 361-380: The main-file merge currently unconditionally calls
new_file_ids.add(main_tu_path_id) and increments cache_misses despite
MergedIndex::merge (invoked as merged[global_main_id].merge(...)) returning a
bool indicating whether the merge was a cache hit; change the call to capture
that bool (e.g., bool merged_hit = merged[global_main_id].merge(global_main_id,
now, std::move(remapped_locs), tu_index.main_file_index, content);) and only
call new_file_ids.add(main_tu_path_id) and increment cache_misses when
merged_hit is false; leave timing aggregation (merged_index_merge_ms += mt.ms())
as-is.
In `@src/support/bitmap.h`:
- Around line 91-142: The bug is that remove() never downgrades a heap-backed
bitmap to inline, so operator== can wrongly treat semantically-equal bitmaps as
different; fix by changing remove(std::uint32_t id) to, after calling
as_heap()->remove(id), check if the heap's maximum() (or isEmpty()) is below
inline_capacity and if so rebuild the inline representation (set data = bits
from the heap), destroy the heap storage and flip the mode back to inline (use
as_heap(), is_inline(), inline_capacity and the same bit-extraction logic used
in from_roaring to populate inline bits); this preserves the
heap⇒contains-high-id invariant and makes operator== and serialize/deserialize
behave correctly.
---
Nitpick comments:
In `@src/index/tu_index.cpp`:
- Around line 179-208: The per-symbol std::ranges::sort inside hash() is
redundant because Builder::build() already sorts and deduplicates each relations
vector; remove the inner sort call that orders "rels" (the lambda comparing
Relation::kind, range.begin, range.end, target_symbol) so hash() relies on the
canonical invariant established by Builder::build(), keeping the outer
sorted_keys construction and the data block hashing, and ensure cached_hash
remains valid for cached results; alternatively, if you prefer hash() to own
canonicalization, move the sort+unique logic from Builder::build() into hash()
and remove the build-time sort to keep a single authoritative place.
In `@tests/unit/index/project_index_tests.cpp`:
- Around line 203-253: The test SelectiveMergeFilters currently uses
ASSERT_TRUE(proj_sym.reference_files.cardinality() >=
symbol.reference_files.cardinality()), which allows extra unrelated files to
slip in; change the assertion to ensure exact preservation of reference-file
sets (e.g., assert equality between proj_sym.reference_files and
symbol.reference_files or that proj_sym.reference_files is equal to the expected
subset) after calling project.merge(tu, new_file_ids) so that comparisons on
tu.symbols -> symbol.reference_files vs project.symbols ->
proj_sym.reference_files confirm no extra paths were added.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 56b57c5c-2fd5-40ca-9fc0-e87490573874
📒 Files selected for processing (13)
CMakeLists.txtbenchmarks/index_benchmark.cppsrc/index/merged_index.cppsrc/index/merged_index.hsrc/index/project_index.cppsrc/index/project_index.hsrc/index/tu_index.cppsrc/index/tu_index.hsrc/support/bitmap.htests/unit/index/merged_index_tests.cpptests/unit/index/project_index_tests.cpptests/unit/index/tu_index_tests.cpptests/unit/support/context_bitmap_tests.cpp
✅ Files skipped from review due to trivial changes (1)
- tests/unit/index/tu_index_tests.cpp
🚧 Files skipped from review as they are similar to previous changes (6)
- src/index/project_index.h
- CMakeLists.txt
- src/index/tu_index.h
- tests/unit/index/merged_index_tests.cpp
- src/index/project_index.cpp
- src/index/merged_index.cpp
| auto elapsed = t.ms(); | ||
| auto queries = merged.size() * 20 * rounds; | ||
| std::println(" Occurrence lookup (raw): {} queries in {:.1f}ms ({:.0f} q/ms, {} hits)", | ||
| queries, | ||
| elapsed, | ||
| queries / elapsed, | ||
| hit_count); | ||
| } | ||
|
|
||
| // Relation lookup by symbol hash (raw, no position conversion). | ||
| { | ||
| std::size_t hit_count = 0; | ||
| Timer t; | ||
| constexpr int rounds = 10; | ||
| for(int r = 0; r < rounds; r++) { | ||
| for(auto& [path, shard]: merged) { | ||
| for(auto hash: sample_hashes) { | ||
| shard.lookup(hash, RelationKind::Definition, [&](const index::Relation&) { | ||
| hit_count++; | ||
| return true; | ||
| }); | ||
| } | ||
| } | ||
| } | ||
| auto elapsed = t.ms(); | ||
| auto queries = merged.size() * sample_hashes.size() * rounds; | ||
| std::println(" Relation lookup (raw): {} queries in {:.1f}ms ({:.0f} q/ms, {} hits)", | ||
| queries, | ||
| elapsed, | ||
| queries / elapsed, | ||
| hit_count); |
There was a problem hiding this comment.
Guard against division-by-zero in queries / elapsed.
If merged is empty (e.g., --limit 0 or every compile failing) both queries and elapsed can be zero, yielding NaN in the printed q/ms values. A simple guard (elapsed > 0 ? queries/elapsed : 0.0) keeps the output sane. Same pattern at Line 586.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/index_benchmark.cpp` around lines 478 - 508, The division-by-zero
occurs when computing queries / elapsed in the Relation lookup printing (using
variables merged, queries, elapsed, sample_hashes, rounds, and Timer t); change
the q/ms expression to guard the divide (e.g., use elapsed > 0 ? queries /
elapsed : 0.0) so it prints 0.0 when elapsed is zero; apply the same guarded
expression to the other benchmarking print that uses queries / elapsed (the
similar block later that also computes queries and elapsed).
Summary
ContextBitmap(tagged-pointer inline bitmap for <63 ids) to replaceroaring::Roaringin MergedIndex hot paths, eliminating heap allocations in the common caseFileIndexSHA256 hashes with deterministic ordering to avoid recomputation and ensure cross-instance consistencyProjectIndex::mergethat skips symbols not referencing newly-merged filesMergedIndex::mergenow returnsboolto indicate cache hits, enabling the selective optimizationindex_benchmarktool for end-to-end merge/query performance profilingTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Performance
Tests