refactor(index): migrate FlatBuffers from flatc IDL to kotatsu reflection#430
refactor(index): migrate FlatBuffers from flatc IDL to kotatsu reflection#43016bit-ykiko wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughReplaces in-repo FlatBuffers schema and manual builders with kotatsu's flatbuffers codec and arena adapters. Removes schema generation from CMake and the Changes
Sequence DiagramsequenceDiagram
participant App as Application
participant Ser as Serialization Layer
participant Adapter as Type Adapters
participant Codec as Kotatsu Codec
participant Buf as FlatBuffer Bytes
rect rgba(100, 149, 237, 0.5)
Note over App,Buf: Serialization Flow (New)
App->>Ser: serialize(index)
Ser->>Adapter: adapt domain types -> wire types
Adapter->>Codec: provide arena-backed structures
Ser->>Codec: kfb::to_flatbuffer(adapted)
Codec->>Buf: emit FlatBuffer bytes
Buf->>App: write bytes to stream
end
rect rgba(144, 238, 144, 0.5)
Note over App,Buf: Deserialization Flow (New)
App->>Buf: read bytes from stream
App->>Ser: deserialize(bytes, size)
Ser->>Codec: kfb::from_flatbuffer(bytes, target)
Codec->>Adapter: expose table_views / wire types
Adapter->>App: reconstruct domain objects
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
cmake/package.cmake (1)
30-35: Pin kotatsu to a commit SHA (or tag) rather than a branch for reproducibility.
refactor/flatbuffers-schema-drivenis a moving branch HEAD, so withFETCHCONTENT_UPDATES_DISCONNECTED ONdifferent developers / CI runs may silently resolve to different commits depending on when their local FetchContent cache was first populated. The PR description already acknowledges this is temporary, but until the kotatsu PR merges and you can switch tomain(or a tag), consider pinning to the exact commit.📦 Proposed pin
FetchContent_Declare( kotatsu GIT_REPOSITORY https://github.com/clice-io/kotatsu - GIT_TAG refactor/flatbuffers-schema-driven + # TODO: repoint to main / tagged release after upstream kotatsu PR merges. + GIT_TAG <commit-sha-from-refactor/flatbuffers-schema-driven> GIT_SHALLOW TRUE )Note: with a commit SHA, you'll also want to drop
GIT_SHALLOW TRUEor ensure the SHA is reachable from the default branch, since shallow fetch of a non-tip SHA is not universally supported.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmake/package.cmake` around lines 30 - 35, The FetchContent_Declare block for kotatsu uses a moving branch in GIT_TAG ("refactor/flatbuffers-schema-driven"); change GIT_TAG to a specific commit SHA (or an immutable tag) so builds are reproducible, and remove or adjust GIT_SHALLOW TRUE (or ensure the SHA is reachable from the default branch) so the pinned SHA can be fetched; update the FetchContent_Declare call for kotatsu accordingly.src/index/wire_types.h (1)
132-141:MergedIndex::contentnow forces eager copy of the full file contents.Previously
contentwas lazily pulled out of the mmapped FlatBuffer root; after this migration,load_in_memorydecodes the entirewire::MergedIndex(includingcontentasstd::string) up front, and bothMergedIndex::content()andneed_update()/lookup()now eagerly callload_in_memory(). For large per-file source contents this means more allocation and memory pressure on what used to be very cheap accessors — especiallyneed_update(), which doesn't needcontentat all.If this turns out to matter at scale, consider splitting
contentout ofwire::MergedIndexinto its own sub-region (or a separate file) so lookup/need_update can decode without materializing it. Not blocking for this PR, but worth tracking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/wire_types.h` around lines 132 - 141, The struct field MergedIndex::content makes decoding the entire wire::MergedIndex (and therefore load_in_memory(), need_update(), lookup()) eagerly allocate the full file contents; to avoid unnecessary eager copies, revert to keeping content out of the hot decoded path by removing or deferring the std::string content from MergedIndex and instead expose a lazy accessor that reads the content from the mmapped FlatBuffer only when explicitly requested (e.g., implement MergedIndex::get_content() that pulls from the FlatBuffer root on first use or store a lightweight view/offset to the content region), update load_in_memory() to not materialize content, and ensure need_update() and lookup() call only the lazy accessor when they actually need the payload; reference MergedIndex, content, load_in_memory, need_update, lookup, and wire::MergedIndex when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/index/merged_index.cpp`:
- Around line 410-496: The methods MergedIndex::lookup, MergedIndex::need_update
and content currently call const_cast<Self&>(self).load_in_memory(), introducing
a data race on members buffer/impl; fix by making lazy-init thread-safe: either
mark buffer and impl mutable and change load_in_memory to perform synchronized
one-time initialization using std::once_flag (or a mutex + double-checked init)
so callers no longer need const_cast, or keep them non-mutable and protect
load_in_memory with a mutex/once_flag and remove the const_cast calls in
lookup/need_update/content; ensure you update load_in_memory, remove const_cast
usages, and reference the same buffer and impl symbols consistently to avoid
races.
- Around line 256-262: When kfb::from_flatbuffer(bytes, w) fails inside
load_in_memory, don't silently reset self.buffer and leave impl as an empty
index; instead log the failure (include the index file path and any error/info
from from_flatbuffer) and record a load-status so callers can detect the bad
load. Concretely: in the failure branch after auto result =
kfb::from_flatbuffer(bytes, w) add a processLogger/error log call that prints
the path and result/error, set an explicit flag or enum on the object (e.g.
impl_loaded = false or load_status = LoadStatus::Failed) instead of just leaving
impl empty, and expose an accessor such as is_loaded() or load_status() so
lookup/need_update/content can detect the bad load and trigger a rebuild or
return an appropriate error.
- Around line 405-407: The code currently uses assert on the result of
kfb::to_flatbuffer(w) which is removed in release builds, causing undefined
behavior when bytes is empty; replace the assert with an explicit check of the
optional-like bytes result (from kfb::to_flatbuffer) and handle the failure path
deterministically (e.g., log an error and throw an exception or return an error
code) before calling bytes->data()/bytes->size(); apply the same change pattern
to the analogous sites (project_index.cpp around the bytes/out.write usage and
TUIndex::serialize) so serialization never dereferences an empty value in
release builds.
- Around line 387-400: The inner relation lists are produced by iterating
index.relations[symbol_id] (a DenseMap) which yields nondeterministic order;
before pushing each wire::SymbolRelationsEntry into w.relations, sort its
sre.relations vector by relation (e.g., compare RelationEntry.relation or use
the result of to_wire(relation)) to ensure deterministic on-disk output; apply
the same pattern of collecting then sorting entries when serializing
canonical_cache, header_contexts, and compilation_contexts so each inner
vector/map is stable across runs.
In `@src/index/project_index.cpp`:
- Around line 88-98: The loop writes to pool.paths using deserialized entry.id
without validating bounds, which can crash on malformed or future sparse ids;
change the loop over w.paths so you either (A) validate that entry.id <
pool.paths.size() before doing pool.paths[entry.id] = k and skip or handle
invalid ids (and still update pool.cache.try_emplace(k, entry.id) only for valid
ids), or (B) stop using entry.id entirely and instead assign ids from the vector
position (use the loop index i as the id when calling pool.save and
pool.cache.try_emplace) so writes use a safe, consecutive index; update the
logic around pool.save, pool.paths and pool.cache.try_emplace to reflect the
chosen approach.
In `@src/index/tu_index.cpp`:
- Around line 313-315: Replace the runtime-unsafe assert that assumes
kfb::to_flatbuffer(w) always returns a value: check the returned optional/ptr
(the bytes variable) and if it is null call llvm::report_fatal_error with a
clear message instead of asserting; then proceed to use
bytes->data()/bytes->size() only after the null-check. Add `#include`
"llvm/Support/ErrorHandling.h" to the top of each affected file (tu_index.cpp,
project_index.cpp, merged_index.cpp) and remove the assert(bytes && ...) lines
so release builds won’t dereference a null pointer.
In `@src/index/wire_types.h`:
- Around line 46-53: Remove the unused padding member from the wire type by
deleting the padding field from struct Relation (wire::Relation) in
wire_types.h, and update the serialization/conversion code that maps between
runtime Relation (tu_index) and wire::Relation in tu_index.cpp and
merged_index.cpp so they no longer read, write, or reserve 4 bytes for padding;
adjust any memcpy/size calculations, constructors, and assignment code that
assumed the extra field and update any on-disk/serialization size constants or
header checks that included the padding so the wire layout no longer contains
the unused 4-byte slot.
---
Nitpick comments:
In `@cmake/package.cmake`:
- Around line 30-35: The FetchContent_Declare block for kotatsu uses a moving
branch in GIT_TAG ("refactor/flatbuffers-schema-driven"); change GIT_TAG to a
specific commit SHA (or an immutable tag) so builds are reproducible, and remove
or adjust GIT_SHALLOW TRUE (or ensure the SHA is reachable from the default
branch) so the pinned SHA can be fetched; update the FetchContent_Declare call
for kotatsu accordingly.
In `@src/index/wire_types.h`:
- Around line 132-141: The struct field MergedIndex::content makes decoding the
entire wire::MergedIndex (and therefore load_in_memory(), need_update(),
lookup()) eagerly allocate the full file contents; to avoid unnecessary eager
copies, revert to keeping content out of the hot decoded path by removing or
deferring the std::string content from MergedIndex and instead expose a lazy
accessor that reads the content from the mmapped FlatBuffer only when explicitly
requested (e.g., implement MergedIndex::get_content() that pulls from the
FlatBuffer root on first use or store a lightweight view/offset to the content
region), update load_in_memory() to not materialize content, and ensure
need_update() and lookup() call only the lazy accessor when they actually need
the payload; reference MergedIndex, content, load_in_memory, need_update,
lookup, and wire::MergedIndex when making the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a56ea3cc-99ce-488e-ac0c-b2f33603a5e0
📒 Files selected for processing (14)
CMakeLists.txtcmake/package.cmakesrc/index/kotatsu_adapters.hsrc/index/merged_index.cppsrc/index/project_index.cppsrc/index/project_index.hsrc/index/schema.fbssrc/index/serialization.hsrc/index/tu_index.cppsrc/index/tu_index.hsrc/index/wire_types.hsrc/server/compiler.cppsrc/server/indexer.cpptests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
- src/index/serialization.h
- src/index/schema.fbs
|
|
||
| wire::MergedIndex w; | ||
| auto result = kfb::from_flatbuffer(bytes, w); | ||
| if(!result) { | ||
| self.buffer.reset(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Silent failure on corrupted/unparseable on-disk index.
If kfb::from_flatbuffer fails, load_in_memory resets the buffer and leaves impl as an empty index — callers (e.g. lookup, need_update, content) then observe an empty index and proceed as if nothing was wrong. For a corrupted or version-mismatched index file this can mask real problems (e.g. a dropped-on-the-floor stale cache is indistinguishable from legitimate empty state). Consider at least logging the failure with the path, and/or exposing a load-status so callers can trigger a rebuild.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 256 - 262, When
kfb::from_flatbuffer(bytes, w) fails inside load_in_memory, don't silently reset
self.buffer and leave impl as an empty index; instead log the failure (include
the index file path and any error/info from from_flatbuffer) and record a
load-status so callers can detect the bad load. Concretely: in the failure
branch after auto result = kfb::from_flatbuffer(bytes, w) add a
processLogger/error log call that prints the path and result/error, set an
explicit flag or enum on the object (e.g. impl_loaded = false or load_status =
LoadStatus::Failed) instead of just leaving impl empty, and expose an accessor
such as is_loaded() or load_status() so lookup/need_update/content can detect
the bad load and trigger a rebuild or return an appropriate error.
| w.relations.reserve(index.relations.size()); | ||
| for(auto& [symbol_id, symbol_relations]: index.relations) { | ||
| wire::SymbolRelationsEntry sre; | ||
| sre.symbol = symbol_id; | ||
| sre.relations.reserve(symbol_relations.size()); | ||
| for(auto& [relation, bitmap]: symbol_relations) { | ||
| wire::RelationEntry re; | ||
| re.relation = to_wire(relation); | ||
| re.context = bitmap; | ||
| sre.relations.push_back(std::move(re)); | ||
| } | ||
| w.relations.push_back(std::move(sre)); | ||
| } | ||
| auto removed = CreateVector(builder, buffer); | ||
|
|
||
| auto content_offset = CreateString(builder, index->content); | ||
| std::ranges::sort(w.relations, {}, [](const auto& e) { return e.symbol; }); |
There was a problem hiding this comment.
Relations within a symbol aren't sorted — on-disk output is nondeterministic.
w.relations is sorted by symbol, but the inner sre.relations vector is populated in hashmap iteration order (from index.relations[symbol_id], a DenseMap<Relation, Roaring>), which isn't stable across runs. The serialized bytes therefore differ for logically-identical indices, which hurts reproducibility, change detection, and round-trip equality checks.
♻️ Sort inner relations before pushing
for(auto& [relation, bitmap]: symbol_relations) {
wire::RelationEntry re;
re.relation = to_wire(relation);
re.context = bitmap;
sre.relations.push_back(std::move(re));
}
+ std::ranges::sort(sre.relations, [](const auto& lhs, const auto& rhs) {
+ return std::tuple(lhs.relation.kind.value(),
+ lhs.relation.range.begin,
+ lhs.relation.range.end,
+ lhs.relation.target_symbol) <
+ std::tuple(rhs.relation.kind.value(),
+ rhs.relation.range.begin,
+ rhs.relation.range.end,
+ rhs.relation.target_symbol);
+ });
w.relations.push_back(std::move(sre));Same concern applies to canonical_cache, header_contexts, and compilation_contexts, which are all iterated out of llvm::SmallDenseMap/StringMap without any subsequent sort.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 387 - 400, The inner relation lists
are produced by iterating index.relations[symbol_id] (a DenseMap) which yields
nondeterministic order; before pushing each wire::SymbolRelationsEntry into
w.relations, sort its sre.relations vector by relation (e.g., compare
RelationEntry.relation or use the result of to_wire(relation)) to ensure
deterministic on-disk output; apply the same pattern of collecting then sorting
entries when serializing canonical_cache, header_contexts, and
compilation_contexts so each inner vector/map is stable across runs.
| auto bytes = kfb::to_flatbuffer(w); | ||
| assert(bytes && "MergedIndex flatbuffer serialization failed"); | ||
| out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); |
There was a problem hiding this comment.
Release-build UB when FlatBuffer serialization fails.
assert(...) is compiled out in release builds, so if kfb::to_flatbuffer(w) returns a falsy/empty result, bytes->data() / bytes->size() will dereference an empty/invalid optional-like value and corrupt the output. Since this is on the persistence path, a silent failure here can produce a truncated or malformed on-disk index that later fails to load (or worse, loads partial data).
Handle the error path explicitly even in release:
🛡️ Proposed fix
- auto bytes = kfb::to_flatbuffer(w);
- assert(bytes && "MergedIndex flatbuffer serialization failed");
- out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
+ auto bytes = kfb::to_flatbuffer(w);
+ if(!bytes) {
+ LOG_ERROR("MergedIndex flatbuffer serialization failed");
+ return;
+ }
+ out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());Same pattern also applies to src/index/project_index.cpp:70-72 and the analogous TUIndex::serialize.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 405 - 407, The code currently uses
assert on the result of kfb::to_flatbuffer(w) which is removed in release
builds, causing undefined behavior when bytes is empty; replace the assert with
an explicit check of the optional-like bytes result (from kfb::to_flatbuffer)
and handle the failure path deterministically (e.g., log an error and throw an
exception or return an error code) before calling bytes->data()/bytes->size();
apply the same change pattern to the analogous sites (project_index.cpp around
the bytes/out.write usage and TUIndex::serialize) so serialization never
dereferences an empty value in release builds.
| auto bytes = kfb::to_flatbuffer(w); | ||
| assert(bytes && "TUIndex flatbuffer serialization failed"); | ||
| os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the codec result is nullable/checked elsewhere and locate unchecked dereferences.
rg -n -C4 '\bto_flatbuffer\s*\(' --iglob '*.h' --iglob '*.hpp' --iglob '*.cpp'
rg -n -C3 'auto bytes = kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)' --iglob '*.cpp'Repository: clice-io/clice
Length of output: 4711
🏁 Script executed:
#!/bin/bash
# Find the definition of kfb::to_flatbuffer to verify its return type
fd -e h -e hpp . | xargs rg -l 'to_flatbuffer' | head -5
rg -n 'to_flatbuffer' --iglob '*.h' --iglob '*.hpp' -B2 -A5 | head -50Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Search more broadly for to_flatbuffer definition
rg 'to_flatbuffer' --type cpp --type h -B2 -A5 | head -80
# Also check what kfb namespace is
rg -n 'namespace.*kfb|using.*kfb' --iglob '*.h' --iglob '*.cpp' | head -20Repository: clice-io/clice
Length of output: 1775
🏁 Script executed:
#!/bin/bash
# Look for to_flatbuffer in kota codec directory
find . -path '*/kota/codec/flatbuffers*' -type f \( -name '*.h' -o -name '*.hpp' \)
# Also search for to_flatbuffer function in the include directory
rg -n 'to_flatbuffer' . --iglob 'serializer*' -A10Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Check the include statements in the three affected files
echo "=== tu_index.cpp includes ==="
head -30 src/index/tu_index.cpp | grep -n include
echo -e "\n=== project_index.cpp includes ==="
head -30 src/index/project_index.cpp | grep -n include
echo -e "\n=== merged_index.cpp includes ==="
head -30 src/index/merged_index.cpp | grep -n includeRepository: clice-io/clice
Length of output: 1182
🏁 Script executed:
#!/bin/bash
# Check if llvm::report_fatal_error is used elsewhere
echo "=== Checking llvm::report_fatal_error usage ==="
rg 'report_fatal_error' --type cpp --type h
echo -e "\n=== Checking ErrorHandling.h includes ==="
rg 'ErrorHandling\.h' --type cpp --type h
echo -e "\n=== Check if file_count for scope ==="
echo "Files affected by the issue:"
rg -l 'auto bytes = kfb::to_flatbuffer' --type cppRepository: clice-io/clice
Length of output: 294
Handle serialization failure in all index serialization paths.
assert(bytes) disappears under NDEBUG, but subsequent dereferencing of bytes persists. If kfb::to_flatbuffer() returns null in release builds, this causes a crash. This issue exists in three files:
- src/index/tu_index.cpp (lines 313–315)
- src/index/project_index.cpp (lines 70–72)
- src/index/merged_index.cpp (lines 405–407)
Replace runtime-unsafe assertions with proper error handling. Use if(!bytes) { llvm::report_fatal_error(...); } and add #include "llvm/Support/ErrorHandling.h" to all three files.
Example fix for tu_index.cpp
`#include` "kota/codec/flatbuffers/deserializer.h"
`#include` "kota/codec/flatbuffers/serializer.h"
+#include "llvm/Support/ErrorHandling.h"
`#include` "llvm/Support/SHA256.h" auto bytes = kfb::to_flatbuffer(w);
- assert(bytes && "TUIndex flatbuffer serialization failed");
+ if (!bytes) {
+ llvm::report_fatal_error("TUIndex flatbuffer serialization failed");
+ }
os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());Apply similar changes to project_index.cpp and merged_index.cpp.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/tu_index.cpp` around lines 313 - 315, Replace the runtime-unsafe
assert that assumes kfb::to_flatbuffer(w) always returns a value: check the
returned optional/ptr (the bytes variable) and if it is null call
llvm::report_fatal_error with a clear message instead of asserting; then proceed
to use bytes->data()/bytes->size() only after the null-check. Add `#include`
"llvm/Support/ErrorHandling.h" to the top of each affected file (tu_index.cpp,
project_index.cpp, merged_index.cpp) and remove the assert(bytes && ...) lines
so release builds won’t dereference a null pointer.
| /// Relation carries an annotated `RelationKind` so it is encoded as a table | ||
| /// rather than an inline struct (annotated fields are not supported inline). | ||
| struct Relation { | ||
| km::annotation<RelationKind, kmb::with<adapters::RelationKindAdapter>> kind; | ||
| std::uint32_t padding = 0; | ||
| LocalSourceRange range; | ||
| SymbolHash target_symbol = 0; | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the runtime Relation definition to see what `padding` represents.
rg -nP -C2 '\bstruct\s+Relation\b' --type=cpp --type=h
rg -nP -C1 '\bpadding\b' src/index/ src/semantic/Repository: clice-io/clice
Length of output: 1710
🏁 Script executed:
#!/bin/bash
# Get full definition of runtime Relation struct with surrounding context
cat -n src/index/tu_index.h | sed -n '20,35p'
# Check if padding field is used/accessed anywhere other than in serialization
rg 'Relation.*\.padding\b|\.padding.*=' --type=cpp --type=h -n
# Look for any comments about memory layout or alignment near the struct
rg -B5 -A10 'struct Relation' src/index/tu_index.hRepository: clice-io/clice
Length of output: 1496
Remove padding field from wire::Relation struct.
The padding field in the runtime Relation struct (tu_index.h:25) exists solely for in-memory alignment and is never accessed outside of serialization. Persisting it to the wire format wastes 4 bytes per relation (potentially millions across an index) and bakes unused reserved bits into the on-disk format, making it awkward to evolve later. Drop it from the wire type definition at wire_types.h:50 and its conversion functions in tu_index.cpp:179 and merged_index.cpp:195.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/wire_types.h` around lines 46 - 53, Remove the unused padding
member from the wire type by deleting the padding field from struct Relation
(wire::Relation) in wire_types.h, and update the serialization/conversion code
that maps between runtime Relation (tu_index) and wire::Relation in tu_index.cpp
and merged_index.cpp so they no longer read, write, or reserve 4 bytes for
padding; adjust any memcpy/size calculations, constructors, and assignment code
that assumed the extra field and update any on-disk/serialization size constants
or header checks that included the padding so the wire layout no longer contains
the unused 4-byte slot.
911702c to
019f422
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/index/tu_index.cpp (1)
216-219:⚠️ Potential issue | 🔴 CriticalHandle serialization failure before dereferencing
bytes.
assert(bytes)disappears underNDEBUG, leavingbytes->data()reachable after a failedto_flatbuffer()call.🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h" + void TUIndex::serialize(llvm::raw_ostream& os) { auto bytes = kfb::to_flatbuffer(*this); - assert(bytes && "TUIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("TUIndex flatbuffer serialization failed"); + } os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); }Run this read-only check to verify no unchecked FlatBuffer serialization dereferences remain:
#!/bin/bash rg -n -C3 'auto bytes = kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)' --iglob '*.cpp'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.cpp` around lines 216 - 219, TUIndex::serialize currently dereferences the result of kfb::to_flatbuffer without handling a failure when asserts are disabled; update the function to check the returned bytes (the value stored in the local variable bytes from kfb::to_flatbuffer) before calling bytes->data()/bytes->size(), and handle the error path (e.g., return early, log/processLogger.error, or throw an exception) so serialization failure is not dereferenced; ensure the check and error handling are applied in TUIndex::serialize around the call to os.write to safely guard bytes.src/index/merged_index.cpp (2)
219-223:⚠️ Potential issue | 🟡 MinorDon’t silently turn a failed shard decode into an empty index.
If
from_flatbufferfails, this drops the buffer and leaves an emptyimpl, making corrupted or version-incompatible shards indistinguishable from legitimate empty shards. At minimum, log the failure and expose a load-failed state so callers can rebuild the shard.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 219 - 223, The current code silently discards the buffer when kfb::from_flatbuffer(bytes, *self.impl) fails (calling self.buffer.reset()), which hides corrupted or incompatible shard loads; change this to log the decode error (include the result/error details) and set an explicit load-failed state on the shard object instead of making it appear empty — e.g., add or set a boolean like self.load_failed (or an enum state) and leave the buffer intact or mark it as invalid for later inspection, and update callers that check self.impl or buffer to handle the new load-failed state; ensure logging references buffer_bytes, kfb::from_flatbuffer and the failing shard id so operators can rebuild the shard.
261-263:⚠️ Potential issue | 🔴 CriticalHandle serialization failure before dereferencing
bytes.
assert(bytes)is compiled out in release builds, butbytes->data()still runs. Use an explicit runtime check before writing.🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h" + auto bytes = kfb::to_flatbuffer(*self.impl); - assert(bytes && "MergedIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("MergedIndex flatbuffer serialization failed"); + } out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());Run this read-only check to verify no unchecked FlatBuffer serialization dereferences remain:
#!/bin/bash rg -n -C3 'auto bytes = kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)' --iglob '*.cpp'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 261 - 263, The code currently dereferences bytes returned from kfb::to_flatbuffer without a runtime null check (assert(bytes) is compiled out in release); update the block around auto bytes = kfb::to_flatbuffer(*self.impl) to explicitly check if bytes is null before calling bytes->data() or bytes->size(), and handle the failure path (e.g., return an error/false, throw an exception, or log an error via the same logging mechanism) instead of proceeding to out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); ensure the check covers both bytes null and empty payload cases so no unchecked bytes->data() dereference remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/index/merged_index.cpp`:
- Around line 339-347: The lazy FlatBuffer lookup paths call the callback
unconditionally and so return stale occurrences/relations that have been
removed; update both occurrence and relation lazy loops to mirror the in-memory
Impl logic: after reading the occurrence/relation key and loading the stored
bitmap via entry.template get<1>() (or the relation inner-map bitmap), subtract
the Impl::removed bitmap (root[&Impl::removed]) and if the resulting bitmap is
empty skip invoking callback (i.e. continue), otherwise proceed to call callback
with the occurrence/relation; locate these changes around the read_occurrence
loop and the relations lookup loop to apply the same filtering.
In `@src/index/tu_index.h`:
- Around line 40-42: Replace the defaulted comparison operators with explicit
implementations for Relation: remove the friend defaulted operator== and
operator<=> and implement operator== and operator<=> that compare only the
logical key members (the same fields used by the existing hash/equality and
dedupe logic) and explicitly exclude the padding member; ensure the
implementation uses the same member set as the Relation hashing function so
ordering and equality are consistent when used as map keys.
---
Duplicate comments:
In `@src/index/merged_index.cpp`:
- Around line 219-223: The current code silently discards the buffer when
kfb::from_flatbuffer(bytes, *self.impl) fails (calling self.buffer.reset()),
which hides corrupted or incompatible shard loads; change this to log the decode
error (include the result/error details) and set an explicit load-failed state
on the shard object instead of making it appear empty — e.g., add or set a
boolean like self.load_failed (or an enum state) and leave the buffer intact or
mark it as invalid for later inspection, and update callers that check self.impl
or buffer to handle the new load-failed state; ensure logging references
buffer_bytes, kfb::from_flatbuffer and the failing shard id so operators can
rebuild the shard.
- Around line 261-263: The code currently dereferences bytes returned from
kfb::to_flatbuffer without a runtime null check (assert(bytes) is compiled out
in release); update the block around auto bytes = kfb::to_flatbuffer(*self.impl)
to explicitly check if bytes is null before calling bytes->data() or
bytes->size(), and handle the failure path (e.g., return an error/false, throw
an exception, or log an error via the same logging mechanism) instead of
proceeding to out.write(reinterpret_cast<const char*>(bytes->data()),
bytes->size()); ensure the check covers both bytes null and empty payload cases
so no unchecked bytes->data() dereference remains.
In `@src/index/tu_index.cpp`:
- Around line 216-219: TUIndex::serialize currently dereferences the result of
kfb::to_flatbuffer without handling a failure when asserts are disabled; update
the function to check the returned bytes (the value stored in the local variable
bytes from kfb::to_flatbuffer) before calling bytes->data()/bytes->size(), and
handle the error path (e.g., return early, log/processLogger.error, or throw an
exception) so serialization failure is not dereferenced; ensure the check and
error handling are applied in TUIndex::serialize around the call to os.write to
safely guard bytes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c3b75257-94ae-4a5a-bc28-575c7794b72b
📒 Files selected for processing (16)
CMakeLists.txtcmake/package.cmakesrc/index/include_graph.hsrc/index/kotatsu_adapters.hsrc/index/merged_index.cppsrc/index/project_index.cppsrc/index/project_index.hsrc/index/schema.fbssrc/index/serialization.hsrc/index/tu_index.cppsrc/index/tu_index.hsrc/semantic/relation_kind.hsrc/server/compiler.cppsrc/server/indexer.cppsrc/syntax/token.htests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
- src/index/serialization.h
- src/index/schema.fbs
✅ Files skipped from review due to trivial changes (1)
- tests/unit/index/project_index_tests.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
- cmake/package.cmake
- src/server/compiler.cpp
- src/index/project_index.h
- src/index/kotatsu_adapters.h
- src/index/project_index.cpp
| for(; lo < count; ++lo) { | ||
| auto entry = entries.at(lo); | ||
| auto occurrence = read_occurrence(entry.template get<0>()); | ||
| if(!occurrence.range.contains(offset)) { | ||
| break; | ||
| } | ||
| if(!callback(occurrence)) { | ||
| break; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
sed -n '266,408p' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 5476
🏁 Script executed:
#!/bin/bash
# Search for the definition of Impl structure and how occurrences/relations are stored
rg -A 10 "struct Impl" src/index/ -t cpp | head -80Repository: clice-io/clice
Length of output: 542
🏁 Script executed:
#!/bin/bash
# Look for where Impl occurrences and relations are defined to understand the FlatBuffer structure
rg "occurrences|relations" src/index/merged_index.cpp -B 2 -A 2 | head -100Repository: clice-io/clice
Length of output: 3664
🏁 Script executed:
#!/bin/bash
# Search for KFB table definitions or schema for this data structure
fd -e "\.h" -e "\.hpp" src/ | xargs rg "struct Impl" -A 15 | head -150Repository: clice-io/clice
Length of output: 2320
🏁 Script executed:
#!/bin/bash
# Search for KFB/FlatBuffer map access patterns to confirm how to access value
rg "\.template get<1>" src/index/ -t cpp -B 3 -A 3Repository: clice-io/clice
Length of output: 486
🏁 Script executed:
#!/bin/bash
# Look for the complete Impl structure definition in the header file
rg -A 30 "struct Impl" src/index/merged_index.hRepository: clice-io/clice
Length of output: 811
🏁 Script executed:
#!/bin/bash
# Check if there are any other places where removed filtering is done in lazy paths
rg "removed" src/index/merged_index.cpp -B 5 -A 2Repository: clice-io/clice
Length of output: 2254
Mirror the impl path by filtering removed canonical IDs in lazy lookups.
The in-memory Impl paths skip entries whose bitmap becomes empty after subtracting the removed set, but the lazy FlatBuffer paths reconstruct only the occurrence/relation key and call the callback unconditionally. This surfaces stale results from removed contexts until materialization.
Both lazy paths need the same filtering logic:
- In occurrences lookup (lines 339–347): read the bitmap via
entry.template get<1>(), computebitmap - root[&Impl::removed], and skip the callback if empty. - In relations lookup (lines 389–406): similarly filter the inner map entry's bitmap value before invoking the callback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 339 - 347, The lazy FlatBuffer
lookup paths call the callback unconditionally and so return stale
occurrences/relations that have been removed; update both occurrence and
relation lazy loops to mirror the in-memory Impl logic: after reading the
occurrence/relation key and loading the stored bitmap via entry.template
get<1>() (or the relation inner-map bitmap), subtract the Impl::removed bitmap
(root[&Impl::removed]) and if the resulting bitmap is empty skip invoking
callback (i.e. continue), otherwise proceed to call callback with the
occurrence/relation; locate these changes around the read_occurrence loop and
the relations lookup loop to apply the same filtering.
| friend bool operator==(const Relation&, const Relation&) = default; | ||
|
|
||
| friend auto operator<=>(const Relation&, const Relation&) = default; |
There was a problem hiding this comment.
Keep Relation comparison aligned with its logical key.
The defaulted operators include padding, but existing relation hashing/equality and dedupe ignore it. If the codec uses this ordering for map keys, logically identical relations can be ordered or compared differently when padding differs.
♻️ Proposed fix
- friend bool operator==(const Relation&, const Relation&) = default;
-
- friend auto operator<=>(const Relation&, const Relation&) = default;
+ friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) {
+ return lhs.kind == rhs.kind && lhs.range == rhs.range &&
+ lhs.target_symbol == rhs.target_symbol;
+ }
+
+ friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) {
+ if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) {
+ return cmp;
+ }
+ if(auto cmp = lhs.range <=> rhs.range; cmp != 0) {
+ return cmp;
+ }
+ return lhs.target_symbol <=> rhs.target_symbol;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| friend bool operator==(const Relation&, const Relation&) = default; | |
| friend auto operator<=>(const Relation&, const Relation&) = default; | |
| friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) { | |
| return lhs.kind == rhs.kind && lhs.range == rhs.range && | |
| lhs.target_symbol == rhs.target_symbol; | |
| } | |
| friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) { | |
| if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) { | |
| return cmp; | |
| } | |
| if(auto cmp = lhs.range <=> rhs.range; cmp != 0) { | |
| return cmp; | |
| } | |
| return lhs.target_symbol <=> rhs.target_symbol; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/tu_index.h` around lines 40 - 42, Replace the defaulted comparison
operators with explicit implementations for Relation: remove the friend
defaulted operator== and operator<=> and implement operator== and operator<=>
that compare only the logical key members (the same fields used by the existing
hash/equality and dedupe logic) and explicitly exclude the padding member;
ensure the implementation uses the same member set as the Relation hashing
function so ordering and equality are consistent when used as map keys.
019f422 to
94cfdd9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
src/index/merged_index.cpp (2)
339-347:⚠️ Potential issue | 🟠 MajorKeep lazy lookups consistent with removed-context filtering.
The materialized paths skip entries whose bitmap becomes empty after subtracting
removed; these lazy paths still invoke callbacks unconditionally, so stale occurrences/relations can surface until materialization.🛡️ Proposed fix sketch
for(; lo < count; ++lo) { auto entry = entries.at(lo); auto occurrence = read_occurrence(entry.template get<0>()); if(!occurrence.range.contains(offset)) { break; } + auto remaining = entry.template get<1>() - root[&Impl::removed]; + if(remaining.isEmpty()) { + continue; + } if(!callback(occurrence)) { break; } }for(std::size_t i = 0; i < count; ++i) { - auto rel_view = inner.at(i).template get<0>(); + auto inner_entry = inner.at(i); + auto remaining = inner_entry.template get<1>() - root[&Impl::removed]; + if(remaining.isEmpty()) { + continue; + } + auto rel_view = inner_entry.template get<0>();Also applies to: 389-406
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 339 - 347, The lazy-path loop over entries (using entries.at(lo), read_occurrence and callback(occurrence)) currently calls the callback even when the occurrence’s bitmap becomes empty after subtracting the `removed` set, causing stale occurrences; update the loop to apply the same removed-subtraction and empty-check used by the materialized-path before invoking callback (i.e., subtract `removed` from the occurrence's bitmap, skip the entry if the bitmap is empty, and only then call callback), and make the identical change to the other corresponding loop that reads occurrences (the similar block handling the lazy path later in the file).
261-263:⚠️ Potential issue | 🔴 CriticalHandle serialization failure outside
assert.Line 262 is compiled out in release builds; Line 263 can still dereference an empty serialization result and write corrupt output.
🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h" + auto bytes = kfb::to_flatbuffer(*self.impl); - assert(bytes && "MergedIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("MergedIndex flatbuffer serialization failed"); + } out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 261 - 263, The code currently uses assert to guard the result of kfb::to_flatbuffer(*self.impl) but then dereferences bytes regardless; replace the assert with a runtime check: call auto bytes = kfb::to_flatbuffer(*self.impl); if (!bytes) handle the error path (e.g., return an error code/throw an exception or log and abort) instead of proceeding, and only call out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()) when bytes is non-null; reference symbols: kfb::to_flatbuffer, bytes, self.impl, and out.write to locate and fix the code.src/index/tu_index.cpp (1)
216-219:⚠️ Potential issue | 🔴 CriticalHandle serialization failure outside
assert.Line 218 is compiled out under
NDEBUG; Line 219 can still dereference an empty serialization result.🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h" + void TUIndex::serialize(llvm::raw_ostream& os) { auto bytes = kfb::to_flatbuffer(*this); - assert(bytes && "TUIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("TUIndex flatbuffer serialization failed"); + } os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.cpp` around lines 216 - 219, TUIndex::serialize currently uses assert on the result of kfb::to_flatbuffer (bytes) which is compiled out under NDEBUG and then dereferences bytes; replace the assert with an explicit runtime check: if bytes is null, report the error and abort the operation (e.g. throw std::runtime_error or call llvm::report_fatal_error/emit a clear diagnostic) instead of proceeding, otherwise call os.write with bytes->data() and bytes->size(); update the error message to mention TUIndex flatbuffer serialization failure and include any available context.src/index/tu_index.h (1)
40-42:⚠️ Potential issue | 🟠 MajorExclude
paddingfromRelationequality/order.The defaulted operators include Line 26
padding, but relation hashing/dedupe use onlykind,range, andtarget_symbol. This can make logically identical relations compare differently.♻️ Proposed fix
- friend bool operator==(const Relation&, const Relation&) = default; - - friend auto operator<=>(const Relation&, const Relation&) = default; + friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) { + return lhs.kind == rhs.kind && lhs.range == rhs.range && + lhs.target_symbol == rhs.target_symbol; + } + + friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) { + if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) { + return cmp; + } + if(auto cmp = lhs.range <=> rhs.range; cmp != 0) { + return cmp; + } + return lhs.target_symbol <=> rhs.target_symbol; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.h` around lines 40 - 42, The defaulted friend operators operator== and operator<=> on Relation include the padding member causing inconsistent comparisons vs hashing/dedupe; replace the defaulted operators with custom implementations that compare only Relation::kind, Relation::range, and Relation::target_symbol (explicitly exclude Relation::padding) so equality and ordering match the logic used by hashing/deduplication; update the friend declarations for operator== and operator<=> on the Relation type to call those custom comparisons.src/index/project_index.cpp (1)
43-46:⚠️ Potential issue | 🔴 CriticalHandle serialization failure outside
assert.Line 45 disappears in release builds; if
kfb::to_flatbuffer(self)returns an empty result, Line 46 still dereferences it.🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h" + void ProjectIndex::serialize(this ProjectIndex& self, llvm::raw_ostream& os) { auto bytes = kfb::to_flatbuffer(self); - assert(bytes && "ProjectIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("ProjectIndex flatbuffer serialization failed"); + } os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/project_index.cpp` around lines 43 - 46, ProjectIndex::serialize currently uses assert to guard the result of kfb::to_flatbuffer(self), which is removed in release builds and can lead to dereferencing an empty optional; change the function to explicitly check the returned bytes (from kfb::to_flatbuffer) before writing to llvm::raw_ostream& os, and handle the failure path deterministically (e.g., log an error via your logging facility and return/throw an exception) instead of relying on assert so the code never dereferences a null/empty buffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/index/kotatsu_adapters.h`:
- Around line 103-108: The deserialize function currently calls
clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false) which
can read past the vector bounds for malformed/untrusted FlatBuffer input;
replace that call with the length-checked clice::Bitmap::readSafe by passing the
buffer pointer and bytes.size() (retain the third parameter false to preserve
the non-portable format) so deserialize safely prevents out-of-bounds reads when
reconstructing a clice::Bitmap from std::vector<std::byte>.
In `@src/index/merged_index.cpp`:
- Around line 226-237: The code rebuilds index.canonical_ref_counts using
inc.canonical_id and ctx.canonical_id without bounds checks, risking
out-of-range access if decoded IDs >= index.max_canonical_id; update the loops
over index.header_contexts and index.compilation_contexts to validate each
canonical ID before incrementing (e.g., check inc.canonical_id <
index.max_canonical_id and ctx.canonical_id < index.max_canonical_id), skip or
log invalid IDs instead of indexing, and ensure any counts use the same
index.canonical_ref_counts container so no UB occurs.
---
Duplicate comments:
In `@src/index/merged_index.cpp`:
- Around line 339-347: The lazy-path loop over entries (using entries.at(lo),
read_occurrence and callback(occurrence)) currently calls the callback even when
the occurrence’s bitmap becomes empty after subtracting the `removed` set,
causing stale occurrences; update the loop to apply the same removed-subtraction
and empty-check used by the materialized-path before invoking callback (i.e.,
subtract `removed` from the occurrence's bitmap, skip the entry if the bitmap is
empty, and only then call callback), and make the identical change to the other
corresponding loop that reads occurrences (the similar block handling the lazy
path later in the file).
- Around line 261-263: The code currently uses assert to guard the result of
kfb::to_flatbuffer(*self.impl) but then dereferences bytes regardless; replace
the assert with a runtime check: call auto bytes =
kfb::to_flatbuffer(*self.impl); if (!bytes) handle the error path (e.g., return
an error code/throw an exception or log and abort) instead of proceeding, and
only call out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size())
when bytes is non-null; reference symbols: kfb::to_flatbuffer, bytes, self.impl,
and out.write to locate and fix the code.
In `@src/index/project_index.cpp`:
- Around line 43-46: ProjectIndex::serialize currently uses assert to guard the
result of kfb::to_flatbuffer(self), which is removed in release builds and can
lead to dereferencing an empty optional; change the function to explicitly check
the returned bytes (from kfb::to_flatbuffer) before writing to
llvm::raw_ostream& os, and handle the failure path deterministically (e.g., log
an error via your logging facility and return/throw an exception) instead of
relying on assert so the code never dereferences a null/empty buffer.
In `@src/index/tu_index.cpp`:
- Around line 216-219: TUIndex::serialize currently uses assert on the result of
kfb::to_flatbuffer (bytes) which is compiled out under NDEBUG and then
dereferences bytes; replace the assert with an explicit runtime check: if bytes
is null, report the error and abort the operation (e.g. throw std::runtime_error
or call llvm::report_fatal_error/emit a clear diagnostic) instead of proceeding,
otherwise call os.write with bytes->data() and bytes->size(); update the error
message to mention TUIndex flatbuffer serialization failure and include any
available context.
In `@src/index/tu_index.h`:
- Around line 40-42: The defaulted friend operators operator== and operator<=>
on Relation include the padding member causing inconsistent comparisons vs
hashing/dedupe; replace the defaulted operators with custom implementations that
compare only Relation::kind, Relation::range, and Relation::target_symbol
(explicitly exclude Relation::padding) so equality and ordering match the logic
used by hashing/deduplication; update the friend declarations for operator== and
operator<=> on the Relation type to call those custom comparisons.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cb8e84a5-034b-42ac-b319-05455c2be11f
📒 Files selected for processing (16)
CMakeLists.txtcmake/package.cmakesrc/index/include_graph.hsrc/index/kotatsu_adapters.hsrc/index/merged_index.cppsrc/index/project_index.cppsrc/index/project_index.hsrc/index/schema.fbssrc/index/serialization.hsrc/index/tu_index.cppsrc/index/tu_index.hsrc/semantic/relation_kind.hsrc/server/compiler.cppsrc/server/indexer.cppsrc/syntax/token.htests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
- src/index/serialization.h
- src/index/schema.fbs
✅ Files skipped from review due to trivial changes (1)
- src/server/indexer.cpp
🚧 Files skipped from review as they are similar to previous changes (7)
- src/server/compiler.cpp
- src/syntax/token.h
- cmake/package.cmake
- src/index/include_graph.h
- tests/unit/index/project_index_tests.cpp
- src/index/project_index.h
- src/semantic/relation_kind.h
| static clice::Bitmap deserialize(std::vector<std::byte> bytes) { | ||
| if(bytes.empty()) { | ||
| return clice::Bitmap(); | ||
| } | ||
| return clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Inspect Roaring/Bitmap headers for length-checked deserialize APIs.
# Expected: a readSafe/deserialize_safe-style API that accepts both pointer and size.
rg -n -C3 'readSafe|deserialize_safe|read\s*\(' --iglob '*roaring*.h' --iglob '*roaring*.hh' --iglob '*roaring*.hpp' --iglob '*bitmap*.h'
rg -n -C3 'using Bitmap|typedef.*Bitmap|class Bitmap|struct Bitmap' --iglob '*.h' --iglob '*.hpp'Repository: clice-io/clice
Length of output: 275
🏁 Script executed:
# Find roaring-related files and headers in the repo
fd -i roaring . --type fRepository: clice-io/clice
Length of output: 40
🏁 Script executed:
# Check includes and roaring dependency in the relevant files
head -30 src/index/kotatsu_adapters.h
head -30 src/support/bitmap.h
# Search for CMake or build config that might show roaring dependency
fd -i cmake . --type f | head -10
cat CMakeLists.txt 2>/dev/null | grep -i roaringRepository: clice-io/clice
Length of output: 1499
🏁 Script executed:
# Read the exact lines mentioned in the review
sed -n '103,108p' src/index/kotatsu_adapters.h
# Check if there's a roaring.hh header we can inspect or search the web for its API
# First, let's see what roaring methods are called in the codebase
rg -n 'Roaring::' src/ --type h --type hppRepository: clice-io/clice
Length of output: 339
🏁 Script executed:
# Search for any roaring usage patterns in codebase to understand available APIs
rg -n 'Bitmap::|roaring::' src/ -A 2 -B 2Repository: clice-io/clice
Length of output: 1609
🌐 Web query:
roaring C++ library deserialize safe API readSafe length-checked
💡 Result:
The Roaring C++ library (CRoaring) provides safe, length-checked deserialization APIs for untrusted input. These prevent buffer overflows by limiting reads to a specified maximum size. For C API (roaring.h): - roaring_bitmap_portable_deserialize_safe(const char *buf, size_t maxbytes): Safe deserialization of 32-bit bitmap; returns NULL if it would exceed maxbytes or fail. - roaring64_bitmap_portable_deserialize_safe(const char buf, size_t maxbytes): Same for 64-bit. For C++ API (roaring.hh, class roaring::Roaring): - static Roaring readSafe(const char buf, size_t maxbytes, bool portable = true): Safe deserialization; will not read past maxbytes. These APIs follow the portable Roaring format spec. They avoid overflows but do not fully validate content (e.g., sorted containers). For untrusted sources, call roaring_bitmap_internal_validate afterward (added Oct 2024 via PR #675, guaranteeing safe deserialization when combined). Example (C++): uint32_t expectedsize = r1.getSizeInBytes; char *serializedbytes = new char[expectedsize]; r1.write(serializedbytes); Roaring t = Roaring::readSafe(serializedbytes, expectedsize); // safe, length-checked Always free with roaring_bitmap_free or destructor.
Citations:
- 1: safety guarantees about deserialize_safe & internal_validate functions RoaringBitmap/CRoaring#675
- 2: https://github.com/RoaringBitmap/CRoaring
- 3: roaring_bitmap_portable_deserialize_safe avoids overflows during deserialization but it does not validate the input RoaringBitmap/CRoaring#324
- 4: http://roaringbitmap.github.io/CRoaring/
- 5: https://roaringbitmap.github.io/CRoaring/
- 6: https://github.com/RoaringBitmap/CRoaring/blob/master/README.md
- 7: https://github.com/RoaringBitmap/CRoaring/blob/master/cpp/roaring/roaring.hh
- 8: http://roaringbitmap.github.io/CRoaring/dc/d56/roaring_8h.html
🌐 Web query:
roaring C++ non-portable format deserialize safe API CRoaring
💡 Result:
In CRoaring's C++ API (roaring::Roaring class in roaring.hh), the primary deserialization methods are static Roaring read(const char* buf, bool portable = true) and static Roaring readSafe(const char* buf, size_t maxbytes, bool portable = true). These support both portable (cross-platform, default) and non-portable formats (when portable=false). The non-portable format is architecture-specific, faster, but not safe for sharing across different systems or endianness. There is no direct "non-portable format deserialize safe API". The "safe" aspect refers to readSafe limiting reads to maxbytes to prevent buffer overflows, but it assumes valid input and does not fully validate content (use roaring_bitmap_internal_validate afterward for untrusted data). A separate "frozen" deserialization exists for read-only "frozen views" (e.g., roaring_bitmap_portable_deserialize_frozen in C, added to C++ via PR #544), which points directly to the buffer without copying. It is explicitly unsafe: may perform unaligned memory accesses (risky on some platforms) and can buffer overflow if buf is invalid. It uses the portable serialization format but creates a non-copying frozen bitmap, suitable for memory-mapped files but only for readonly operations. Standard deserialize (read/portable_deserialize) is fully safe if input is trusted/valid, copies data into aligned memory. For untrusted input, use *_safe variants + validation. No purely "non-portable safe" API exists due to alignment/portability risks.
Citations:
- 1: https://github.com/RoaringBitmap/CRoaring
- 2: https://docs.rs/croaring-sys/latest/croaring_sys/fn.roaring_bitmap_portable_deserialize_frozen.html
- 3: https://github.com/RoaringBitmap/CRoaring/blob/master/include/roaring/roaring.h
- 4: https://roaringbitmap.github.io/CRoaring/
- 5: https://github.com/RoaringBitmap/CRoaring/blob/master/cpp/roaring/roaring.hh
- 6: https://github.com/RoaringBitmap/CRoaring/blob/master/README.md
- 7: Add support for portable_deserialize_frozen to C++ API RoaringBitmap/CRoaring#544
- 8: Add roaring_bitmap_portable_deserialize_frozen RoaringBitmap/CRoaring#421
🏁 Script executed:
# Check the roaring.hh header if available in includes or search for read() method signature
rg -n 'read\s*\(' src/support/bitmap.h -A 5 -B 5
# Also search for how non-portable is handled
rg -n 'portable' src/index/kotatsu_adapters.h -B 3 -A 3Repository: clice-io/clice
Length of output: 330
Use readSafe for length-checked deserialization of untrusted FlatBuffer data.
Line 107 calls clice::Bitmap::read() without passing buffer length, creating a buffer overflow vulnerability when deserializing untrusted FlatBuffer blobs. The Roaring reader can consume past the vector boundary if the input is malformed.
Switch to readSafe, which accepts both buffer pointer and length and prevents out-of-bounds reads:
Proposed fix
- return clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false);
+ return clice::Bitmap::readSafe(
+ reinterpret_cast<const char*>(bytes.data()),
+ bytes.size(),
+ false);The third parameter (false) preserves the non-portable format used in the wire protocol (as documented on line 83).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static clice::Bitmap deserialize(std::vector<std::byte> bytes) { | |
| if(bytes.empty()) { | |
| return clice::Bitmap(); | |
| } | |
| return clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false); | |
| } | |
| static clice::Bitmap deserialize(std::vector<std::byte> bytes) { | |
| if(bytes.empty()) { | |
| return clice::Bitmap(); | |
| } | |
| return clice::Bitmap::readSafe( | |
| reinterpret_cast<const char*>(bytes.data()), | |
| bytes.size(), | |
| false); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/kotatsu_adapters.h` around lines 103 - 108, The deserialize
function currently calls clice::Bitmap::read(reinterpret_cast<const
char*>(bytes.data()), false) which can read past the vector bounds for
malformed/untrusted FlatBuffer input; replace that call with the length-checked
clice::Bitmap::readSafe by passing the buffer pointer and bytes.size() (retain
the third parameter false to preserve the non-portable format) so deserialize
safely prevents out-of-bounds reads when reconstructing a clice::Bitmap from
std::vector<std::byte>.
| // Rebuild the ref count table from the already-loaded contexts. | ||
| auto& index = *self.impl; | ||
| index.canonical_ref_counts.clear(); | ||
| index.canonical_ref_counts.resize(index.max_canonical_id, 0); | ||
|
|
||
| for(auto entry: *root->header_contexts()) { | ||
| HeaderContext context; | ||
| auto path = entry->path_id(); | ||
| context.version = entry->version(); | ||
| for(auto include: *entry->includes()) { | ||
| index.canonical_ref_counts[include->canonical_id()] += 1; | ||
| context.includes.emplace_back(*safe_cast<IncludeContext>(include)); | ||
| for(auto& [_, ctx]: index.header_contexts) { | ||
| for(auto& inc: ctx.includes) { | ||
| index.canonical_ref_counts[inc.canonical_id] += 1; | ||
| } | ||
| index.header_contexts.try_emplace(path, std::move(context)); | ||
| } | ||
|
|
||
| for(auto entry: *root->compilation_contexts()) { | ||
| CompilationContext context; | ||
| auto path = entry->path_id(); | ||
| context.version = entry->version(); | ||
| context.canonical_id = entry->canonical_id(); | ||
| context.build_at = entry->build_at(); | ||
| for(auto include: *entry->include_locations()) { | ||
| context.include_locations.emplace_back(*safe_cast<IncludeLocation>(include)); | ||
| } | ||
| index.compilation_contexts.try_emplace(path, std::move(context)); | ||
| } | ||
|
|
||
| // Count ref counts from compilation contexts. | ||
| for(auto entry: *root->compilation_contexts()) { | ||
| index.canonical_ref_counts[entry->canonical_id()] += 1; | ||
| } | ||
|
|
||
| // Deserialize removed bitmap. | ||
| if(root->removed() && root->removed()->size() > 0) { | ||
| index.removed = read_bitmap(root->removed()); | ||
| } | ||
|
|
||
| for(auto entry: *root->occurrences()) { | ||
| index.occurrences.try_emplace(*safe_cast<Occurrence>(entry->occurrence()), | ||
| read_bitmap(entry->context())); | ||
| } | ||
|
|
||
| for(auto entry: *root->relations()) { | ||
| auto& relations = index.relations[entry->symbol()]; | ||
| for(auto relation_entry: *entry->relations()) { | ||
| relations.try_emplace(*safe_cast<Relation>(relation_entry->relation()), | ||
| read_bitmap(relation_entry->context())); | ||
| } | ||
| } | ||
|
|
||
| if(root->content()) { | ||
| index.content = root->content()->str(); | ||
| for(auto& [_, ctx]: index.compilation_contexts) { | ||
| index.canonical_ref_counts[ctx.canonical_id] += 1; | ||
| } |
There was a problem hiding this comment.
Validate canonical IDs before rebuilding ref counts.
inc.canonical_id and ctx.canonical_id come from the decoded buffer. If either is >= index.max_canonical_id, Lines 232 or 236 index past canonical_ref_counts.
🛡️ Proposed fix
for(auto& [_, ctx]: index.header_contexts) {
for(auto& inc: ctx.includes) {
+ if(inc.canonical_id >= index.canonical_ref_counts.size()) {
+ self.impl.reset();
+ self.buffer.reset();
+ return;
+ }
index.canonical_ref_counts[inc.canonical_id] += 1;
}
}
for(auto& [_, ctx]: index.compilation_contexts) {
+ if(ctx.canonical_id >= index.canonical_ref_counts.size()) {
+ self.impl.reset();
+ self.buffer.reset();
+ return;
+ }
index.canonical_ref_counts[ctx.canonical_id] += 1;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Rebuild the ref count table from the already-loaded contexts. | |
| auto& index = *self.impl; | |
| index.canonical_ref_counts.clear(); | |
| index.canonical_ref_counts.resize(index.max_canonical_id, 0); | |
| for(auto entry: *root->header_contexts()) { | |
| HeaderContext context; | |
| auto path = entry->path_id(); | |
| context.version = entry->version(); | |
| for(auto include: *entry->includes()) { | |
| index.canonical_ref_counts[include->canonical_id()] += 1; | |
| context.includes.emplace_back(*safe_cast<IncludeContext>(include)); | |
| for(auto& [_, ctx]: index.header_contexts) { | |
| for(auto& inc: ctx.includes) { | |
| index.canonical_ref_counts[inc.canonical_id] += 1; | |
| } | |
| index.header_contexts.try_emplace(path, std::move(context)); | |
| } | |
| for(auto entry: *root->compilation_contexts()) { | |
| CompilationContext context; | |
| auto path = entry->path_id(); | |
| context.version = entry->version(); | |
| context.canonical_id = entry->canonical_id(); | |
| context.build_at = entry->build_at(); | |
| for(auto include: *entry->include_locations()) { | |
| context.include_locations.emplace_back(*safe_cast<IncludeLocation>(include)); | |
| } | |
| index.compilation_contexts.try_emplace(path, std::move(context)); | |
| } | |
| // Count ref counts from compilation contexts. | |
| for(auto entry: *root->compilation_contexts()) { | |
| index.canonical_ref_counts[entry->canonical_id()] += 1; | |
| } | |
| // Deserialize removed bitmap. | |
| if(root->removed() && root->removed()->size() > 0) { | |
| index.removed = read_bitmap(root->removed()); | |
| } | |
| for(auto entry: *root->occurrences()) { | |
| index.occurrences.try_emplace(*safe_cast<Occurrence>(entry->occurrence()), | |
| read_bitmap(entry->context())); | |
| } | |
| for(auto entry: *root->relations()) { | |
| auto& relations = index.relations[entry->symbol()]; | |
| for(auto relation_entry: *entry->relations()) { | |
| relations.try_emplace(*safe_cast<Relation>(relation_entry->relation()), | |
| read_bitmap(relation_entry->context())); | |
| } | |
| } | |
| if(root->content()) { | |
| index.content = root->content()->str(); | |
| for(auto& [_, ctx]: index.compilation_contexts) { | |
| index.canonical_ref_counts[ctx.canonical_id] += 1; | |
| } | |
| // Rebuild the ref count table from the already-loaded contexts. | |
| auto& index = *self.impl; | |
| index.canonical_ref_counts.clear(); | |
| index.canonical_ref_counts.resize(index.max_canonical_id, 0); | |
| for(auto& [_, ctx]: index.header_contexts) { | |
| for(auto& inc: ctx.includes) { | |
| if(inc.canonical_id >= index.canonical_ref_counts.size()) { | |
| self.impl.reset(); | |
| self.buffer.reset(); | |
| return; | |
| } | |
| index.canonical_ref_counts[inc.canonical_id] += 1; | |
| } | |
| } | |
| for(auto& [_, ctx]: index.compilation_contexts) { | |
| if(ctx.canonical_id >= index.canonical_ref_counts.size()) { | |
| self.impl.reset(); | |
| self.buffer.reset(); | |
| return; | |
| } | |
| index.canonical_ref_counts[ctx.canonical_id] += 1; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 226 - 237, The code rebuilds
index.canonical_ref_counts using inc.canonical_id and ctx.canonical_id without
bounds checks, risking out-of-range access if decoded IDs >=
index.max_canonical_id; update the loops over index.header_contexts and
index.compilation_contexts to validate each canonical ID before incrementing
(e.g., check inc.canonical_id < index.max_canonical_id and ctx.canonical_id <
index.max_canonical_id), skip or log invalid IDs instead of indexing, and ensure
any counts use the same index.canonical_ref_counts container so no UB occurs.
…tion Replace the flatc-generated serialization layer with kotatsu's arena codec driven directly by the in-memory index types. No hand-written DTOs: the on-wire layout is derived from reflection over the existing structs, with type-level customization where needed. - Drop `schema.fbs`, `serialization.h`, and the flatc build step - Delete `wire_types.h` — no more parallel wire representation - Add `kotatsu_adapters.h` with `kota::codec::type_adapter<T>` specializations for RelationKind, SymbolKind, Bitmap, and std::chrono::milliseconds - Mark runtime-only FileID-keyed maps with `kota::meta::skip<>` so they are excluded from reflection slots; serialize via `main_file_index` and `path_file_indices` (keyed by path id) - Restore MergedIndex's dual dispatch: in-memory path when `impl` is live, lazy flatbuffers path via `kfb::table_view<Impl>::from_bytes()` and `root[&Impl::field]` proxy access when only the buffer is held - Add default member initializers to LocalSourceRange, padding field to Relation, and a path_id lookup struct to IncludeLocation so reflection picks up all stored state - Propagate buffer size through `TUIndex::from` / `ProjectIndex::from` (kota codec requires an explicit size for bounds verification) All 551 unit tests pass; 9 environment-gated integration tests skipped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
94cfdd9 to
d241ea8
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
src/index/tu_index.h (1)
40-42:⚠️ Potential issue | 🟠 MajorKeep
Relationcomparison aligned with its logical key.The defaulted operators include
padding, while relation hashing/equality and dedupe ignore it. That can make logically identical relations compare differently when the codec orders map keys.♻️ Proposed fix
- friend bool operator==(const Relation&, const Relation&) = default; - - friend auto operator<=>(const Relation&, const Relation&) = default; + friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) { + return lhs.kind == rhs.kind && lhs.range == rhs.range && + lhs.target_symbol == rhs.target_symbol; + } + + friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) { + if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) { + return cmp; + } + if(auto cmp = lhs.range <=> rhs.range; cmp != 0) { + return cmp; + } + return lhs.target_symbol <=> rhs.target_symbol; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.h` around lines 40 - 42, Defaulted Relation comparison operators include padding bytes causing mismatches with the existing logical-key-based hashing/equality/dedupe; replace the defaulted friend operator== and operator<=> for Relation with explicit implementations that compare only the logical key members used by hashing/dedupe (i.e., exclude padding and any fields not part of the logical identity), so operator== and operator<=> are consistent with the relation hashing/equality/dedupe behavior.src/index/tu_index.cpp (1)
216-219:⚠️ Potential issue | 🔴 CriticalHandle FlatBuffer serialization failure outside
assert.Line 218 is compiled out in release builds, but Line 219 still dereferences
bytes. Ifkfb::to_flatbuffer()fails, this becomes a release-only crash/corrupt write.🛡️ Proposed fix
`#include` "kota/codec/flatbuffers/deserializer.h" `#include` "kota/codec/flatbuffers/serializer.h" +#include "llvm/Support/ErrorHandling.h" `#include` "llvm/Support/SHA256.h" @@ void TUIndex::serialize(llvm::raw_ostream& os) { auto bytes = kfb::to_flatbuffer(*this); - assert(bytes && "TUIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("TUIndex flatbuffer serialization failed"); + } os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size()); }#!/bin/bash # Verify all kotatsu flatbuffer serialization call sites handle a falsy result # before dereferencing bytes->data()/bytes->size(). rg -n -C3 'kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)|bytes->size\(\)' --iglob '*.cpp'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/tu_index.cpp` around lines 216 - 219, TUIndex::serialize currently asserts the result of kfb::to_flatbuffer but still dereferences bytes in release builds (causing crashes); change the logic to check if bytes is non-null before using it: call kfb::to_flatbuffer(*this), if the returned pointer (bytes) is null log or return gracefully (avoid dereferencing bytes->data()/bytes->size()), otherwise call os.write with bytes->data() and bytes->size(); update error handling around TUIndex::serialize (and any other call sites using kfb::to_flatbuffer) to perform a null-check on the returned value instead of relying solely on assert.src/index/merged_index.cpp (4)
219-224:⚠️ Potential issue | 🟡 MinorDon’t silently turn a bad on-disk index into an empty one.
If
kfb::from_flatbuffer()fails, this resetsbufferand leavesimplas an empty index, so callers cannot distinguish corruption/version mismatch from legitimate empty state.Consider logging the failure and recording an explicit load status so
lookup,need_update, andcontentcan trigger rebuild behavior instead of silently proceeding.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 219 - 224, The code currently resets self.buffer but leaves self.impl as an empty index when kfb::from_flatbuffer(bytes, *self.impl) fails, hiding corruption/version errors; update the load path to (1) log the failure (include the error from kfb::from_flatbuffer) and do not silently leave impl as a valid-but-empty index, and (2) set an explicit load status flag on the index object (e.g., enum LoadStatus {OK, CORRUPT, VERSION_MISMATCH, EMPTY}) so that methods like lookup, need_update, and content can detect CORRUPT/VERSION_MISMATCH and trigger rebuild behavior. Locate the error handling around kfb::from_flatbuffer and add the logging and status assignment there, and update the lookup/need_update/content handlers to check that status before treating the index as legitimately empty.
226-237:⚠️ Potential issue | 🟠 MajorValidate decoded canonical IDs before rebuilding ref counts.
Lines 232 and 236 index
canonical_ref_countswith IDs from the decoded buffer. A corrupted or mismatched index file can producecanonical_id >= max_canonical_idand cause out-of-bounds access.🛡️ Proposed fix
for(auto& [_, ctx]: index.header_contexts) { for(auto& inc: ctx.includes) { + if(inc.canonical_id >= index.canonical_ref_counts.size()) { + self.impl.reset(); + self.buffer.reset(); + return; + } index.canonical_ref_counts[inc.canonical_id] += 1; } } for(auto& [_, ctx]: index.compilation_contexts) { + if(ctx.canonical_id >= index.canonical_ref_counts.size()) { + self.impl.reset(); + self.buffer.reset(); + return; + } index.canonical_ref_counts[ctx.canonical_id] += 1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 226 - 237, When rebuilding index.canonical_ref_counts, validate decoded canonical IDs before using them to avoid out-of-bounds access: in the loops over index.header_contexts (ctx.includes -> inc.canonical_id) and index.compilation_contexts (ctx.canonical_id) ensure each id is within [0, index.max_canonical_id) (and non-negative) before doing index.canonical_ref_counts[id] += 1; if an id is invalid, skip the increment and emit a debug/error log mentioning the offending canonical id and the owning context to aid debugging.
261-263:⚠️ Potential issue | 🔴 CriticalHandle FlatBuffer serialization failure in release builds.
Line 262 disappears under
NDEBUG, but Line 263 still dereferencesbytes. A failedkfb::to_flatbuffer()can crash or write malformed index data.🛡️ Proposed fix
`#include` "llvm/ADT/DenseSet.h" +#include "llvm/Support/ErrorHandling.h" `#include` "llvm/Support/raw_os_ostream.h" @@ auto bytes = kfb::to_flatbuffer(*self.impl); - assert(bytes && "MergedIndex flatbuffer serialization failed"); + if(!bytes) { + llvm::report_fatal_error("MergedIndex flatbuffer serialization failed"); + } out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());#!/bin/bash # Verify all kotatsu flatbuffer serialization call sites handle a falsy result # before dereferencing bytes->data()/bytes->size(). rg -n -C3 'kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)|bytes->size\(\)' --iglob '*.cpp'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 261 - 263, The current code calls kfb::to_flatbuffer(*self.impl) and then unconditionally dereferences bytes->data()/bytes->size() after an assert, which disappears in release builds; modify the MergedIndex serialization site (the bytes local returned from kfb::to_flatbuffer in merged_index.cpp) to explicitly check that bytes is non-null before calling out.write, and on failure either log an error and return/throw an error code (or otherwise abort safe serialization) rather than dereferencing; update the code paths around kfb::to_flatbuffer, the bytes variable and the out.write call to handle a falsy result deterministically in release builds.
339-347:⚠️ Potential issue | 🟠 MajorFilter removed canonical IDs in lazy lookups.
The in-memory paths subtract
removedbefore invoking callbacks, but both lazy paths call the callback using the raw stored bitmap. Removed contexts can therefore leak stale occurrences/relations until materialization.🛡️ Proposed fix sketch
for(; lo < count; ++lo) { auto entry = entries.at(lo); auto occurrence = read_occurrence(entry.template get<0>()); if(!occurrence.range.contains(offset)) { break; } + auto remaining = entry.template get<1>() - root[&Impl::removed]; + if(remaining.isEmpty()) { + continue; + } if(!callback(occurrence)) { break; } } @@ - auto inner = entry->template get<1>(); + auto removed = root[&Impl::removed]; + auto inner = entry->template get<1>(); const std::size_t count = inner.size(); for(std::size_t i = 0; i < count; ++i) { - auto rel_view = inner.at(i).template get<0>(); + auto relation_entry = inner.at(i); + auto rel_view = relation_entry.template get<0>(); @@ if(relation_kind & kind) { + auto remaining = relation_entry.template get<1>() - removed; + if(remaining.isEmpty()) { + continue; + } auto range_view = rel_view[&Relation::range];Also applies to: 389-403
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/index/merged_index.cpp` around lines 339 - 347, The lazy lookup loops in merged_index.cpp call callback with occurrences constructed from the stored bitmap (via read_occurrence) without subtracting the in-memory removed set, allowing removed canonical IDs to leak; update both lazy paths (the loop using read_occurrence at the shown block and the similar loop around lines 389-403) to filter the occurrence's canonical-id bitmap by removing the current removed set before checking contains(offset) and before invoking callback (e.g., apply the same subtraction logic that the in-memory path uses so occurrence.canonical_ids no longer includes removed IDs prior to callback).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/index/merged_index.cpp`:
- Around line 451-456: The code indexes path_mapping[path_id] using a path_id
decoded from on-disk data (from include_locations and used with deps.insert);
add a bounds check before any access to path_mapping: verify 0 <= path_id <
path_mapping.size(), and if out of range avoid dereferencing and treat it as a
stale/on-disk mismatch (e.g., set the "needs update/rebuild" flag or return an
error path), otherwise proceed to call fs::status and continue as before; update
the logic around include_locations, path_id, deps.insert and fs::status so
out-of-range IDs never index path_mapping.
- Around line 325-337: The binary search in the lazy flatbuffer path uses
entries (from Impl::occurrences / DenseMap) and compares
range_view[&LocalSourceRange::end] < offset assuming entries are sorted by range
end, but DenseMap iteration is unordered, so materialize the occurrences into a
vector (similar to occurrences_cache), sort it by the same key used in the
in-memory path (begin, end, target) or at least by range end, then run the
existing binary search/ lower_bound on that sorted vector; update the code paths
around entries, Impl::occurrences, and the binary search loop to operate on the
sorted materialized container so the lookup by LocalSourceRange::end is correct
for nested ranges.
---
Duplicate comments:
In `@src/index/merged_index.cpp`:
- Around line 219-224: The code currently resets self.buffer but leaves
self.impl as an empty index when kfb::from_flatbuffer(bytes, *self.impl) fails,
hiding corruption/version errors; update the load path to (1) log the failure
(include the error from kfb::from_flatbuffer) and do not silently leave impl as
a valid-but-empty index, and (2) set an explicit load status flag on the index
object (e.g., enum LoadStatus {OK, CORRUPT, VERSION_MISMATCH, EMPTY}) so that
methods like lookup, need_update, and content can detect
CORRUPT/VERSION_MISMATCH and trigger rebuild behavior. Locate the error handling
around kfb::from_flatbuffer and add the logging and status assignment there, and
update the lookup/need_update/content handlers to check that status before
treating the index as legitimately empty.
- Around line 226-237: When rebuilding index.canonical_ref_counts, validate
decoded canonical IDs before using them to avoid out-of-bounds access: in the
loops over index.header_contexts (ctx.includes -> inc.canonical_id) and
index.compilation_contexts (ctx.canonical_id) ensure each id is within [0,
index.max_canonical_id) (and non-negative) before doing
index.canonical_ref_counts[id] += 1; if an id is invalid, skip the increment and
emit a debug/error log mentioning the offending canonical id and the owning
context to aid debugging.
- Around line 261-263: The current code calls kfb::to_flatbuffer(*self.impl) and
then unconditionally dereferences bytes->data()/bytes->size() after an assert,
which disappears in release builds; modify the MergedIndex serialization site
(the bytes local returned from kfb::to_flatbuffer in merged_index.cpp) to
explicitly check that bytes is non-null before calling out.write, and on failure
either log an error and return/throw an error code (or otherwise abort safe
serialization) rather than dereferencing; update the code paths around
kfb::to_flatbuffer, the bytes variable and the out.write call to handle a falsy
result deterministically in release builds.
- Around line 339-347: The lazy lookup loops in merged_index.cpp call callback
with occurrences constructed from the stored bitmap (via read_occurrence)
without subtracting the in-memory removed set, allowing removed canonical IDs to
leak; update both lazy paths (the loop using read_occurrence at the shown block
and the similar loop around lines 389-403) to filter the occurrence's
canonical-id bitmap by removing the current removed set before checking
contains(offset) and before invoking callback (e.g., apply the same subtraction
logic that the in-memory path uses so occurrence.canonical_ids no longer
includes removed IDs prior to callback).
In `@src/index/tu_index.cpp`:
- Around line 216-219: TUIndex::serialize currently asserts the result of
kfb::to_flatbuffer but still dereferences bytes in release builds (causing
crashes); change the logic to check if bytes is non-null before using it: call
kfb::to_flatbuffer(*this), if the returned pointer (bytes) is null log or return
gracefully (avoid dereferencing bytes->data()/bytes->size()), otherwise call
os.write with bytes->data() and bytes->size(); update error handling around
TUIndex::serialize (and any other call sites using kfb::to_flatbuffer) to
perform a null-check on the returned value instead of relying solely on assert.
In `@src/index/tu_index.h`:
- Around line 40-42: Defaulted Relation comparison operators include padding
bytes causing mismatches with the existing logical-key-based
hashing/equality/dedupe; replace the defaulted friend operator== and operator<=>
for Relation with explicit implementations that compare only the logical key
members used by hashing/dedupe (i.e., exclude padding and any fields not part of
the logical identity), so operator== and operator<=> are consistent with the
relation hashing/equality/dedupe behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 07696db5-ed3f-4258-909c-75490b2eda17
📒 Files selected for processing (16)
CMakeLists.txtcmake/package.cmakesrc/index/include_graph.hsrc/index/kotatsu_adapters.hsrc/index/merged_index.cppsrc/index/project_index.cppsrc/index/project_index.hsrc/index/schema.fbssrc/index/serialization.hsrc/index/tu_index.cppsrc/index/tu_index.hsrc/semantic/relation_kind.hsrc/server/compiler.cppsrc/server/indexer.cppsrc/syntax/token.htests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
- src/index/serialization.h
- src/index/schema.fbs
✅ Files skipped from review due to trivial changes (3)
- src/syntax/token.h
- src/index/project_index.cpp
- src/index/kotatsu_adapters.h
🚧 Files skipped from review as they are similar to previous changes (5)
- src/semantic/relation_kind.h
- src/server/compiler.cpp
- cmake/package.cmake
- src/server/indexer.cpp
- src/index/include_graph.h
| const std::size_t count = entries.size(); | ||
| std::size_t lo = 0; | ||
| std::size_t hi = count; | ||
| while(lo < hi) { | ||
| auto mid = lo + (hi - lo) / 2; | ||
| auto entry = entries.at(mid); | ||
| auto range_view = entry.template get<0>()[&Occurrence::range]; | ||
| if(range_view[&LocalSourceRange::end] < offset) { | ||
| lo = mid + 1; | ||
| } else { | ||
| hi = mid; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the ordering mismatch between occurrence sort/key order and lower_bound projection.
rg -n -C4 'operator<=>\(const Occurrence|range\.end|lower_bound\(occurrences|auto mid = lo' --iglob '*.h' --iglob '*.cpp'Repository: clice-io/clice
Length of output: 30034
🏁 Script executed:
sed -n '310,345p' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 1471
🏁 Script executed:
rg -n 'struct Impl|flatbuffer.*Impl' src/index/merged_index.cpp | head -20Repository: clice-io/clice
Length of output: 129
🏁 Script executed:
rg -n 'Impl.*occurrences|occurrences.*push_back|\.serialize' src/index/merged_index.cpp | head -30Repository: clice-io/clice
Length of output: 110
🏁 Script executed:
find src/index -name "*.h" | xargs grep -l 'struct Impl'Repository: clice-io/clice
Length of output: 82
🏁 Script executed:
rg -n -A30 'struct Impl' src/index/merged_index.hRepository: clice-io/clice
Length of output: 904
🏁 Script executed:
rg -n -B5 -A25 'struct Impl {' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 147
🏁 Script executed:
rg -n -B5 -A25 'struct Impl' src/index/merged_index.cpp | head -50Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
rg -n 'namespace kfb|reflected_struct' src/index/merged_index.cpp | head -20Repository: clice-io/clice
Length of output: 103
🏁 Script executed:
rg -n 'serialize.*occurrence|occurrence.*serialize' src/index/merged_index.cpp | head -20Repository: clice-io/clice
Length of output: 40
🏁 Script executed:
sed -n '180,280p' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 3190
🏁 Script executed:
sed -n '240,260p' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 540
🏁 Script executed:
sed -n '1,180p' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 5610
🏁 Script executed:
rg -n -A10 'struct Occurrence' src/index/tu_index.hRepository: clice-io/clice
Length of output: 351
🏁 Script executed:
rg -n 'type.*Range|using Range' src/index/tu_index.h | head -5Repository: clice-io/clice
Length of output: 92
🏁 Script executed:
rg -n 'struct LocalSourceRange|class LocalSourceRange' src/ -A15 | head -40Repository: clice-io/clice
Length of output: 903
🏁 Script executed:
rg -n -B5 -A5 'from_flatbuffer.*impl' src/index/merged_index.cppRepository: clice-io/clice
Length of output: 321
Binary search in lazy flatbuffer path assumes range.end ordering but operates on unsorted DenseMap data.
The manual binary search (lines 325-337) compares range_view[&LocalSourceRange::end] < offset, assuming the entries are sorted by range end offset. However, entries is derived from Impl::occurrences, which is a llvm::DenseMap keyed by Occurrence. DenseMap iteration order is arbitrary (hash table order), not sorted by any field. The in-memory path correctly sorts occurrences_cache by (begin, end, target) before binary search, but the lazy path serializes from an unsorted hash table. This breaks binary search correctness for nested ranges like [10,100] then [20,30]—it may skip matches for offset 50.
Materialize and sort occurrences by the search key before the binary search loop, or use std::ranges::lower_bound with the range.end projection (as the in-memory path does).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 325 - 337, The binary search in the
lazy flatbuffer path uses entries (from Impl::occurrences / DenseMap) and
compares range_view[&LocalSourceRange::end] < offset assuming entries are sorted
by range end, but DenseMap iteration is unordered, so materialize the
occurrences into a vector (similar to occurrences_cache), sort it by the same
key used in the in-memory path (begin, end, target) or at least by range end,
then run the existing binary search/ lower_bound on that sorted vector; update
the code paths around entries, Impl::occurrences, and the binary search loop to
operate on the sorted materialized container so the lookup by
LocalSourceRange::end is correct for nested ranges.
| auto location = include_locations.at(i); | ||
| auto path_id = location[&IncludeLocation::path_id]; | ||
| auto [_, success] = deps.insert(path_id); | ||
| if(success) { | ||
| fs::file_status status; | ||
| if(auto err = fs::status(path_mapping[location->path_id()], status)) { | ||
| if(auto err = fs::status(path_mapping[path_id], status)) { |
There was a problem hiding this comment.
Bounds-check decoded path IDs before indexing path_mapping.
path_id comes from the on-disk buffer. If it is out of range, Line 456 reads past path_mapping; treat that as needing an update/rebuild instead.
🛡️ Proposed fix
auto path_id = location[&IncludeLocation::path_id];
auto [_, success] = deps.insert(path_id);
if(success) {
+ if(path_id >= path_mapping.size()) {
+ return true;
+ }
fs::file_status status;
if(auto err = fs::status(path_mapping[path_id], status)) {
return true;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index/merged_index.cpp` around lines 451 - 456, The code indexes
path_mapping[path_id] using a path_id decoded from on-disk data (from
include_locations and used with deps.insert); add a bounds check before any
access to path_mapping: verify 0 <= path_id < path_mapping.size(), and if out of
range avoid dereferencing and treat it as a stale/on-disk mismatch (e.g., set
the "needs update/rebuild" flag or return an error path), otherwise proceed to
call fs::status and continue as before; update the logic around
include_locations, path_id, deps.insert and fs::status so out-of-range IDs never
index path_mapping.
Summary
Replace the flatc-generated FlatBuffers serialization with kotatsu's arena codec driven by reflection + wire DTOs.
Relation/Symbol/TUIndex/ProjectIndex/MergedIndex::Impl) stay in their natural LLVM-container shapesrc/index/wire_types.hwith<>adapters insrc/index/kotatsu_adapters.h(RelationKind, SymbolKind, Bitmap, milliseconds)MergedIndex's dual in-memory/buffer dispatch paths collapsed into a single path viaload_in_memory(), simplifyinglookup/need_update/contentTUIndex::from/ProjectIndex::fromnow take(data, size)since the arena codec needs an explicit size for bounds verificationDrops
src/index/schema.fbs(173 lines)src/index/serialization.h(79 lines)CMakeLists.txtflatbuffersFetchContent dep (now comes transitively viakota::codec::flatbuffers)Numbers
Dependencies
Depends on the kotatsu branch currently pinned as
refactor/flatbuffers-schema-drivenincmake/package.cmake— once the corresponding kotatsu PR is merged, we'll repoint this atmain.Test plan
unit_tests --test-dir=./tests/data— 551 passed / 9 skipped🤖 Generated with Claude Code
Summary by CodeRabbit