Skip to content

refactor(index): migrate FlatBuffers from flatc IDL to kotatsu reflection#430

Open
16bit-ykiko wants to merge 1 commit into
mainfrom
migrate/flatbuffers-to-kotatsu-reflection
Open

refactor(index): migrate FlatBuffers from flatc IDL to kotatsu reflection#430
16bit-ykiko wants to merge 1 commit into
mainfrom
migrate/flatbuffers-to-kotatsu-reflection

Conversation

@16bit-ykiko

@16bit-ykiko 16bit-ykiko commented Apr 19, 2026

Copy link
Copy Markdown
Member

Summary

Replace the flatc-generated FlatBuffers serialization with kotatsu's arena codec driven by reflection + wire DTOs.

  • Runtime types (Relation / Symbol / TUIndex / ProjectIndex / MergedIndex::Impl) stay in their natural LLVM-container shape
  • On-wire layout is declared via reflectable structs in src/index/wire_types.h
  • Non-trivial leaves bridged by with<> adapters in src/index/kotatsu_adapters.h (RelationKind, SymbolKind, Bitmap, milliseconds)
  • MergedIndex's dual in-memory/buffer dispatch paths collapsed into a single path via load_in_memory(), simplifying lookup / need_update / content
  • TUIndex::from / ProjectIndex::from now take (data, size) since the arena codec needs an explicit size for bounds verification

Drops

  • src/index/schema.fbs (173 lines)
  • src/index/serialization.h (79 lines)
  • flatc codegen custom target + generated header include dir in CMakeLists.txt
  • flatbuffers FetchContent dep (now comes transitively via kota::codec::flatbuffers)

Numbers

  • Net -264 LoC, 14 files changed
  • 551/551 unit tests pass (9 skipped, matching main)

Dependencies

Depends on the kotatsu branch currently pinned as refactor/flatbuffers-schema-driven in cmake/package.cmake — once the corresponding kotatsu PR is merged, we'll repoint this at main.

Test plan

  • RelWithDebInfo build clean (clice-core + unit_tests)
  • unit_tests --test-dir=./tests/data — 551 passed / 9 skipped
  • Round-trip testing on real-world index directories

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Switched index serialization to a centralized codec with safer, size-aware deserialization, rebuilt in-memory caches, improved lazy-buffer lookup, and added defaulted ordering for several domain types.
  • Chores
    • Removed local FlatBuffers schema/generation and consolidated FlatBuffers support through the centralized codec provider.
  • API
    • Deserialization APIs now require an explicit size parameter; some serialize methods are no longer const.
  • Tests
    • Updated unit tests to use the size-aware deserialization API.

@coderabbitai

coderabbitai Bot commented Apr 19, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Replaces in-repo FlatBuffers schema and manual builders with kotatsu's flatbuffers codec and arena adapters. Removes schema generation from CMake and the .fbs file, adds codec trait specializations, updates serialization APIs to size-aware signatures, and annotates several runtime-only maps with kota::meta::skip.

Changes

Cohort / File(s) Summary
CMake Build Configuration
CMakeLists.txt, cmake/package.cmake
Removed in-repo FlatBuffers codegen and direct flatbuffers FetchContent; switched kotatsu source to refactor/flatbuffers-schema-driven and enabled KOTA_CODEC_ENABLE_FLATBUFFERS.
Codec Adapter Additions
src/index/kotatsu_adapters.h
New kotatsu arena serialize_traits/deserialize_traits for std::chrono::milliseconds, clice::RelationKind, clice::SymbolKind, and clice::Bitmap implementing wire types and conversion logic.
Index Serialization Refactor
src/index/merged_index.cpp, src/index/project_index.cpp, src/index/tu_index.cpp, src/index/project_index.h, src/index/tu_index.h
Replaced manual FlatBuffers builders/readers with kota::codec::flatbuffers (kfb::to_flatbuffer / kfb::from_flatbuffer); changed from APIs to accept (data, size); switched some runtime-only members to kota::meta::skip; simplified load/serialize logic.
Removed FlatBuffers Artifacts
src/index/schema.fbs, src/index/serialization.h
Deleted the FlatBuffers .fbs schema and internal serialization helpers header; removed manual wire-format helpers and types.
Call Sites & Tests
src/server/compiler.cpp, src/server/indexer.cpp, tests/unit/index/project_index_tests.cpp
Updated deserialization call sites and tests to pass explicit buffer sizes to TUIndex::from / ProjectIndex::from.
Runtime Annotations & Comparisons
src/index/include_graph.h, src/semantic/relation_kind.h, src/syntax/token.h
Wrapped runtime-only maps in kota::meta::skip; added three-way comparison operators (operator<=>) for RelationKind, LocalSourceRange, and defaulted comparisons for Relation/Occurrence.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Ser as Serialization Layer
    participant Adapter as Type Adapters
    participant Codec as Kotatsu Codec
    participant Buf as FlatBuffer Bytes

    rect rgba(100, 149, 237, 0.5)
    Note over App,Buf: Serialization Flow (New)
    App->>Ser: serialize(index) 
    Ser->>Adapter: adapt domain types -> wire types
    Adapter->>Codec: provide arena-backed structures
    Ser->>Codec: kfb::to_flatbuffer(adapted)
    Codec->>Buf: emit FlatBuffer bytes
    Buf->>App: write bytes to stream
    end

    rect rgba(144, 238, 144, 0.5)
    Note over App,Buf: Deserialization Flow (New)
    App->>Buf: read bytes from stream
    App->>Ser: deserialize(bytes, size)
    Ser->>Codec: kfb::from_flatbuffer(bytes, target)
    Codec->>Adapter: expose table_views / wire types
    Adapter->>App: reconstruct domain objects
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I nudge the bytes beneath the moonlit rug,
Swap generated fields for cozy kotatsu hug.
Wire maps tucked, runtime scraps set free,
I hop, I nibble, then I serialize with glee! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main refactoring work: migrating from flatc-generated FlatBuffers schemas to kotatsu's reflection-driven codec. This is the primary objective reflected throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch migrate/flatbuffers-to-kotatsu-reflection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
cmake/package.cmake (1)

30-35: Pin kotatsu to a commit SHA (or tag) rather than a branch for reproducibility.

refactor/flatbuffers-schema-driven is a moving branch HEAD, so with FETCHCONTENT_UPDATES_DISCONNECTED ON different developers / CI runs may silently resolve to different commits depending on when their local FetchContent cache was first populated. The PR description already acknowledges this is temporary, but until the kotatsu PR merges and you can switch to main (or a tag), consider pinning to the exact commit.

📦 Proposed pin
 FetchContent_Declare(
     kotatsu
     GIT_REPOSITORY https://github.com/clice-io/kotatsu
-    GIT_TAG refactor/flatbuffers-schema-driven
+    # TODO: repoint to main / tagged release after upstream kotatsu PR merges.
+    GIT_TAG <commit-sha-from-refactor/flatbuffers-schema-driven>
     GIT_SHALLOW TRUE
 )

Note: with a commit SHA, you'll also want to drop GIT_SHALLOW TRUE or ensure the SHA is reachable from the default branch, since shallow fetch of a non-tip SHA is not universally supported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmake/package.cmake` around lines 30 - 35, The FetchContent_Declare block for
kotatsu uses a moving branch in GIT_TAG ("refactor/flatbuffers-schema-driven");
change GIT_TAG to a specific commit SHA (or an immutable tag) so builds are
reproducible, and remove or adjust GIT_SHALLOW TRUE (or ensure the SHA is
reachable from the default branch) so the pinned SHA can be fetched; update the
FetchContent_Declare call for kotatsu accordingly.
src/index/wire_types.h (1)

132-141: MergedIndex::content now forces eager copy of the full file contents.

Previously content was lazily pulled out of the mmapped FlatBuffer root; after this migration, load_in_memory decodes the entire wire::MergedIndex (including content as std::string) up front, and both MergedIndex::content() and need_update() / lookup() now eagerly call load_in_memory(). For large per-file source contents this means more allocation and memory pressure on what used to be very cheap accessors — especially need_update(), which doesn't need content at all.

If this turns out to matter at scale, consider splitting content out of wire::MergedIndex into its own sub-region (or a separate file) so lookup/need_update can decode without materializing it. Not blocking for this PR, but worth tracking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/wire_types.h` around lines 132 - 141, The struct field
MergedIndex::content makes decoding the entire wire::MergedIndex (and therefore
load_in_memory(), need_update(), lookup()) eagerly allocate the full file
contents; to avoid unnecessary eager copies, revert to keeping content out of
the hot decoded path by removing or deferring the std::string content from
MergedIndex and instead expose a lazy accessor that reads the content from the
mmapped FlatBuffer only when explicitly requested (e.g., implement
MergedIndex::get_content() that pulls from the FlatBuffer root on first use or
store a lightweight view/offset to the content region), update load_in_memory()
to not materialize content, and ensure need_update() and lookup() call only the
lazy accessor when they actually need the payload; reference MergedIndex,
content, load_in_memory, need_update, lookup, and wire::MergedIndex when making
the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/index/merged_index.cpp`:
- Around line 410-496: The methods MergedIndex::lookup, MergedIndex::need_update
and content currently call const_cast<Self&>(self).load_in_memory(), introducing
a data race on members buffer/impl; fix by making lazy-init thread-safe: either
mark buffer and impl mutable and change load_in_memory to perform synchronized
one-time initialization using std::once_flag (or a mutex + double-checked init)
so callers no longer need const_cast, or keep them non-mutable and protect
load_in_memory with a mutex/once_flag and remove the const_cast calls in
lookup/need_update/content; ensure you update load_in_memory, remove const_cast
usages, and reference the same buffer and impl symbols consistently to avoid
races.
- Around line 256-262: When kfb::from_flatbuffer(bytes, w) fails inside
load_in_memory, don't silently reset self.buffer and leave impl as an empty
index; instead log the failure (include the index file path and any error/info
from from_flatbuffer) and record a load-status so callers can detect the bad
load. Concretely: in the failure branch after auto result =
kfb::from_flatbuffer(bytes, w) add a processLogger/error log call that prints
the path and result/error, set an explicit flag or enum on the object (e.g.
impl_loaded = false or load_status = LoadStatus::Failed) instead of just leaving
impl empty, and expose an accessor such as is_loaded() or load_status() so
lookup/need_update/content can detect the bad load and trigger a rebuild or
return an appropriate error.
- Around line 405-407: The code currently uses assert on the result of
kfb::to_flatbuffer(w) which is removed in release builds, causing undefined
behavior when bytes is empty; replace the assert with an explicit check of the
optional-like bytes result (from kfb::to_flatbuffer) and handle the failure path
deterministically (e.g., log an error and throw an exception or return an error
code) before calling bytes->data()/bytes->size(); apply the same change pattern
to the analogous sites (project_index.cpp around the bytes/out.write usage and
TUIndex::serialize) so serialization never dereferences an empty value in
release builds.
- Around line 387-400: The inner relation lists are produced by iterating
index.relations[symbol_id] (a DenseMap) which yields nondeterministic order;
before pushing each wire::SymbolRelationsEntry into w.relations, sort its
sre.relations vector by relation (e.g., compare RelationEntry.relation or use
the result of to_wire(relation)) to ensure deterministic on-disk output; apply
the same pattern of collecting then sorting entries when serializing
canonical_cache, header_contexts, and compilation_contexts so each inner
vector/map is stable across runs.

In `@src/index/project_index.cpp`:
- Around line 88-98: The loop writes to pool.paths using deserialized entry.id
without validating bounds, which can crash on malformed or future sparse ids;
change the loop over w.paths so you either (A) validate that entry.id <
pool.paths.size() before doing pool.paths[entry.id] = k and skip or handle
invalid ids (and still update pool.cache.try_emplace(k, entry.id) only for valid
ids), or (B) stop using entry.id entirely and instead assign ids from the vector
position (use the loop index i as the id when calling pool.save and
pool.cache.try_emplace) so writes use a safe, consecutive index; update the
logic around pool.save, pool.paths and pool.cache.try_emplace to reflect the
chosen approach.

In `@src/index/tu_index.cpp`:
- Around line 313-315: Replace the runtime-unsafe assert that assumes
kfb::to_flatbuffer(w) always returns a value: check the returned optional/ptr
(the bytes variable) and if it is null call llvm::report_fatal_error with a
clear message instead of asserting; then proceed to use
bytes->data()/bytes->size() only after the null-check. Add `#include`
"llvm/Support/ErrorHandling.h" to the top of each affected file (tu_index.cpp,
project_index.cpp, merged_index.cpp) and remove the assert(bytes && ...) lines
so release builds won’t dereference a null pointer.

In `@src/index/wire_types.h`:
- Around line 46-53: Remove the unused padding member from the wire type by
deleting the padding field from struct Relation (wire::Relation) in
wire_types.h, and update the serialization/conversion code that maps between
runtime Relation (tu_index) and wire::Relation in tu_index.cpp and
merged_index.cpp so they no longer read, write, or reserve 4 bytes for padding;
adjust any memcpy/size calculations, constructors, and assignment code that
assumed the extra field and update any on-disk/serialization size constants or
header checks that included the padding so the wire layout no longer contains
the unused 4-byte slot.

---

Nitpick comments:
In `@cmake/package.cmake`:
- Around line 30-35: The FetchContent_Declare block for kotatsu uses a moving
branch in GIT_TAG ("refactor/flatbuffers-schema-driven"); change GIT_TAG to a
specific commit SHA (or an immutable tag) so builds are reproducible, and remove
or adjust GIT_SHALLOW TRUE (or ensure the SHA is reachable from the default
branch) so the pinned SHA can be fetched; update the FetchContent_Declare call
for kotatsu accordingly.

In `@src/index/wire_types.h`:
- Around line 132-141: The struct field MergedIndex::content makes decoding the
entire wire::MergedIndex (and therefore load_in_memory(), need_update(),
lookup()) eagerly allocate the full file contents; to avoid unnecessary eager
copies, revert to keeping content out of the hot decoded path by removing or
deferring the std::string content from MergedIndex and instead expose a lazy
accessor that reads the content from the mmapped FlatBuffer only when explicitly
requested (e.g., implement MergedIndex::get_content() that pulls from the
FlatBuffer root on first use or store a lightweight view/offset to the content
region), update load_in_memory() to not materialize content, and ensure
need_update() and lookup() call only the lazy accessor when they actually need
the payload; reference MergedIndex, content, load_in_memory, need_update,
lookup, and wire::MergedIndex when making the changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a56ea3cc-99ce-488e-ac0c-b2f33603a5e0

📥 Commits

Reviewing files that changed from the base of the PR and between 17e6801 and 911702c.

📒 Files selected for processing (14)
  • CMakeLists.txt
  • cmake/package.cmake
  • src/index/kotatsu_adapters.h
  • src/index/merged_index.cpp
  • src/index/project_index.cpp
  • src/index/project_index.h
  • src/index/schema.fbs
  • src/index/serialization.h
  • src/index/tu_index.cpp
  • src/index/tu_index.h
  • src/index/wire_types.h
  • src/server/compiler.cpp
  • src/server/indexer.cpp
  • tests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
  • src/index/serialization.h
  • src/index/schema.fbs

Comment thread src/index/merged_index.cpp Outdated
Comment on lines +256 to +262

wire::MergedIndex w;
auto result = kfb::from_flatbuffer(bytes, w);
if(!result) {
self.buffer.reset();
return;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent failure on corrupted/unparseable on-disk index.

If kfb::from_flatbuffer fails, load_in_memory resets the buffer and leaves impl as an empty index — callers (e.g. lookup, need_update, content) then observe an empty index and proceed as if nothing was wrong. For a corrupted or version-mismatched index file this can mask real problems (e.g. a dropped-on-the-floor stale cache is indistinguishable from legitimate empty state). Consider at least logging the failure with the path, and/or exposing a load-status so callers can trigger a rebuild.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 256 - 262, When
kfb::from_flatbuffer(bytes, w) fails inside load_in_memory, don't silently reset
self.buffer and leave impl as an empty index; instead log the failure (include
the index file path and any error/info from from_flatbuffer) and record a
load-status so callers can detect the bad load. Concretely: in the failure
branch after auto result = kfb::from_flatbuffer(bytes, w) add a
processLogger/error log call that prints the path and result/error, set an
explicit flag or enum on the object (e.g. impl_loaded = false or load_status =
LoadStatus::Failed) instead of just leaving impl empty, and expose an accessor
such as is_loaded() or load_status() so lookup/need_update/content can detect
the bad load and trigger a rebuild or return an appropriate error.

Comment thread src/index/merged_index.cpp Outdated
Comment on lines +387 to +400
w.relations.reserve(index.relations.size());
for(auto& [symbol_id, symbol_relations]: index.relations) {
wire::SymbolRelationsEntry sre;
sre.symbol = symbol_id;
sre.relations.reserve(symbol_relations.size());
for(auto& [relation, bitmap]: symbol_relations) {
wire::RelationEntry re;
re.relation = to_wire(relation);
re.context = bitmap;
sre.relations.push_back(std::move(re));
}
w.relations.push_back(std::move(sre));
}
auto removed = CreateVector(builder, buffer);

auto content_offset = CreateString(builder, index->content);
std::ranges::sort(w.relations, {}, [](const auto& e) { return e.symbol; });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Relations within a symbol aren't sorted — on-disk output is nondeterministic.

w.relations is sorted by symbol, but the inner sre.relations vector is populated in hashmap iteration order (from index.relations[symbol_id], a DenseMap<Relation, Roaring>), which isn't stable across runs. The serialized bytes therefore differ for logically-identical indices, which hurts reproducibility, change detection, and round-trip equality checks.

♻️ Sort inner relations before pushing
         for(auto& [relation, bitmap]: symbol_relations) {
             wire::RelationEntry re;
             re.relation = to_wire(relation);
             re.context = bitmap;
             sre.relations.push_back(std::move(re));
         }
+        std::ranges::sort(sre.relations, [](const auto& lhs, const auto& rhs) {
+            return std::tuple(lhs.relation.kind.value(),
+                              lhs.relation.range.begin,
+                              lhs.relation.range.end,
+                              lhs.relation.target_symbol) <
+                   std::tuple(rhs.relation.kind.value(),
+                              rhs.relation.range.begin,
+                              rhs.relation.range.end,
+                              rhs.relation.target_symbol);
+        });
         w.relations.push_back(std::move(sre));

Same concern applies to canonical_cache, header_contexts, and compilation_contexts, which are all iterated out of llvm::SmallDenseMap/StringMap without any subsequent sort.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 387 - 400, The inner relation lists
are produced by iterating index.relations[symbol_id] (a DenseMap) which yields
nondeterministic order; before pushing each wire::SymbolRelationsEntry into
w.relations, sort its sre.relations vector by relation (e.g., compare
RelationEntry.relation or use the result of to_wire(relation)) to ensure
deterministic on-disk output; apply the same pattern of collecting then sorting
entries when serializing canonical_cache, header_contexts, and
compilation_contexts so each inner vector/map is stable across runs.

Comment thread src/index/merged_index.cpp Outdated
Comment on lines +405 to +407
auto bytes = kfb::to_flatbuffer(w);
assert(bytes && "MergedIndex flatbuffer serialization failed");
out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Release-build UB when FlatBuffer serialization fails.

assert(...) is compiled out in release builds, so if kfb::to_flatbuffer(w) returns a falsy/empty result, bytes->data() / bytes->size() will dereference an empty/invalid optional-like value and corrupt the output. Since this is on the persistence path, a silent failure here can produce a truncated or malformed on-disk index that later fails to load (or worse, loads partial data).

Handle the error path explicitly even in release:

🛡️ Proposed fix
-    auto bytes = kfb::to_flatbuffer(w);
-    assert(bytes && "MergedIndex flatbuffer serialization failed");
-    out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
+    auto bytes = kfb::to_flatbuffer(w);
+    if(!bytes) {
+        LOG_ERROR("MergedIndex flatbuffer serialization failed");
+        return;
+    }
+    out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());

Same pattern also applies to src/index/project_index.cpp:70-72 and the analogous TUIndex::serialize.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 405 - 407, The code currently uses
assert on the result of kfb::to_flatbuffer(w) which is removed in release
builds, causing undefined behavior when bytes is empty; replace the assert with
an explicit check of the optional-like bytes result (from kfb::to_flatbuffer)
and handle the failure path deterministically (e.g., log an error and throw an
exception or return an error code) before calling bytes->data()/bytes->size();
apply the same change pattern to the analogous sites (project_index.cpp around
the bytes/out.write usage and TUIndex::serialize) so serialization never
dereferences an empty value in release builds.

Comment thread src/index/merged_index.cpp Outdated
Comment thread src/index/project_index.cpp Outdated
Comment thread src/index/tu_index.cpp Outdated
Comment on lines +313 to +315
auto bytes = kfb::to_flatbuffer(w);
assert(bytes && "TUIndex flatbuffer serialization failed");
os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the codec result is nullable/checked elsewhere and locate unchecked dereferences.
rg -n -C4 '\bto_flatbuffer\s*\(' --iglob '*.h' --iglob '*.hpp' --iglob '*.cpp'
rg -n -C3 'auto bytes = kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)' --iglob '*.cpp'

Repository: clice-io/clice

Length of output: 4711


🏁 Script executed:

#!/bin/bash
# Find the definition of kfb::to_flatbuffer to verify its return type
fd -e h -e hpp . | xargs rg -l 'to_flatbuffer' | head -5
rg -n 'to_flatbuffer' --iglob '*.h' --iglob '*.hpp' -B2 -A5 | head -50

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Search more broadly for to_flatbuffer definition
rg 'to_flatbuffer' --type cpp --type h -B2 -A5 | head -80
# Also check what kfb namespace is
rg -n 'namespace.*kfb|using.*kfb' --iglob '*.h' --iglob '*.cpp' | head -20

Repository: clice-io/clice

Length of output: 1775


🏁 Script executed:

#!/bin/bash
# Look for to_flatbuffer in kota codec directory
find . -path '*/kota/codec/flatbuffers*' -type f \( -name '*.h' -o -name '*.hpp' \)
# Also search for to_flatbuffer function in the include directory
rg -n 'to_flatbuffer' . --iglob 'serializer*' -A10

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Check the include statements in the three affected files
echo "=== tu_index.cpp includes ==="
head -30 src/index/tu_index.cpp | grep -n include

echo -e "\n=== project_index.cpp includes ==="
head -30 src/index/project_index.cpp | grep -n include

echo -e "\n=== merged_index.cpp includes ==="
head -30 src/index/merged_index.cpp | grep -n include

Repository: clice-io/clice

Length of output: 1182


🏁 Script executed:

#!/bin/bash
# Check if llvm::report_fatal_error is used elsewhere
echo "=== Checking llvm::report_fatal_error usage ==="
rg 'report_fatal_error' --type cpp --type h

echo -e "\n=== Checking ErrorHandling.h includes ==="
rg 'ErrorHandling\.h' --type cpp --type h

echo -e "\n=== Check if file_count for scope ==="
echo "Files affected by the issue:"
rg -l 'auto bytes = kfb::to_flatbuffer' --type cpp

Repository: clice-io/clice

Length of output: 294


Handle serialization failure in all index serialization paths.

assert(bytes) disappears under NDEBUG, but subsequent dereferencing of bytes persists. If kfb::to_flatbuffer() returns null in release builds, this causes a crash. This issue exists in three files:

  • src/index/tu_index.cpp (lines 313–315)
  • src/index/project_index.cpp (lines 70–72)
  • src/index/merged_index.cpp (lines 405–407)

Replace runtime-unsafe assertions with proper error handling. Use if(!bytes) { llvm::report_fatal_error(...); } and add #include "llvm/Support/ErrorHandling.h" to all three files.

Example fix for tu_index.cpp
 `#include` "kota/codec/flatbuffers/deserializer.h"
 `#include` "kota/codec/flatbuffers/serializer.h"
+#include "llvm/Support/ErrorHandling.h"
 `#include` "llvm/Support/SHA256.h"
     auto bytes = kfb::to_flatbuffer(w);
-    assert(bytes && "TUIndex flatbuffer serialization failed");
+    if (!bytes) {
+        llvm::report_fatal_error("TUIndex flatbuffer serialization failed");
+    }
     os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());

Apply similar changes to project_index.cpp and merged_index.cpp.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.cpp` around lines 313 - 315, Replace the runtime-unsafe
assert that assumes kfb::to_flatbuffer(w) always returns a value: check the
returned optional/ptr (the bytes variable) and if it is null call
llvm::report_fatal_error with a clear message instead of asserting; then proceed
to use bytes->data()/bytes->size() only after the null-check. Add `#include`
"llvm/Support/ErrorHandling.h" to the top of each affected file (tu_index.cpp,
project_index.cpp, merged_index.cpp) and remove the assert(bytes && ...) lines
so release builds won’t dereference a null pointer.

Comment thread src/index/wire_types.h Outdated
Comment on lines +46 to +53
/// Relation carries an annotated `RelationKind` so it is encoded as a table
/// rather than an inline struct (annotated fields are not supported inline).
struct Relation {
km::annotation<RelationKind, kmb::with<adapters::RelationKindAdapter>> kind;
std::uint32_t padding = 0;
LocalSourceRange range;
SymbolHash target_symbol = 0;
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the runtime Relation definition to see what `padding` represents.
rg -nP -C2 '\bstruct\s+Relation\b' --type=cpp --type=h
rg -nP -C1 '\bpadding\b' src/index/ src/semantic/

Repository: clice-io/clice

Length of output: 1710


🏁 Script executed:

#!/bin/bash
# Get full definition of runtime Relation struct with surrounding context
cat -n src/index/tu_index.h | sed -n '20,35p'

# Check if padding field is used/accessed anywhere other than in serialization
rg 'Relation.*\.padding\b|\.padding.*=' --type=cpp --type=h -n

# Look for any comments about memory layout or alignment near the struct
rg -B5 -A10 'struct Relation' src/index/tu_index.h

Repository: clice-io/clice

Length of output: 1496


Remove padding field from wire::Relation struct.

The padding field in the runtime Relation struct (tu_index.h:25) exists solely for in-memory alignment and is never accessed outside of serialization. Persisting it to the wire format wastes 4 bytes per relation (potentially millions across an index) and bakes unused reserved bits into the on-disk format, making it awkward to evolve later. Drop it from the wire type definition at wire_types.h:50 and its conversion functions in tu_index.cpp:179 and merged_index.cpp:195.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/wire_types.h` around lines 46 - 53, Remove the unused padding
member from the wire type by deleting the padding field from struct Relation
(wire::Relation) in wire_types.h, and update the serialization/conversion code
that maps between runtime Relation (tu_index) and wire::Relation in tu_index.cpp
and merged_index.cpp so they no longer read, write, or reserve 4 bytes for
padding; adjust any memcpy/size calculations, constructors, and assignment code
that assumed the extra field and update any on-disk/serialization size constants
or header checks that included the padding so the wire layout no longer contains
the unused 4-byte slot.

@16bit-ykiko 16bit-ykiko force-pushed the migrate/flatbuffers-to-kotatsu-reflection branch from 911702c to 019f422 Compare April 19, 2026 20:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/index/tu_index.cpp (1)

216-219: ⚠️ Potential issue | 🔴 Critical

Handle serialization failure before dereferencing bytes.

assert(bytes) disappears under NDEBUG, leaving bytes->data() reachable after a failed to_flatbuffer() call.

🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h"
+
 void TUIndex::serialize(llvm::raw_ostream& os) {
     auto bytes = kfb::to_flatbuffer(*this);
-    assert(bytes && "TUIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("TUIndex flatbuffer serialization failed");
+    }
     os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
 }

Run this read-only check to verify no unchecked FlatBuffer serialization dereferences remain:

#!/bin/bash
rg -n -C3 'auto bytes = kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)' --iglob '*.cpp'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.cpp` around lines 216 - 219, TUIndex::serialize currently
dereferences the result of kfb::to_flatbuffer without handling a failure when
asserts are disabled; update the function to check the returned bytes (the value
stored in the local variable bytes from kfb::to_flatbuffer) before calling
bytes->data()/bytes->size(), and handle the error path (e.g., return early,
log/processLogger.error, or throw an exception) so serialization failure is not
dereferenced; ensure the check and error handling are applied in
TUIndex::serialize around the call to os.write to safely guard bytes.
src/index/merged_index.cpp (2)

219-223: ⚠️ Potential issue | 🟡 Minor

Don’t silently turn a failed shard decode into an empty index.

If from_flatbuffer fails, this drops the buffer and leaves an empty impl, making corrupted or version-incompatible shards indistinguishable from legitimate empty shards. At minimum, log the failure and expose a load-failed state so callers can rebuild the shard.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 219 - 223, The current code silently
discards the buffer when kfb::from_flatbuffer(bytes, *self.impl) fails (calling
self.buffer.reset()), which hides corrupted or incompatible shard loads; change
this to log the decode error (include the result/error details) and set an
explicit load-failed state on the shard object instead of making it appear empty
— e.g., add or set a boolean like self.load_failed (or an enum state) and leave
the buffer intact or mark it as invalid for later inspection, and update callers
that check self.impl or buffer to handle the new load-failed state; ensure
logging references buffer_bytes, kfb::from_flatbuffer and the failing shard id
so operators can rebuild the shard.

261-263: ⚠️ Potential issue | 🔴 Critical

Handle serialization failure before dereferencing bytes.

assert(bytes) is compiled out in release builds, but bytes->data() still runs. Use an explicit runtime check before writing.

🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h"
+
     auto bytes = kfb::to_flatbuffer(*self.impl);
-    assert(bytes && "MergedIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("MergedIndex flatbuffer serialization failed");
+    }
     out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());

Run this read-only check to verify no unchecked FlatBuffer serialization dereferences remain:

#!/bin/bash
rg -n -C3 'auto bytes = kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)' --iglob '*.cpp'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 261 - 263, The code currently
dereferences bytes returned from kfb::to_flatbuffer without a runtime null check
(assert(bytes) is compiled out in release); update the block around auto bytes =
kfb::to_flatbuffer(*self.impl) to explicitly check if bytes is null before
calling bytes->data() or bytes->size(), and handle the failure path (e.g.,
return an error/false, throw an exception, or log an error via the same logging
mechanism) instead of proceeding to out.write(reinterpret_cast<const
char*>(bytes->data()), bytes->size()); ensure the check covers both bytes null
and empty payload cases so no unchecked bytes->data() dereference remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/index/merged_index.cpp`:
- Around line 339-347: The lazy FlatBuffer lookup paths call the callback
unconditionally and so return stale occurrences/relations that have been
removed; update both occurrence and relation lazy loops to mirror the in-memory
Impl logic: after reading the occurrence/relation key and loading the stored
bitmap via entry.template get<1>() (or the relation inner-map bitmap), subtract
the Impl::removed bitmap (root[&Impl::removed]) and if the resulting bitmap is
empty skip invoking callback (i.e. continue), otherwise proceed to call callback
with the occurrence/relation; locate these changes around the read_occurrence
loop and the relations lookup loop to apply the same filtering.

In `@src/index/tu_index.h`:
- Around line 40-42: Replace the defaulted comparison operators with explicit
implementations for Relation: remove the friend defaulted operator== and
operator<=> and implement operator== and operator<=> that compare only the
logical key members (the same fields used by the existing hash/equality and
dedupe logic) and explicitly exclude the padding member; ensure the
implementation uses the same member set as the Relation hashing function so
ordering and equality are consistent when used as map keys.

---

Duplicate comments:
In `@src/index/merged_index.cpp`:
- Around line 219-223: The current code silently discards the buffer when
kfb::from_flatbuffer(bytes, *self.impl) fails (calling self.buffer.reset()),
which hides corrupted or incompatible shard loads; change this to log the decode
error (include the result/error details) and set an explicit load-failed state
on the shard object instead of making it appear empty — e.g., add or set a
boolean like self.load_failed (or an enum state) and leave the buffer intact or
mark it as invalid for later inspection, and update callers that check self.impl
or buffer to handle the new load-failed state; ensure logging references
buffer_bytes, kfb::from_flatbuffer and the failing shard id so operators can
rebuild the shard.
- Around line 261-263: The code currently dereferences bytes returned from
kfb::to_flatbuffer without a runtime null check (assert(bytes) is compiled out
in release); update the block around auto bytes = kfb::to_flatbuffer(*self.impl)
to explicitly check if bytes is null before calling bytes->data() or
bytes->size(), and handle the failure path (e.g., return an error/false, throw
an exception, or log an error via the same logging mechanism) instead of
proceeding to out.write(reinterpret_cast<const char*>(bytes->data()),
bytes->size()); ensure the check covers both bytes null and empty payload cases
so no unchecked bytes->data() dereference remains.

In `@src/index/tu_index.cpp`:
- Around line 216-219: TUIndex::serialize currently dereferences the result of
kfb::to_flatbuffer without handling a failure when asserts are disabled; update
the function to check the returned bytes (the value stored in the local variable
bytes from kfb::to_flatbuffer) before calling bytes->data()/bytes->size(), and
handle the error path (e.g., return early, log/processLogger.error, or throw an
exception) so serialization failure is not dereferenced; ensure the check and
error handling are applied in TUIndex::serialize around the call to os.write to
safely guard bytes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c3b75257-94ae-4a5a-bc28-575c7794b72b

📥 Commits

Reviewing files that changed from the base of the PR and between 911702c and 019f422.

📒 Files selected for processing (16)
  • CMakeLists.txt
  • cmake/package.cmake
  • src/index/include_graph.h
  • src/index/kotatsu_adapters.h
  • src/index/merged_index.cpp
  • src/index/project_index.cpp
  • src/index/project_index.h
  • src/index/schema.fbs
  • src/index/serialization.h
  • src/index/tu_index.cpp
  • src/index/tu_index.h
  • src/semantic/relation_kind.h
  • src/server/compiler.cpp
  • src/server/indexer.cpp
  • src/syntax/token.h
  • tests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
  • src/index/serialization.h
  • src/index/schema.fbs
✅ Files skipped from review due to trivial changes (1)
  • tests/unit/index/project_index_tests.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
  • cmake/package.cmake
  • src/server/compiler.cpp
  • src/index/project_index.h
  • src/index/kotatsu_adapters.h
  • src/index/project_index.cpp

Comment on lines +339 to +347
for(; lo < count; ++lo) {
auto entry = entries.at(lo);
auto occurrence = read_occurrence(entry.template get<0>());
if(!occurrence.range.contains(offset)) {
break;
}
if(!callback(occurrence)) {
break;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
sed -n '266,408p' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 5476


🏁 Script executed:

#!/bin/bash
# Search for the definition of Impl structure and how occurrences/relations are stored
rg -A 10 "struct Impl" src/index/ -t cpp | head -80

Repository: clice-io/clice

Length of output: 542


🏁 Script executed:

#!/bin/bash
# Look for where Impl occurrences and relations are defined to understand the FlatBuffer structure
rg "occurrences|relations" src/index/merged_index.cpp -B 2 -A 2 | head -100

Repository: clice-io/clice

Length of output: 3664


🏁 Script executed:

#!/bin/bash
# Search for KFB table definitions or schema for this data structure
fd -e "\.h" -e "\.hpp" src/ | xargs rg "struct Impl" -A 15 | head -150

Repository: clice-io/clice

Length of output: 2320


🏁 Script executed:

#!/bin/bash
# Search for KFB/FlatBuffer map access patterns to confirm how to access value
rg "\.template get<1>" src/index/ -t cpp -B 3 -A 3

Repository: clice-io/clice

Length of output: 486


🏁 Script executed:

#!/bin/bash
# Look for the complete Impl structure definition in the header file
rg -A 30 "struct Impl" src/index/merged_index.h

Repository: clice-io/clice

Length of output: 811


🏁 Script executed:

#!/bin/bash
# Check if there are any other places where removed filtering is done in lazy paths
rg "removed" src/index/merged_index.cpp -B 5 -A 2

Repository: clice-io/clice

Length of output: 2254


Mirror the impl path by filtering removed canonical IDs in lazy lookups.

The in-memory Impl paths skip entries whose bitmap becomes empty after subtracting the removed set, but the lazy FlatBuffer paths reconstruct only the occurrence/relation key and call the callback unconditionally. This surfaces stale results from removed contexts until materialization.

Both lazy paths need the same filtering logic:

  • In occurrences lookup (lines 339–347): read the bitmap via entry.template get<1>(), compute bitmap - root[&Impl::removed], and skip the callback if empty.
  • In relations lookup (lines 389–406): similarly filter the inner map entry's bitmap value before invoking the callback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 339 - 347, The lazy FlatBuffer
lookup paths call the callback unconditionally and so return stale
occurrences/relations that have been removed; update both occurrence and
relation lazy loops to mirror the in-memory Impl logic: after reading the
occurrence/relation key and loading the stored bitmap via entry.template
get<1>() (or the relation inner-map bitmap), subtract the Impl::removed bitmap
(root[&Impl::removed]) and if the resulting bitmap is empty skip invoking
callback (i.e. continue), otherwise proceed to call callback with the
occurrence/relation; locate these changes around the read_occurrence loop and
the relations lookup loop to apply the same filtering.

Comment thread src/index/tu_index.h
Comment on lines +40 to +42
friend bool operator==(const Relation&, const Relation&) = default;

friend auto operator<=>(const Relation&, const Relation&) = default;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep Relation comparison aligned with its logical key.

The defaulted operators include padding, but existing relation hashing/equality and dedupe ignore it. If the codec uses this ordering for map keys, logically identical relations can be ordered or compared differently when padding differs.

♻️ Proposed fix
-    friend bool operator==(const Relation&, const Relation&) = default;
-
-    friend auto operator<=>(const Relation&, const Relation&) = default;
+    friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) {
+        return lhs.kind == rhs.kind && lhs.range == rhs.range &&
+               lhs.target_symbol == rhs.target_symbol;
+    }
+
+    friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) {
+        if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) {
+            return cmp;
+        }
+        if(auto cmp = lhs.range <=> rhs.range; cmp != 0) {
+            return cmp;
+        }
+        return lhs.target_symbol <=> rhs.target_symbol;
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
friend bool operator==(const Relation&, const Relation&) = default;
friend auto operator<=>(const Relation&, const Relation&) = default;
friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) {
return lhs.kind == rhs.kind && lhs.range == rhs.range &&
lhs.target_symbol == rhs.target_symbol;
}
friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) {
if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) {
return cmp;
}
if(auto cmp = lhs.range <=> rhs.range; cmp != 0) {
return cmp;
}
return lhs.target_symbol <=> rhs.target_symbol;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.h` around lines 40 - 42, Replace the defaulted comparison
operators with explicit implementations for Relation: remove the friend
defaulted operator== and operator<=> and implement operator== and operator<=>
that compare only the logical key members (the same fields used by the existing
hash/equality and dedupe logic) and explicitly exclude the padding member;
ensure the implementation uses the same member set as the Relation hashing
function so ordering and equality are consistent when used as map keys.

@16bit-ykiko 16bit-ykiko force-pushed the migrate/flatbuffers-to-kotatsu-reflection branch from 019f422 to 94cfdd9 Compare April 20, 2026 01:33

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
src/index/merged_index.cpp (2)

339-347: ⚠️ Potential issue | 🟠 Major

Keep lazy lookups consistent with removed-context filtering.

The materialized paths skip entries whose bitmap becomes empty after subtracting removed; these lazy paths still invoke callbacks unconditionally, so stale occurrences/relations can surface until materialization.

🛡️ Proposed fix sketch
         for(; lo < count; ++lo) {
             auto entry = entries.at(lo);
             auto occurrence = read_occurrence(entry.template get<0>());
             if(!occurrence.range.contains(offset)) {
                 break;
             }
+            auto remaining = entry.template get<1>() - root[&Impl::removed];
+            if(remaining.isEmpty()) {
+                continue;
+            }
             if(!callback(occurrence)) {
                 break;
             }
         }
         for(std::size_t i = 0; i < count; ++i) {
-            auto rel_view = inner.at(i).template get<0>();
+            auto inner_entry = inner.at(i);
+            auto remaining = inner_entry.template get<1>() - root[&Impl::removed];
+            if(remaining.isEmpty()) {
+                continue;
+            }
+            auto rel_view = inner_entry.template get<0>();

Also applies to: 389-406

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 339 - 347, The lazy-path loop over
entries (using entries.at(lo), read_occurrence and callback(occurrence))
currently calls the callback even when the occurrence’s bitmap becomes empty
after subtracting the `removed` set, causing stale occurrences; update the loop
to apply the same removed-subtraction and empty-check used by the
materialized-path before invoking callback (i.e., subtract `removed` from the
occurrence's bitmap, skip the entry if the bitmap is empty, and only then call
callback), and make the identical change to the other corresponding loop that
reads occurrences (the similar block handling the lazy path later in the file).

261-263: ⚠️ Potential issue | 🔴 Critical

Handle serialization failure outside assert.

Line 262 is compiled out in release builds; Line 263 can still dereference an empty serialization result and write corrupt output.

🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h"
+
     auto bytes = kfb::to_flatbuffer(*self.impl);
-    assert(bytes && "MergedIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("MergedIndex flatbuffer serialization failed");
+    }
     out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 261 - 263, The code currently uses
assert to guard the result of kfb::to_flatbuffer(*self.impl) but then
dereferences bytes regardless; replace the assert with a runtime check: call
auto bytes = kfb::to_flatbuffer(*self.impl); if (!bytes) handle the error path
(e.g., return an error code/throw an exception or log and abort) instead of
proceeding, and only call out.write(reinterpret_cast<const
char*>(bytes->data()), bytes->size()) when bytes is non-null; reference symbols:
kfb::to_flatbuffer, bytes, self.impl, and out.write to locate and fix the code.
src/index/tu_index.cpp (1)

216-219: ⚠️ Potential issue | 🔴 Critical

Handle serialization failure outside assert.

Line 218 is compiled out under NDEBUG; Line 219 can still dereference an empty serialization result.

🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h"
+
 void TUIndex::serialize(llvm::raw_ostream& os) {
     auto bytes = kfb::to_flatbuffer(*this);
-    assert(bytes && "TUIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("TUIndex flatbuffer serialization failed");
+    }
     os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.cpp` around lines 216 - 219, TUIndex::serialize currently
uses assert on the result of kfb::to_flatbuffer (bytes) which is compiled out
under NDEBUG and then dereferences bytes; replace the assert with an explicit
runtime check: if bytes is null, report the error and abort the operation (e.g.
throw std::runtime_error or call llvm::report_fatal_error/emit a clear
diagnostic) instead of proceeding, otherwise call os.write with bytes->data()
and bytes->size(); update the error message to mention TUIndex flatbuffer
serialization failure and include any available context.
src/index/tu_index.h (1)

40-42: ⚠️ Potential issue | 🟠 Major

Exclude padding from Relation equality/order.

The defaulted operators include Line 26 padding, but relation hashing/dedupe use only kind, range, and target_symbol. This can make logically identical relations compare differently.

♻️ Proposed fix
-    friend bool operator==(const Relation&, const Relation&) = default;
-
-    friend auto operator<=>(const Relation&, const Relation&) = default;
+    friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) {
+        return lhs.kind == rhs.kind && lhs.range == rhs.range &&
+               lhs.target_symbol == rhs.target_symbol;
+    }
+
+    friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) {
+        if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) {
+            return cmp;
+        }
+        if(auto cmp = lhs.range <=> rhs.range; cmp != 0) {
+            return cmp;
+        }
+        return lhs.target_symbol <=> rhs.target_symbol;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.h` around lines 40 - 42, The defaulted friend operators
operator== and operator<=> on Relation include the padding member causing
inconsistent comparisons vs hashing/dedupe; replace the defaulted operators with
custom implementations that compare only Relation::kind, Relation::range, and
Relation::target_symbol (explicitly exclude Relation::padding) so equality and
ordering match the logic used by hashing/deduplication; update the friend
declarations for operator== and operator<=> on the Relation type to call those
custom comparisons.
src/index/project_index.cpp (1)

43-46: ⚠️ Potential issue | 🔴 Critical

Handle serialization failure outside assert.

Line 45 disappears in release builds; if kfb::to_flatbuffer(self) returns an empty result, Line 46 still dereferences it.

🛡️ Proposed fix
+#include "llvm/Support/ErrorHandling.h"
+
 void ProjectIndex::serialize(this ProjectIndex& self, llvm::raw_ostream& os) {
     auto bytes = kfb::to_flatbuffer(self);
-    assert(bytes && "ProjectIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("ProjectIndex flatbuffer serialization failed");
+    }
     os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/project_index.cpp` around lines 43 - 46, ProjectIndex::serialize
currently uses assert to guard the result of kfb::to_flatbuffer(self), which is
removed in release builds and can lead to dereferencing an empty optional;
change the function to explicitly check the returned bytes (from
kfb::to_flatbuffer) before writing to llvm::raw_ostream& os, and handle the
failure path deterministically (e.g., log an error via your logging facility and
return/throw an exception) instead of relying on assert so the code never
dereferences a null/empty buffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/index/kotatsu_adapters.h`:
- Around line 103-108: The deserialize function currently calls
clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false) which
can read past the vector bounds for malformed/untrusted FlatBuffer input;
replace that call with the length-checked clice::Bitmap::readSafe by passing the
buffer pointer and bytes.size() (retain the third parameter false to preserve
the non-portable format) so deserialize safely prevents out-of-bounds reads when
reconstructing a clice::Bitmap from std::vector<std::byte>.

In `@src/index/merged_index.cpp`:
- Around line 226-237: The code rebuilds index.canonical_ref_counts using
inc.canonical_id and ctx.canonical_id without bounds checks, risking
out-of-range access if decoded IDs >= index.max_canonical_id; update the loops
over index.header_contexts and index.compilation_contexts to validate each
canonical ID before incrementing (e.g., check inc.canonical_id <
index.max_canonical_id and ctx.canonical_id < index.max_canonical_id), skip or
log invalid IDs instead of indexing, and ensure any counts use the same
index.canonical_ref_counts container so no UB occurs.

---

Duplicate comments:
In `@src/index/merged_index.cpp`:
- Around line 339-347: The lazy-path loop over entries (using entries.at(lo),
read_occurrence and callback(occurrence)) currently calls the callback even when
the occurrence’s bitmap becomes empty after subtracting the `removed` set,
causing stale occurrences; update the loop to apply the same removed-subtraction
and empty-check used by the materialized-path before invoking callback (i.e.,
subtract `removed` from the occurrence's bitmap, skip the entry if the bitmap is
empty, and only then call callback), and make the identical change to the other
corresponding loop that reads occurrences (the similar block handling the lazy
path later in the file).
- Around line 261-263: The code currently uses assert to guard the result of
kfb::to_flatbuffer(*self.impl) but then dereferences bytes regardless; replace
the assert with a runtime check: call auto bytes =
kfb::to_flatbuffer(*self.impl); if (!bytes) handle the error path (e.g., return
an error code/throw an exception or log and abort) instead of proceeding, and
only call out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size())
when bytes is non-null; reference symbols: kfb::to_flatbuffer, bytes, self.impl,
and out.write to locate and fix the code.

In `@src/index/project_index.cpp`:
- Around line 43-46: ProjectIndex::serialize currently uses assert to guard the
result of kfb::to_flatbuffer(self), which is removed in release builds and can
lead to dereferencing an empty optional; change the function to explicitly check
the returned bytes (from kfb::to_flatbuffer) before writing to
llvm::raw_ostream& os, and handle the failure path deterministically (e.g., log
an error via your logging facility and return/throw an exception) instead of
relying on assert so the code never dereferences a null/empty buffer.

In `@src/index/tu_index.cpp`:
- Around line 216-219: TUIndex::serialize currently uses assert on the result of
kfb::to_flatbuffer (bytes) which is compiled out under NDEBUG and then
dereferences bytes; replace the assert with an explicit runtime check: if bytes
is null, report the error and abort the operation (e.g. throw std::runtime_error
or call llvm::report_fatal_error/emit a clear diagnostic) instead of proceeding,
otherwise call os.write with bytes->data() and bytes->size(); update the error
message to mention TUIndex flatbuffer serialization failure and include any
available context.

In `@src/index/tu_index.h`:
- Around line 40-42: The defaulted friend operators operator== and operator<=>
on Relation include the padding member causing inconsistent comparisons vs
hashing/dedupe; replace the defaulted operators with custom implementations that
compare only Relation::kind, Relation::range, and Relation::target_symbol
(explicitly exclude Relation::padding) so equality and ordering match the logic
used by hashing/deduplication; update the friend declarations for operator== and
operator<=> on the Relation type to call those custom comparisons.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cb8e84a5-034b-42ac-b319-05455c2be11f

📥 Commits

Reviewing files that changed from the base of the PR and between 019f422 and 94cfdd9.

📒 Files selected for processing (16)
  • CMakeLists.txt
  • cmake/package.cmake
  • src/index/include_graph.h
  • src/index/kotatsu_adapters.h
  • src/index/merged_index.cpp
  • src/index/project_index.cpp
  • src/index/project_index.h
  • src/index/schema.fbs
  • src/index/serialization.h
  • src/index/tu_index.cpp
  • src/index/tu_index.h
  • src/semantic/relation_kind.h
  • src/server/compiler.cpp
  • src/server/indexer.cpp
  • src/syntax/token.h
  • tests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
  • src/index/serialization.h
  • src/index/schema.fbs
✅ Files skipped from review due to trivial changes (1)
  • src/server/indexer.cpp
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/server/compiler.cpp
  • src/syntax/token.h
  • cmake/package.cmake
  • src/index/include_graph.h
  • tests/unit/index/project_index_tests.cpp
  • src/index/project_index.h
  • src/semantic/relation_kind.h

Comment thread src/index/kotatsu_adapters.h Outdated
Comment on lines +103 to +108
static clice::Bitmap deserialize(std::vector<std::byte> bytes) {
if(bytes.empty()) {
return clice::Bitmap();
}
return clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect Roaring/Bitmap headers for length-checked deserialize APIs.
# Expected: a readSafe/deserialize_safe-style API that accepts both pointer and size.

rg -n -C3 'readSafe|deserialize_safe|read\s*\(' --iglob '*roaring*.h' --iglob '*roaring*.hh' --iglob '*roaring*.hpp' --iglob '*bitmap*.h'
rg -n -C3 'using Bitmap|typedef.*Bitmap|class Bitmap|struct Bitmap' --iglob '*.h' --iglob '*.hpp'

Repository: clice-io/clice

Length of output: 275


🏁 Script executed:

# Find roaring-related files and headers in the repo
fd -i roaring . --type f

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

# Check includes and roaring dependency in the relevant files
head -30 src/index/kotatsu_adapters.h
head -30 src/support/bitmap.h

# Search for CMake or build config that might show roaring dependency
fd -i cmake . --type f | head -10
cat CMakeLists.txt 2>/dev/null | grep -i roaring

Repository: clice-io/clice

Length of output: 1499


🏁 Script executed:

# Read the exact lines mentioned in the review
sed -n '103,108p' src/index/kotatsu_adapters.h

# Check if there's a roaring.hh header we can inspect or search the web for its API
# First, let's see what roaring methods are called in the codebase
rg -n 'Roaring::' src/ --type h --type hpp

Repository: clice-io/clice

Length of output: 339


🏁 Script executed:

# Search for any roaring usage patterns in codebase to understand available APIs
rg -n 'Bitmap::|roaring::' src/ -A 2 -B 2

Repository: clice-io/clice

Length of output: 1609


🌐 Web query:

roaring C++ library deserialize safe API readSafe length-checked

💡 Result:

The Roaring C++ library (CRoaring) provides safe, length-checked deserialization APIs for untrusted input. These prevent buffer overflows by limiting reads to a specified maximum size. For C API (roaring.h): - roaring_bitmap_portable_deserialize_safe(const char *buf, size_t maxbytes): Safe deserialization of 32-bit bitmap; returns NULL if it would exceed maxbytes or fail. - roaring64_bitmap_portable_deserialize_safe(const char buf, size_t maxbytes): Same for 64-bit. For C++ API (roaring.hh, class roaring::Roaring): - static Roaring readSafe(const char buf, size_t maxbytes, bool portable = true): Safe deserialization; will not read past maxbytes. These APIs follow the portable Roaring format spec. They avoid overflows but do not fully validate content (e.g., sorted containers). For untrusted sources, call roaring_bitmap_internal_validate afterward (added Oct 2024 via PR #675, guaranteeing safe deserialization when combined). Example (C++): uint32_t expectedsize = r1.getSizeInBytes; char *serializedbytes = new char[expectedsize]; r1.write(serializedbytes); Roaring t = Roaring::readSafe(serializedbytes, expectedsize); // safe, length-checked Always free with roaring_bitmap_free or destructor.

Citations:


🌐 Web query:

roaring C++ non-portable format deserialize safe API CRoaring

💡 Result:

In CRoaring's C++ API (roaring::Roaring class in roaring.hh), the primary deserialization methods are static Roaring read(const char* buf, bool portable = true) and static Roaring readSafe(const char* buf, size_t maxbytes, bool portable = true). These support both portable (cross-platform, default) and non-portable formats (when portable=false). The non-portable format is architecture-specific, faster, but not safe for sharing across different systems or endianness. There is no direct "non-portable format deserialize safe API". The "safe" aspect refers to readSafe limiting reads to maxbytes to prevent buffer overflows, but it assumes valid input and does not fully validate content (use roaring_bitmap_internal_validate afterward for untrusted data). A separate "frozen" deserialization exists for read-only "frozen views" (e.g., roaring_bitmap_portable_deserialize_frozen in C, added to C++ via PR #544), which points directly to the buffer without copying. It is explicitly unsafe: may perform unaligned memory accesses (risky on some platforms) and can buffer overflow if buf is invalid. It uses the portable serialization format but creates a non-copying frozen bitmap, suitable for memory-mapped files but only for readonly operations. Standard deserialize (read/portable_deserialize) is fully safe if input is trusted/valid, copies data into aligned memory. For untrusted input, use *_safe variants + validation. No purely "non-portable safe" API exists due to alignment/portability risks.

Citations:


🏁 Script executed:

# Check the roaring.hh header if available in includes or search for read() method signature
rg -n 'read\s*\(' src/support/bitmap.h -A 5 -B 5
# Also search for how non-portable is handled
rg -n 'portable' src/index/kotatsu_adapters.h -B 3 -A 3

Repository: clice-io/clice

Length of output: 330


Use readSafe for length-checked deserialization of untrusted FlatBuffer data.

Line 107 calls clice::Bitmap::read() without passing buffer length, creating a buffer overflow vulnerability when deserializing untrusted FlatBuffer blobs. The Roaring reader can consume past the vector boundary if the input is malformed.

Switch to readSafe, which accepts both buffer pointer and length and prevents out-of-bounds reads:

Proposed fix
-        return clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false);
+        return clice::Bitmap::readSafe(
+            reinterpret_cast<const char*>(bytes.data()),
+            bytes.size(),
+            false);

The third parameter (false) preserves the non-portable format used in the wire protocol (as documented on line 83).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static clice::Bitmap deserialize(std::vector<std::byte> bytes) {
if(bytes.empty()) {
return clice::Bitmap();
}
return clice::Bitmap::read(reinterpret_cast<const char*>(bytes.data()), false);
}
static clice::Bitmap deserialize(std::vector<std::byte> bytes) {
if(bytes.empty()) {
return clice::Bitmap();
}
return clice::Bitmap::readSafe(
reinterpret_cast<const char*>(bytes.data()),
bytes.size(),
false);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/kotatsu_adapters.h` around lines 103 - 108, The deserialize
function currently calls clice::Bitmap::read(reinterpret_cast<const
char*>(bytes.data()), false) which can read past the vector bounds for
malformed/untrusted FlatBuffer input; replace that call with the length-checked
clice::Bitmap::readSafe by passing the buffer pointer and bytes.size() (retain
the third parameter false to preserve the non-portable format) so deserialize
safely prevents out-of-bounds reads when reconstructing a clice::Bitmap from
std::vector<std::byte>.

Comment on lines +226 to 237
// Rebuild the ref count table from the already-loaded contexts.
auto& index = *self.impl;
index.canonical_ref_counts.clear();
index.canonical_ref_counts.resize(index.max_canonical_id, 0);

for(auto entry: *root->header_contexts()) {
HeaderContext context;
auto path = entry->path_id();
context.version = entry->version();
for(auto include: *entry->includes()) {
index.canonical_ref_counts[include->canonical_id()] += 1;
context.includes.emplace_back(*safe_cast<IncludeContext>(include));
for(auto& [_, ctx]: index.header_contexts) {
for(auto& inc: ctx.includes) {
index.canonical_ref_counts[inc.canonical_id] += 1;
}
index.header_contexts.try_emplace(path, std::move(context));
}

for(auto entry: *root->compilation_contexts()) {
CompilationContext context;
auto path = entry->path_id();
context.version = entry->version();
context.canonical_id = entry->canonical_id();
context.build_at = entry->build_at();
for(auto include: *entry->include_locations()) {
context.include_locations.emplace_back(*safe_cast<IncludeLocation>(include));
}
index.compilation_contexts.try_emplace(path, std::move(context));
}

// Count ref counts from compilation contexts.
for(auto entry: *root->compilation_contexts()) {
index.canonical_ref_counts[entry->canonical_id()] += 1;
}

// Deserialize removed bitmap.
if(root->removed() && root->removed()->size() > 0) {
index.removed = read_bitmap(root->removed());
}

for(auto entry: *root->occurrences()) {
index.occurrences.try_emplace(*safe_cast<Occurrence>(entry->occurrence()),
read_bitmap(entry->context()));
}

for(auto entry: *root->relations()) {
auto& relations = index.relations[entry->symbol()];
for(auto relation_entry: *entry->relations()) {
relations.try_emplace(*safe_cast<Relation>(relation_entry->relation()),
read_bitmap(relation_entry->context()));
}
}

if(root->content()) {
index.content = root->content()->str();
for(auto& [_, ctx]: index.compilation_contexts) {
index.canonical_ref_counts[ctx.canonical_id] += 1;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate canonical IDs before rebuilding ref counts.

inc.canonical_id and ctx.canonical_id come from the decoded buffer. If either is >= index.max_canonical_id, Lines 232 or 236 index past canonical_ref_counts.

🛡️ Proposed fix
     for(auto& [_, ctx]: index.header_contexts) {
         for(auto& inc: ctx.includes) {
+            if(inc.canonical_id >= index.canonical_ref_counts.size()) {
+                self.impl.reset();
+                self.buffer.reset();
+                return;
+            }
             index.canonical_ref_counts[inc.canonical_id] += 1;
         }
     }
     for(auto& [_, ctx]: index.compilation_contexts) {
+        if(ctx.canonical_id >= index.canonical_ref_counts.size()) {
+            self.impl.reset();
+            self.buffer.reset();
+            return;
+        }
         index.canonical_ref_counts[ctx.canonical_id] += 1;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Rebuild the ref count table from the already-loaded contexts.
auto& index = *self.impl;
index.canonical_ref_counts.clear();
index.canonical_ref_counts.resize(index.max_canonical_id, 0);
for(auto entry: *root->header_contexts()) {
HeaderContext context;
auto path = entry->path_id();
context.version = entry->version();
for(auto include: *entry->includes()) {
index.canonical_ref_counts[include->canonical_id()] += 1;
context.includes.emplace_back(*safe_cast<IncludeContext>(include));
for(auto& [_, ctx]: index.header_contexts) {
for(auto& inc: ctx.includes) {
index.canonical_ref_counts[inc.canonical_id] += 1;
}
index.header_contexts.try_emplace(path, std::move(context));
}
for(auto entry: *root->compilation_contexts()) {
CompilationContext context;
auto path = entry->path_id();
context.version = entry->version();
context.canonical_id = entry->canonical_id();
context.build_at = entry->build_at();
for(auto include: *entry->include_locations()) {
context.include_locations.emplace_back(*safe_cast<IncludeLocation>(include));
}
index.compilation_contexts.try_emplace(path, std::move(context));
}
// Count ref counts from compilation contexts.
for(auto entry: *root->compilation_contexts()) {
index.canonical_ref_counts[entry->canonical_id()] += 1;
}
// Deserialize removed bitmap.
if(root->removed() && root->removed()->size() > 0) {
index.removed = read_bitmap(root->removed());
}
for(auto entry: *root->occurrences()) {
index.occurrences.try_emplace(*safe_cast<Occurrence>(entry->occurrence()),
read_bitmap(entry->context()));
}
for(auto entry: *root->relations()) {
auto& relations = index.relations[entry->symbol()];
for(auto relation_entry: *entry->relations()) {
relations.try_emplace(*safe_cast<Relation>(relation_entry->relation()),
read_bitmap(relation_entry->context()));
}
}
if(root->content()) {
index.content = root->content()->str();
for(auto& [_, ctx]: index.compilation_contexts) {
index.canonical_ref_counts[ctx.canonical_id] += 1;
}
// Rebuild the ref count table from the already-loaded contexts.
auto& index = *self.impl;
index.canonical_ref_counts.clear();
index.canonical_ref_counts.resize(index.max_canonical_id, 0);
for(auto& [_, ctx]: index.header_contexts) {
for(auto& inc: ctx.includes) {
if(inc.canonical_id >= index.canonical_ref_counts.size()) {
self.impl.reset();
self.buffer.reset();
return;
}
index.canonical_ref_counts[inc.canonical_id] += 1;
}
}
for(auto& [_, ctx]: index.compilation_contexts) {
if(ctx.canonical_id >= index.canonical_ref_counts.size()) {
self.impl.reset();
self.buffer.reset();
return;
}
index.canonical_ref_counts[ctx.canonical_id] += 1;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 226 - 237, The code rebuilds
index.canonical_ref_counts using inc.canonical_id and ctx.canonical_id without
bounds checks, risking out-of-range access if decoded IDs >=
index.max_canonical_id; update the loops over index.header_contexts and
index.compilation_contexts to validate each canonical ID before incrementing
(e.g., check inc.canonical_id < index.max_canonical_id and ctx.canonical_id <
index.max_canonical_id), skip or log invalid IDs instead of indexing, and ensure
any counts use the same index.canonical_ref_counts container so no UB occurs.

…tion

Replace the flatc-generated serialization layer with kotatsu's arena codec
driven directly by the in-memory index types. No hand-written DTOs: the
on-wire layout is derived from reflection over the existing structs, with
type-level customization where needed.

- Drop `schema.fbs`, `serialization.h`, and the flatc build step
- Delete `wire_types.h` — no more parallel wire representation
- Add `kotatsu_adapters.h` with `kota::codec::type_adapter<T>` specializations
  for RelationKind, SymbolKind, Bitmap, and std::chrono::milliseconds
- Mark runtime-only FileID-keyed maps with `kota::meta::skip<>` so they
  are excluded from reflection slots; serialize via `main_file_index` and
  `path_file_indices` (keyed by path id)
- Restore MergedIndex's dual dispatch: in-memory path when `impl` is live,
  lazy flatbuffers path via `kfb::table_view<Impl>::from_bytes()` and
  `root[&Impl::field]` proxy access when only the buffer is held
- Add default member initializers to LocalSourceRange, padding field to
  Relation, and a path_id lookup struct to IncludeLocation so reflection
  picks up all stored state
- Propagate buffer size through `TUIndex::from` / `ProjectIndex::from`
  (kota codec requires an explicit size for bounds verification)

All 551 unit tests pass; 9 environment-gated integration tests skipped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@16bit-ykiko 16bit-ykiko force-pushed the migrate/flatbuffers-to-kotatsu-reflection branch from 94cfdd9 to d241ea8 Compare April 20, 2026 01:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
src/index/tu_index.h (1)

40-42: ⚠️ Potential issue | 🟠 Major

Keep Relation comparison aligned with its logical key.

The defaulted operators include padding, while relation hashing/equality and dedupe ignore it. That can make logically identical relations compare differently when the codec orders map keys.

♻️ Proposed fix
-    friend bool operator==(const Relation&, const Relation&) = default;
-
-    friend auto operator<=>(const Relation&, const Relation&) = default;
+    friend constexpr bool operator==(const Relation& lhs, const Relation& rhs) {
+        return lhs.kind == rhs.kind && lhs.range == rhs.range &&
+               lhs.target_symbol == rhs.target_symbol;
+    }
+
+    friend constexpr auto operator<=>(const Relation& lhs, const Relation& rhs) {
+        if(auto cmp = lhs.kind <=> rhs.kind; cmp != 0) {
+            return cmp;
+        }
+        if(auto cmp = lhs.range <=> rhs.range; cmp != 0) {
+            return cmp;
+        }
+        return lhs.target_symbol <=> rhs.target_symbol;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.h` around lines 40 - 42, Defaulted Relation comparison
operators include padding bytes causing mismatches with the existing
logical-key-based hashing/equality/dedupe; replace the defaulted friend
operator== and operator<=> for Relation with explicit implementations that
compare only the logical key members used by hashing/dedupe (i.e., exclude
padding and any fields not part of the logical identity), so operator== and
operator<=> are consistent with the relation hashing/equality/dedupe behavior.
src/index/tu_index.cpp (1)

216-219: ⚠️ Potential issue | 🔴 Critical

Handle FlatBuffer serialization failure outside assert.

Line 218 is compiled out in release builds, but Line 219 still dereferences bytes. If kfb::to_flatbuffer() fails, this becomes a release-only crash/corrupt write.

🛡️ Proposed fix
 `#include` "kota/codec/flatbuffers/deserializer.h"
 `#include` "kota/codec/flatbuffers/serializer.h"
+#include "llvm/Support/ErrorHandling.h"
 `#include` "llvm/Support/SHA256.h"
@@
 void TUIndex::serialize(llvm::raw_ostream& os) {
     auto bytes = kfb::to_flatbuffer(*this);
-    assert(bytes && "TUIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("TUIndex flatbuffer serialization failed");
+    }
     os.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
 }
#!/bin/bash
# Verify all kotatsu flatbuffer serialization call sites handle a falsy result
# before dereferencing bytes->data()/bytes->size().
rg -n -C3 'kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)|bytes->size\(\)' --iglob '*.cpp'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/tu_index.cpp` around lines 216 - 219, TUIndex::serialize currently
asserts the result of kfb::to_flatbuffer but still dereferences bytes in release
builds (causing crashes); change the logic to check if bytes is non-null before
using it: call kfb::to_flatbuffer(*this), if the returned pointer (bytes) is
null log or return gracefully (avoid dereferencing bytes->data()/bytes->size()),
otherwise call os.write with bytes->data() and bytes->size(); update error
handling around TUIndex::serialize (and any other call sites using
kfb::to_flatbuffer) to perform a null-check on the returned value instead of
relying solely on assert.
src/index/merged_index.cpp (4)

219-224: ⚠️ Potential issue | 🟡 Minor

Don’t silently turn a bad on-disk index into an empty one.

If kfb::from_flatbuffer() fails, this resets buffer and leaves impl as an empty index, so callers cannot distinguish corruption/version mismatch from legitimate empty state.

Consider logging the failure and recording an explicit load status so lookup, need_update, and content can trigger rebuild behavior instead of silently proceeding.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 219 - 224, The code currently resets
self.buffer but leaves self.impl as an empty index when
kfb::from_flatbuffer(bytes, *self.impl) fails, hiding corruption/version errors;
update the load path to (1) log the failure (include the error from
kfb::from_flatbuffer) and do not silently leave impl as a valid-but-empty index,
and (2) set an explicit load status flag on the index object (e.g., enum
LoadStatus {OK, CORRUPT, VERSION_MISMATCH, EMPTY}) so that methods like lookup,
need_update, and content can detect CORRUPT/VERSION_MISMATCH and trigger rebuild
behavior. Locate the error handling around kfb::from_flatbuffer and add the
logging and status assignment there, and update the lookup/need_update/content
handlers to check that status before treating the index as legitimately empty.

226-237: ⚠️ Potential issue | 🟠 Major

Validate decoded canonical IDs before rebuilding ref counts.

Lines 232 and 236 index canonical_ref_counts with IDs from the decoded buffer. A corrupted or mismatched index file can produce canonical_id >= max_canonical_id and cause out-of-bounds access.

🛡️ Proposed fix
     for(auto& [_, ctx]: index.header_contexts) {
         for(auto& inc: ctx.includes) {
+            if(inc.canonical_id >= index.canonical_ref_counts.size()) {
+                self.impl.reset();
+                self.buffer.reset();
+                return;
+            }
             index.canonical_ref_counts[inc.canonical_id] += 1;
         }
     }
     for(auto& [_, ctx]: index.compilation_contexts) {
+        if(ctx.canonical_id >= index.canonical_ref_counts.size()) {
+            self.impl.reset();
+            self.buffer.reset();
+            return;
+        }
         index.canonical_ref_counts[ctx.canonical_id] += 1;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 226 - 237, When rebuilding
index.canonical_ref_counts, validate decoded canonical IDs before using them to
avoid out-of-bounds access: in the loops over index.header_contexts
(ctx.includes -> inc.canonical_id) and index.compilation_contexts
(ctx.canonical_id) ensure each id is within [0, index.max_canonical_id) (and
non-negative) before doing index.canonical_ref_counts[id] += 1; if an id is
invalid, skip the increment and emit a debug/error log mentioning the offending
canonical id and the owning context to aid debugging.

261-263: ⚠️ Potential issue | 🔴 Critical

Handle FlatBuffer serialization failure in release builds.

Line 262 disappears under NDEBUG, but Line 263 still dereferences bytes. A failed kfb::to_flatbuffer() can crash or write malformed index data.

🛡️ Proposed fix
 `#include` "llvm/ADT/DenseSet.h"
+#include "llvm/Support/ErrorHandling.h"
 `#include` "llvm/Support/raw_os_ostream.h"
@@
     auto bytes = kfb::to_flatbuffer(*self.impl);
-    assert(bytes && "MergedIndex flatbuffer serialization failed");
+    if(!bytes) {
+        llvm::report_fatal_error("MergedIndex flatbuffer serialization failed");
+    }
     out.write(reinterpret_cast<const char*>(bytes->data()), bytes->size());
#!/bin/bash
# Verify all kotatsu flatbuffer serialization call sites handle a falsy result
# before dereferencing bytes->data()/bytes->size().
rg -n -C3 'kfb::to_flatbuffer|assert\(bytes|bytes->data\(\)|bytes->size\(\)' --iglob '*.cpp'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 261 - 263, The current code calls
kfb::to_flatbuffer(*self.impl) and then unconditionally dereferences
bytes->data()/bytes->size() after an assert, which disappears in release builds;
modify the MergedIndex serialization site (the bytes local returned from
kfb::to_flatbuffer in merged_index.cpp) to explicitly check that bytes is
non-null before calling out.write, and on failure either log an error and
return/throw an error code (or otherwise abort safe serialization) rather than
dereferencing; update the code paths around kfb::to_flatbuffer, the bytes
variable and the out.write call to handle a falsy result deterministically in
release builds.

339-347: ⚠️ Potential issue | 🟠 Major

Filter removed canonical IDs in lazy lookups.

The in-memory paths subtract removed before invoking callbacks, but both lazy paths call the callback using the raw stored bitmap. Removed contexts can therefore leak stale occurrences/relations until materialization.

🛡️ Proposed fix sketch
         for(; lo < count; ++lo) {
             auto entry = entries.at(lo);
             auto occurrence = read_occurrence(entry.template get<0>());
             if(!occurrence.range.contains(offset)) {
                 break;
             }
+            auto remaining = entry.template get<1>() - root[&Impl::removed];
+            if(remaining.isEmpty()) {
+                continue;
+            }
             if(!callback(occurrence)) {
                 break;
             }
         }
@@
-        auto inner = entry->template get<1>();
+        auto removed = root[&Impl::removed];
+        auto inner = entry->template get<1>();
         const std::size_t count = inner.size();
         for(std::size_t i = 0; i < count; ++i) {
-            auto rel_view = inner.at(i).template get<0>();
+            auto relation_entry = inner.at(i);
+            auto rel_view = relation_entry.template get<0>();
@@
             if(relation_kind & kind) {
+                auto remaining = relation_entry.template get<1>() - removed;
+                if(remaining.isEmpty()) {
+                    continue;
+                }
                 auto range_view = rel_view[&Relation::range];

Also applies to: 389-403

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 339 - 347, The lazy lookup loops in
merged_index.cpp call callback with occurrences constructed from the stored
bitmap (via read_occurrence) without subtracting the in-memory removed set,
allowing removed canonical IDs to leak; update both lazy paths (the loop using
read_occurrence at the shown block and the similar loop around lines 389-403) to
filter the occurrence's canonical-id bitmap by removing the current removed set
before checking contains(offset) and before invoking callback (e.g., apply the
same subtraction logic that the in-memory path uses so occurrence.canonical_ids
no longer includes removed IDs prior to callback).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/index/merged_index.cpp`:
- Around line 451-456: The code indexes path_mapping[path_id] using a path_id
decoded from on-disk data (from include_locations and used with deps.insert);
add a bounds check before any access to path_mapping: verify 0 <= path_id <
path_mapping.size(), and if out of range avoid dereferencing and treat it as a
stale/on-disk mismatch (e.g., set the "needs update/rebuild" flag or return an
error path), otherwise proceed to call fs::status and continue as before; update
the logic around include_locations, path_id, deps.insert and fs::status so
out-of-range IDs never index path_mapping.
- Around line 325-337: The binary search in the lazy flatbuffer path uses
entries (from Impl::occurrences / DenseMap) and compares
range_view[&LocalSourceRange::end] < offset assuming entries are sorted by range
end, but DenseMap iteration is unordered, so materialize the occurrences into a
vector (similar to occurrences_cache), sort it by the same key used in the
in-memory path (begin, end, target) or at least by range end, then run the
existing binary search/ lower_bound on that sorted vector; update the code paths
around entries, Impl::occurrences, and the binary search loop to operate on the
sorted materialized container so the lookup by LocalSourceRange::end is correct
for nested ranges.

---

Duplicate comments:
In `@src/index/merged_index.cpp`:
- Around line 219-224: The code currently resets self.buffer but leaves
self.impl as an empty index when kfb::from_flatbuffer(bytes, *self.impl) fails,
hiding corruption/version errors; update the load path to (1) log the failure
(include the error from kfb::from_flatbuffer) and do not silently leave impl as
a valid-but-empty index, and (2) set an explicit load status flag on the index
object (e.g., enum LoadStatus {OK, CORRUPT, VERSION_MISMATCH, EMPTY}) so that
methods like lookup, need_update, and content can detect
CORRUPT/VERSION_MISMATCH and trigger rebuild behavior. Locate the error handling
around kfb::from_flatbuffer and add the logging and status assignment there, and
update the lookup/need_update/content handlers to check that status before
treating the index as legitimately empty.
- Around line 226-237: When rebuilding index.canonical_ref_counts, validate
decoded canonical IDs before using them to avoid out-of-bounds access: in the
loops over index.header_contexts (ctx.includes -> inc.canonical_id) and
index.compilation_contexts (ctx.canonical_id) ensure each id is within [0,
index.max_canonical_id) (and non-negative) before doing
index.canonical_ref_counts[id] += 1; if an id is invalid, skip the increment and
emit a debug/error log mentioning the offending canonical id and the owning
context to aid debugging.
- Around line 261-263: The current code calls kfb::to_flatbuffer(*self.impl) and
then unconditionally dereferences bytes->data()/bytes->size() after an assert,
which disappears in release builds; modify the MergedIndex serialization site
(the bytes local returned from kfb::to_flatbuffer in merged_index.cpp) to
explicitly check that bytes is non-null before calling out.write, and on failure
either log an error and return/throw an error code (or otherwise abort safe
serialization) rather than dereferencing; update the code paths around
kfb::to_flatbuffer, the bytes variable and the out.write call to handle a falsy
result deterministically in release builds.
- Around line 339-347: The lazy lookup loops in merged_index.cpp call callback
with occurrences constructed from the stored bitmap (via read_occurrence)
without subtracting the in-memory removed set, allowing removed canonical IDs to
leak; update both lazy paths (the loop using read_occurrence at the shown block
and the similar loop around lines 389-403) to filter the occurrence's
canonical-id bitmap by removing the current removed set before checking
contains(offset) and before invoking callback (e.g., apply the same subtraction
logic that the in-memory path uses so occurrence.canonical_ids no longer
includes removed IDs prior to callback).

In `@src/index/tu_index.cpp`:
- Around line 216-219: TUIndex::serialize currently asserts the result of
kfb::to_flatbuffer but still dereferences bytes in release builds (causing
crashes); change the logic to check if bytes is non-null before using it: call
kfb::to_flatbuffer(*this), if the returned pointer (bytes) is null log or return
gracefully (avoid dereferencing bytes->data()/bytes->size()), otherwise call
os.write with bytes->data() and bytes->size(); update error handling around
TUIndex::serialize (and any other call sites using kfb::to_flatbuffer) to
perform a null-check on the returned value instead of relying solely on assert.

In `@src/index/tu_index.h`:
- Around line 40-42: Defaulted Relation comparison operators include padding
bytes causing mismatches with the existing logical-key-based
hashing/equality/dedupe; replace the defaulted friend operator== and operator<=>
for Relation with explicit implementations that compare only the logical key
members used by hashing/dedupe (i.e., exclude padding and any fields not part of
the logical identity), so operator== and operator<=> are consistent with the
relation hashing/equality/dedupe behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07696db5-ed3f-4258-909c-75490b2eda17

📥 Commits

Reviewing files that changed from the base of the PR and between 94cfdd9 and d241ea8.

📒 Files selected for processing (16)
  • CMakeLists.txt
  • cmake/package.cmake
  • src/index/include_graph.h
  • src/index/kotatsu_adapters.h
  • src/index/merged_index.cpp
  • src/index/project_index.cpp
  • src/index/project_index.h
  • src/index/schema.fbs
  • src/index/serialization.h
  • src/index/tu_index.cpp
  • src/index/tu_index.h
  • src/semantic/relation_kind.h
  • src/server/compiler.cpp
  • src/server/indexer.cpp
  • src/syntax/token.h
  • tests/unit/index/project_index_tests.cpp
💤 Files with no reviewable changes (2)
  • src/index/serialization.h
  • src/index/schema.fbs
✅ Files skipped from review due to trivial changes (3)
  • src/syntax/token.h
  • src/index/project_index.cpp
  • src/index/kotatsu_adapters.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/semantic/relation_kind.h
  • src/server/compiler.cpp
  • cmake/package.cmake
  • src/server/indexer.cpp
  • src/index/include_graph.h

Comment on lines +325 to +337
const std::size_t count = entries.size();
std::size_t lo = 0;
std::size_t hi = count;
while(lo < hi) {
auto mid = lo + (hi - lo) / 2;
auto entry = entries.at(mid);
auto range_view = entry.template get<0>()[&Occurrence::range];
if(range_view[&LocalSourceRange::end] < offset) {
lo = mid + 1;
} else {
hi = mid;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the ordering mismatch between occurrence sort/key order and lower_bound projection.
rg -n -C4 'operator<=>\(const Occurrence|range\.end|lower_bound\(occurrences|auto mid = lo' --iglob '*.h' --iglob '*.cpp'

Repository: clice-io/clice

Length of output: 30034


🏁 Script executed:

sed -n '310,345p' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 1471


🏁 Script executed:

rg -n 'struct Impl|flatbuffer.*Impl' src/index/merged_index.cpp | head -20

Repository: clice-io/clice

Length of output: 129


🏁 Script executed:

rg -n 'Impl.*occurrences|occurrences.*push_back|\.serialize' src/index/merged_index.cpp | head -30

Repository: clice-io/clice

Length of output: 110


🏁 Script executed:

find src/index -name "*.h" | xargs grep -l 'struct Impl'

Repository: clice-io/clice

Length of output: 82


🏁 Script executed:

rg -n -A30 'struct Impl' src/index/merged_index.h

Repository: clice-io/clice

Length of output: 904


🏁 Script executed:

rg -n -B5 -A25 'struct Impl {' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 147


🏁 Script executed:

rg -n -B5 -A25 'struct Impl' src/index/merged_index.cpp | head -50

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

rg -n 'namespace kfb|reflected_struct' src/index/merged_index.cpp | head -20

Repository: clice-io/clice

Length of output: 103


🏁 Script executed:

rg -n 'serialize.*occurrence|occurrence.*serialize' src/index/merged_index.cpp | head -20

Repository: clice-io/clice

Length of output: 40


🏁 Script executed:

sed -n '180,280p' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 3190


🏁 Script executed:

sed -n '240,260p' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 540


🏁 Script executed:

sed -n '1,180p' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 5610


🏁 Script executed:

rg -n -A10 'struct Occurrence' src/index/tu_index.h

Repository: clice-io/clice

Length of output: 351


🏁 Script executed:

rg -n 'type.*Range|using Range' src/index/tu_index.h | head -5

Repository: clice-io/clice

Length of output: 92


🏁 Script executed:

rg -n 'struct LocalSourceRange|class LocalSourceRange' src/ -A15 | head -40

Repository: clice-io/clice

Length of output: 903


🏁 Script executed:

rg -n -B5 -A5 'from_flatbuffer.*impl' src/index/merged_index.cpp

Repository: clice-io/clice

Length of output: 321


Binary search in lazy flatbuffer path assumes range.end ordering but operates on unsorted DenseMap data.

The manual binary search (lines 325-337) compares range_view[&LocalSourceRange::end] < offset, assuming the entries are sorted by range end offset. However, entries is derived from Impl::occurrences, which is a llvm::DenseMap keyed by Occurrence. DenseMap iteration order is arbitrary (hash table order), not sorted by any field. The in-memory path correctly sorts occurrences_cache by (begin, end, target) before binary search, but the lazy path serializes from an unsorted hash table. This breaks binary search correctness for nested ranges like [10,100] then [20,30]—it may skip matches for offset 50.

Materialize and sort occurrences by the search key before the binary search loop, or use std::ranges::lower_bound with the range.end projection (as the in-memory path does).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 325 - 337, The binary search in the
lazy flatbuffer path uses entries (from Impl::occurrences / DenseMap) and
compares range_view[&LocalSourceRange::end] < offset assuming entries are sorted
by range end, but DenseMap iteration is unordered, so materialize the
occurrences into a vector (similar to occurrences_cache), sort it by the same
key used in the in-memory path (begin, end, target) or at least by range end,
then run the existing binary search/ lower_bound on that sorted vector; update
the code paths around entries, Impl::occurrences, and the binary search loop to
operate on the sorted materialized container so the lookup by
LocalSourceRange::end is correct for nested ranges.

Comment on lines +451 to +456
auto location = include_locations.at(i);
auto path_id = location[&IncludeLocation::path_id];
auto [_, success] = deps.insert(path_id);
if(success) {
fs::file_status status;
if(auto err = fs::status(path_mapping[location->path_id()], status)) {
if(auto err = fs::status(path_mapping[path_id], status)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bounds-check decoded path IDs before indexing path_mapping.

path_id comes from the on-disk buffer. If it is out of range, Line 456 reads past path_mapping; treat that as needing an update/rebuild instead.

🛡️ Proposed fix
             auto path_id = location[&IncludeLocation::path_id];
             auto [_, success] = deps.insert(path_id);
             if(success) {
+                if(path_id >= path_mapping.size()) {
+                    return true;
+                }
                 fs::file_status status;
                 if(auto err = fs::status(path_mapping[path_id], status)) {
                     return true;
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index/merged_index.cpp` around lines 451 - 456, The code indexes
path_mapping[path_id] using a path_id decoded from on-disk data (from
include_locations and used with deps.insert); add a bounds check before any
access to path_mapping: verify 0 <= path_id < path_mapping.size(), and if out of
range avoid dereferencing and treat it as a stale/on-disk mismatch (e.g., set
the "needs update/rebuild" flag or return an error path), otherwise proceed to
call fs::status and continue as before; update the logic around
include_locations, path_id, deps.insert and fs::status so out-of-range IDs never
index path_mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant