Skip to content

fix(vector): resolve 7 correctness, memory-safety, and quantization bugs (#50)#65

Merged
farhan-syah merged 7 commits intomainfrom
bug-to-coverage/issue-50-vector
Apr 16, 2026
Merged

fix(vector): resolve 7 correctness, memory-safety, and quantization bugs (#50)#65
farhan-syah merged 7 commits intomainfrom
bug-to-coverage/issue-50-vector

Conversation

@farhan-syah
Copy link
Copy Markdown
Contributor

Closes #50.

Summary

Seven independent bugs in nodedb-vector — three silent-wrong-result bugs, one memory-safety UB, one silently no-op feature, and two performance pathologies. Each has dedicated regression coverage in nodedb-vector/tests/.

Fixes

  1. SIMD length parity (UB)distance::distance() dispatcher now asserts a.len() == b.len() before invoking any AVX2/AVX-512/NEON kernel, with defense-in-depth asserts inside each kernel.
  2. Roaring bitmap ID spaceHnswIndex/FlatIndex gained search_filtered_offset / search_with_bitmap_bytes_offset; VectorCollection::search_with_bitmap_bytes passes seg.base_id per segment so global-id bitmaps resolve correctly across sealed segments.
  3. index_type='hnsw_pq' wiringVectorCollection carries IndexConfig; new with_index_config / with_pq_config constructors; complete_build trains PqCodec when configured; SealedSegment.pq stored; stats() reports Pq / HnswPq. Quantized candidates are now actually scored via PqCodec::asymmetric_distance (and Sq8Codec asymmetric variants) during search with FP32 rerank.
  4. Soft-deletes across checkpointFlatIndex::get_vector returns None for tombstoned slots; added is_deleted, insert_tombstoned; BuildingSnapshot gained deleted: Vec<bool>; from_checkpoint replays tombstones for growing and building segments. SealedSnapshot now serializes pq_bytes + pq_codes so PQ survives restore.
  5. Compact doc-map remapHnswIndex::compact_with_map() returns (removed, id_map); VectorCollection::compact rewrites doc_id_map and multi_doc_map to new globals.
  6. HNSW layer cappub const MAX_LAYER_CAP: usize = 16, applied in random_layer.
  7. k-means collapse — True k-means++ with d²-weighted sampling (seeded Xorshift64) in both quantize/pq.rs and ivf.rs; min_dists init fixed.

Structural

  • Split distance/simd.rs (504 lines, over hard cap) into distance/simd/{mod,runtime,scalar,hamming,avx2,avx512,neon}.rs.
  • Extracted quantizer builders from collection/lifecycle.rs (555 → 494 lines) into collection/quantize.rs.

Test plan

  • cargo nextest run -p nodedb-vector --all-features — 103 passed, 0 failed
  • cargo nextest run --all-features (workspace) — 5228 passed, 0 failed
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo fmt --all — applied

Move simd.rs into a simd/ directory with dedicated files for each
target (avx2, avx512, neon, scalar, hamming, runtime). Add a
length-mismatch assertion in the top-level distance() entry point
so mismatched vector dimensions panic immediately with a clear
message rather than producing silent wrong results.
random_layer() could theoretically produce very large values for
unlucky RNG draws, promoting max_layer to an unbounded height and
making every subsequent search's Phase-1 greedy descent O(max_layer).
Apply a hard cap of 16 layers — standard practice for production HNSW
deployments. Also refactor compact() to expose compact_with_map()
returning the old→new id remapping needed by doc_id_map maintenance.
The previous initialization selected the farthest point deterministically
rather than sampling proportionally to squared distance. Replace with
proper weighted d² sampling using a deterministic xorshift RNG so
centroid seeding is stable across runs and converges reliably for skewed
distributions. Also derive MessagePack serialization for PqCodec so
trained codecs survive checkpointing.
…ent search

Bitmap filters carry global vector ids, but each segment's HNSW nodes
and FlatIndex entries are numbered starting at zero (local ids). Without
an offset the filter tests the wrong bit for every segment after the
first, producing incorrect results for filtered searches over collections
with more than one sealed segment.

Add search_filtered_offset / search_with_bitmap_bytes_offset to HnswIndex
and search_filtered_offset to FlatIndex. Thread id_offset through the
internal search_layer function so bitmap membership is checked against the
correct global id. Update collection/search.rs to pass the per-segment
base_id and the growing segment's growing_base_id when dispatching filtered
searches.
…d_map

Checkpoint serialization previously skipped deleted vectors entirely,
so on restore those local ids were missing and all subsequent ids shifted
by one — corrupting the HNSW graph's neighbor adjacency. Fix by capturing
the deleted flag for every vector (growing and building segments) and
replaying tombstones via insert_tombstoned / index.delete() on restore.

Separately, compact() previously discarded doc_id_map and multi_doc_map
entries for the segment being compacted. Use compact_with_map() to obtain
the old→new local id remapping and rewrite both maps so that global ids
continue to resolve to the correct document strings after compaction.
…lection

Add IndexConfig / IndexType to VectorCollection so PQ-configured collections
train and store PQ codes when sealing a segment rather than always falling
back to SQ8. Split quantizer training helpers into collection/quantize.rs to
keep lifecycle.rs under the 500-line file cap.

Extend the sealed-segment search path to use a unified quantized_search()
function that handles both PQ and SQ8 with proper asymmetric scoring,
widened candidate generation, and exact FP32 reranking. Report PQ
quantization and index type correctly in collection stats.
@farhan-syah farhan-syah merged commit 4d18c03 into main Apr 16, 2026
2 checks passed
@farhan-syah farhan-syah deleted the bug-to-coverage/issue-50-vector branch April 16, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector subsystem — correctness, memory-safety, quantization (7 sub-items)

1 participant