Skip to content

fix: resolve issues #36–#44 — security, robustness, and correctness#49

Merged
farhan-syah merged 10 commits intomainfrom
fix/issues-36-44-security-and-robustness
Apr 16, 2026
Merged

fix: resolve issues #36–#44 — security, robustness, and correctness#49
farhan-syah merged 10 commits intomainfrom
fix/issues-36-44-security-and-robustness

Conversation

@farhan-syah
Copy link
Copy Markdown
Contributor

Summary

Fixes 9 issues (14 sub-problems) spanning WAL crypto, memory safety, OOM, performance, DoS hardening, and pgwire correctness.

Also fixes a silent data-loss bug in encode_row_for_wal that swallowed serialization errors via unwrap_or_default().

Test plan

  • 17 new tests across 7 files — all pass
  • 680/680 shared crate tests pass (nodedb-wal, nodedb-vector, nodedb-columnar, nodedb-query, nodedb-sql)
  • 2834/2835 core crate tests pass (1 pre-existing flaky cluster test)
  • Full SQL test suite (scripts/test.sh) passes — zero regressions across ~400 query results
  • cargo fmt --all clean
  • cargo clippy --all-targets clean

Closes #36, closes #37, closes #38, closes #39, closes #40, closes #41, closes #42, closes #43, closes #44

…tion

Replace the LSN-only nonce derivation with a `[4-byte random epoch][8-byte
LSN]` scheme. A fresh epoch is generated at construction time via getrandom,
ensuring nonces are never reused across WAL lifetimes (process restart,
snapshot restore, segment rotation) even when LSNs restart from 1.

Update segment decryption to pass the epoch from the key so the AAD binding
remains consistent. Refactor mmap_reader and reader for cleaner error paths.
- Add 10-second timeout on TLS handshakes across all listeners (pgwire,
  native, RESP, ILP) to prevent slow-handshake connection slot exhaustion
- Enforce a 10 MiB per-line limit on ILP ingestion using read_until instead
  of lines(), dropping connections that exceed it
- Limit recursive expression parsing depth to 128 in both the SQL resolver
  and the generated-expression parser to prevent stack overflow on
  deeply-nested malformed ASTs
- Cap ef_search at 8192 in HNSW vector search to prevent DoS via unbounded
  beam width
…earch

The quantized two-phase search was scanning all vectors to build candidates
instead of using the HNSW graph. Replace the O(N) brute-force loop with
HNSW graph traversal for O(log N) candidate generation, then rerank with
exact FP32 distance as before.

Also extend mmap_segment with additional helper methods and add unit tests
for SQ8 search correctness via the collection API.
Add richer WAL record variants for columnar operations to support finer-
grained replay. Fix mutation handling for edge cases in segment application.
…ng accumulators

The previous implementation stored all matching raw document bytes grouped
by key, then aggregated at the end — O(total_docs × avg_doc_size) memory.

Replace with per-group streaming accumulators (accum.rs) that retain only
the derived state needed for each aggregate function. Memory is now
O(num_groups × num_aggregates) regardless of how many documents match.

Supported functions: count, sum, avg, min, max, count_distinct,
stddev/variance (Welford), approx_count_distinct (HLL), approx_percentile
(t-digest), approx_topk (space-saving), array_agg, string_agg,
percentile_cont. Array-materializing variants are capped at 10,000 items.

Add aggregate_helpers.rs in nodedb-query for the field-extraction primitives
used by the accumulator path.
…cans

When the columnar memtable reaches the flush threshold, drain it into a
compressed segment and retain the bytes in memory. Extend scan_normalize
to read from flushed segments before falling back to the live memtable,
so queries see all rows regardless of whether they have been flushed.

This makes columnar scan results consistent across memtable boundaries
without requiring a durable write path yet.
… result columns

Two issues in the prepared-statement extended-query path:

1. DSL statements (SEARCH, GRAPH, MATCH, UPSERT INTO, etc.) were not
   handled by the Execute phase. Route them through the same DSL dispatcher
   used by simple queries; bound parameters are intentionally ignored for
   DSL.

2. When a statement declares typed result columns via Describe, Execute was
   producing a single-column JSON response against the N-column schema
   described to the client, causing null values for columns 2..N. Add a
   reproject step that parses each JSON object and re-encodes it with one
   pgwire field per declared column, with missing fields sent as SQL NULL.
…ements

Cover GROUP BY with sum/avg/min/max/count over flushed columnar segments,
and extended-query protocol correctness for typed result columns and DSL
statement passthrough.
Replace nested if-let and if-inside-if-let patterns with let-chain
conditions in AggAccum::accumulate. Eliminates unnecessary nesting,
makes null-skip and dedup conditions read left-to-right, and avoids
a spurious clone of COUNT(*) row in the non-null branch.

Also replace manual .max(64).min(MAX_EF_SEARCH) with .clamp() in
the ef_search sizing helper for clarity.
…ds checks

nodedb-vector/src/mmap_segment.rs: rewrite the mmap offset bounds
check from a single nested checked_add/checked_mul expression into
sequential early-returns via let-else. Each overflow or out-of-bounds
condition is now its own guard, making the failure paths obvious.

nodedb-wal/src/crypto.rs: replace .clone() calls on Copy epoch values
in tests with copy-dereference to avoid unnecessary clone on a type
that implements Copy.
@farhan-syah farhan-syah merged commit a8acd1a into main Apr 16, 2026
2 checks passed
@farhan-syah farhan-syah deleted the fix/issues-36-44-security-and-robustness branch April 16, 2026 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment