fix: resolve issues #36–#44 — security, robustness, and correctness by farhan-syah · Pull Request #49 · NodeDB-Lab/nodedb

farhan-syah · 2026-04-16T11:08:43Z

Summary

Fixes 9 issues (14 sub-problems) spanning WAL crypto, memory safety, OOM, performance, DoS hardening, and pgwire correctness.

AES-GCM nonce reuse in WAL encryption on LSN rewind / snapshot restore #36 — AES-GCM nonce reuse: add random 4-byte epoch to nonce, preventing reuse after WAL truncation/snapshot restore
Integer overflow in mmap vector segment header enables OOB read (memory unsafety) #37 — mmap vector segment integer overflow: checked_mul/checked_add in header parsing, dim=0 rejection, bounds checks in get_vector/prefetch
Columnar memtable never flushes — unbounded memory growth on INSERT #38 — Columnar memtable never flushes: wire should_flush → drain_optimized → SegmentWriter persist → scan from flushed segments
GROUP BY aggregate materialises every matching document in memory #39 — GROUP BY materializes all docs: replace HashMap<String, Vec<Vec<u8>>> with streaming AggAccum accumulators (O(groups) memory)
SQ8-quantized sealed segments bypass HNSW graph navigation (O(N) brute-force per query) #40 — SQ8 search does O(N) linear scan: replace brute-force loop with HNSW graph-guided candidate generation + FP32 rerank
Unbounded-input DoS hardening — ef_search, TLS handshake, ILP line, SQL expr depth #41.1 — ef_search unbounded: clamp to MAX_EF_SEARCH = 8192
Unbounded-input DoS hardening — ef_search, TLS handshake, ILP line, SQL expr depth #41.2 — TLS handshake no timeout: 10s tokio::time::timeout on all 3 listeners
Unbounded-input DoS hardening — ef_search, TLS handshake, ILP line, SQL expr depth #41.3 — ILP unbounded line: MAX_ILP_LINE_BYTES = 10 MiB cap with read_until
Unbounded-input DoS hardening — ef_search, TLS handshake, ILP line, SQL expr depth #41.4 — Expr parser stack overflow: MAX_EXPR_DEPTH = 128 in parser + resolver
WAL & recovery robustness — short-write handling, reader recursion, decode panics #42.1 — WAL short write ignored: retry loop + correct data.len() offset
WAL & recovery robustness — short-write handling, reader recursion, decode panics #42.2 — WAL reader self-recursion: convert to loop in both reader.rs and mmap_reader.rs
WAL & recovery robustness — short-write handling, reader recursion, decode panics #42.3 — WAL decode panics on truncated payloads: read_slice helper + MAX_FIELD_LEN cap
pgwire extended-query protocol rejects DSL statements (SEARCH / GRAPH / UPSERT / MATCH) — forces simple-query + string interpolation #43 — pgwire extended protocol rejects DSL: is_dsl flag routes SEARCH/GRAPH/MATCH/UPSERT to DSL dispatcher
pgwire SELECT on TYPE DOCUMENT STRICT silently returns null/empty columns via extended protocol (data loss) #44 — STRICT doc SELECT returns null via extended protocol: reproject_response extracts typed columns from JSON envelope

Also fixes a silent data-loss bug in encode_row_for_wal that swallowed serialization errors via unwrap_or_default().

Test plan

17 new tests across 7 files — all pass
680/680 shared crate tests pass (nodedb-wal, nodedb-vector, nodedb-columnar, nodedb-query, nodedb-sql)
2834/2835 core crate tests pass (1 pre-existing flaky cluster test)
Full SQL test suite (scripts/test.sh) passes — zero regressions across ~400 query results
cargo fmt --all clean
cargo clippy --all-targets clean

Closes #36, closes #37, closes #38, closes #39, closes #40, closes #41, closes #42, closes #43, closes #44

…tion Replace the LSN-only nonce derivation with a `[4-byte random epoch][8-byte LSN]` scheme. A fresh epoch is generated at construction time via getrandom, ensuring nonces are never reused across WAL lifetimes (process restart, snapshot restore, segment rotation) even when LSNs restart from 1. Update segment decryption to pass the epoch from the key so the AAD binding remains consistent. Refactor mmap_reader and reader for cleaner error paths.

- Add 10-second timeout on TLS handshakes across all listeners (pgwire, native, RESP, ILP) to prevent slow-handshake connection slot exhaustion - Enforce a 10 MiB per-line limit on ILP ingestion using read_until instead of lines(), dropping connections that exceed it - Limit recursive expression parsing depth to 128 in both the SQL resolver and the generated-expression parser to prevent stack overflow on deeply-nested malformed ASTs - Cap ef_search at 8192 in HNSW vector search to prevent DoS via unbounded beam width

…earch The quantized two-phase search was scanning all vectors to build candidates instead of using the HNSW graph. Replace the O(N) brute-force loop with HNSW graph traversal for O(log N) candidate generation, then rerank with exact FP32 distance as before. Also extend mmap_segment with additional helper methods and add unit tests for SQ8 search correctness via the collection API.

Add richer WAL record variants for columnar operations to support finer- grained replay. Fix mutation handling for edge cases in segment application.

…ng accumulators The previous implementation stored all matching raw document bytes grouped by key, then aggregated at the end — O(total_docs × avg_doc_size) memory. Replace with per-group streaming accumulators (accum.rs) that retain only the derived state needed for each aggregate function. Memory is now O(num_groups × num_aggregates) regardless of how many documents match. Supported functions: count, sum, avg, min, max, count_distinct, stddev/variance (Welford), approx_count_distinct (HLL), approx_percentile (t-digest), approx_topk (space-saving), array_agg, string_agg, percentile_cont. Array-materializing variants are capped at 10,000 items. Add aggregate_helpers.rs in nodedb-query for the field-extraction primitives used by the accumulator path.

…cans When the columnar memtable reaches the flush threshold, drain it into a compressed segment and retain the bytes in memory. Extend scan_normalize to read from flushed segments before falling back to the live memtable, so queries see all rows regardless of whether they have been flushed. This makes columnar scan results consistent across memtable boundaries without requiring a durable write path yet.

… result columns Two issues in the prepared-statement extended-query path: 1. DSL statements (SEARCH, GRAPH, MATCH, UPSERT INTO, etc.) were not handled by the Execute phase. Route them through the same DSL dispatcher used by simple queries; bound parameters are intentionally ignored for DSL. 2. When a statement declares typed result columns via Describe, Execute was producing a single-column JSON response against the N-column schema described to the client, causing null values for columns 2..N. Add a reproject step that parses each JSON object and re-encodes it with one pgwire field per declared column, with missing fields sent as SQL NULL.

…ements Cover GROUP BY with sum/avg/min/max/count over flushed columnar segments, and extended-query protocol correctness for typed result columns and DSL statement passthrough.

Replace nested if-let and if-inside-if-let patterns with let-chain conditions in AggAccum::accumulate. Eliminates unnecessary nesting, makes null-skip and dedup conditions read left-to-right, and avoids a spurious clone of COUNT(*) row in the non-null branch. Also replace manual .max(64).min(MAX_EF_SEARCH) with .clamp() in the ef_search sizing helper for clarity.

…ds checks nodedb-vector/src/mmap_segment.rs: rewrite the mmap offset bounds check from a single nested checked_add/checked_mul expression into sequential early-returns via let-else. Each overflow or out-of-bounds condition is now its own guard, making the failure paths obvious. nodedb-wal/src/crypto.rs: replace .clone() calls on Copy epoch values in tests with copy-dereference to avoid unnecessary clone on a type that implements Copy.

farhan-syah added 8 commits April 16, 2026 19:06

refactor(columnar): expand WAL record types and fix mutation edge cases

9eeb5eb

Add richer WAL record variants for columnar operations to support finer- grained replay. Fix mutation handling for edge cases in segment application.

test: add integration tests for columnar aggregates and prepared stat…

14fb1d6

…ements Cover GROUP BY with sum/avg/min/max/count over flushed columnar segments, and extended-query protocol correctness for typed result columns and DSL statement passthrough.

farhan-syah added 2 commits April 16, 2026 19:21

farhan-syah merged commit a8acd1a into main Apr 16, 2026
2 checks passed

farhan-syah deleted the fix/issues-36-44-security-and-robustness branch April 16, 2026 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve issues #36–#44 — security, robustness, and correctness#49

fix: resolve issues #36–#44 — security, robustness, and correctness#49
farhan-syah merged 10 commits intomainfrom
fix/issues-36-44-security-and-robustness

farhan-syah commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

farhan-syah commented Apr 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant