feat(cluster): distributed cluster infrastructure by farhan-syah · Pull Request #45 · NodeDB-Lab/nodedb

farhan-syah · 2026-04-16T08:49:51Z

Summary

SWIM failure detector with gossip dissemination, UDP transport, membership subscriber hooks, and configurable failure detection
Raft consensus hardening — deduplicated votes, configurable election timeout, stable leader election gating
Routing & topology — liveness-driven leader hint invalidation, SHOW RANGES/ROUTING/SCHEMA VERSION commands, closed timestamp tracker, follower read gate
Graceful node lifecycle — decommission safety checks, coordinated drain flow, reachability-driven circuit-breaker recovery
Load-based rebalancer — metrics collection, deterministic planning, CPU backpressure gate, elastic scaling via SWIM membership kicks
Transactional DDL — typed DDL AST parser, session-scoped DDL buffer, atomic Raft batch replication, AST-based dispatch
Operational endpoints — /health/live, /health/drain, pg_catalog virtual tables, nodedb.read_consistency session parameter
Migration executor — corrected learner promotion and replicated Phase 3 cut-over
Test infrastructure — readiness polling replaces fixed sleeps, serialized cluster test group, DDL replication correctness tests

Test plan

cargo nextest run -p nodedb-cluster — all cluster unit + integration tests pass
cargo nextest run --test sql_ddl_cluster — DDL replication correctness across 3-node cluster
cargo nextest run -p nodedb-sql — SQL parser and planner tests including const-fold
cargo clippy --all-targets -- -D warnings — no warnings
cargo fmt --all --check — formatting clean

Introduce a `MembershipSubscriber` trait that the failure detector calls after every accepted member state transition (insert, Alive → Suspect, Suspect → Dead, any → Left). Subscribers are synchronous and must not block; they are suitable for cheap in-memory bookkeeping. Key changes: - `subscriber.rs`: defines `MembershipSubscriber` with a single `on_state_change(node_id, old, new)` method. - `detector/runner.rs`: adds `with_subscribers` constructor and `apply_and_notify` helper that wraps `apply_and_disseminate` with before/after state snapshots; the no-subscriber path is zero-overhead (early return on empty slice). - `bootstrap.rs`: adds `spawn_with_subscribers` alongside the existing `spawn` entry-point so callers can inject subscribers without breaking the common no-observer signature. - `mod.rs`: re-exports `MembershipSubscriber` and `spawn_with_subscribers`.

Add `RoutingLivenessHook`, a `MembershipSubscriber` that clears the leader hint for every Raft group whose leaseholder transitions to Suspect, Dead, or Left. After the hint is cleared, the next query against those vShards gets `NotLeader`, triggers leader re-discovery, and updates the routing table — limiting clients to at most one retry on node failure. `NodeIdResolver` is a closure type that maps SWIM `NodeId` to the numeric routing-table id, keeping the hook storage-agnostic. Seed placeholders and transient learners that the routing table has not yet registered are silently ignored. The integration test (`tests/swim_routing_invalidation.rs`) runs three real UDP-backed SWIM nodes, shuts down the group leader, and asserts that the hook clears the routing hint within a few suspicion timeouts.

Introduce a `decommission` module with four sub-modules: - `safety` — `check_can_decommission` validates that removing a node from every Raft group it belongs to still satisfies the configured replication factor. - `flow` — `DecommissionFlow` drives the ordered sequence of metadata proposals (start → leadership transfers → member removal → finish) with per-step error propagation. - `coordinator` — `DecommissionCoordinator` owns the flow and exposes a single `run()` entry point for callers that hold the metadata Raft handle. - `observer` — `DecommissionObserver` watches committed entries and unblocks a waiting coordinator when the final `FinishDecommission` entry is applied. Supporting changes: - `metadata_group::entry` — add `RoutingChange::RemoveMember` so the decommission flow can strip a draining node from every group it belongs to without a compatibility break. - `metadata_group::applier` — add `with_live_state()` builder that attaches live `ClusterTopology` and `RoutingTable` handles; committed `TopologyChange` and `RoutingChange` entries now mutate them in place in addition to updating the in-memory history log. - `routing` — add `remove_group_member()` with unit tests covering voter removal, learner-only removal, and unknown-group no-ops. - `lifecycle::plan_decommission` — thin wrapper over `decommission::plan_full_decommission`; existing call sites are unchanged. - `lib` — re-export public decommission surface. Integration test `decommission_flow` exercises the full round-trip on a three-node in-process cluster: safety check, coordinator run, and topology/routing state after commit.

Introduce a `reachability` module that actively probes peers whose circuit breakers are stuck in the Open state. Without periodic probes, a breaker that opens due to a transient failure never transitions back to HalfOpen because there is no outbound traffic to trigger a `check()` call — the node stays blacklisted indefinitely. New pieces: - `reachability::prober` — `ReachabilityProber` trait with a `TransportProber` adapter (wraps any `Transport: clone + send`) and a `NoopProber` for tests. - `reachability::driver` — `ReachabilityDriver` runs a Tokio interval loop, calls `circuit_breaker.open_peers()` each tick, and fires one probe per stuck peer. A successful probe resets the breaker to Closed; a failed probe keeps it Open until the next interval. `ReachabilityDriverConfig` controls the probe interval and per-tick peer cap. Supporting changes: - `circuit_breaker::open_peers()` — returns the ids of every peer currently in the Open state so the driver can discover them without coupling to breaker internals. - `routing_liveness` — remove stale inline checklist from the module doc; the comment now describes what actually happens. - `lib` — re-export public reachability surface. Integration test `reachability_loop` spins up a driver backed by a `NoopProber`, forces three peers into the Open state, advances the interval, and asserts that all three breakers return to Closed.

The SWIM implementation was built in incremental sub-batches whose labels (E-α, E-β, E-γ, …) were embedded in module docs, inline comments, and test file headers. Now that all layers are present the labels no longer help a reader navigate the code and are confusing when encountered out of context. Replace each reference with a plain description of what the module or code path actually does.

…driver loop Introduces a new `rebalancer` module to the cluster crate that complements the existing overload-triggered scheduler with a storage-shape-driven rebalancing path (bytes + qps + vshard count). Key components: - `metrics`: `LoadMetrics`, `LoadMetricsProvider` trait, `LoadWeights`, and `normalized_score` for per-node pressure scoring - `plan`: `compute_load_based_plan` produces a bounded set of vshard moves from hottest to coldest nodes using configurable imbalance thresholds (`RebalancerPlanConfig`) - `driver`: `RebalancerLoop` orchestrates periodic plan evaluation and dispatches moves through the `MigrationDispatcher` trait; the `ElectionGate` trait lets callers gate sweeps on Raft leadership; `AlwaysReadyGate` is provided for tests All planning logic is pure and side-effect-free, making it fully unit-testable without spawning any Tokio tasks. An end-to-end integration test (`tests/rebalancer_loop.rs`) wires a `StaticProvider`, a `DirectDispatcher`, and `AlwaysReadyGate` to assert that a full plan → dispatch → routing-table-mutation cycle works correctly.

…e 3 cut-over Phase 1 previously added the target node directly as a voter via `ConfChangeType::AddNode`. This was incorrect: a node with an empty log cannot safely participate in elections or voting immediately. Phase 1 now adds the target as a learner (`AddLearner`) so Raft replication can bring it up to date without affecting quorum. Phase 2 detects when the learner's `match_index` reaches the leader's `commit_index` and promotes it to voter via `ConfChangeType::PromoteLearner` before returning. This ensures the Raft group has sufficient replicas for a safe cut-over before Phase 3 begins. Phase 3 cut-over is updated to route the routing change through the metadata Raft group when a `MetadataProposer` is attached. The `LeadershipTransfer` metadata entry is proposed and waited on so every node applies the routing update atomically at the same commit index. Without a proposer (test environments without a metadata group) the executor falls back to a local-only routing mutation via `set_leader`. A `with_metadata_proposer` builder method is added to `MigrationExecutor` for production wiring; tests continue to work without it.

Introduce RebalancerKickHook, a MembershipSubscriber that wakes the rebalancer loop immediately when a node joins or leaves, replacing passive polling for topology changes.

… commands Implement three new cluster introspection statements: - SHOW RANGES: lists vshard distribution with leaseholder and replica info - SHOW ROUTING <key>: resolves a routing hint to the owning vshard - SHOW SCHEMA VERSION: reports the current schema version across the cluster Wire dispatch through the pgwire DDL router and native SQL dispatcher, and exclude all three from the standard DML fast-path guard.

Add two new health endpoints alongside the existing /healthz and /health: - GET /health/live: unconditional liveness probe (always 200); suitable as a k8s liveness probe since no internal state is checked - POST /health/drain: signals the shutdown watch to begin graceful drain, causing /healthz to return 503 so the service mesh stops routing new connections; designed as a k8s preStop hook Update /healthz to return 503 when the cluster observer reports the node is draining or decommissioned, and fix the startup gate middleware comment to reflect all /health* paths being exempt from the startup gate.

…lancer test lock scope

Adds a `backpressure_cpu_threshold` field to `RebalancerLoopConfig` (default 0.80). Before executing a rebalance sweep, the driver checks all node CPU utilization snapshots; if any node exceeds the threshold the sweep is skipped and a STATUS-level log is emitted. This prevents vShard migrations from amplifying cluster load when nodes are already stressed. - `LoadMetrics` gains a `cpu_utilization: f64` field (0.0–1.0) - Integration tests updated to populate the new field - Unit test `sweep_skipped_under_cpu_backpressure` verifies no migration is dispatched when a node reports 95% CPU

Introduces `session/read_consistency.rs` with parsing and validation for the `nodedb.read_consistency` SET parameter. Accepted values are `strong`, `bounded_staleness:<secs>`, and `eventual`. Invalid values are rejected at SET time with SQLSTATE 22023 and a descriptive error message rather than silently ignored.

Adds `pgwire/pg_catalog/` with a dispatcher and virtual table definitions that intercept queries against `pg_catalog` relations. This allows PostgreSQL-compatible clients and drivers that inspect catalog tables during connection setup to receive well-formed responses without requiring a full system catalog implementation.

Introduce ClosedTimestampTracker for advancing the closed timestamp used by bounded-staleness reads, and FollowerReadGate with ReadLevel to decide whether a follower can serve a read at a given LSN without forwarding to the leader.

Replace direct index access with bounds-checked alternatives in the auth user DDL handler, and add early-return validation in GRANT/REVOKE ROLE handlers to return a well-formed syntax error instead of panicking when the command has too few parts.

Introduce `nodedb_sql::ddl_ast` — a typed representation of every NodeDB DDL statement as `NodedbStatement` enum variants, with a whitespace-token parser that converts raw SQL into the typed form. The AST gives DDL dispatch a compiler-checked match instead of string-prefix branching, so adding a new DDL command forces handling in every affected site at compile time rather than at runtime.

…tion Add a `Batch` variant to `MetadataEntry` that wraps a sequence of sub-entries under a single Raft log index. The applier recurses into batch entries via `cascade_live_state`, and the cache applies each sub-entry in order. This ensures that a transactional DDL block (`BEGIN; CREATE ...; COMMIT;`) either commits fully or not at all across the cluster.

Wire transactional DDL semantics through the pgwire session layer: - `ddl_buffer`: thread-local buffer that intercepts `propose_catalog_entry` calls while a BEGIN block is active, accumulating encoded payloads instead of proposing them immediately. - `transaction_cmds`: COMMIT flushes the buffer as a single `MetadataEntry::Batch` proposal; ROLLBACK discards without proposing. - `metadata_proposer`: checks the thread-local buffer before proposing, returning early when a transaction is active. - `ast` + router fast path: parse DDL once into `NodedbStatement`, handle `IF [NOT] EXISTS` at dispatch before falling through to legacy string-prefix handlers.

Cover the full create/drop lifecycle for collections, sequences, triggers, and schedules across a 3-node test cluster. Each test verifies that DDL executed on the leader becomes visible on all followers within a bounded window, and that IF [NOT] EXISTS branches complete without error.

…figurable Use a HashSet keyed by peer ID in handle_request_vote_response so that duplicate grants from the same peer cannot inflate the vote count and cause premature leader promotion. Add election_timeout_min/max fields to ClusterConfig and propagate them through bootstrap, join, and restart paths. Expose matching fields in ClusterTransportTuning (default 2–5 s) and wire them into MultiRaft so every startup path honours the configured window rather than using a hardcoded constant. Adjust network tuning defaults from 60/120 s to 2/5 s to match real-world cluster behaviour.

Start the HealthMonitor task from start_raft so that periodic pings run for the lifetime of the node. Without it, topology updates relied solely on the fire-and-forget broadcast during the join flow; when that broadcast was lost (peer QUIC server not yet accepting), the peer never converged to the full topology. On each successful pong, push the current topology to peers whose reported version is behind ours. This closes the convergence gap for peers that missed the initial broadcast. Bound QUIC handshake attempts by rpc_timeout so a hung connection does not block for the full 30 s idle timeout when a peer is slow to accept. Expose the cluster catalog via ClusterHandle so the HealthMonitor can persist topology changes detected during failure/recovery. Increase the pgwire retry budget from 3 attempts (350 ms) to 5 attempts (750 ms) to tolerate longer Raft leader drains under load.

…IM probe test Replace fixed tokio::time::sleep calls in tests with bounded polling loops that break as soon as the expected condition is met. This makes tests deterministic under heavy parallel load (500+ unit tests sharing the same CPU pool) rather than depending on timing assumptions. In the SWIM indirect_ack_saves_target test, remove start_paused — paused time auto-advances timeouts before channel-woken tasks are polled, making the indirect path race its own timeout. With real time the 40 ms probe window is ample for in-memory delivery. Add a recv loop on the local endpoint to resolve inflight probes when Acks arrive, and handle both Ping and PingReq in the helper so the test is correct regardless of which node the scheduler picks as the direct target. In the cluster harness, replace the 200 ms sleep before peers join with a topology-readiness poll, and serialise node 2 joining before node 3 to avoid overwhelming the bootstrap leader's join handler under load. Extend the descriptor lease planner wait budget from 3 s to 10 s to match the extended election timeout range used in tests.

farhan-syah added 24 commits April 16, 2026 09:31

chore: bump workspace version to 0.0.4

9f59208

feat(cluster): trigger rebalancer sweep on SWIM membership changes

cf92257

Introduce RebalancerKickHook, a MembershipSubscriber that wakes the rebalancer loop immediately when a node joins or leaves, replacing passive polling for topology changes.

fix(pgwire): wire pg_catalog dispatch into query handler and fix reba…

b8f6618

…lancer test lock scope

farhan-syah merged commit edafc6b into main Apr 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): distributed cluster infrastructure#45

feat(cluster): distributed cluster infrastructure#45
farhan-syah merged 24 commits intomainfrom
cluster

farhan-syah commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

farhan-syah commented Apr 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant