Skip to content

feat(cluster): distributed cluster infrastructure#45

Merged
farhan-syah merged 24 commits intomainfrom
cluster
Apr 16, 2026
Merged

feat(cluster): distributed cluster infrastructure#45
farhan-syah merged 24 commits intomainfrom
cluster

Conversation

@farhan-syah
Copy link
Copy Markdown
Contributor

Summary

  • SWIM failure detector with gossip dissemination, UDP transport, membership subscriber hooks, and configurable failure detection
  • Raft consensus hardening — deduplicated votes, configurable election timeout, stable leader election gating
  • Routing & topology — liveness-driven leader hint invalidation, SHOW RANGES/ROUTING/SCHEMA VERSION commands, closed timestamp tracker, follower read gate
  • Graceful node lifecycle — decommission safety checks, coordinated drain flow, reachability-driven circuit-breaker recovery
  • Load-based rebalancer — metrics collection, deterministic planning, CPU backpressure gate, elastic scaling via SWIM membership kicks
  • Transactional DDL — typed DDL AST parser, session-scoped DDL buffer, atomic Raft batch replication, AST-based dispatch
  • Operational endpoints/health/live, /health/drain, pg_catalog virtual tables, nodedb.read_consistency session parameter
  • Migration executor — corrected learner promotion and replicated Phase 3 cut-over
  • Test infrastructure — readiness polling replaces fixed sleeps, serialized cluster test group, DDL replication correctness tests

Test plan

  • cargo nextest run -p nodedb-cluster — all cluster unit + integration tests pass
  • cargo nextest run --test sql_ddl_cluster — DDL replication correctness across 3-node cluster
  • cargo nextest run -p nodedb-sql — SQL parser and planner tests including const-fold
  • cargo clippy --all-targets -- -D warnings — no warnings
  • cargo fmt --all --check — formatting clean

Introduce a `MembershipSubscriber` trait that the failure detector
calls after every accepted member state transition (insert, Alive →
Suspect, Suspect → Dead, any → Left). Subscribers are synchronous and
must not block; they are suitable for cheap in-memory bookkeeping.

Key changes:
- `subscriber.rs`: defines `MembershipSubscriber` with a single
  `on_state_change(node_id, old, new)` method.
- `detector/runner.rs`: adds `with_subscribers` constructor and
  `apply_and_notify` helper that wraps `apply_and_disseminate` with
  before/after state snapshots; the no-subscriber path is
  zero-overhead (early return on empty slice).
- `bootstrap.rs`: adds `spawn_with_subscribers` alongside the
  existing `spawn` entry-point so callers can inject subscribers
  without breaking the common no-observer signature.
- `mod.rs`: re-exports `MembershipSubscriber` and
  `spawn_with_subscribers`.
Add `RoutingLivenessHook`, a `MembershipSubscriber` that clears the
leader hint for every Raft group whose leaseholder transitions to
Suspect, Dead, or Left. After the hint is cleared, the next query
against those vShards gets `NotLeader`, triggers leader re-discovery,
and updates the routing table — limiting clients to at most one retry
on node failure.

`NodeIdResolver` is a closure type that maps SWIM `NodeId` to the
numeric routing-table id, keeping the hook storage-agnostic. Seed
placeholders and transient learners that the routing table has not
yet registered are silently ignored.

The integration test (`tests/swim_routing_invalidation.rs`) runs three
real UDP-backed SWIM nodes, shuts down the group leader, and asserts
that the hook clears the routing hint within a few suspicion timeouts.
Introduce a `decommission` module with four sub-modules:

- `safety`      — `check_can_decommission` validates that removing a
                  node from every Raft group it belongs to still
                  satisfies the configured replication factor.
- `flow`        — `DecommissionFlow` drives the ordered sequence of
                  metadata proposals (start → leadership transfers →
                  member removal → finish) with per-step error
                  propagation.
- `coordinator` — `DecommissionCoordinator` owns the flow and exposes a
                  single `run()` entry point for callers that hold the
                  metadata Raft handle.
- `observer`    — `DecommissionObserver` watches committed entries and
                  unblocks a waiting coordinator when the final
                  `FinishDecommission` entry is applied.

Supporting changes:

- `metadata_group::entry` — add `RoutingChange::RemoveMember` so the
  decommission flow can strip a draining node from every group it
  belongs to without a compatibility break.
- `metadata_group::applier` — add `with_live_state()` builder that
  attaches live `ClusterTopology` and `RoutingTable` handles; committed
  `TopologyChange` and `RoutingChange` entries now mutate them in place
  in addition to updating the in-memory history log.
- `routing` — add `remove_group_member()` with unit tests covering
  voter removal, learner-only removal, and unknown-group no-ops.
- `lifecycle::plan_decommission` — thin wrapper over
  `decommission::plan_full_decommission`; existing call sites are
  unchanged.
- `lib` — re-export public decommission surface.

Integration test `decommission_flow` exercises the full round-trip on a
three-node in-process cluster: safety check, coordinator run, and
topology/routing state after commit.
Introduce a `reachability` module that actively probes peers whose
circuit breakers are stuck in the Open state. Without periodic probes,
a breaker that opens due to a transient failure never transitions back
to HalfOpen because there is no outbound traffic to trigger a
`check()` call — the node stays blacklisted indefinitely.

New pieces:

- `reachability::prober` — `ReachabilityProber` trait with a
  `TransportProber` adapter (wraps any `Transport: clone + send`) and a
  `NoopProber` for tests.
- `reachability::driver` — `ReachabilityDriver` runs a Tokio interval
  loop, calls `circuit_breaker.open_peers()` each tick, and fires one
  probe per stuck peer. A successful probe resets the breaker to
  Closed; a failed probe keeps it Open until the next interval.
  `ReachabilityDriverConfig` controls the probe interval and per-tick
  peer cap.

Supporting changes:

- `circuit_breaker::open_peers()` — returns the ids of every peer
  currently in the Open state so the driver can discover them without
  coupling to breaker internals.
- `routing_liveness` — remove stale inline checklist from the module
  doc; the comment now describes what actually happens.
- `lib` — re-export public reachability surface.

Integration test `reachability_loop` spins up a driver backed by a
`NoopProber`, forces three peers into the Open state, advances the
interval, and asserts that all three breakers return to Closed.
The SWIM implementation was built in incremental sub-batches whose
labels (E-α, E-β, E-γ, …) were embedded in module docs, inline
comments, and test file headers. Now that all layers are present the
labels no longer help a reader navigate the code and are confusing
when encountered out of context.

Replace each reference with a plain description of what the module or
code path actually does.
…driver loop

Introduces a new `rebalancer` module to the cluster crate that
complements the existing overload-triggered scheduler with a
storage-shape-driven rebalancing path (bytes + qps + vshard count).

Key components:

- `metrics`: `LoadMetrics`, `LoadMetricsProvider` trait, `LoadWeights`,
  and `normalized_score` for per-node pressure scoring
- `plan`: `compute_load_based_plan` produces a bounded set of vshard
  moves from hottest to coldest nodes using configurable imbalance
  thresholds (`RebalancerPlanConfig`)
- `driver`: `RebalancerLoop` orchestrates periodic plan evaluation and
  dispatches moves through the `MigrationDispatcher` trait; the
  `ElectionGate` trait lets callers gate sweeps on Raft leadership;
  `AlwaysReadyGate` is provided for tests

All planning logic is pure and side-effect-free, making it fully
unit-testable without spawning any Tokio tasks. An end-to-end
integration test (`tests/rebalancer_loop.rs`) wires a `StaticProvider`,
a `DirectDispatcher`, and `AlwaysReadyGate` to assert that a full
plan → dispatch → routing-table-mutation cycle works correctly.
…e 3 cut-over

Phase 1 previously added the target node directly as a voter via
`ConfChangeType::AddNode`. This was incorrect: a node with an empty
log cannot safely participate in elections or voting immediately.
Phase 1 now adds the target as a learner (`AddLearner`) so Raft
replication can bring it up to date without affecting quorum.

Phase 2 detects when the learner's `match_index` reaches the leader's
`commit_index` and promotes it to voter via `ConfChangeType::PromoteLearner`
before returning. This ensures the Raft group has sufficient replicas for
a safe cut-over before Phase 3 begins.

Phase 3 cut-over is updated to route the routing change through the
metadata Raft group when a `MetadataProposer` is attached. The
`LeadershipTransfer` metadata entry is proposed and waited on so every
node applies the routing update atomically at the same commit index.
Without a proposer (test environments without a metadata group) the
executor falls back to a local-only routing mutation via `set_leader`.

A `with_metadata_proposer` builder method is added to `MigrationExecutor`
for production wiring; tests continue to work without it.
Introduce RebalancerKickHook, a MembershipSubscriber that wakes the
rebalancer loop immediately when a node joins or leaves, replacing
passive polling for topology changes.
… commands

Implement three new cluster introspection statements:
- SHOW RANGES: lists vshard distribution with leaseholder and replica info
- SHOW ROUTING <key>: resolves a routing hint to the owning vshard
- SHOW SCHEMA VERSION: reports the current schema version across the cluster

Wire dispatch through the pgwire DDL router and native SQL dispatcher,
and exclude all three from the standard DML fast-path guard.
Add two new health endpoints alongside the existing /healthz and /health:
- GET /health/live: unconditional liveness probe (always 200); suitable as
  a k8s liveness probe since no internal state is checked
- POST /health/drain: signals the shutdown watch to begin graceful drain,
  causing /healthz to return 503 so the service mesh stops routing new
  connections; designed as a k8s preStop hook

Update /healthz to return 503 when the cluster observer reports the node
is draining or decommissioned, and fix the startup gate middleware comment
to reflect all /health* paths being exempt from the startup gate.
Adds a `backpressure_cpu_threshold` field to `RebalancerLoopConfig`
(default 0.80). Before executing a rebalance sweep, the driver checks
all node CPU utilization snapshots; if any node exceeds the threshold
the sweep is skipped and a STATUS-level log is emitted. This prevents
vShard migrations from amplifying cluster load when nodes are already
stressed.

- `LoadMetrics` gains a `cpu_utilization: f64` field (0.0–1.0)
- Integration tests updated to populate the new field
- Unit test `sweep_skipped_under_cpu_backpressure` verifies no
  migration is dispatched when a node reports 95% CPU
Introduces `session/read_consistency.rs` with parsing and validation
for the `nodedb.read_consistency` SET parameter. Accepted values are
`strong`, `bounded_staleness:<secs>`, and `eventual`. Invalid values
are rejected at SET time with SQLSTATE 22023 and a descriptive
error message rather than silently ignored.
Adds `pgwire/pg_catalog/` with a dispatcher and virtual table
definitions that intercept queries against `pg_catalog` relations.
This allows PostgreSQL-compatible clients and drivers that inspect
catalog tables during connection setup to receive well-formed
responses without requiring a full system catalog implementation.
Introduce ClosedTimestampTracker for advancing the closed timestamp
used by bounded-staleness reads, and FollowerReadGate with ReadLevel
to decide whether a follower can serve a read at a given LSN without
forwarding to the leader.
Replace direct index access with bounds-checked alternatives in the
auth user DDL handler, and add early-return validation in GRANT/REVOKE
ROLE handlers to return a well-formed syntax error instead of panicking
when the command has too few parts.
Introduce `nodedb_sql::ddl_ast` — a typed representation of every
NodeDB DDL statement as `NodedbStatement` enum variants, with a
whitespace-token parser that converts raw SQL into the typed form.

The AST gives DDL dispatch a compiler-checked match instead of
string-prefix branching, so adding a new DDL command forces handling
in every affected site at compile time rather than at runtime.
…tion

Add a `Batch` variant to `MetadataEntry` that wraps a sequence of
sub-entries under a single Raft log index. The applier recurses into
batch entries via `cascade_live_state`, and the cache applies each
sub-entry in order. This ensures that a transactional DDL block
(`BEGIN; CREATE ...; COMMIT;`) either commits fully or not at all
across the cluster.
Wire transactional DDL semantics through the pgwire session layer:

- `ddl_buffer`: thread-local buffer that intercepts `propose_catalog_entry`
  calls while a BEGIN block is active, accumulating encoded payloads
  instead of proposing them immediately.
- `transaction_cmds`: COMMIT flushes the buffer as a single
  `MetadataEntry::Batch` proposal; ROLLBACK discards without proposing.
- `metadata_proposer`: checks the thread-local buffer before proposing,
  returning early when a transaction is active.
- `ast` + router fast path: parse DDL once into `NodedbStatement`,
  handle `IF [NOT] EXISTS` at dispatch before falling through to
  legacy string-prefix handlers.
Cover the full create/drop lifecycle for collections, sequences,
triggers, and schedules across a 3-node test cluster. Each test
verifies that DDL executed on the leader becomes visible on all
followers within a bounded window, and that IF [NOT] EXISTS branches
complete without error.
…figurable

Use a HashSet keyed by peer ID in handle_request_vote_response so that
duplicate grants from the same peer cannot inflate the vote count and
cause premature leader promotion.

Add election_timeout_min/max fields to ClusterConfig and propagate them
through bootstrap, join, and restart paths. Expose matching fields in
ClusterTransportTuning (default 2–5 s) and wire them into MultiRaft so
every startup path honours the configured window rather than using a
hardcoded constant. Adjust network tuning defaults from 60/120 s to 2/5 s
to match real-world cluster behaviour.
Start the HealthMonitor task from start_raft so that periodic pings
run for the lifetime of the node. Without it, topology updates relied
solely on the fire-and-forget broadcast during the join flow; when that
broadcast was lost (peer QUIC server not yet accepting), the peer never
converged to the full topology.

On each successful pong, push the current topology to peers whose
reported version is behind ours. This closes the convergence gap for
peers that missed the initial broadcast.

Bound QUIC handshake attempts by rpc_timeout so a hung connection does
not block for the full 30 s idle timeout when a peer is slow to accept.

Expose the cluster catalog via ClusterHandle so the HealthMonitor can
persist topology changes detected during failure/recovery. Increase the
pgwire retry budget from 3 attempts (350 ms) to 5 attempts (750 ms) to
tolerate longer Raft leader drains under load.
…IM probe test

Replace fixed tokio::time::sleep calls in tests with bounded polling
loops that break as soon as the expected condition is met. This makes
tests deterministic under heavy parallel load (500+ unit tests sharing
the same CPU pool) rather than depending on timing assumptions.

In the SWIM indirect_ack_saves_target test, remove start_paused — paused
time auto-advances timeouts before channel-woken tasks are polled, making
the indirect path race its own timeout. With real time the 40 ms probe
window is ample for in-memory delivery. Add a recv loop on the local
endpoint to resolve inflight probes when Acks arrive, and handle both
Ping and PingReq in the helper so the test is correct regardless of
which node the scheduler picks as the direct target.

In the cluster harness, replace the 200 ms sleep before peers join with
a topology-readiness poll, and serialise node 2 joining before node 3 to
avoid overwhelming the bootstrap leader's join handler under load.

Extend the descriptor lease planner wait budget from 3 s to 10 s to
match the extended election timeout range used in tests.
@farhan-syah farhan-syah merged commit edafc6b into main Apr 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant