feat(cluster): distributed cluster infrastructure#45
Merged
farhan-syah merged 24 commits intomainfrom Apr 16, 2026
Merged
Conversation
Introduce a `MembershipSubscriber` trait that the failure detector calls after every accepted member state transition (insert, Alive → Suspect, Suspect → Dead, any → Left). Subscribers are synchronous and must not block; they are suitable for cheap in-memory bookkeeping. Key changes: - `subscriber.rs`: defines `MembershipSubscriber` with a single `on_state_change(node_id, old, new)` method. - `detector/runner.rs`: adds `with_subscribers` constructor and `apply_and_notify` helper that wraps `apply_and_disseminate` with before/after state snapshots; the no-subscriber path is zero-overhead (early return on empty slice). - `bootstrap.rs`: adds `spawn_with_subscribers` alongside the existing `spawn` entry-point so callers can inject subscribers without breaking the common no-observer signature. - `mod.rs`: re-exports `MembershipSubscriber` and `spawn_with_subscribers`.
Add `RoutingLivenessHook`, a `MembershipSubscriber` that clears the leader hint for every Raft group whose leaseholder transitions to Suspect, Dead, or Left. After the hint is cleared, the next query against those vShards gets `NotLeader`, triggers leader re-discovery, and updates the routing table — limiting clients to at most one retry on node failure. `NodeIdResolver` is a closure type that maps SWIM `NodeId` to the numeric routing-table id, keeping the hook storage-agnostic. Seed placeholders and transient learners that the routing table has not yet registered are silently ignored. The integration test (`tests/swim_routing_invalidation.rs`) runs three real UDP-backed SWIM nodes, shuts down the group leader, and asserts that the hook clears the routing hint within a few suspicion timeouts.
Introduce a `decommission` module with four sub-modules:
- `safety` — `check_can_decommission` validates that removing a
node from every Raft group it belongs to still
satisfies the configured replication factor.
- `flow` — `DecommissionFlow` drives the ordered sequence of
metadata proposals (start → leadership transfers →
member removal → finish) with per-step error
propagation.
- `coordinator` — `DecommissionCoordinator` owns the flow and exposes a
single `run()` entry point for callers that hold the
metadata Raft handle.
- `observer` — `DecommissionObserver` watches committed entries and
unblocks a waiting coordinator when the final
`FinishDecommission` entry is applied.
Supporting changes:
- `metadata_group::entry` — add `RoutingChange::RemoveMember` so the
decommission flow can strip a draining node from every group it
belongs to without a compatibility break.
- `metadata_group::applier` — add `with_live_state()` builder that
attaches live `ClusterTopology` and `RoutingTable` handles; committed
`TopologyChange` and `RoutingChange` entries now mutate them in place
in addition to updating the in-memory history log.
- `routing` — add `remove_group_member()` with unit tests covering
voter removal, learner-only removal, and unknown-group no-ops.
- `lifecycle::plan_decommission` — thin wrapper over
`decommission::plan_full_decommission`; existing call sites are
unchanged.
- `lib` — re-export public decommission surface.
Integration test `decommission_flow` exercises the full round-trip on a
three-node in-process cluster: safety check, coordinator run, and
topology/routing state after commit.
Introduce a `reachability` module that actively probes peers whose circuit breakers are stuck in the Open state. Without periodic probes, a breaker that opens due to a transient failure never transitions back to HalfOpen because there is no outbound traffic to trigger a `check()` call — the node stays blacklisted indefinitely. New pieces: - `reachability::prober` — `ReachabilityProber` trait with a `TransportProber` adapter (wraps any `Transport: clone + send`) and a `NoopProber` for tests. - `reachability::driver` — `ReachabilityDriver` runs a Tokio interval loop, calls `circuit_breaker.open_peers()` each tick, and fires one probe per stuck peer. A successful probe resets the breaker to Closed; a failed probe keeps it Open until the next interval. `ReachabilityDriverConfig` controls the probe interval and per-tick peer cap. Supporting changes: - `circuit_breaker::open_peers()` — returns the ids of every peer currently in the Open state so the driver can discover them without coupling to breaker internals. - `routing_liveness` — remove stale inline checklist from the module doc; the comment now describes what actually happens. - `lib` — re-export public reachability surface. Integration test `reachability_loop` spins up a driver backed by a `NoopProber`, forces three peers into the Open state, advances the interval, and asserts that all three breakers return to Closed.
The SWIM implementation was built in incremental sub-batches whose labels (E-α, E-β, E-γ, …) were embedded in module docs, inline comments, and test file headers. Now that all layers are present the labels no longer help a reader navigate the code and are confusing when encountered out of context. Replace each reference with a plain description of what the module or code path actually does.
…driver loop Introduces a new `rebalancer` module to the cluster crate that complements the existing overload-triggered scheduler with a storage-shape-driven rebalancing path (bytes + qps + vshard count). Key components: - `metrics`: `LoadMetrics`, `LoadMetricsProvider` trait, `LoadWeights`, and `normalized_score` for per-node pressure scoring - `plan`: `compute_load_based_plan` produces a bounded set of vshard moves from hottest to coldest nodes using configurable imbalance thresholds (`RebalancerPlanConfig`) - `driver`: `RebalancerLoop` orchestrates periodic plan evaluation and dispatches moves through the `MigrationDispatcher` trait; the `ElectionGate` trait lets callers gate sweeps on Raft leadership; `AlwaysReadyGate` is provided for tests All planning logic is pure and side-effect-free, making it fully unit-testable without spawning any Tokio tasks. An end-to-end integration test (`tests/rebalancer_loop.rs`) wires a `StaticProvider`, a `DirectDispatcher`, and `AlwaysReadyGate` to assert that a full plan → dispatch → routing-table-mutation cycle works correctly.
…e 3 cut-over Phase 1 previously added the target node directly as a voter via `ConfChangeType::AddNode`. This was incorrect: a node with an empty log cannot safely participate in elections or voting immediately. Phase 1 now adds the target as a learner (`AddLearner`) so Raft replication can bring it up to date without affecting quorum. Phase 2 detects when the learner's `match_index` reaches the leader's `commit_index` and promotes it to voter via `ConfChangeType::PromoteLearner` before returning. This ensures the Raft group has sufficient replicas for a safe cut-over before Phase 3 begins. Phase 3 cut-over is updated to route the routing change through the metadata Raft group when a `MetadataProposer` is attached. The `LeadershipTransfer` metadata entry is proposed and waited on so every node applies the routing update atomically at the same commit index. Without a proposer (test environments without a metadata group) the executor falls back to a local-only routing mutation via `set_leader`. A `with_metadata_proposer` builder method is added to `MigrationExecutor` for production wiring; tests continue to work without it.
Introduce RebalancerKickHook, a MembershipSubscriber that wakes the rebalancer loop immediately when a node joins or leaves, replacing passive polling for topology changes.
… commands Implement three new cluster introspection statements: - SHOW RANGES: lists vshard distribution with leaseholder and replica info - SHOW ROUTING <key>: resolves a routing hint to the owning vshard - SHOW SCHEMA VERSION: reports the current schema version across the cluster Wire dispatch through the pgwire DDL router and native SQL dispatcher, and exclude all three from the standard DML fast-path guard.
Add two new health endpoints alongside the existing /healthz and /health: - GET /health/live: unconditional liveness probe (always 200); suitable as a k8s liveness probe since no internal state is checked - POST /health/drain: signals the shutdown watch to begin graceful drain, causing /healthz to return 503 so the service mesh stops routing new connections; designed as a k8s preStop hook Update /healthz to return 503 when the cluster observer reports the node is draining or decommissioned, and fix the startup gate middleware comment to reflect all /health* paths being exempt from the startup gate.
…lancer test lock scope
Adds a `backpressure_cpu_threshold` field to `RebalancerLoopConfig` (default 0.80). Before executing a rebalance sweep, the driver checks all node CPU utilization snapshots; if any node exceeds the threshold the sweep is skipped and a STATUS-level log is emitted. This prevents vShard migrations from amplifying cluster load when nodes are already stressed. - `LoadMetrics` gains a `cpu_utilization: f64` field (0.0–1.0) - Integration tests updated to populate the new field - Unit test `sweep_skipped_under_cpu_backpressure` verifies no migration is dispatched when a node reports 95% CPU
Introduces `session/read_consistency.rs` with parsing and validation for the `nodedb.read_consistency` SET parameter. Accepted values are `strong`, `bounded_staleness:<secs>`, and `eventual`. Invalid values are rejected at SET time with SQLSTATE 22023 and a descriptive error message rather than silently ignored.
Adds `pgwire/pg_catalog/` with a dispatcher and virtual table definitions that intercept queries against `pg_catalog` relations. This allows PostgreSQL-compatible clients and drivers that inspect catalog tables during connection setup to receive well-formed responses without requiring a full system catalog implementation.
Introduce ClosedTimestampTracker for advancing the closed timestamp used by bounded-staleness reads, and FollowerReadGate with ReadLevel to decide whether a follower can serve a read at a given LSN without forwarding to the leader.
Replace direct index access with bounds-checked alternatives in the auth user DDL handler, and add early-return validation in GRANT/REVOKE ROLE handlers to return a well-formed syntax error instead of panicking when the command has too few parts.
Introduce `nodedb_sql::ddl_ast` — a typed representation of every NodeDB DDL statement as `NodedbStatement` enum variants, with a whitespace-token parser that converts raw SQL into the typed form. The AST gives DDL dispatch a compiler-checked match instead of string-prefix branching, so adding a new DDL command forces handling in every affected site at compile time rather than at runtime.
…tion Add a `Batch` variant to `MetadataEntry` that wraps a sequence of sub-entries under a single Raft log index. The applier recurses into batch entries via `cascade_live_state`, and the cache applies each sub-entry in order. This ensures that a transactional DDL block (`BEGIN; CREATE ...; COMMIT;`) either commits fully or not at all across the cluster.
Wire transactional DDL semantics through the pgwire session layer: - `ddl_buffer`: thread-local buffer that intercepts `propose_catalog_entry` calls while a BEGIN block is active, accumulating encoded payloads instead of proposing them immediately. - `transaction_cmds`: COMMIT flushes the buffer as a single `MetadataEntry::Batch` proposal; ROLLBACK discards without proposing. - `metadata_proposer`: checks the thread-local buffer before proposing, returning early when a transaction is active. - `ast` + router fast path: parse DDL once into `NodedbStatement`, handle `IF [NOT] EXISTS` at dispatch before falling through to legacy string-prefix handlers.
Cover the full create/drop lifecycle for collections, sequences, triggers, and schedules across a 3-node test cluster. Each test verifies that DDL executed on the leader becomes visible on all followers within a bounded window, and that IF [NOT] EXISTS branches complete without error.
…figurable Use a HashSet keyed by peer ID in handle_request_vote_response so that duplicate grants from the same peer cannot inflate the vote count and cause premature leader promotion. Add election_timeout_min/max fields to ClusterConfig and propagate them through bootstrap, join, and restart paths. Expose matching fields in ClusterTransportTuning (default 2–5 s) and wire them into MultiRaft so every startup path honours the configured window rather than using a hardcoded constant. Adjust network tuning defaults from 60/120 s to 2/5 s to match real-world cluster behaviour.
Start the HealthMonitor task from start_raft so that periodic pings run for the lifetime of the node. Without it, topology updates relied solely on the fire-and-forget broadcast during the join flow; when that broadcast was lost (peer QUIC server not yet accepting), the peer never converged to the full topology. On each successful pong, push the current topology to peers whose reported version is behind ours. This closes the convergence gap for peers that missed the initial broadcast. Bound QUIC handshake attempts by rpc_timeout so a hung connection does not block for the full 30 s idle timeout when a peer is slow to accept. Expose the cluster catalog via ClusterHandle so the HealthMonitor can persist topology changes detected during failure/recovery. Increase the pgwire retry budget from 3 attempts (350 ms) to 5 attempts (750 ms) to tolerate longer Raft leader drains under load.
…IM probe test Replace fixed tokio::time::sleep calls in tests with bounded polling loops that break as soon as the expected condition is met. This makes tests deterministic under heavy parallel load (500+ unit tests sharing the same CPU pool) rather than depending on timing assumptions. In the SWIM indirect_ack_saves_target test, remove start_paused — paused time auto-advances timeouts before channel-woken tasks are polled, making the indirect path race its own timeout. With real time the 40 ms probe window is ample for in-memory delivery. Add a recv loop on the local endpoint to resolve inflight probes when Acks arrive, and handle both Ping and PingReq in the helper so the test is correct regardless of which node the scheduler picks as the direct target. In the cluster harness, replace the 200 ms sleep before peers join with a topology-readiness poll, and serialise node 2 joining before node 3 to avoid overwhelming the bootstrap leader's join handler under load. Extend the descriptor lease planner wait budget from 3 s to 10 s to match the extended election timeout range used in tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/health/live,/health/drain,pg_catalogvirtual tables,nodedb.read_consistencysession parameterTest plan
cargo nextest run -p nodedb-cluster— all cluster unit + integration tests passcargo nextest run --test sql_ddl_cluster— DDL replication correctness across 3-node clustercargo nextest run -p nodedb-sql— SQL parser and planner tests including const-foldcargo clippy --all-targets -- -D warnings— no warningscargo fmt --all --check— formatting clean