Fix deferred freeClient clobbering replication state after replicaof#3719
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds debug infrastructure and fixes replica replication state handling to correctly manage async primary client disconnection during replica repointing. A new ChangesReplica Repointing with Async Primary Free Support
TLS Connection Input Validation
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/wait.tcl`:
- Around line 549-581: Add a regression case that issues "$replica replicaof no
one" after an established replication link to exercise replicationUnsetPrimary
and ensure it doesn't crash: call "$replica replicaof no one",
wait_for_condition until [s 0 master_link_status] eq {down} (or appropriate
non-up state), then assert that [s 0 master_host] and [s 0 master_port] are
cleared/empty; place this in the same test (Repoint replica between primaries
does not leak connections or crash) or as a sibling test and mirror the same
checks in the other similar block (lines ~583-597) to cover the unset-primary
path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 112ab53b-140d-4476-bab4-6f6b1a238876
📒 Files selected for processing (3)
src/replication.csrc/tls.ctests/unit/wait.tcl
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3719 +/- ##
============================================
+ Coverage 76.65% 76.71% +0.05%
============================================
Files 162 162
Lines 80662 80674 +12
============================================
+ Hits 61830 61887 +57
+ Misses 18832 18787 -45
🚀 New features to boost your workflow:
|
| } else if (server.repl_state == REPL_STATE_CONNECTED) { | ||
| /* primary_host is NULL: deliberate unset in progress. */ | ||
| server.repl_state = REPL_STATE_NONE; | ||
| server.repl_down_since = server.unixtime; |
There was a problem hiding this comment.
repl_down_since represents how long we've been disconnected from our primary, and is only meaningful while the node is configured as a replica. Setting it on the transition to
REPL_STATE_NONE (i.e. REPLICAOF NO ONE) writes a value that is conceptually meaningless — and would persist as a stale timestamp if the node later becomes a replica again, until the next genuine disconnect resets it.
Suggest dropping the assignment:
| server.repl_down_since = server.unixtime; |
There was a problem hiding this comment.
Done. Dropped the repl_down_since assignment on the REPL_STATE_NONE transition.
There was a problem hiding this comment.
Retracting my earlier suggestion to drop server.repl_down_since = server.unixtime; from the else if branch - It was the wrong call.
This branch is reached not only by REPLICAOF NO ONE but also by the synchronous replicationSetPrimary(newhost) path: replicationSetPrimary clears primary_host = NULL before calling freeClient(server.primary), so when the function chains synchronously to replicationHandlePrimaryDisconnection, we hit repl_state == CONNECTED && primary_host == NULL. In the REPLICAOF newhost case the node is going to be a replica again immediately, and repl_down_since needs to track the disconnection from the old primary so that clusterHandleReplicaFailover's data_age check works correctly until the new sync completes.
Since PR valkey-io#3324, freeClient() on a primary client with pending IO is deferred via freeClientAsync. The deferred free eventually chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection(), which unconditionally set repl_state = REPL_STATE_CONNECT. This causes two bugs: 1. REPLICAOF NO ONE: primary_host is NULL when the deferred free runs, so replicationCron calls connectWithPrimary(NULL) -> SIGSEGV in connTLSConnect (inet_pton with NULL addr). 2. REPLICAOF newhost newport: the deferred free clobbers the already- progressed repl_state (CONNECTING) back to CONNECT, causing replicationCron to call connectWithPrimary() again, which overwrites server.repl_transfer_s without closing the previous connection (FD leak). Fix by making replicationHandlePrimaryDisconnection() only transition to REPL_STATE_CONNECT when repl_state is still REPL_STATE_CONNECTED (meaning this is a genuine disconnect, not a stale deferred free). If repl_state has already moved on, the deferred free is stale and should not mutate the state machine. Additionally: - Add NULL check for addr in connTLSConnect() as defense in depth. - Add 10s timeout to the WAITAOF test to prevent indefinite hanging. - Add dedicated tests for the repoint scenario. Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>
ad8df8b to
ccebe67
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/replication.c (1)
4557-4590: Request@core-teamreview for this replication state-machine change.This patch touches
src/replication.c, which the repo treats as an architectural-review area.As per coding guidelines, "
src/{cluster*.c,replication.c,rdb.c,aof.c}: Request@core-teamarchitectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/replication.c` around lines 4557 - 4590, This change modifies the replication state machine in replicationHandlePrimaryDisconnection (in src/replication.c) and therefore requires an explicit architectural review by the core team; please add a PR reviewer request to `@core-team` and a brief rationale in the PR description mentioning the affected symbol replicationHandlePrimaryDisconnection and the state transitions around server.repl_state (REPL_STATE_CONNECTED → REPL_STATE_CONNECT/REPL_STATE_NONE) so reviewers can evaluate correctness and backward-compatibility of the reconnection logic (including behavior when server.primary_host is NULL and the immediate reconnect path via connectWithPrimary()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/replication.c`:
- Around line 4557-4590: This change modifies the replication state machine in
replicationHandlePrimaryDisconnection (in src/replication.c) and therefore
requires an explicit architectural review by the core team; please add a PR
reviewer request to `@core-team` and a brief rationale in the PR description
mentioning the affected symbol replicationHandlePrimaryDisconnection and the
state transitions around server.repl_state (REPL_STATE_CONNECTED →
REPL_STATE_CONNECT/REPL_STATE_NONE) so reviewers can evaluate correctness and
backward-compatibility of the reconnection logic (including behavior when
server.primary_host is NULL and the immediate reconnect path via
connectWithPrimary()).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 45226830-d070-4816-be53-a81f65a66ed3
📒 Files selected for processing (7)
src/debug.csrc/networking.csrc/replication.csrc/server.csrc/server.hsrc/tls.ctests/unit/wait.tcl
✅ Files skipped from review due to trivial changes (1)
- src/debug.c
🚧 Files skipped from review as they are similar to previous changes (1)
- src/tls.c
|
The "test wait for new failover in tests/unit/cluster/failover2.tcl" is constantly failing. so I would wait with this merge till we analyze the issue |
Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Summary
This PR addresses 2 race conditions where deferred
freeClient(introduced by #3324) clobbers replication state set byREPLICAOFcommands.Found while triaging the recurring
test-ubuntu-tls-io-threadsdaily CI failure intests/unit/wait.tcl.Root Cause
Since PR #3324 ("Redesign IO threading communication model"),
freeClient()on a primary client with pending IO is deferred viafreeClientAsync(gated onclientHasPendingIO). When the deferred free eventually executes, it chains throughreplicationCachePrimary()->replicationHandlePrimaryDisconnection(), which unconditionally setsserver.repl_state = REPL_STATE_CONNECT.This causes two bugs:
Bug 1: REPLICAOF NO ONE (SIGSEGV)
replicationUnsetPrimary()setsprimary_host = NULLbefore callingfreeClient. The deferred free runs later, setsrepl_state = REPL_STATE_CONNECTwhileprimary_hostis still NULL.replicationCronthen callsconnectWithPrimary()which passes NULL toconnTLSConnect()->inet_pton(AF_INET, NULL, ...)-> SIGSEGV.Bug 2: REPLICAOF newhost newport (connection leak)
replicationSetPrimary()callsfreeClient(old_primary)(deferred), then setsprimary_hostto the new IP and progressesrepl_statetoREPL_STATE_CONNECTINGwith a new connection handle inserver.repl_transfer_s. The deferred free runs later, clobbersrepl_stateback toREPL_STATE_CONNECT.replicationCronthen callsconnectWithPrimary()again, overwritingserver.repl_transfer_swithout closing the previous connection -- an FD leak.Fix
Make
replicationHandlePrimaryDisconnection()only transition toREPL_STATE_CONNECTwhenrepl_stateis stillREPL_STATE_CONNECTEDandprimary_hostis set. This means the disconnection is genuine and no other state transition has already occurred. Ifrepl_statehas already moved on (CONNECT, CONNECTING, NONE, etc.), the deferred free is stale and the function leaves the state untouched.Additionally:
connTLSConnect(): ReturnC_ERRifaddris NULL (defense in depth).tests/unit/wait.tcl: Add 10s timeout to the blocking WAITAOF test, and add dedicated tests for the repoint scenario.Reproduction
Reproduced locally by establishing TLS replication and executing
REPLICAOF NO ONE:Testing
unit/waittest suite passes (51/51) with IO threads enabled