Skip to content

Fix deferred freeClient clobbering replication state after replicaof#3719

Merged
ranshid merged 2 commits into
valkey-io:unstablefrom
yaronsananes:fix-deferred-free-repl-state-clobber
May 15, 2026
Merged

Fix deferred freeClient clobbering replication state after replicaof#3719
ranshid merged 2 commits into
valkey-io:unstablefrom
yaronsananes:fix-deferred-free-repl-state-clobber

Conversation

@yaronsananes
Copy link
Copy Markdown
Contributor

@yaronsananes yaronsananes commented May 14, 2026

Summary

This PR addresses 2 race conditions where deferred freeClient (introduced by #3324) clobbers replication state set by REPLICAOF commands.

Found while triaging the recurring test-ubuntu-tls-io-threads daily CI failure in tests/unit/wait.tcl.

Root Cause

Since PR #3324 ("Redesign IO threading communication model"), freeClient() on a primary client with pending IO is deferred via freeClientAsync (gated on clientHasPendingIO). When the deferred free eventually executes, it chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection(), which unconditionally sets server.repl_state = REPL_STATE_CONNECT.

This causes two bugs:

Bug 1: REPLICAOF NO ONE (SIGSEGV)

replicationUnsetPrimary() sets primary_host = NULL before calling freeClient. The deferred free runs later, sets repl_state = REPL_STATE_CONNECT while primary_host is still NULL. replicationCron then calls connectWithPrimary() which passes NULL to connTLSConnect() -> inet_pton(AF_INET, NULL, ...) -> SIGSEGV.

Bug 2: REPLICAOF newhost newport (connection leak)

replicationSetPrimary() calls freeClient(old_primary) (deferred), then sets primary_host to the new IP and progresses repl_state to REPL_STATE_CONNECTING with a new connection handle in server.repl_transfer_s. The deferred free runs later, clobbers repl_state back to REPL_STATE_CONNECT. replicationCron then calls connectWithPrimary() again, overwriting server.repl_transfer_s without closing the previous connection -- an FD leak.

Fix

Make replicationHandlePrimaryDisconnection() only transition to REPL_STATE_CONNECT when repl_state is still REPL_STATE_CONNECTED and primary_host is set. This means the disconnection is genuine and no other state transition has already occurred. If repl_state has already moved on (CONNECT, CONNECTING, NONE, etc.), the deferred free is stale and the function leaves the state untouched.

Additionally:

  • connTLSConnect(): Return C_ERR if addr is NULL (defense in depth).
  • tests/unit/wait.tcl: Add 10s timeout to the blocking WAITAOF test, and add dedicated tests for the repoint scenario.

Reproduction

Reproduced locally by establishing TLS replication and executing REPLICAOF NO ONE:

  • Without fix: server crashes with signal 11, accessing address 0x0
  • With fix: server continues operating normally

Testing

  • Full unit/wait test suite passes (51/51) with IO threads enabled
  • New tests "Repoint replica between primaries does not leak connections or crash" and "Rapid repoint does not crash or leak" pass
  • Crash reproduced locally over TLS (SIGSEGV without fix, graceful handling with fix)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cdbbd68c-431b-4c06-87c1-b5d9b842b8ff

📥 Commits

Reviewing files that changed from the base of the PR and between ccebe67 and 5512e04.

📒 Files selected for processing (1)
  • src/replication.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/replication.c

📝 Walkthrough

Walkthrough

This PR adds debug infrastructure and fixes replica replication state handling to correctly manage async primary client disconnection during replica repointing. A new DEBUG force-free-primary-async command enables deterministic async closure of primary connections, and replicationHandlePrimaryDisconnection() now handles stale cached primary references. TLS connection handler also gains NULL address validation.

Changes

Replica Repointing with Async Primary Free Support

Layer / File(s) Summary
Debug infrastructure for async primary free
src/server.h, src/server.c, src/debug.c, src/networking.c
struct valkeyServer gains debug_force_free_primary_async field initialized at startup. New DEBUG force-free-primary-async command sets the flag. freeClient() skips immediate release for primary clients when flag is active, enabling async closure for testing.
Replication state transitions on primary disconnection
src/replication.c
replicationHandlePrimaryDisconnection() now conditionally updates server.repl_state only when prior state was REPL_STATE_CONNECTED and primary_host is consistent. Transitions to REPL_STATE_CONNECT or REPL_STATE_NONE with updated repl_down_since. Immediate reconnection gated to require both REPL_STATE_CONNECT and non-NULL primary_host.
Tests for replica repointing and primary free scenarios
tests/unit/wait.tcl
WAITAOF timeout adjusted from 0 to 10000 ms in backlog-absent test. New test suite added for replica repointing: forces async primary free via debug command, repoints replica to second primary, and verifies only one connection attempt via replica stdout logs.

TLS Connection Input Validation

Layer / File(s) Summary
NULL address validation in TLS connection
src/tls.c
connTLSConnect() adds early guard that rejects NULL addr parameter with C_ERR return before address parsing or SNI setup.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing a bug where deferred freeClient clobbers replication state after REPLICAOF commands.
Description check ✅ Passed The description comprehensively explains the root cause, the two specific bugs addressed, the fix implemented, and reproduction/testing details—all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ranshid ranshid marked this pull request as ready for review May 14, 2026 15:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/wait.tcl`:
- Around line 549-581: Add a regression case that issues "$replica replicaof no
one" after an established replication link to exercise replicationUnsetPrimary
and ensure it doesn't crash: call "$replica replicaof no one",
wait_for_condition until [s 0 master_link_status] eq {down} (or appropriate
non-up state), then assert that [s 0 master_host] and [s 0 master_port] are
cleared/empty; place this in the same test (Repoint replica between primaries
does not leak connections or crash) or as a sibling test and mirror the same
checks in the other similar block (lines ~583-597) to cover the unset-primary
path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 112ab53b-140d-4476-bab4-6f6b1a238876

📥 Commits

Reviewing files that changed from the base of the PR and between fdf13ca and ad8df8b.

📒 Files selected for processing (3)
  • src/replication.c
  • src/tls.c
  • tests/unit/wait.tcl

Comment thread tests/unit/wait.tcl Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (fdf13ca) to head (5512e04).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3719      +/-   ##
============================================
+ Coverage     76.65%   76.71%   +0.05%     
============================================
  Files           162      162              
  Lines         80662    80674      +12     
============================================
+ Hits          61830    61887      +57     
+ Misses        18832    18787      -45     
Files with missing lines Coverage Δ
src/debug.c 54.95% <100.00%> (+0.11%) ⬆️
src/networking.c 92.32% <100.00%> (+0.10%) ⬆️
src/replication.c 85.89% <100.00%> (-0.28%) ⬇️
src/server.c 89.48% <100.00%> (+0.02%) ⬆️
src/server.h 100.00% <ø> (ø)
src/tls.c 17.64% <ø> (ø)

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/replication.c
} else if (server.repl_state == REPL_STATE_CONNECTED) {
/* primary_host is NULL: deliberate unset in progress. */
server.repl_state = REPL_STATE_NONE;
server.repl_down_since = server.unixtime;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repl_down_since represents how long we've been disconnected from our primary, and is only meaningful while the node is configured as a replica. Setting it on the transition to
REPL_STATE_NONE (i.e. REPLICAOF NO ONE) writes a value that is conceptually meaningless — and would persist as a stale timestamp if the node later becomes a replica again, until the next genuine disconnect resets it.

Suggest dropping the assignment:

Suggested change
server.repl_down_since = server.unixtime;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Dropped the repl_down_since assignment on the REPL_STATE_NONE transition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracting my earlier suggestion to drop server.repl_down_since = server.unixtime; from the else if branch - It was the wrong call.

This branch is reached not only by REPLICAOF NO ONE but also by the synchronous replicationSetPrimary(newhost) path: replicationSetPrimary clears primary_host = NULL before calling freeClient(server.primary), so when the function chains synchronously to replicationHandlePrimaryDisconnection, we hit repl_state == CONNECTED && primary_host == NULL. In the REPLICAOF newhost case the node is going to be a replica again immediately, and repl_down_since needs to track the disconnection from the old primary so that clusterHandleReplicaFailover's data_age check works correctly until the new sync completes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. thanks @ranshid.

Comment thread tests/unit/wait.tcl
Since PR valkey-io#3324, freeClient() on a primary client with pending IO is
deferred via freeClientAsync. The deferred free eventually chains through
replicationCachePrimary() -> replicationHandlePrimaryDisconnection(),
which unconditionally set repl_state = REPL_STATE_CONNECT.

This causes two bugs:

1. REPLICAOF NO ONE: primary_host is NULL when the deferred free runs,
   so replicationCron calls connectWithPrimary(NULL) -> SIGSEGV in
   connTLSConnect (inet_pton with NULL addr).

2. REPLICAOF newhost newport: the deferred free clobbers the already-
   progressed repl_state (CONNECTING) back to CONNECT, causing
   replicationCron to call connectWithPrimary() again, which overwrites
   server.repl_transfer_s without closing the previous connection (FD leak).

Fix by making replicationHandlePrimaryDisconnection() only transition to
REPL_STATE_CONNECT when repl_state is still REPL_STATE_CONNECTED (meaning
this is a genuine disconnect, not a stale deferred free). If repl_state
has already moved on, the deferred free is stale and should not mutate
the state machine.

Additionally:
- Add NULL check for addr in connTLSConnect() as defense in depth.
- Add 10s timeout to the WAITAOF test to prevent indefinite hanging.
- Add dedicated tests for the repoint scenario.

Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>
@yaronsananes yaronsananes force-pushed the fix-deferred-free-repl-state-clobber branch from ad8df8b to ccebe67 Compare May 14, 2026 17:34
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/replication.c (1)

4557-4590: Request @core-team review for this replication state-machine change.

This patch touches src/replication.c, which the repo treats as an architectural-review area.

As per coding guidelines, "src/{cluster*.c,replication.c,rdb.c,aof.c}: Request @core-team architectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/replication.c` around lines 4557 - 4590, This change modifies the
replication state machine in replicationHandlePrimaryDisconnection (in
src/replication.c) and therefore requires an explicit architectural review by
the core team; please add a PR reviewer request to `@core-team` and a brief
rationale in the PR description mentioning the affected symbol
replicationHandlePrimaryDisconnection and the state transitions around
server.repl_state (REPL_STATE_CONNECTED → REPL_STATE_CONNECT/REPL_STATE_NONE) so
reviewers can evaluate correctness and backward-compatibility of the
reconnection logic (including behavior when server.primary_host is NULL and the
immediate reconnect path via connectWithPrimary()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/replication.c`:
- Around line 4557-4590: This change modifies the replication state machine in
replicationHandlePrimaryDisconnection (in src/replication.c) and therefore
requires an explicit architectural review by the core team; please add a PR
reviewer request to `@core-team` and a brief rationale in the PR description
mentioning the affected symbol replicationHandlePrimaryDisconnection and the
state transitions around server.repl_state (REPL_STATE_CONNECTED →
REPL_STATE_CONNECT/REPL_STATE_NONE) so reviewers can evaluate correctness and
backward-compatibility of the reconnection logic (including behavior when
server.primary_host is NULL and the immediate reconnect path via
connectWithPrimary()).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 45226830-d070-4816-be53-a81f65a66ed3

📥 Commits

Reviewing files that changed from the base of the PR and between ad8df8b and ccebe67.

📒 Files selected for processing (7)
  • src/debug.c
  • src/networking.c
  • src/replication.c
  • src/server.c
  • src/server.h
  • src/tls.c
  • tests/unit/wait.tcl
✅ Files skipped from review due to trivial changes (1)
  • src/debug.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/tls.c

Copy link
Copy Markdown
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

I think we could also add a serverAssert(server.primary_host != NULL) at connectWithPrimary entry as future-proofing, but I would not risk it right now.

@ranshid
Copy link
Copy Markdown
Member

ranshid commented May 14, 2026

The "test wait for new failover in tests/unit/cluster/failover2.tcl" is constantly failing. so I would wait with this merge till we analyze the issue

@ranshid ranshid self-requested a review May 14, 2026 19:30
Comment thread src/replication.c
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
@ranshid ranshid moved this to To be backported in Valkey 9.1 May 15, 2026
@ranshid ranshid added the bug Something isn't working label May 15, 2026
@ranshid ranshid merged commit 0321a69 into valkey-io:unstable May 15, 2026
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: No status
Status: To be backported

Development

Successfully merging this pull request may close these issues.

2 participants