Fix deferred freeClient clobbering replication state after replicaof by yaronsananes · Pull Request #3719 · valkey-io/valkey

yaronsananes · 2026-05-14T15:21:26Z

Summary

This PR addresses 2 race conditions where deferred freeClient (introduced by #3324) clobbers replication state set by REPLICAOF commands.

Found while triaging the recurring test-ubuntu-tls-io-threads daily CI failure in tests/unit/wait.tcl.

Root Cause

Since PR #3324 ("Redesign IO threading communication model"), freeClient() on a primary client with pending IO is deferred via freeClientAsync (gated on clientHasPendingIO). When the deferred free eventually executes, it chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection(), which unconditionally sets server.repl_state = REPL_STATE_CONNECT.

This causes two bugs:

Bug 1: REPLICAOF NO ONE (SIGSEGV)

replicationUnsetPrimary() sets primary_host = NULL before calling freeClient. The deferred free runs later, sets repl_state = REPL_STATE_CONNECT while primary_host is still NULL. replicationCron then calls connectWithPrimary() which passes NULL to connTLSConnect() -> inet_pton(AF_INET, NULL, ...) -> SIGSEGV.

Bug 2: REPLICAOF newhost newport (connection leak)

replicationSetPrimary() calls freeClient(old_primary) (deferred), then sets primary_host to the new IP and progresses repl_state to REPL_STATE_CONNECTING with a new connection handle in server.repl_transfer_s. The deferred free runs later, clobbers repl_state back to REPL_STATE_CONNECT. replicationCron then calls connectWithPrimary() again, overwriting server.repl_transfer_s without closing the previous connection -- an FD leak.

Fix

Make replicationHandlePrimaryDisconnection() only transition to REPL_STATE_CONNECT when repl_state is still REPL_STATE_CONNECTED and primary_host is set. This means the disconnection is genuine and no other state transition has already occurred. If repl_state has already moved on (CONNECT, CONNECTING, NONE, etc.), the deferred free is stale and the function leaves the state untouched.

Additionally:

connTLSConnect(): Return C_ERR if addr is NULL (defense in depth).
tests/unit/wait.tcl: Add 10s timeout to the blocking WAITAOF test, and add dedicated tests for the repoint scenario.

Reproduction

Reproduced locally by establishing TLS replication and executing REPLICAOF NO ONE:

Without fix: server crashes with signal 11, accessing address 0x0
With fix: server continues operating normally

Testing

Full unit/wait test suite passes (51/51) with IO threads enabled
New tests "Repoint replica between primaries does not leak connections or crash" and "Rapid repoint does not crash or leak" pass
Crash reproduced locally over TLS (SIGSEGV without fix, graceful handling with fix)

coderabbitai · 2026-05-14T15:21:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cdbbd68c-431b-4c06-87c1-b5d9b842b8ff

📥 Commits

Reviewing files that changed from the base of the PR and between ccebe67 and 5512e04.

📒 Files selected for processing (1)

src/replication.c

🚧 Files skipped from review as they are similar to previous changes (1)

src/replication.c

📝 Walkthrough

Walkthrough

This PR adds debug infrastructure and fixes replica replication state handling to correctly manage async primary client disconnection during replica repointing. A new DEBUG force-free-primary-async command enables deterministic async closure of primary connections, and replicationHandlePrimaryDisconnection() now handles stale cached primary references. TLS connection handler also gains NULL address validation.

Changes

Replica Repointing with Async Primary Free Support

Layer / File(s)	Summary
Debug infrastructure for async primary free `src/server.h`, `src/server.c`, `src/debug.c`, `src/networking.c`	`struct valkeyServer` gains `debug_force_free_primary_async` field initialized at startup. New `DEBUG force-free-primary-async` command sets the flag. `freeClient()` skips immediate release for primary clients when flag is active, enabling async closure for testing.
Replication state transitions on primary disconnection `src/replication.c`	`replicationHandlePrimaryDisconnection()` now conditionally updates `server.repl_state` only when prior state was `REPL_STATE_CONNECTED` and `primary_host` is consistent. Transitions to `REPL_STATE_CONNECT` or `REPL_STATE_NONE` with updated `repl_down_since`. Immediate reconnection gated to require both `REPL_STATE_CONNECT` and non-NULL `primary_host`.
Tests for replica repointing and primary free scenarios `tests/unit/wait.tcl`	WAITAOF timeout adjusted from 0 to 10000 ms in backlog-absent test. New test suite added for replica repointing: forces async primary free via debug command, repoints replica to second primary, and verifies only one connection attempt via replica stdout logs.

TLS Connection Input Validation

Layer / File(s)	Summary
NULL address validation in TLS connection `src/tls.c`	`connTLSConnect()` adds early guard that rejects `NULL` `addr` parameter with `C_ERR` return before address parsing or SNI setup.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing a bug where deferred freeClient clobbers replication state after REPLICAOF commands.
Description check	✅ Passed	The description comprehensively explains the root cause, the two specific bugs addressed, the fix implemented, and reproduction/testing details—all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/wait.tcl`:
- Around line 549-581: Add a regression case that issues "$replica replicaof no
one" after an established replication link to exercise replicationUnsetPrimary
and ensure it doesn't crash: call "$replica replicaof no one",
wait_for_condition until [s 0 master_link_status] eq {down} (or appropriate
non-up state), then assert that [s 0 master_host] and [s 0 master_port] are
cleared/empty; place this in the same test (Repoint replica between primaries
does not leak connections or crash) or as a sibling test and mirror the same
checks in the other similar block (lines ~583-597) to cover the unset-primary
path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 112ab53b-140d-4476-bab4-6f6b1a238876

📥 Commits

Reviewing files that changed from the base of the PR and between fdf13ca and ad8df8b.

📒 Files selected for processing (3)

src/replication.c
src/tls.c
tests/unit/wait.tcl

codecov · 2026-05-14T15:45:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (fdf13ca) to head (5512e04).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3719      +/-   ##
============================================
+ Coverage     76.65%   76.71%   +0.05%     
============================================
  Files           162      162              
  Lines         80662    80674      +12     
============================================
+ Hits          61830    61887      +57     
+ Misses        18832    18787      -45

Files with missing lines	Coverage Δ
src/debug.c	`54.95% <100.00%> (+0.11%)`	⬆️
src/networking.c	`92.32% <100.00%> (+0.10%)`	⬆️
src/replication.c	`85.89% <100.00%> (-0.28%)`	⬇️
src/server.c	`89.48% <100.00%> (+0.02%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/tls.c	`17.64% <ø> (ø)`

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ranshid · 2026-05-14T15:41:10Z

+    } else if (server.repl_state == REPL_STATE_CONNECTED) {
+        /* primary_host is NULL: deliberate unset in progress. */
+        server.repl_state = REPL_STATE_NONE;
+        server.repl_down_since = server.unixtime;


repl_down_since represents how long we've been disconnected from our primary, and is only meaningful while the node is configured as a replica. Setting it on the transition to
REPL_STATE_NONE (i.e. REPLICAOF NO ONE) writes a value that is conceptually meaningless — and would persist as a stale timestamp if the node later becomes a replica again, until the next genuine disconnect resets it.

Suggest dropping the assignment:

Suggested change

server.repl_down_since = server.unixtime;

Done. Dropped the repl_down_since assignment on the REPL_STATE_NONE transition.

Retracting my earlier suggestion to drop server.repl_down_since = server.unixtime; from the else if branch - It was the wrong call.

This branch is reached not only by REPLICAOF NO ONE but also by the synchronous replicationSetPrimary(newhost) path: replicationSetPrimary clears primary_host = NULL before calling freeClient(server.primary), so when the function chains synchronously to replicationHandlePrimaryDisconnection, we hit repl_state == CONNECTED && primary_host == NULL. In the REPLICAOF newhost case the node is going to be a replica again immediately, and repl_down_since needs to track the disconnection from the old primary so that clusterHandleReplicaFailover's data_age check works correctly until the new sync completes.

Nice catch. thanks @ranshid.

Since PR valkey-io#3324, freeClient() on a primary client with pending IO is deferred via freeClientAsync. The deferred free eventually chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection(), which unconditionally set repl_state = REPL_STATE_CONNECT. This causes two bugs: 1. REPLICAOF NO ONE: primary_host is NULL when the deferred free runs, so replicationCron calls connectWithPrimary(NULL) -> SIGSEGV in connTLSConnect (inet_pton with NULL addr). 2. REPLICAOF newhost newport: the deferred free clobbers the already- progressed repl_state (CONNECTING) back to CONNECT, causing replicationCron to call connectWithPrimary() again, which overwrites server.repl_transfer_s without closing the previous connection (FD leak). Fix by making replicationHandlePrimaryDisconnection() only transition to REPL_STATE_CONNECT when repl_state is still REPL_STATE_CONNECTED (meaning this is a genuine disconnect, not a stale deferred free). If repl_state has already moved on, the deferred free is stale and should not mutate the state machine. Additionally: - Add NULL check for addr in connTLSConnect() as defense in depth. - Add 10s timeout to the WAITAOF test to prevent indefinite hanging. - Add dedicated tests for the repoint scenario. Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>

coderabbitai

🧹 Nitpick comments (1)

src/replication.c (1)
4557-4590: Request @core-team review for this replication state-machine change.

This patch touches src/replication.c, which the repo treats as an architectural-review area.

As per coding guidelines, "src/{cluster*.c,replication.c,rdb.c,aof.c}: Request @core-team architectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/replication.c` around lines 4557 - 4590, This change modifies the
replication state machine in replicationHandlePrimaryDisconnection (in
src/replication.c) and therefore requires an explicit architectural review by
the core team; please add a PR reviewer request to `@core-team` and a brief
rationale in the PR description mentioning the affected symbol
replicationHandlePrimaryDisconnection and the state transitions around
server.repl_state (REPL_STATE_CONNECTED → REPL_STATE_CONNECT/REPL_STATE_NONE) so
reviewers can evaluate correctness and backward-compatibility of the
reconnection logic (including behavior when server.primary_host is NULL and the
immediate reconnect path via connectWithPrimary()).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/replication.c`:
- Around line 4557-4590: This change modifies the replication state machine in
replicationHandlePrimaryDisconnection (in src/replication.c) and therefore
requires an explicit architectural review by the core team; please add a PR
reviewer request to `@core-team` and a brief rationale in the PR description
mentioning the affected symbol replicationHandlePrimaryDisconnection and the
state transitions around server.repl_state (REPL_STATE_CONNECTED →
REPL_STATE_CONNECT/REPL_STATE_NONE) so reviewers can evaluate correctness and
backward-compatibility of the reconnection logic (including behavior when
server.primary_host is NULL and the immediate reconnect path via
connectWithPrimary()).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 45226830-d070-4816-be53-a81f65a66ed3

📥 Commits

Reviewing files that changed from the base of the PR and between ad8df8b and ccebe67.

📒 Files selected for processing (7)

src/debug.c
src/networking.c
src/replication.c
src/server.c
src/server.h
src/tls.c
tests/unit/wait.tcl

✅ Files skipped from review due to trivial changes (1)

src/debug.c

🚧 Files skipped from review as they are similar to previous changes (1)

src/tls.c

ranshid

Overall LGTM

I think we could also add a serverAssert(server.primary_host != NULL) at connectWithPrimary entry as future-proofing, but I would not risk it right now.

ranshid · 2026-05-14T19:30:05Z

The "test wait for new failover in tests/unit/cluster/failover2.tcl" is constantly failing. so I would wait with this merge till we analyze the issue

Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

github-actions Bot assigned yaronsananes May 14, 2026

ranshid marked this pull request as ready for review May 14, 2026 15:32

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread tests/unit/wait.tcl Outdated

ranshid reviewed May 14, 2026

View reviewed changes

yaronsananes force-pushed the fix-deferred-free-repl-state-clobber branch from ad8df8b to ccebe67 Compare May 14, 2026 17:34

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

ranshid approved these changes May 14, 2026

View reviewed changes

ranshid self-requested a review May 14, 2026 19:30

ranshid reviewed May 15, 2026

View reviewed changes

Comment thread src/replication.c

Apply suggestions from code review

5512e04

Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

ranshid added this to Valkey 9.1 May 15, 2026

ranshid moved this to To be backported in Valkey 9.1 May 15, 2026

ranshid added this to Valkey 10 May 15, 2026

ranshid added the bug Something isn't working label May 15, 2026

ranshid merged commit 0321a69 into valkey-io:unstable May 15, 2026
98 checks passed

ranshid mentioned this pull request May 15, 2026

Fix crash in connectWithPrimary when primary_host is NULL with TLS #3695

Closed

Conversation

yaronsananes commented May 14, 2026 • edited by ranshid Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Bug 1: REPLICAOF NO ONE (SIGSEGV)

Bug 2: REPLICAOF newhost newport (connection leak)

Fix

Reproduction

Testing

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranshid May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yaronsananes May 14, 2026

Choose a reason for hiding this comment

Uh oh!

ranshid May 15, 2026

Choose a reason for hiding this comment

Uh oh!

yaronsananes May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ranshid left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranshid commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaronsananes commented May 14, 2026 •

edited by ranshid

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

codecov Bot commented May 14, 2026 •

edited

Loading

ranshid left a comment •

edited

Loading