Fix crash in connectWithPrimary when primary_host is NULL with TLS by yaronsananes · Pull Request #3695 · valkey-io/valkey

yaronsananes · 2026-05-13T10:58:56Z

Summary

This fix addresses a SIGSEGV crash in connectWithPrimary() that occurs when TLS and IO threads are both enabled. The crash was identified while triaging the recurring test-ubuntu-tls-io-threads daily CI failure in tests/unit/wait.tcl (test: "WAITAOF master without backlog, wait is released when the replica finishes full-sync").

Root Cause

Since PR #3324 ("Redesign IO threading communication model"), freeClient() on a primary client with pending IO is deferred via freeClientAsync (gated on clientHasPendingIO(c)). This replaced the earlier synchronous waitForClientIO.

When a replica executes REPLICAOF NO ONE, replicationUnsetPrimary() performs:

Sets server.primary_host = NULL
Calls freeClient(server.primary) - which may be deferred

When the deferred free eventually executes, it chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection() which unconditionally sets server.repl_state = REPL_STATE_CONNECT. By this time, replicationUnsetPrimary() has already finalized repl_state = REPL_STATE_NONE. The deferred free resurrects REPL_STATE_CONNECT while primary_host remains NULL.

replicationCron then observes repl_state == REPL_STATE_CONNECT and calls connectWithPrimary(), passing NULL to connTLSConnect() where inet_pton(AF_INET, NULL, ...) causes a SIGSEGV.

CI crash evidence (from daily run on 2026-05-10):

10981:M * Connecting to PRIMARY (null):28117
10981:M # valkey 255.255.255 crashed by signal: 11, si_code: 1
10981:M # Accessing address: (nil)

Reproduction

Reproduced locally by establishing TLS replication between master and replica, then executing REPLICAOF NO ONE which triggers the deferred free path calling connectWithPrimary() with primary_host == NULL:

Without fix: server crashes with signal 11, accessing address 0x0
With fix: server continues operating normally

Fix

The fix targets the single location where the invalid state is created:

src/replication.c - replicationHandlePrimaryDisconnection(): Condition the repl_state transition on primary_host being set. If primary_host is NULL, set REPL_STATE_NONE instead of REPL_STATE_CONNECT, since the disconnection is part of a deliberate unset that has already finalized state.
src/tls.c - connTLSConnect(): Return C_ERR if addr is NULL (defense in depth).
tests/unit/wait.tcl: Change waitaof 0 1 0 (infinite timeout) to waitaof 0 1 10000 (10 second timeout) to prevent the test from hanging indefinitely when the replica cannot complete sync.

Testing

Full unit/wait test suite passes (40/40) with IO threads enabled
Crash reproduced locally over TLS (SIGSEGV without fix, graceful handling with fix)

coderabbitai · 2026-05-13T10:59:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a41ffc63-2e48-4320-b310-bea82680cb3e

📥 Commits

Reviewing files that changed from the base of the PR and between add1cf1 and 884e8a5.

📒 Files selected for processing (1)

src/replication.c

🚧 Files skipped from review as they are similar to previous changes (1)

src/replication.c

📝 Walkthrough

Walkthrough

Replication avoids reconnect attempts when primary_host is unset; TLS connection returns an error if addr is NULL; a unit test increases a waitaof timeout to 10000 ms.

Changes

Replication connectivity null pointer guards

Layer / File(s)	Summary
Primary host null guard `src/replication.c`	`replicationHandlePrimaryDisconnection()` sets `server.repl_state` to `REPL_STATE_CONNECT` only if `server.primary_host` remains set; `replicationCron()` checks `server.primary_host` before calling `connectWithPrimary()` and resets state to `REPL_STATE_NONE` if it's NULL.
WAITAOF test timeout adjustment `tests/unit/wait.tcl`	Deferring client call changed from `$rd waitaof 0 1 0` to `$rd waitaof 0 1 10000`, increasing the timeout to 10000 ms.

TLS connection address guard

Layer / File(s)	Summary
Connection address null check `src/tls.c`	`connTLSConnect()` returns `C_ERR` immediately when `addr` is `NULL`, preventing null pointer dereference in downstream address parsing and SNI logic.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main crash being fixed: a NULL pointer dereference in connectWithPrimary when TLS is enabled, which matches the primary changes in src/replication.c and src/tls.c.
Description check	✅ Passed	The description provides a comprehensive explanation directly related to the changeset, including root cause analysis, reproduction steps, the specific fix, and testing results that align with the code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ranshid

Thank you @yaronsananes. I think there are some clarifications to the root cause needed:

I think, the issue is only reproducible on branches that contain PR #3324 ("Redesign IO threading communication model"): unstable and 9.1. On earlier branches, freeClient() on a primary client with pending IO synchronized via waitForClientIO.

However in #3324 we replaced the synchronous waitForClientIO(c) with an async-free escape hatch gated on clientHasPendingIO(c), which is what allows replicationHandlePrimaryDisconnection() to run on a later main-loop iteration, after replicationUnsetPrimary() has already finalized repl_state = REPL_STATE_NONE.

The PR currently adds three defensive NULL-checks (in connectWithPrimary, in replicationCron, and in connTLSConnect) plus a state-reset fallback. A smaller, more targeted fix is to tighten the invariant directly in replicationHandlePrimaryDisconnection() — the one place where the bad state is actually created:

 void replicationHandlePrimaryDisconnection(void) {
     if (server.repl_state == REPL_STATE_CONNECTED)
         moduleFireServerEvent(VALKEYMODULE_EVENT_PRIMARY_LINK_CHANGE,
                               VALKEYMODULE_SUBEVENT_PRIMARY_LINK_DOWN, NULL);

     server.primary = NULL;
-    server.repl_state = REPL_STATE_CONNECT;
+    /* freeClient(primary) can be deferred via freeClientAsync
+     * when the client has pending IO. By the time we run in that deferred
+     * context, replicationUnsetPrimary()/replicationSetPrimary() may have
+     * already finalized replication state. If primary_host is NULL, a
+     * deliberate unset is in progress (or complete), so don't resurrect
+     * REPL_STATE_CONNECT — that would make replicationCron call
+     * connectWithPrimary() with a NULL host. */
+    server.repl_state = server.primary_host ? REPL_STATE_CONNECT : REPL_STATE_NONE;
     server.repl_down_since = server.unixtime;

     /* Try to re-connect immediately rather than wait for replicationCron
      * waiting 1 second may risk backlog being recycled. */
     if (server.primary_host) {
         serverLog(LL_NOTICE, "Reconnecting to PRIMARY %s:%d",
                   server.primary_host, server.primary_port);
         connectWithPrimary();
     }
 }

yaronsananes · 2026-05-14T07:15:56Z

Thank you for the thorough review @ranshid.

You are right on both points:

The root cause is indeed tied to the async free path introduced in Redesign IO threading communication model #3324. The earlier synchronous waitForClientIO prevented this race. I have updated the PR description to reflect this.
I have adopted your suggested approach. The fix now targets replicationHandlePrimaryDisconnection() directly, conditioning the state transition on primary_host:

server.repl_state = server.primary_host ? REPL_STATE_CONNECT : REPL_STATE_NONE;

I removed the guards from connectWithPrimary() and replicationCron() since they are no longer needed. I kept the addr == NULL check in connTLSConnect() as a general defense-in-depth measure since passing NULL to inet_pton is always a crash regardless of the caller.

When a replica executes `replicaof no one`, `replicationUnsetPrimary()` sets `server.primary_host = NULL` before calling `freeClient(server.primary)`. Since PR valkey-io#3324 ("Redesign IO threading communication model"), freeClient() on a primary client with pending IO is deferred via freeClientAsync. When it eventually executes, it chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection() which unconditionally sets `server.repl_state = REPL_STATE_CONNECT`. By that time, replicationUnsetPrimary() has already finalized the state with `repl_state = REPL_STATE_NONE`. The deferred free resurrects REPL_STATE_CONNECT while primary_host is NULL. replicationCron then calls connectWithPrimary() which passes NULL to connTLSConnect(), causing inet_pton(AF_INET, NULL, ...) to SIGSEGV. Fix by conditioning the state transition in replicationHandlePrimaryDisconnection() on primary_host being set. If primary_host is NULL, the disconnection is part of a deliberate unset that has already finalized state, so we set REPL_STATE_NONE instead of REPL_STATE_CONNECT. Additionally: - Add a NULL check for addr in connTLSConnect() as defense in depth. - Add a 10s timeout to the WAITAOF test that blocks indefinitely, preventing the test from hanging if the replica fails to sync. Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>

codecov · 2026-05-14T08:14:43Z

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.68%. Comparing base (ca9dee3) to head (884e8a5).
⚠️ Report is 7 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/replication.c	66.66%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3695      +/-   ##
============================================
- Coverage     76.71%   76.68%   -0.03%     
============================================
  Files           162      162              
  Lines         80656    80665       +9     
============================================
- Hits          61872    61861      -11     
- Misses        18784    18804      +20

Files with missing lines	Coverage Δ
src/tls.c	`17.64% <ø> (ø)`
src/replication.c	`86.29% <66.66%> (+0.05%)`	⬆️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ranshid · 2026-05-14T08:36:12Z

I think there is still an issue.
After PR #3324 made freeClient(server.primary) async-deferrable via freeClientAsync (gated on clientHasPendingIO), the deferred replicationHandlePrimaryDisconnection() can run on a later main-loop tick and clobber replication state that was established after the async free was scheduled. This PR fixes the REPLICAOF NO ONE instance of this; an analogous instance still exists in REPLICAOF newhost newport and is not addressed.

The race

replicationSetPrimary() (replica being repointed at a new primary) runs in this order on the main thread:

Clear primary_host.
Call freeClient(server.primary) — with IO threads + Redesign IO threading communication model #3324, this returns immediately via freeClientAsync and schedules the actual free for next tick.
Set primary_host = sdsnew(new_ip).
Set repl_state = REPL_STATE_CONNECT.
Call connectWithPrimary() — which sets repl_state = REPL_STATE_CONNECTING and assigns server.repl_transfer_s = .
Return.

Next tick, beforeSleep drains the async-free queue. The deferred freeClient(primary) finally chains through replicationCachePrimary() → replicationHandlePrimaryDisconnection(), which unconditionally writes:

server.repl_state = REPL_STATE_CONNECT;     // clobbers CONNECTING from step 5

Then on a subsequent replicationCron tick, repl_state == REPL_STATE_CONNECT triggers another connectWithPrimary() call, which does:

server.repl_transfer_s = connCreate(connTypeOfReplication());   // overwrites <conn1>

without closing the previous handle

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

ranshid · 2026-05-14T13:55:53Z

I think there is still an issue. After PR #3324 made freeClient(server.primary) async-deferrable via freeClientAsync (gated on clientHasPendingIO), the deferred replicationHandlePrimaryDisconnection() can run on a later main-loop tick and clobber replication state that was established after the async free was scheduled. This PR fixes the REPLICAOF NO ONE instance of this; an analogous instance still exists in REPLICAOF newhost newport and is not addressed.

The race

replicationSetPrimary() (replica being repointed at a new primary) runs in this order on the main thread:

Clear primary_host.

Call freeClient(server.primary) — with IO threads + Redesign IO threading communication model #3324, this returns immediately via freeClientAsync and schedules the actual free for next tick.

Set primary_host = sdsnew(new_ip).

Set repl_state = REPL_STATE_CONNECT.

Call connectWithPrimary() — which sets repl_state = REPL_STATE_CONNECTING and assigns server.repl_transfer_s = .

Return.

Next tick, beforeSleep drains the async-free queue. The deferred freeClient(primary) finally chains through replicationCachePrimary() → replicationHandlePrimaryDisconnection(), which unconditionally writes:
server.repl_state = REPL_STATE_CONNECT;     // clobbers CONNECTING from step 5
Then on a subsequent replicationCron tick, repl_state == REPL_STATE_CONNECT triggers another connectWithPrimary() call, which does:
server.repl_transfer_s = connCreate(connTypeOfReplication());   // overwrites <conn1>
without closing the previous handle

I think we can do it in a sperate PR

github-actions Bot assigned yaronsananes May 13, 2026

yaronsananes marked this pull request as ready for review May 13, 2026 11:01

ranshid reviewed May 13, 2026

View reviewed changes

Comment thread src/replication.c

ranshid added this to Valkey 9.1 May 13, 2026

ranshid requested a review from madolson May 13, 2026 14:46

yaronsananes force-pushed the fix-null-primary-host-crash-tls-io-threads branch from d452ec9 to 4eddf6d Compare May 14, 2026 07:13

yaronsananes marked this pull request as draft May 14, 2026 07:14

yaronsananes force-pushed the fix-null-primary-host-crash-tls-io-threads branch from 4eddf6d to 543533c Compare May 14, 2026 07:35

yaronsananes marked this pull request as ready for review May 14, 2026 07:36

yaronsananes force-pushed the fix-null-primary-host-crash-tls-io-threads branch from 543533c to add1cf1 Compare May 14, 2026 07:49

ranshid approved these changes May 14, 2026

View reviewed changes

Comment thread src/replication.c Outdated

Update src/replication.c

884e8a5

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

yaronsananes mentioned this pull request May 14, 2026

Fix deferred freeClient clobbering replication state after replicaof #3719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash in connectWithPrimary when primary_host is NULL with TLS#3695

Fix crash in connectWithPrimary when primary_host is NULL with TLS#3695
yaronsananes wants to merge 2 commits into
valkey-io:unstablefrom
yaronsananes:fix-null-primary-host-crash-tls-io-threads

yaronsananes commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

ranshid left a comment

Uh oh!

Uh oh!

yaronsananes commented May 14, 2026

Uh oh!

codecov Bot commented May 14, 2026 •

edited

Loading

Uh oh!

ranshid commented May 14, 2026

Uh oh!

Uh oh!

ranshid commented May 14, 2026

The race

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaronsananes commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Reproduction

Fix

Testing

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

ranshid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaronsananes commented May 14, 2026

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranshid commented May 14, 2026

The race

Uh oh!

Uh oh!

ranshid commented May 14, 2026

The race

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaronsananes commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

codecov Bot commented May 14, 2026 •

edited

Loading