Skip to content

Fix crash in connectWithPrimary when primary_host is NULL with TLS#3695

Open
yaronsananes wants to merge 2 commits into
valkey-io:unstablefrom
yaronsananes:fix-null-primary-host-crash-tls-io-threads
Open

Fix crash in connectWithPrimary when primary_host is NULL with TLS#3695
yaronsananes wants to merge 2 commits into
valkey-io:unstablefrom
yaronsananes:fix-null-primary-host-crash-tls-io-threads

Conversation

@yaronsananes
Copy link
Copy Markdown
Contributor

@yaronsananes yaronsananes commented May 13, 2026

Summary

This fix addresses a SIGSEGV crash in connectWithPrimary() that occurs when TLS and IO threads are both enabled. The crash was identified while triaging the recurring test-ubuntu-tls-io-threads daily CI failure in tests/unit/wait.tcl (test: "WAITAOF master without backlog, wait is released when the replica finishes full-sync").

Root Cause

Since PR #3324 ("Redesign IO threading communication model"), freeClient() on a primary client with pending IO is deferred via freeClientAsync (gated on clientHasPendingIO(c)). This replaced the earlier synchronous waitForClientIO.

When a replica executes REPLICAOF NO ONE, replicationUnsetPrimary() performs:

  1. Sets server.primary_host = NULL
  2. Calls freeClient(server.primary) - which may be deferred

When the deferred free eventually executes, it chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection() which unconditionally sets server.repl_state = REPL_STATE_CONNECT. By this time, replicationUnsetPrimary() has already finalized repl_state = REPL_STATE_NONE. The deferred free resurrects REPL_STATE_CONNECT while primary_host remains NULL.

replicationCron then observes repl_state == REPL_STATE_CONNECT and calls connectWithPrimary(), passing NULL to connTLSConnect() where inet_pton(AF_INET, NULL, ...) causes a SIGSEGV.

CI crash evidence (from daily run on 2026-05-10):

10981:M * Connecting to PRIMARY (null):28117
10981:M # valkey 255.255.255 crashed by signal: 11, si_code: 1
10981:M # Accessing address: (nil)

Reproduction

Reproduced locally by establishing TLS replication between master and replica, then executing REPLICAOF NO ONE which triggers the deferred free path calling connectWithPrimary() with primary_host == NULL:

  • Without fix: server crashes with signal 11, accessing address 0x0
  • With fix: server continues operating normally

Fix

The fix targets the single location where the invalid state is created:

  1. src/replication.c - replicationHandlePrimaryDisconnection(): Condition the repl_state transition on primary_host being set. If primary_host is NULL, set REPL_STATE_NONE instead of REPL_STATE_CONNECT, since the disconnection is part of a deliberate unset that has already finalized state.
  2. src/tls.c - connTLSConnect(): Return C_ERR if addr is NULL (defense in depth).
  3. tests/unit/wait.tcl: Change waitaof 0 1 0 (infinite timeout) to waitaof 0 1 10000 (10 second timeout) to prevent the test from hanging indefinitely when the replica cannot complete sync.

Testing

  • Full unit/wait test suite passes (40/40) with IO threads enabled
  • Crash reproduced locally over TLS (SIGSEGV without fix, graceful handling with fix)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a41ffc63-2e48-4320-b310-bea82680cb3e

📥 Commits

Reviewing files that changed from the base of the PR and between add1cf1 and 884e8a5.

📒 Files selected for processing (1)
  • src/replication.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/replication.c

📝 Walkthrough

Walkthrough

Replication avoids reconnect attempts when primary_host is unset; TLS connection returns an error if addr is NULL; a unit test increases a waitaof timeout to 10000 ms.

Changes

Replication connectivity null pointer guards

Layer / File(s) Summary
Primary host null guard
src/replication.c
replicationHandlePrimaryDisconnection() sets server.repl_state to REPL_STATE_CONNECT only if server.primary_host remains set; replicationCron() checks server.primary_host before calling connectWithPrimary() and resets state to REPL_STATE_NONE if it's NULL.
WAITAOF test timeout adjustment
tests/unit/wait.tcl
Deferring client call changed from $rd waitaof 0 1 0 to $rd waitaof 0 1 10000, increasing the timeout to 10000 ms.

TLS connection address guard

Layer / File(s) Summary
Connection address null check
src/tls.c
connTLSConnect() returns C_ERR immediately when addr is NULL, preventing null pointer dereference in downstream address parsing and SNI logic.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main crash being fixed: a NULL pointer dereference in connectWithPrimary when TLS is enabled, which matches the primary changes in src/replication.c and src/tls.c.
Description check ✅ Passed The description provides a comprehensive explanation directly related to the changeset, including root cause analysis, reproduction steps, the specific fix, and testing results that align with the code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@yaronsananes yaronsananes marked this pull request as ready for review May 13, 2026 11:01
Copy link
Copy Markdown
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yaronsananes. I think there are some clarifications to the root cause needed:

  1. I think, the issue is only reproducible on branches that contain PR #3324 ("Redesign IO threading communication model"): unstable and 9.1. On earlier branches, freeClient() on a primary client with pending IO synchronized via waitForClientIO.

However in #3324 we replaced the synchronous waitForClientIO(c) with an async-free escape hatch gated on clientHasPendingIO(c), which is what allows replicationHandlePrimaryDisconnection() to run on a later main-loop iteration, after replicationUnsetPrimary() has already finalized repl_state = REPL_STATE_NONE.

  1. The PR currently adds three defensive NULL-checks (in connectWithPrimary, in replicationCron, and in connTLSConnect) plus a state-reset fallback. A smaller, more targeted fix is to tighten the invariant directly in replicationHandlePrimaryDisconnection() — the one place where the bad state is actually created:
 void replicationHandlePrimaryDisconnection(void) {
     if (server.repl_state == REPL_STATE_CONNECTED)
         moduleFireServerEvent(VALKEYMODULE_EVENT_PRIMARY_LINK_CHANGE,
                               VALKEYMODULE_SUBEVENT_PRIMARY_LINK_DOWN, NULL);

     server.primary = NULL;
-    server.repl_state = REPL_STATE_CONNECT;
+    /* freeClient(primary) can be deferred via freeClientAsync
+     * when the client has pending IO. By the time we run in that deferred
+     * context, replicationUnsetPrimary()/replicationSetPrimary() may have
+     * already finalized replication state. If primary_host is NULL, a
+     * deliberate unset is in progress (or complete), so don't resurrect
+     * REPL_STATE_CONNECT — that would make replicationCron call
+     * connectWithPrimary() with a NULL host. */
+    server.repl_state = server.primary_host ? REPL_STATE_CONNECT : REPL_STATE_NONE;
     server.repl_down_since = server.unixtime;

     /* Try to re-connect immediately rather than wait for replicationCron
      * waiting 1 second may risk backlog being recycled. */
     if (server.primary_host) {
         serverLog(LL_NOTICE, "Reconnecting to PRIMARY %s:%d",
                   server.primary_host, server.primary_port);
         connectWithPrimary();
     }
 }

Comment thread src/replication.c
@ranshid ranshid requested a review from madolson May 13, 2026 14:46
@yaronsananes yaronsananes force-pushed the fix-null-primary-host-crash-tls-io-threads branch from d452ec9 to 4eddf6d Compare May 14, 2026 07:13
@yaronsananes yaronsananes marked this pull request as draft May 14, 2026 07:14
@yaronsananes
Copy link
Copy Markdown
Contributor Author

Thank you for the thorough review @ranshid.

You are right on both points:

  1. The root cause is indeed tied to the async free path introduced in Redesign IO threading communication model #3324. The earlier synchronous waitForClientIO prevented this race. I have updated the PR description to reflect this.

  2. I have adopted your suggested approach. The fix now targets replicationHandlePrimaryDisconnection() directly, conditioning the state transition on primary_host:

server.repl_state = server.primary_host ? REPL_STATE_CONNECT : REPL_STATE_NONE;

I removed the guards from connectWithPrimary() and replicationCron() since they are no longer needed. I kept the addr == NULL check in connTLSConnect() as a general defense-in-depth measure since passing NULL to inet_pton is always a crash regardless of the caller.

@yaronsananes yaronsananes force-pushed the fix-null-primary-host-crash-tls-io-threads branch from 4eddf6d to 543533c Compare May 14, 2026 07:35
@yaronsananes yaronsananes marked this pull request as ready for review May 14, 2026 07:36
When a replica executes `replicaof no one`, `replicationUnsetPrimary()`
sets `server.primary_host = NULL` before calling `freeClient(server.primary)`.
Since PR valkey-io#3324 ("Redesign IO threading communication model"), freeClient()
on a primary client with pending IO is deferred via freeClientAsync. When it
eventually executes, it chains through replicationCachePrimary() ->
replicationHandlePrimaryDisconnection() which unconditionally sets
`server.repl_state = REPL_STATE_CONNECT`.

By that time, replicationUnsetPrimary() has already finalized the state
with `repl_state = REPL_STATE_NONE`. The deferred free resurrects
REPL_STATE_CONNECT while primary_host is NULL. replicationCron then calls
connectWithPrimary() which passes NULL to connTLSConnect(), causing
inet_pton(AF_INET, NULL, ...) to SIGSEGV.

Fix by conditioning the state transition in replicationHandlePrimaryDisconnection()
on primary_host being set. If primary_host is NULL, the disconnection is
part of a deliberate unset that has already finalized state, so we set
REPL_STATE_NONE instead of REPL_STATE_CONNECT.

Additionally:
- Add a NULL check for addr in connTLSConnect() as defense in depth.
- Add a 10s timeout to the WAITAOF test that blocks indefinitely,
  preventing the test from hanging if the replica fails to sync.

Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>
@yaronsananes yaronsananes force-pushed the fix-null-primary-host-crash-tls-io-threads branch from 543533c to add1cf1 Compare May 14, 2026 07:49
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.68%. Comparing base (ca9dee3) to head (884e8a5).
⚠️ Report is 7 commits behind head on unstable.

Files with missing lines Patch % Lines
src/replication.c 66.66% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3695      +/-   ##
============================================
- Coverage     76.71%   76.68%   -0.03%     
============================================
  Files           162      162              
  Lines         80656    80665       +9     
============================================
- Hits          61872    61861      -11     
- Misses        18784    18804      +20     
Files with missing lines Coverage Δ
src/tls.c 17.64% <ø> (ø)
src/replication.c 86.29% <66.66%> (+0.05%) ⬆️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ranshid
Copy link
Copy Markdown
Member

ranshid commented May 14, 2026

I think there is still an issue.
After PR #3324 made freeClient(server.primary) async-deferrable via freeClientAsync (gated on clientHasPendingIO), the deferred replicationHandlePrimaryDisconnection() can run on a later main-loop tick and clobber replication state that was established after the async free was scheduled. This PR fixes the REPLICAOF NO ONE instance of this; an analogous instance still exists in REPLICAOF newhost newport and is not addressed.

The race

replicationSetPrimary() (replica being repointed at a new primary) runs in this order on the main thread:

  1. Clear primary_host.
  2. Call freeClient(server.primary) — with IO threads + Redesign IO threading communication model #3324, this returns immediately via freeClientAsync and schedules the actual free for next tick.
  3. Set primary_host = sdsnew(new_ip).
  4. Set repl_state = REPL_STATE_CONNECT.
  5. Call connectWithPrimary() — which sets repl_state = REPL_STATE_CONNECTING and assigns server.repl_transfer_s = .
  6. Return.

Next tick, beforeSleep drains the async-free queue. The deferred freeClient(primary) finally chains through replicationCachePrimary() → replicationHandlePrimaryDisconnection(), which unconditionally writes:

server.repl_state = REPL_STATE_CONNECT;     // clobbers CONNECTING from step 5

Then on a subsequent replicationCron tick, repl_state == REPL_STATE_CONNECT triggers another connectWithPrimary() call, which does:

server.repl_transfer_s = connCreate(connTypeOfReplication());   // overwrites <conn1>

without closing the previous handle

Comment thread src/replication.c Outdated
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
@ranshid
Copy link
Copy Markdown
Member

ranshid commented May 14, 2026

I think there is still an issue. After PR #3324 made freeClient(server.primary) async-deferrable via freeClientAsync (gated on clientHasPendingIO), the deferred replicationHandlePrimaryDisconnection() can run on a later main-loop tick and clobber replication state that was established after the async free was scheduled. This PR fixes the REPLICAOF NO ONE instance of this; an analogous instance still exists in REPLICAOF newhost newport and is not addressed.

The race

replicationSetPrimary() (replica being repointed at a new primary) runs in this order on the main thread:

  1. Clear primary_host.
  2. Call freeClient(server.primary) — with IO threads + Redesign IO threading communication model #3324, this returns immediately via freeClientAsync and schedules the actual free for next tick.
  3. Set primary_host = sdsnew(new_ip).
  4. Set repl_state = REPL_STATE_CONNECT.
  5. Call connectWithPrimary() — which sets repl_state = REPL_STATE_CONNECTING and assigns server.repl_transfer_s = .
  6. Return.

Next tick, beforeSleep drains the async-free queue. The deferred freeClient(primary) finally chains through replicationCachePrimary() → replicationHandlePrimaryDisconnection(), which unconditionally writes:

server.repl_state = REPL_STATE_CONNECT;     // clobbers CONNECTING from step 5

Then on a subsequent replicationCron tick, repl_state == REPL_STATE_CONNECT triggers another connectWithPrimary() call, which does:

server.repl_transfer_s = connCreate(connTypeOfReplication());   // overwrites <conn1>

without closing the previous handle

I think we can do it in a sperate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants