Fix crash in connectWithPrimary when primary_host is NULL with TLS#3695
Fix crash in connectWithPrimary when primary_host is NULL with TLS#3695yaronsananes wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplication avoids reconnect attempts when primary_host is unset; TLS connection returns an error if addr is NULL; a unit test increases a waitaof timeout to 10000 ms. ChangesReplication connectivity null pointer guards
TLS connection address guard
🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
ranshid
left a comment
There was a problem hiding this comment.
Thank you @yaronsananes. I think there are some clarifications to the root cause needed:
- I think, the issue is only reproducible on branches that contain PR #3324 ("Redesign IO threading communication model"): unstable and 9.1. On earlier branches, freeClient() on a primary client with pending IO synchronized via waitForClientIO.
However in #3324 we replaced the synchronous waitForClientIO(c) with an async-free escape hatch gated on clientHasPendingIO(c), which is what allows replicationHandlePrimaryDisconnection() to run on a later main-loop iteration, after replicationUnsetPrimary() has already finalized repl_state = REPL_STATE_NONE.
- The PR currently adds three defensive NULL-checks (in connectWithPrimary, in replicationCron, and in connTLSConnect) plus a state-reset fallback. A smaller, more targeted fix is to tighten the invariant directly in replicationHandlePrimaryDisconnection() — the one place where the bad state is actually created:
void replicationHandlePrimaryDisconnection(void) {
if (server.repl_state == REPL_STATE_CONNECTED)
moduleFireServerEvent(VALKEYMODULE_EVENT_PRIMARY_LINK_CHANGE,
VALKEYMODULE_SUBEVENT_PRIMARY_LINK_DOWN, NULL);
server.primary = NULL;
- server.repl_state = REPL_STATE_CONNECT;
+ /* freeClient(primary) can be deferred via freeClientAsync
+ * when the client has pending IO. By the time we run in that deferred
+ * context, replicationUnsetPrimary()/replicationSetPrimary() may have
+ * already finalized replication state. If primary_host is NULL, a
+ * deliberate unset is in progress (or complete), so don't resurrect
+ * REPL_STATE_CONNECT — that would make replicationCron call
+ * connectWithPrimary() with a NULL host. */
+ server.repl_state = server.primary_host ? REPL_STATE_CONNECT : REPL_STATE_NONE;
server.repl_down_since = server.unixtime;
/* Try to re-connect immediately rather than wait for replicationCron
* waiting 1 second may risk backlog being recycled. */
if (server.primary_host) {
serverLog(LL_NOTICE, "Reconnecting to PRIMARY %s:%d",
server.primary_host, server.primary_port);
connectWithPrimary();
}
}
d452ec9 to
4eddf6d
Compare
|
Thank you for the thorough review @ranshid. You are right on both points:
server.repl_state = server.primary_host ? REPL_STATE_CONNECT : REPL_STATE_NONE;I removed the guards from |
4eddf6d to
543533c
Compare
When a replica executes `replicaof no one`, `replicationUnsetPrimary()` sets `server.primary_host = NULL` before calling `freeClient(server.primary)`. Since PR valkey-io#3324 ("Redesign IO threading communication model"), freeClient() on a primary client with pending IO is deferred via freeClientAsync. When it eventually executes, it chains through replicationCachePrimary() -> replicationHandlePrimaryDisconnection() which unconditionally sets `server.repl_state = REPL_STATE_CONNECT`. By that time, replicationUnsetPrimary() has already finalized the state with `repl_state = REPL_STATE_NONE`. The deferred free resurrects REPL_STATE_CONNECT while primary_host is NULL. replicationCron then calls connectWithPrimary() which passes NULL to connTLSConnect(), causing inet_pton(AF_INET, NULL, ...) to SIGSEGV. Fix by conditioning the state transition in replicationHandlePrimaryDisconnection() on primary_host being set. If primary_host is NULL, the disconnection is part of a deliberate unset that has already finalized state, so we set REPL_STATE_NONE instead of REPL_STATE_CONNECT. Additionally: - Add a NULL check for addr in connTLSConnect() as defense in depth. - Add a 10s timeout to the WAITAOF test that blocks indefinitely, preventing the test from hanging if the replica fails to sync. Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>
543533c to
add1cf1
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3695 +/- ##
============================================
- Coverage 76.71% 76.68% -0.03%
============================================
Files 162 162
Lines 80656 80665 +9
============================================
- Hits 61872 61861 -11
- Misses 18784 18804 +20
🚀 New features to boost your workflow:
|
|
I think there is still an issue. The racereplicationSetPrimary() (replica being repointed at a new primary) runs in this order on the main thread:
Next tick, beforeSleep drains the async-free queue. The deferred freeClient(primary) finally chains through replicationCachePrimary() → replicationHandlePrimaryDisconnection(), which unconditionally writes: Then on a subsequent replicationCron tick, repl_state == REPL_STATE_CONNECT triggers another connectWithPrimary() call, which does: without closing the previous handle |
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
I think we can do it in a sperate PR |
Summary
This fix addresses a SIGSEGV crash in
connectWithPrimary()that occurs when TLS and IO threads are both enabled. The crash was identified while triaging the recurringtest-ubuntu-tls-io-threadsdaily CI failure intests/unit/wait.tcl(test: "WAITAOF master without backlog, wait is released when the replica finishes full-sync").Root Cause
Since PR #3324 ("Redesign IO threading communication model"),
freeClient()on a primary client with pending IO is deferred viafreeClientAsync(gated onclientHasPendingIO(c)). This replaced the earlier synchronouswaitForClientIO.When a replica executes
REPLICAOF NO ONE,replicationUnsetPrimary()performs:server.primary_host = NULLfreeClient(server.primary)- which may be deferredWhen the deferred free eventually executes, it chains through
replicationCachePrimary()->replicationHandlePrimaryDisconnection()which unconditionally setsserver.repl_state = REPL_STATE_CONNECT. By this time,replicationUnsetPrimary()has already finalizedrepl_state = REPL_STATE_NONE. The deferred free resurrectsREPL_STATE_CONNECTwhileprimary_hostremains NULL.replicationCronthen observesrepl_state == REPL_STATE_CONNECTand callsconnectWithPrimary(), passing NULL toconnTLSConnect()whereinet_pton(AF_INET, NULL, ...)causes a SIGSEGV.CI crash evidence (from daily run on 2026-05-10):
Reproduction
Reproduced locally by establishing TLS replication between master and replica, then executing
REPLICAOF NO ONEwhich triggers the deferred free path callingconnectWithPrimary()withprimary_host == NULL:Fix
The fix targets the single location where the invalid state is created:
src/replication.c-replicationHandlePrimaryDisconnection(): Condition therepl_statetransition onprimary_hostbeing set. Ifprimary_hostis NULL, setREPL_STATE_NONEinstead ofREPL_STATE_CONNECT, since the disconnection is part of a deliberate unset that has already finalized state.src/tls.c-connTLSConnect(): ReturnC_ERRifaddris NULL (defense in depth).tests/unit/wait.tcl: Changewaitaof 0 1 0(infinite timeout) towaitaof 0 1 10000(10 second timeout) to prevent the test from hanging indefinitely when the replica cannot complete sync.Testing
unit/waittest suite passes (40/40) with IO threads enabled