Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch#3772
Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch#3772Taeknology wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR updates the diskless RDB pipe test in ChangesTimeout disconnection log handling in diskless RDB pipe test
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration/replication.tcl`:
- Line 998: Update the failing test assertion so it provides a descriptive error
message: change the assert_equal that compares 1 with the result of
count_log_message(... "Disconnecting timedout replica") to include a clear
message explaining the expectation (e.g., that exactly one "Disconnecting
timedout replica" log entry should be present). Locate the assertion using the
assert_equal call and the count_log_message invocation and add the descriptive
string as the assertion's message argument.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 183ac616-8bc0-4e1b-a69e-330ce23c468a
📒 Files selected for processing (1)
tests/integration/replication.tcl
…isconnect branch The "diskless timeout replicas drop during rdb pipe" subcase in tests/integration/replication.tcl waits for a "(full sync)" timeout disconnect log. On test-macos-latest the disconnect can surface on the "(streaming sync)" branch instead, because the RDB child exits and clears server.rdb_child_type (src/rdb.c:3698) in the same serverCron tick as the disconnect loop (src/replication.c:5348), closing the (full sync) window before it can fire. The timed-out replica is by then already promoted via replicaPutOnline(), so it falls into the (streaming sync) branch (src/replication.c:5357-5364). Linux CI does not hit this because SIGSTOP back-pressure keeps the RDB child blocked on the pipe write long enough that backgroundSaveDoneHandler does not run in the same tick. Accept either timeout-disconnect message and assert exactly one such disconnect occurred, so the test still rejects a regression that emits zero or multiple timeouts. This follows the existing catch+fallback convention already used in the same test for "all" and "slow" subcases (replication.tcl:1011-1036). Also addresses CodeRabbit feedback by adding a descriptive failure message to the `assert_equal 1 [count_log_message ...]` guard so a regression that emits zero or multiple timeout-disconnect log lines surfaces with context instead of a bare `Expected '1' to be equal to '0'`. Fixes valkey-io#3686. Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
93eac07 to
95849c9
Compare
rainsupreme
left a comment
There was a problem hiding this comment.
Looks pretty solid 👍
I might like to see a workflow that ran this 100x to show it isn't flaky any more, but your in-depth analysis convinces me that this is understood and worth fixing
Fixes #3686.
Symptom
tests/integration/replication.tcl"diskless timeout replicas drop during rdb pipe" fails ontest-macos-latest:The master did fire a timeout disconnect, but on the streaming-sync branch instead of the full-sync one:
Root cause: same-tick cron ordering
serverCrondoes two things in the same tick that compete here:checkChildrenDone(src/server.c:1453) reaps the RDB child and callsbackgroundSaveDoneHandler, which clearsserver.rdb_child_type = RDB_CHILD_TYPE_NONEatsrc/rdb.c:3698.src/replication.c:5348runs.The
(full sync)branch (src/replication.c:5369-5377) is gated onrdb_child_type == RDB_CHILD_TYPE_SOCKET. Once step 1 clears it, the branch is unreachable for the rest of that tick. The SIGSTOP'd replica meanwhile was already promoted viareplicaPutOnline()(src/replication.c:2006-2031), so the disconnect lands in the(streaming sync)branch atsrc/replication.c:5357-5364instead.On Linux CI, SIGSTOP back-pressure keeps the RDB child blocked on
write()to the parent pipe long enough that step 1 does not run in the same tick as step 2. On macOS-latest CI the child drains faster (larger loopbackSO_SNDBUF, different scheduler quanta), so the two events collapse into one tick and the test's strict(full sync)wait never matches.Fix
Accept either timeout-disconnect message, then assert exactly one such disconnect occurred so the test still rejects a regression that emits zero or multiple timeouts:
Both branches represent a legitimate timeout-driven disconnect; the test's intent ("the slow replica is dropped because of
repl-timeout") is preserved. Thecount_log_messageassertion guards against silently widening to "any timeout anywhere".Prior art in this test
The same
catch { wait_for_log_messages ... }+ fallback pattern is already the established convention in this very test block to absorb RDB-pipe race variance — see the post-disconnect assertions forall_drop == "all"andall_drop == "slow"attests/integration/replication.tcl:1011-1036. The comment ("RDB finished before replicas were killed; accept either 1 or 2 replicas still up at pipe-read time.") describes the same kind of timing variance this PR addresses for the timeout subcase on the disconnect-emission side.Alternatives considered
pause_processon partial-write evidence: rejected. Oncerdb_child_typeis cleared in the same tick, no test-side synchronization can force the(full sync)path to fire.src/replication.c:5360,5373(state +rdb_child_typeat disconnect): deferred to a separate follow-up PR. Useful for future triage but independent review surface, and this PR is intentionally test-only.Verification
grep -n "Disconnecting timedout replica" src/replication.cconfirms both emitters intact (:5360streaming,:5373full).grep -rn "Disconnecting timedout replica (full sync)" tests/confirms no other test relies on the strict string.integration/replicationon macOS Apple Silicon: 111 passed, 0 failed.diskless * replicas drop during rdb pipesubcases (including timeout): 50/50 pass. The race is only reproducible on CI VM conditions, so local runs validate non-regression rather than fix efficacy.replication.tcl:988-1080(rdb_bgsave_in_progress, replicas_alive cleanup, dbsize parity, debug digest) are all branch-agnostic.