Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch by Taeknology · Pull Request #3772 · valkey-io/valkey

Taeknology · 2026-05-19T09:09:00Z

Symptom

tests/integration/replication.tcl "diskless timeout replicas drop during rdb pipe" fails on test-macos-latest:

log message of '"*Disconnecting timedout replica (full sync)*"' not found in
./tests/tmp/server.13581.2852/stdout after line: 128 till line: 151

The master did fire a timeout disconnect, but on the streaming-sync branch instead of the full-sync one:

01:58:34.346 * Streamed RDB transfer with replica 127.0.0.1:22549 succeeded (socket).
01:58:34.346 * Synchronization with replica 127.0.0.1:22549 succeeded
01:58:37.373 # Disconnecting timedout replica (streaming sync): 127.0.0.1:22548

Root cause: same-tick cron ordering

serverCron does two things in the same tick that compete here:

checkChildrenDone (src/server.c:1453) reaps the RDB child and calls backgroundSaveDoneHandler, which clears server.rdb_child_type = RDB_CHILD_TYPE_NONE at src/rdb.c:3698.
The disconnect loop at src/replication.c:5348 runs.

The (full sync) branch (src/replication.c:5369-5377) is gated on rdb_child_type == RDB_CHILD_TYPE_SOCKET. Once step 1 clears it, the branch is unreachable for the rest of that tick. The SIGSTOP'd replica meanwhile was already promoted via replicaPutOnline() (src/replication.c:2006-2031), so the disconnect lands in the (streaming sync) branch at src/replication.c:5357-5364 instead.

On Linux CI, SIGSTOP back-pressure keeps the RDB child blocked on write() to the parent pipe long enough that step 1 does not run in the same tick as step 2. On macOS-latest CI the child drains faster (larger loopback SO_SNDBUF, different scheduler quanta), so the two events collapse into one tick and the test's strict (full sync) wait never matches.

Fix

Accept either timeout-disconnect message, then assert exactly one such disconnect occurred so the test still rejects a regression that emits zero or multiple timeouts:

if {[catch {
    wait_for_log_messages -2 {"*Disconnecting timedout replica (full sync)*"} $loglines 100 100
}]} {
    wait_for_log_messages -2 {"*Disconnecting timedout replica (streaming sync)*"} $loglines 100 100
}
assert_equal 1 [count_log_message -2 "Disconnecting timedout replica"]

Both branches represent a legitimate timeout-driven disconnect; the test's intent ("the slow replica is dropped because of repl-timeout") is preserved. The count_log_message assertion guards against silently widening to "any timeout anywhere".

Prior art in this test

The same catch { wait_for_log_messages ... } + fallback pattern is already the established convention in this very test block to absorb RDB-pipe race variance — see the post-disconnect assertions for all_drop == "all" and all_drop == "slow" at tests/integration/replication.tcl:1011-1036. The comment ("RDB finished before replicas were killed; accept either 1 or 2 replicas still up at pipe-read time.") describes the same kind of timing variance this PR addresses for the timeout subcase on the disconnect-emission side.

Alternatives considered

Synchronize pause_process on partial-write evidence: rejected. Once rdb_child_type is cleared in the same tick, no test-side synchronization can force the (full sync) path to fire.
Diagnostic logging at src/replication.c:5360,5373 (state + rdb_child_type at disconnect): deferred to a separate follow-up PR. Useful for future triage but independent review surface, and this PR is intentionally test-only.

Verification

grep -n "Disconnecting timedout replica" src/replication.c confirms both emitters intact (:5360 streaming, :5373 full).
grep -rn "Disconnecting timedout replica (full sync)" tests/ confirms no other test relies on the strict string.
Full local run of integration/replication on macOS Apple Silicon: 111 passed, 0 failed.
10-loop run of the five diskless * replicas drop during rdb pipe subcases (including timeout): 50/50 pass. The race is only reproducible on CI VM conditions, so local runs validate non-regression rather than fix efficacy.
Post-disconnect assertions at replication.tcl:988-1080 (rdb_bgsave_in_progress, replicas_alive cleanup, dbsize parity, debug digest) are all branch-agnostic.

coderabbitai · 2026-05-19T09:09:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c8cc11b8-5976-433d-8c04-03e7bfa6a62f

📥 Commits

Reviewing files that changed from the base of the PR and between 93eac07 and 95849c9.

📒 Files selected for processing (1)

tests/integration/replication.tcl

🚧 Files skipped from review as they are similar to previous changes (1)

tests/integration/replication.tcl

📝 Walkthrough

Walkthrough

The PR updates the diskless RDB pipe test in tests/integration/replication.tcl to accept replication-timeout disconnect logs from either full-sync or streaming-sync paths, assert the timedout disconnect occurs exactly once, and restore repl-timeout after the check.

Changes

Timeout disconnection log handling in diskless RDB pipe test

Layer / File(s)	Summary
Timeout disconnection log handling in diskless RDB pipe test `tests/integration/replication.tcl`	The `all_drop == "timeout"` subcase replaces a fixed expectation for the full-sync disconnect log with a conditional wait that first attempts to match the full-sync message, falls back to the streaming-sync message if needed, asserts the "Disconnecting timedout replica" log entry occurs exactly once across both branches, and restores `repl-timeout` to 60 afterwards.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

valkey-io/valkey#3703: Also adjusts repl-timeout in replication-related tests to handle replication-timeout disconnect behavior.

Suggested reviewers

ranshid
zuiderkwast
enjoy-binbin

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: updating the test to tolerate either timeout-disconnect log branch instead of strictly expecting full-sync.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the symptom, root cause, fix, and verification for the test update.
Linked Issues check	✅ Passed	The PR fully addresses issue `#3686` by implementing the exact requirements: accepting either disconnect log message and asserting exactly one timeout-driven disconnect to catch regressions.
Out of Scope Changes check	✅ Passed	All changes are strictly within scope: only the replication.tcl test file is modified to fix the flaky timeout test case, with no unrelated changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/replication.tcl`:
- Line 998: Update the failing test assertion so it provides a descriptive error
message: change the assert_equal that compares 1 with the result of
count_log_message(... "Disconnecting timedout replica") to include a clear
message explaining the expectation (e.g., that exactly one "Disconnecting
timedout replica" log entry should be present). Locate the assertion using the
assert_equal call and the count_log_message invocation and add the descriptive
string as the assertion's message argument.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 183ac616-8bc0-4e1b-a69e-330ce23c468a

📥 Commits

Reviewing files that changed from the base of the PR and between 8274099 and 93eac07.

📒 Files selected for processing (1)

tests/integration/replication.tcl

…isconnect branch The "diskless timeout replicas drop during rdb pipe" subcase in tests/integration/replication.tcl waits for a "(full sync)" timeout disconnect log. On test-macos-latest the disconnect can surface on the "(streaming sync)" branch instead, because the RDB child exits and clears server.rdb_child_type (src/rdb.c:3698) in the same serverCron tick as the disconnect loop (src/replication.c:5348), closing the (full sync) window before it can fire. The timed-out replica is by then already promoted via replicaPutOnline(), so it falls into the (streaming sync) branch (src/replication.c:5357-5364). Linux CI does not hit this because SIGSTOP back-pressure keeps the RDB child blocked on the pipe write long enough that backgroundSaveDoneHandler does not run in the same tick. Accept either timeout-disconnect message and assert exactly one such disconnect occurred, so the test still rejects a regression that emits zero or multiple timeouts. This follows the existing catch+fallback convention already used in the same test for "all" and "slow" subcases (replication.tcl:1011-1036). Also addresses CodeRabbit feedback by adding a descriptive failure message to the `assert_equal 1 [count_log_message ...]` guard so a regression that emits zero or multiple timeout-disconnect log lines surfaces with context instead of a bare `Expected '1' to be equal to '0'`. Fixes valkey-io#3686. Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>

rainsupreme

Looks pretty solid 👍
I might like to see a workflow that ran this 100x to show it isn't flaky any more, but your in-depth analysis convinces me that this is understood and worth fixing

github-actions Bot assigned Taeknology May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread tests/integration/replication.tcl Outdated

Taeknology force-pushed the fix/diskless-timeout-3686 branch from 93eac07 to 95849c9 Compare May 19, 2026 09:36

rainsupreme approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch#3772

Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch#3772
Taeknology wants to merge 1 commit into
valkey-io:unstablefrom
Taeknology:fix/diskless-timeout-3686

Taeknology commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

rainsupreme left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Taeknology commented May 19, 2026

Symptom

Root cause: same-tick cron ordering

Fix

Prior art in this test

Alternatives considered

Verification

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rainsupreme left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 19, 2026 •

edited

Loading