Skip to content

Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch#3772

Open
Taeknology wants to merge 1 commit into
valkey-io:unstablefrom
Taeknology:fix/diskless-timeout-3686
Open

Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch#3772
Taeknology wants to merge 1 commit into
valkey-io:unstablefrom
Taeknology:fix/diskless-timeout-3686

Conversation

@Taeknology
Copy link
Copy Markdown

Fixes #3686.

Symptom

tests/integration/replication.tcl "diskless timeout replicas drop during rdb pipe" fails on test-macos-latest:

log message of '"*Disconnecting timedout replica (full sync)*"' not found in
./tests/tmp/server.13581.2852/stdout after line: 128 till line: 151

The master did fire a timeout disconnect, but on the streaming-sync branch instead of the full-sync one:

01:58:34.346 * Streamed RDB transfer with replica 127.0.0.1:22549 succeeded (socket).
01:58:34.346 * Synchronization with replica 127.0.0.1:22549 succeeded
01:58:37.373 # Disconnecting timedout replica (streaming sync): 127.0.0.1:22548

Root cause: same-tick cron ordering

serverCron does two things in the same tick that compete here:

  1. checkChildrenDone (src/server.c:1453) reaps the RDB child and calls backgroundSaveDoneHandler, which clears server.rdb_child_type = RDB_CHILD_TYPE_NONE at src/rdb.c:3698.
  2. The disconnect loop at src/replication.c:5348 runs.

The (full sync) branch (src/replication.c:5369-5377) is gated on rdb_child_type == RDB_CHILD_TYPE_SOCKET. Once step 1 clears it, the branch is unreachable for the rest of that tick. The SIGSTOP'd replica meanwhile was already promoted via replicaPutOnline() (src/replication.c:2006-2031), so the disconnect lands in the (streaming sync) branch at src/replication.c:5357-5364 instead.

On Linux CI, SIGSTOP back-pressure keeps the RDB child blocked on write() to the parent pipe long enough that step 1 does not run in the same tick as step 2. On macOS-latest CI the child drains faster (larger loopback SO_SNDBUF, different scheduler quanta), so the two events collapse into one tick and the test's strict (full sync) wait never matches.

Fix

Accept either timeout-disconnect message, then assert exactly one such disconnect occurred so the test still rejects a regression that emits zero or multiple timeouts:

if {[catch {
    wait_for_log_messages -2 {"*Disconnecting timedout replica (full sync)*"} $loglines 100 100
}]} {
    wait_for_log_messages -2 {"*Disconnecting timedout replica (streaming sync)*"} $loglines 100 100
}
assert_equal 1 [count_log_message -2 "Disconnecting timedout replica"]

Both branches represent a legitimate timeout-driven disconnect; the test's intent ("the slow replica is dropped because of repl-timeout") is preserved. The count_log_message assertion guards against silently widening to "any timeout anywhere".

Prior art in this test

The same catch { wait_for_log_messages ... } + fallback pattern is already the established convention in this very test block to absorb RDB-pipe race variance — see the post-disconnect assertions for all_drop == "all" and all_drop == "slow" at tests/integration/replication.tcl:1011-1036. The comment ("RDB finished before replicas were killed; accept either 1 or 2 replicas still up at pipe-read time.") describes the same kind of timing variance this PR addresses for the timeout subcase on the disconnect-emission side.

Alternatives considered

  • Synchronize pause_process on partial-write evidence: rejected. Once rdb_child_type is cleared in the same tick, no test-side synchronization can force the (full sync) path to fire.
  • Diagnostic logging at src/replication.c:5360,5373 (state + rdb_child_type at disconnect): deferred to a separate follow-up PR. Useful for future triage but independent review surface, and this PR is intentionally test-only.

Verification

  • grep -n "Disconnecting timedout replica" src/replication.c confirms both emitters intact (:5360 streaming, :5373 full).
  • grep -rn "Disconnecting timedout replica (full sync)" tests/ confirms no other test relies on the strict string.
  • Full local run of integration/replication on macOS Apple Silicon: 111 passed, 0 failed.
  • 10-loop run of the five diskless * replicas drop during rdb pipe subcases (including timeout): 50/50 pass. The race is only reproducible on CI VM conditions, so local runs validate non-regression rather than fix efficacy.
  • Post-disconnect assertions at replication.tcl:988-1080 (rdb_bgsave_in_progress, replicas_alive cleanup, dbsize parity, debug digest) are all branch-agnostic.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c8cc11b8-5976-433d-8c04-03e7bfa6a62f

📥 Commits

Reviewing files that changed from the base of the PR and between 93eac07 and 95849c9.

📒 Files selected for processing (1)
  • tests/integration/replication.tcl
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/replication.tcl

📝 Walkthrough

Walkthrough

The PR updates the diskless RDB pipe test in tests/integration/replication.tcl to accept replication-timeout disconnect logs from either full-sync or streaming-sync paths, assert the timedout disconnect occurs exactly once, and restore repl-timeout after the check.

Changes

Timeout disconnection log handling in diskless RDB pipe test

Layer / File(s) Summary
Timeout disconnection log handling in diskless RDB pipe test
tests/integration/replication.tcl
The all_drop == "timeout" subcase replaces a fixed expectation for the full-sync disconnect log with a conditional wait that first attempts to match the full-sync message, falls back to the streaming-sync message if needed, asserts the "Disconnecting timedout replica" log entry occurs exactly once across both branches, and restores repl-timeout to 60 afterwards.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • valkey-io/valkey#3703: Also adjusts repl-timeout in replication-related tests to handle replication-timeout disconnect behavior.

Suggested reviewers

  • ranshid
  • zuiderkwast
  • enjoy-binbin
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating the test to tolerate either timeout-disconnect log branch instead of strictly expecting full-sync.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the symptom, root cause, fix, and verification for the test update.
Linked Issues check ✅ Passed The PR fully addresses issue #3686 by implementing the exact requirements: accepting either disconnect log message and asserting exactly one timeout-driven disconnect to catch regressions.
Out of Scope Changes check ✅ Passed All changes are strictly within scope: only the replication.tcl test file is modified to fix the flaky timeout test case, with no unrelated changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/replication.tcl`:
- Line 998: Update the failing test assertion so it provides a descriptive error
message: change the assert_equal that compares 1 with the result of
count_log_message(... "Disconnecting timedout replica") to include a clear
message explaining the expectation (e.g., that exactly one "Disconnecting
timedout replica" log entry should be present). Locate the assertion using the
assert_equal call and the count_log_message invocation and add the descriptive
string as the assertion's message argument.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 183ac616-8bc0-4e1b-a69e-330ce23c468a

📥 Commits

Reviewing files that changed from the base of the PR and between 8274099 and 93eac07.

📒 Files selected for processing (1)
  • tests/integration/replication.tcl

Comment thread tests/integration/replication.tcl Outdated
…isconnect branch

The "diskless timeout replicas drop during rdb pipe" subcase in
tests/integration/replication.tcl waits for a "(full sync)" timeout
disconnect log. On test-macos-latest the disconnect can surface on the
"(streaming sync)" branch instead, because the RDB child exits and
clears server.rdb_child_type (src/rdb.c:3698) in the same serverCron
tick as the disconnect loop (src/replication.c:5348), closing the
(full sync) window before it can fire. The timed-out replica is by then
already promoted via replicaPutOnline(), so it falls into the
(streaming sync) branch (src/replication.c:5357-5364). Linux CI does
not hit this because SIGSTOP back-pressure keeps the RDB child blocked
on the pipe write long enough that backgroundSaveDoneHandler does not
run in the same tick.

Accept either timeout-disconnect message and assert exactly one such
disconnect occurred, so the test still rejects a regression that emits
zero or multiple timeouts. This follows the existing catch+fallback
convention already used in the same test for "all" and "slow"
subcases (replication.tcl:1011-1036).

Also addresses CodeRabbit feedback by adding a descriptive failure
message to the `assert_equal 1 [count_log_message ...]` guard so a
regression that emits zero or multiple timeout-disconnect log lines
surfaces with context instead of a bare `Expected '1' to be equal
to '0'`.

Fixes valkey-io#3686.

Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
@Taeknology Taeknology force-pushed the fix/diskless-timeout-3686 branch from 93eac07 to 95849c9 Compare May 19, 2026 09:36
Copy link
Copy Markdown
Contributor

@rainsupreme rainsupreme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty solid 👍
I might like to see a workflow that ran this 100x to show it isn't flaky any more, but your in-depth analysis convinces me that this is understood and worth fixing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] diskless timeout replicas drop during rdb pipe in tests/integration/replication.tcl

2 participants