Skip to content

Set repl-timeout for slotmigrations tests to prevent disconnections#3703

Merged
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
enjoy-binbin:fixtest
May 15, 2026
Merged

Set repl-timeout for slotmigrations tests to prevent disconnections#3703
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
enjoy-binbin:fixtest

Conversation

@enjoy-binbin
Copy link
Copy Markdown
Member

This test will insert keys, and as can be seen from the logs, the insertion
somehow is very slow in this CI, eventually causing a repl-timeout disconnection.

50655:M 14 May 2026 00:44:57.812 - DB 0: 5 keys (0 volatile) in 7 slots HT.
50655:M 14 May 2026 00:45:02.884 - DB 0: 6 keys (0 volatile) in 7 slots HT.
50655:M 14 May 2026 00:45:06.014 # Timing out slot migration xxx after not receiving ack for too long

Set repl-timeout for these tests to prevent the timeout disconnections.
Also the tests does not actually require inserting distinct keys, we only
need to fill the replication buffer, so there is no need for the different
keyname, this can save the CI some memory.

Closes #3702.

This test will insert keys, and as can be seen from the logs, the insertion
somehow is very slow in this CI, eventually causing a repl-timeout disconnection.
```
50655:M 14 May 2026 00:44:57.812 - DB 0: 5 keys (0 volatile) in 7 slots HT.
50655:M 14 May 2026 00:45:02.884 - DB 0: 6 keys (0 volatile) in 7 slots HT.
50655:M 14 May 2026 00:45:06.014 # Timing out slot migration xxx after not receiving ack for too long
```

Set repl-timeout for these tests to prevent the timeout disconnections.
Also the tests does not actually require inserting distinct keys, we only
need to fill the replication buffer, so there is no need for the different
keyname, this can save the CI some memory.

Closes valkey-io#3702.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e05fe207-a6af-47c1-8691-07d2abd1c281

📥 Commits

Reviewing files that changed from the base of the PR and between fdf13ca and 2204760.

📒 Files selected for processing (1)
  • tests/unit/cluster/cluster-migrateslots.tcl

📝 Walkthrough

Walkthrough

Two test cases in the cluster migration test suite are updated: each receives an increased replication timeout override and a modification to how test data is populated. The data generation now writes the same key repeatedly instead of creating unique keys per loop iteration.

Changes

Cluster migration test reliability adjustments

Layer / File(s) Summary
AOF/consistency test replication timeout and data population
tests/unit/cluster/cluster-migrateslots.tcl
Adds repl-timeout 3600 override to the start_cluster call in the AOF/consistency test section and changes the data population loop to write to a single fixed key "$16383_slot_tag:key" instead of generating unique keys with :$j suffix.
Slot migration test replication timeout and data population
tests/unit/cluster/cluster-migrateslots.tcl
Adds repl-timeout 3600 override to the start_cluster call in the slot-migration-max-failover-repl-bytes test section and changes the data population loop to write to a single fixed key "$16383_slot_tag:key" instead of generating unique keys with :$j suffix.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: setting repl-timeout for slot migration tests to prevent disconnections.
Description check ✅ Passed The description is clearly related to the changeset, explaining the slow key insertion causing repl-timeout issues and the solution implemented.
Linked Issues check ✅ Passed The PR adequately addresses issue #3702 by implementing repl-timeout override (3600 seconds) to prevent replication timeout disconnections that were causing test failures.
Out of Scope Changes check ✅ Passed All changes are scoped to addressing the repl-timeout issue: adding repl-timeout override and simplifying key generation to reduce memory usage, both directly addressing the test failure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.69%. Comparing base (ca9dee3) to head (2204760).
⚠️ Report is 7 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3703      +/-   ##
============================================
- Coverage     76.71%   76.69%   -0.03%     
============================================
  Files           162      162              
  Lines         80656    80662       +6     
============================================
- Hits          61872    61860      -12     
- Misses        18784    18802      +18     

see 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@rainsupreme rainsupreme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. Did you verify that this fixes the flakiness? It's weird that it would be so slow in the CI workflow 🤔

Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if you saw this failure...

Replicating 1MB key takes more than one minute?

@zuiderkwast zuiderkwast self-requested a review May 14, 2026 11:50
Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the ASM code uses repl-timeout for this:

            if (last_interaction &&
                (server.unixtime - last_interaction > server.repl_timeout)) {
                serverLog(LL_WARNING,
                          "Timing out slot migration %s "
                          "after not receiving ack for too long",
                          job->description);

Then, I'm convinced it will fix the failure where this warning is logged. 👍

@enjoy-binbin
Copy link
Copy Markdown
Member Author

Did you verify that this fixes the flakiness? It's weird that it would be so slow in the CI workflow

I did not bother to trigger the test in slow CI... I think the change is good enough to merge.

@enjoy-binbin enjoy-binbin merged commit 390a11c into valkey-io:unstable May 15, 2026
63 checks passed
@enjoy-binbin enjoy-binbin deleted the fixtest branch May 15, 2026 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] slot-migration-max-failover-repl-bytes -1 disables repl bytes limit in tests/unit/cluster/cluster-migrateslots.tcl

3 participants