fix(cloneset): clamp updateMaxUnavailable to zero to prevent stuck rollouts#2385
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent CloneSet rolling updates from stalling when the workload is under-replicated by clamping updateMaxUnavailable to a non-negative value.
Changes:
- Clamp
updateMaxUnavailableto>= 0incalculateDiffsWithExpectationto avoid negative values. - Update an existing unit test expectation to reflect clamped behavior.
- Add a regression test case covering an under-replicated scenario that previously produced a negative
updateMaxUnavailable.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/controller/cloneset/sync/cloneset_sync_utils.go | Clamp updateMaxUnavailable to zero and document the rationale. |
| pkg/controller/cloneset/sync/cloneset_sync_utils_test.go | Update expected values and add a regression case asserting non-negative updateMaxUnavailable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| name: "update not blocked when scaleUpLimit=0 leaves cluster under-replicated", | ||
| set: setScaleStrategy( | ||
| createTestCloneSet(5, intstr.FromInt(0), intstr.FromInt(1), intstr.FromInt(0)), | ||
| intstr.FromInt(1), | ||
| ), | ||
| pods: []*v1.Pod{ | ||
| createTestPod(oldRevision, appspub.LifecycleStateNormal, true, false), | ||
| createTestPod(oldRevision, appspub.LifecycleStateNormal, false, false), | ||
| createTestPod(oldRevision, appspub.LifecycleStateNormal, false, false), | ||
| }, | ||
| expectResult: expectationDiffs{ | ||
| scaleUpNum: 2, | ||
| scaleUpLimit: 0, | ||
| updateNum: 3, | ||
| updateMaxUnavailable: 0, | ||
| }, |
There was a problem hiding this comment.
This regression case (and its name) implies the update should not be blocked, but it only asserts updateMaxUnavailable is clamped to 0—which still causes limitUpdateIndexes to select zero pods to update. If the intended fix is “rollout progresses instead of stalling”, the test should also assert the downstream behavior (e.g., via limitUpdateIndexes or Update) and/or the case name/comment should be adjusted to match what is actually being guaranteed here (non-negative value).
| // updateMaxUnavailable is the ceiling used by limitUpdateIndexes to decide | ||
| // how many pods may be simultaneously unavailable during an update pass. | ||
| // When len(pods) < replicas (e.g. the cluster is under-replicated because | ||
| // ScaleStrategy.MaxUnavailable is throttling new-pod creation to zero), | ||
| // the raw formula produces a negative number. limitUpdateIndexes's guard: | ||
| // targetRevisionUnavailableCount + canUpdateCount >= updateMaxUnavailable | ||
| // evaluates to 0 >= -N immediately, breaking the loop before any pod is | ||
| // examined and silently stalling the rollout. Clamping to zero preserves | ||
| // the correct blocking semantic for available pods (0+0 >= 0 is still true) | ||
| // without relying on integer-comparison overflow of a negative sentinel. | ||
| res.updateMaxUnavailable = integer.IntMax(maxUnavailable+len(pods)-replicas, 0) |
There was a problem hiding this comment.
Clamping to 0 makes updateMaxUnavailable non-negative, but it does not address the “stuck rollout” described in the PR: limitUpdateIndexes breaks immediately when updateMaxUnavailable is 0 as well as when it is negative (the guard uses ">="). If the intent is to allow updates to proceed while under-replicated (e.g., update already-unavailable pods), this likely needs an accompanying change to limitUpdateIndexes logic and/or a different ceiling calculation; otherwise rollouts can still stall with updateMaxUnavailable==0.
| // evaluates to 0 >= -N immediately, breaking the loop before any pod is | ||
| // examined and silently stalling the rollout. Clamping to zero preserves | ||
| // the correct blocking semantic for available pods (0+0 >= 0 is still true) | ||
| // without relying on integer-comparison overflow of a negative sentinel. |
There was a problem hiding this comment.
The comment mentions “integer-comparison overflow of a negative sentinel”, but there’s no overflow here—this is just a comparison against a negative value. Consider rewording to avoid implying overflow/UB, and focus on the concrete behavior (the guard condition is immediately true when updateMaxUnavailable <= 0).
| // without relying on integer-comparison overflow of a negative sentinel. | |
| // and avoids a negative limit that would make the guard condition | |
| // immediately true and stall the rollout. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2385 +/- ##
==========================================
+ Coverage 48.66% 48.69% +0.03%
==========================================
Files 324 324
Lines 27920 27921 +1
==========================================
+ Hits 13587 13597 +10
+ Misses 12794 12783 -11
- Partials 1539 1541 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…llouts When ScaleStrategy.MaxUnavailable throttles pod creation to zero (scaleUpLimit=0), createPods(0,...) returns (false, nil). syncCloneSet therefore does not skip Update(), which runs with len(pods) < replicas. The raw formula then yields a negative updateMaxUnavailable; limitUpdateIndexes imediately breaks out of its loop and returns no pods to update, silently stalling the rolling update until the scale-up succeeds. Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
…s zero When ScaleStrategy.MaxUnavailable throttles pod creation so that len(pods) < replicas, calculateDiffsWithExpectation clamps updateMaxUnavailable to zero (previously a negative value caused the same stall). The prior approach was still broken: limitUpdateIndexes's guard targetRevisionUnavailableCount + canUpdateCount >= updateMaxUnavailable evaluates to 0 >= 0 on the very first iteration, so it broke out of the loop before examining any pod — including pods that are already unavailable. Updating an already-unavailable pod does not increase the cluster's unavailability, so those pods must never be blocked by a zero budget. Fix: when updateMaxUnavailable is exactly zero (i.e. the formula was negative and was clamped), skip the break for pods that are already unavailable. The budget still correctly blocks transitions from available to unavailable (available pods still see break at 0 >= 0). The fix preserves all existing semantics for positive budgets: the targetRevisionUnavailableCount + canUpdateCount >= N guard continues to limit simultaneous pod updates to N as before. Also update the comment in calculateDiffsWithExpectation to correctly describe what the zero-clamp achieves, and add a regression test in TestCalculateUpdateCount that exercises the end-to-end path through both calculateDiffsWithExpectation and limitUpdateIndexes. Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
9c4dc49 to
545699e
Compare
When
ScaleStrategy.MaxUnavailablethrottles pod creation to zero (scaleUpLimit=0),createPods(0,...)returns(false, nil).syncCloneSettherefore does not skipUpdate(), which runs withlen(pods) < replicas. The raw formula then yields a negativeupdateMaxUnavailable;limitUpdateIndexesimmediately breaks out of its loop and returns no pods to update, silently stalling the rolling update until the scale-up succeeds.Describe what this PR does
This fixes a silent rollout stall in CloneSet when the cluster is under-replicated.
The formula
maxUnavailable + len(pods) - replicascan go negative when there are fewer pods than the desired replica count — for example, whenScaleStrategy.MaxUnavailablethrottles new pod creation down to zero. When that happens,limitUpdateIndexessees the negative number and immediately bails out of its loop without picking any pods for update. The rollout just hangs there, no error, no log, nothing — it just waits until scale-up finishes on its own.The fix is a one-line clamp: wrap the formula with
integer.IntMax(..., 0)so the value never drops below zero. This way, even when scaling is stuck, the update path doesn't silently deadlock.Does this pull request fix one issue?
NONE
Describe how to verify it
Unit tests — Run the existing and newly added test:
go test ./pkg/controller/cloneset/sync/ -run TestCalculateDiffs -vupdateMaxUnavailable: 0instead of-2.ScaleStrategy.MaxUnavailable=1. It confirms the raw value-1gets clamped to0and the update is not blocked.Manual / integration check — Create a CloneSet with
ScaleStrategy.MaxUnavailableset low enough thatscaleUpLimitdrops to 0 while pods are still pending. Trigger a rolling update and confirm the rollout progresses instead of hanging indefinitely.Special notes for reviews
cloneset_sync_utils.goand a corresponding regression test.0 + 0 >= 0still holds true, so the blocking behavior for available pods is preserved as expected.