Skip to content

fix(cloneset): clamp updateMaxUnavailable to zero to prevent stuck rollouts#2385

Open
aditya-systems-hub wants to merge 3 commits intoopenkruise:masterfrom
aditya-systems-hub:fix/cloneset-negative-updateMaxUnavailable
Open

fix(cloneset): clamp updateMaxUnavailable to zero to prevent stuck rollouts#2385
aditya-systems-hub wants to merge 3 commits intoopenkruise:masterfrom
aditya-systems-hub:fix/cloneset-negative-updateMaxUnavailable

Conversation

@aditya-systems-hub
Copy link
Copy Markdown

@aditya-systems-hub aditya-systems-hub commented Mar 1, 2026

When ScaleStrategy.MaxUnavailable throttles pod creation to zero (scaleUpLimit=0), createPods(0,...) returns (false, nil). syncCloneSet therefore does not skip Update(), which runs with len(pods) < replicas. The raw formula then yields a negative updateMaxUnavailable; limitUpdateIndexes immediately breaks out of its loop and returns no pods to update, silently stalling the rolling update until the scale-up succeeds.


Describe what this PR does

This fixes a silent rollout stall in CloneSet when the cluster is under-replicated.

The formula maxUnavailable + len(pods) - replicas can go negative when there are fewer pods than the desired replica count — for example, when ScaleStrategy.MaxUnavailable throttles new pod creation down to zero. When that happens, limitUpdateIndexes sees the negative number and immediately bails out of its loop without picking any pods for update. The rollout just hangs there, no error, no log, nothing — it just waits until scale-up finishes on its own.

The fix is a one-line clamp: wrap the formula with integer.IntMax(..., 0) so the value never drops below zero. This way, even when scaling is stuck, the update path doesn't silently deadlock.

Does this pull request fix one issue?

NONE

Describe how to verify it

  1. Unit tests — Run the existing and newly added test:

    go test ./pkg/controller/cloneset/sync/ -run TestCalculateDiffs -v
    • The existing test case now expects updateMaxUnavailable: 0 instead of -2.
    • A new regression test simulates the exact scenario: 5 replicas, 3 pods (1 available, 2 unavailable), ScaleStrategy.MaxUnavailable=1. It confirms the raw value -1 gets clamped to 0 and the update is not blocked.
  2. Manual / integration check — Create a CloneSet with ScaleStrategy.MaxUnavailable set low enough that scaleUpLimit drops to 0 while pods are still pending. Trigger a rolling update and confirm the rollout progresses instead of hanging indefinitely.

Special notes for reviews

  • The change is intentionally minimal — a single-line clamp in cloneset_sync_utils.go and a corresponding regression test.
  • Clamping to zero is safe: the guard 0 + 0 >= 0 still holds true, so the blocking behavior for available pods is preserved as expected.
  • No new dependencies or behavior changes beyond preventing the negative value.

Copilot AI review requested due to automatic review settings March 1, 2026 23:10
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent CloneSet rolling updates from stalling when the workload is under-replicated by clamping updateMaxUnavailable to a non-negative value.

Changes:

  • Clamp updateMaxUnavailable to >= 0 in calculateDiffsWithExpectation to avoid negative values.
  • Update an existing unit test expectation to reflect clamped behavior.
  • Add a regression test case covering an under-replicated scenario that previously produced a negative updateMaxUnavailable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/controller/cloneset/sync/cloneset_sync_utils.go Clamp updateMaxUnavailable to zero and document the rationale.
pkg/controller/cloneset/sync/cloneset_sync_utils_test.go Update expected values and add a regression case asserting non-negative updateMaxUnavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +940 to +955
name: "update not blocked when scaleUpLimit=0 leaves cluster under-replicated",
set: setScaleStrategy(
createTestCloneSet(5, intstr.FromInt(0), intstr.FromInt(1), intstr.FromInt(0)),
intstr.FromInt(1),
),
pods: []*v1.Pod{
createTestPod(oldRevision, appspub.LifecycleStateNormal, true, false),
createTestPod(oldRevision, appspub.LifecycleStateNormal, false, false),
createTestPod(oldRevision, appspub.LifecycleStateNormal, false, false),
},
expectResult: expectationDiffs{
scaleUpNum: 2,
scaleUpLimit: 0,
updateNum: 3,
updateMaxUnavailable: 0,
},
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression case (and its name) implies the update should not be blocked, but it only asserts updateMaxUnavailable is clamped to 0—which still causes limitUpdateIndexes to select zero pods to update. If the intended fix is “rollout progresses instead of stalling”, the test should also assert the downstream behavior (e.g., via limitUpdateIndexes or Update) and/or the case name/comment should be adjusted to match what is actually being guaranteed here (non-negative value).

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +288
// updateMaxUnavailable is the ceiling used by limitUpdateIndexes to decide
// how many pods may be simultaneously unavailable during an update pass.
// When len(pods) < replicas (e.g. the cluster is under-replicated because
// ScaleStrategy.MaxUnavailable is throttling new-pod creation to zero),
// the raw formula produces a negative number. limitUpdateIndexes's guard:
// targetRevisionUnavailableCount + canUpdateCount >= updateMaxUnavailable
// evaluates to 0 >= -N immediately, breaking the loop before any pod is
// examined and silently stalling the rollout. Clamping to zero preserves
// the correct blocking semantic for available pods (0+0 >= 0 is still true)
// without relying on integer-comparison overflow of a negative sentinel.
res.updateMaxUnavailable = integer.IntMax(maxUnavailable+len(pods)-replicas, 0)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clamping to 0 makes updateMaxUnavailable non-negative, but it does not address the “stuck rollout” described in the PR: limitUpdateIndexes breaks immediately when updateMaxUnavailable is 0 as well as when it is negative (the guard uses ">="). If the intent is to allow updates to proceed while under-replicated (e.g., update already-unavailable pods), this likely needs an accompanying change to limitUpdateIndexes logic and/or a different ceiling calculation; otherwise rollouts can still stall with updateMaxUnavailable==0.

Copilot uses AI. Check for mistakes.
// evaluates to 0 >= -N immediately, breaking the loop before any pod is
// examined and silently stalling the rollout. Clamping to zero preserves
// the correct blocking semantic for available pods (0+0 >= 0 is still true)
// without relying on integer-comparison overflow of a negative sentinel.
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions “integer-comparison overflow of a negative sentinel”, but there’s no overflow here—this is just a comparison against a negative value. Consider rewording to avoid implying overflow/UB, and focus on the concrete behavior (the guard condition is immediately true when updateMaxUnavailable <= 0).

Suggested change
// without relying on integer-comparison overflow of a negative sentinel.
// and avoids a negative limit that would make the guard condition
// immediately true and stall the rollout.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.69%. Comparing base (bba2621) to head (545699e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2385      +/-   ##
==========================================
+ Coverage   48.66%   48.69%   +0.03%     
==========================================
  Files         324      324              
  Lines       27920    27921       +1     
==========================================
+ Hits        13587    13597      +10     
+ Misses      12794    12783      -11     
- Partials     1539     1541       +2     
Flag Coverage Δ
unittests 48.69% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…llouts

When ScaleStrategy.MaxUnavailable throttles pod creation to zero
(scaleUpLimit=0), createPods(0,...) returns (false, nil). syncCloneSet
therefore does not skip Update(), which runs with len(pods) < replicas.
The raw formula then yields a negative updateMaxUnavailable; limitUpdateIndexes
imediately breaks out of its loop and returns no pods to update, silently
stalling the rolling update until the scale-up succeeds.

Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
…s zero

When ScaleStrategy.MaxUnavailable throttles pod creation so that
len(pods) < replicas, calculateDiffsWithExpectation clamps
updateMaxUnavailable to zero (previously a negative value caused the
same stall).

The prior approach was still broken: limitUpdateIndexes's guard

  targetRevisionUnavailableCount + canUpdateCount >= updateMaxUnavailable

evaluates to 0 >= 0 on the very first iteration, so it broke out of the
loop before examining any pod — including pods that are already
unavailable. Updating an already-unavailable pod does not increase the
cluster's unavailability, so those pods must never be blocked by a zero
budget.

Fix: when updateMaxUnavailable is exactly zero (i.e. the formula was
negative and was clamped), skip the break for pods that are already
unavailable. The budget still correctly blocks transitions from available
to unavailable (available pods still see break at 0 >= 0).

The fix preserves all existing semantics for positive budgets: the
targetRevisionUnavailableCount + canUpdateCount >= N guard continues
to limit simultaneous pod updates to N as before.

Also update the comment in calculateDiffsWithExpectation to correctly
describe what the zero-clamp achieves, and add a regression test in
TestCalculateUpdateCount that exercises the end-to-end path through
both calculateDiffsWithExpectation and limitUpdateIndexes.

Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
@aditya-systems-hub aditya-systems-hub force-pushed the fix/cloneset-negative-updateMaxUnavailable branch from 9c4dc49 to 545699e Compare March 2, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M size/M: 30-99

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants