fix(cloneset): clamp updateMaxUnavailable to zero to prevent stuck rollouts by aditya-systems-hub · Pull Request #2385 · openkruise/kruise

aditya-systems-hub · 2026-03-01T23:10:35Z

When ScaleStrategy.MaxUnavailable throttles pod creation to zero (scaleUpLimit=0), createPods(0,...) returns (false, nil). syncCloneSet therefore does not skip Update(), which runs with len(pods) < replicas. The raw formula then yields a negative updateMaxUnavailable; limitUpdateIndexes immediately breaks out of its loop and returns no pods to update, silently stalling the rolling update until the scale-up succeeds.

Describe what this PR does

This fixes a silent rollout stall in CloneSet when the cluster is under-replicated.

The formula maxUnavailable + len(pods) - replicas can go negative when there are fewer pods than the desired replica count — for example, when ScaleStrategy.MaxUnavailable throttles new pod creation down to zero. When that happens, limitUpdateIndexes sees the negative number and immediately bails out of its loop without picking any pods for update. The rollout just hangs there, no error, no log, nothing — it just waits until scale-up finishes on its own.

The fix is a one-line clamp: wrap the formula with integer.IntMax(..., 0) so the value never drops below zero. This way, even when scaling is stuck, the update path doesn't silently deadlock.

Does this pull request fix one issue?

NONE

Describe how to verify it

Unit tests — Run the existing and newly added test:
```
go test ./pkg/controller/cloneset/sync/ -run TestCalculateDiffs -v
```
- The existing test case now expects updateMaxUnavailable: 0 instead of -2.
- A new regression test simulates the exact scenario: 5 replicas, 3 pods (1 available, 2 unavailable), ScaleStrategy.MaxUnavailable=1. It confirms the raw value -1 gets clamped to 0 and the update is not blocked.
Manual / integration check — Create a CloneSet with ScaleStrategy.MaxUnavailable set low enough that scaleUpLimit drops to 0 while pods are still pending. Trigger a rolling update and confirm the rollout progresses instead of hanging indefinitely.

Special notes for reviews

The change is intentionally minimal — a single-line clamp in cloneset_sync_utils.go and a corresponding regression test.
Clamping to zero is safe: the guard 0 + 0 >= 0 still holds true, so the blocking behavior for available pods is preserved as expected.
No new dependencies or behavior changes beyond preventing the negative value.

kruise-bot · 2026-03-01T23:10:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR aims to prevent CloneSet rolling updates from stalling when the workload is under-replicated by clamping updateMaxUnavailable to a non-negative value.

Changes:

Clamp updateMaxUnavailable to >= 0 in calculateDiffsWithExpectation to avoid negative values.
Update an existing unit test expectation to reflect clamped behavior.
Add a regression test case covering an under-replicated scenario that previously produced a negative updateMaxUnavailable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
pkg/controller/cloneset/sync/cloneset_sync_utils.go	Clamp `updateMaxUnavailable` to zero and document the rationale.
pkg/controller/cloneset/sync/cloneset_sync_utils_test.go	Update expected values and add a regression case asserting non-negative `updateMaxUnavailable`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T23:14:33Z

pkg/controller/cloneset/sync/cloneset_sync_utils_test.go

+			name: "update not blocked when scaleUpLimit=0 leaves cluster under-replicated",
+			set: setScaleStrategy(
+				createTestCloneSet(5, intstr.FromInt(0), intstr.FromInt(1), intstr.FromInt(0)),
+				intstr.FromInt(1),
+			),
+			pods: []*v1.Pod{
+				createTestPod(oldRevision, appspub.LifecycleStateNormal, true, false),
+				createTestPod(oldRevision, appspub.LifecycleStateNormal, false, false),
+				createTestPod(oldRevision, appspub.LifecycleStateNormal, false, false),
+			},
+			expectResult: expectationDiffs{
+				scaleUpNum:          2,
+				scaleUpLimit:        0,
+				updateNum:           3,
+				updateMaxUnavailable: 0,
+			},


This regression case (and its name) implies the update should not be blocked, but it only asserts updateMaxUnavailable is clamped to 0—which still causes limitUpdateIndexes to select zero pods to update. If the intended fix is “rollout progresses instead of stalling”, the test should also assert the downstream behavior (e.g., via limitUpdateIndexes or Update) and/or the case name/comment should be adjusted to match what is actually being guaranteed here (non-negative value).

Copilot · 2026-03-01T23:14:34Z

pkg/controller/cloneset/sync/cloneset_sync_utils.go

+		// updateMaxUnavailable is the ceiling used by limitUpdateIndexes to decide
+		// how many pods may be simultaneously unavailable during an update pass.
+		// When len(pods) < replicas (e.g. the cluster is under-replicated because
+		// ScaleStrategy.MaxUnavailable is throttling new-pod creation to zero),
+		// the raw formula produces a negative number. limitUpdateIndexes's guard:
+		// targetRevisionUnavailableCount + canUpdateCount >= updateMaxUnavailable
+		// evaluates to 0 >= -N immediately, breaking the loop before any pod is
+		// examined and silently stalling the rollout. Clamping to zero preserves
+		// the correct blocking semantic for available pods (0+0 >= 0 is still true)
+		// without relying on integer-comparison overflow of a negative sentinel.
+		res.updateMaxUnavailable = integer.IntMax(maxUnavailable+len(pods)-replicas, 0)


Clamping to 0 makes updateMaxUnavailable non-negative, but it does not address the “stuck rollout” described in the PR: limitUpdateIndexes breaks immediately when updateMaxUnavailable is 0 as well as when it is negative (the guard uses ">="). If the intent is to allow updates to proceed while under-replicated (e.g., update already-unavailable pods), this likely needs an accompanying change to limitUpdateIndexes logic and/or a different ceiling calculation; otherwise rollouts can still stall with updateMaxUnavailable==0.

Copilot · 2026-03-01T23:14:34Z

pkg/controller/cloneset/sync/cloneset_sync_utils.go

+		// evaluates to 0 >= -N immediately, breaking the loop before any pod is
+		// examined and silently stalling the rollout. Clamping to zero preserves
+		// the correct blocking semantic for available pods (0+0 >= 0 is still true)
+		// without relying on integer-comparison overflow of a negative sentinel.


The comment mentions “integer-comparison overflow of a negative sentinel”, but there’s no overflow here—this is just a comparison against a negative value. Consider rewording to avoid implying overflow/UB, and focus on the concrete behavior (the guard condition is immediately true when updateMaxUnavailable <= 0).

Suggested change

// without relying on integer-comparison overflow of a negative sentinel.

// and avoids a negative limit that would make the guard condition

// immediately true and stall the rollout.

codecov · 2026-03-01T23:24:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.69%. Comparing base (bba2621) to head (545699e).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2385      +/-   ##
==========================================
+ Coverage   48.66%   48.69%   +0.03%     
==========================================
  Files         324      324              
  Lines       27920    27921       +1     
==========================================
+ Hits        13587    13597      +10     
+ Misses      12794    12783      -11     
- Partials     1539     1541       +2

Flag	Coverage Δ
unittests	`48.69% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…llouts When ScaleStrategy.MaxUnavailable throttles pod creation to zero (scaleUpLimit=0), createPods(0,...) returns (false, nil). syncCloneSet therefore does not skip Update(), which runs with len(pods) < replicas. The raw formula then yields a negative updateMaxUnavailable; limitUpdateIndexes imediately breaks out of its loop and returns no pods to update, silently stalling the rolling update until the scale-up succeeds. Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>

…s zero When ScaleStrategy.MaxUnavailable throttles pod creation so that len(pods) < replicas, calculateDiffsWithExpectation clamps updateMaxUnavailable to zero (previously a negative value caused the same stall). The prior approach was still broken: limitUpdateIndexes's guard targetRevisionUnavailableCount + canUpdateCount >= updateMaxUnavailable evaluates to 0 >= 0 on the very first iteration, so it broke out of the loop before examining any pod — including pods that are already unavailable. Updating an already-unavailable pod does not increase the cluster's unavailability, so those pods must never be blocked by a zero budget. Fix: when updateMaxUnavailable is exactly zero (i.e. the formula was negative and was clamped), skip the break for pods that are already unavailable. The budget still correctly blocks transitions from available to unavailable (available pods still see break at 0 >= 0). The fix preserves all existing semantics for positive budgets: the targetRevisionUnavailableCount + canUpdateCount >= N guard continues to limit simultaneous pod updates to N as before. Also update the comment in calculateDiffsWithExpectation to correctly describe what the zero-clamp achieves, and add a regression test in TestCalculateUpdateCount that exercises the end-to-end path through both calculateDiffsWithExpectation and limitUpdateIndexes. Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>

Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>

Copilot AI review requested due to automatic review settings March 1, 2026 23:10

kruise-bot requested review from hellolijj and shiyan2016 March 1, 2026 23:10

kruise-bot added the size/M size/M: 30-99 label Mar 1, 2026

Copilot started reviewing on behalf of aditya-systems-hub March 1, 2026 23:11 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

aditya-systems-hub added 3 commits March 2, 2026 06:28

test: clarify regression test name for clamped updateMaxUnavailable

545699e

Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>

aditya-systems-hub force-pushed the fix/cloneset-negative-updateMaxUnavailable branch from 9c4dc49 to 545699e Compare March 2, 2026 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloneset): clamp updateMaxUnavailable to zero to prevent stuck rollouts#2385

fix(cloneset): clamp updateMaxUnavailable to zero to prevent stuck rollouts#2385
aditya-systems-hub wants to merge 3 commits intoopenkruise:masterfrom
aditya-systems-hub:fix/cloneset-negative-updateMaxUnavailable

aditya-systems-hub commented Mar 1, 2026 •

edited

Loading

Uh oh!

kruise-bot commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

codecov bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// without relying on integer-comparison overflow of a negative sentinel.
	// and avoids a negative limit that would make the guard condition
	// immediately true and stall the rollout.

Conversation

aditya-systems-hub commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe what this PR does

Does this pull request fix one issue?

Describe how to verify it

Special notes for reviews

Uh oh!

kruise-bot commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aditya-systems-hub commented Mar 1, 2026 •

edited

Loading

codecov bot commented Mar 1, 2026 •

edited

Loading