feat: support graceful scale-down for AlluxioRuntime using AdvancedStatefulSet by jakharmonika364 · Pull Request #5805 · fluid-cloudnative/fluid

jakharmonika364 · 2026-04-23T19:38:45Z

Ⅰ. Describe what this PR does

This PR implements a graceful decommissioning workflow for Alluxio workers during scale-in. It adds a new AdvancedStatefulSet feature gate that allows Fluid to leverage OpenKruise capabilities for finer pod lifecycle management. When enabled, workers are decommissioned and their cached data is migrated before the pods are terminated, ensuring cluster stability and data availability during scaling operations.

Ⅱ. Does this pull request fix one issue?

fixes #4193

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

pkg/ddc/alluxio/operations/decommission_test.go: Unit tests for Alluxio decommissioning commands and active worker count parsing.
Integration verified via SyncReplicas logic to ensure scale-down waits for successful decommissioning.

Ⅳ. Describe how to verify it

Enable AdvancedStatefulSet=true in the feature gates.
Deploy an AlluxioRuntime and scale down the replicas.
Verify that the targeted worker pods are decommissioned from the Alluxio master before the StatefulSet actually deletes the pods.

Ⅴ. Special notes for reviews

The feature is currently in Alpha and disabled by default. It provides the necessary infrastructure to support selective pod deletion in later phases.

fluid-e2e-bot · 2026-04-23T19:38:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign trafalgarzzz for approval by writing /assign @trafalgarzzz in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fluid-e2e-bot · 2026-04-23T19:39:04Z

Hi @jakharmonika364. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gemini-code-assist

Code Review

This pull request introduces graceful worker scale-down for Alluxio by decommissioning workers before they are terminated by the StatefulSet controller, ensuring cached blocks can be migrated. This functionality is gated by a new AdvancedStatefulSet feature. The changes include new decommissioning operations in AlluxioFileUtils, integration into the SyncReplicas reconciliation loop, and comprehensive unit tests. Review feedback suggests improving context propagation by replacing context.TODO(), optimizing efficiency by passing existing runtime objects to avoid redundant API lookups, and refining error handling during the draining phase to prevent log noise.

gemini-code-assist · 2026-04-23T19:43:58Z

+	for ord := desiredReplicas; ord < currentReplicas; ord++ {
+		podName := fmt.Sprintf("%s-%d", workerStsName, ord)
+		pod := &corev1.Pod{}
+		if err := e.Client.Get(context.TODO(),


Using context.TODO() is generally discouraged in production code as it makes it harder to propagate cancellation and timeouts. Since SyncReplicas is part of a reconciliation loop that provides a context (though wrapped in cruntime.ReconcileRequestContext), it is better to use a proper context here. If the context is not directly accessible from ctx, consider passing context.Background() or refactoring to pass the context down.

gemini-code-assist · 2026-04-23T19:43:58Z

+func (e *AlluxioEngine) getWorkerRPCPort() int {
+	runtime, err := e.getRuntime()
+	if err != nil {
+		return defaultWorkerRPCPort
+	}
+	if port, ok := runtime.Spec.Worker.Ports["rpc"]; ok && port > 0 {
+		return port
+	}
+	return defaultWorkerRPCPort
+}


The getWorkerRPCPort function calls e.getRuntime() which performs a redundant API lookup. Since the runtime object is already fetched in SyncReplicas (line 89), it should be passed as an argument to drainScalingDownWorkers and then to getWorkerRPCPort to improve efficiency.

func (e *AlluxioEngine) getWorkerRPCPort(runtime *data.AlluxioRuntime) int { if port, ok := runtime.Spec.Worker.Ports["rpc"]; ok && port > 0 { return port } return defaultWorkerRPCPort }

gemini-code-assist · 2026-04-23T19:43:58Z

+				return fmt.Errorf("workers not yet drained; scale-in to %d replicas will resume on next reconcile",
+					runtime.Replicas())


Returning a generic error here will cause the controller to log this as a failure every time it reconciles while waiting for workers to drain. This can lead to noisy logs. It is better to use a specific error type that the caller can recognize to log at a lower level, or if the framework supports it, return a requeue result without an error.

codecov · 2026-04-23T19:45:02Z

Codecov Report

❌ Patch coverage is 39.80583% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.11%. Comparing base (82e490e) to head (b77d0d4).

Files with missing lines	Patch %	Lines
pkg/ddc/alluxio/replicas.go	11.76%	59 Missing and 1 partial ⚠️
pkg/features/features.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5805      +/-   ##
==========================================
- Coverage   58.17%   58.11%   -0.06%     
==========================================
  Files         478      480       +2     
  Lines       32485    32588     +103     
==========================================
+ Hits        18899    18940      +41     
- Misses      12042    12103      +61     
- Partials     1544     1545       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…atefulSet (fluid-cloudnative#4193) Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

sonarqubecloud · 2026-04-23T20:51:26Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fluid-e2e-bot Bot added the needs-ok-to-test label Apr 23, 2026

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

feat: support graceful scale-down for AlluxioRuntime using AdvancedSt…

b77d0d4

…atefulSet (fluid-cloudnative#4193) Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

jakharmonika364 force-pushed the feat-support-advanced-statefulset-4193 branch from 8c7de5c to b77d0d4 Compare April 23, 2026 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support graceful scale-down for AlluxioRuntime using AdvancedStatefulSet#5805

feat: support graceful scale-down for AlluxioRuntime using AdvancedStatefulSet#5805
jakharmonika364 wants to merge 1 commit intofluid-cloudnative:masterfrom
jakharmonika364:feat-support-advanced-statefulset-4193

jakharmonika364 commented Apr 23, 2026

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

codecov Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return fmt.Errorf("workers not yet drained; scale-in to %d replicas will resume on next reconcile",
		runtime.Replicas())

Conversation

jakharmonika364 commented Apr 23, 2026

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 23, 2026 •

edited

Loading