[WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob by Sridhar1030 · Pull Request #3540 · kubeflow/trainer

Sridhar1030 · 2026-05-24T15:12:57Z

Related: #2782
Ref issues: kubernetes/enhancements#5729, kubernetes/enhancements#4671

This is the initial KEP to support DRA (Dynamic Resource Allocation) in Kubeflow Trainer and TrainJob.

DRA graduated to GA in Kubernetes 1.34. This KEP proposes adding a resourceClaims field to PodSpecPatch so that:

Platform admins can pre-configure ResourceClaimTemplates in ClusterTrainingRuntime templates
Data scientists can override or add claims via runtimePatches in TrainJob

The existing strategic merge patch pipeline already handles corev1.PodSpec.ResourceClaims natively, so the implementation requires only an API surface change with no controller logic modifications.

The KEP covers:

Pod-level ResourceClaims via PodSpecPatch (Phase 1)
CEL validation for claim source fields
Known limitations (torch plugin GPU detection, ContainerPatch gap)
Migration path from extended resources
Future work: PodGroup-level claims, ComputeDomain integration (Phase 2, after k8s 1.36+)

google-oss-prow · 2026-05-24T15:13:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-05-24T15:13:06Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

kannon92 · 2026-05-24T15:43:43Z

/assign

kannon92 · 2026-05-24T16:41:16Z

+
+#### E2E tests
+
+E2E tests are deferred until a DRA-capable test cluster (with a DRA driver installed) is


For Kueue, we use the dra example driver to do e2e testing without relying on a GPU. I think integration/unit are probably fine here but just wanted to call this out.

Thanks for the pointer! That's great to know. I'll update the E2E test section to note that the dra-example-driver can be used for E2E testing without real GPUs, similar to how Kueue handles it.

kannon92

LGTM

Really well done proposal. I don't really have any concerns on this one.

abhijeet-dhumal · 2026-05-25T05:16:39Z

+
+1. **PodGroup-level ResourceClaims.** Multi-node topology-aware allocation (e.g., NVL72,
+   ComputeDomains) depends on upstream Kubernetes work (DRAWorkloadResourceClaims feature gate,
+   expected in k8s 1.36+) and the Trainer WAS KEP ([#3219](https://github.com/kubeflow/trainer/pull/3219)).


Just to confirm the dependency chain for Phase 2, is the actual blocker Trainer's WAS KEP (#3219) rather than upstream k8s? IIUC KEP-5729 is already Beta in 1.37, so it seems like once #3219 lands we're good to go?

Good question. Phase 2 needs both: upstream KEP-5729 to stabilize (currently alpha in 1.36, 1.37 is still TBD per the KEP issue) AND Trainer's WAS KEP (#3219) to land.
I'll update the text to make this dependency chain clearer.
Phase 1 is independent of both, which is why we're proceeding with it now.

Actually KEP-5729 Beta PRs for 1.37 are already merged/tracked per the issue, it's not TBD. I think the real remaining gate is #3219 landing in Trainer. Can you mention it clearly

Got it, I'll update the text to make #3219 the primary Trainer-side blocker for Phase 2 and reflect the upstream status more accurately.

abhijeet-dhumal · 2026-05-25T05:19:50Z

+`ResourceClaim` objects and exposed to containers via CDI device injection.
+
+**Impact:** When a training job uses DRA claims exclusively (no extended resources like
+`nvidia.com/gpu`), `GetNumGPUPerNode()` returns 0. If `numProcPerNode` is not explicitly


Is there a plan to add a webhook warning if someone uses DRA claims without setting numProcPerNode explicitly?

That's a good idea. We could add a webhook warning (not rejection) when resourceClaims are present in the merged PodSpec but numProcPerNode is not explicitly set in the Trainer spec. I'll add this to the KEP as a recommended follow-up for Phase 1, since it's a small validation check that could save users from a confusing failure mode.

abhijeet-dhumal · 2026-05-25T05:41:03Z

+  acceptable: wrapping would diverge from Kubernetes semantics, and v1alpha1 allows breaking
+  changes if upstream adds fields that conflict.
+
+#### ContainerPatch and container-level claim references


Is the expectation that every DRA-ready ClusterTrainingRuntime must have resources.claims pre-wired in the container spec?

Is adding Resources to ContainerPatch tracked anywhere, or is it strictly Phase 2?

Yes, for Phase 1, DRA-ready ClusterTrainingRuntime templates need both pod-level resourceClaims AND container-level resources.claims pre-wired by the admin. Kubernetes requires containers to explicitly reference the claims they consume, and since ContainerPatch doesn't expose Resources, users can't add those references via runtimePatches.

Adding Resources to ContainerPatch is mentioned in the KEP under Known Limitations and Future Work. It's not strictly Phase 2 (no dependency on WAS/PodGroups), so it could land as a fast follow-up to Phase 1.

abhijeet-dhumal · 2026-05-25T06:19:17Z

+1. Admin defines `ClusterTrainingRuntime` with `resourceClaims` in the full `corev1.PodSpec`
+   template, and sets container-level `resources.claims` references.
+2. User creates `TrainJob` with `runtimePatches` containing `PodSpecPatch.ResourceClaims`.
+3. Controller calls `mergeRuntimePatches()` in `pkg/runtime/core/trainingruntime.go`:


Have you considered if there are any conflicts with : #3428
I think The DRA merge flow needs to be updated to reflect KEP-2599..according this KEP , the source for mergeRuntimePatches() is now the snapshot ConfigMap, not the live runtime.

specifically whether ValidateObjects() and mergeRuntimePatches() will use the snapshot or the live runtime, and whether Story 1 (admin-configured claims frozen at snapshot time) is called out as expected behavior.

Yes, I've read through KEP-2599. The snapshot doesn't conflict with DRA since mergeRuntimePatches() is source-agnostic; it takes a JobSetTemplateSpec and applies strategic merge patch regardless of whether the input came from the live runtime or the snapshot ConfigMap.

To your specific points:

ValidateObjects() runs at admission time before the TrainJob is reconciled, so it reads the live runtime (no snapshot exists yet at that point)

mergeRuntimePatches() during reconciliation will use the snapshot as its source

Admin-configured DRA claims being frozen at snapshot time is expected and desirable behavior. An admin changing a runtime's claim from A100 to H100 won't affect running jobs.

I'll update the flow diagram and design details to reflect the snapshot-based flow.

This KEP proposes adding a resourceClaims field to PodSpecPatch, enabling pod-level DRA support for training workloads. Related: kubeflow#2782 Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

- Clarify Phase 2 dependency: kubeflow#3219 is primary Trainer-side blocker, KEP-5729 beta targeting 1.37 - Update flow diagram and step-by-step for KEP-2599 snapshot mechanism - Add dra-example-driver reference for E2E testing (per kannon92) - Add webhook warning recommendation for DRA + numProcPerNode - Tighten ContainerPatch limitation with MUST language Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a KEP proposal describing Dynamic Resource Allocation (DRA) support for Kubeflow Trainer by documenting the intended API surface change (PodSpecPatch.resourceClaims), merge/validation approach, limitations, and a test plan.

Changes:

Introduces a new KEP document explaining DRA concepts, user stories, and design details for Trainer integration.
Specifies proposed CRD/CEL validation rules and a framework plugin validation approach.
Documents strategic-merge behavior, operational edge cases, and known limitations (notably Torch GPU auto-detection).

+	// +kubebuilder:validation:XValidation:rule="self.all(c, has(c.resourceClaimName) && c.resourceClaimName != '' || has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != '')", message="each claim must set either resourceClaimName or resourceClaimTemplateName"
+	// +kubebuilder:validation:XValidation:rule="self.all(c, !(has(c.resourceClaimName) && c.resourceClaimName != '' && has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != ''))", message="each claim must set only one of resourceClaimName or resourceClaimTemplateName"


+apiVersion: resource.k8s.io/v1beta2
+kind: ResourceClaimTemplate


+graduated to GA in Kubernetes 1.34, establishing a new standard for device scheduling that
+supersedes extended resources for GPUs and accelerators. Users need the ability to configure


Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

Sridhar1030 · 2026-06-04T17:56:44Z

Hey @andreyvelich , I have addressed all the above reviews , would love your comments/feedback when you get a moment

google-oss-prow Bot added the do-not-merge/work-in-progress label May 24, 2026

google-oss-prow Bot requested a review from akshaychitneni May 24, 2026 15:13

google-oss-prow Bot requested a review from kuizhiqing May 24, 2026 15:13

google-oss-prow Bot added the size/XL label May 24, 2026

google-oss-prow Bot assigned kannon92 May 24, 2026

kannon92 reviewed May 24, 2026

View reviewed changes

abhijeet-dhumal reviewed May 25, 2026

View reviewed changes

Sridhar1030 changed the title ~~[WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob~~ feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob May 27, 2026

Sridhar1030 changed the title ~~feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob~~ [WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob May 27, 2026

Sridhar1030 and others added 3 commits June 4, 2026 23:06

feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob

08d753b

This KEP proposes adding a resourceClaims field to PodSpecPatch, enabling pod-level DRA support for training workloads. Related: kubeflow#2782 Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

chore: trigger CI re-run

f451edc

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Sridhar1030 force-pushed the kep/2782-dra-support branch from 729245d to f451edc Compare June 4, 2026 17:37

Sridhar1030 marked this pull request as ready for review June 4, 2026 17:38

Copilot AI review requested due to automatic review settings June 4, 2026 17:38

Copilot AI reviewed Jun 4, 2026

View reviewed changes

addressed copilot's review

31e2d48

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>


		#### E2E tests

		E2E tests are deferred until a DRA-capable test cluster (with a DRA driver installed) is

		// +kubebuilder:validation:XValidation:rule="self.all(c, has(c.resourceClaimName) && c.resourceClaimName != '' \|\| has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != '')", message="each claim must set either resourceClaimName or resourceClaimTemplateName"
		// +kubebuilder:validation:XValidation:rule="self.all(c, !(has(c.resourceClaimName) && c.resourceClaimName != '' && has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != ''))", message="each claim must set only one of resourceClaimName or resourceClaimTemplateName"

		apiVersion: resource.k8s.io/v1beta2
		kind: ResourceClaimTemplate

		graduated to GA in Kubernetes 1.34, establishing a new standard for device scheduling that
		supersedes extended resources for GPUs and accelerators. Users need the ability to configure

Conversation

Sridhar1030 commented May 24, 2026

Uh oh!

google-oss-prow Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

kannon92 commented May 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Sridhar1030 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants