Skip to content

[WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob#3540

Open
Sridhar1030 wants to merge 4 commits into
kubeflow:masterfrom
Sridhar1030:kep/2782-dra-support
Open

[WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob#3540
Sridhar1030 wants to merge 4 commits into
kubeflow:masterfrom
Sridhar1030:kep/2782-dra-support

Conversation

@Sridhar1030
Copy link
Copy Markdown
Contributor

Related: #2782
Ref issues: kubernetes/enhancements#5729, kubernetes/enhancements#4671

This is the initial KEP to support DRA (Dynamic Resource Allocation) in Kubeflow Trainer and TrainJob.

DRA graduated to GA in Kubernetes 1.34. This KEP proposes adding a resourceClaims field to PodSpecPatch so that:

  • Platform admins can pre-configure ResourceClaimTemplates in ClusterTrainingRuntime templates
  • Data scientists can override or add claims via runtimePatches in TrainJob

The existing strategic merge patch pipeline already handles corev1.PodSpec.ResourceClaims natively, so the implementation requires only an API surface change with no controller logic modifications.

The KEP covers:

  • Pod-level ResourceClaims via PodSpecPatch (Phase 1)
  • CEL validation for claim source fields
  • Known limitations (torch plugin GPU detection, ContainerPatch gap)
  • Migration path from extended resources
  • Future work: PodGroup-level claims, ComputeDomain integration (Phase 2, after k8s 1.36+)

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot requested a review from kuizhiqing May 24, 2026 15:13
@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@kannon92
Copy link
Copy Markdown
Contributor

/assign


#### E2E tests

E2E tests are deferred until a DRA-capable test cluster (with a DRA driver installed) is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Kueue, we use the dra example driver to do e2e testing without relying on a GPU. I think integration/unit are probably fine here but just wanted to call this out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! That's great to know. I'll update the E2E test section to note that the dra-example-driver can be used for E2E testing without real GPUs, similar to how Kueue handles it.

Copy link
Copy Markdown
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Really well done proposal. I don't really have any concerns on this one.


1. **PodGroup-level ResourceClaims.** Multi-node topology-aware allocation (e.g., NVL72,
ComputeDomains) depends on upstream Kubernetes work (DRAWorkloadResourceClaims feature gate,
expected in k8s 1.36+) and the Trainer WAS KEP ([#3219](https://github.com/kubeflow/trainer/pull/3219)).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm the dependency chain for Phase 2, is the actual blocker Trainer's WAS KEP (#3219) rather than upstream k8s? IIUC KEP-5729 is already Beta in 1.37, so it seems like once #3219 lands we're good to go?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Phase 2 needs both: upstream KEP-5729 to stabilize (currently alpha in 1.36, 1.37 is still TBD per the KEP issue) AND Trainer's WAS KEP (#3219) to land.
I'll update the text to make this dependency chain clearer.
Phase 1 is independent of both, which is why we're proceeding with it now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually KEP-5729 Beta PRs for 1.37 are already merged/tracked per the issue, it's not TBD. I think the real remaining gate is #3219 landing in Trainer. Can you mention it clearly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll update the text to make #3219 the primary Trainer-side blocker for Phase 2 and reflect the upstream status more accurately.

`ResourceClaim` objects and exposed to containers via CDI device injection.

**Impact:** When a training job uses DRA claims exclusively (no extended resources like
`nvidia.com/gpu`), `GetNumGPUPerNode()` returns 0. If `numProcPerNode` is not explicitly
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to add a webhook warning if someone uses DRA claims without setting numProcPerNode explicitly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. We could add a webhook warning (not rejection) when resourceClaims are present in the merged PodSpec but numProcPerNode is not explicitly set in the Trainer spec. I'll add this to the KEP as a recommended follow-up for Phase 1, since it's a small validation check that could save users from a confusing failure mode.

acceptable: wrapping would diverge from Kubernetes semantics, and v1alpha1 allows breaking
changes if upstream adds fields that conflict.

#### ContainerPatch and container-level claim references
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that every DRA-ready ClusterTrainingRuntime must have resources.claims pre-wired in the container spec?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding Resources to ContainerPatch tracked anywhere, or is it strictly Phase 2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for Phase 1, DRA-ready ClusterTrainingRuntime templates need both pod-level resourceClaims AND container-level resources.claims pre-wired by the admin. Kubernetes requires containers to explicitly reference the claims they consume, and since ContainerPatch doesn't expose Resources, users can't add those references via runtimePatches.

Adding Resources to ContainerPatch is mentioned in the KEP under Known Limitations and Future Work. It's not strictly Phase 2 (no dependency on WAS/PodGroups), so it could land as a fast follow-up to Phase 1.

1. Admin defines `ClusterTrainingRuntime` with `resourceClaims` in the full `corev1.PodSpec`
template, and sets container-level `resources.claims` references.
2. User creates `TrainJob` with `runtimePatches` containing `PodSpecPatch.ResourceClaims`.
3. Controller calls `mergeRuntimePatches()` in `pkg/runtime/core/trainingruntime.go`:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered if there are any conflicts with : #3428
I think The DRA merge flow needs to be updated to reflect KEP-2599..according this KEP , the source for mergeRuntimePatches() is now the snapshot ConfigMap, not the live runtime.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically whether ValidateObjects() and mergeRuntimePatches() will use the snapshot or the live runtime, and whether Story 1 (admin-configured claims frozen at snapshot time) is called out as expected behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've read through KEP-2599. The snapshot doesn't conflict with DRA since mergeRuntimePatches() is source-agnostic; it takes a JobSetTemplateSpec and applies strategic merge patch regardless of whether the input came from the live runtime or the snapshot ConfigMap.

To your specific points:

  • ValidateObjects() runs at admission time before the TrainJob is reconciled, so it reads the live runtime (no snapshot exists yet at that point)
  • mergeRuntimePatches() during reconciliation will use the snapshot as its source
  • Admin-configured DRA claims being frozen at snapshot time is expected and desirable behavior. An admin changing a runtime's claim from A100 to H100 won't affect running jobs.

I'll update the flow diagram and design details to reflect the snapshot-based flow.

@Sridhar1030 Sridhar1030 changed the title [WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob May 27, 2026
@Sridhar1030 Sridhar1030 changed the title feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob [WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob May 27, 2026
Sridhar1030 and others added 3 commits June 4, 2026 23:06
This KEP proposes adding a resourceClaims field to PodSpecPatch,
enabling pod-level DRA support for training workloads.
Related: kubeflow#2782

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
- Clarify Phase 2 dependency: kubeflow#3219 is primary Trainer-side blocker,
  KEP-5729 beta targeting 1.37
- Update flow diagram and step-by-step for KEP-2599 snapshot mechanism
- Add dra-example-driver reference for E2E testing (per kannon92)
- Add webhook warning recommendation for DRA + numProcPerNode
- Tighten ContainerPatch limitation with MUST language

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@Sridhar1030 Sridhar1030 force-pushed the kep/2782-dra-support branch from 729245d to f451edc Compare June 4, 2026 17:37
@Sridhar1030 Sridhar1030 marked this pull request as ready for review June 4, 2026 17:38
Copilot AI review requested due to automatic review settings June 4, 2026 17:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a KEP proposal describing Dynamic Resource Allocation (DRA) support for Kubeflow Trainer by documenting the intended API surface change (PodSpecPatch.resourceClaims), merge/validation approach, limitations, and a test plan.

Changes:

  • Introduces a new KEP document explaining DRA concepts, user stories, and design details for Trainer integration.
  • Specifies proposed CRD/CEL validation rules and a framework plugin validation approach.
  • Documents strategic-merge behavior, operational edge cases, and known limitations (notably Torch GPU auto-detection).

Comment on lines +419 to +420
// +kubebuilder:validation:XValidation:rule="self.all(c, has(c.resourceClaimName) && c.resourceClaimName != '' || has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != '')", message="each claim must set either resourceClaimName or resourceClaimTemplateName"
// +kubebuilder:validation:XValidation:rule="self.all(c, !(has(c.resourceClaimName) && c.resourceClaimName != '' && has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != ''))", message="each claim must set only one of resourceClaimName or resourceClaimTemplateName"
Comment on lines +113 to +114
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
Comment on lines +10 to +11
graduated to GA in Kubernetes 1.34, establishing a new standard for device scheduling that
supersedes extended resources for GPUs and accelerators. Users need the ability to configure
Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
@Sridhar1030
Copy link
Copy Markdown
Contributor Author

Hey @andreyvelich , I have addressed all the above reviews , would love your comments/feedback when you get a moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants