[WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob#3540
[WIP] feat(api): KEP-2782: Dynamic Resource Allocation support for TrainJob#3540Sridhar1030 wants to merge 4 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
/assign |
|
|
||
| #### E2E tests | ||
|
|
||
| E2E tests are deferred until a DRA-capable test cluster (with a DRA driver installed) is |
There was a problem hiding this comment.
For Kueue, we use the dra example driver to do e2e testing without relying on a GPU. I think integration/unit are probably fine here but just wanted to call this out.
There was a problem hiding this comment.
Thanks for the pointer! That's great to know. I'll update the E2E test section to note that the dra-example-driver can be used for E2E testing without real GPUs, similar to how Kueue handles it.
kannon92
left a comment
There was a problem hiding this comment.
LGTM
Really well done proposal. I don't really have any concerns on this one.
|
|
||
| 1. **PodGroup-level ResourceClaims.** Multi-node topology-aware allocation (e.g., NVL72, | ||
| ComputeDomains) depends on upstream Kubernetes work (DRAWorkloadResourceClaims feature gate, | ||
| expected in k8s 1.36+) and the Trainer WAS KEP ([#3219](https://github.com/kubeflow/trainer/pull/3219)). |
There was a problem hiding this comment.
Good question. Phase 2 needs both: upstream KEP-5729 to stabilize (currently alpha in 1.36, 1.37 is still TBD per the KEP issue) AND Trainer's WAS KEP (#3219) to land.
I'll update the text to make this dependency chain clearer.
Phase 1 is independent of both, which is why we're proceeding with it now.
There was a problem hiding this comment.
Actually KEP-5729 Beta PRs for 1.37 are already merged/tracked per the issue, it's not TBD. I think the real remaining gate is #3219 landing in Trainer. Can you mention it clearly
There was a problem hiding this comment.
Got it, I'll update the text to make #3219 the primary Trainer-side blocker for Phase 2 and reflect the upstream status more accurately.
| `ResourceClaim` objects and exposed to containers via CDI device injection. | ||
|
|
||
| **Impact:** When a training job uses DRA claims exclusively (no extended resources like | ||
| `nvidia.com/gpu`), `GetNumGPUPerNode()` returns 0. If `numProcPerNode` is not explicitly |
There was a problem hiding this comment.
Is there a plan to add a webhook warning if someone uses DRA claims without setting numProcPerNode explicitly?
There was a problem hiding this comment.
That's a good idea. We could add a webhook warning (not rejection) when resourceClaims are present in the merged PodSpec but numProcPerNode is not explicitly set in the Trainer spec. I'll add this to the KEP as a recommended follow-up for Phase 1, since it's a small validation check that could save users from a confusing failure mode.
| acceptable: wrapping would diverge from Kubernetes semantics, and v1alpha1 allows breaking | ||
| changes if upstream adds fields that conflict. | ||
|
|
||
| #### ContainerPatch and container-level claim references |
There was a problem hiding this comment.
Is the expectation that every DRA-ready ClusterTrainingRuntime must have resources.claims pre-wired in the container spec?
There was a problem hiding this comment.
Is adding Resources to ContainerPatch tracked anywhere, or is it strictly Phase 2?
There was a problem hiding this comment.
Yes, for Phase 1, DRA-ready ClusterTrainingRuntime templates need both pod-level resourceClaims AND container-level resources.claims pre-wired by the admin. Kubernetes requires containers to explicitly reference the claims they consume, and since ContainerPatch doesn't expose Resources, users can't add those references via runtimePatches.
Adding Resources to ContainerPatch is mentioned in the KEP under Known Limitations and Future Work. It's not strictly Phase 2 (no dependency on WAS/PodGroups), so it could land as a fast follow-up to Phase 1.
| 1. Admin defines `ClusterTrainingRuntime` with `resourceClaims` in the full `corev1.PodSpec` | ||
| template, and sets container-level `resources.claims` references. | ||
| 2. User creates `TrainJob` with `runtimePatches` containing `PodSpecPatch.ResourceClaims`. | ||
| 3. Controller calls `mergeRuntimePatches()` in `pkg/runtime/core/trainingruntime.go`: |
There was a problem hiding this comment.
Have you considered if there are any conflicts with : #3428
I think The DRA merge flow needs to be updated to reflect KEP-2599..according this KEP , the source for mergeRuntimePatches() is now the snapshot ConfigMap, not the live runtime.
There was a problem hiding this comment.
specifically whether ValidateObjects() and mergeRuntimePatches() will use the snapshot or the live runtime, and whether Story 1 (admin-configured claims frozen at snapshot time) is called out as expected behavior.
There was a problem hiding this comment.
Yes, I've read through KEP-2599. The snapshot doesn't conflict with DRA since mergeRuntimePatches() is source-agnostic; it takes a JobSetTemplateSpec and applies strategic merge patch regardless of whether the input came from the live runtime or the snapshot ConfigMap.
To your specific points:
ValidateObjects()runs at admission time before the TrainJob is reconciled, so it reads the live runtime (no snapshot exists yet at that point)mergeRuntimePatches()during reconciliation will use the snapshot as its source- Admin-configured DRA claims being frozen at snapshot time is expected and desirable behavior. An admin changing a runtime's claim from A100 to H100 won't affect running jobs.
I'll update the flow diagram and design details to reflect the snapshot-based flow.
This KEP proposes adding a resourceClaims field to PodSpecPatch, enabling pod-level DRA support for training workloads. Related: kubeflow#2782 Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
- Clarify Phase 2 dependency: kubeflow#3219 is primary Trainer-side blocker, KEP-5729 beta targeting 1.37 - Update flow diagram and step-by-step for KEP-2599 snapshot mechanism - Add dra-example-driver reference for E2E testing (per kannon92) - Add webhook warning recommendation for DRA + numProcPerNode - Tighten ContainerPatch limitation with MUST language Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
729245d to
f451edc
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a KEP proposal describing Dynamic Resource Allocation (DRA) support for Kubeflow Trainer by documenting the intended API surface change (PodSpecPatch.resourceClaims), merge/validation approach, limitations, and a test plan.
Changes:
- Introduces a new KEP document explaining DRA concepts, user stories, and design details for Trainer integration.
- Specifies proposed CRD/CEL validation rules and a framework plugin validation approach.
- Documents strategic-merge behavior, operational edge cases, and known limitations (notably Torch GPU auto-detection).
| // +kubebuilder:validation:XValidation:rule="self.all(c, has(c.resourceClaimName) && c.resourceClaimName != '' || has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != '')", message="each claim must set either resourceClaimName or resourceClaimTemplateName" | ||
| // +kubebuilder:validation:XValidation:rule="self.all(c, !(has(c.resourceClaimName) && c.resourceClaimName != '' && has(c.resourceClaimTemplateName) && c.resourceClaimTemplateName != ''))", message="each claim must set only one of resourceClaimName or resourceClaimTemplateName" |
| apiVersion: resource.k8s.io/v1beta2 | ||
| kind: ResourceClaimTemplate |
| graduated to GA in Kubernetes 1.34, establishing a new standard for device scheduling that | ||
| supersedes extended resources for GPUs and accelerators. Users need the ability to configure |
Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
|
Hey @andreyvelich , I have addressed all the above reviews , would love your comments/feedback when you get a moment |
Related: #2782
Ref issues: kubernetes/enhancements#5729, kubernetes/enhancements#4671
This is the initial KEP to support DRA (Dynamic Resource Allocation) in Kubeflow Trainer and TrainJob.
DRA graduated to GA in Kubernetes 1.34. This KEP proposes adding a
resourceClaimsfield toPodSpecPatchso that:runtimePatchesin TrainJobThe existing strategic merge patch pipeline already handles
corev1.PodSpec.ResourceClaimsnatively, so the implementation requires only an API surface change with no controller logic modifications.The KEP covers: