diff --git a/docs/proposals/3328-automatic-resource-configuration/README.md b/docs/proposals/3328-automatic-resource-configuration/README.md new file mode 100644 index 0000000000..a55414b85f --- /dev/null +++ b/docs/proposals/3328-automatic-resource-configuration/README.md @@ -0,0 +1,392 @@ +# KEP-3328: Automatic Resource Configuration + + + +## Authors + +- Vassilis Vassiliadis - [@VassilisVassiliadis](https://github.com/VassilisVassiliadis) + +## Summary + + + +An extensible mechanism for automatically configuring TrainJob resource requests +(starting with GPUs, CPUs, memory, and replicas) before a TrainJob becomes eligible +for admission and scheduling. The mechanism delegates recommendations to plugin +external controllers that compute a recommended configuration and write it back to +the TrainJob. + +At a high level: +1. Kubeflow Trainer owns the protocol for mutating the TrainJob (admission gating, + API, guardrails, etc.) +2. Plugin external controllers own the logic for generating the resource + configurations. + +## Motivation + + + +Resource configuration of TrainJobs is not a minor detail. It is a manual decision +that directly affects queueing time, cost, and success rate. ML practitioners guess +GPU count and memory headroom following the recommendations of platform teams, but +these recommendations can go stale. The problem is exacerbated in multi-tenant +Kubernetes clusters where users compete against each other for shared, limited, and +expensive accelerators like GPUs. + +There are two undesired scenarios: +- Under-requesting resources can cause GPU out-of-memory errors, potentially after + the TrainJob has already executed for a while. +- Over-requesting increases queue time and reduces cluster utilization, which can + make the platform feel slower and more expensive than necessary. + +Intuitively, Kubernetes can schedule what the user asks for, but it cannot help the +user size their jobs. Vertical and horizontal pod autoscaling don't address this +scenario out of the box because they mutate only the resource requirement fields of +downstream Pod objects but not the TrainJob object that we want to right-size. What +they are missing is the ability to mutate the TrainJob object itself to ensure +proper integration with frameworks like Kueue, as well as update fields that depend +on the resource requirements. For example, `accelerate launch` has command line +arguments that specify the number of processes to use which typically map to the +number of GPUs. + +Fortunately, it is possible to leverage the characteristics of a TrainJob or even +the state of the cluster to automate the resource configuration of a TrainJob. As +such, this KEP proposes the use of external Kubernetes controllers acting as +plugins that produce resource recommendations while Kubeflow Trainer orchestrates +applying these recommendations to TrainJob objects safely. + + +### Goals + + + +- Automatically configure TrainJob resources before the job is eligible for admission: + initially once, right after creation. Starting with GPU, CPU, memory, and the number + of node replicas. +- Enforce guardrails to the operations of plugins: timeouts, quota caps, fallback + strategies, etc. +- Allow plugins to set TrainJob parameters that depend on resources: + e.g., accelerate/torchrun derived flags. +- Integrate with Kueue so that the job is not eligible for Kueue's admission + flow while it is being auto-configured. +- Support a catalog of plugins: platform engineers can enable/disable + plugins per namespace and users can pick the one they'd like to use for their TrainJob. +- Provide observability: events and TrainJob status fields reflecting plugin activity. + +### Non-Goals + + + +- Implement or standardize recommender policies inside Kubeflow Trainer: Kubeflow owns + the protocol, plugins own the resource recommendation logic/policy. +- Continuously reconfigure resources while the job is pending or after it has started + running: this is useful but more complex. It could be future work. +- Implement a GUI for the catalog: users can discover available plugins using + kubectl, so we need not complicate things now. +- Support TrainJobs that use DRAs: It could be future work. + +## Proposal + + + +Kubeflow Trainer implements the protocol that external controllers acting as plugins +use to mutate the resource requirements of TrainJobs that opt into this process. + +TrainJobs that do not opt into automatic resource mutation are treated by Kubeflow +Trainer as usual. + +The protocol for TrainJobs that opt into automatic resource mutation consists of 4 steps: +1. Kubeflow Trainer gates the admission/scheduling of the TrainJob while the plugin + is working on mutating the resource requirements of the TrainJob. +2. The plugin that the TrainJob opts into updates the resource requirements of the TrainJob + as well as other fields that depend on the resource requirements (e.g., cmdline args, etc.). +3. Kubeflow Trainer applies guardrails on the patch from the plugin and emits + observability events and status updates. +4. Kubeflow Trainer un-gates the TrainJob so that the underlying Kubernetes objects are + eventually scheduled. + +### User Stories (Optional) + + + +#### Story 1 + +As a TrainJob user fine-tuning models on my organization's multi-tenant Kubernetes cluster, +I want to delegate the configuration of my TrainJob's resource requirements to an AI agent. +The platform engineers managing my cluster have already deployed a Kubernetes controller +which has the id `ai-agent-recommender`. + +It is an AI agent powered resource recommender for TrainJobs. The platform engineers +provided the AI agent with Model Context Protocol (MCP) tools that can interact with +the cluster to get more information about it (e.g. the available GPUs, the running +workloads, etc). + +I write my TrainJob like I usually do and simply add the following annotation to it: + +```yaml +metadata: + annotations: + trainer.kubeflow.org/autoconf-plugin: ai-agent-recommender +``` + +Soon after the TrainJob is submitted, the AI agent uses the Kubeflow API to patch my +TrainJob's resource requirements and fields that depend on them. I can use +`kubectl describe trainjob my-trainjob` to see the changes that the agent made. + + +### Notes/Constraints/Caveats (Optional) + + + +1. Mutating the resources may require changing command-line arguments e.g. + the number of processes for `accelerate launch` or `torchrun`. +2. The API that a plugin uses must enable associating the changes to the plugin: we + can use the `spec.runtimePatches` API from KEP-2170; it requires extending it to + support mutating the CLI args and the resource requirements. +3. JobSet objects do not support mutating resource requirements. + We should either not create a JobSet until the plugin controller has finished patching + the TrainJob, or we should delete and recreate the JobSet after the plugin + has finished patching the TrainJob. + + +### Risks and Mitigations + + + +1. Kueue supports an alpha feature for gating the admission of jobs called [AdmissionGatedBy](https://kueue.sigs.k8s.io/docs/reference/labels-and-annotations/#kueuex-k8sioadmission-gated-by). + AdmissionGatedBy is behind a feature gate and is off by default. Thus, Trainer cannot + guarantee that Kueue will respect the `kueue.x-k8s.io/admission-gated-by` annotation. + This means that there is a chance for Kueue to admit a TrainJob before a plugin has + finished patching the TrainJob. To mitigate this risk, Kubeflow Trainer should not + un-suspend the JobSet object before the plugin has finished patching the associated TrainJob. +2. There could be other systems like Volcano managing the TrainJob. The proposal will still + function properly because Trainer will not un-suspend the JobSet object until it is + ready to be scheduled. +3. There is a chance that while the TrainJob is being patched by the plugin, Kueue, Volcano, + or an equivalent framework preempts a different running workload to make room for the + new TrainJob based on the original resource requirements of the job, which the plugin + controller may change. The chances of this happening should be small. Even if it does + happen, this is not a new issue that this proposal introduces. This is something that + would have happened anyway. In fact, an intelligent plugin controller could detect + that the capacity of the cluster is not enough for a TrainJob and size the TrainJob + in a way that avoids preempting other workloads. + +## Design Details + + + +TBD + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +#### Prerequisite testing updates + + + +#### Unit Tests + + + + + +- ``: `` - `` + +#### E2E tests + + + +#### Integration tests + + + +### Graduation Criteria + + + +## Implementation History + + + +- 2026-05-28: KEP Creation + +## Drawbacks + + + +The main disadvantage of this approach is that it requires introducing a different code +path for TrainJobs. It also requires updating the Trainer API to account for changes +to the resource requirements of TrainJobs as well as CLI args. + +## Alternatives + + + +### Proposed approach: External plugin controllers + +- **Pros:** + - **Extensible**: platform engineers can develop and deploy custom plugins without modifying Kubeflow Trainer. + - **Safe**: Trainer enforces guardrails (timeouts, quota caps) on plugin operations. + - **Observable**: `spec.runtimePatches` shows the exact fields that the plugin is mutating, + events and status updates offer an additional layer of visibility into the plugin's actions. +- **Cons:** + - **Adds complexity**: requires gating/un-gating mechanism in Trainer to coordinate with admission/scheduling + frameworks like Kueue and Kubernetes. + - **API extension needed**: requires extending `spec.runtimePatches` to support resource and CLI arg mutations. + +### Bake recommenders into Kubeflow Trainer + +- **Pros:** + - **Reduced complexity**: Kubeflow Trainer wouldn't have to gate and un-gate the TrainJob. + On first reconciliation, it would only apply the policy and then mutate the TrainJob. +- **Cons:** + - **Limited extensibility**: Policies are workload-specific and rapidly evolving. Users would need to fork Kubeflow Trainer + in order to use their custom policies. + +### Use a mutating webhook instead of external controllers to patch the TrainJobs + +- **Pros:** + - **No Trainer changes**: We wouldn't need to update Kubeflow Trainer at all. + - **Guaranteed ordering**: The mutating webhook would ensure that the TrainJob is patched before + it is considered for admission or scheduling. +- **Cons:** + - **Server Stability**: A recommender could take too long to come up with a suggestion (for example, it could + run canary tests that take minutes). We do not want to increase the critical path of + mutating webhooks, as that could compromise the stability of the cluster. + - **Higher complexity**: The barrier of entry for plugin developers would be much higher because it is much more + complex, harder, and riskier to implement a webhook compared to a controller. + +### Users set the resource requirements of jobs following best practices and documentation + +- **Pros:** + - **No Trainer changes**: We wouldn't need to update Kubeflow Trainer at all. +- **Cons:** + - **Manual effort**: Users would have to know the best practices and set the resource requirements correctly. + - **Operational pain**: We would still have repeated operational pain. + +### Introduce a new CRD (e.g., `AutoConfiguredTrainJob`) which wraps a `TrainJob` + +- **Pros:** + - **Reduced complexity**: This would make the Trainer feature less complex because it wouldn't have to gate and + un-gate the TrainJob. It would only need to create the TrainJob once after the plugin has + finished patching the `AutoConfiguredTrainJob`. +- **Cons:** + - **Additional CRD**: There would be yet another CRD to manage. + - **Different API**: Users would need to create a different resource instead of a TrainJob.