Skip to content

fix(operator): guard against nil trainer PodSet in JAX EnforceMLPolicy#3563

Open
immanuwell wants to merge 1 commit into
kubeflow:masterfrom
immanuwell:fix/jax-nil-trainer-podset
Open

fix(operator): guard against nil trainer PodSet in JAX EnforceMLPolicy#3563
immanuwell wants to merge 1 commit into
kubeflow:masterfrom
immanuwell:fix/jax-nil-trainer-podset

Conversation

@immanuwell
Copy link
Copy Markdown

What this PR does / why we need it:

EnforceMLPolicy in the JAX plugin calls info.FindPodSetByAncestor(constants.AncestorTrainer) which returns *PodSet and can be nil if no replicatedJob carries the trainer ancestor label. The next line dereferences it directly (trainerPS.Count != nil), causing a nil pointer dereference panic at runtime.

The fix adds a trainerPS != nil guard, matching the pattern already used in the Torch plugin. The XGBoost plugin uses a similar early-return guard too -- the JAX plugin was just missing it.

Reproduce:

Create a ClusterTrainingRuntime with JAX mlPolicy but without a replicatedJob labeled trainer.kubeflow.org/trainjob-ancestor-step: trainer, then submit a TrainJob against it. The controller panics when processing EnforceMLPolicy.

Or run the new unit test case no panic when JAX policy is set but trainer PodSet is absent after reverting the one-line fix -- it will panic.

Which issue(s) this PR fixes:
Fixes #

Copilot AI review requested due to automatic review settings May 30, 2026 11:28
@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Immanuel Tikhonov <pchpr.00@list.ru>
Signed-off-by: immanuwell <pchpr.00@list.ru>
@immanuwell immanuwell force-pushed the fix/jax-nil-trainer-podset branch from fcbfac1 to a8d9318 Compare May 30, 2026 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant