Skip to content

fix(flytek8s): classify ImagePullBackOff caused by registry 429 as system error#7362

Open
pingsutw wants to merge 1 commit into
flyteorg:mainfrom
pingsutw:flytek8s-image-pull-429-system-error
Open

fix(flytek8s): classify ImagePullBackOff caused by registry 429 as system error#7362
pingsutw wants to merge 1 commit into
flyteorg:mainfrom
pingsutw:flytek8s-image-pull-429-system-error

Conversation

@pingsutw
Copy link
Copy Markdown
Member

Tracking issue

N/A

Why are the changes needed?

Container runtimes surface registry rate limiting (HTTP 429 Too Many Requests) through the ImagePullBackOff waiting reason. Today, once the ImagePullBackoffGracePeriod elapses, DemystifyPending returns PhaseInfoRetryableFailureWithCleanup, which is classified as a USER error and consumes the user's retry budget.

429s from the registry are an infrastructure problem outside the user's control — typically transient throttling on the registry side. Misclassifying them as user errors means a transient registry rate-limit can exhaust a task's user retries even though the user did nothing wrong.

Example real-world pod waiting message:

Back-off pulling image "us-central1-docker.pkg.dev/.../internal-gnina:...": ErrImagePull: failed to pull and unpack image "...": failed to copy: httpReadSeeker: failed open: unexpected status code https://us-central1-docker.pkg.dev/v2/...: 429 Too Many Requests

What changes were proposed in this pull request?

  • pod_helper.go: in classifyWaitingContainer, after the ImagePullBackOff grace period, detect the 429 Too Many Requests substring in the waiting message and return PhaseInfoSystemRetryableFailureWithCleanup instead of PhaseInfoRetryableFailureWithCleanup.
  • Add small helper isRegistryRateLimited(message string) bool.
  • All other ImagePullBackOff messages keep their existing user-error classification.

How was this patch tested?

  • New unit test TestDemystifyPending/ImagePullBackOffOutsideGracePeriod_RegistryRateLimited exercises a realistic 429 waiting message and asserts core.ExecutionError_SYSTEM.
  • Existing ImagePullBackOffOutsideGracePeriod (non-429) test continues to pass, confirming the default classification is unchanged.
  • go test ./go/tasks/pluginmachinery/flytek8s/... passes locally.

Labels

  • fixed

Setup process

N/A

Screenshots

N/A

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

N/A

Docs link

N/A

…stem error

Container runtimes surface registry rate limiting (HTTP 429 Too Many
Requests) through the ImagePullBackOff waiting reason. Today, once the
ImagePullBackoffGracePeriod elapses, DemystifyPending returns
PhaseInfoRetryableFailureWithCleanup, which is classified as a USER
error and consumes the user's retry budget.

429s from the registry are an infrastructure problem outside the user's
control — typically a transient throttling event on the registry side.
Detect the "429 Too Many Requests" substring in the waiting message and
route those cases to PhaseInfoSystemRetryableFailureWithCleanup so they
do not exhaust the user's retries.

All other ImagePullBackOff messages keep their existing user-error
classification.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Copilot AI review requested due to automatic review settings May 11, 2026 19:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Flyte’s Kubernetes plugin (flytek8s) pending-pod demystification so that ImagePullBackOff events caused by registry HTTP 429 rate limiting are treated as system (infrastructure) retryable failures instead of user retryable failures, preventing accidental consumption of user retry budgets for transient registry throttling.

Changes:

  • In classifyWaitingContainer, after ImagePullBackOff exceeds the grace period, detect registry rate limiting via "429 Too Many Requests" in the waiting message and return PhaseInfoSystemRetryableFailureWithCleanup.
  • Add helper isRegistryRateLimited(message string) bool for the 429 detection.
  • Add a unit test covering the 429 ImagePullBackOff scenario and asserting core.ExecutionError_SYSTEM.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go Classifies ImagePullBackOff + registry 429 as a system-retryable failure after the grace period.
flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper_test.go Adds a regression test ensuring 429-based ImagePullBackOff maps to ExecutionError_SYSTEM.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants