fix(flytek8s): classify ImagePullBackOff caused by registry 429 as system error#7362
Open
pingsutw wants to merge 1 commit into
Open
fix(flytek8s): classify ImagePullBackOff caused by registry 429 as system error#7362pingsutw wants to merge 1 commit into
pingsutw wants to merge 1 commit into
Conversation
…stem error Container runtimes surface registry rate limiting (HTTP 429 Too Many Requests) through the ImagePullBackOff waiting reason. Today, once the ImagePullBackoffGracePeriod elapses, DemystifyPending returns PhaseInfoRetryableFailureWithCleanup, which is classified as a USER error and consumes the user's retry budget. 429s from the registry are an infrastructure problem outside the user's control — typically a transient throttling event on the registry side. Detect the "429 Too Many Requests" substring in the waiting message and route those cases to PhaseInfoSystemRetryableFailureWithCleanup so they do not exhaust the user's retries. All other ImagePullBackOff messages keep their existing user-error classification. Signed-off-by: Kevin Su <pingsutw@apache.org>
Contributor
There was a problem hiding this comment.
Pull request overview
Adjusts Flyte’s Kubernetes plugin (flytek8s) pending-pod demystification so that ImagePullBackOff events caused by registry HTTP 429 rate limiting are treated as system (infrastructure) retryable failures instead of user retryable failures, preventing accidental consumption of user retry budgets for transient registry throttling.
Changes:
- In
classifyWaitingContainer, afterImagePullBackOffexceeds the grace period, detect registry rate limiting via"429 Too Many Requests"in the waiting message and returnPhaseInfoSystemRetryableFailureWithCleanup. - Add helper
isRegistryRateLimited(message string) boolfor the 429 detection. - Add a unit test covering the 429
ImagePullBackOffscenario and assertingcore.ExecutionError_SYSTEM.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go | Classifies ImagePullBackOff + registry 429 as a system-retryable failure after the grace period. |
| flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper_test.go | Adds a regression test ensuring 429-based ImagePullBackOff maps to ExecutionError_SYSTEM. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
N/A
Why are the changes needed?
Container runtimes surface registry rate limiting (HTTP 429 Too Many Requests) through the
ImagePullBackOffwaiting reason. Today, once theImagePullBackoffGracePeriodelapses,DemystifyPendingreturnsPhaseInfoRetryableFailureWithCleanup, which is classified as a USER error and consumes the user's retry budget.429s from the registry are an infrastructure problem outside the user's control — typically transient throttling on the registry side. Misclassifying them as user errors means a transient registry rate-limit can exhaust a task's user retries even though the user did nothing wrong.
Example real-world pod waiting message:
What changes were proposed in this pull request?
pod_helper.go: inclassifyWaitingContainer, after theImagePullBackOffgrace period, detect the429 Too Many Requestssubstring in the waiting message and returnPhaseInfoSystemRetryableFailureWithCleanupinstead ofPhaseInfoRetryableFailureWithCleanup.isRegistryRateLimited(message string) bool.ImagePullBackOffmessages keep their existing user-error classification.How was this patch tested?
TestDemystifyPending/ImagePullBackOffOutsideGracePeriod_RegistryRateLimitedexercises a realistic 429 waiting message and assertscore.ExecutionError_SYSTEM.ImagePullBackOffOutsideGracePeriod(non-429) test continues to pass, confirming the default classification is unchanged.go test ./go/tasks/pluginmachinery/flytek8s/...passes locally.Labels
Setup process
N/A
Screenshots
N/A
Check all the applicable boxes
Related PRs
N/A
Docs link
N/A