fix: retry transient API errors in wait/poll loops#685
Merged
Conversation
All four wait.Poll* call sites treated any non-NotFound error from Get as a fatal poll error, causing PollUntilContext* to abort immediately rather than retrying. This meant transient Kubernetes API errors (e.g. Internal, ServerTimeout, network blips) would terminate the poll loop well before the configured timeout was reached. Change all four sites to treat non-NotFound errors as retryable by returning (false, nil) instead of (false, err). If the error persists, the poll will eventually time out as intended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Marcin Owsiany <porridge@redhat.com>
62d025b to
d839cc2
Compare
414fb88 to
e8366c3
Compare
vladbologa
reviewed
Apr 30, 2026
Collaborator
vladbologa
left a comment
There was a problem hiding this comment.
I think it would be useful to surface somehow what the actual error was. Right now all we'd get is DeadlineExceeded, right?
vladbologa
approved these changes
Apr 30, 2026
Signed-off-by: Marcin Owsiany <porridge@redhat.com>
2c22006 to
9843571
Compare
Member
Author
|
@vladbologa good point, I also noticed |
vladbologa
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
All four
wait.Poll*call sites treated any non-NotFounderror fromGetas a fatal poll error (return false, err), causingPollUntilContext*to abort immediately. Transient Kubernetes API errors (Internal, ServerTimeout, network blips) would terminate the poll loop well before the configured timeout was reached.This changes all four sites to return
(false, nil)on transient errors so polling continues until the actual timeout expires. Also:kubernetes.WaitForSAis transformed intoHarness.waitForSAso that it can take advantage of the logger in Harness (which is its only user)WaitForDeleteturned out to be unused and was replaced with a very similar but subtly different w.r.t. used types implementation fromStep.DeleteExistingWaitForDelete.