Skip to content

fix: retry transient API errors in wait/poll loops#685

Merged
porridge merged 3 commits intomainfrom
fix/retry-transient-errors-in-poll-loops
May 6, 2026
Merged

fix: retry transient API errors in wait/poll loops#685
porridge merged 3 commits intomainfrom
fix/retry-transient-errors-in-poll-loops

Conversation

@porridge
Copy link
Copy Markdown
Member

@porridge porridge commented Apr 29, 2026

What this PR does / why we need it:

All four wait.Poll* call sites treated any non-NotFound error from Get as a fatal poll error (return false, err), causing PollUntilContext* to abort immediately. Transient Kubernetes API errors (Internal, ServerTimeout, network blips) would terminate the poll loop well before the configured timeout was reached.

This changes all four sites to return (false, nil) on transient errors so polling continues until the actual timeout expires. Also:

  • kubernetes.WaitForSA is transformed into Harness.waitForSA so that it can take advantage of the logger in Harness (which is its only user)
  • WaitForDelete turned out to be unused and was replaced with a very similar but subtly different w.r.t. used types implementation from Step.DeleteExisting
  • Adds tests for the new WaitForDelete.

@porridge porridge requested a review from kensipe as a code owner April 29, 2026 10:45
All four wait.Poll* call sites treated any non-NotFound error from Get
as a fatal poll error, causing PollUntilContext* to abort immediately
rather than retrying. This meant transient Kubernetes API errors (e.g.
Internal, ServerTimeout, network blips) would terminate the poll loop
well before the configured timeout was reached.

Change all four sites to treat non-NotFound errors as retryable by
returning (false, nil) instead of (false, err). If the error persists,
the poll will eventually time out as intended.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Marcin Owsiany <porridge@redhat.com>
@porridge porridge force-pushed the fix/retry-transient-errors-in-poll-loops branch from 62d025b to d839cc2 Compare April 29, 2026 10:46
Signed-off-by: Marcin Owsiany <porridge@redhat.com>
@porridge porridge force-pushed the fix/retry-transient-errors-in-poll-loops branch from 414fb88 to e8366c3 Compare April 29, 2026 11:28
vladbologa

This comment was marked as duplicate.

Copy link
Copy Markdown
Collaborator

@vladbologa vladbologa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to surface somehow what the actual error was. Right now all we'd get is DeadlineExceeded, right?

Signed-off-by: Marcin Owsiany <porridge@redhat.com>
@porridge porridge force-pushed the fix/retry-transient-errors-in-poll-loops branch from 2c22006 to 9843571 Compare May 4, 2026 08:00
@porridge
Copy link
Copy Markdown
Member Author

porridge commented May 4, 2026

@vladbologa good point, I also noticed kubernetes.WaitForDelete was unused so did some shuffling around such that all cases surface the transient errors in one way or another, at least when they cause an eventual failure.
PTAL

@porridge porridge requested a review from vladbologa May 4, 2026 08:10
@porridge porridge merged commit e260ab9 into main May 6, 2026
6 checks passed
@porridge porridge deleted the fix/retry-transient-errors-in-poll-loops branch May 6, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants