Retry storage operation on idle connection error during raw file uploads#7368
Draft
ddl-rliu wants to merge 2 commits into
Draft
Retry storage operation on idle connection error during raw file uploads#7368ddl-rliu wants to merge 2 commits into
ddl-rliu wants to merge 2 commits into
Conversation
147d1b9 to
ca37329
Compare
Workflows with 100+ large (100MB+) file outputs lead to failures in the sidecar. These failures appear to be due to transient errors during the upload phase. Large file outputs (anything > 5MB) lead to a MultipartUpload, and flytecopilot attempts to parallelize file uploads. However, it appears the parallelism causes idle connections from a set of completed uploads to interfere with in-progress connections for other uploads. Since the failures are transient, implement a fix where copilot simply initiates a retry (up to 5 max retries) if a transient error is seen during a flytecopilot raw file upload. This fix is consistent with a similar issue seen previously in Flyte storage writes: flyteorg/flyteadmin#325
ca37329 to
3674473
Compare
As part of the previous fix, flytecopilot simply initiates a retry (up to 5 max retries) if a transient error is seen during a flytecopilot raw file upload. This corresponds to the following list of four specific transient network errors: - "http: server closed idle connection" - "use of closed network connection" - "EOF" - "write: broken pipe" These four network errors were discovered during the local reproduction of the issue. However, during a live run, a new transient network error "WebidentityErr: failed to retrieve credentials ... status code: 408" was observed (Note the status code 408 in particular). In order to update the retry logic to be more robust against unanticipated transient errors, we relax the retry logic to retry on all errors.
768c882 to
6fd96ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
Similar issue to flyteorg/flyteadmin#325
Why are the changes needed?
What changes were proposed in this pull request?
How was this patch tested?
Labels
Please add one or more of the following labels to categorize your PR:
This is important to improve the readability of release notes.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Stack
If you do use
git townto manage PR Stacks, the stack relevant to this PRwill show below. Otherwise, you can ignore this section.
Docs link