Skip to content

Retry storage operation on idle connection error during raw file uploads#7368

Draft
ddl-rliu wants to merge 2 commits into
flyteorg:masterfrom
ddl-rliu:rliu.writeraw-retry
Draft

Retry storage operation on idle connection error during raw file uploads#7368
ddl-rliu wants to merge 2 commits into
flyteorg:masterfrom
ddl-rliu:rliu.writeraw-retry

Conversation

@ddl-rliu
Copy link
Copy Markdown
Contributor

@ddl-rliu ddl-rliu commented May 13, 2026

Tracking issue

Similar issue to flyteorg/flyteadmin#325

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Stack

If you do use git town to manage PR Stacks, the stack relevant to this PR
will show below. Otherwise, you can ignore this section.

Docs link

@github-actions github-actions Bot added the flyte label May 13, 2026
@ddl-rliu ddl-rliu force-pushed the rliu.writeraw-retry branch 3 times, most recently from 147d1b9 to ca37329 Compare May 14, 2026 00:02
Workflows with 100+ large (100MB+) file outputs lead to failures in the sidecar.
These failures appear to be due to transient errors during the upload phase.
Large file outputs (anything > 5MB) lead to a MultipartUpload, and flytecopilot
attempts to parallelize file uploads. However, it appears the parallelism causes
idle connections from a set of completed uploads to interfere with in-progress
connections for other uploads.

Since the failures are transient, implement a fix where copilot simply initiates
a retry (up to 5 max retries) if a transient error is seen during a flytecopilot
raw file upload. This fix is consistent with a similar issue seen previously in
Flyte storage writes: flyteorg/flyteadmin#325
@ddl-rliu ddl-rliu changed the title Retry storage operation on idle connection error Retry storage operation on idle connection error during raw file uploads May 14, 2026
@ddl-rliu ddl-rliu force-pushed the rliu.writeraw-retry branch from ca37329 to 3674473 Compare May 14, 2026 02:10
As part of the previous fix, flytecopilot simply initiates a retry (up to 5 max
retries) if a transient error is seen during a flytecopilot raw file upload.
This corresponds to the following list of four specific transient network errors:
 - "http: server closed idle connection"
 - "use of closed network connection"
 - "EOF"
 - "write: broken pipe"
These four network errors were discovered during the local reproduction of the
issue.

However, during a live run, a new transient network error "WebidentityErr:
failed to retrieve credentials ... status code: 408" was observed (Note the
status code 408 in particular). In order to update the retry logic to be more
robust against unanticipated transient errors, we relax the retry logic to retry
on all errors.
@ddl-rliu ddl-rliu force-pushed the rliu.writeraw-retry branch from 768c882 to 6fd96ae Compare May 14, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant