Skip to content

Handle corrupted Report.pkl gracefully in RetryManager to prevent component crash#12486

Open
Viphava280444 wants to merge 1 commit into
dmwm:masterfrom
Viphava280444:fixFailedLoadFile
Open

Handle corrupted Report.pkl gracefully in RetryManager to prevent component crash#12486
Viphava280444 wants to merge 1 commit into
dmwm:masterfrom
Viphava280444:fixFailedLoadFile

Conversation

@Viphava280444
Copy link
Copy Markdown
Contributor

Fixes #12414

Status

ready

Description

When PauseAlgo.isReady() fails to unpickle a corrupted Report.pkl, the exception crashes the entire RetryManagerPoller thread, blocking retries for all jobs in cooloff — not just the affected one.
This fix wraps the plugin.isReady() call in selectRetryAlgo() with a try/except so that pickle failures are logged as warnings and the job proceeds with retry (skipping exit code check) instead of crashing the component.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@hassan11196
Copy link
Copy Markdown
Member

Hi @Viphava280444 this change is specific to Tier0 right?

@hassan11196
Copy link
Copy Markdown
Member

test this please

1 similar comment
@hassan11196
Copy link
Copy Markdown
Member

test this please

@Viphava280444
Copy link
Copy Markdown
Contributor Author

Hi @Viphava280444 this change is specific to Tier0 right?

Hi @hassan11196,
yes, this issue is related to RetryManager, which is used in Tier 0.

@hassan11196
Copy link
Copy Markdown
Member

test this please

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 3 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 1 warnings
    • 5 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1307/artifact/artifacts/PullRequestReport.html

@hassan11196
Copy link
Copy Markdown
Member

test this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RetryManager crashes for every random pkl load failure

3 participants