Skip to content

Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes#9296

Open
nausikt wants to merge 2 commits into
dmwm:masterfrom
nausikt:fix/tolerate-rucio-http-exceptions
Open

Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes#9296
nausikt wants to merge 2 commits into
dmwm:masterfrom
nausikt:fix/tolerate-rucio-http-exceptions

Conversation

@nausikt
Copy link
Copy Markdown
Contributor

@nausikt nausikt commented Apr 9, 2026

Resolve #9285

I've released this patch to crab-dev-tw02 via v3.latest, v3-260409-tolerate-rucio-http-exceptions.
See also latest tag in harbour

How to test

From now on, we should expect one-sided reporting/emailing from only crab-prod-tw02 about 503 exceptions.

P.S. If its works well, we will promote it into prod with proper release tag.

@nausikt nausikt requested a review from belforte April 9, 2026 20:56
@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 10 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2782/artifact/artifacts/PullRequestReport.html

Copy link
Copy Markdown
Member

@belforte belforte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this code is now "complex enough" that a high level description would be very useful. Like a small "design" note at the top of the file.

E.g it is not clear to me what will happen if we get an http exception other than the ones that you list explicitly. Is that fatal ? Ignored ? Retried using default policy ?

On the substance, are you sure that all other http exceptions deserve a different treatment ? Do we have any evidence of an http exception which is not worth retrying, given that "remote serve is down or anyhow in total confusion" is a condition where retrying after delay is certainly good ?

@nausikt
Copy link
Copy Markdown
Contributor Author

nausikt commented Apr 10, 2026

Acknowledged @belforte, will revise/add better docs!

E.g it is not clear to me what will happen if we get an http exception other than the ones that you list explicitly. Is that fatal ? Ignored ? Retried using default policy ?

Put concretely, RucioException("http status code: 555" or whatever "....") will be treated as Fatal (raised right away with no retry), just like other native RucioExceptions.

Meanwhile, "http status code: {403|429|500|502|503}" are specifically trapped by predicate in retry loop earlier, not raised until the maximum number of retries is reached.

On the substance, are you sure that all other http exceptions deserve a different treatment ? Do we have any evidence of an http exception which is not worth retrying, given that "remote serve is down or anyhow in total confusion" is a condition where retrying after delay is certainly good ?

I mostly follow the standard used on the CRAB REST side. That said, I'm still skeptical on only 403 shouldn't be here P.S. other make sense to me na. e.g. 429: too many requests, 50X: server faults, what do you think?

def retriableError(ex):
""" Return True if the error can be retried
"""
if isinstance(ex, HTTPException):
#403 Authentication failure. When CMSWEB FrontEnd is restarting
#429 Too Many Requests. When client hits the throttling limit
#500 Internal sever error. For some errors retries it helps
#502 CMSWEB frontend answers with this when the CMSWEB backends are overloaded
#503 Usually that's the DatabaseUnavailable error
return ex.status in [403, 429, 500, 502, 503]

@belforte
Copy link
Copy Markdown
Member

RESTInteraction is a bit special because it only talks to our server where our code is in control of what http code is returned, and was also based on experience, in practice we almost never see anything different. Server throttling never really worked (something which Valentin tried to add, but ... somehow he did not really finished and the main problem was DBS which was solved by rewrite in go and separating server for users and prod). So 429 never happens. But we have some http 304 e.g. in current grafana which I have no idea what it is, and http 400 which should not be there, because all queries come from our code and therefore should better be good ones. It is also true that those HTTP codes may come from cmsweb FrontEnd, not necessarily CRAB Server...
There would be a lot to investigate there. SImly do not take that code as gold standard !

You have to choose between adding http errors from Rucio "as they pop up", possibly after having investigated and figured out what they are, or be proactive and say "retry them all". The former is more work. The latter may potentially obscure some issue worth reporting... But for CRAB Rest we felt we had a duty to fix our server, so preferred to call it fatal any "unknowns" error. For Rucio... what do we do if we get some odd errors at a low rate ?

@nausikt
Copy link
Copy Markdown
Contributor Author

nausikt commented Apr 10, 2026

Thanks for extreme clarity & history!🙇‍♂️

Forgive me, for had switched off my brain, the former investigate & patch "as they pop up", make much more sense to me!

Will converge to only 503 for now then🫡

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 10 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2798/artifact/artifacts/PullRequestReport.html

@belforte
Copy link
Copy Markdown
Member

@nausikt forgive may failing memory. What is the status of this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RucioException - 503 service temporary unavailable should be retryable and have extra tolerance.

3 participants