Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes by nausikt · Pull Request #9296 · dmwm/CRABServer

nausikt · 2026-04-09T20:56:36Z

Resolve #9285

I've released this patch to crab-dev-tw02 via v3.latest, v3-260409-tolerate-rucio-http-exceptions.
See also latest tag in harbour

How to test

From now on, we should expect one-sided reporting/emailing from only crab-prod-tw02 about 503 exceptions.

P.S. If its works well, we will promote it into prod with proper release tag.

cmsdmwmbot · 2026-04-09T20:59:59Z

Jenkins results:

Python3 Pylint check: succeeded
- 10 comments to review
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2782/artifact/artifacts/PullRequestReport.html

belforte

IMHO this code is now "complex enough" that a high level description would be very useful. Like a small "design" note at the top of the file.

E.g it is not clear to me what will happen if we get an http exception other than the ones that you list explicitly. Is that fatal ? Ignored ? Retried using default policy ?

On the substance, are you sure that all other http exceptions deserve a different treatment ? Do we have any evidence of an http exception which is not worth retrying, given that "remote serve is down or anyhow in total confusion" is a condition where retrying after delay is certainly good ?

nausikt · 2026-04-10T14:35:59Z

Acknowledged @belforte, will revise/add better docs!

E.g it is not clear to me what will happen if we get an http exception other than the ones that you list explicitly. Is that fatal ? Ignored ? Retried using default policy ?

Put concretely, RucioException("http status code: 555" or whatever "....") will be treated as Fatal (raised right away with no retry), just like other native RucioExceptions.

Meanwhile, "http status code: {403|429|500|502|503}" are specifically trapped by predicate in retry loop earlier, not raised until the maximum number of retries is reached.

On the substance, are you sure that all other http exceptions deserve a different treatment ? Do we have any evidence of an http exception which is not worth retrying, given that "remote serve is down or anyhow in total confusion" is a condition where retrying after delay is certainly good ?

I mostly follow the standard used on the CRAB REST side. That said, I'm still skeptical on only 403 shouldn't be here P.S. other make sense to me na. e.g. 429: too many requests, 50X: server faults, what do you think?

CRABServer/src/python/RESTInteractions.py

Lines 29 to 38 in 216d7f5

    
           def retriableError(ex): 
        
               """ Return True if the error can be retried 
        
               """ 
        
               if isinstance(ex, HTTPException): 
        
                   #403 Authentication failure. When CMSWEB FrontEnd is restarting 
        
                   #429 Too Many Requests. When client hits the throttling limit 
        
                   #500 Internal sever error. For some errors retries it helps 
        
                   #502 CMSWEB frontend answers with this when the CMSWEB backends are overloaded 
        
                   #503 Usually that's the DatabaseUnavailable error 
        
                   return ex.status in [403, 429, 500, 502, 503]

belforte · 2026-04-10T14:58:20Z

RESTInteraction is a bit special because it only talks to our server where our code is in control of what http code is returned, and was also based on experience, in practice we almost never see anything different. Server throttling never really worked (something which Valentin tried to add, but ... somehow he did not really finished and the main problem was DBS which was solved by rewrite in go and separating server for users and prod). So 429 never happens. But we have some http 304 e.g. in current grafana which I have no idea what it is, and http 400 which should not be there, because all queries come from our code and therefore should better be good ones. It is also true that those HTTP codes may come from cmsweb FrontEnd, not necessarily CRAB Server...
There would be a lot to investigate there. SImly do not take that code as gold standard !

You have to choose between adding http errors from Rucio "as they pop up", possibly after having investigated and figured out what they are, or be proactive and say "retry them all". The former is more work. The latter may potentially obscure some issue worth reporting... But for CRAB Rest we felt we had a duty to fix our server, so preferred to call it fatal any "unknowns" error. For Rucio... what do we do if we get some odd errors at a low rate ?

nausikt · 2026-04-10T15:13:55Z

Thanks for extreme clarity & history!🙇‍♂️

Forgive me, for had switched off my brain, the former investigate & patch "as they pop up", make much more sense to me!

Will converge to only 503 for now then🫡

cmsdmwmbot · 2026-04-17T15:45:15Z

Jenkins results:

Python3 Pylint check: succeeded
- 10 comments to review
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2798/artifact/artifacts/PullRequestReport.html

belforte · 2026-05-11T17:22:13Z

@nausikt forgive may failing memory. What is the status of this ?

add rucio http exceptions awareness, retry up-to 10 attempts ~20 mins.

cfca9b8

nausikt requested a review from belforte April 9, 2026 20:56

nausikt added the Type: Bug label Apr 9, 2026

belforte reviewed Apr 10, 2026

View reviewed changes

retry only 503, let investigate other http code as the pop up.

04b6b34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes#9296

Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes#9296
nausikt wants to merge 2 commits into
dmwm:masterfrom
nausikt:fix/tolerate-rucio-http-exceptions

nausikt commented Apr 9, 2026 •

edited

Loading

Uh oh!

cmsdmwmbot commented Apr 9, 2026

Uh oh!

belforte left a comment

Uh oh!

nausikt commented Apr 10, 2026 •

edited

Loading

Uh oh!

belforte commented Apr 10, 2026

Uh oh!

nausikt commented Apr 10, 2026 •

edited

Loading

Uh oh!

cmsdmwmbot commented Apr 17, 2026

Uh oh!

belforte commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nausikt commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test

Uh oh!

cmsdmwmbot commented Apr 9, 2026

Uh oh!

belforte left a comment

Choose a reason for hiding this comment

Uh oh!

nausikt commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belforte commented Apr 10, 2026

Uh oh!

nausikt commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsdmwmbot commented Apr 17, 2026

Uh oh!

belforte commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nausikt commented Apr 9, 2026 •

edited

Loading

nausikt commented Apr 10, 2026 •

edited

Loading

nausikt commented Apr 10, 2026 •

edited

Loading