Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes#9296
Add awareness of Rucio HTTP exceptions and retry up to 10 times over ~20 minutes#9296nausikt wants to merge 2 commits into
Conversation
|
Jenkins results:
|
belforte
left a comment
There was a problem hiding this comment.
IMHO this code is now "complex enough" that a high level description would be very useful. Like a small "design" note at the top of the file.
E.g it is not clear to me what will happen if we get an http exception other than the ones that you list explicitly. Is that fatal ? Ignored ? Retried using default policy ?
On the substance, are you sure that all other http exceptions deserve a different treatment ? Do we have any evidence of an http exception which is not worth retrying, given that "remote serve is down or anyhow in total confusion" is a condition where retrying after delay is certainly good ?
|
Acknowledged @belforte, will revise/add better docs!
Put concretely, RucioException("http status code: 555" or whatever "....") will be treated as Fatal (raised right away with no retry), just like other native RucioExceptions. Meanwhile, "http status code: {403|429|500|502|503}" are specifically trapped by predicate in retry loop earlier, not raised until the maximum number of retries is reached.
I mostly follow the standard used on the CRAB REST side. That said, I'm still skeptical on only 403 shouldn't be here P.S. other make sense to me na. e.g. 429: too many requests, 50X: server faults, what do you think? CRABServer/src/python/RESTInteractions.py Lines 29 to 38 in 216d7f5 |
|
RESTInteraction is a bit special because it only talks to our server where our code is in control of what http code is returned, and was also based on experience, in practice we almost never see anything different. Server throttling never really worked (something which Valentin tried to add, but ... somehow he did not really finished and the main problem was DBS which was solved by rewrite in go and separating server for users and prod). So 429 never happens. But we have some http 304 e.g. in current grafana which I have no idea what it is, and http 400 which should not be there, because all queries come from our code and therefore should better be good ones. It is also true that those HTTP codes may come from cmsweb FrontEnd, not necessarily CRAB Server... You have to choose between adding http errors from Rucio "as they pop up", possibly after having investigated and figured out what they are, or be proactive and say "retry them all". The former is more work. The latter may potentially obscure some issue worth reporting... But for CRAB Rest we felt we had a duty to fix our server, so preferred to call it fatal any "unknowns" error. For Rucio... what do we do if we get some odd errors at a low rate ? |
|
Thanks for extreme clarity & history!🙇♂️ Forgive me, for had switched off my brain, the former investigate & patch "as they pop up", make much more sense to me! Will converge to only 503 for now then🫡 |
|
Jenkins results:
|
|
@nausikt forgive may failing memory. What is the status of this ? |
Resolve #9285
I've released this patch to crab-dev-tw02 via
v3.latest,v3-260409-tolerate-rucio-http-exceptions.See also latest tag in harbour
How to test
From now on, we should expect one-sided reporting/emailing from only
crab-prod-tw02about 503 exceptions.P.S. If its works well, we will promote it into prod with proper release tag.