Skip to content

ci: auto-retry infra-caused script failures via exit code 75 sentinel#3812

Merged
Leiyks merged 5 commits intomasterfrom
leiyks/infra-failure-retry
Apr 22, 2026
Merged

ci: auto-retry infra-caused script failures via exit code 75 sentinel#3812
Leiyks merged 5 commits intomasterfrom
leiyks/infra-failure-retry

Conversation

@Leiyks
Copy link
Copy Markdown
Contributor

@Leiyks Leiyks commented Apr 22, 2026

Service startup timeouts (Kafka, Zookeeper, etc.) exit with code 1, which GitLab classifies as script_failure — not covered by our runner-level retry rules. We don't want blanket script_failure retry as it hides real test flakiness.

GitLab's retry: exit_codes: is a standalone OR condition: jobs are retried if they exit with one of the listed codes, independently of when:. This means exit_codes: [75] retries only on exit code 75 — real test failures (exit 1) are never retried.

This PR introduces exit code 75 (EX_TEMPFAIL) as an infra-failure sentinel:

  • generate-common.php: add exit_codes: [75] to the global retry block — jobs exiting with the infra sentinel are retried up to 2×, all other failures are unaffected
  • wait-for-service-ready.sh: exit 1exit 75 on service startup timeout (Kafka, Zookeeper, MySQL, Redis, etc.)

Verified on CI: exit 75 → 3 attempts (retried twice); exit 1 → 1 attempt only.

Any script that fails due to transient infra can adopt the same convention (exit 75) and will be picked up by the global rule automatically.

Leiyks added 2 commits April 22, 2026 13:48
Service startup timeouts (Kafka, Zookeeper, MySQL, etc.) exit the job
with code 1, which GitLab classifies as script_failure — not covered by
the existing runner-level retry rules, and the team doesn't want to
enable blanket script_failure retry (hides real flakiness).

GitLab 14.9+ supports `retry: exit_codes:` which fires the
script_failure retry rule only when the exit code matches. We use
EX_TEMPFAIL (75) as the infra-sentinel: wait-for-service-ready.sh now
exits 75 on service timeout instead of 1, and the global default retry
block adds `script_failure` gated on exit code 75.

Effect: Kafka/Zookeeper/other service startup races are retried up to 2
times automatically. Real test failures (exit 1) are never retried.
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented Apr 22, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.65% (-0.04%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 930acf6 | Docs | Datadog PR Page | Give us feedback!

Leiyks added 3 commits April 22, 2026 15:28
Remove script_failure from the global when: list. Having both
script_failure and exit_codes: [75] retries on any script failure OR
exit code 75 — not exit code 75 only as intended.

exit_codes: [75] alone correctly retries only jobs that exit with
code 75 (the EX_TEMPFAIL sentinel from wait-for-service-ready.sh),
leaving all other script failures (exit 1, real test failures)
unretried.
test-retry-exit-75: exits 75, no job-level retry override → inherits
global default → should be retried twice (3 total attempts)

test-no-retry-exit-1: exits 1, same config → should run exactly once

Both jobs are branch-scoped (leiyks/infra-failure-retry only) and
allow_failure: true. Remove after verification.
@Leiyks Leiyks marked this pull request as ready for review April 22, 2026 14:19
@Leiyks Leiyks requested a review from a team as a code owner April 22, 2026 14:19
@Leiyks Leiyks merged commit 0d88e71 into master Apr 22, 2026
1888 of 1958 checks passed
@Leiyks Leiyks deleted the leiyks/infra-failure-retry branch April 22, 2026 14:54
@github-actions github-actions Bot added this to the 1.19.0 milestone Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants