Skip to content

Commit 0d88e71

Browse files
authored
ci: auto-retry infra-caused script failures via exit code 75 sentinel (#3812)
* ci: auto-retry infra-caused script failures via exit code 75 sentinel Service startup timeouts (Kafka, Zookeeper, MySQL, etc.) exit the job with code 1, which GitLab classifies as script_failure — not covered by the existing runner-level retry rules, and the team doesn't want to enable blanket script_failure retry (hides real flakiness). GitLab 14.9+ supports `retry: exit_codes:` which fires the script_failure retry rule only when the exit code matches. We use EX_TEMPFAIL (75) as the infra-sentinel: wait-for-service-ready.sh now exits 75 on service timeout instead of 1, and the global default retry block adds `script_failure` gated on exit code 75. Effect: Kafka/Zookeeper/other service startup races are retried up to 2 times automatically. Real test failures (exit 1) are never retried. * ci: remove inline comment on exit 75 * ci: fix retry config — exit_codes is OR not a filter on script_failure Remove script_failure from the global when: list. Having both script_failure and exit_codes: [75] retries on any script failure OR exit code 75 — not exit code 75 only as intended. exit_codes: [75] alone correctly retries only jobs that exit with code 75 (the EX_TEMPFAIL sentinel from wait-for-service-ready.sh), leaving all other script failures (exit 1, real test failures) unretried. * ci: add temporary test jobs to validate exit_codes retry behaviour test-retry-exit-75: exits 75, no job-level retry override → inherits global default → should be retried twice (3 total attempts) test-no-retry-exit-1: exits 1, same config → should run exactly once Both jobs are branch-scoped (leiyks/infra-failure-retry only) and allow_failure: true. Remove after verification. * ci: remove temporary retry validation test jobs
1 parent ee25be4 commit 0d88e71

2 files changed

Lines changed: 3 additions & 1 deletion

File tree

.gitlab/generate-common.php

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@ function windows_git_setup_with_packages() {
111111
- api_failure
112112
- stuck_or_timeout_failure
113113
- job_execution_timeout
114+
exit_codes:
115+
- 75
114116

115117
.all_targets: &all_minor_major_targets
116118
<?php

.gitlab/wait-for-service-ready.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ if [ -n "${WAIT_FOR:-}" ]; then
115115
service_type="$(detect_service_type "${host}")"
116116

117117
if ! wait_for_single_service "${host}" "${port}" "${service_type}" 30 5; then
118-
exit 1
118+
exit 75
119119
fi
120120
done
121121
fi

0 commit comments

Comments
 (0)