ci: fix exit 75 retry with FF_ENABLE_BASH_EXIT_CODE_CHECK and FF_USE_NEW_BASH_EVAL_STRATEGY#3827
Draft
ci: fix exit 75 retry with FF_ENABLE_BASH_EXIT_CODE_CHECK and FF_USE_NEW_BASH_EVAL_STRATEGY#3827
Conversation
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 0d71df6 | Docs | Datadog PR Page | Give us feedback! |
026d7fd to
5aad6c5
Compare
…plates GitLab CI does not support exit_codes in the default: keyword (job-level only). As a result, the exit_codes: [75] in generate-common.php's default: block was silently ignored, so jobs exiting 75 (e.g. kafka not ready in wait-for-service-ready.sh) were never retried. Fix: - Add retry_on_infra_failure() helper to generate-common.php to centralize the retry config (max 2, infra failure conditions, exit_codes: [75]) - Apply it to all child pipeline base templates: .base_test (tracer), .verify_job (package), .appsec_test (appsec), .tea_test (shared) - PHP Language Tests explicit retry gains exit_codes: [75] back, which it was losing by overriding the default: block
9329923 to
3f47e21
Compare
…t artifact exit_codes: [75] in GitLab CI only filters retries when paired with when: script_failure. Without it, only runner/infra-level failures are retried, so a job exiting 75 from script was silently not retried. - Add script_failure to retry_on_infra_failure() so exit_codes: [75] actually restricts retries to exit 75 (not all script failures) - Revert kafka "exit 75" test artifact in wait-for-service-ready.sh back to return 0 (was temporarily added to confirm exit code behavior)
3f47e21 to
7ddb42b
Compare
… Tests These jobs extend .base_test (via .asan_test and .debug_test) and were overriding the inherited retry config with their own blocks. Now they inherit retry_on_infra_failure() from .base_test uniformly.
…t failures exit_codes filtering is non-functional in this GitLab environment. Use when: script_failure on .cli_integration_test specifically so integration tests (kafka, mysql, redis, etc.) retry on transient service-not-ready failures. .base_test keeps infra-only retry to avoid retrying language/ASAN test failures.
d29d06d to
bcc6d87
Compare
4 jobs testing exit_codes: [75] alone (no when:) across: - script: exit 75 -> should retry - script: exit 1 -> should NOT retry - before_script: exit 75 -> should retry - before_script: exit 1 -> should NOT retry
Combining when: and exit_codes: causes GitLab to silently ignore exit_codes. Using exit_codes: [75] alone correctly retries only on exit 75, confirmed by standalone test jobs.
Standalone test jobs confirmed when: [infra] + exit_codes: [75] correctly retries on exit 75 only (not exit 1). Apply this directly to kafka job definitions via retry_on_infra_failure().
…t code tests to ASAN test_c and test_extension_ci
…package pipelines
… superseded by default block
…STRATEGY on test_c and kafka jobs to verify exit code propagation
…RATEGY globally via generate-common.php
…&& cmd patterns on alpine
…in profiler; reset libdatadog to master
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
GitLab Runner does not reliably propagate exit codes from external shell scripts (like
wait-for-service-ready.sh) to the job's exit code. Without the fix, abefore_scriptcalling.gitlab/wait-for-service-ready.shthat exits 75 would be seen by the runner as exit code 1, soretry: exit_codes: [75]never matched and kafka jobs were never retried on transient infra failures.This is addressed by two official GitLab Runner feature flags (both disabled by default):
FF_ENABLE_BASH_EXIT_CODE_CHECK: check exit code after each command rather than relying solely onset -eFF_USE_NEW_BASH_EVAL_STRATEGY: run the bashevalcall in a subshell for proper exit code detectionFix
generate-common.php: enable both feature flags globally via a top-levelvariables:block so all generated child pipelines inherit themgenerate-common.php: keepexit_codes: [75]in thedefault:retry block (confirmed working once exit codes propagate correctly)generate-tracer.php: remove all explicit per-job/per-templateretry_on_infra_failure()andretry_on_script_and_infra_failure()overrides —default:handles everythingSide effect fix
FF_ENABLE_BASH_EXIT_CODE_CHECKadds an explicit exit code check after every YAML script step. This brokecommand -v switch-php && switch-phpingenerate-profiler.php: whenswitch-phpis absent,command -vexits 1, the&&short-circuits, and the runner treats the non-zero exit as a job failure — even though the intent was "run only if present".Fixed by rewriting as
if command -v switch-php > /dev/null 2>&1; then switch-php ...; fi, whose false branch always exits 0.Validation
Confirmed with live CI runs:
wait-for-service-ready.sh) are retried