Skip to content

ci: fix exit 75 retry with FF_ENABLE_BASH_EXIT_CODE_CHECK and FF_USE_NEW_BASH_EVAL_STRATEGY#3827

Draft
Leiyks wants to merge 26 commits intomasterfrom
leiyks/fix-kafka-wait-retry
Draft

ci: fix exit 75 retry with FF_ENABLE_BASH_EXIT_CODE_CHECK and FF_USE_NEW_BASH_EVAL_STRATEGY#3827
Leiyks wants to merge 26 commits intomasterfrom
leiyks/fix-kafka-wait-retry

Conversation

@Leiyks
Copy link
Copy Markdown
Contributor

@Leiyks Leiyks commented Apr 24, 2026

Root cause

GitLab Runner does not reliably propagate exit codes from external shell scripts (like wait-for-service-ready.sh) to the job's exit code. Without the fix, a before_script calling .gitlab/wait-for-service-ready.sh that exits 75 would be seen by the runner as exit code 1, so retry: exit_codes: [75] never matched and kafka jobs were never retried on transient infra failures.

This is addressed by two official GitLab Runner feature flags (both disabled by default):

Fix

  • generate-common.php: enable both feature flags globally via a top-level variables: block so all generated child pipelines inherit them
  • generate-common.php: keep exit_codes: [75] in the default: retry block (confirmed working once exit codes propagate correctly)
  • generate-tracer.php: remove all explicit per-job/per-template retry_on_infra_failure() and retry_on_script_and_infra_failure() overrides — default: handles everything

Side effect fix

FF_ENABLE_BASH_EXIT_CODE_CHECK adds an explicit exit code check after every YAML script step. This broke command -v switch-php && switch-php in generate-profiler.php: when switch-php is absent, command -v exits 1, the && short-circuits, and the runner treats the non-zero exit as a job failure — even though the intent was "run only if present".

Fixed by rewriting as if command -v switch-php > /dev/null 2>&1; then switch-php ...; fi, whose false branch always exits 0.

Validation

Confirmed with live CI runs:

  • Jobs exiting 1 are not retried
  • Jobs exiting 75 (via wait-for-service-ready.sh) are retried

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented Apr 24, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.64% (-0.04%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 0d71df6 | Docs | Datadog PR Page | Give us feedback!

@Leiyks Leiyks force-pushed the leiyks/fix-kafka-wait-retry branch 6 times, most recently from 026d7fd to 5aad6c5 Compare April 24, 2026 14:31
@Leiyks Leiyks changed the title ci: exit 75 on kafka port-open to guarantee retry on broker unreadiness ci: fix exit 75 retry by adding explicit retry config to all base templates Apr 24, 2026
…plates

GitLab CI does not support exit_codes in the default: keyword (job-level
only). As a result, the exit_codes: [75] in generate-common.php's
default: block was silently ignored, so jobs exiting 75 (e.g. kafka
not ready in wait-for-service-ready.sh) were never retried.

Fix:
- Add retry_on_infra_failure() helper to generate-common.php to
  centralize the retry config (max 2, infra failure conditions,
  exit_codes: [75])
- Apply it to all child pipeline base templates: .base_test (tracer),
  .verify_job (package), .appsec_test (appsec), .tea_test (shared)
- PHP Language Tests explicit retry gains exit_codes: [75] back, which
  it was losing by overriding the default: block
@Leiyks Leiyks force-pushed the leiyks/fix-kafka-wait-retry branch 3 times, most recently from 9329923 to 3f47e21 Compare April 24, 2026 15:51
…t artifact

exit_codes: [75] in GitLab CI only filters retries when paired with
when: script_failure. Without it, only runner/infra-level failures
are retried, so a job exiting 75 from script was silently not retried.

- Add script_failure to retry_on_infra_failure() so exit_codes: [75]
  actually restricts retries to exit 75 (not all script failures)
- Revert kafka "exit 75" test artifact in wait-for-service-ready.sh
  back to return 0 (was temporarily added to confirm exit code behavior)
@Leiyks Leiyks force-pushed the leiyks/fix-kafka-wait-retry branch from 3f47e21 to 7ddb42b Compare April 24, 2026 20:50
Leiyks added 5 commits April 24, 2026 23:01
… Tests

These jobs extend .base_test (via .asan_test and .debug_test) and were
overriding the inherited retry config with their own blocks. Now they
inherit retry_on_infra_failure() from .base_test uniformly.
…t failures

exit_codes filtering is non-functional in this GitLab environment.
Use when: script_failure on .cli_integration_test specifically so
integration tests (kafka, mysql, redis, etc.) retry on transient
service-not-ready failures. .base_test keeps infra-only retry to
avoid retrying language/ASAN test failures.
@Leiyks Leiyks force-pushed the leiyks/fix-kafka-wait-retry branch from d29d06d to bcc6d87 Compare April 24, 2026 21:40
Leiyks added 10 commits April 24, 2026 23:53
4 jobs testing exit_codes: [75] alone (no when:) across:
- script: exit 75  -> should retry
- script: exit 1   -> should NOT retry
- before_script: exit 75  -> should retry
- before_script: exit 1   -> should NOT retry
Combining when: and exit_codes: causes GitLab to silently ignore
exit_codes. Using exit_codes: [75] alone correctly retries only on
exit 75, confirmed by standalone test jobs.
Standalone test jobs confirmed when: [infra] + exit_codes: [75]
correctly retries on exit 75 only (not exit 1). Apply this directly
to kafka job definitions via retry_on_infra_failure().
…t code tests to ASAN test_c and test_extension_ci
@Leiyks Leiyks changed the title ci: fix exit 75 retry by adding explicit retry config to all base templates ci: fix exit 75 retry with FF_ENABLE_BASH_EXIT_CODE_CHECK and FF_USE_NEW_BASH_EVAL_STRATEGY Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant