fix(ci): improve K8s nightly stability - readiness check and CrashLoopBackOff detection#4790
fix(ci): improve K8s nightly stability - readiness check and CrashLoopBackOff detection#4790zdrapela wants to merge 3 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
/qodo |
|
/test e2e-eks-helm-nightly |
|
/test e2e-aks-helm-nightly |
|
/agentic_review |
Code Review by Qodo
1. Crashloop filter hides failures
|
|
/test e2e-eks-helm-nightly |
|
/test e2e-aks-helm-nightly |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4790 +/- ##
===========================================
+ Coverage 40.88% 69.49% +28.60%
===========================================
Files 119 109 -10
Lines 2228 4710 +2482
Branches 562 513 -49
===========================================
+ Hits 911 3273 +2362
- Misses 1311 1437 +126
+ Partials 6 0 -6
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
/test e2e-aks-helm-nightly |
|
/test e2e-eks-helm-nightly |
|
/test e2e-aks-helm-nightly |
|
@zdrapela: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add SKIP_TESTS guard to the 3 jobs that bypassed it by calling testing::run_tests directly (ocp-nightly runtime, ocp-operator runtime, auth-providers). Introduce DEPLOYMENT_TYPE env var (showcase, showcase-rbac, all) for K8s jobs (AKS, EKS, GKE) to allow deploying only one deployment type and keeping it alive for local test re-runs. When set to a single deployment, namespace cleanup and DNS cleanup are skipped. Add -d/--deployment CLI flag to local-run.sh with interactive prompts for K8s jobs. Update local-test-setup.sh to accept showcase-rbac as primary argument (rbac kept as alias). CI behavior is unchanged: DEPLOYMENT_TYPE defaults to 'all' when unset. Assisted-by: OpenCode
Update e2e-deploy-rhdh skill to document the new -d/--deployment flag for K8s jobs and remove the OCP-only restriction from deploy-only mode. Update e2e-parse-ci-failure skill to include -d flag in local-run.sh command output. Update local-test-setup.sh argument references from 'rbac' to 'showcase-rbac' in reproduce-failure, verify-fix, and diagnose-and-fix skills. Assisted-by: OpenCode
45ff335 to
ab88355
Compare
The Helm chart (1.10-114-CI) enables lightspeed by default, which causes two deployment failures on K8s platforms: 1. lightspeed-core sidecar crashes with sqlite3.OperationalError (attempt to write a readonly database), making the pod not ready (1/2 containers) and the ingress return 503 2. lightspeed-backend dynamic plugin triggers 'Zip bomb detected' error in the init container, preventing startup entirely Disable the lightspeed sidecar (global.lightspeed.enabled: false) and both sets of lightspeed plugin references (chart's registry.access and catalog index's ghcr.io entries) in all K8s diff-values files (AKS, EKS, GKE x showcase, showcase-rbac). Assisted-by: OpenCode
ab88355 to
b7534ee
Compare
|
|
The container image build workflow finished with status: |



Summary
Fixes the root causes of consistent AKS/EKS/GKE nightly E2E job failures (100% failure rate across all 3 K8s platforms for the last 2+ weeks).
Root Causes Identified
RBAC phase aborted by CrashLoopBackOff fast-fail: The
lightspeed-coresidecar (shipped in chart1.10-114-CIwithlightspeed-stack:0.5.0) crashes on all platforms including GKE. On GKE, the backstage-backend happens to become ready before the detection fires, so tests run. On AKS/EKS, the detection fires first and aborts the deployment — 0 RBAC tests ever ran.Showcase phase test failures (guest sign-in / 503s): The CI health check used
curl -I <root_url>(HEAD to the frontend), which returns 200 as soon as the ingress serves the SPA — before the backend API (including auth) is initialized. Tests started too early and hit 503s on/api/auth/guest/refresh. Additionally, the ALB/nginx ingress intermittently returns 502 during backend health propagation even after the first successful response.Changes
/.backstage/health/v1/readinessinstead of the root URL. This endpoint returns 503 until all backend plugins (including auth) complete initialization.lightspeed-coresidecar from the fast-fail detection. The sidecar is non-essential for E2E tests (GKE proves this — all 63 RBAC tests pass with lightspeed-core in CrashLoopBackOff). The fallback check is narrowed toInit:CrashLoopBackOffonly (still catches init container crashes like the install-dynamic-plugins zip bomb).Verified on EKS