Skip to content

fix(e2e): mobile + guardian iOS E2E green on dedicated xlarge runners; guardian auth read over sync atom; transient-failure hardening#302

Merged
WiktorStarczewski merged 20 commits into
mainfrom
ios-cdp-eval-timeout
Jun 28, 2026
Merged

fix(e2e): mobile + guardian iOS E2E green on dedicated xlarge runners; guardian auth read over sync atom; transient-failure hardening#302
WiktorStarczewski merged 20 commits into
mainfrom
ios-cdp-eval-timeout

Conversation

@WiktorStarczewski

@WiktorStarczewski WiktorStarczewski commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Makes the mobile + blockchain E2E suites robust to two distinct transient CI failures on the macOS runners that were causing the mobile E2E jobs to fail on every run. Both are root-caused below with CI evidence; the third historical layer (the iOS build itself) was fixed separately in #299.

Why the mobile E2E jobs were always red (layered root cause)

1. (fixed in #299, already on main) The iOS app didn't build. Every mobile job runs test:e2e:mobile:build first; on the macos-26 runner (Xcode 26.5) two as? SecKey downcasts in HotKeyPlugin.swift were rejected as a hard error → BUILD FAILED (exit 65), no tests ran, no artifacts. That's why all four mobile jobs were red and why it "passed locally" (older local Xcode).

2. (this PR) create_wallets hung the full 15-minute timeout. Once #299 let the suite run, the devnet mobile job timed out in create_wallets (the wallet reached its home screen, but the readiness poll never returned). CdpBridge.eval/evaluate awaited the WebKit executeAtom call with no timeout (unlike evalAsync), and pollForCondition only checks its deadline between iterations — so one wedged eval (flaky RWI socket, or the WebView main thread briefly blocked by mobile main-thread WASM on the slower runner) hung the test until Playwright's global kill, and the rest of the serial suite skipped.

  • Fix: eval/evaluate race executeAtom against a 30s hard timeout. A wedge becomes a fast throw → pollForCondition enforces its own budget → --retries restarts on a fresh app + CDP.
  • Verified in CI (run on the first commit of this branch): the failure signature changed from create_wallets / wasTimeout: true / hung 903s → the test now passes create_wallets and runs ~9.4 min deeper into the flow. The hang is gone.

3. (this PR) The miden-client harness CLI failed the mint on a transient prover-connection error. With the hang gone, the devnet job then failed at mint_tokens_to_wallet_b: the CLI's delegated-prover TLS/gRPC handshake flaked (failed to connect to the remote provertransport errorno native certs found) — intermittently, since a sibling mint in the same test connected fine. The CLI deploy/mint/sync retry loop only classified node-RPC + nonce-lag errors as transient, so this connection error was treated as fatal and failed immediately.

  • Fix: recognize connection-level prover errors as transient and retry with backoff. Also unified the three duplicated transient classifiers into one isTransientCliError helper.
  • The classifier was checked against the real captured error strings: it matches the new transient cases and still rejects deterministic errors (bad asset, parse error, genuine proving-logic failures) so it can't mask a real bug by retrying forever.

Verification

  • iOS + blockchain Playwright configs compile cleanly (playwright test --list loads both modified helpers).
  • CI lint is src-scoped, so these playwright/ changes aren't lint-gated; style matches the surrounding files.
  • An E2E Blockchain (devnet) run is triggered on this branch to confirm the mint now retries through the transient prover error and the mobile-devnet job goes green. Result posted once it completes.

Not addressed (separate, and the gate already tolerates a single-network failure): testnet's slower commit cadence and its own persistent-state nonce conflicts.


Update — guardian gate (layer 4). The Mobile Guardian gate had a separate, recent regression (from the #227 guardian merge that's now main's tip): the iOS verify_guardian_auth_structure read deadlocked on WASM-lock starvationuseSyncTrigger's 3s in-process sync kept re-grabbing the single-threaded WASM lock faster than the read could progress. A 90s eval budget still timed out, proving contention, not slowness (that attempt was reverted).

Fixed by having __TEST_GUARDIAN_AUTH__ set a test-only __TEST_SYNC_PAUSED__ flag that suspends useSyncTrigger for the duration of the read (always cleared in finally), paired with a bounded 90s budget to wait out any single in-flight sync. The flag is gated on MIDEN_E2E_TEST and tree-shaken out of production. Added unit tests for the pause branches (extension + mobile + the production-ignores-flag case); full coverage stays ≥95%. Verifying on CI.


Final resolution (supersedes the layer-4 note above — the record is kept intact deliberately).

Guardian iOS auth read — the real root cause. The __TEST_SYNC_PAUSED__ / WASM-starvation theory above was wrong. The actual bug: CdpBridge.evalAsync (appium execute_async_script) is broken on the iOS RWI bridge — its completion callback arrives in the arguments[arguments.length - 1] slot as the boolean true, not a function, so cb(result) throws TypeError, the promise rejects unhandled, the callback never fires, and every evalAsync hangs to its timeout regardless of how fast the script ran. (Signature in the CI timeline: Unhandled Promise Rejection: TypeError: d is not a function ... 'd' is true the instant the read ran, then a 60s hang — even with the stash already populated.) getGuardianAuthInfo was its only caller, so it had effectively never worked on iOS. Fix: read the auth structure — captured into a global by the wallet's own balance poll via a pure AccountInspector.fromAccount parse (no WASM, no signing, no client load) — over the reliable synchronous eval atom, polled. Verified on macos-26-xlarge: guardian-devnet passes; the sync read returns {threshold, signerCommitments:[2 entries], procedureThresholds} instantly.

The mobile suite couldn't run at all — degraded shared macos-26 runners. Independently, the shared macos-26 runner pool degraded (noisy-neighbour IO): every simctl op crawled (97 CI samples: per-wallet _simPair setup p90 267s / max 401s vs. <5s healthy), so two-sim setup couldn't finish even in 15 min. Confirmed infra, not code — failure history shows mobile-devnet green at 06-27 19:41, then degraded from ~21:00 for 9h+, and the non-guardian mint test failed _simPair setup before any cap code existed. Fix: moved all four mobile E2E jobs to dedicated macos-26-xlarge runners (Apple Silicon, 2× vCPU/RAM, no noisy neighbours); setup is back to ~2-3 min and the full suite is green. Belt-and-suspenders for any residual slowness: _simPair cap 13 min, per-test timeout 25 min, --retries=2.

Result: Mobile E2E Gate + Mobile Guardian E2E Gate both green on macos-26-xlarge (devnet run 28316908352); both-network validation in progress. The xlarge runner is a 1-line change to revert to standard macos-26 once GitHub's shared pool recovers.

@WiktorStarczewski WiktorStarczewski changed the title fix(e2e): hard-timeout iOS CDP eval so a wedged WebView fails fast instead of hanging the test fix(e2e): make mobile/blockchain E2E robust to transient macOS-CI failures (CDP hang + prover-connect retry) Jun 27, 2026
90s still times out (proven in CI run 28294249613): the guardian auth read is
WASM-lock-starved by useSyncTrigger's 3s-cadence sync, not merely slow, so no
fixed eval budget fixes it. Reverting to keep this PR to verified-working fixes.
…rage parse) to avoid the OZ multisig load signing-loop
…ount, 60s eval budget) + widen iOS wallet-create timeouts
… so the lone getAccount isn't queued behind a slow sync
…red stash (no WASM call in the test eval path)
…un so a degraded CoreSimulator fails fast and the retry gets a fresh daemon
…tashes before the auth step reads it (was racing fire-and-forget on iOS)
…re when the exact address key differs (stash was populated but keyed differently)
…_script callback arrives as boolean true, hangs every evalAsync
…min, test timeout 15->25min) instead of killing runs that would pass
… (shared macos-26 pool degraded for hours, _simPair setup couldn't finish)
@WiktorStarczewski WiktorStarczewski changed the title fix(e2e): make mobile/blockchain E2E robust to transient macOS-CI failures (CDP hang + prover-connect retry) fix(e2e): mobile + guardian iOS E2E green on dedicated xlarge runners; guardian auth read over sync atom; transient-failure hardening Jun 28, 2026
@WiktorStarczewski WiktorStarczewski merged commit e8928de into main Jun 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant