fix(e2e): probe prover via CPU health sentinel, not the GPU service by alexgarden-mnemom · Pull Request #118 · mnemom/deploy

alexgarden-mnemom · 2026-05-08T03:33:44Z

Summary

Repoints tests/e2e/config.js prover.base at the CPU-only health_sentinel URL (https://mnemom--mnemom-prover-health-sentinel.modal.run).
Tightens the per-endpoint timeout from 60s → 10s now that we're off the GPU cold-start path.

Why

The deploy gate's prover health probe was hitting the H100 prover_service, which cold-starts in 60-180s (container boot + SP1 init, see modal_app.py:135-138). Any probe arriving on a cold container busted the 60s timeout and failed the gate, blocking every other repo's production deploy.

mnemom-prover PR #32 (Apr 28, 2026) added a CPU-only health_sentinel specifically for this purpose: it answers in milliseconds from a CPU container (scaledown_window=300s) by querying Postgres for proofs stuck in pending/proving > 10 min. That's the right readiness signal for the deploy gate — real outages flip it red, quiet idle periods leave it green.

Companion change

mnemom-prover PR #40 reverted the GPU prover_service scaledown_window from 5s back to 60s, restoring dispatch + GPU probe paths. This PR removes the deploy gate's dependence on that knob entirely by getting off the GPU probe path.

Verification

Probed the sentinel URL before opening this PR:

$ curl -sS -m 10 -o /dev/null -w "%{http_code} (%{time_total}s)\n" \
    https://mnemom--mnemom-prover-health-sentinel.modal.run/health
200 (9.3s)

9s on a cold sentinel; subsequent warm probes will be sub-second.

Test plan

CI green on this PR
Post-merge: trigger any service deploy (e.g., a no-op mnemom-api redeploy) and confirm the prover health check passes in <10s
Re-run a previously-failed mnemom-api deploy from yesterday's incident and confirm it now succeeds
Check Grafana: prover.dead_letter_breach rate stays at 0 (confirms PR Fix deploy.yml: add missing npm-directory for SDK publish jobs #40 fix is still holding)

Rollback

git revert this commit. Prover-side PR #40 ensures the GPU probe path also works, so rollback is safe — deploys would still pass, just paying GPU cold-start latency on quiet probes.

Follow-ups

Verify BetterStack uptime monitor target is the sentinel URL (separate manual check; PR Fix PR deploy context token usage #32's design assumes it but unconfirmed today).
Larger architectural follow-up: split this shared E2E suite into per-service smoke gates + a scheduled system-health cron, so one service's regression can't block another's deploy. Tracked in the prover repo's incident postmortem plan as PR-05.

ops-impact: high

Changes which endpoint every service's deploy gate probes. Soak: validated via direct curl pre-merge. Rollback path is one-line revert.

🤖 Generated with Claude Code

The deploy gate's prover health probe targets `mnemom--mnemom-prover-prover-service.modal.run/health` — the H100 GPU service, which cold-starts in 60-180s per the prover repo's wait_for_ready (container boot + SP1 init). Any probe arriving on a cold container busts the 60s timeout and fails the gate, blocking every other repo's production deploy. mnemom-prover PR #32 (Apr 28, 2026) added a CPU-only `health_sentinel` function specifically for this purpose: it answers in milliseconds from a CPU container (scaledown_window=300s) by querying Postgres for proofs stuck in pending/proving > 10 min. That's the right readiness signal for the deploy gate — real outages flip it red, quiet idle periods leave it green. This PR repoints `prover.base` at the sentinel URL and tightens the per-endpoint timeout from 60s to 10s (CPU cold-start is sub-5s, so a real outage now surfaces inside the deploy gate's budget instead of riding the full 60s ceiling). Tested: probed `https://mnemom--mnemom-prover-health-sentinel.modal.run/health` before opening this PR; returns 200 in 9s on cold sentinel (scaledown_window=300s means most probes will be sub-second on warm). Companion change in mnemom-prover: PR #40 reverted the GPU prover_service `scaledown_window` from 5s back to 60s, restoring dispatch + GPU probe paths. This PR removes the dependence on that revert by getting the deploy gate off the GPU probe path entirely. ops-impact: high — changes which endpoint every service's deploy gate probes. Soak: validated via direct curl pre-merge. Rollback: revert this commit to repoint at prover_service. Prover-side PR #40 ensures that even the GPU probe works again, so rollback is safe (deploys would still pass, just paying GPU cold-start latency on quiet probes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexgarden-mnemom merged commit 9a0fe3b into main May 8, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): probe prover via CPU health sentinel, not the GPU service#118

fix(e2e): probe prover via CPU health sentinel, not the GPU service#118
alexgarden-mnemom merged 1 commit intomainfrom
fix/e2e-prover-use-sentinel

alexgarden-mnemom commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexgarden-mnemom commented May 8, 2026

Summary

Why

Companion change

Verification

Test plan

Rollback

Follow-ups

ops-impact: high

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants