Skip to content

fix(e2e): probe prover via CPU health sentinel, not the GPU service#118

Merged
alexgarden-mnemom merged 1 commit intomainfrom
fix/e2e-prover-use-sentinel
May 8, 2026
Merged

fix(e2e): probe prover via CPU health sentinel, not the GPU service#118
alexgarden-mnemom merged 1 commit intomainfrom
fix/e2e-prover-use-sentinel

Conversation

@alexgarden-mnemom
Copy link
Copy Markdown
Member

Summary

  • Repoints tests/e2e/config.js prover.base at the CPU-only health_sentinel URL (https://mnemom--mnemom-prover-health-sentinel.modal.run).
  • Tightens the per-endpoint timeout from 60s → 10s now that we're off the GPU cold-start path.

Why

The deploy gate's prover health probe was hitting the H100 prover_service, which cold-starts in 60-180s (container boot + SP1 init, see modal_app.py:135-138). Any probe arriving on a cold container busted the 60s timeout and failed the gate, blocking every other repo's production deploy.

mnemom-prover PR #32 (Apr 28, 2026) added a CPU-only health_sentinel specifically for this purpose: it answers in milliseconds from a CPU container (scaledown_window=300s) by querying Postgres for proofs stuck in pending/proving > 10 min. That's the right readiness signal for the deploy gate — real outages flip it red, quiet idle periods leave it green.

Companion change

mnemom-prover PR #40 reverted the GPU prover_service scaledown_window from 5s back to 60s, restoring dispatch + GPU probe paths. This PR removes the deploy gate's dependence on that knob entirely by getting off the GPU probe path.

Verification

Probed the sentinel URL before opening this PR:

$ curl -sS -m 10 -o /dev/null -w "%{http_code} (%{time_total}s)\n" \
    https://mnemom--mnemom-prover-health-sentinel.modal.run/health
200 (9.3s)

9s on a cold sentinel; subsequent warm probes will be sub-second.

Test plan

  • CI green on this PR
  • Post-merge: trigger any service deploy (e.g., a no-op mnemom-api redeploy) and confirm the prover health check passes in <10s
  • Re-run a previously-failed mnemom-api deploy from yesterday's incident and confirm it now succeeds
  • Check Grafana: prover.dead_letter_breach rate stays at 0 (confirms PR Fix deploy.yml: add missing npm-directory for SDK publish jobs #40 fix is still holding)

Rollback

git revert this commit. Prover-side PR #40 ensures the GPU probe path also works, so rollback is safe — deploys would still pass, just paying GPU cold-start latency on quiet probes.

Follow-ups

  • Verify BetterStack uptime monitor target is the sentinel URL (separate manual check; PR Fix PR deploy context token usage #32's design assumes it but unconfirmed today).
  • Larger architectural follow-up: split this shared E2E suite into per-service smoke gates + a scheduled system-health cron, so one service's regression can't block another's deploy. Tracked in the prover repo's incident postmortem plan as PR-05.

ops-impact: high

Changes which endpoint every service's deploy gate probes. Soak: validated via direct curl pre-merge. Rollback path is one-line revert.

🤖 Generated with Claude Code

The deploy gate's prover health probe targets
`mnemom--mnemom-prover-prover-service.modal.run/health` — the H100
GPU service, which cold-starts in 60-180s per the prover repo's
wait_for_ready (container boot + SP1 init). Any probe arriving on a
cold container busts the 60s timeout and fails the gate, blocking
every other repo's production deploy.

mnemom-prover PR #32 (Apr 28, 2026) added a CPU-only `health_sentinel`
function specifically for this purpose: it answers in milliseconds
from a CPU container (scaledown_window=300s) by querying Postgres
for proofs stuck in pending/proving > 10 min. That's the right
readiness signal for the deploy gate — real outages flip it red,
quiet idle periods leave it green.

This PR repoints `prover.base` at the sentinel URL and tightens the
per-endpoint timeout from 60s to 10s (CPU cold-start is sub-5s, so
a real outage now surfaces inside the deploy gate's budget instead
of riding the full 60s ceiling).

Tested: probed `https://mnemom--mnemom-prover-health-sentinel.modal.run/health`
before opening this PR; returns 200 in 9s on cold sentinel
(scaledown_window=300s means most probes will be sub-second on warm).

Companion change in mnemom-prover: PR #40 reverted the GPU
prover_service `scaledown_window` from 5s back to 60s, restoring
dispatch + GPU probe paths. This PR removes the dependence on that
revert by getting the deploy gate off the GPU probe path entirely.

ops-impact: high — changes which endpoint every service's deploy
gate probes. Soak: validated via direct curl pre-merge.

Rollback: revert this commit to repoint at prover_service. Prover-side
PR #40 ensures that even the GPU probe works again, so rollback is
safe (deploys would still pass, just paying GPU cold-start latency on
quiet probes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexgarden-mnemom alexgarden-mnemom merged commit 9a0fe3b into main May 8, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants