fix(e2e): probe prover via CPU health sentinel, not the GPU service#118
Merged
alexgarden-mnemom merged 1 commit intomainfrom May 8, 2026
Merged
fix(e2e): probe prover via CPU health sentinel, not the GPU service#118alexgarden-mnemom merged 1 commit intomainfrom
alexgarden-mnemom merged 1 commit intomainfrom
Conversation
The deploy gate's prover health probe targets `mnemom--mnemom-prover-prover-service.modal.run/health` — the H100 GPU service, which cold-starts in 60-180s per the prover repo's wait_for_ready (container boot + SP1 init). Any probe arriving on a cold container busts the 60s timeout and fails the gate, blocking every other repo's production deploy. mnemom-prover PR #32 (Apr 28, 2026) added a CPU-only `health_sentinel` function specifically for this purpose: it answers in milliseconds from a CPU container (scaledown_window=300s) by querying Postgres for proofs stuck in pending/proving > 10 min. That's the right readiness signal for the deploy gate — real outages flip it red, quiet idle periods leave it green. This PR repoints `prover.base` at the sentinel URL and tightens the per-endpoint timeout from 60s to 10s (CPU cold-start is sub-5s, so a real outage now surfaces inside the deploy gate's budget instead of riding the full 60s ceiling). Tested: probed `https://mnemom--mnemom-prover-health-sentinel.modal.run/health` before opening this PR; returns 200 in 9s on cold sentinel (scaledown_window=300s means most probes will be sub-second on warm). Companion change in mnemom-prover: PR #40 reverted the GPU prover_service `scaledown_window` from 5s back to 60s, restoring dispatch + GPU probe paths. This PR removes the dependence on that revert by getting the deploy gate off the GPU probe path entirely. ops-impact: high — changes which endpoint every service's deploy gate probes. Soak: validated via direct curl pre-merge. Rollback: revert this commit to repoint at prover_service. Prover-side PR #40 ensures that even the GPU probe works again, so rollback is safe (deploys would still pass, just paying GPU cold-start latency on quiet probes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/e2e/config.jsprover.baseat the CPU-onlyhealth_sentinelURL (https://mnemom--mnemom-prover-health-sentinel.modal.run).Why
The deploy gate's prover health probe was hitting the H100
prover_service, which cold-starts in 60-180s (container boot + SP1 init, seemodal_app.py:135-138). Any probe arriving on a cold container busted the 60s timeout and failed the gate, blocking every other repo's production deploy.mnemom-prover PR #32 (Apr 28, 2026) added a CPU-only
health_sentinelspecifically for this purpose: it answers in milliseconds from a CPU container (scaledown_window=300s) by querying Postgres for proofs stuck inpending/proving> 10 min. That's the right readiness signal for the deploy gate — real outages flip it red, quiet idle periods leave it green.Companion change
mnemom-prover PR #40 reverted the GPU
prover_servicescaledown_windowfrom 5s back to 60s, restoring dispatch + GPU probe paths. This PR removes the deploy gate's dependence on that knob entirely by getting off the GPU probe path.Verification
Probed the sentinel URL before opening this PR:
9s on a cold sentinel; subsequent warm probes will be sub-second.
Test plan
mnemom-apiredeploy) and confirm the prover health check passes in <10smnemom-apideploy from yesterday's incident and confirm it now succeedsprover.dead_letter_breachrate stays at 0 (confirms PR Fix deploy.yml: add missing npm-directory for SDK publish jobs #40 fix is still holding)Rollback
git revertthis commit. Prover-side PR #40 ensures the GPU probe path also works, so rollback is safe — deploys would still pass, just paying GPU cold-start latency on quiet probes.Follow-ups
ops-impact: high
Changes which endpoint every service's deploy gate probes. Soak: validated via direct curl pre-merge. Rollback path is one-line revert.
🤖 Generated with Claude Code