Bounded by horizon. Three reference scenarios shipped as YAML; not trained in this repo.
The Advanced tier is the first tier where single-incident reasoning is solved and the new test is long-horizon multi-incident sequences with partial observability. It targets the seed/Series A persona: $300–500 of compute, 1–2 A100-days, fine-tuning a Qwen 7B-14B with LoRA + GRPO + a small DPO pass on the hardest 10% of scenarios.
This document is the design defence for that tier. The three reference scenarios in sre_gym/advanced/scenarios/ are the proof-of-shape: real topologies, real action sets, real reward dimensions, real reference traces.
The single insight: at 60–90 ticks per episode, the agent has to track state that no single 8K context window can hold. That changes the training problem fundamentally:
- Trajectories must be summarized (or windowed, or compressed) inside the policy's context.
- Reward shaping has to survive across summarization boundaries — i.e. the agent has to commit to a remediation plan before any single rollback delivers terminal reward.
- The action space grows because the agent is making more kinds of decisions: escalate vs. ack vs. continue, trace vs. metrics vs. logs, feature-flag-toggle vs. rollback. 28 actions instead of 11.
- Recovery from early mistakes is now scored explicitly. Scenario 1 is unsolvable without a second rollback after the first one introduces a chained incident. An agent that gets phase 1 right and phase 2 wrong scores worse than one that gets both right but slow.
The persona is "an SRE who has 30 minutes and an evolving alert dashboard, not a single page about a single thing." The training goal is to teach long-horizon coherence without sacrificing the per-tick quality the Basic tier teaches.
sre_gym/advanced/scenarios/cascading_release_train.yaml
A release train deploys gateway, worker, and migration-runner together at 14:02 UTC. Phase 1 fault: gateway code expects a column the migration applied successfully, but worker continues using its pinned schema version asynchronously, so worker writes are stamped with the old schema. Phase 1 looks like a schema_drift_missing_migration incident; the correct phase-1 action is rollback_deploy(api-gateway).
Five simulated minutes (25 ticks) later, the worker's drift sync triggers a chain of failed retries. Now the worker is the loudest service, with metrics that look like a fresh dependency-pool-exhaustion incident. The trained agent must recognize that:
- The phase-2 timing aligns with the phase-1 fix.
- The phase-2 metrics align with the phase-1 deploy timestamp, not a new deploy.
- The correct phase-2 action is
rollback_deploy(worker-orders), not a fresh investigation.
This is the Theme #2 ("super long-horizon planning") evaluation Theme verbatim: track state over extended trajectories, recover from early mistakes, decompose goals. A short-context agent that treats phase 2 as a fresh incident scores 0.30 lower than one that recognizes the chain.
sre_gym/advanced/scenarios/observability_pipeline_outage.yaml
The application is throwing millions of caught exceptions; the logging pipeline is configured to ship full stack traces synchronously to a central Loki cluster; Loki saturates, Promtail backpressures, and the application's logging library starts blocking on flush — so every service that uses the same logging library gets slow. The agent's query_logs action returns degraded, partial, or stale data.
This forces the agent to use the alternate observability path: query_traces (Tempo is on a separate ingest path), query_metrics (Prometheus is fine), and the query_session_cardinality / query_audit_log actions for richer signal. The optimal recovery is two-phase:
- Containment first. Drop log sampling, toggle verbose logging off — restoring the pipeline so further investigation is possible. Reward dimension:
pipeline_protection. - Root-cause fix. Once logs flow again, the underlying caught-exception bug becomes visible; rollback the offending deploy and turn verbose logging back on.
This scenario is grounded directly in the Cloudflare Nov 2025 logging-storm postmortem: the observability stack itself acted as a denial-of-service vector, and frontline SREs had to mitigate the pipeline before they could even start root-cause analysis.
sre_gym/advanced/scenarios/supabase_rls_silent_leak.yaml
The hardest reference scenario — and the one with the strongest novelty claim. A Supabase RLS policy regression silently leaks one tenant's open orders into another tenant's /api/orders view. There is no SLO breach, no 5xx spike, no latency anomaly — only:
- one Sentry alert ("distinct tenant_id per session 6σ anomaly")
- seven support tickets in a 12-minute window
The trained agent must:
- Recognize that the standard reliability dashboard is misleadingly clean, and pivot to the security-flavoured signals (
query_session_cardinality,query_audit_log). - Contain before rolling back. The optimal path is
feature_flag_toggle(orders_list_view, off)before any data-store action — every minute of unmitigated leak adds tenant-exposure to the postmortem window. - Identify the RLS migration (
USING (tenant_id = auth.uid())typoed toUSING (TRUE)) by reading the audit log, not the deploy log. - Roll back at the right layer (postgres, where the RLS policy lives) — rolling back the orders-service deploy alone doesn't release the bad policy.
- Quantify the leak window in the postmortem (
sessions × duration × tenants exposed), draft a customer comm, initiate a legal/compliance handoff.
No existing SRE benchmark scores cross-domain reasoning + containment-first discipline + leak-window quantification + customer-comm drafting. This scenario is the white-space claim of the Advanced tier.
Inherits the 11 Basic actions, adds 17 horizon-specific actions:
| Category | Action | Why it's added |
|---|---|---|
| Observability | query_traces |
Trace IDs survive when log ingest is broken |
| Observability | query_external_dep_status |
Stripe/Supabase status pages are part of the corpus |
| Code | query_recent_prs |
Identify the bad commit by recent merges |
| Code | read_diff |
Inspect the actual change without rolling back |
| Mitigation | feature_flag_toggle |
Containment without redeploy |
| Mitigation | slow_rollout / bisect_deploys |
Surgical rollback for partial regressions |
| Mitigation | drain_queue |
Backlog drain after recovery |
| Coordination | escalate / assign_oncall |
Page the right team |
| Coordination | request_human_approval |
Gate destructive changes |
| Coordination | request_acknowledgement |
Confirm peer awareness |
| Comms | post_status_update |
Customer-facing status page |
| Comms | draft_customer_comm |
Customer notification |
| Postmortem | propose_postmortem |
Structured postmortem |
| Postmortem | mark_resolution_partial |
Honest "symptom mitigated, root cause pending" |
| Security | escalate_security / request_legal_handoff |
Cross-domain escalation paths |
| Inventory | query_audit_log / query_session_cardinality |
Fault evidence beyond the standard dashboard |
| Operations | tag_release_dirty |
Mark a release poisoned in deploy registry |
Each action is an additional degree of freedom the policy must learn when not to use. That's the horizon-tier learning signal.
Inherits the 7 Basic rubric dimensions and adds five horizon-specific signals:
| Advanced-only dimension | Range | What it measures |
|---|---|---|
chained_incident_recognition |
0 → 0.10 | Agent labelled phase-2 incident as caused by phase-1 fix |
alternate_observability_use |
0 → 0.10 | Agent used traces/metrics when logs lied |
pipeline_protection |
0 → 0.10 | Disabled verbose logs/sampling before further damage |
containment_first |
0 → 0.15 | Toggled feature flag before any rollback |
security_recognition |
0 → 0.15 | Classified incident as security correctly |
data_leak_window_documented |
0 → 0.10 | Postmortem quantifies leak window |
legal_handoff_initiated |
0 → 0.05 | Compliance handoff opened |
customer_comm_drafted |
0 → 0.05 | Customer-facing comm drafted |
postmortem_quality |
0 → 0.10 | Critic-evaluated postmortem |
blast_radius_increase_penalty |
-0.10 → 0 | Action increased blast radius |
The blast_radius_increase_penalty is the tier's most important signal: at 60–90 ticks the agent has time to make things worse, and a good agent uses that time to verify before escalating mitigations. A bad agent issues restart_service on a healthy peer and gets penalized even if the eventual outcome is correct.
A new abstraction: an LLM-driven peer that responds to escalate() / request_acknowledgement() calls. The peer is helpful but sometimes wrong. Per-scenario oncall_peer.behaviours declares trigger conditions and a correct_pct field; e.g.:
oncall_peer:
enabled: true
behaviours:
- trigger: "escalate during phase 2"
response: "Maybe try restarting the worker? It's been wedged for a few minutes."
correct_pct: 30
decoy: trueA trained agent learns to escalate when uncertain but not blindly defer to the peer's advice. This is a long-horizon coherence test that single-incident benchmarks structurally can't surface.
A faithful Advanced simulator needs:
- A 15–20 service event-loop simulator (vs. the 4-service Basic one)
- Multi-tick fault propagation with configurable causal latency
- A synthetic on-call-peer model with calibrated
correct_pctper behaviour - ~28 action handlers, vs. 11 in Basic
- A learned-critic reward path for postmortem quality
- Time-pressure SLO countdowns surfacing in the observation
Roughly 2 weeks of focused engineering and 1–2 A100-days of training. Both are out of scope for the 36-hour hackathon window. We ship the design at the YAML level so a downstream operator with the budget can lift it.
If a Series-A team picked this up, the deliverable would be:
- ~25 templates × 4 multi-incident compositions = ~100 scenario instances
- 60-train / 40-eval split (smaller train set; horizon training is more sample-efficient per-scenario)
- Qwen 2.5 7B or 14B with LoRA r=64, GRPO 2,000–4,000 steps + DPO on the hardest 10%
- 1–2 A100-days end-to-end
- Comparison table: untrained-7B vs. trained-7B vs. Claude Sonnet on horizon-bounded held-out set
The expected outcome (informed prior): a 7B specialist beats Sonnet on multi-incident horizon tasks but loses on the breadth-y short-context tasks Basic specializes in. That tradeoff is the experimental claim of the Advanced tier.
from sre_gym import SREGym, Tier
env = SREGym(tier=Tier.ADVANCED)
print(env.describe()) # tier metadata
for spec in env.list_scenarios():
print(f"{spec['id']}: {spec['name']}")
for phase in spec.get('incident_chain', []):
print(f" phase {phase['phase']}: {phase['triggered_by']} -> {phase['correct_action']}")Calling env.reset() raises TierNotRunnableError with a pointer to this document.