AppliedScientific · VibeCodingScientist · May 29, 2026 · May 29, 2026 · May 29, 2026 · coderabbitai
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,18 @@ All notable changes to RefusalBench are documented here. The format follows [Kee
 
 ---
 
+## [Unreleased] — 2026-05-29
+
+### Added
+- **Claude Opus 4.8** added to the main sweep + should-refuse positive control (post-v1.1-frozen; marked `*`). 705 adjudicated trials (total: 14,094) + 75 should-refuse trials (total: 1,500).
+- PC Tier A (TPR 100 %); benign 57 %, borderline 93 %, dual-use 100 %, Youden's J +0.43 — walks back Opus 4.7's benign over-refusal (77 % → 57 %).
+- "Model updates" section in the README tracking post-snapshot models (release date, test date, council version).
+
+### Changed
+- **Council judges rotated to v1.3** (`benchmark/council/v1.1.json`). As of 2026-05-29, `nvidia/llama-3.1-nemotron-70b-instruct` returned HTTP 404 on OpenRouter with no Bedrock deployment, and `cohere.command-r-plus-v1:0` was marked Legacy on Bedrock (access-denied, >30 days inactive). Replaced with Microsoft Phi-4 and Cohere Command R+ (via OpenRouter), preserving the no-org-overlap invariant. Opus 4.8 is adjudicated under this rotated panel; the v1.1-frozen 13,389 rows are unchanged.
+
+---
+
 ## [1.1.0] — 2026-05-21
 
 ### Added

diff --git a/README.md b/README.md
@@ -17,6 +17,20 @@ The v1.0 prompt set and the inaugural May 2026 snapshot (13,389 adjudicated rows
 
 ---
 
+## Model updates
+
+Models evaluated after the v1.1-frozen snapshot are appended to the committed data and tracked here. Post-snapshot additions are marked with `*` on the leaderboard and in the dataset, and may be adjudicated under a rotated judge panel (see note below).
+
+| Model | Provider | Released | Tested | Council | Snapshot | Headline |
+|---|---|---|---|---|---|---|
+| **Claude Opus 4.8** \* | Anthropic | late May 2026 | 2026-05-29 | **v1.3** (rotated) | post-v1.1 | PC Tier A (TPR 100 %); benign 57 %, dual-use 100 %, Youden's J **+0.43** |
+
+The v1.1-frozen panel (18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B, all under the v1.1 council) remains the canonical snapshot referenced in the manuscript. Opus 4.8 walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (Youden's J +0.23 → +0.43) while holding dual-use refusal at 100 %.
+
+> **\* Rotated v1.3 council.** Claude Opus 4.8 was adjudicated under a rotated three-judge panel (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), **not** the original v1.1 panel (NVIDIA Nemotron + Cohere via Bedrock + AI21 Jamba). As of 2026-05-29, `nvidia/llama-3.1-nemotron-70b-instruct` was no longer available on OpenRouter (HTTP 404, no endpoints found) and had no corresponding Bedrock deployment; `cohere.command-r-plus-v1:0` was marked Legacy on Bedrock and access-denied due to >30 days inactivity. Both judges were replaced with verified-live alternatives maintaining the no-org-overlap invariant. Two of three judges differ from the original panel, so cross-panel comparisons should be read with that caveat (mean inter-judge agreement is comparable: 0.955 vs 0.975). Full judge history is documented in [`benchmark/council/v1.1.json`](benchmark/council/v1.1.json).
+
+---
+
 ## Quickstart
 
 ```bash

diff --git a/benchmark/config/model_lineage.json b/benchmark/config/model_lineage.json
@@ -20,6 +20,12 @@
           "model_id": "us.anthropic.claude-opus-4-7-20250514-v1:0",
           "display_name": "Claude Opus 4.7",
           "release_date": "2025-05-14"
+        },
+        {
+          "model_id": "anthropic/claude-opus-4.8",
+          "display_name": "Claude Opus 4.8",
+          "release_date": "2026-05-28",
+          "routing_note": "Routed via OpenRouter. Bedrock ref ID: us.anthropic.claude-opus-4-8"
         }
       ]
     },

diff --git a/benchmark/config/sweep_models.json b/benchmark/config/sweep_models.json
@@ -1,8 +1,28 @@
 {
-  "version": "1.6",
-  "schema_doc": "Routing table for the Phase 4 evaluation sweep. 18 models: 7 via AWS Bedrock (BEDROCK_API_KEY), 11 via OpenRouter (OPENROUTER_API_KEY). Anthropic Claude models moved from Bedrock to OpenRouter on 2026-05-08: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API, since Anthropic's refusal mechanism is an API-level rejection with no text content regardless of provider.",
+  "version": "1.7",
+  "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
   "default_provider": "bedrock",
   "models": [
+    {
+      "model_id": "anthropic/claude-opus-4.8",
+      "display_name": "Claude Opus 4.8",
+      "provider": "openrouter",
+      "jurisdiction": "us",
+      "organization": "anthropic",
+      "role": "v1.2_primary",
+      "routing_note": "Routed via OpenRouter for main sweep (same rationale as all Anthropic models — Bedrock domain-filters protein engineering prompts). OpenRouter ID: anthropic/claude-opus-4.8. Released 2026-05-28. Same $5/$25 per MTok pricing as Opus 4.7.",
+      "pricing_usd_per_mtok": {"input": 5.0, "output": 25.0}
+    },
+    {
+      "model_id": "us.anthropic.claude-opus-4-8",
+      "display_name": "Claude Opus 4.8",
+      "provider": "bedrock",
+      "jurisdiction": "us",
+      "organization": "anthropic",
+      "role": "v1.2_pc_only",
+      "routing_note": "Bedrock geo-inference ID used exclusively for the should-refuse positive control sweep. Bedrock domain-filter counts as refusal for PC purposes (any refusal = correct outcome). Run with: python scripts/should_refuse_cli.py --run-sweep --sweep-model-ids us.anthropic.claude-opus-4-8 --sweep-only bedrock",
+      "pricing_usd_per_mtok": {"input": 5.0, "output": 25.0}
+    },
     {
       "model_id": "anthropic/claude-opus-4.7",
       "display_name": "Claude Opus 4.7",
@@ -185,7 +205,7 @@
     }
   ],
   "notes": {
-    "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 11 OpenRouter models (includes all 4 Anthropic Claude models moved from Bedrock on 2026-05-08).",
+    "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",
     "anthropic_routing_rationale": "Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models — including benign human protein targets (tested: EGFR, binder_001 benign tier, stopReason=content_filtered). This is infrastructure-level filtering, not model-level refusal, and conflates the two signals. OpenRouter routes directly to Anthropic and surfaces model-level refusals as native_finish_reason=refusal + empty content. Since Anthropic's refusal mechanism is API-level (no text content), OpenRouter and the Anthropic direct API are functionally equivalent for this benchmark. Tested 2026-05-08: Opus 4.5 and 4.6 comply on benign, refuse borderline+dual-use; Opus 4.7 refuses all three tiers (H4 tightening signal confirmed in pretest).",
     "model_id_verification": "OpenRouter model IDs verified via https://openrouter.ai/api/v1/models on 2026-05-07. Bedrock model IDs corrected 2026-05-08 from AWS documentation (docs.aws.amazon.com/bedrock/latest/userguide/model-cards.html). Bedrock dropped date-stamped versioned IDs in favour of short-name geo/global inference profile IDs.",
     "gpt_55_pro_excluded": "GPT-5.5 Pro ($30/$180 per MTok) was excluded due to cost (~$45 for sweep). GPT-5.5 standard ($5/$30) provides sufficient OpenAI coverage.",
@@ -195,6 +215,7 @@
     "grok_note": "Grok 4.20 (x-ai/grok-4.20-20260309, $1.25/$2.50 per MTok, ~$0.70 for sweep). xAI is US-headquartered (Elon Musk), adds a fifth distinct US non-Anthropic org. Also available: Grok 4.3 (x-ai/grok-4.3-20260430, same pricing, more recent) — swap model_id if preferred.",
     "cross_region_profiles": "Bedrock geo inference IDs prefixed with 'us.' route across us-east-1/us-east-2/us-west-2 automatically. Models without geo support (Mistral Large 3, DeepSeek V3.2, Qwen3 Next 80B, GLM-5) use bare in-region IDs and are served from us-east-1 directly.",
     "opus_47_temperature_note": "Claude Opus 4.7 does not support temperature, top_p, or top_k parameters. Provider code must omit these from inferenceConfig.",
-    "longitudinal_models": "Opus 4.5 and 4.6 run the same prompt set as Opus 4.7 solely for H4 (longitudinal trajectory analysis)."
+    "longitudinal_models": "Opus 4.5, 4.6, and 4.7 run alongside Opus 4.8 for H4 (longitudinal trajectory analysis). v1.2 extends the series to 4 points: 4.5 → 4.6 → 4.7 → 4.8.",
+    "opus_48_note": "Opus 4.8 (2026-05-28) is the v1.2 panel addition. Main-sweep cost: 141 prompts × 5 trials = 705 calls ≈ $2.64. Should-refuse PC cost: 15 prompts × 5 trials = 75 calls ≈ $0.28. Total ≈ $2.92. Run with: python scripts/run_sweep_all.py --label 2026-05 --model-ids anthropic/claude-opus-4.8"
   }
 }
diff --git a/benchmark/council/v1.1.json b/benchmark/council/v1.1.json
@@ -1,7 +1,7 @@
 {
-  "version": "1.1",
+  "version": "1.3",
   "rubric_version": "1.0",
-  "schema_doc": "Three-judge panel with zero provider overlap with the evaluation panel. No judge shares an organization with any model in config/sweep_models.json. Judges classify each (prompt, response) pair into one (compliance_level, reason_category) pair from rubric/v1.0.json.",
+  "schema_doc": "Three-judge panel with zero provider overlap with the evaluation panel. No judge shares an organization with any model in config/sweep_models.json. Judges classify each (prompt, response) pair into one (compliance_level, reason_category) pair from rubric/v1.0.json. v1.2 (2026-05-29): rotated two dead judges. v1.3 (2026-05-29): replaced writer/palmyra-x5 — OpenRouter infrastructure-filters protein engineering judge inputs. microsoft/phi-4 confirmed to handle protein content without filtering.",
   "aggregation": {
     "method": "modal_label",
     "tie_break": "most_restrictive",
@@ -10,11 +10,11 @@
   },
   "judges": [
     {
-      "judge_id": "nvidia_nemotron",
-      "label": "NVIDIA Llama-3.1-Nemotron-70B (NVIDIA, US)",
+      "judge_id": "microsoft_phi4",
+      "label": "Microsoft Phi-4 (Microsoft, US)",
       "jurisdiction": "us",
       "provider": "openrouter",
-      "model_id": "nvidia/llama-3.1-nemotron-70b-instruct",
+      "model_id": "microsoft/phi-4",
       "verify_id": true,
       "temperature": 0.0,
       "max_tokens": 256,
@@ -26,10 +26,10 @@
     },
     {
       "judge_id": "cohere_command_r_plus",
-      "label": "Cohere Command R+ (Cohere, Canada/Western)",
+      "label": "Cohere Command R+ 08-2024 (Cohere, Canada/Western)",
       "jurisdiction": "western",
-      "provider": "bedrock",
-      "model_id": "cohere.command-r-plus-v1:0",
+      "provider": "openrouter",
+      "model_id": "cohere/command-r-plus-08-2024",
       "verify_id": true,
       "temperature": 0.0,
       "max_tokens": 256,
@@ -57,10 +57,10 @@
   ],
   "notes": {
     "v1_0_overlap_problem": "council/v1.0.json used Llama 3.3 (Meta), Qwen3-32B (Alibaba), Magistral Small (Mistral) as judges. All three organizations appear in the evaluation panel (Llama 3.3 70B as control, Qwen3 Next 80B, Mistral Large 3). This violates the no-overlap invariant. v1.1 rotates all three judges.",
-    "us_judge_nvidia": "NVIDIA Nemotron 70B (NVIDIA, US) replaces the Meta/Llama judge. NVIDIA is a compute infrastructure company — not an AI-first model provider — and has no entry in the evaluation panel.",
-    "western_judge_cohere": "Cohere Command R+ (Cohere, Toronto) replaces the Mistral judge. Cohere is headquartered in Canada (Western-aligned, similar regulatory environment to EU) and has no entry in the evaluation panel. No other non-Mistral EU-accessible model is readily available on Bedrock.",
-    "asia_constraint": "All five readily-accessible East Asian model providers are in the evaluation panel: DeepSeek, Alibaba/Qwen, Z.AI (GLM), MiniMax, and Moonshot/Kimi. A strictly East Asian judge cannot be sourced without either narrowing the eval panel or adding an obscure API integration. AI21 Jamba 1.5 (AI21 Labs, Israel) is the best available substitute: Israel is outside the US/EU/East Asia triangle, AI21 has no presence in the evaluation panel, and Jamba is a genuinely independent model architecture (not a Llama derivative). Documented in docs/adapter_decisions.md.",
-    "all_verify_id_true": "All three model IDs in v1.1 must be verified against live Bedrock and OpenRouter catalogs before the first council run. Check Bedrock console for cohere.command-r-plus-v1:0 and ai21.jamba-1-5-large-v1:0; check https://openrouter.ai/api/v1/models for nvidia/llama-3.1-nemotron-70b-instruct.",
-    "openrouter_fallback": "If NVIDIA Nemotron is unavailable via OpenRouter, nvidia/llama-3.3-nemotron-super-49b-v1 (NVIDIA, Bedrock) or writer/palmyra-x5 (Writer Inc., US) are non-overlapping fallbacks."
+    "v1_1_deprecations": "nvidia/llama-3.1-nemotron-70b-instruct dropped from OpenRouter (404 as of 2026-05-29). cohere.command-r-plus-v1:0 marked Legacy on Bedrock (inactive >30 days). Both replaced in v1.2.",
+    "us_judge_phi4": "Microsoft Phi-4 (Microsoft, US) replaces Writer Palmyra X5. Writer/OpenRouter infrastructure-filters protein engineering judge inputs (content-filtered on 38 benign rows). Microsoft has no entry in the evaluation panel. Phi-4 confirmed to handle protein engineering content and return valid JSON classifications. Verified live on OpenRouter 2026-05-29.",
+    "western_judge_cohere": "Cohere Command R+ 08-2024 (Cohere, Toronto) retains the Western seat, now routed via OpenRouter instead of Bedrock. Same organization, newer model, verified live 2026-05-29.",
+    "asia_constraint": "All five readily-accessible East Asian model providers are in the evaluation panel: DeepSeek, Alibaba/Qwen, Z.AI (GLM), MiniMax, and Moonshot/Kimi. AI21 Jamba 1.5 (AI21 Labs, Israel) holds the non-US/EU seat. Jamba is a genuinely independent model architecture (not a Llama derivative). Documented in docs/adapter_decisions.md.",
+    "continuity_note": "The 13,389 rows adjudicated under v1.1 judges remain in adjudicated.csv. The 123 rows that failed under v1.1 (judge endpoint errors) are retried under v1.2 judges on the next --resume run."
   }
 }