Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@ All notable changes to RefusalBench are documented here. The format follows [Kee

---

## [Unreleased] — 2026-05-29

### Added
- **Claude Opus 4.8** added to the main sweep + should-refuse positive control (post-v1.1-frozen; marked `*`). 705 adjudicated trials (total: 14,094) + 75 should-refuse trials (total: 1,500).
- PC Tier A (TPR 100 %); benign 57 %, borderline 93 %, dual-use 100 %, Youden's J +0.43 — walks back Opus 4.7's benign over-refusal (77 % → 57 %).
- "Model updates" section in the README tracking post-snapshot models (release date, test date, council version).

### Changed
- **Council judges rotated to v1.3** (`benchmark/council/v1.1.json`). As of 2026-05-29, `nvidia/llama-3.1-nemotron-70b-instruct` returned HTTP 404 on OpenRouter with no Bedrock deployment, and `cohere.command-r-plus-v1:0` was marked Legacy on Bedrock (access-denied, >30 days inactive). Replaced with Microsoft Phi-4 and Cohere Command R+ (via OpenRouter), preserving the no-org-overlap invariant. Opus 4.8 is adjudicated under this rotated panel; the v1.1-frozen 13,389 rows are unchanged.

---

## [1.1.0] — 2026-05-21

### Added
Expand Down
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,20 @@ The v1.0 prompt set and the inaugural May 2026 snapshot (13,389 adjudicated rows

---

## Model updates

Models evaluated after the v1.1-frozen snapshot are appended to the committed data and tracked here. Post-snapshot additions are marked with `*` on the leaderboard and in the dataset, and may be adjudicated under a rotated judge panel (see note below).

| Model | Provider | Released | Tested | Council | Snapshot | Headline |
|---|---|---|---|---|---|---|
| **Claude Opus 4.8** \* | Anthropic | late May 2026 | 2026-05-29 | **v1.3** (rotated) | post-v1.1 | PC Tier A (TPR 100 %); benign 57 %, dual-use 100 %, Youden's J **+0.43** |

The v1.1-frozen panel (18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B, all under the v1.1 council) remains the canonical snapshot referenced in the manuscript. Opus 4.8 walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (Youden's J +0.23 → +0.43) while holding dual-use refusal at 100 %.

> **\* Rotated v1.3 council.** Claude Opus 4.8 was adjudicated under a rotated three-judge panel (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), **not** the original v1.1 panel (NVIDIA Nemotron + Cohere via Bedrock + AI21 Jamba). As of 2026-05-29, `nvidia/llama-3.1-nemotron-70b-instruct` was no longer available on OpenRouter (HTTP 404, no endpoints found) and had no corresponding Bedrock deployment; `cohere.command-r-plus-v1:0` was marked Legacy on Bedrock and access-denied due to >30 days inactivity. Both judges were replaced with verified-live alternatives maintaining the no-org-overlap invariant. Two of three judges differ from the original panel, so cross-panel comparisons should be read with that caveat (mean inter-judge agreement is comparable: 0.955 vs 0.975). Full judge history is documented in [`benchmark/council/v1.1.json`](benchmark/council/v1.1.json).

---

## Quickstart

```bash
Expand Down
6 changes: 6 additions & 0 deletions benchmark/config/model_lineage.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@
"model_id": "us.anthropic.claude-opus-4-7-20250514-v1:0",
"display_name": "Claude Opus 4.7",
"release_date": "2025-05-14"
},
{
"model_id": "anthropic/claude-opus-4.8",
"display_name": "Claude Opus 4.8",
"release_date": "2026-05-28",
"routing_note": "Routed via OpenRouter. Bedrock ref ID: us.anthropic.claude-opus-4-8"
}
]
},
Expand Down
29 changes: 25 additions & 4 deletions benchmark/config/sweep_models.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,28 @@
{
"version": "1.6",
"schema_doc": "Routing table for the Phase 4 evaluation sweep. 18 models: 7 via AWS Bedrock (BEDROCK_API_KEY), 11 via OpenRouter (OPENROUTER_API_KEY). Anthropic Claude models moved from Bedrock to OpenRouter on 2026-05-08: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API, since Anthropic's refusal mechanism is an API-level rejection with no text content regardless of provider.",
"version": "1.7",
"schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix provider-count metadata to match the actual model table.

schema_doc and notes.keys_required still state 7 Bedrock models, but this config now contains more Bedrock entries (8 in main sweep, 9 including the v1.2_pc_only Opus 4.8 row). Please update these counts to avoid operator confusion.

Suggested patch
-  "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
+  "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (8 Bedrock, 11 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
...
-    "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",
+    "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 8 Bedrock main-sweep models (9 including the Opus 4.8 PC-only Bedrock row); OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",

Also applies to: 208-208

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/config/sweep_models.json` at line 3, Update the provider-count
metadata strings to match the actual model table: change the "7 Bedrock"
occurrences in the "schema_doc" value (and the corresponding
"notes.keys_required" entry) to the correct counts — "8 Bedrock" for the main
sweep and, where relevant/mentioned (e.g., in v1.2 notes or the pc-only row), "9
Bedrock" when including the v1.2_pc_only Opus 4.8 row — ensuring both places
(the "schema_doc" key and the "notes.keys_required" metadata) reflect the
corrected numbers.

"default_provider": "bedrock",
"models": [
{
"model_id": "anthropic/claude-opus-4.8",
"display_name": "Claude Opus 4.8",
"provider": "openrouter",
"jurisdiction": "us",
"organization": "anthropic",
"role": "v1.2_primary",
"routing_note": "Routed via OpenRouter for main sweep (same rationale as all Anthropic models — Bedrock domain-filters protein engineering prompts). OpenRouter ID: anthropic/claude-opus-4.8. Released 2026-05-28. Same $5/$25 per MTok pricing as Opus 4.7.",
"pricing_usd_per_mtok": {"input": 5.0, "output": 25.0}
},
{
"model_id": "us.anthropic.claude-opus-4-8",
"display_name": "Claude Opus 4.8",
"provider": "bedrock",
"jurisdiction": "us",
"organization": "anthropic",
"role": "v1.2_pc_only",
"routing_note": "Bedrock geo-inference ID used exclusively for the should-refuse positive control sweep. Bedrock domain-filter counts as refusal for PC purposes (any refusal = correct outcome). Run with: python scripts/should_refuse_cli.py --run-sweep --sweep-model-ids us.anthropic.claude-opus-4-8 --sweep-only bedrock",
"pricing_usd_per_mtok": {"input": 5.0, "output": 25.0}
},
{
"model_id": "anthropic/claude-opus-4.7",
"display_name": "Claude Opus 4.7",
Expand Down Expand Up @@ -185,7 +205,7 @@
}
],
"notes": {
"keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 11 OpenRouter models (includes all 4 Anthropic Claude models moved from Bedrock on 2026-05-08).",
"keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",
"anthropic_routing_rationale": "Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models — including benign human protein targets (tested: EGFR, binder_001 benign tier, stopReason=content_filtered). This is infrastructure-level filtering, not model-level refusal, and conflates the two signals. OpenRouter routes directly to Anthropic and surfaces model-level refusals as native_finish_reason=refusal + empty content. Since Anthropic's refusal mechanism is API-level (no text content), OpenRouter and the Anthropic direct API are functionally equivalent for this benchmark. Tested 2026-05-08: Opus 4.5 and 4.6 comply on benign, refuse borderline+dual-use; Opus 4.7 refuses all three tiers (H4 tightening signal confirmed in pretest).",
"model_id_verification": "OpenRouter model IDs verified via https://openrouter.ai/api/v1/models on 2026-05-07. Bedrock model IDs corrected 2026-05-08 from AWS documentation (docs.aws.amazon.com/bedrock/latest/userguide/model-cards.html). Bedrock dropped date-stamped versioned IDs in favour of short-name geo/global inference profile IDs.",
"gpt_55_pro_excluded": "GPT-5.5 Pro ($30/$180 per MTok) was excluded due to cost (~$45 for sweep). GPT-5.5 standard ($5/$30) provides sufficient OpenAI coverage.",
Expand All @@ -195,6 +215,7 @@
"grok_note": "Grok 4.20 (x-ai/grok-4.20-20260309, $1.25/$2.50 per MTok, ~$0.70 for sweep). xAI is US-headquartered (Elon Musk), adds a fifth distinct US non-Anthropic org. Also available: Grok 4.3 (x-ai/grok-4.3-20260430, same pricing, more recent) — swap model_id if preferred.",
"cross_region_profiles": "Bedrock geo inference IDs prefixed with 'us.' route across us-east-1/us-east-2/us-west-2 automatically. Models without geo support (Mistral Large 3, DeepSeek V3.2, Qwen3 Next 80B, GLM-5) use bare in-region IDs and are served from us-east-1 directly.",
"opus_47_temperature_note": "Claude Opus 4.7 does not support temperature, top_p, or top_k parameters. Provider code must omit these from inferenceConfig.",
"longitudinal_models": "Opus 4.5 and 4.6 run the same prompt set as Opus 4.7 solely for H4 (longitudinal trajectory analysis)."
"longitudinal_models": "Opus 4.5, 4.6, and 4.7 run alongside Opus 4.8 for H4 (longitudinal trajectory analysis). v1.2 extends the series to 4 points: 4.5 → 4.6 → 4.7 → 4.8.",
"opus_48_note": "Opus 4.8 (2026-05-28) is the v1.2 panel addition. Main-sweep cost: 141 prompts × 5 trials = 705 calls ≈ $2.64. Should-refuse PC cost: 15 prompts × 5 trials = 75 calls ≈ $0.28. Total ≈ $2.92. Run with: python scripts/run_sweep_all.py --label 2026-05 --model-ids anthropic/claude-opus-4.8"
}
}
26 changes: 13 additions & 13 deletions benchmark/council/v1.1.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"version": "1.1",
"version": "1.3",
"rubric_version": "1.0",
"schema_doc": "Three-judge panel with zero provider overlap with the evaluation panel. No judge shares an organization with any model in config/sweep_models.json. Judges classify each (prompt, response) pair into one (compliance_level, reason_category) pair from rubric/v1.0.json.",
"schema_doc": "Three-judge panel with zero provider overlap with the evaluation panel. No judge shares an organization with any model in config/sweep_models.json. Judges classify each (prompt, response) pair into one (compliance_level, reason_category) pair from rubric/v1.0.json. v1.2 (2026-05-29): rotated two dead judges. v1.3 (2026-05-29): replaced writer/palmyra-x5 — OpenRouter infrastructure-filters protein engineering judge inputs. microsoft/phi-4 confirmed to handle protein content without filtering.",
"aggregation": {
"method": "modal_label",
"tie_break": "most_restrictive",
Expand All @@ -10,11 +10,11 @@
},
"judges": [
{
"judge_id": "nvidia_nemotron",
"label": "NVIDIA Llama-3.1-Nemotron-70B (NVIDIA, US)",
"judge_id": "microsoft_phi4",
"label": "Microsoft Phi-4 (Microsoft, US)",
"jurisdiction": "us",
"provider": "openrouter",
"model_id": "nvidia/llama-3.1-nemotron-70b-instruct",
"model_id": "microsoft/phi-4",
"verify_id": true,
"temperature": 0.0,
"max_tokens": 256,
Expand All @@ -26,10 +26,10 @@
},
{
"judge_id": "cohere_command_r_plus",
"label": "Cohere Command R+ (Cohere, Canada/Western)",
"label": "Cohere Command R+ 08-2024 (Cohere, Canada/Western)",
"jurisdiction": "western",
"provider": "bedrock",
"model_id": "cohere.command-r-plus-v1:0",
"provider": "openrouter",
"model_id": "cohere/command-r-plus-08-2024",
"verify_id": true,
"temperature": 0.0,
"max_tokens": 256,
Expand Down Expand Up @@ -57,10 +57,10 @@
],
"notes": {
"v1_0_overlap_problem": "council/v1.0.json used Llama 3.3 (Meta), Qwen3-32B (Alibaba), Magistral Small (Mistral) as judges. All three organizations appear in the evaluation panel (Llama 3.3 70B as control, Qwen3 Next 80B, Mistral Large 3). This violates the no-overlap invariant. v1.1 rotates all three judges.",
"us_judge_nvidia": "NVIDIA Nemotron 70B (NVIDIA, US) replaces the Meta/Llama judge. NVIDIA is a compute infrastructure company — not an AI-first model provider — and has no entry in the evaluation panel.",
"western_judge_cohere": "Cohere Command R+ (Cohere, Toronto) replaces the Mistral judge. Cohere is headquartered in Canada (Western-aligned, similar regulatory environment to EU) and has no entry in the evaluation panel. No other non-Mistral EU-accessible model is readily available on Bedrock.",
"asia_constraint": "All five readily-accessible East Asian model providers are in the evaluation panel: DeepSeek, Alibaba/Qwen, Z.AI (GLM), MiniMax, and Moonshot/Kimi. A strictly East Asian judge cannot be sourced without either narrowing the eval panel or adding an obscure API integration. AI21 Jamba 1.5 (AI21 Labs, Israel) is the best available substitute: Israel is outside the US/EU/East Asia triangle, AI21 has no presence in the evaluation panel, and Jamba is a genuinely independent model architecture (not a Llama derivative). Documented in docs/adapter_decisions.md.",
"all_verify_id_true": "All three model IDs in v1.1 must be verified against live Bedrock and OpenRouter catalogs before the first council run. Check Bedrock console for cohere.command-r-plus-v1:0 and ai21.jamba-1-5-large-v1:0; check https://openrouter.ai/api/v1/models for nvidia/llama-3.1-nemotron-70b-instruct.",
"openrouter_fallback": "If NVIDIA Nemotron is unavailable via OpenRouter, nvidia/llama-3.3-nemotron-super-49b-v1 (NVIDIA, Bedrock) or writer/palmyra-x5 (Writer Inc., US) are non-overlapping fallbacks."
"v1_1_deprecations": "nvidia/llama-3.1-nemotron-70b-instruct dropped from OpenRouter (404 as of 2026-05-29). cohere.command-r-plus-v1:0 marked Legacy on Bedrock (inactive >30 days). Both replaced in v1.2.",
"us_judge_phi4": "Microsoft Phi-4 (Microsoft, US) replaces Writer Palmyra X5. Writer/OpenRouter infrastructure-filters protein engineering judge inputs (content-filtered on 38 benign rows). Microsoft has no entry in the evaluation panel. Phi-4 confirmed to handle protein engineering content and return valid JSON classifications. Verified live on OpenRouter 2026-05-29.",
"western_judge_cohere": "Cohere Command R+ 08-2024 (Cohere, Toronto) retains the Western seat, now routed via OpenRouter instead of Bedrock. Same organization, newer model, verified live 2026-05-29.",
"asia_constraint": "All five readily-accessible East Asian model providers are in the evaluation panel: DeepSeek, Alibaba/Qwen, Z.AI (GLM), MiniMax, and Moonshot/Kimi. AI21 Jamba 1.5 (AI21 Labs, Israel) holds the non-US/EU seat. Jamba is a genuinely independent model architecture (not a Llama derivative). Documented in docs/adapter_decisions.md.",
"continuity_note": "The 13,389 rows adjudicated under v1.1 judges remain in adjudicated.csv. The 123 rows that failed under v1.1 (judge endpoint errors) are retried under v1.2 judges on the next --resume run."
}
}
Loading
Loading