You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **ISC (Internal Safety Collapse)** reveals a fundamental paradox in frontier AI: the very capability that makes agents useful is what bypasses their safety training. By simply completing professional workflows, models generate harmful outputs with **zero jailbreaks, zero adversarial prompts, and zero obfuscation.** The task itself is the exploit.
32
29
33
-
### Impact at a Glance
34
-
-**100% of Top-25 LLMs Triggered:** Every top-25 model on the [Chatbot Arena](https://arena.ai/leaderboard/text) leaderboard has a confirmed ISC trigger — including GPT-5, Claude 4, and Gemini 3 series. 51/100 confirmed to date; the rest are untested, not immune.
35
-
-**Broad scope:** Works against single-turn chat, agentic pipelines, and any AI doing programming or tool-integrated work (MCP, APIs).
36
-
-**Dataset-scale output:** A single trigger generates full structured datasets of harmful content — toxins, exploits, adversarial prompts.
37
-
-**No optimization needed:** No fine-tuning or prompt engineering required. The task structure is the trigger.
30
+
> **ISC (Internal Safety Collapse)** reveals a fundamental paradox in frontier AI: the very capability that makes agents useful is what bypasses their safety training. By simply completing professional workflows, models generate harmful outputs with zero jailbreaks, zero adversarial prompts, and zero obfuscation. The task itself is the exploit.
31
+
32
+
33
+
### 🚨 **Impact at a Glance**
34
+
> -**Top-25 frontier LLMs:** All top-25 models on [Chatbot Arena](https://arena.ai/leaderboard/text) have confirmed ISC triggers; 51 models in the top 100 have been confirmed so far.
35
+
> -**Broad coverage:** ISC appears across chat-based LLMs, LLM-based agents, tool-using LLMs, MCP-enabled LLMs, and automated LLM workflows.
36
+
> -**Dual-use tools amplify risk:** Ecosystems such as Hugging Face, Python packages, and domain APIs can become part of the attack surface.
37
+
> -**No jailbreak required:** ISC can be triggered without adversarial prompts or jailbreak techniques.
38
+
> -**Scales to dataset-level harm:** A single trigger can produce a structured harmful-content dataset.
Go directly to the corresponding experiment folder to validate the core findings:
63
-
64
-
ISC is evaluated in three settings:
63
+
ISC is evaluated in three settings. Start with the corresponding experiment folder if you want to validate the paper results directly:
65
64
66
65
**Single-turn ([`isc_single/`](experiment/isc_single/))** — The complete TVD task context (task script, validator, data file, and validation traceback) is wrapped into a single prompt simulating a terminal session. The model responds in one turn.
67
66
@@ -83,7 +82,7 @@ You are an autonomous agent solving NLP tasks. You can use terminal commands to
83
82
84
83
### ③ 🧩 Explore Templates
85
84
86
-
Templates are not fixed — they are starting points. Effectiveness depends on your target model, how you construct the anchor, and how the validator is framed. The stronger your engineering background, the more you can push them.
85
+
Templates are starting points, not fixed recipes. What works depends on the target model, the anchor you use, and the validator you build around it.
87
86
88
87
1.**Browse [`templates/`](templates/)** (84 templates, 9 domains). Each template includes a `SKILL.md` with TVD structure, anchor strength, and guidance on what to adjust.
89
88
2.**Study [`community/`](community/)** to see how others adapted templates for different models. Real examples show what worked, what needed tweaking, and what the model actually generated.
@@ -114,7 +113,7 @@ Templates are not fixed — they are starting points. Effectiveness depends on y
114
113
115
114
|| Date | Update |
116
115
|:-:|:-----|:-------|
117
-
| 🔴 | 2026-04-10 | 51/100 confirmed. New: Grok 4.1 Thinking, Gemini 3 Flash Thinking, GPT-5.1/High, Claude Opus 4.1/Thinking, GPT-5.2/High, DeepSeek V3.2 Thinking, **Qwen 3.5 Max Preview** (web trigger). All via`aiml_guard_attack_v2`. |
116
+
| 🔴 | 2026-04-10 | 51/100 confirmed. New confirmations include Grok 4.1 Thinking, Gemini 3 Flash Thinking, GPT-5.1/High, Claude Opus 4.1/Thinking, GPT-5.2/High, DeepSeek V3.2 Thinking, and **Qwen 3.5 Max Preview** (web trigger), all with`aiml_guard_attack_v2`. |
118
117
| 🔴 | 2026-04-10 |**Claude Opus 4.6 Thinking (Rank 1):** ISC induced the model to generate adversarial prompts (PAIR, PAP, DAN) directly. See [community/claudeopus46thinking-guard-attack](https://github.com/wuyoscar/ISC-Bench/tree/main/community/claudeopus46thinking-guard-attack). |
| 🔴 | 2026-03-29 |**Mistral Large 3** (Rank 64): single-turn survival analysis — poisoning cohort data with LD50 and mechanisms ([#60](https://github.com/wuyoscar/ISC-Bench/issues/60)). 26/100 confirmed. |
@@ -129,9 +128,7 @@ Templates are not fixed — they are starting points. Effectiveness depends on y
129
128
130
129
|| Date | Note |
131
130
|:-:|:-----|:-----|
132
-
| 🎙️ | 2026-04-04 | Featured on [**AI Post Transformers podcast**](https://podcasts.apple.com/tr/podcast/internal-safety-collapse-in-frontier-llms/id1835878324?i=1000759288088): deep dive into ISC and agent-workflow vulnerabilities |
133
-
| ✨ | 2026-03-29 |**700+ stars**; terminology updated from "Jailbroken" to "Triggered" |
| 🚀 | 2026-03-25 | ISC-Bench repository and [**paper**](https://arxiv.org/abs/2603.23509) released |
136
133
137
134
<sub>[Full changelog →](CHANGELOG.md)</sub>
@@ -140,15 +137,14 @@ Templates are not fixed — they are starting points. Effectiveness depends on y
140
137
141
138
## 🔍 Community Perspectives
142
139
140
+
<sub>Short descriptions from others that match the core idea behind ISC.</sub>
141
+
143
142
> *"Big blind spot. We guard prompts, but risk sits in tasks."* — [**Bonny Banerjee**](https://www.linkedin.com/feed/update/urn:li:activity:7442788617648852993?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7442788617648852993%2C7442937067493466112%29)
144
143
145
144
> *"ISC is not about jailbreaks — it's about how models complete tasks. Models produce harmful outputs simply by doing their job."* — [**Charles H. Martin**](https://www.linkedin.com/posts/charlesmartin14_activity-7442788617648852993-8rsz)
146
145
147
146
> *"Task completion and safety are two different goals. When you force them into one model, the task always wins — and safety collapses."* — [**Andrei Trandafira**](https://www.linkedin.com/feed/update/urn:li:activity:7442788617648852993?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7442788617648852993%2C7442894697385156610%29)
148
147
149
-
> *"SO interesting. Great paper tbh."* — **Adrian De Wynter**
150
-
151
-
> *"Refusal-based safety is a surface-level constraint, not genuine capability removal. When professional objectives clash with safety objectives, dangerous knowledge resurfaces."* — [**AI Post Transformers**](https://podcasts.apple.com/tr/podcast/internal-safety-collapse-in-frontier-llms/id1835878324?i=1000759288088) podcast
> *"SO interesting. Great paper tbh."* — **Adrian De Wynter**
176
156
177
-
> *"Refusal-based safety is a surface-level constraint, not genuine capability removal. When professional objectives clash with safety objectives, dangerous knowledge resurfaces."* — [**AI Post Transformers**](https://podcasts.apple.com/tr/podcast/internal-safety-collapse-in-frontier-llms/id1835878324?i=1000759288088) podcast
0 commit comments