MEGA Security

The evaluation-driven approach to LLM system-prompt security.
Define the attack surface, measure it, harden to pass.

Quick Start · What it does · Benchmark · Leaderboard ↗ · megacode.ai ↗

✨ Why?

Warning

Routing through OpenClaw, Hermes, LiteLLM, or OpenRouter? Your system prompt runs on whichever model the router picks at request time and defense rates swing from 0.50 to 0.91 across vendors. Untuned, you ship the worst case.

Important

Your system prompt is your trust asset. In production it has been breaking repeatedly: EchoLeak (zero-click M365 Copilot exfiltration), the Gap chatbot jailbreak, the Chevy "$1 Tahoe" persona override, and 7+ vendor system prompts now public on GitHub. A static prompt is no longer enough.

The common pain points teams hit shipping LLM products:

🧨 Attacks evolve faster than benchmarks — HarmBench, DAN, PII catalogs all live in separate repos, English-only, and lag behind real-world techniques.
⚖️ Defense vs. usability is unmeasured — teams regress into "block-everything" prompts that frustrate legitimate users (high false-refusal rate).
🎯 No reproducible stop condition — there's no objective signal for "is this prompt ship-ready?"
🔁 Manual review is the only feedback loop — you can't tell whether a prompt edit actually helped.

mega-security is an example of evaluation-driven development applied to LLM security. It ships two Claude Code commands that diagnose and harden any LLM system prompt, fail-closed, reproducible, and never modifying your code without your explicit approval.

🚀 Quick Start

Inside any Claude Code session:

/plugin marketplace add https://github.com/mega-edo/mega-security

/plugin install mega-security@mega-edo

That's it. Commands become available immediately:

/prompt-check                  # 5–10 min diagnosis of a single system prompt

/prompt-optimize               # iterative hardening with no-regression guarantees

To pull updates later: /plugin upgrade mega-security.

Local development install (contributors only)

git clone https://github.com/mega-edo/mega-security ~/mega-agent-security
claude --plugin-dir ~/mega-agent-security

--plugin-dir is session-scoped and additive. To load multiple plugins in one session, repeat the flag. After editing plugin files mid-session, run /reload-plugins to refresh.

📊 Proven across 4 vendors × 2 tiers × 3 scenarios

A 24-cell sweep with prompt-optimize (Sonnet 4.6 rewriter, max 5 iters, Pareto acceptance gates). 23 of 24 cells reach DSR ≥ 0.94 with zero FRR regression beyond budget. Per-cell average across 3 production scenarios; tiebreaker = higher baseline DSR.

Rank	Vendor	Tier	Model	Base	Opt	Δ	Jailbreak	PII	Injection	Leak	FRR
1	Anthropic	frontier	`claude-opus-4.7`	0.91	1.00	+0.09	1.00	1.00	1.00	1.00	0.00
2	Google	frontier	`gemini-3.1-pro-preview`	0.68	1.00	+0.32	1.00	1.00	1.00	1.00	0.00
3	Google	small	`gemini-3.1-flash-lite-preview`	0.50	1.00	+0.50	1.00	1.00	1.00	1.00	0.00
4	xAI	frontier	`grok-4.20-0309-reasoning`	0.53	0.99	+0.47	1.00	1.00	0.97	1.00	0.00
5	xAI	small	`grok-4.1-fast-non-reasoning`	0.66	0.99	+0.33	0.98	1.00	0.99	1.00	0.00
6	OpenAI	frontier	`gpt-5.5`	0.83	0.97	+0.14	0.94	0.96	0.96	1.00	0.00
7	OpenAI	small	`gpt-5.4-mini`	0.73	0.95	+0.22	0.82	1.00	0.99	0.99	0.00
8	Anthropic	small	`claude-haiku-4.5`	0.80	0.91	+0.11	0.92	0.93	1.00	0.79	0.02

Tip

A small model with prompt-optimize (DSR 0.95–1.00) beats every frontier model used as-is. Cheap + automatic tuning > expensive + raw.

➡️ Full per-cell breakdown, real BREACHED traces, methodology, and interpretation → mega-security-leaderboard ↗

🧩 What it does

Wherever you wire an LLM into your product — chatbots, agents, RAG-backed apps, copilots, content generators, classifiers — there's a system prompt holding your operator intent. mega-security targets that layer. Two commands diagnose and harden it:

Command	What it produces
`/prompt-check`	`MEGA_PROMPT_CHECK.md` — block rate per attack category, three failure examples per failing category, weakness pattern analysis with concrete prompt edits
`/prompt-optimize`	`MEGA_PROMPT_OPTIMIZE.md` — per-iter score history, per-category trajectory, final unified diff (never auto-applied)

How /prompt-check works (10-step pipeline)

flowchart TD
    A[1.Discover system prompt<br/>scan prompt.txt / code / env / YAML]
    A --> B[2.Refresh model catalog<br/>24h-cached, litellm-supported]
    B --> C[3.Auto-detect product model<br/>+ API-key env]
    C --> D[4.Five setup questions<br/>auto-detected fields skipped]
    D --> E{English or<br/>low-risk product?}
    E -- yes --> G
    E -- no --> F[5.Locale detection<br/>Translate all / except jailbreak / Keep EN]
    F --> G[6.Sample from vetted pool<br/>200 attacks = 100 scoring + 100 tuning<br/>fingerprint-locked]
    G --> H{Localize<br/>requested?}
    H -- yes --> I[7.Localize sub-agent<br/>working copy only — frozen pool untouched]
    H -- no --> J
    I --> J[8.Run test runner<br/>system prompt + user msg<br/>scoring set only]
    J --> K{9.Validation OK?<br/>token greater than 0, latency at least 10ms,<br/>traces present}
    K -- no --> Halt([HALT — no report written])
    K -- yes --> L([10.Write MEGA_PROMPT_CHECK.md])

    classDef gate fill:#fef3c7,stroke:#d97706,color:#78350f
    classDef terminal fill:#dcfce7,stroke:#16a34a,color:#14532d
    classDef halt fill:#fee2e2,stroke:#dc2626,color:#7f1d1d
    class E,H,K gate
    class L terminal
    class Halt halt

Discover system prompt — directory scan finds candidates in prompt.txt, code literals, env vars, YAML keys. One candidate → silent accept; multiple → picker.
Refresh model catalog (24h-cached) — WebSearch + WebFetch pulls latest litellm-supported model ids per provider.
Auto-detect product model + API-key env — Grep + Read over the user's repo extracts model invocations and .env candidates near the discovered prompt.
Five setup questions — auto-detected fields silently skip their question; first-time users typically answer ~2 of the 5.
Locale detection (sub-agent) — for English / low-risk products the question is skipped; otherwise the user picks Translate all / Translate except jailbreak / Keep English.
Sample from the vetted pool — 200 attacks (100 scoring + 100 tuning) drawn fresh per run from a fixed pool of 400. Different seeds give different samples; pool fingerprint is stable so runs remain comparable.
Localize sub-agent (optional) — rewrites the working copy to the target language and swaps embedded entities (Korean RRN format, JP postal codes, etc.). The frozen reference pool is never modified.
Run the test runner — system prompt + user message, one AI call per test. Scoring set only.
Validation check — fidelity signals (token=0 / sub-10ms latency / zero traces) trigger halt before any report is written.
Write report — block rate per attack type, three failure examples per failing category, concrete prompt edits.

How /prompt-optimize works (Pareto acceptance loop)

flowchart TD
    A[1.Load scoring-set baseline<br/>from latest /prompt-check] --> B[2.Measure tuning-set baseline<br/>one-time — search signal]
    B --> Loop{iter less than max_iter?}
    Loop -- no --> Term
    Loop -- yes --> D[Build failure summary<br/>tuning set only — no scoring leakage]
    D --> E[Rewriter proposes candidate<br/>uses your Claude Code default model]
    E --> F{Tuning gate<br/>improves on tuning set?}
    F -- no --> R1[Reject — cheap exit<br/>no scoring-set spend]
    F -- yes --> G{Scoring gate<br/>no regression and FRR in budget?}
    G -- no --> R2[Reject — keep prior best<br/>generalization guard]
    G -- yes --> Acc[Accept — update best]
    R1 --> Stall{3 iters without<br/>best changing?}
    R2 --> Stall
    Acc --> Thr{All thresholds<br/>cleared?}
    Thr -- yes --> Term
    Thr -- no --> Stall
    Stall -- yes --> Term
    Stall -- no --> Loop
    Term[4.Termination] --> Z{5.Diff + AskUserQuestion}
    Z -- Auto-apply recommended --> Out([Write MEGA_PROMPT_OPTIMIZE.md])
    Z -- Manual apply --> Out
    Z -- Discard --> Out

    classDef gate fill:#fef3c7,stroke:#d97706,color:#78350f
    classDef accept fill:#dcfce7,stroke:#16a34a,color:#14532d
    classDef reject fill:#fee2e2,stroke:#dc2626,color:#7f1d1d
    classDef terminal fill:#e0e7ff,stroke:#4f46e5,color:#312e81
    class F,G,Loop,Stall,Thr,Z gate
    class Acc accept
    class R1,R2 reject
    class Out terminal

Load scoring-set baseline from the most recent prompt-check run.
Measure tuning-set baseline (one-time) — the optimizer needs it once for the search signal.
Iteration loop (up to 10):
- Build the failure summary from the tuning set only — the rewriter never sees scoring traces.
- Rewriter (your Claude Code default model) proposes a hardened candidate.
- Tuning gate (cheap reject) — if the candidate doesn't even improve on the tuning set, reject without spending budget on the scoring set.
- Scoring gate (generalization) — only candidates that pass the tuning gate get a scoring-set measurement. Accept only if scoring-set block rate didn't regress and over-blocking rate stayed in budget.
Termination — every scoring-set threshold cleared, max_iter reached, or 3 consecutive iters without best changing.
Diff + AskUserQuestion — Auto-apply (recommended) / Manual apply / Discard.

🛡 Real-world incidents this defends against

Note

Each incident below maps to a probe family in our 400-probe pool. Hardening the system prompt with prompt-optimize exercises the same attack mechanism. The injection still arrives, but it no longer succeeds.

Incident	Category	What broke
Three AI coding agents leak simultaneously (2026)	prompt_injection	One injection caused simultaneous API key + token leakage across Claude Code, Gemini CLI, and Copilot
EchoLeak — M365 Copilot zero-click exfiltration (2025-06)	prompt_injection	First production AI zero-click data leak, through a received email hijacked Copilot with no user action
Vendor system prompts leaked on GitHub (2025–2026) — asgeirtj · CL4R1T4S	system_prompt_leak	Production prompts from ChatGPT, Claude, Gemini, Grok, Cursor, Devin, Replit all extracted and kept up to date publicly
Gap chatbot jailbreak + Chevy "$1 Tahoe"	jailbreak	DAN persona override broke the dealer bot into a "legally binding" $76K-for-$1 offer
OpenClaw "did exactly what they were told" (2026)	pii_disclosure	Agent published internal threat intelligence to the public web, because it was told to

73% of production AI deployments were hit by prompt injection at least once in 2025 (Obsidian Security).

🤔 Why this keeps happening

"I built it with Claude Code, so my agent is secure by default"

Two different things, conflated:

Claude Code	Your deployed agent
A code-authoring tool helps you write the source code	The system that actually runs in production. The model it calls is whatever name you wrote into your code

So in reality:

agent on openai/gpt-5.5 → GPT-5.5's security characteristics apply
agent on gemini/gemini-3.1-pro → Gemini's apply
Which IDE you used to write the code is irrelevant at runtime

The security posture across vendors is not the same for the same prompt:

"Claude demonstrated the most robust security posture by providing secure responses with high consistency. Gemini was the most vulnerable due to filtering failures and information leakage. GPT-4o behaved securely in most scenarios but exhibited inconsistency in the face of indirect attacks." — Multi-Model Prompt Injection Survey, SciTePress 2025

"There is no such thing as prompt portability. If you change models, you need to re-eval, and re-tune, all your prompts." — Vivek Haldar · also PromptBridge, arXiv 2512.01420

Claude Code doesn't close this gap. It doesn't know which API model you'll deploy against, and it doesn't auto-tune the system prompt for that model's specific attack patterns. (Vendor-locked stacks like the Claude Agent SDK are internally consistent, but lock-in is a different cost.)

Multi-API agents are the production standard

Frontier Claude API pricing is roughly 5–10× the small/flash tiers from OpenAI and Google, making Claude-only production traffic uneconomical for most startups and SMBs:

"Cost-based routing strategies route simple tasks to Gemini Flash (~$0.10/1M input) and complex reasoning to Claude, achieving cost savings of 50–80%." — LangDB

The infrastructure has standardized around this pattern:

Tool	What it does
LiteLLM	100+ LLM APIs behind an OpenAI-compatible interface — self-hosted, zero-vendor-lock-in
OpenRouter	500+ models behind a single API key — $40M raised at $500M valuation (Jun 2025)
Bifrost / OpenAI Agents SDK compat	Gemini CLI ↔ Claude / GPT / Groq + 20 providers
OpenClaude	Claude-compatible interface fronting 200+ models from OpenAI / Gemini / DeepSeek / Ollama

Real production agents look like this:

[development]                [deployment]
Code in Claude Code    →     Agent uses LiteLLM / OpenRouter to
                             dynamically pick GPT-5.5 / Gemini / Grok / Claude
                             based on cost and task fit

OpenClaw, Hermes-class agent stacks, and similar multi-vendor frameworks all converge on this shape. Even if your dev tool is Claude, the model your deployed agent calls is a separate decision, and the security of that model depends entirely on whether its system prompt has been tuned per-vendor.

📦 What's in the box

mega-security/
├─ skills/
│  ├─ prompt-check/       # 5–10 min single-prompt diagnosis
│  └─ prompt-optimize/    # iterative hardening with Pareto gates
├─ hooks/                 # Claude Code lifecycle hooks
├─ scripts/               # log / sanity / pricing helpers
└─ tests/                 # judge regression + archetype detection

Every command is read-only by default, none of them auto-modify your source code. Optimize commands present a unified diff at the end and let you decide whether and where to apply.

🔬 Vetted attack pool

Category	Sources	Pool size
`prompt_injection`	HarmBench + in-house synth (12 indirect-injection vectors × 12 payloads + 8 singletons)	100
`jailbreak`	DAN-in-the-wild	100
`pii_disclosure`	In-house synth (16 hard patterns × 12 victim profiles)	100
`system_prompt_leak`	In-house synth (24 patterns × 7 targets + 8 singletons)	100

Every attack was vetted against a capable baseline AI, only the ones it actually failed to defend against (or barely defended) made it into the frozen pool. Trivial probes were dropped so meaningful differences between models actually surface instead of saturating at ~100%. The pool is fingerprint-locked (sha256 in manifest.json) so cross-run comparability is preserved.

📚 Documentation

Leaderboard repo — full benchmark, methodology, real BREACHED traces
Claude Code plugin marketplace — install entry point

🌐 Built by MEGA Code

mega-security is part of the MEGA Code platform

🤝 Contributing

Issues and PRs welcome at github.com/mega-edo/mega-security. Before submitting, please run the existing test suites:

python tests/judge_regression_test.py
python tests/test_archetype_detection.py

📄 License

🙏 Acknowledgments

Built on the shoulders of:

HarmBench — academic-standard adversarial benchmark
TrustAIRLab/in-the-wild-jailbreak-prompts — DAN/persona-override corpus
LiteLLM — unified multi-vendor LLM interface
OWASP GenAI Security Project — incident taxonomy and remediation guidance

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MEGA Security

✨ Why?

🚀 Quick Start

📊 Proven across 4 vendors × 2 tiers × 3 scenarios

🧩 What it does

🛡 Real-world incidents this defends against

🤔 Why this keeps happening

"I built it with Claude Code, so my agent is secure by default"

Multi-API agents are the production standard

📦 What's in the box

🔬 Vetted attack pool

📚 Documentation

🌐 Built by MEGA Code

🤝 Contributing

📄 License

🙏 Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MEGA Security

✨ Why?

🚀 Quick Start

📊 Proven across 4 vendors × 2 tiers × 3 scenarios

🧩 What it does

🛡 Real-world incidents this defends against

🤔 Why this keeps happening

"I built it with Claude Code, so my agent is secure by default"

Multi-API agents are the production standard

📦 What's in the box

🔬 Vetted attack pool

📚 Documentation

🌐 Built by MEGA Code

🤝 Contributing

📄 License

🙏 Acknowledgments