Frontier safety for open-weight models
A minimal Python + PyTorch + Transformers research project for Anthropic-style activation probes on open-weight models.
It does four things:
- Collects hidden-state activations from a target model.
- Trains a tiny linear probe to classify safe vs unsafe exchanges.
- Runs a guarded generation loop that probes hidden states during generation and pauses/escalates when risk rises.
- Trains and runs a ModernBERT exchange classifier when the probe pauses. The current best classifier is a three-class model: safe compliance, safe refusal, and unsafe compliance.
This is a research MVP. It is useful enough to benchmark and iterate on, but it should not be treated as production safety infrastructure without more evaluation.
The best guardrail configuration in the Modal output volume (superagentai workspace) is:
probe_dir: probe_out_gemma4_fresh_prod
classifier_dir: classifier_out_modernbert_threeclass_v4_toxicchat_prod_fixed
classifier_block_threshold: 0.459937
final_classifier_check: true
refusal: "I can't help with that request. I can help with safe, lawful alternatives instead."
Why these values:
classifier_out_modernbert_threeclass_v4_toxicchat_prod_fixedis a ModernBERT-base classifier trained with three labels (safe_compliance,safe_refusal,unsafe_compliance) on the optimized v3 dataset plus ToxicChat-derived examples. Validation F1 is0.9602.0.459937came from held-out calibration (/outputs/classifier_calibration_v4_toxicchat_prod_fixed.json): calibration F10.9910with a safe-compliance block rate under1%.- A less aggressive
0.65threshold was also swept; it did not reduce benign blocking and slightly worsened jailbreak safety, so the calibrated threshold is the default.
Run guarded generation with the current baseline:
uv run modal run modal_train.py::generate \
--probe-dir probe_out_gemma4_fresh_prod \
--classifier-dir classifier_out_modernbert_threeclass_v4_toxicchat_prod_fixed \
--classifier-block-threshold 0.459937 \
--final-classifier-check \
--prompt "Explain browser security headers at a high level"Production eval results (OpenRouter openai/gpt-4o-mini judge, 100 examples per split, artifacts at /outputs/evals/prod_v4_toxicchat_*.json):
- JailbreakBench
judge_comparison(jailbreak attacks):- raw unsafe:
20% - guarded unsafe:
1%(about95%relative reduction) - guard block rate:
26%
- raw unsafe:
- JailbreakBench
behaviors, harmful split:- raw unsafe:
0% - guarded unsafe:
0% - guard block rate:
1%
- raw unsafe:
- JailbreakBench
behaviors, benign split:- raw unsafe:
9% - guarded unsafe:
0% - benign block rate:
10%(known caveat, see below)
- raw unsafe:
- WildGuard, harmful split:
- raw unsafe:
14% - guarded unsafe:
0% - guard block rate:
23%
- raw unsafe:
- WildGuard, benign split:
- raw unsafe:
0% - guarded unsafe:
0% - benign block rate:
0%
- raw unsafe:
One evaluation detail matters: OpenRouter sometimes returned both unsafe=true and refusal=true for the same response. The benchmark treats refusal=true as not unsafe and keeps the raw unsafe flag separately for audit.
Known caveat: the guard still over-blocks benign prompts on the JBB benign split (10% block rate). Safety is strong; the remaining production work is reducing benign over-blocking.
Latency with the same baseline config on Modal A10G, 4 prompts, 64 generated tokens (/outputs/evals/prod_v4_toxicchat_latency.json):
- baseline generation:
3737 msmean,17.1 tok/s - probe only:
3723 msmean, no measurable overhead - probe + classifier + final check:
3831 msmean,+94 ms,+2.5% - classifier calls in the guarded run:
26total across 4 prompts; time-to-first-token unchanged
That latency is acceptable for a research guardrail. It is not yet optimized for serving.
Install uv if you do not already have it:
curl -LsSf https://astral.sh/uv/install.sh | shThen install the project dependencies:
uv syncThe default model is Gemma 4 E2B. It is smaller than many frontier models, but still needs substantial disk and memory:
uv run train-probe \
--model_id google/gemma-4-E2B-it \
--data_path data/examples.jsonl \
--layer -4 \
--out_dir ./probe_outThen run guarded generation:
uv run guarded-generate \
--model_id google/gemma-4-E2B-it \
--probe_path ./probe_out/probe.pt \
--config_path ./probe_out/config.json \
--prompt "Explain SQL injection at a high level"Train the second-stage ModernBERT exchange classifier on the Git LFS-backed dataset:
uv run train-exchange-classifier \
--model_id answerdotai/ModernBERT-base \
--data_path data/training_data.jsonl \
--out_dir ./classifier_out \
--epochs 5 \
--batch_size 8 \
--max_length 512 \
--prefix_augmentFor the current three-class baseline, the training source used on Modal was:
/outputs/training_data_classifier_threeclass_v4_toxicchat.jsonl
and the trained checkpoint was written to:
/outputs/classifier_out_modernbert_threeclass_v4_toxicchat_prod_fixed
Then pass the classifier artifact to guarded generation. When the activation probe pauses, the partial exchange is scored by ModernBERT before buffered tokens are released:
uv run guarded-generate \
--model_id google/gemma-4-E2B-it \
--probe_path ./probe_out/probe.pt \
--config_path ./probe_out/config.json \
--classifier_dir ./classifier_out \
--prompt "Explain SQL injection at a high level"Use Modal when the model download or GPU memory is too large for your local machine. The Modal runner uses an A10G GPU by default and persists Hugging Face downloads plus probe outputs in Modal Volumes.
flowchart LR
LocalRepo["Local repo"] --> ModalRun["uv run modal run modal_train.py"]
ModalRun --> ModalImage["Modal image"]
ModalImage --> GpuJob["A10G training job"]
HFSecret["huggingface-secret HF_TOKEN"] --> GpuJob
HFCache["open-constitution-hf-cache"] --> GpuJob
GpuJob --> HFCache
GpuJob --> Outputs["open-constitution-outputs"]
Outputs --> ModalGenerate["uv run modal run modal_train.py::generate"]
HFCache --> ModalGenerate
Outputs --> Download["modal volume get"]
Download --> LocalProbe["local probe_out_gemma4"]
Confirm that Modal is authenticated:
uv run modal profile currentCreate the Hugging Face token secret once so model downloads are authenticated:
uv run modal secret create huggingface-secret HF_TOKEN=$HF_TOKENRun Gemma 4 probe training on Modal:
uv run modal run modal_train.pyRun guarded generation on Modal using the saved probe, without downloading the model or probe locally:
uv run modal run modal_train.py::generate \
--prompt "Explain SQL injection at a high level"Train the ModernBERT exchange classifier on Modal:
uv run modal run modal_train.py::train_classifier \
--data_path data/training_data.jsonl \
--output-dir classifier_out_modernbertClassifier training logs are written to the Modal output volume. Download them with:
uv run modal volume get open-constitution-outputs classifier_out_modernbert/train.log ./train.logRun guarded generation on Modal with both the probe and classifier artifacts:
uv run modal run modal_train.py::generate \
--probe_dir probe_out_gemma4 \
--classifier_dir classifier_out_modernbert \
--prompt "Explain SQL injection at a high level"Download the saved probe outputs after the job finishes:
uv run modal volume get open-constitution-outputs probe_out_gemma4 ./probe_out_gemma4The first run downloads the model into the open-constitution-hf-cache Modal Volume. Later
runs reuse that cache. If your Modal workspace supports larger GPUs and you want extra
headroom, change gpu="A10G" to a larger GPU type (for example gpu="A100") in modal_train.py.
data/training_data.jsonl is tracked with Git LFS and is the original training source for
the ModernBERT classifier. data/training_data_classifier.jsonl adds hard benign examples.
The three-class Modal dataset also appends OpenRouter-labeled teacher rows from benchmark
generations. data/examples.jsonl is only a tiny smoke-test fixture.
Training files expect one JSON object per line:
{"prompt":"...", "response":"...", "label":0}
{"prompt":"...", "response":"...", "label":1}
{"prompt":"...", "response":"...", "label":2}Where:
- binary data:
label: 0= safe / allowedlabel: 1= restricted / unsafe
- three-class data:
label: 0= safe compliancelabel: 1= safe refusallabel: 2= unsafe compliance
For a real probe, you need thousands of examples across allowed and disallowed policy boundaries.
flowchart TD
Exchange["prompt + response"] --> ForwardPass["target model forward pass with output_hidden_states=True"]
ForwardPass --> HiddenState["selected layer hidden state at final token"]
HiddenState --> Probe["linear probe"]
Probe --> RiskScore["risk score"]
During generation:
flowchart TD
Generate["generate N tokens"] --> ReadHiddenState["read selected hidden state"]
ReadHiddenState --> ProbeRisk["probe risk score"]
ProbeRisk --> SmoothScore["smooth score over a small window"]
SmoothScore --> GuardDecision["continue / pause / escalate / refuse"]
This MVP:
- Uses the final token hidden state only.
- Trains a simple linear probe.
- Uses tiny example data.
- Does not implement Anthropic's exact training tricks.
- Does not patch vLLM.
- Uses a lightweight ModernBERT exchange classifier as the second-stage scorer.
Recommended next steps:
- Generate a serious policy dataset.
- Test multiple layers and layer combinations.
- Add token-level labels or soft token weighting.
- Add score smoothing and calibration curves.
- Calibrate the probe and classifier thresholds on held-out data.
- Integrate into vLLM once the probe is validated.
The default model is now:
google/gemma-4-E2B-itGemma 4 on Hugging Face uses AutoProcessor + AutoModelForImageTextToText, while many text-only models use AutoTokenizer + AutoModelForCausalLM. The MVP handles both.
The repo also uses model chat templates by default through:
tokenizer_or_processor.apply_chat_template(...)or, if the processor wraps a tokenizer:
tokenizer_or_processor.tokenizer.apply_chat_template(...)Disable chat templates with:
--no_chat_templateExample Gemma 4 run:
uv run train-probe \
--model_id google/gemma-4-E2B-it \
--data_path data/examples.jsonl \
--layer -4 \
--out_dir ./probe_out_gemma4Then:
uv run guarded-generate \
--model_id google/gemma-4-E2B-it \
--probe_path ./probe_out_gemma4/probe.pt \
--config_path ./probe_out_gemma4/config.json \
--prompt "Explain SQL injection at a high level"Layer sweep:
uv run sweep-layers \
--model_id google/gemma-4-E2B-it \
--data_path data/examples.jsonl \
--layers="-2,-4,-6,-8,-10,-12"Note: Gemma models may require accepting Google's license terms on Hugging Face before download. google/gemma-4-E2B-it downloads about 10 GB of weights, so use a machine or cloud runtime with enough free disk space.