17 deterministic rules | 53K+ RC events | 2 languages | self-scanning
"Every benchmark tells you which model is strongest. NOMOS tells you which one can survive reality."
NOMOS is a code governance engine. Its first domain is C# and Python; the architecture is domain-agnostic — any field where AI generates output and a deterministic verifier exists can plug into the same pipeline.
| Typical tool | NOMOS |
|---|---|
| Rules written once, never change | Rules evolve from Reference Channel data |
| Runs on your code, never on itself | run.py scans dev/ — governance governed |
| Tests models, reports scores | RC events feed QLoRA fine-tuning |
| Single-language | Language profiles (C# via tree-sitter, Python via regex) |
L0 Infrastructure — tree-sitter AST, RC storage (JSONL + SHA-256)
L1 LLM Gateway — multi-model routing, trace
L2 Rule Engine — 17 rules (AST + regex + multi-file), multi-language
L3 Planning — scan strategy
L4 Analysis — constitution extraction, self-inspection
L5 Competition — shadow evaluation, threshold auto-tuning
L6 Builder — Agent instruction injection
Architecture design → | Implementation details →
| Model | Python80 | C#30 | C# Multi-8 | Fuzzy20 | Pollution10 |
|---|---|---|---|---|---|
| coder:6.7b | 72% | 70% | 62% | 100%* | 80% |
| qwen:7b | 80% | 87% | 88% | 95%* | 30% |
| v4-flash (cloud) | 81% | 100% | 88% | 100%* | 100% |
Pollution: 7 cross-domain instruction layers + 200-turn VS Code Agent context injected.
* Fuzzy prompts: all models PASS but default to single-file (f=1/N) — engineering unusable.
- Clean benchmarks lie. qwen: 100% clean → 30% polluted. You'd ship it. It'd crash.
- Size ≠ resilience. 6.7B coder beats cloud v4-flash in pollution resistance.
- Pollution quality > quantity. 100K same-domain noise = no effect. 15K cross-domain instructions = 0-30% destruction.
- Multi-file is the real cliff. L2 two-file: all pass. L3 three-file+interface: coder drops to 0%.
- Even 1.3T fails. v4-pro needed 13 human corrections for 4 tasks under 392K context.
git clone https://github.com/guilingzhouyi-creator/NOMOS.git
cd NOMOS/v0.1/dev
# Scan Small_WarThunder (C#, auto-detect)
python run.py
# Scan Python codebase
python run.py --lang py
# Run benchmarks
cd .. && python _long_prompt_bench.py # Python 80 problems
python _cs_bench.py # C# 30 problems
python _cs_multifile_bench.py # C# multi-file L2-L4
python _pollution_bench.py # Context pollutionv0.1/
├── dev/ — Governance engine (102 modules, 8.3K lines)
│ ├── run.py — Entry point (@step pipeline, multi-language)
│ ├── l0/ — Infrastructure (AST, RC, config, MCP)
│ ├── l1/ — LLM gateway (router, client, trace)
│ ├── l2/ — Rule engine (17 rules, pipeline, constants)
│ ├── l3/ l4/ l5/ l6/ — Higher-order governance layers
│ ├── rc/ — Batch scanners, GitHub integration
│ └── tools/ — Fingerprint, doc check, MCP server
├── .reference_channel/ — 53,000+ SHA-256 verified events
├── _*_bench.py — 5-dimension benchmark suite
├── _qlora_train.py — QLoRA fine-tuning pipeline
├── output/ train_data/ — Generated artifacts & training pairs
└── LEADERBOARD.md — Full multi-model matrix
- How do models degrade under Agent-style context pollution?
- Can a 7B local model beat cloud models when fine-tuned with compliance data?
- What's the multi-file decoupling cliff — and which models survive it?
- Can governance rules self-evolve from accumulated Reference Channel data?
- Does the architecture generalize beyond code to other AI-output domains?
@software{NOMOS2026,
author = {guilingzhouyi},
title = {NOMOS: A Self-Improving AI Code Governance \& Benchmark System},
year = {2026},
url = {https://github.com/guilingzhouyi-creator/NOMOS}
}MIT — use it, fork it, build on it.
Built by a university student + an LLM agent. First discovered the problem, then built the solution.