Skip to content

guilingzhouyi-creator/NOMOS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

100 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NOMOS — AI Code Governance & Benchmark · v0.12

17 deterministic rules | 53K+ RC events | 2 languages | self-scanning

"Every benchmark tells you which model is strongest. NOMOS tells you which one can survive reality."

17 rules Python + C# 53K+ RC events MIT


What NOMOS Is

NOMOS is a code governance engine. Its first domain is C# and Python; the architecture is domain-agnostic — any field where AI generates output and a deterministic verifier exists can plug into the same pipeline.

Typical tool NOMOS
Rules written once, never change Rules evolve from Reference Channel data
Runs on your code, never on itself run.py scans dev/ — governance governed
Tests models, reports scores RC events feed QLoRA fine-tuning
Single-language Language profiles (C# via tree-sitter, Python via regex)

The Governance Engine (v0.1/dev/)

L0  Infrastructure    —  tree-sitter AST, RC storage (JSONL + SHA-256)
L1  LLM Gateway       —  multi-model routing, trace
L2  Rule Engine       —  17 rules (AST + regex + multi-file), multi-language
L3  Planning          —  scan strategy
L4  Analysis          —  constitution extraction, self-inspection
L5  Competition       —  shadow evaluation, threshold auto-tuning
L6  Builder           —  Agent instruction injection

Architecture design → | Implementation details →


The Benchmark Matrix (5 Dimensions)

Model Python80 C#30 C# Multi-8 Fuzzy20 Pollution10
coder:6.7b 72% 70% 62% 100%* 80%
qwen:7b 80% 87% 88% 95%* 30%
v4-flash (cloud) 81% 100% 88% 100%* 100%

Pollution: 7 cross-domain instruction layers + 200-turn VS Code Agent context injected.
* Fuzzy prompts: all models PASS but default to single-file (f=1/N) — engineering unusable.

Full leaderboard →


Key Discoveries

  1. Clean benchmarks lie. qwen: 100% clean → 30% polluted. You'd ship it. It'd crash.
  2. Size ≠ resilience. 6.7B coder beats cloud v4-flash in pollution resistance.
  3. Pollution quality > quantity. 100K same-domain noise = no effect. 15K cross-domain instructions = 0-30% destruction.
  4. Multi-file is the real cliff. L2 two-file: all pass. L3 three-file+interface: coder drops to 0%.
  5. Even 1.3T fails. v4-pro needed 13 human corrections for 4 tasks under 392K context.

Quick Start

git clone https://github.com/guilingzhouyi-creator/NOMOS.git
cd NOMOS/v0.1/dev

# Scan Small_WarThunder (C#, auto-detect)
python run.py

# Scan Python codebase
python run.py --lang py

# Run benchmarks
cd .. && python _long_prompt_bench.py     # Python 80 problems
python _cs_bench.py                       # C# 30 problems
python _cs_multifile_bench.py             # C# multi-file L2-L4
python _pollution_bench.py                # Context pollution

Repository Structure

v0.1/
├── dev/                        —  Governance engine (102 modules, 8.3K lines)
│   ├── run.py                  —  Entry point (@step pipeline, multi-language)
│   ├── l0/                     —  Infrastructure (AST, RC, config, MCP)
│   ├── l1/                     —  LLM gateway (router, client, trace)
│   ├── l2/                     —  Rule engine (17 rules, pipeline, constants)
│   ├── l3/ l4/ l5/ l6/         —  Higher-order governance layers
│   ├── rc/                     —  Batch scanners, GitHub integration
│   └── tools/                  —  Fingerprint, doc check, MCP server
├── .reference_channel/         —  53,000+ SHA-256 verified events
├── _*_bench.py                 —  5-dimension benchmark suite
├── _qlora_train.py             —  QLoRA fine-tuning pipeline
├── output/ train_data/         —  Generated artifacts & training pairs
└── LEADERBOARD.md              —  Full multi-model matrix

Research Questions

  • How do models degrade under Agent-style context pollution?
  • Can a 7B local model beat cloud models when fine-tuned with compliance data?
  • What's the multi-file decoupling cliff — and which models survive it?
  • Can governance rules self-evolve from accumulated Reference Channel data?
  • Does the architecture generalize beyond code to other AI-output domains?

Citation

@software{NOMOS2026,
  author = {guilingzhouyi},
  title = {NOMOS: A Self-Improving AI Code Governance \& Benchmark System},
  year = {2026},
  url = {https://github.com/guilingzhouyi-creator/NOMOS}
}

License

MIT — use it, fork it, build on it.


Built by a university student + an LLM agent. First discovered the problem, then built the solution.