xrouter-llm

Stop sending every prompt to your most expensive LLM.

xrouter-llm is a prompt-aware LLM routing-decision service: it predicts which models can complete a prompt, then chooses the cheapest model that clears the bar. On our tested dataset, it cuts realized cost by 53.2% while improving completion by +1.9 pts.

It answers "which model should serve this prompt?" and records the choice — it does NOT call the underlying LLMs.

Install

pip install xrouter-llm        # ships a trained router + model registry
# or, for development:
pip install -e ".[dev]"

The wheel bundles a trained router artifact, the model-profile registry, and the router configs, so a fresh install can serve immediately with no extra files.

Serve

The bundled router, registry, and configs are the defaults, so a bare invocation works out of the box:

xrouter-llm serve --port 8080

Override any of them to use your own trained model or registry:

xrouter-llm serve \
  --model artifacts/models/irt_router_350k.joblib \
  --models-dir path/to/models --routers-dir path/to/routers \
  --db artifacts/calls.db --port 8080

GET / — single-page UI (prompt box, config picker, decision table, history)
GET /api/configs, POST /api/route ({prompt, config, task?}), GET /api/history?limit=N
Every decision is logged to SQLite (*.db/*.sqlite are gitignored — the log holds user prompts).

Model registry

One YAML per supported model, bundled under src/xrouter_llm/resources/config/models/ (capability profile: provider, costs, context, published benchmarks as 0-100 percentages). model_id is the model's canonical OpenRouter slug (e.g. anthropic/claude-opus-4.8). The bundled registry is the default for --benchmark-profiles; point it at your own directory or file to extend it. Add a model = add a file.

from xrouter_llm import IRTRouter, default_model_path, default_models_dir, load_benchmark_profiles

router = IRTRouter.load(default_model_path())
for profile in load_benchmark_profiles(default_models_dir()).profiles():
    router.add_benchmark_profile(profile)

preds = router.predict(
    "Design a distributed consensus algorithm",
    model_ids=["anthropic/claude-opus-4.8", "deepseek/deepseek-v4-pro"],
)
print({p.model_id: round(p.mu, 3) for p in preds})

How it works

Do not train:  prompt -> selected model
Train:         prompt + model -> probability the model completes the prompt
Decide:        predicted completion + cost -> cheapest model that can complete

Completion is factored into two decoupled axes (an IRT-style model):

P(complete) = sigmoid(a * capability(model) + b * difficulty(prompt) + c)

capability(model) = the mean of the model's published gpqa_diamond and livecodebench (both full-coverage on the training side). Going wider doesn't help at this data scale — a flat mean dilutes and learned weights overfit at 37 profiled models; see AGENTS.md "Capability benchmarks". Used directly, so a brand-new model's benchmarks drive its ranking.
difficulty(prompt) = a Ridge regressor on a multilingual embedding (Qwen/Qwen3-Embedding-0.6B), trained on each prompt's empirical pass-rate. Multilingual (Chinese transfers from English training data). Picked over bge-m3 by a controlled probe (scripts/probe_qwen_difficulty.py): higher held-out Pearson and it no longer rates trivial prompts ("1+1=?") as maximally hard.

This factoring is the key lesson: a single joint classifier could not rank unseen models by their benchmarks (on this data, model capability barely explains completion marginally — but it does once difficulty is controlled, which is exactly what the factored model exploits).

Datasets

The production difficulty model is trained on multiple datasets combined (all feed the difficulty axis; only profiled models feed the capability axis):

Source	Type	Scale	In production train?
`NPULH/LLMRouterBench` (350k stream sample)	single-turn QA / code / math (22 tasks)	37 models x ~13.8k prompts	✅
agent-psychometrics — Terminal-Bench 2.0	terminal agent	89 tasks x 112 subjects	✅ `--dataset agentic:agentic/terminalbench`
agent-psychometrics — SWE-bench Verified	coding agent	500 tasks x 134 subjects	✅ task text joined from `princeton-nlp/SWE-bench_Verified`
`Xorbits/xagent-xrouter-labels`	real xagent internal prompts	100 prompts x 4 OpenRouter models	✅ `--dataset xagent-labels:Xorbits/xagent-xrouter-labels:full`
agent-psychometrics — SWE-bench Pro / GSO	coding agent	730x14 / 102x15	⛔ ship no local task text, external join needed

The current artifact trains on LLMRouterBench 350k + Terminal-Bench + SWE-bench Verified + xagent labels (378,397 rows / ~14,463 prompts / 287 subjects). The agentic matrices come from agent-psychometrics (MIT) via agentic.py. In IRTRouter, only the 37 profiled llmrouterbench models feed the capability axis and agentic subjects feed difficulty only. RouterBench (withmartian/routerbench) remains a smaller legacy baseline. Local datasets and trained artifacts are not committed (data/, artifacts/ are gitignored).

Adding more agentic prompt types (e.g. your own traffic) is the only way to make difficulty accurate for task mixes outside coding/terminal — see AGENTS.md.

Train

xrouter-llm train-irt \
  --dataset llmrouterbench:data/raw/llmrouterbench_stream_sample_350k \
  --dataset agentic:agentic/terminalbench \
  --dataset agentic:agentic/swebench_verified \
  --dataset xagent-labels:Xorbits/xagent-xrouter-labels:full \
  --benchmark-profiles artifacts/profiles/llmrouterbench_350k_profiles_priority_collected.json,src/xrouter_llm/resources/config/models \
  --output artifacts/models/irt_router_350k.joblib

Diagnostics: sweep-thresholds (cost/completion frontier + calibration) and eval-model-holdout (leave-one-model-out generalization).

Components

IRTRouter (irt_router.py): conservative production baseline (difficulty x capability).
RoutingPolicy (policy.py): "cheapest model whose predicted completion clears completion_threshold; else the cheapest within fallback_quality_margin of the best predicted completion".
serving.py / server.py: HTTP routing-decision API + single-page web UI.
resources/config/models/: a per-model YAML registry of capability profiles (bundled in the package; resolve with default_models_dir()).
resources/config/routers/: named "auto configs" — a candidate model set + policy (bundled; default_routers_dir()).
resources/models/irt_router_350k.joblib: the trained router shipped with the package (default_model_path()).

License

xrouter-llm is released under the Xagent Source License (© Xorbits Inc.) — see LICENSE. It is source-available, not an OSI-approved open source license.

The license text is shared verbatim with Xagent; for this project the licensed "Software" is xrouter-llm, and the "Restricted Functionality" / hosted-service and competitive-use clauses apply to its routing-decision and model-selection capabilities. In short: use, modification, and internal/single-tenant deployment are permitted; offering it as a multi-tenant hosted/managed service, or a directly competing service, is not. See LICENSE for the controlling terms.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
scripts		scripts
src/xrouter_llm		src/xrouter_llm
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xrouter-llm

Install

Serve

Model registry

How it works

Datasets

Train

Components

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xrouter-llm

Install

Serve

Model registry

How it works

Datasets

Train

Components

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages