LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.
The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.
Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.
Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:
| metric (all 97) | base | LoRA-tuned | Δ |
|---|---|---|---|
| valid JSON | 95.9% | 100.0% | +4.1 |
| schema-valid call | 19.6% | 94.8% | +75.2 |
| correct tool | 36.1% | 85.6% | +49.5 |
| exact args | 4.1% | 74.2% | +70.1 |
| fully correct | 4.1% | 74.2% | +70.1 |
On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.
The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.
The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:
| metric (12 hand-written) | base | tuned | Δ |
|---|---|---|---|
| schema-valid | 25.0% | 100.0% | +75.0 |
| correct tool | 41.7% | 83.3% | +41.6 |
| fully correct | 8.3% | 58.3% | +50.0 |
The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.
LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.
pip install -e . # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt # torch/transformers/peft/trl/... for training
python -m toolsmith.data.build # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo # offline recovery demo -> logs/run-demo.jsonlpython -m toolsmith.mcp_server # stdio; exposes route_to_tool(request) + the 8 tools
# or containerized (installs the inference stack, pulls the base model on first run):
docker build -t tool-smith . && docker run --rm -i tool-smithmcp.json for Claude Desktop / Cursor:
{
"mcpServers": {
"tool-smith": {
"command": "python",
"args": ["-m", "toolsmith.mcp_server"],
"cwd": "/path/to/tool-smith"
}
}
}route_to_tool("What's the weather in Tokyo?") → {"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.
agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:
ok=True recovery=none | What's the weather in Tokyo? (tuned, 1 attempt)
ok=True recovery=repair_retry | Pack for Berlin? ... rain there. (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds... (base failed x3 -> fallback router)
python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.
toolsmith/
schema.py # the fixed 8-tool toolbox + JSON validator (one source of truth)
data/build.py # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
train.py # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
router.py # load base (+adapter) and turn a request into a tool-call string
eval.py # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
grade.py # exact tool/args grading (no LLM judge needed for routing)
mcp_server.py # FastMCP: route_to_tool + 8 mock tools
agent.py # validate / repair-retry / frontier-fallback loop + JSONL logging
logs_report.py # summarize agent runs
data/ # committed train/test jsonl
artifacts/ # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/ # committed real agent traces
tests/ # pytest (grading + agent recovery), model-free
Stated plainly — knowing the limits is part of the work:
- Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.
- Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form
fully_correctis 58% vs 74% templated. - Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.
- Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).
- Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.
- Every number here comes from an actual local run, regenerable (fixed seed;
requirements-train.txtpins the exact stack). No placeholder figures.
MIT

