Tightwad

Mixed-vendor GPU inference cluster manager with speculative decoding proxy. Pools CUDA and ROCm GPUs across machines using llama.cpp RPC, and accelerates inference via application-layer speculative decoding across network-separated servers.

AI Agents: See AGENTS.md for a complete setup playbook — hardware assessment, decision tree, step-by-step config, model selection guide, and troubleshooting. Designed for Claude, ChatGPT, Copilot, OpenClaw, and any AI assistant helping users deploy Tightwad.

How It Works in 10 Seconds

YOUR HARDWARE (any mix works)                         TIGHTWAD
                                                         │
  RTX 4070 Ti Super (16GB, NVIDIA) ──────────┐           │
  RTX 3060 (12GB, NVIDIA) ───────────────────┤           │
  RTX 2070 (8GB, NVIDIA) ────────────────────┤  cluster  │   ┌──────────────-┐
  GTX 770 (2GB — yes, really) ───────────────┤──────────►│──►│ OpenAI API    │
  RX 7900 XTX (24GB, AMD!) ─────────────────-┤  yaml     │   │ localhost:8088│
  Old Xeon workstation (CPU only) ───────────┤           │   └──────────────-┘
  Your laptop (M2, CPU draft) ───────────────┘           │
                                                         │
  CUDA ✓   ROCm ✓   CPU ✓   Mixed ✓            One endpoint.

It's not 2 matching GPUs. It's your entire junk drawer of compute unified into one API. That dusty 770 in your closet? Put it to work.

  Without Tightwad: big model generates every token, one at a time
  With Tightwad:    big model only works on the tokens it disagrees with
  Output quality:   IDENTICAL (greedy) / EQUIVALENT (sampling)
  Speed:            Up to 2-3x faster

The small model is fast but sometimes wrong. The big model is slow but always right. Tightwad uses the small model to do most of the work, and the big model to catch mistakes. Because catching mistakes is cheap — it's one batch operation, not N serial ones.

What Does This Look Like as a User?

You change nothing about your workflow. Tightwad is invisible.

	Before	After
Your chat app	Open WebUI, ChatBot UI, etc.	Same app, no changes
Points at	`http://192.168.1.10:11434` (Ollama on one machine)	`http://192.168.1.10:8088` (Tightwad proxy)
Model you talk to	Qwen3-32B	Qwen3-32B (same model, same output)
What you see	Normal chat responses	Normal chat responses, just faster
The small model	Doesn't exist	Hidden — drafting on a different machine entirely
Other machines	Idle, wasted	RTX 2070, old Xeon, laptop — all contributing

That's it. One URL change. Same UI, same model, same quality. Tightwad handles everything behind the scenes:

Your chat sends a message to Tightwad's port
Behind the curtain, a small model quickly predicts the next several tokens
The big model verifies them all in one shot (instead of generating one at a time)
Tightwad streams the verified tokens back to your chat
You see the response faster — and it's identical (greedy decoding) or statistically equivalent (sampling) to what the big model would have produced alone

The small model is like autocomplete on your phone — it suggests, the big model accepts or corrects. You only ever see the final, verified output.

Four Modes

1. Speculative Decoding Proxy — Draft + Verify across machines

A fast small model (e.g., 1.7B on any CPU or cheap GPU) drafts candidate tokens, a large model (e.g., 32B-72B) verifies them in batch. Output quality is equivalent to running the large model alone, but up to 2x faster because batch verification is much cheaper than autoregressive generation. Network traffic: bytes (token IDs only).

2. RPC Cluster — Pool GPUs into one endpoint

Combine GPUs from different machines and vendors into a single OpenAI-compatible API. The coordinator distributes model layers across local and remote GPUs. Use this when a model doesn't fit on any single machine.

Note: The coordinator machine needs enough system RAM for the full model file (not just its GPU share). llama.cpp mmaps the entire GGUF before distributing tensors to workers. A 70B Q4_K_M (~40GB) needs ~44GB RAM on the coordinator.

3. Combined Mode — Speculation Over a Pool (the killer feature)

When a model doesn't fit on one machine, pool the GPUs AND speculate on top. The RPC pool is slow autoregressive (3 tok/s over WiFi), but batch verification amortizes the RPC overhead — 32 tokens per round instead of 1 token per round-trip. Result: 1.8x speedup over pool-only, making models that don't fit on one machine actually usable.

ANY junk hardware (P400 2GB, GTX 770, laptop CPU, Raspberry Pi)
    │ runs a small 1-2B draft model (~30 tok/s)
    │ sends token IDs (bytes, not megabytes)
    ▼
Tightwad Proxy (:8088)
    │ sends draft to pool for BATCH verification
    ▼
RPC GPU Pool (any mix: CUDA + ROCm + Metal, running 70B)
    │ verifies 32 tokens in ONE forward pass
    │ 1 RPC round-trip for 32 tokens instead of 32 round-trips
    ▼
5+ tok/s instead of 3 tok/s — and the 70B model fits nowhere else

4. Quality Gate — CPU Fleet Drafts, GPU Verifies

A fleet of cheap agents (CPUs, small GPUs) generate full responses using small models. A single powerful GPU reviews each output — approving, correcting, or rejecting it. 60-80% of responses pass unchanged, so the GPU only processes the hard 20-40%. tightwad gate start to run this mode.

The draft model needs: (1) same model family as the target, (2) llamacpp backend (not Ollama) for prompt-append verification, (3) any hardware that can run a 1-2B model. That's it.

Client (OpenAI API)
        │
        ▼
┌──────────────────────────────┐
│   Tightwad Proxy (:8088)      │  Python async server
│   Speculation Loop:          │
│   1. Draft N tokens (auto)    │──► Draft: Qwen3-8B (fast, local)
│   2. Verify batch            │──► Target: Qwen3-32B (accurate, local or API)
│   3. Accept/reject           │
│   4. Stream to client        │
└──────────────────────────────┘

Why not just use RPC? RPC ships 100-300 MB of tensor data per step over the network. The speculative proxy ships token IDs (bytes). For models that fit on a single machine's VRAM, speculation is dramatically faster.

Docker Quick Start

The fastest way to get a speculative decoding proxy running. No config files needed — just set your draft and target server URLs:

# One-liner with Docker
docker run --rm --network host \
  -e TIGHTWAD_DRAFT_URL=http://192.168.1.10:11434 \
  -e TIGHTWAD_DRAFT_MODEL=qwen3:8b \
  -e TIGHTWAD_TARGET_URL=http://192.168.1.20:11434 \
  -e TIGHTWAD_TARGET_MODEL=qwen3:32b \
  ghcr.io/youngharold/tightwad

# Or with Docker Compose (edit docker-compose.yml with your IPs first)
docker compose up
# Logs persist in ./logs/ across restarts

Mac/Docker Desktop: Replace --network host with -p 8088:8088 and use host.docker.internal instead of LAN IPs.

Docker Compose includes a healthcheck (/v1/models every 10s) and mounts ./logs/ for persistent proxy logs.

All TIGHTWAD_* env vars:

Env Var	Default	Description
`TIGHTWAD_DRAFT_URL`	required	Draft server URL
`TIGHTWAD_DRAFT_MODEL`	`draft`	Draft model name
`TIGHTWAD_DRAFT_BACKEND`	`ollama`	`ollama` or `llamacpp`
`TIGHTWAD_TARGET_URL`	required	Target server URL
`TIGHTWAD_TARGET_MODEL`	`target`	Target model name
`TIGHTWAD_TARGET_BACKEND`	`ollama`	`ollama` or `llamacpp`
`TIGHTWAD_PORT`	`8088`	Proxy listen port
`TIGHTWAD_HOST`	`0.0.0.0`	Proxy bind host
`TIGHTWAD_MAX_DRAFT_TOKENS`	`8`	Tokens per draft round (or `auto` for adaptive tuning)
`TIGHTWAD_PROXY_TOKEN`	(unset)	Bearer token for proxy API auth (recommended)
`TIGHTWAD_ALLOW_PRIVATE_UPSTREAM`	`true`	SSRF: `false` = block private/LAN upstream IPs
`TIGHTWAD_MAX_TOKENS_LIMIT`	`16384`	Hard cap on `max_tokens` in requests — rejects higher values with 400 (DoS mitigation)
`TIGHTWAD_MAX_BODY_SIZE`	`10485760`	Max request body size in bytes (10 MB) — rejects oversized payloads with 413 before buffering

Proxy Authentication

The proxy API binds to 0.0.0.0:8088 by default, making it reachable by any device on the LAN (or the internet if ports are forwarded). Set a token to prevent unauthorized use of your GPU compute.

Via environment variable (Docker / Docker Compose):

docker run --rm --network host \
  -e TIGHTWAD_DRAFT_URL=http://192.168.1.10:11434 \
  -e TIGHTWAD_TARGET_URL=http://192.168.1.20:11434 \
  -e TIGHTWAD_PROXY_TOKEN=my-secret-token \
  ghcr.io/youngharold/tightwad

Via cluster.yaml:

proxy:
  host: 0.0.0.0
  port: 8088
  auth_token: "${TIGHTWAD_PROXY_TOKEN}"   # or paste the token directly
  draft:
    url: http://127.0.0.1:8081
    model_name: qwen3-1.7b
  target:
    url: http://192.168.1.100:8090
    model_name: qwen3-32b

Making authenticated requests:

curl http://localhost:8088/v1/chat/completions \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

If no token is configured the proxy operates in open (unauthenticated) mode for backward compatibility, but logs a security warning on startup. TIGHTWAD_TOKEN (the swarm seeder token) is also accepted as a fallback alias.

SSRF Protection (upstream URL validation)

Tightwad validates all upstream URLs before opening connections (audit ref: SEC-5).

What is always enforced:

Scheme allowlist — only http:// and https:// are accepted. file://, gopher://, ftp://, and every other scheme are rejected with a clear error.

What is enforced when allow_private_upstream: false:

Private/internal IP blocking — requests to RFC-1918 ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback (127.0.0.0/8), link-local / IMDS (169.254.0.0/16), and IPv6 equivalents (::1, fc00::/7, fe80::/10) are blocked.
DNS-rebinding protection — hostnames are resolved via DNS and the resolved IPs are also checked, so a domain that resolves to an internal address is caught even if the URL looks public.

Homelab default (allow_private_upstream: true):

Because Tightwad's most common use case targets LAN servers, the private-IP check defaults to allowed. The scheme check is still always enforced.

proxy:
  # Default: LAN/loopback targets are fine (common homelab setup)
  allow_private_upstream: true   # omit or set true for home/LAN use

  # Strict mode: useful in cloud or multi-tenant environments
  # allow_private_upstream: false

  draft:
    url: http://192.168.1.101:11434   # OK in default mode
    model_name: qwen3-1.7b
  target:
    url: http://192.168.1.100:8080
    model_name: qwen3-32b

Via environment variable:

# Strict mode (block private/internal targets)
export TIGHTWAD_ALLOW_PRIVATE_UPSTREAM=false

# Homelab mode (default)
export TIGHTWAD_ALLOW_PRIVATE_UPSTREAM=true   # or omit entirely

If a URL fails validation the proxy refuses to start and prints a clear error explaining why and how to fix it.

Quick Start

# Install
pip install tightwad

# Auto-discover LAN servers and generate config
tightwad init

# Or edit topology manually
vim configs/cluster.yaml

# Verify your setup
tightwad doctor        # check config, binaries, network, versions
tightwad doctor --fix  # show fix suggestions for any issues

Speculative Decoding Proxy

# Start the proxy (draft + target servers must be running)
tightwad proxy start

# Check health and acceptance rate stats
tightwad proxy status

# Test it
curl http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Detailed stats
curl http://localhost:8088/v1/tightwad/status

# Stop
tightwad proxy stop

RPC Cluster

# Check cluster status
tightwad status

# Start (after rpc-server instances are running on workers)
tightwad start

# View coordinator logs
tightwad logs              # last 50 lines
tightwad logs -f           # live tail

# Hot-swap to a different model (RPC workers persist)
tightwad swap deepseek-r1-70b

# Benchmark
tightwad benchmark

# Stop
tightwad stop

Homelab Recipe

A realistic three-machine setup you can reproduce in ~30 minutes. Start with two machines, add more anytime. The cluster grows.

Hardware:

Machine A (target): Desktop with RTX 4070 Ti Super + RTX 3060 (28GB VRAM combined) — the workhorse
Machine B (draft): Old gaming PC with RTX 2070 (8GB VRAM) — that box you almost sold
Machine C (CPU draft): Server or workstation with no GPU — CPU-only, still contributes

Expected results: 58–64% average token acceptance, up to 88% on reasoning tasks. Machine C adds throughput even without a GPU.

Replace all 192.168.1.x addresses below with your actual machine IPs. Find them with ip addr (Linux), ipconfig (Windows), or ipconfig getifaddr en0 (macOS).

Step 1 — On Machine B: Start the draft model

# Machine B — RTX 2070 (8GB VRAM)
ollama run qwen3:8b
# Confirm it works:
ollama ps
# Should show: qwen3:8b running

Ollama listens on 0.0.0.0:11434 by default. If not, set OLLAMA_HOST=0.0.0.0 before starting.

Step 2 — On Machine C: Start a CPU draft model

# Machine C — CPU only, no GPU needed
# llama-server with a tiny model is ideal for CPU drafting
llama-server -m qwen3-1.7b-q4_k_m.gguf --port 8081 --host 0.0.0.0
# Or with Ollama:
OLLAMA_HOST=0.0.0.0 ollama run qwen3:1.7b

Even at 15–30 tok/s on CPU, Machine C reduces load on Machine B and adds redundancy.

Step 3 — On Machine A: Start the target model

# Machine A — RTX 4070 Ti Super + RTX 3060 (28GB combined via llama.cpp RPC)
ollama run qwen3:32b
# Confirm:
ollama ps
# Should show: qwen3:32b running

Same note: ensure Ollama is accessible on the network (OLLAMA_HOST=0.0.0.0).

Step 4 — On whichever machine runs the proxy: Install Tightwad

pip install tightwad

Step 5 — Generate config

Option A: Auto-discover with tightwad init (recommended)

tightwad init
# Scans your LAN for Ollama and llama-server instances
# Shows a table of discovered servers
# You pick target (big model) and draft (small model) by number
# Writes configs/cluster.yaml automatically

Example output:

Scanning LAN for inference servers...

          Discovered Servers (192.168.1.0/24)
┏━━━┳━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┓
┃ # ┃ Host          ┃ Port  ┃ Backend ┃ Models     ┃ Status  ┃
┡━━━╇━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━┩
│ 1 │ 192.168.1.10  │ 11434 │ ollama  │ qwen3:32b  │ healthy │
│ 2 │ 192.168.1.20  │ 11434 │ ollama  │ qwen3:8b   │ healthy │
└───┴───────────────┴───────┴─────────┴────────────┴─────────┘

Select TARGET server (big model): 1
Select DRAFT server (small fast model): 2
Write to configs/cluster.yaml? [Y/n] y

If your subnet isn't auto-detected correctly, specify it manually:

tightwad init --subnet 192.168.1.0/24

Option B: Manual config

Edit configs/cluster.yaml directly:

proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes based on acceptance rate
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.20:11434    # Machine B (RTX 2070) — replace with your IP
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.10:11434    # Machine A (4070 Ti + 3060) — replace with your IP
    model_name: qwen3:32b
    backend: ollama

Replace all IPs with your actual machine IPs (ip addr on Linux, ipconfig on Windows).

Option C: Docker (no config file at all)

Skip config entirely and use environment variables:

docker run --rm --network host \
  -e TIGHTWAD_DRAFT_URL=http://192.168.1.20:11434 \
  -e TIGHTWAD_DRAFT_MODEL=qwen3:8b \
  -e TIGHTWAD_TARGET_URL=http://192.168.1.10:11434 \
  -e TIGHTWAD_TARGET_MODEL=qwen3:32b \
  ghcr.io/youngharold/tightwad

Adding Machine C later (CPU draft as fallback or parallel drafter):

  # Add to configs/cluster.yaml:
  draft_fallback:
    url: http://192.168.1.30:8081     # Machine C (CPU only) — replace with your IP
    model_name: qwen3:1.7b
    backend: llamacpp

Step 6 — Start the proxy

tightwad proxy start
# Expected output:
# ✓ Draft model healthy  (qwen3:8b @ 192.168.1.20:11434)   — Machine B
# ✓ Target model healthy (qwen3:32b @ 192.168.1.10:11434)  — Machine A
# ✓ Proxy listening on http://localhost:8088

Step 7 — Test it

# Basic test
curl http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 17 * 24?"}],
    "max_tokens": 100
  }'

# Check acceptance rate stats
tightwad proxy status
# Expected: Acceptance rate: ~58% | Rounds: N | Tokens saved: N

# Detailed stats
curl http://localhost:8088/v1/tightwad/status

Step 8 — Point your app at it

Any OpenAI-compatible client works. Just change the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8088/v1",
    api_key="not-needed"  # Tightwad doesn't require an API key
)

response = client.chat.completions.create(
    model="tightwad",
    messages=[{"role": "user", "content": "Explain recursion"}]
)

Acceptance rates you can expect with this setup:

Task	Acceptance Rate
Reasoning / math	~88%
Code generation	~73%
Factual Q&A	~52%
Creative writing	~34%
Average	~58%

The cluster grows. Start with Machines A + B. Add Machine C when you're ready. Add a fourth machine (that GTX 770 you haven't thrown out yet) whenever. Each new node contributes without disrupting the existing setup — just add it to cluster.yaml or re-run tightwad init. Tightwad doesn't care what generation or vendor the hardware is from. CUDA, ROCm, Metal, CPU-only — it all pools together. The only thing that matters is that your draft and target models share the same family.

Note on the bigger picture: With Qwen3-8B drafting for Qwen3.5-397B (via API), we've seen 80% acceptance after whitespace normalization — meaning 4 in 5 tokens come from your local GPU, not the cloud. Reasoning tasks hit 88%. The bigger the gap between draft and target quality, the more you save.

Configuration

Edit configs/cluster.yaml (or generate one with tightwad init):

# ⚠️  Replace all IPs and model paths below with your own values.
# Find your IPs: ip addr (Linux), ipconfig (Windows), ipconfig getifaddr en0 (macOS)

# Speculative decoding proxy
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes based on acceptance rate (or pin at 32)
  fallback_on_draft_failure: true
  max_tokens_limit: 16384           # Hard cap on max_tokens per request (DoS mitigation, CQ-1)
  max_body_size: 10485760           # Max request body bytes — 10 MB (memory-exhaustion mitigation, CQ-5)
  draft:
    url: http://192.168.1.50:8081    # ← your draft machine's IP + port
    model_name: qwen3-8b
    backend: llamacpp                  # or "ollama"
  target:
    url: http://192.168.1.100:8080   # ← your target machine's IP + port
    model_name: qwen3-32b
    backend: llamacpp

# RPC cluster (optional, for tensor-parallel across machines)
# Pool GPUs from multiple machines into a single model
coordinator:
  host: 0.0.0.0
  port: 8090
  backend: cuda          # "cuda" (NVIDIA) or "hip" (AMD/ROCm)
  gpus:                  # Local GPUs on the coordinator machine
    - name: "RTX 4070 Ti Super"
      vram_gb: 16
    - name: "RTX 3060"
      vram_gb: 12

workers:                 # Remote machines running rpc-server
  - host: 192.168.1.20   # ← your worker's IP
    gpus:
      - name: "RTX 2070"
        vram_gb: 8
        rpc_port: 50052
  - host: 192.168.1.30   # ← your worker's IP
    gpus:
      - name: "Apple M2 Metal"
        vram_gb: 11       # Use recommendedMaxWorkingSetSize, not total unified memory
        rpc_port: 50052

models:
  qwen3-32b:
    path: /models/Qwen3-32B-Q4_K_M.gguf  # ← absolute path on coordinator machine
    ctx_size: 8192
    flash_attn: true              # boolean (legacy string values "on"/"off" also accepted)
    default: true

Backend Presets & Passthrough

Tightwad auto-injects known-good settings per backend. For example, ROCm multi-GPU setups automatically get HSA_ENABLE_SDMA=0 and GPU_MAX_HW_QUEUES=1 to prevent SDMA hangs — no manual configuration needed.

You can override presets or add your own with extra_args and env in the coordinator section:

coordinator:
  backend: hip
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24
    - name: "7900 XTX #1"
      vram_gb: 24
  # Additional CLI args passed to llama-server
  extra_args: ["--no-mmap", "--no-warmup"]
  # Environment variables (override auto-injected presets)
  env:
    HSA_ENABLE_SDMA: "1"  # override the default preset

Server Backends

The proxy supports two backend types for draft and target servers:

Backend	Endpoint	Best for
`ollama`	`/api/generate` (raw mode)	Quick setup, any Ollama instance
`llamacpp`	`/v1/completions` (with logprobs)	Best performance, full logprobs support

How Speculative Decoding Works

Draft: The small model generates N candidate tokens (fast, ~100+ tok/s)
Verify: The large model evaluates all N tokens in a single forward pass
Accept/reject: Keep tokens where both models agree, take the large model's token at the first disagreement
Repeat until done

The output is equivalent to running the large model alone — the small model just proposes shortcuts.

Benchmark Results

Wall-Clock Speedup (Qwen3-8B → Qwen3-32B, cross-machine llama-server)

Draft on RTX 2070 (8GB), target on RTX 4070 Ti Super + RTX 3060 (28GB). Both via llama-server with prompt-append verification.

Prompt	Baseline	Speculative	Speedup
Capital of France	1.17s	0.90s	1.30x
Thermodynamics	12.73s	9.09s	1.40x
Prime checker	12.76s	10.15s	1.28x
Average speed	13.24s	10.95s	1.21x
TCP vs UDP	5.58s	4.88s	1.14x
Total	45.43s	35.96s	1.27x

1.27x overall speedup with max_draft_tokens: 32 (50 rounds, 31.7 tokens/round, 100% acceptance).

Tuning `max_draft_tokens`

Setting	Rounds	Tok/Round	Overall Speedup
8	96	8.8	0.63x (slower)
32	50	31.7	1.27x
64	16	56.5	1.21x

The sweet spot is 32 draft tokens — fewer rounds reduce HTTP overhead, but going too high (64) adds draft latency that outweighs the savings.

Acceptance Rate Details (logprobs verification)

Metric	Value
Acceptance Rate	73.5%
Effective tokens/round	6.6 (at max_draft_tokens=8)
Total rounds	87
Drafted tokens	671
Accepted tokens	493

Text-Match Benchmarks (Ollama, for acceptance rate comparison)

Same-family (Qwen3-8B → Qwen3-32B, local Ollama):

Prompt Type	Acceptance Rate	Rounds	Notes
Reasoning	89%	32	Highest — deterministic math answers
Code	76%	34	High — structured syntax overlap
Factual	73%	16	Strong agreement on facts
List	42%	40	Varied phrasing causes divergence
Creative	39%	6	Lowest — many valid outputs
Average	63.8%	25.6

Cloud API Benchmarks (OpenRouter)

Draft	Target	Size Gap	Acceptance
Llama 3.1 8B	Llama 3.1 405B	50x	18.9%
Qwen3 1.7B	Qwen3.5 397B	233x	10.8%
Llama 3.1 8B	Llama 3.1 70B	9x	9.9%
Qwen3 1.7B	Qwen3 235B	138x	6.6%
Qwen3 8B	Llama 3.3 70B	cross-family	~3%

Important: Over cloud APIs, the per-round network latency (~3-8s per API call) makes speculative decoding slower than baseline despite positive acceptance rates. Spec decoding shines when both models are local or very low-latency.

Key findings:

Same-family drafting is critical — cross-family drops to ~3% regardless of model size
Draft model size matters — the 1.7B is too small to predict 200B+ target phrasing
Larger targets don't always mean lower acceptance (405B beat 70B with the same 8B draft)
Cloud API latency negates wall-clock speedup even with decent acceptance rates

CPU Draft Results (Qwen3-1.7B CPU → Qwen3-32B GPU)

Draft Host	Draft Speed	Acceptance	Wall-Clock Speedup
M4 Mac CPU (llama-server)	32.8 tok/s	68%	0.80x
Unraid CPU (Ollama, text-match)	14.9 tok/s	68%	0.14x

CPU drafting with a 1.7B model works but doesn't achieve speedup at max_draft_tokens=8 due to HTTP round-trip overhead. At max_draft_tokens=32, CPU drafting achieves significant speedup (see Combined Mode below).

Combined Mode: Speculation Over RPC Pool

The killer feature. When a model is too large for any single machine, pool GPUs via RPC and use speculative decoding to overcome RPC's per-token latency. The draft model runs on any junk hardware (CPU, 2GB GPU) and the pooled target verifies 32 tokens per batch instead of generating one at a time.

Qwen3-32B (4-GPU pool, Qwen3-1.7B draft on M4 CPU):

Mode	Speed	Notes
RPC pool direct (autoregressive)	3.0 tok/s	Each token = full RPC round-trip to all workers
RPC pool + speculation	5.4 tok/s	32 tokens verified per batch, 100% acceptance
Speedup	1.8x

Llama 3.3 70B (4-GPU pool, Llama 3.1 8B draft on M4 Metal):

Mode	Tokens	Time	Speed
RPC pool direct (autoregressive)	512	231s	2.2 tok/s
RPC pool + speculation	519	127s	4.1 tok/s
Speedup			1.86x

100% acceptance rate, 33 tokens/round. The 70B model doesn't fit on any single machine — it's distributed across 4 GPUs (4070 Ti Super + 3060 + 2070 + M2 Metal = 52GB VRAM) over WiFi. Without speculation: painfully slow. With speculation: usable.

Critical lesson: draft and target MUST be the same model family. Llama 3.2 3B → Llama 3.3 70B got 1.6% acceptance (10x slower than no speculation) despite sharing a tokenizer. Llama 3.1 8B → Llama 3.3 70B gets 100% acceptance because they share the same architecture. Tightwad automatically detects model families at proxy startup and in tightwad doctor, warning loudly on mismatch.

Why this works:

  Pool autoregressive: 1 token → full RPC round-trip → 1 token → full RPC round-trip → ...
                       2-3 tok/s (network latency per token)

  Pool + speculation:  Draft 32 tokens (local GPU, fast, no network)
                       → Verify 32 tokens in ONE batch (one RPC round-trip for 32 tokens)
                       → 4-5 tok/s (network latency amortized over 32 tokens)

This means any model that fits across your pooled GPUs is usable — even over WiFi. The draft model just needs to be from the same model family as the target and small enough to run on your local hardware.

RPC Pool Without Speculation (for comparison)

Don't do this over WiFi. RPC tensor-parallelism ships 100-300 MB per inference step.

Setup	Speed
Desktop local only (4070+3060, 32B)	17.0 tok/s
4-GPU RPC pool (4070+3060+2070+M2, 32B)	3.0 tok/s
Same pool + speculation (Qwen3-1.7B draft)	5.4 tok/s
4-GPU RPC pool (4070+3060+2070+M2, 70B)	2.2 tok/s
Same pool + speculation (Llama 3.1 8B draft)	4.1 tok/s

RPC pooling is only useful when the model doesn't fit on one machine. When it does fit locally, don't pool — just use speculation with a remote drafter.

Use Cases

Models too big for one machine: Pool GPUs via RPC, then speculate on top — the draft model turns 3 tok/s into 5+ tok/s. A 70B model across 4 consumer GPUs becomes usable
Local multi-GPU: Draft on a consumer GPU ($200), verify on a larger GPU/rig
Cloud cost reduction: Draft locally, verify via cloud API — fewer API calls for the same output quality
CPU draft, GPU verify: Run a tiny model (0.6B-2B) on CPU/RAM, verify on GPU. Turns every idle CPU into usable inference compute
Multi-drafter parallelism: Multiple CPUs each run a draft model in parallel, the GPU target picks the best candidate
Legacy GPU revival: A 12-year-old GPU with 2GB VRAM can run a 1-2B draft model for a 70B+ target — turning e-waste into productive infrastructure
Junk drawer inference: Pool ALL your hardware — CUDA, ROCm, Metal, CPU — into one endpoint. The speculative proxy handles the coordination. No GPU left behind

Swarm Transfer — P2P Model Distribution

When you need to get a 40 GB model onto 5 worker machines, rsync from one source = 200 GB of outbound transfer. Swarm transfer splits the model into 64 MB pieces with SHA256 hashes and lets workers pull from any peer that has pieces — including each other.

rsync (single-source):                swarm (P2P):

  Source ──► Worker 1 (40 GB)           Source ──► Worker 1 ──► Worker 3
  Source ──► Worker 2 (40 GB)           Source ──► Worker 2 ──► Worker 4
  Source ──► Worker 3 (40 GB)           Worker 1 ──► Worker 5
  Source ──► Worker 4 (40 GB)           Worker 2 ──► Worker 5
  Source ──► Worker 5 (40 GB)
  Total: 200 GB from source            Total: ~80 GB from source (peers share the rest)

	rsync (`tightwad distribute`)	swarm (`tightwad swarm`)
Transfer pattern	Single source → each worker	Any peer → any peer
Source bandwidth	O(N × model_size)	O(model_size)
Resume on interrupt	Restart from beginning	Continue from last piece
Integrity	Trust the network	SHA256 per piece
Best for	1-2 workers, small models	3+ workers, large models

Note: Swarm transfer is intended for distributing models across your own cluster. Ensure you have the right to distribute any model files you share. Some models (e.g., Llama 3) require license acceptance on HuggingFace.

# On the source machine (--token requires auth, --allowed-ips restricts by subnet)
tightwad manifest create ~/models/Qwen3-32B-Q4_K_M.gguf
tightwad swarm seed ~/models/Qwen3-32B-Q4_K_M.gguf \
  --token mysecret \
  --allowed-ips 192.168.1.0/24

# On each worker (can pull from source + other workers)
tightwad swarm pull ~/models/Qwen3-32B-Q4_K_M.gguf \
  --manifest http://192.168.1.10:9080/manifest \
  --peer http://192.168.1.10:9080 \
  --peer http://192.168.1.20:9080 \
  --token mysecret

# Check progress
tightwad swarm status ~/models/Qwen3-32B-Q4_K_M.gguf

See the Swarm Transfer section above for architecture details on rarest-first piece selection and bitfield tracking.

Why Tightwad?

You've probably heard of the other tools. Here's how Tightwad fits in.

vs vLLM

vLLM is excellent production inference software. It's also primarily CUDA-focused, ROCm support is experimental. If you have an AMD GPU, getting it working takes extra effort. Tightwad pools CUDA and ROCm GPUs on the same model, same endpoint.

vLLM does support speculative decoding, but only within a single machine. Tightwad's proxy does it across your network — your draft model can be on a completely different box than your target.

Critically: vLLM cannot pool heterogeneous hardware. You can't mix a GTX 770 with a 4070 Ti in vLLM. You can't combine a CUDA machine with an AMD machine. You can't add a CPU-only node to the cluster. vLLM assumes uniform, high-end CUDA hardware throughout. Tightwad assumes you have a junk drawer.

vLLM is built for ML teams running production workloads at scale. Tightwad is built for anyone with two machines and a network cable.

	vLLM	Tightwad
AMD / ROCm support	Experimental	✓
Cross-machine speculative decoding	✗	✓
Mix old + new GPU generations	✗	✓
CPU nodes in the cluster	✗	✓
Works with Ollama	✗	✓
Target audience	Production ML teams	Homelab / anyone

vs Ollama

Ollama is great. It's the reason most people have local models running at all. But Ollama runs one model on one machine. When you outgrow one GPU, Ollama can't help you — it has no concept of pooling or cross-machine inference.

Ollama cannot combine machines at all. Your RTX 2070 on the old gaming PC and your RTX 4070 on the main rig are completely isolated from each other in Ollama's world. They'll never cooperate on a single request.

Tightwad is the next step after Ollama. Keep using Ollama as the backend on each machine — Tightwad just coordinates between them.

vs llama.cpp RPC

Tightwad is built on top of llama.cpp RPC. We didn't replace it — we added the orchestration layer, YAML configuration, CLI, and speculative decoding proxy that you'd otherwise have to script yourself.

The key difference for speculative decoding: llama.cpp RPC ships 100–300 MB of tensor data over the network per step. Tightwad's proxy ships token IDs — a few bytes. For models that fit on individual machines, the proxy approach is dramatically faster over a standard home network.

vs TGI (HuggingFace Text Generation Inference)

TGI is optimized for the HuggingFace ecosystem and integrates well with their services. It's an excellent tool if you're already in that ecosystem.

Tightwad is MIT licensed, has no vendor affiliation, and works with your existing Ollama or llama.cpp setup without any additional accounts or services. It's backend-agnostic by design.

The honest summary

If you have a single powerful CUDA machine and need production-grade throughput: use vLLM.

If you have one machine and just want to run models: use Ollama.

If you have two or more machines — mixed vendors, mixed GPU generations, mixed budgets, some with no GPU at all — and want them to work together intelligently: that's what Tightwad is for.

That GTX 770 from 2013? Put it to work drafting tokens. The old Xeon server with no GPU? CPU drafting. Your gaming PC, your workstation, your NAS, your laptop — Tightwad doesn't judge what you have. It just pools it.

RAM Management

llama-server mmaps the entire GGUF file into RAM before copying tensors to VRAM. On Windows, pages stay resident forever. On Linux, pages linger in the page cache. This means a 16 GB RAM machine can't load an 18 GB model even if the GPU has 24 GB VRAM.

Tightwad solves this with three tools:

`tightwad load` — Pre-warm + Load + Reclaim

For standalone GGUF loading with memory-aware pre-warming:

tightwad load /path/to/model.gguf                    # auto pre-warm if needed
tightwad load /path/to/model.gguf --no-prewarm        # skip pre-warming

When the model exceeds 80% of available RAM, tightwad reads the file sequentially (with posix_fadvise(SEQUENTIAL) on Linux) to warm the page cache before llama-server mmaps it. After /health confirms the model is in VRAM, RAM is reclaimed.

This also happens automatically via tightwad start when ram_reclaim is auto or on.

`tightwad reclaim` — Free RAM After Loading

After the model is fully loaded to VRAM (/health returns 200), tell the OS to release the file's page cache:

# Auto-reclaim after starting coordinator
tightwad start -m qwen3-32b --ram-reclaim on

# Or reclaim manually for any running llama-server
tightwad reclaim                        # auto-detects coordinator PID
tightwad reclaim --pid 12345            # any llama-server process
tightwad reclaim --pid 12345 --model-path /models/model.gguf

How it works per platform:

Platform	Method	Effect
Linux	`posix_fadvise(DONTNEED)`	Drops the GGUF file's page cache. Targeted — only affects that file.
Windows	`SetProcessWorkingSetSize(-1, -1)`	Trims working set, moves mmap'd pages to standby list.
macOS	No-op	Unified memory — GPU and CPU share physical RAM. Reclaim is unnecessary.

ram_reclaim modes (in cluster.yaml or --ram-reclaim flag):

Mode	Behavior
`off`	Default llama-server behavior. No reclaim.
`on`	Always reclaim after model loads to VRAM.
`auto`	Reclaim if model > 50% of available RAM. Skip if plenty of headroom. (Default)

`tightwad tune` — System Readiness Check

For machines where the model is bigger than RAM, NVMe swap must be configured:

tightwad tune                              # general system check
tightwad tune --model /models/qwen3-32b.gguf  # check against specific model

Example output on a 16 GB machine with no swap:

System Resources:
  RAM:        16.0 GB (12.3 GB available)
  Swap:       0.0 GB (0.0 GB used)
  Swappiness: 60

Model: qwen3-32b-Q4_K_M.gguf (18.1 GB)

  [!] CRITICAL: No swap configured. This model (18.1 GB) exceeds available
      RAM (12.3 GB). Loading will fail. Configure NVMe swap:

      sudo fallocate -l 32G /swapfile
      sudo chmod 600 /swapfile
      sudo mkswap /swapfile
      sudo swapon /swapfile
      echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

  [i] Tip: After loading, run 'tightwad reclaim' to free RAM.

CLI Reference

Setup & Config

Command	Description
`tightwad init`	Auto-discover LAN servers and generate cluster.yaml
`tightwad init --local`	Auto-detect local GPUs and generate coordinator-only config
`tightwad doctor [--fix] [--json]`	Diagnose config, binaries, network, versions, families, MoE

Speculative Decoding Proxy

Command	Description
`tightwad proxy start`	Start proxy (dashboard at `/dashboard`)
`tightwad proxy stop / status`	Stop proxy or show health + acceptance rate
`tightwad chat [--direct]`	Interactive chat with inline stats (or bypass proxy)
`tightwad bench [--json]`	A/B benchmark: proxy vs direct target comparison

Quality Gate (CPU Fleet + GPU Verifier)

Command	Description
`tightwad gate start`	Start quality gate proxy
`tightwad gate status`	Show approve/correct/reject rates

RPC Cluster

Command	Description
`tightwad start [-m MODEL]`	Start coordinator (enforces version matching)
`tightwad stop`	Stop the coordinator
`tightwad status`	Show RPC cluster status
`tightwad swap MODEL`	Hot-swap model (workers persist)
`tightwad benchmark`	Benchmark the running coordinator

Peer Agent (Cross-Platform Management)

Command	Description
`tightwad peer start [--port N]`	Start peer agent daemon (default port 9191)
`tightwad peer stop / status`	Stop or check peer agent

Models & Distribution

Command	Description
`tightwad pull <model> [--list]`	Download GGUF from HuggingFace (ensure you've accepted the model's license)
`tightwad inspect <model.gguf> [--plan]`	Show model info, MoE details, distribution plan
`tightwad distribute MODEL [--method rsync\|swarm] [--dry-run]`	Copy model to workers
`tightwad swarm seed / pull / status`	P2P model distribution
`tightwad manifest create <model.gguf>`	Generate swarm manifest

System

Command	Description
`tightwad logs [coordinator\|proxy] [-f] [--clear]`	View or tail logs
`tightwad tune [--model <gguf>]`	Check RAM/swap readiness
`tightwad load <model.gguf> [--no-prewarm]`	Memory-aware model loading
`tightwad reclaim [--pid PID]`	Free RAM after model loads to VRAM
`tightwad start --ram-reclaim auto`	Start with automatic RAM reclaim
`tightwad service install / uninstall / status`	Manage systemd/launchd service
`tightwad deploy <host> [--ssh-user USER]`	Deploy to remote host via SSH

Global option: -c /path/to/cluster.yaml or TIGHTWAD_CONFIG env var. Config is auto-discovered from ./tightwad.yaml, ./configs/cluster.yaml, ~/.tightwad/config.yaml, or the package default if -c is not provided.

tightwad inspect requires the gguf package: pip install tightwad[inspect]

API Endpoints (Proxy)

Endpoint	Method	Description
`/`	GET	Browser-based chat UI
`/dashboard`	GET	Live monitoring dashboard (health, charts, request log)
`/v1/completions`	POST	Text completion (OpenAI-compatible)
`/v1/chat/completions`	POST	Chat completion (OpenAI-compatible)
`/v1/models`	GET	List available models
`/v1/tightwad/status`	GET	Proxy stats: acceptance rate, rounds, throughput
`/v1/tightwad/events`	GET	SSE stream of live stats, health, and request events
`/v1/tightwad/ws`	WebSocket	Bidirectional dashboard updates (stats, health, commands)
`/v1/tightwad/history`	GET	JSON array of recent request records (max 50)
`/metrics`	GET	Prometheus text exposition format (requests, tokens, acceptance rate, uptime)

All endpoints support stream: true for SSE streaming. The WebSocket endpoint at /v1/tightwad/ws enables bidirectional communication — live stats push and inbound commands (e.g., adjusting max_draft_tokens in real time). The chat UI at / provides an instant browser-based interface with streaming — no additional software required.

Chat Templates: The /v1/chat/completions endpoint auto-detects the correct chat template from the target model family (Llama 3, Mistral, Gemma, Phi, DeepSeek, Command-R, ChatML/Qwen). Override with proxy.chat_template in cluster.yaml or TIGHTWAD_CHAT_TEMPLATE env var. Use /v1/completions with a pre-formatted prompt if you need full control over the prompt format.

Auto-Tune: Set max_draft_tokens: auto in your config (or TIGHTWAD_MAX_DRAFT_TOKENS=auto) to let the proxy optimize draft count at runtime. It adjusts based on rolling acceptance rates and draft-vs-verify timing — increasing when drafts are cheap and acceptance is high, decreasing when they're not. Most users should use auto.

Draft-Verify Pipelining: The proxy can overlap drafting round N+1 with verifying round N. When acceptance is high (73-100% with same-family models), the optimistic draft is ready immediately when verification completes, nearly doubling throughput.

Hardware Setup

RPC Workers

Each machine that contributes a GPU runs rpc-server. Download pre-built binaries from the llama.cpp releases or build from source:

Windows (CUDA):

# Pre-built: download llama-b8079-bin-win-cuda-cu12.2.0-x64.zip
# rpc-server.exe is included alongside llama-server.exe
rpc-server.exe --host 0.0.0.0 --port 50052

macOS (Metal):

# Pre-built: download llama-b8079-bin-macos-arm64.tar.gz
# Restrict to Metal GPU only (avoids exposing CPU as second device):
./rpc-server --host 0.0.0.0 --port 50052 --device MTL0

Linux (ROCm/CUDA):

# Build from source:
cmake -B build -DGGML_HIP=ON -DGGML_RPC=ON -DAMDGPU_TARGETS=gfx1100
cmake --build build --config Release -j$(nproc)
./build/bin/rpc-server --host 0.0.0.0 --port 50052

Important: The coordinator's llama-server and all rpc-server instances must be the same build version. Version mismatches cause silent failures — no error, tensors just don't distribute. tightwad start enforces this automatically via SSH version checks and refuses to launch on mismatch. Use --skip-version-check to bypass if needed. tightwad doctor also reports version mismatches as errors.

Coordinator

The coordinator runs llama-server with --rpc pointing to all workers:

llama-server -m model.gguf --host 0.0.0.0 --port 8090 -ngl 999 \
  --rpc 192.168.1.20:50052,192.168.1.30:50052 \
  --tensor-split 0.34,0.26,0.17,0.23 \
  --flash-attn --jinja

Firewall Notes

rpc-server listens on port 50052 by default. You'll need to open this port:

Windows: New-NetFirewallRule -DisplayName 'llama-rpc' -Direction Inbound -Program 'C:\llama\rpc-server.exe' -Action Allow
macOS: Add rpc-server to System Settings → Network → Firewall → Options, or: sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /path/to/rpc-server && sudo /usr/libexec/ApplicationFirewall/socketfilterfw --unblockapp /path/to/rpc-server
Linux: sudo ufw allow 50052/tcp

Benchmarking

Benchmark scripts measure acceptance rates and wall-clock speedup across model pairs. See the scripts in the benchmarks/ directory and scripts/benchmark_proxy.py for full methodology and reproduction instructions.

# Multi-config benchmark (local Ollama + cloud APIs)
OPENROUTER_API_KEY="sk-or-..." .venv/bin/python scripts/benchmark_proxy.py

# Llama 3.1 8B (local) → Llama 3.1 405B (OpenRouter)
# Requires local llama-server running on port 8081
llama-server -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --port 8081 -ngl 999
OPENROUTER_API_KEY="sk-or-..." .venv/bin/python scripts/benchmark_openrouter.py

Results are saved as JSON in benchmarks/.

Development

git clone https://github.com/youngharold/tightwad.git
cd tightwad
pip install -e ".[dev]"
pytest tests/ --ignore=tests/e2e_loader_test.py -v

Project Structure

tightwad/
├── cli/                # CLI package (split into subcommand modules)
│   ├── __init__.py     # Main Click group, shared helpers
│   ├── proxy.py        # proxy start/stop/status
│   ├── cluster.py      # start, stop, status, swap, benchmark, bench
│   ├── gate.py         # gate start/status (quality gate)
│   ├── peer.py         # peer start/stop/status
│   ├── tools.py        # doctor, init, inspect, distribute, deploy, etc.
│   └── service.py      # service install/uninstall/status
├── proxy.py            # Speculative decoding proxy (+ auto-tune, pipelining, consensus)
├── speculation.py      # Verification algorithms (greedy, stochastic, consensus)
├── config.py           # YAML/env config loader with SSRF validation
├── family.py           # Model family detection (Ollama, llama-server, GGUF)
├── chat_templates.py   # Per-family chat templates (Llama 3, Mistral, Gemma, Phi, etc.)
├── quality_gate.py     # CPU fleet drafts, GPU verifies (approve/correct/reject)
├── peer.py             # Peer agent HTTP daemon (replaces SSH)
├── tree_speculation.py # Tree-based speculation data structures
├── bench.py            # A/B benchmark: proxy vs direct target
├── coordinator.py      # llama-server lifecycle + version enforcement
├── worker.py           # RPC health checks + version detection (SSH + peer)
├── doctor.py           # Diagnostic checks (config, network, versions, families, MoE)
├── dashboard.py        # Live web dashboard (SSE + WebSocket)
├── gguf_reader.py      # Pure-Python GGUF v2/v3 parser (zero deps)
├── gguf_inspect.py     # GGUF analysis + MoE detection + distribution planning
├── ssrf.py             # SSRF protection (scheme, private-IP, DNS-rebinding)
├── validation.py       # Request body validation + DoS limits
├── init_wizard.py      # LAN auto-discovery + config generator
├── reclaim.py          # Cross-platform RAM reclaim after model loading
├── loader.py           # Memory-aware model pre-warming
├── distribute.py       # Model distribution (rsync/scp + swarm P2P)
├── manifest.py         # Swarm manifest + piece bitfield tracking
├── swarm_transfer.py   # P2P seeder + puller
├── gpu_detect.py       # GPU auto-detection (NVIDIA, AMD, Apple)
├── service.py          # systemd/launchd service management
├── model_hub.py        # HuggingFace GGUF downloader
├── deploy.py           # Remote deployment via SSH
└── tune.py             # System RAM/swap diagnostics
tests/                  # 765+ tests (unit + integration)
configs/                # Example cluster.yaml configs

License

Tightwad is provided as-is under the MIT License. Performance claims are based on our testing and may vary with your hardware, model pairing, network, and configuration. All trademarks are the property of their respective owners.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
benchmarks		benchmarks
configs		configs
docs		docs
examples		examples
scripts		scripts
tests		tests
tightwad		tightwad
.dockerignore		.dockerignore
.gitignore		.gitignore
.nojekyll		.nojekyll
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
index.html		index.html
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Tightwad

How It Works in 10 Seconds

What Does This Look Like as a User?

Four Modes

1. Speculative Decoding Proxy — Draft + Verify across machines

2. RPC Cluster — Pool GPUs into one endpoint

3. Combined Mode — Speculation Over a Pool (the killer feature)

4. Quality Gate — CPU Fleet Drafts, GPU Verifies

Docker Quick Start

Proxy Authentication

SSRF Protection (upstream URL validation)

Quick Start

Speculative Decoding Proxy

RPC Cluster

Homelab Recipe

Step 1 — On Machine B: Start the draft model

Step 2 — On Machine C: Start a CPU draft model

Step 3 — On Machine A: Start the target model

Step 4 — On whichever machine runs the proxy: Install Tightwad

Step 5 — Generate config

Step 6 — Start the proxy

Step 7 — Test it

Step 8 — Point your app at it

Configuration

Backend Presets & Passthrough

Server Backends

How Speculative Decoding Works

Benchmark Results

Wall-Clock Speedup (Qwen3-8B → Qwen3-32B, cross-machine llama-server)

Tuning max_draft_tokens

Acceptance Rate Details (logprobs verification)

Text-Match Benchmarks (Ollama, for acceptance rate comparison)

Cloud API Benchmarks (OpenRouter)

CPU Draft Results (Qwen3-1.7B CPU → Qwen3-32B GPU)

Combined Mode: Speculation Over RPC Pool

RPC Pool Without Speculation (for comparison)

Use Cases

Swarm Transfer — P2P Model Distribution

Why Tightwad?

vs vLLM

vs Ollama

vs llama.cpp RPC

vs TGI (HuggingFace Text Generation Inference)

The honest summary

RAM Management

tightwad load — Pre-warm + Load + Reclaim

tightwad reclaim — Free RAM After Loading

tightwad tune — System Readiness Check

CLI Reference

API Endpoints (Proxy)

Hardware Setup

RPC Workers

Coordinator

Firewall Notes

Benchmarking

Development

Project Structure

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Tuning `max_draft_tokens`

`tightwad load` — Pre-warm + Load + Reclaim

`tightwad reclaim` — Free RAM After Loading

`tightwad tune` — System Readiness Check

Packages