Skip to content

chariotsolutions/claude-code-ollama-local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Claude Code on local LLMs via Ollama

A working configuration for running Claude Code against a local LLM backend (via Ollama) on consumer Apple Silicon. Tested on an M4 Max MacBook Pro, 48GB unified memory. Should work on any 48GB+ Apple Silicon Mac with minor adjustments.

Contents

  • bin/claude-ollama — launcher wrapper that sets the four env vars Claude Code needs and execs claude
  • bin/ollama-proxy.py — optional logging proxy that sits between Claude Code and Ollama and captures request bodies
  • prompts/ — shared prompt files for the performance checks
  • tests/throughput.sh <model> — direct-API throughput check, swap model via CLI arg
  • tests/qualitative.sh <model> <prompt> — run any prompt against any model for manual evaluation

1. Prerequisites

  • macOS Apple Silicon, 48GB+ unified memory
  • Claude Code installed and working against your normal Claude account
  • Terminal access, ~15 minutes

2. Install Ollama

Direct tarball install — no Homebrew required:

cd /tmp
curl -LO https://github.com/ollama/ollama/releases/download/v0.21.0/ollama-darwin.tgz
curl -LO https://github.com/ollama/ollama/releases/download/v0.21.0/ollama-darwin.tgz.sha256
shasum -a 256 -c ollama-darwin.tgz.sha256

mkdir -p ~/.local/opt/ollama ~/.local/bin
tar -xzf ollama-darwin.tgz -C ~/.local/opt/ollama
ln -sf ~/.local/opt/ollama/ollama ~/.local/bin/ollama

# Make sure ~/.local/bin is on PATH (add to ~/.zshrc if not)
grep -q '/.local/bin' ~/.zshrc || echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
ollama --version

3. Pull a model

Pick one. Approximate numbers measured on an M4 Max 48GB with the settings in §4:

Model ~tok/s Disk Notes
qwen3:30b-a3b-instruct-2507-q4_K_M 93 18 GB Most reliable coding agent in my testing. Default pick.
qwen3-coder:30b 100 18 GB Coder-specialized variant, similar speed.
gemma4:26b 80 17 GB Strong single-turn quality, less reliable in multi-turn tool-calling.
qwen3.6:35b-a3b 44 23 GB Newer family, materially slower, handled multi-turn well in my sessions.
ollama pull qwen3:30b-a3b-instruct-2507-q4_K_M

4. Start Ollama with the four settings that matter

mkdir -p ~/.local/var/log/ollama
OLLAMA_CONTEXT_LENGTH=131072 \
OLLAMA_KEEP_ALIVE=60m \
OLLAMA_KV_CACHE_TYPE=q8_0 \
OLLAMA_FLASH_ATTENTION=1 \
  nohup ollama serve > ~/.local/var/log/ollama/serve.log 2>&1 & \
  disown

curl -s http://localhost:11434/api/version   # sanity check

Why each matters:

  • OLLAMA_CONTEXT_LENGTH=131072 (128K). The default (32K) silently truncates Claude Code's ~33K-token system prompt — you get no error, the agent just behaves badly.
  • OLLAMA_KEEP_ALIVE=60m. The default (5m) evicts the KV cache slot between turns, making every "warm" turn re-pay cold prompt-eval cost (~70s on a 30B model).
  • OLLAMA_KV_CACHE_TYPE=q8_0. Halves KV cache memory. On a 48GB box at 128K context, this is often the difference between "fits on GPU" and "spills to CPU and runs at single-digit tok/s."
  • OLLAMA_FLASH_ATTENTION=1. On by default in Ollama 0.21 for supported models, but explicit is safer. Required for the q8_0 KV quantization above.

5. Point Claude Code at it

Install the launcher from this repo:

git clone <this-repo-url> ~/dev/claude-code-ollama-local
ln -sf ~/dev/claude-code-ollama-local/bin/claude-ollama ~/.local/bin/

Launch:

claude-ollama                                       # use the default model
CLAUDE_OLLAMA_MODEL=gemma4:26b claude-ollama        # override per session
claude-ollama -- -c                                 # pass -c (continue session) through to claude
What the launcher does — 11 lines of bash
#!/usr/bin/env bash
# Launch Claude Code pointed at local Ollama.
set -euo pipefail

MODEL="${CLAUDE_OLLAMA_MODEL:-qwen3:30b-a3b-instruct-2507-q4_K_M}"
BASE_URL="${CLAUDE_OLLAMA_BASE_URL:-http://localhost:11434}"

export ANTHROPIC_BASE_URL="$BASE_URL"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""

exec claude --model "$MODEL" "$@"

6. Optional: logging proxy to see what's on the wire

If you ever want to inspect what Claude Code actually sends to Ollama — useful for debugging tool-call issues or verifying which tools actually fire — run the proxy and point the launcher at it:

ln -sf ~/dev/claude-code-ollama-local/bin/ollama-proxy.py ~/.local/bin/
nohup ~/.local/bin/ollama-proxy.py > /dev/null 2>&1 &

CLAUDE_OLLAMA_BASE_URL=http://localhost:11435 claude-ollama
tail -f ~/.local/var/log/ollama/proxy.log

Requires uv; the inline # /// script header auto-installs aiohttp on first run.

The proxy listens on 127.0.0.1:11435, forwards to 127.0.0.1:11434, and writes request bodies for /api/generate and the last message of each /v1/messages body to ~/.local/var/log/ollama/proxy.log. Other requests are logged as length-only to keep the file readable.

7. Monitoring

ollama ps                                         # loaded model, memory footprint, GPU/CPU split, context, keep-alive
tail -f ~/.local/var/log/ollama/serve.log         # per-request timing + runner layer-placement events
tail -f ~/.local/var/log/ollama/proxy.log         # request bodies (if the proxy is running)

ollama ps is the first line of defense — one command tells you whether the model is loaded, how much memory it's using, and whether it's fully on GPU or spilling to CPU.

Per-request [GIN] lines in serve.log show which endpoint was hit and how long it took. If you see repeated Operation:fit/alloc/commit/close lines on the runner, something is forcing it to rebuild — usually two clients hitting the same Ollama instance, or the fitter oscillating GPU layer counts under memory pressure.

One important caveat: WebSearch silently returns empty

WebSearch is a server-side Anthropic tool. There's no Anthropic server to answer it against a local Ollama backend — the tool returns empty results. The model, presented with an empty tool result, produces a confident, detailed response that looks search-grounded but is actually drawn from training data. Nothing in the UI flags this.

In one real session the agent emitted six WebSearch calls researching Python libraries and delivered a thorough comparison. Its own summary admitted: "The search tool had issues, but I have current knowledge on all these options." If you'd scrolled past that line, you'd have trusted the output as real research.

Treat any research-style output from a local-backend session as suspect — library recommendations, "current best practice" claims, framework comparisons — unless you've either installed a search-capable MCP server or pinned the agent to WebFetch with specific URLs.

WebFetch, by contrast, runs client-side in the Claude Code CLI process and makes real HTTP requests. It works against local backends. Use it for retrieving specific documentation pages, GitHub files, PyPI pages, etc.

Checking performance on your own setup

Two scripts for reproducible measurement, both taking a model tag via CLI arg so you can swap models without editing code:

./tests/throughput.sh qwen3:30b-a3b-instruct-2507-q4_K_M
# Runs warmup + two canonical prompts (capital-short, memoization-300w)
# with think=false, stream=true. Prints per-prompt tok/s, TTFT, and
# duration. Artifacts to tests/logs/.

./tests/qualitative.sh qwen3:30b-a3b-instruct-2507-q4_K_M flash-attention-explain
# Runs one prompt from prompts/ against the chosen model; pairs with
# the matching .rubric.md for manual evaluation.

Artifacts for each run land in tests/logs/ with filenames including the timestamp, sanitized model tag, and prompt name — multiple runs and multiple models coexist cleanly.

About

Run Claude Code against a local LLM backend via Ollama — working configuration and performance checks for Apple Silicon.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors