Minimal local CLI for running Gemma 4 with the native orbit server backend.
Orbit is designed for local execution, streaming output, shell tools, and a simple terminal workflow. The normal setup uses Orbit's native backend and does not require an external llama-server process at runtime.
The native backend still depends on native libraries derived from llama.cpp/ggml, built either from Orbit's vendored sources or from a documented developer fallback such as --llama-root. Zero-build packaging remains future work. See docs/NATIVE_PACKAGING_ROADMAP.md.
Linux is the main target environment. macOS may work. Windows is not a target environment.
- chats with a local model through a small terminal CLI
- streams answers and runtime metrics
- exposes unrestricted local shell tools by default, with explicit
/tools off,--tools off, orORBIT_TOOLS=offcontrols - keeps the tool loop model-driven
- supports local image and audio input when the backend is started with multimodal support
- supports native MTP when target and draft models are available
- stores lightweight sessions and prompt history under
~/.orbit
Orbit stays model-driven. The runtime enforces safety, size, timeout, and tool-contract boundaries, but it does not deterministically solve user tasks.
- Primary backend path: native
orbit server - Compatibility path:
llama-serveror another OpenAI-compatible local backend - CLI default base URL:
http://127.0.0.1:12120
If your native server runs on another port, pass --base-url.
Orbit is designed, tested, and supported primarily for CPU-only local execution.
The vendored native self-build path is CPU-only in this release:
python3 scripts/build_native.pyGPU acceleration is not part of Orbit's supported vendored self-build path yet. Advanced users who want CUDA, Metal, Vulkan, or ROCm should build llama.cpp natively on the target GPU machine and point Orbit to that build through the documented developer fallback:
orbit server --llama-root /path/to/gpu-enabled/llama.cppThis keeps Orbit's default path portable and stable while still allowing advanced GPU experiments through upstream llama.cpp builds.
- Python 3.11 or newer
- Linux recommended
- a local Gemma 4 target GGUF model
Optional artifacts:
- MTP draft GGUF for native speculative decoding
- matching
mmprojGGUF for multimodal image/audio input
git clone https://github.com/guelfoweb/orbit.git
cd orbit
python3 -m venv .venv
. .venv/bin/activate
pip install -e .This installs the Python package and CLI.
If vendor/lib/ does not already contain the required CPU native libraries, build them explicitly:
python3 scripts/build_native.pyThis uses Orbit's vendored llama.cpp sources and writes local build output under src/orbit/native_llama/vendor/.
Developer fallback:
--llama-root /path/to/llama.cppORBIT_LLAMA_ROOT=/path/to/llama.cpp
Download only the target model:
orbit download ggml-org/gemma-4-12B-it-GGUFDownload only the multimodal projector:
orbit download ggml-org/gemma-4-12B-it-GGUF/mmproj-gemma-4-12B-it-Q8_0.ggufDownload only the MTP draft:
orbit download unsloth/gemma-4-12b-it-GGUF/MTP/gemma-4-12b-it-Q8_0-MTP.ggufStable default server, with MTP disabled:
orbit serverBy default, the native server performs startup route-prefix prewarm for the tools-on route prefix. This shifts the first tools-on route prefill cost to startup. To disable only startup prewarm:
ORBIT_KV_PREFIX_PREWARM=off orbit serverTo disable route prefix-anchor and prewarm:
ORBIT_KV_PREFIX_ANCHOR=off orbit serverThe terminal client starts with tools enabled by default. To start without local shell tools:
ORBIT_TOOLS=off orbit
orbit --tools off "hello"To get a reasonable CPU/RAM starting profile, you can first run scripts/suggest-server-profile.sh; it checks local CPU and RAM and prints conservative environment-variable suggestions to review before export.
If native libraries are not packaged inside Orbit yet, use:
orbit server --llama-root /path/to/llama.cppOptional experimental MTP mode:
orbit server --mtpWith a multimodal projector:
orbit server \
--port 12120 \
--mmproj models/ggml-org--gemma-4-12B-it-GGUF/mmproj-gemma-4-12B-it-Q8_0.ggufYou can combine MTP and multimodal flags when both artifacts are available:
orbit server \
--port 12120 \
--mtp \
--mmproj models/ggml-org--gemma-4-12B-it-GGUF/mmproj-gemma-4-12B-it-Q8_0.ggufWhat this means:
orbit serverstarts the stable native backend without MTP.orbit server --mtpenables the experimental MTP path explicitly.- MTP can improve some workloads, but Orbit keeps it off by default because stability has priority.
- if native libs are missing, Orbit exits with a short error telling you to build them with
python3 scripts/build_native.pyor use--llama-root/ORBIT_LLAMA_ROOT - native route KV prefix-anchor runs in safe auto mode by default for eligible
tools-on route calls; disable it with
ORBIT_KV_PREFIX_ANCHOR=offif you need the baseline prefill path - experimental multi-turn raw MTP chat reuse remains debug-only behind:
ORBIT_MTP_CHAT_REUSE_RAW=1ORBIT_MTP_CHAT_REUSE_DEBUG=1
After startup, you can verify that the server is healthy:
orbit --healthInside the interactive client, you can inspect backend state with:
/health
/props
Expected /props values:
- default server:
backend_mode=no-mtpmtp_enabled=false
- with
--mtp:backend_mode=mtp-readymtp_enabled=true
orbitInside Orbit, tools are off by default:
/tools on
orbit "Say who you are in one short sentence."
orbit --image workdir/media/image1.jpg "Describe this image."
orbit --audio workdir/media/audio1.wav "Transcribe or summarize this audio."Orbit supports runtime thinking visibility:
/think off
/think on
think off: do not request visible reasoningthink on: request visible reasoning first, then the final answer
The backend and model must actually support visible reasoning for this to appear correctly.
/compact [tools] Compact memory or old tool results.
/continue Continue the last answer if it reached max_tokens.
/health Check backend health.
/help Show this help.
/max-tokens [n] Show or set output token limit for following turns.
/think [off|on] Show or set thinking visibility.
/reset Clear current conversation and saved session.
/sessions clear Delete all saved sessions for this workdir.
/status [ctx] Show runtime status or estimated context usage.
/tools [off|on] Show or set shell tool access.
/exit Exit interactive mode.
/max-tokens <n> affects only the current runtime. It does not rewrite config or session files.
Orbit can still talk to compatible local HTTP backends through --base-url. Keep this as compatibility or comparison, not as the preferred product path.
- backend unavailable: check
orbit --health --base-url ... - native libraries missing: run
python3 scripts/build_native.py, or use--llama-root/ORBIT_LLAMA_ROOTas a developer fallback - model not found: verify the Orbit models cache or explicit model paths
- multimodal unavailable: ensure the matching
mmprojis present and the backend was started with it - MTP unavailable: ensure both target and draft models are available and the backend was started with the experimental MTP path
- performance notes: docs/PERFORMANCE.md
- native packaging roadmap: docs/NATIVE_PACKAGING_ROADMAP.md
- runtime techniques: docs/TECHNIQUES.md
- manual regression prompts: docs/PROMPTS.md
- release confidence suite: docs/RELEASE_CONFIDENCE.md
orbit bench-core --base-url http://127.0.0.1:12120
orbit release-confidence --base-url http://127.0.0.1:12120 --keep-failed