A high-performance LLM inference engine and image generation pipeline written in C# 14 / .NET 10. Runs GGUF models on CPU (AVX2/AVX-512 SIMD) and GPU (Vulkan compute shaders or CUDA cuBLAS). Includes an OpenAI- and Anthropic-compatible API server and native pipelines for Z-Image-Turbo and FLUX.1.
Requirements: .NET 10 SDK, x86-64 CPU with AVX2.
Optional: Vulkan-capable GPU (drivers), CUDA Toolkit 11.x/12.x for NVIDIA paths,
OpenBLAS in tools/openblas/ for faster batched GEMM. Build with dotnet build -c Release.
Supported architectures: llama, llama4, qwen3, qwen3moe. Benchmarked on
AMD Zen 4 (12c/24t, DDR4-3200) + RTX 4070 Ti (12 GB), Q4_K_M, --temp 0,
-n 80, prompt "Write a Python function that sorts a list using the quicksort algorithm:".
Decode rate is forward-pass iterations / decode time, so it counts
thinking-mode tokens too. All outputs verified coherent (scripts/bench-all.ps1).
Cross-engine top-1 parity vs llama.cpp b8585 verified on Qwen3-8B (byte-identical
60-token greedy decode with matching chat template).
| Model | Repo | Size | Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|---|---|---|
| SmolLM2 1.7B Instruct | HuggingFaceTB | 1 GB | CPU | 16.1 | 39.9 | AVX2 fused dequant-matvec |
| SmolLM2 1.7B Instruct | (same) | 1 GB | Vulkan -g -1 |
44.3 | 147.9 | GLSL subgroupAdd reduce |
| SmolLM2 1.7B Instruct | (same) | 1 GB | CUDA -g -1 |
180.2 | 157.8 | NVRTC __dp4a + Q8_1 |
| Qwen3 8B | Qwen | 5 GB | Vulkan -g -1 |
22.8 | 46.8 | 11.4K auto-ctx |
| Qwen3 8B | (same) | 5 GB | Vulkan -g -1 --tq |
21.5 | 45.5 | 3-bit KV → 40 960 ctx |
| Qwen3 8B | (same) | 5 GB | CUDA -g -1 |
65.4 | 58.0 | ~2.9× Vulkan prefill |
| Qwen3 8B | (same) | 5 GB | CUDA -g -1 --tq |
66.1 | 65.4 | 3-bit KV → 40 960 ctx; 17 t/s @ 8K, 10 t/s @ 16K |
| Qwen3-Coder 30B-A3B (MoE) | Qwen | 17 GB | CPU | 13.0 | 20.6 | 128 experts / 8 active |
| Qwen3-Coder 30B-A3B (MoE) | (same) | 17 GB | CPU --tq |
11.3 | 20.6 | 3-bit KV; GPU MoE gated by #2 |
--backend auto (default) picks CUDA for dense models with full offload (-g -1),
Vulkan otherwise. --tq enables 3-bit TurboQuant KV compression (CPU, Vulkan, CUDA;
requires headDim ∈ {128, 256}). MoE on GPU is rejected by default — see #2.
# CPU, single-turn, greedy
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf -p "What is 2+2?" --temp 0
# Full GPU offload (auto-picks CUDA on dense + full offload)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf -p "Write a quicksort in Python" --temp 0 -g -1
# MoE on CPU with 3-bit KV compression (5× less VRAM, full ctx)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Implement a BST in C#" --temp 0
# Interactive chat (no -p)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf
# Speculative decoding (~2× faster at temp 0)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf --draft-model models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
-p "Write a binary search in Rust" --temp 0
# API server (OpenAI /v1/chat/completions + Anthropic /v1/messages, port 5000)
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
dotnet run --project src/SharpInference.Server -c ReleaseTwo pipelines, auto-detected from model filename. Benchmarked on AMD Zen 4
- RTX 4070 Ti (CUDA backend, 4 denoising steps, 512×512 output). The CLI is a one-shot binary, so each invocation pays the full load + text-encoder warmup. The "cached" column is the steady-state cost when the same encoder weights stay resident — e.g., re-rendering inside the server or interactive loop after the first prompt.
| Pipeline | Components (repo • file • size) | Per-run | Cached prompt | Notes |
|---|---|---|---|---|
| Z-Image-Turbo | DiT: jayn7/Z-Image-Turbo-GGUF z_image_turbo-Q5_K_M.gguf 5.5 GBEncoder: BennyDaBall/...-AbliteratedV1 Z-Image-AbliteratedV1.Q5_K_M.gguf 2.9 GBVAE + tokenizer: Tongyi-MAI/Z-Image-Turbo vae/ tokenizer/ |
~108 s | ~30 s | Most of the per-run cost is text-encoder warmup (~90 s); DiT ~4 s, VAE ~18 s once weights are hot. Output verified visually. |
| FLUX.1-schnell | DiT: city96/FLUX.1-schnell-gguf flux1-schnell-Q4_K_S.gguf ~7 GBEncoders + VAE: comfyanonymous/flux_text_encoders clip_l.safetensors + t5xxl_fp16.safetensors + ae.safetensors |
— | — | 4-step distilled; model not on this benchmark machine |
Optional 4× upscale via Real-ESRGAN (RealESRGAN_x4plus.safetensors):
runs on CUDA when available, falls back to bicubic.
# Z-Image-Turbo (auto-detects pipeline from filename containing "z_image")
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/z_image_turbo-Q5_K_M.gguf \
--vae models/z-image-turbo/vae \
--qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
--qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
-p "a serene mountain lake at sunrise" -W 1024 -H 1024 --steps 4 -o landscape.png
# FLUX.1-schnell
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/flux1-schnell-Q4_K_S.gguf \
--vae models/flux/ae.safetensors \
--clip-l models/flux/clip_l.safetensors --clip-tokenizer models/flux/tokenizer_clip.json \
--t5xxl models/flux/t5xxl_fp16.safetensors --t5-tokenizer models/flux/tokenizer_t5.json \
-p "a cinematic photograph of a mountain lake" -W 512 -H 512 --steps 4 -o out.png
# With 4× Real-ESRGAN upscale + blend
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/z_image_turbo-Q5_K_M.gguf \
--vae models/z-image-turbo/vae \
--qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
--qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
--upscaler models/RealESRGAN_x4plus.safetensors --upscale-blend 0.8 \
-p "a fox in autumn forest" -W 512 -H 512 --steps 4 -o fox.png- Architecture & algorithms: docs/SharpInference-Design.md
- All CLI flags:
sharpi-cli --help,sharpi-cli image --help - Model downloads:
scripts/download-model.ps1 -Model <smollm2|qwen3-8b|qwen3-coder-30b-a3b|llama4-scout|z-image-turbo|realesrgan-x4|…> - Tests:
dotnet test(207 tests across 5 projects) - NativeAOT publish:
dotnet publish src/SharpInference.Cli -c Release -r win-x64
Released under the MIT License.