SharpInference

A high-performance LLM inference engine and image generation pipeline written in C# 14 / .NET 10. Runs GGUF models on CPU (AVX2/AVX-512 SIMD) and GPU (Vulkan compute shaders or CUDA cuBLAS). Includes an OpenAI- and Anthropic-compatible API server and native pipelines for Z-Image-Turbo and FLUX.1.

Requirements: .NET 10 SDK, x86-64 CPU with AVX2. Optional: Vulkan-capable GPU (drivers), CUDA Toolkit 11.x/12.x for NVIDIA paths, OpenBLAS in tools/openblas/ for faster batched GEMM. Build with dotnet build -c Release.

Text generation

Supported architectures: llama, llama4, qwen3, qwen3moe. Benchmarked on AMD Zen 4 (12c/24t, DDR4-3200) + RTX 4070 Ti (12 GB), Q4_K_M, --temp 0, -n 80, prompt "Write a Python function that sorts a list using the quicksort algorithm:". Decode rate is forward-pass iterations / decode time, so it counts thinking-mode tokens too. All outputs verified coherent (scripts/bench-all.ps1). Cross-engine top-1 parity vs llama.cpp b8585 verified on Qwen3-8B (byte-identical 60-token greedy decode with matching chat template).

Model	Repo	Size	Backend	Prefill t/s	Decode t/s	Notes
SmolLM2 1.7B Instruct	HuggingFaceTB	1 GB	CPU	16.1	39.9	AVX2 fused dequant-matvec
SmolLM2 1.7B Instruct	(same)	1 GB	Vulkan `-g -1`	44.3	147.9	GLSL `subgroupAdd` reduce
SmolLM2 1.7B Instruct	(same)	1 GB	CUDA `-g -1`	180.2	157.8	NVRTC `__dp4a` + Q8_1
Qwen3 8B	Qwen	5 GB	Vulkan `-g -1`	22.8	46.8	11.4K auto-ctx
Qwen3 8B	(same)	5 GB	Vulkan `-g -1 --tq`	21.5	45.5	3-bit KV → 40 960 ctx
Qwen3 8B	(same)	5 GB	CUDA `-g -1`	65.4	58.0	~2.9× Vulkan prefill
Qwen3 8B	(same)	5 GB	CUDA `-g -1 --tq`	66.1	65.4	3-bit KV → 40 960 ctx; 17 t/s @ 8K, 10 t/s @ 16K
Qwen3-Coder 30B-A3B (MoE)	Qwen	17 GB	CPU	13.0	20.6	128 experts / 8 active
Qwen3-Coder 30B-A3B (MoE)	(same)	17 GB	CPU `--tq`	11.3	20.6	3-bit KV; GPU MoE gated by #2

--backend auto (default) picks CUDA for dense models with full offload (-g -1), Vulkan otherwise. --tq enables 3-bit TurboQuant KV compression (CPU, Vulkan, CUDA; requires headDim ∈ {128, 256}). MoE on GPU is rejected by default — see #2.

CLI examples

# CPU, single-turn, greedy
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf -p "What is 2+2?" --temp 0

# Full GPU offload (auto-picks CUDA on dense + full offload)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-8B-Q4_K_M.gguf -p "Write a quicksort in Python" --temp 0 -g -1

# MoE on CPU with 3-bit KV compression (5× less VRAM, full ctx)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Implement a BST in C#" --temp 0

# Interactive chat (no -p)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf

# Speculative decoding (~2× faster at temp 0)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-8B-Q4_K_M.gguf --draft-model models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  -p "Write a binary search in Rust" --temp 0

# API server (OpenAI /v1/chat/completions + Anthropic /v1/messages, port 5000)
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  dotnet run --project src/SharpInference.Server -c Release

Image generation

Two pipelines, auto-detected from model filename. Benchmarked on AMD Zen 4

RTX 4070 Ti (CUDA backend, 4 denoising steps, 512×512 output). The CLI is a one-shot binary, so each invocation pays the full load + text-encoder warmup. The "cached" column is the steady-state cost when the same encoder weights stay resident — e.g., re-rendering inside the server or interactive loop after the first prompt.

Pipeline	Components (repo • file • size)	Per-run	Cached prompt	Notes
Z-Image-Turbo	DiT: jayn7/Z-Image-Turbo-GGUF `z_image_turbo-Q5_K_M.gguf` 5.5 GB Encoder: BennyDaBall/...-AbliteratedV1 `Z-Image-AbliteratedV1.Q5_K_M.gguf` 2.9 GB VAE + tokenizer: Tongyi-MAI/Z-Image-Turbo `vae/` `tokenizer/`	~108 s	~30 s	Most of the per-run cost is text-encoder warmup (~90 s); DiT ~4 s, VAE ~18 s once weights are hot. Output verified visually.
FLUX.1-schnell	DiT: city96/FLUX.1-schnell-gguf `flux1-schnell-Q4_K_S.gguf` ~7 GB Encoders + VAE: comfyanonymous/flux_text_encoders `clip_l.safetensors` + `t5xxl_fp16.safetensors` + `ae.safetensors`	—	—	4-step distilled; model not on this benchmark machine

Optional 4× upscale via Real-ESRGAN (RealESRGAN_x4plus.safetensors): runs on CUDA when available, falls back to bicubic.

CLI examples

# Z-Image-Turbo (auto-detects pipeline from filename containing "z_image")
dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/z_image_turbo-Q5_K_M.gguf \
  --vae models/z-image-turbo/vae \
  --qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
  --qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
  -p "a serene mountain lake at sunrise" -W 1024 -H 1024 --steps 4 -o landscape.png

# FLUX.1-schnell
dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/flux1-schnell-Q4_K_S.gguf \
  --vae models/flux/ae.safetensors \
  --clip-l models/flux/clip_l.safetensors --clip-tokenizer models/flux/tokenizer_clip.json \
  --t5xxl models/flux/t5xxl_fp16.safetensors --t5-tokenizer models/flux/tokenizer_t5.json \
  -p "a cinematic photograph of a mountain lake" -W 512 -H 512 --steps 4 -o out.png

# With 4× Real-ESRGAN upscale + blend
dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/z_image_turbo-Q5_K_M.gguf \
  --vae models/z-image-turbo/vae \
  --qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
  --qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
  --upscaler models/RealESRGAN_x4plus.safetensors --upscale-blend 0.8 \
  -p "a fox in autumn forest" -W 512 -H 512 --steps 4 -o fox.png

More

Architecture & algorithms: docs/SharpInference-Design.md
All CLI flags: sharpi-cli --help, sharpi-cli image --help
Model downloads: scripts/download-model.ps1 -Model <smollm2|qwen3-8b|qwen3-coder-30b-a3b|llama4-scout|z-image-turbo|realesrgan-x4|…>
Tests: dotnet test (207 tests across 5 projects)
NativeAOT publish: dotnet publish src/SharpInference.Cli -c Release -r win-x64

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
codebooks		codebooks
docs		docs
samples/SharpInference.Sample.Chat		samples/SharpInference.Sample.Chat
scripts		scripts
shaders		shaders
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Directory.Build.props		Directory.Build.props
LICENSE		LICENSE
README.md		README.md
SharpInference.slnx		SharpInference.slnx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SharpInference

Text generation

CLI examples

Image generation

CLI examples

More

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SharpInference

Text generation

CLI examples

Image generation

CLI examples

More

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages