**⚠ Disclaimer : This is only compatable with qwen architecture currently
An Out-of-Core Inference Engine for Mixture-of-Experts (MoE) models
Preman is a proof-of-concept inference engine designed to run MoE models larger than available VRAM using tiered memory + prediction.
Current Benchmarks (29GB Qwen1.5-MoE on 15GB T4):
- Speed: 0.1 – 0.5 tokens/sec
- VRAM Hit Rate: 70%+
- Predictor Accuracy: 6% – 55%
-
3-Tier Memory System: VRAM (L1) ↔ Pinned RAM (L2) ↔ NVMe (L3)
-
Predictor System: Learns expert usage patterns and preloads weights
-
Challenger System: Competing predictor that can replace the main one dynamically
-
Phase Detection: Detects context shifts to prevent stale caching
-
Bare-Metal PyTorch Engine: No frameworks, direct weight streaming
git clone https://github.com/GrandFuzard/preman
cd premanpip install torch transformers safetensors huggingface_hub psutilpython slicer.pyThis will download and convert:
- Qwen1.5-MoE-A2.7B-Chat
Output directory:
./qwen_engine_weights/
python run.py- First run will be slow (cold cache)
- Requires ~30GB disk space
- GPU recommended (CPU will be extremely slow)
-
Token-by-token generation
-
Internal telemetry:
- VRAM hits
- RAM hits
- Disk loads
- Predictor accuracy
- First run will be slow (cold cache)
- Requires large disk space (~30GB)
- Designed for experimentation, not production
Preman works by:
- Predicting which experts will be used next
- Preloading them into VRAM
- Falling back to RAM / disk when wrong
This enables:
Running models larger than total VRAM
Main bottleneck:
- Predictor accuracy
If you want to help:
- Improve expert prediction
- Reduce disk misses
- Optimize prefetching
Please leave your ideas or working optimizations in issues or pull requests your time and effort are appriciated.
MIT License — free to use and modify.