🚀 Preman (Predictor-Manager) MoE Engine

**⚠ Disclaimer : This is only compatable with qwen architecture currently

🚀 Preman (Predictor-Manager) MoE Engine

An Out-of-Core Inference Engine for Mixture-of-Experts (MoE) models

⚠️ Project Status: Under Active Development

Preman is a proof-of-concept inference engine designed to run MoE models larger than available VRAM using tiered memory + prediction.

Current Benchmarks (29GB Qwen1.5-MoE on 15GB T4):

Speed: 0.1 – 0.5 tokens/sec
VRAM Hit Rate: 70%+
Predictor Accuracy: 6% – 55%

🌟 Key Features

3-Tier Memory System: VRAM (L1) ↔ Pinned RAM (L2) ↔ NVMe (L3)
Predictor System: Learns expert usage patterns and preloads weights
Challenger System: Competing predictor that can replace the main one dynamically
Phase Detection: Detects context shifts to prevent stale caching
Bare-Metal PyTorch Engine: No frameworks, direct weight streaming

🛠️ Usage

1. Clone repo

git clone https://github.com/GrandFuzard/preman
cd preman

2. Install dependencies

pip install torch transformers safetensors huggingface_hub psutil

3. Prepare model weights

python slicer.py

This will download and convert:

Qwen1.5-MoE-A2.7B-Chat

Output directory:

./qwen_engine_weights/

4. Run the engine

python run.py

⚠️ Notes

First run will be slow (cold cache)
Requires ~30GB disk space
GPU recommended (CPU will be extremely slow)

📊 What You’ll See

Token-by-token generation
Internal telemetry:
- VRAM hits
- RAM hits
- Disk loads
- Predictor accuracy

⚠️ Notes

First run will be slow (cold cache)
Requires large disk space (~30GB)
Designed for experimentation, not production

🧠 Architecture Overview

Preman works by:

Predicting which experts will be used next
Preloading them into VRAM
Falling back to RAM / disk when wrong

This enables:

Running models larger than total VRAM

🤝 Contributing

Main bottleneck:

Predictor accuracy

If you want to help:

Improve expert prediction
Reduce disk misses
Optimize prefetching

Please leave your ideas or working optimizations in issues or pull requests your time and effort are appriciated.

📜 License

MIT License — free to use and modify.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
preman_engine.py		preman_engine.py
run.py		run.py
slicer.py		slicer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Preman (Predictor-Manager) MoE Engine

⚠️ Project Status: Under Active Development

🌟 Key Features

🛠️ Usage

1. Clone repo

2. Install dependencies

3. Prepare model weights

4. Run the engine

⚠️ Notes

📊 What You’ll See

⚠️ Notes

🧠 Architecture Overview

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Preman (Predictor-Manager) MoE Engine

⚠️ Project Status: Under Active Development

🌟 Key Features

🛠️ Usage

1. Clone repo

2. Install dependencies

3. Prepare model weights

4. Run the engine

⚠️ Notes

📊 What You’ll See

⚠️ Notes

🧠 Architecture Overview

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages