Skip to content

GrandFuzard/preman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

**⚠ Disclaimer : This is only compatable with qwen architecture currently

🚀 Preman (Predictor-Manager) MoE Engine

An Out-of-Core Inference Engine for Mixture-of-Experts (MoE) models


⚠️ Project Status: Under Active Development

Preman is a proof-of-concept inference engine designed to run MoE models larger than available VRAM using tiered memory + prediction.

Current Benchmarks (29GB Qwen1.5-MoE on 15GB T4):

  • Speed: 0.1 – 0.5 tokens/sec
  • VRAM Hit Rate: 70%+
  • Predictor Accuracy: 6% – 55%

🌟 Key Features

  • 3-Tier Memory System: VRAM (L1) ↔ Pinned RAM (L2) ↔ NVMe (L3)

  • Predictor System: Learns expert usage patterns and preloads weights

  • Challenger System: Competing predictor that can replace the main one dynamically

  • Phase Detection: Detects context shifts to prevent stale caching

  • Bare-Metal PyTorch Engine: No frameworks, direct weight streaming

🛠️ Usage

1. Clone repo

git clone https://github.com/GrandFuzard/preman
cd preman

2. Install dependencies

pip install torch transformers safetensors huggingface_hub psutil

3. Prepare model weights

python slicer.py

This will download and convert:

  • Qwen1.5-MoE-A2.7B-Chat

Output directory:

./qwen_engine_weights/

4. Run the engine

python run.py

⚠️ Notes

  • First run will be slow (cold cache)
  • Requires ~30GB disk space
  • GPU recommended (CPU will be extremely slow)

📊 What You’ll See

  • Token-by-token generation

  • Internal telemetry:

    • VRAM hits
    • RAM hits
    • Disk loads
    • Predictor accuracy

⚠️ Notes

  • First run will be slow (cold cache)
  • Requires large disk space (~30GB)
  • Designed for experimentation, not production

🧠 Architecture Overview

Preman works by:

  1. Predicting which experts will be used next
  2. Preloading them into VRAM
  3. Falling back to RAM / disk when wrong

This enables:

Running models larger than total VRAM


🤝 Contributing

Main bottleneck:

  • Predictor accuracy

If you want to help:

  • Improve expert prediction
  • Reduce disk misses
  • Optimize prefetching

Please leave your ideas or working optimizations in issues or pull requests your time and effort are appriciated.


📜 License

MIT License — free to use and modify.

About

An Out-of-Core MoE Inference Engine designed specifically for MoE models which uses a Predictor-Manager (PreMan) architecture.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages