Skip to content

Latest commit

 

History

History
86 lines (72 loc) · 4.49 KB

File metadata and controls

86 lines (72 loc) · 4.49 KB

Squish v4 — Wave 15+16 Benchmark Results

CPU/numpy micro-benchmarks — pure Python, no GPU required. Measured on Apple Silicon M-series (or equivalent CPU).


Wave 15 — Serving Intelligence + KV Architecture Evolution

Module Operation Latency (µs) Notes
AdaServe get_gamma() tight SLO 0.91 SLO-customized gamma selection
AdaServe get_gamma() relaxed SLO 0.83
ConfSpec verify_step() flat logits 131.89 Full verification path
ConfSpec verify_step() peaked logits 127.56 Auto-accept path (high confidence)
SeqPacking pack() 32 short seqs 2754.3 8–64 token sequences
SeqPacking pack() 8 long seqs 43442.3 128–512 token sequences
MetaReasoner compute_entropy() 32k 448.64 Static method
MetaReasoner step() 32k vocab 0.12 Per-token thinking budget decision
YOCO append() seq=64 dim=128 0.59 KV append to shared store
YOCO get_shared_kv() 2230.51 Retrieve cached KV for cross-decoder layers
DiffKV get_policy() 1.37 Per-head precision policy lookup
DiffKV record_attention() 4×4 5.66 Attention pattern accumulation
ParisKV encode() batch=32 dim=128 29.4 Online codebook assignment
ParisKV decode() batch=32 3.9 Codebook reconstruction
ParisKV online_update() batch=8 107.2 Drift-corrected centroid update
KVTuner search() 32 layers 4212.2 Sensitivity-aware bit assignment
CLA CLASchedule.from_config() 19.49 Cross-layer attention schedule gen

Wave 16 — Heterogeneous Compute + Advanced Spec-Decode

Module Operation Latency (µs) Notes
Dovetail verify_one() vocab=32k 602.4 CPU target verification
PIPO run_layer() in=out=4096 1376.2 INT4 dequant + matmul w/ prefetch
MobileMoE route() single 128 experts 14.88 Expert selection
MobileMoE route_batch() 32 tokens 486.7
OnlineSD record() hidden=4096 1.40 Trace buffer append
LookaheadReasoning run_cycle() k=4 13.7 Parallel step verification cycle
SparseSpec PillarAttnCache.update() cap=4096 1.20 Attention pillar accumulation
SparseSpec top_k_indices() k=205 24.4 Sparse position selection
FRSpec head.forward() top-25% vocab 4095.0 Compressed draft logits
FRSpec compress_logits() 32k→subset 12.6 Vocab projection
FRSpec expand_logits() subset→32k 21.8 Full-vocab restore
LongSpec LongSpecHead.forward() h=4096 12434.7 Shared-KV draft head
ForeLen EGTPPredictor.predict() 99.12 Entropy histogram → length
ForeLen PLPPredictor.update() 1.42 Exponential decay estimate
RASD CorpusIndex.search() 1k seqs 0.72 Prefix-tree lookup
RASD build_retrieval_tree() 1.83 Draft tree construction

Reference: Paper-Reported Technique Improvements

Note: These are technique-level estimates derived from published papers. End-to-end validation on Squish with a loaded model on Apple Silicon has not yet been run for this wave. See dev/benchmarks/bench_eoe.py for the real-hardware benchmark harness.

Technique Improvement Module
KV memory (YOCO) 50% reduction YOCO — only cross-decoder layers use KV
KV memory (DiffKV) 2.7–5.7× compression DiffKV asymmetric K/V precision
KV memory (KVTuner) vs naive quant KVTuner mixed-precision calibration
CoT decode energy 44–89% saving MetaReasoner dynamic thinking budget
Batch throughput 1.8× effective SeqPacking barrel-effect elimination
Spec decode throughput 2.13× SparseSpec dynamic sparse self-speculation
Reasoning throughput 2.1× LookaheadReasoning parallel step verification
Offloaded model throughput 1.7× PIPO pipelined prefetch offloading
Heterogeneous throughput Dovetail CPU+GPU spec decode
Draft acceptance +5–8 pp OnlineSD continuous adaptation
Length prediction (MAE) 29% ↓ vs TRAIL ForeLen entropy-guided prediction
Corpus hit rate 40–60% RASD retrieval-augmented spec decode

Accuracy Baseline (unchanged — v4 operates on KV / serving paths)

Task Score
ARC-Easy (acc_norm) 73.5%
HellaSwag (acc_norm) 62.0%
WinoGrande (acc) 67.0%
PIQA (acc_norm) 76.5%