You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CPU/numpy micro-benchmarks β pure Python, no GPU required.
Measured on Apple Silicon M-series (or equivalent CPU).
Wave 23 β Multi-Modal & Long Context Intelligence
Module
Operation
Latency (Β΅s)
Notes
VisionKVFuse
append() fused image+text KV
1.43
Cross-modal KV slot append
VisionKVFuse
get_kv() hit
1.37
Fused retrieval
ImageTokenPrune
prune() 256 tokens β key 64
1070
Importance-score pruning
ImageTokenPrune
prune() best case
980
Min latency
RAGPrefetch
record_access() 8-doc window
4.89
Rolling LRU access record
RAGPrefetch
get_candidates() top_k=4
4.87
Recency-weighted ranking
CoTCompress
compress() 256-token input
75.8
BM25 + entropy filtering
CoTCompress
compress() 64-token input
26.0
Short-context path
MultiModalBatch
add_request() text+image
0.67
Priority-queue enqueue
MultiModalBatch
next_batch()
0.28
O(1) dequeue
ContextualRerank
rerank() 16 docs
87.9
BM25 relevance re-scoring
ContextualRerank
rerank() with query embedding
42.7
Query-conditioned rerank
CrossModalAttn
forward() seq=64 h=4 d=32
455
Cross-attention over modalities
CrossModalAttn
forward() best case
409
Min latency
HierarchicalKV
put() 8 tokens kv=(8,4,32)
1.74
L1 β L2 tier write
HierarchicalKV
get() hit (33% hit rate)
0.72
L1 cache hit path
HierarchicalKV
get() miss
0.25
L2 fallback lookup
StreamRAG
inject() 4 passages
3.47
Passage token injection
StreamRAG
retrieve() 4 passages
21.4
BM25 similarity retrieval
CrossDocAttn
forward() seq=64 docs=2
548
Cross-document attention
CrossDocAttn
forward() best case
496
Min latency
VideoFramePrune
prune_temporal() 16 frames
32.2
Temporal motion difference
VideoFramePrune
prune_spatial() 16 frames
28.1
Spatial entropy score
EmbeddingGate
gate() 8 tokens hidden=64
37.3
Cosine-threshold gating
EmbeddingGate
gate() best case
26.4
Min latency
LongContextChunk
chunk() seq=2048 chunk=256
207
8-chunk split
LongContextChunk
chunk() seq=256 chunk=256
0.65
Single-chunk fast path
ModalityRouter
route()+complete() text
0.65
Type-dispatch routing
ModalityRouter
route() best case
0.54
Min latency
Wave 24 β Quantisation Evolution & Model Surgery
Module
Operation
Latency (Β΅s)
Notes
TernaryQuant
quantize() 256Γ256 weight
719
1.58-bit {β1,0,+1} Β· 3.1% sparsity
TernaryQuant
dequantize() 256Γ256
38.5
Scale multiply
BinaryAttn
forward() seq=64 h=4 d=32
224
XNOR/popcount attention
BinaryAttn
forward() best case
199
Min latency
StructuredPrune
prune() 128Γ256 β 50% sparse
1255
Group L2-norm row pruning
StructuredPrune
prune() best case
1156
Min latency
LayerFusion
cosine_similarity() h=256
20.1
Layer redundancy probe
LayerFusion
fuse() 2-layer merge 256-dim
109
Weighted average merge
WeightSharing
get_effective_weight() shared
25.3
Offset-based shared tensor
WeightSharing
memory_ratio
0.25
4 layers β 1 copy
QuantCalib
calibrate_minmax() 128Γ256
606
Per-channel min/max pass
SparseWeight
compress() 128Γ256 30% sparse
1316
CSR format conversion
SparseWeight
decompress() CSR β dense
152
Dense reconstruction
SparseWeight
compression_ratio
1.33Γ
vs dense storage
DeltaCompress
compress() 256Γ256 SVD-delta
9087
SVD rank-16 delta fit
DeltaCompress
decompress()
23.8
rank-16 matmul
DeltaCompress
compression_ratio
7.98Γ
SVD delta vs full weight
ModelSurgery
plan() 3-layer graph
0.59
Surgery plan generation
ModelSurgery
estimate_reduction()
0.45
Perf delta estimation
ZeroQuantV2
quantize() 256Γ256 INT8
233
Outlier-separated quant
ZeroQuantV2
dequantize() INT8 β float
66.0
Per-group dequant
ZeroQuantV2
outlier_rate
1.17%
Weight outlier fraction
GPTQLayer
calibrate() 32Γ220 inputs
1053
Hessian-gated GPTQ update
SparseMoE
route() 4 experts Β· 2 top-k
58.3
Load-balanced expert gate
AWQv2
calibrate() 128Γ256
73402
One-time activation calibration
AWQv2
quantize() post-calibration
64.4
Per-row INT4 scale
IterPrune
prune_step() 5 iterations
956
Magnitude prune + regrow
IterPrune
prune_step() 10 iterations
784
70% sparsity achieved
IterPrune
sparsity_at_step10
70%
Target convergence
Reference: Paper-Reported Technique Improvements
Note: These are technique-level estimates derived from published papers.
End-to-end validation on Squish with a loaded model on Apple Silicon
has not yet been run for this wave.
See dev/benchmarks/bench_eoe.py for the real-hardware benchmark harness.
Technique
Improvement
Module
Image token reduction
50β70% fewer tokens
ImageTokenPrune importance scoring
Video token reduction
60β80% fewer tokens
VideoFramePrune temporal+spatial
CoT length reduction
30β50% fewer tokens
CoTCompress BM25+entropy
KV memory (hierarchical)
hot-tier hit rate driven
HierarchicalKV L1/L2
RAG prefetch hit rate
access-pattern driven
RAGPrefetch predictive fetch
Cross-modal attention FLOPs
seq Γ modality_len
CrossModalAttn
Weight size (1.58-bit)
~5Γ vs FP16
TernaryQuant {β1,0,+1}
Weight size (binary)
~16Γ vs FP16
BinaryAttn 1-bit
Structured sparsity
2:4 or custom ratio
StructuredPrune group L2
Delta compression
7.98Γ ratio measured
DeltaCompress SVD rank-16
Iterative pruning sparsity
70% sparsity converges
IterPrune magnitude prune
Model size (AWQ INT4)
4Γ vs FP16
AWQv2 activation-aware
Expert routing overhead
<100 Β΅s / dispatch
SparseMoE load-balanced
Layer fusion memory
N_fused Γ layer_size
LayerFusion cosine-redundancy
Accuracy Baseline
Wave 24 quantisation modules trade off compression ratio against generation quality.
The baseline below is from the Squish INT8 compressed Qwen2.5-1.5B reference run.
Per-module quality impact (perplexity delta, accuracy delta) is measured separately
via dev/benchmarks/bench_wave23_24.py with a loaded model.