You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wave 26 — Distributed Inference & Production Reliability
Module
Operation
Latency (µs)
Notes
TensorParallelShard
shard() 256×256 weight
5.95
Row/column split
TensorParallelShard
forward() simulated all-reduce
15.9
Linear + all-reduce stub
SequenceParallelScatter
scatter() seq=512 h=4 d=32
5.96
Ulysses seq-dim scatter
SequenceParallelScatter
gather() seq=512
39.1
All-gather reconstruction
SequenceParallelScatter
communication_bytes
1.5 MB
Per all-gather payload
KVMigrator
pack() kv=(8,4,32)
88.9
Live KV state serialise
KVMigrator
unpack() packed state
77.2
Deserialise + restore
KVMigrator
packed_bytes
524 KB
Wire payload size
DisaggPrefillNode
prefill() seq=64 h=4 d=32
2354
Prefill-only forward
DisaggDecodeNode
decode_step()
0.41
Receive KV + single decode step
PreemptScheduler
preempt+swap+resume()
4.28
Swap-based preempt cycle
PreemptScheduler
preempt+recompute+resume()
1.24
Recompute-based preempt cycle
InferenceGateway
route()+complete() 8 workers
1.90
Least-loaded routing
InferenceGateway
route() best case
1.75
Min latency
ModelVersionManager
route_request() canary=10%
1.45
Version-split routing
ModelVersionManager
route_request() best case
1.33
Min latency
ProductionProfiler
record() single op span
0.18
Zero-alloc APM record
ProductionProfiler
get_stats() 1000 ops
79.5
p50/p99/p999 computation
AdaptiveBatchController
next_batch() 8 pending
1.91
SLO-driven batch select
AdaptiveBatchController
record_obs()
0.22
Latency observation ingest
SafetyClassifier
score() seq=64
19.4
Token-level safety scoring
SafetyClassifier
score_logits() vocab=8192
67.3
Logit-domain safety gate
SemanticResponseCache
lookup() miss
295
Cosine-sim scan (256 cache)
SemanticResponseCache
store()
0.81
Embedding insert
TokenBucketRateLimiter
consume() 8 tenants
0.92
Token-bucket check
TokenBucketRateLimiter
refill()
0.48
Bucket refill step
SchemaValidator
validate() valid JSON
7.48
jsonschema compliance check
SchemaValidator
validate() invalid JSON
4.90
Early rejection path
AuditLogger
log() single entry
1.92
SHA-256 chained append
AuditLogger
verify_chain() 2010 entries
2236000
Full chain integrity check
Reference: Paper-Reported Technique Improvements
Note: These are technique-level estimates derived from published papers.
End-to-end validation on Squish with a loaded model on Apple Silicon
has not yet been run for this wave.
See dev/benchmarks/bench_eoe.py for the real-hardware benchmark harness.
Technique
Improvement
Module
KV memory (FlashMLA latent)
4× reduction measured
FlashMLACache compression_ratio
Attention FLOPs (NSA sparsity)
sub-quadratic at high sparsity
NativeSparseAttention
Sampling overhead
zero intermediate alloc
FusedSampler single-pass
KV fragmentation
0% post-defrag measured
KVDefragmenter in-place compact
Long-context FLOPs
O(chunk²) not O(seq²)
DualChunkAttention
Peak GPU memory (offload)
layer activation size freed
ActivationOffloader
Attention FLOPs (morph)
25% reduction at seq=2048
AttentionMorpher
Speculative draft candidates
n_heads per step
HydraSpecDecoder multi-head
KV memory (seq compaction)
pruned slots reclaimed
SequenceCompactor zero-copy
Scheduling latency
sub-µs prediction
LatencyPredictor linear probe
Output quality (best-of-n)
diversity-scored selection
ParallelSampler
Context length (summarise)
importance+recency pruning
ContextSummarizer
Watermark detection
z-score statistical
TokenWatermarker Kirchenbauer
Structured gen validity
100% schema compliance
SchemaGenEngine FSM
Memory scaling (tensor ∥)
linear across devices
TensorParallel
Attention FLOPs (seq ∥)
distributed across devices
SequenceParallel
KV migration overhead
zero recompute on handoff
KVMigrate pack/unpack
Prefill/decode specialisation
separate hardware
DisaggPrefill
Priority inversion
SRPT preemption
RequestPreempt
Routing latency
<2 µs per request
InferenceGateway
Deploy downtime
canary → rollback
ModelVersionSwap
APM overhead
0.18 µs per record
ProductionProfiler
Batch efficiency
SLO-objective driven
AdaptiveBatcher
Safety gate cost
no extra forward pass
SafetyLayer logit-domain
Duplicate inference
cosine-dedup short-circuit
SemanticResponseCache
Tenant isolation
hard ceiling per tenant
RateLimiter token-bucket
Output validity
100% compliant outputs
SchemaValidator
Audit tamper-evidence
SHA-256 chained log
AuditLogger
Accuracy Baseline
Wave 25-26 modules operate on attention patterns, sampling, and serving infrastructure.
They do not modify weight values or quantisation, so generation quality is unchanged
from the Squish INT8 baseline.