diff --git a/docs/adr/ADR-132-ruvix-hypervisor-core.md b/docs/adr/ADR-132-ruvix-hypervisor-core.md index e088d52d..e4d02fc5 100644 --- a/docs/adr/ADR-132-ruvix-hypervisor-core.md +++ b/docs/adr/ADR-132-ruvix-hypervisor-core.md @@ -374,7 +374,7 @@ RVM leverages existing RuVector crates rather than reimplementing graph primitiv ## Non-Goals (v1) - Full Linux ABI compatibility -- Large device model surface (USB, GPU, network card diversity) +- Large device model surface (USB, GPU, network card diversity). **Note:** GPU compute was added in ADR-144 as an optional, feature-gated subsystem. - Desktop or workstation use - Full formal verification (deferred to post-v1; seL4-style proofs are multi-year efforts) - Cloud VM replacement (strongest advantage is edge/appliance coherence) diff --git a/docs/adr/ADR-133-partition-object-model.md b/docs/adr/ADR-133-partition-object-model.md new file mode 100644 index 00000000..7f742c5d --- /dev/null +++ b/docs/adr/ADR-133-partition-object-model.md @@ -0,0 +1,121 @@ +# ADR-133: Partition Object Model + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +ADR-132 establishes partitions as the primary abstraction in RVM -- not VMs. A partition is a coherence domain container for scheduling, isolation, migration, and fault containment. However, ADR-132 specifies partitions at the architectural level without defining the concrete object model: lifecycle states, transition rules, split/merge semantics, communication edges, device leases, or the relationship between logical and physical partition slots. + +Without a precise object model, implementations risk diverging on fundamental questions: What states can a partition occupy? When is a split legal? How are capabilities handled during merge? This ADR answers those questions. + +## Decision + +### Partition Structure + +A partition is a fixed-size kernel object containing: + +- **PartitionId**: Unique identifier (u32), recyclable after destruction. +- **PartitionState**: Lifecycle state enum (see below). +- **PartitionType**: `Agent` (workload), `Infrastructure` (driver domain), or `Root` (bootstrap authority). +- **CoherenceScore**: Current locality metric from the coherence graph. +- **CutPressure**: Graph-derived isolation signal; high pressure triggers migration or split. +- **vCPU count and CPU affinity**: Scheduling parameters. +- **Epoch**: Creation epoch for capability staleness detection. + +### Lifecycle States + +``` +Created --> Running --> Suspended --> Running --> Hibernated --> Created + | | | | | + +----------+------------+------------+----------->Destroyed +``` + +| State | Description | +|-------|-------------| +| `Created` | Allocated, capability table initialized, not yet scheduled | +| `Running` | Actively scheduled on physical CPU(s) | +| `Suspended` | All vCPUs paused, state preserved in-place (Hot/Warm tier) | +| `Hibernated` | State serialized to Cold storage, physical resources released | +| `Destroyed` | Resources reclaimed, ID available for reuse | + +All transitions are validated by `valid_transition()` and emit witness records. Invalid transitions return `RvmError::InvalidPartitionState`. + +### MAX_PARTITIONS = 256 + +The hard limit of 256 active physical partition slots is derived from the ARM VMID width (8 bits). Logical partitions may exceed this limit (DC-12 allows up to 4096 logical partitions); physical slots are multiplexed via TLB flush and VMID reassignment when logical count exceeds physical capacity. + +### Split Semantics + +Partition split divides one partition into two, triggered by high cut pressure. Preconditions: + +1. Source partition must be in `Running` or `Suspended` state. +2. The coherence engine provides a scored region assignment (DC-9): each memory region is assigned to the side with higher `alpha * local_access_fraction + beta * remote_access_cost_avoided + gamma * size_penalty`. +3. Capabilities follow their target objects (DC-8): if a capability's target is on side A, it goes to partition A only. Shared targets get READ_ONLY attenuation in both, with a `CAPABILITY_ATTENUATED_ON_SPLIT` witness. +4. Two new `PartitionId` values are allocated; the original ID is destroyed. + +### Merge Semantics + +Partition merge combines two partitions into one. Seven preconditions must all hold (DC-11): + +1. A shared `CommEdge` exists between the two partitions. +2. Mutual coherence score exceeds the merge threshold (7000 basis points default). +3. No conflicting device leases. +4. No overlapping mutable memory regions. +5. Capability intersection is valid (no escalation). +6. Both partitions are in `Running` or `Suspended` state. +7. P2 proof validates merge authority. + +Failure of any precondition results in rejection plus a witness record. The `merge_preconditions_full()` function checks all seven and returns a typed `MergePreconditionError` identifying which conditions failed. + +### CommEdge Model + +A `CommEdge` is a weighted directed edge in the coherence graph representing inter-partition communication. Each edge carries: + +- **CommEdgeId**: Unique edge identifier. +- Source and destination `PartitionId`. +- Weight (communication volume, decayed per epoch). + +CommEdges are the input to the mincut algorithm. The scheduler uses cut pressure derived from these edges to boost scheduling priority (DC-4: `priority = deadline_urgency + cut_pressure_boost`). + +### IPC + +The `IpcManager` provides fixed-capacity message queues between partitions. Each `MessageQueue` is bounded (default 64 messages) and supports zero-copy semantics via shared memory regions. IPC messages emit `IpcSend` / `IpcReceive` witness records. + +### Device Leases + +A `DeviceLeaseManager` tracks time-bounded, revocable access grants to hardware devices. Each `ActiveLease` records the partition, device, expiry time, and capability used to acquire the lease. Device leases are checked during merge preconditions (condition 3) to prevent conflicts. + +## Consequences + +### Positive + +- **Precise lifecycle** prevents invalid state transitions at compile time via exhaustive match. +- **Scored region assignment** during split avoids oscillation from hotspot access patterns. +- **Seven merge preconditions** prevent authority leaks and resource conflicts. +- **Fixed-size structures** (`MAX_PARTITIONS = 256`) enable fully stack-allocated operation in `no_std`. + +### Negative + +- **256 physical slots** may become a bottleneck for large agent workloads; VMID multiplexing (DC-12) adds TLB flush overhead. +- **Seven merge preconditions** are conservative; some legitimate merges may be rejected by condition 4 (overlapping regions) when the overlap is intentional shared memory. + +### Neutral + +- The split/merge operations are novel for a hypervisor. No existing system provides a direct comparison for performance baseline. + +## References + +- ADR-132: RVM Hypervisor Core (DC-8, DC-9, DC-11, DC-12) +- `crates/rvm-partition/src/lib.rs` -- Partition module root +- `crates/rvm-partition/src/partition.rs` -- Core struct and constants +- `crates/rvm-partition/src/lifecycle.rs` -- State transition validation +- `crates/rvm-partition/src/split.rs` -- Scored region assignment +- `crates/rvm-partition/src/merge.rs` -- Merge precondition checks +- `crates/rvm-partition/src/comm_edge.rs` -- CommEdge model +- `crates/rvm-partition/src/ipc.rs` -- IPC message queues +- `crates/rvm-partition/src/device.rs` -- Device lease management diff --git a/docs/adr/ADR-134-witness-schema-log-format.md b/docs/adr/ADR-134-witness-schema-log-format.md index a2e11132..17a8a576 100644 --- a/docs/adr/ADR-134-witness-schema-log-format.md +++ b/docs/adr/ADR-134-witness-schema-log-format.md @@ -1,6 +1,6 @@ # ADR-134: Witness Schema and Log Format -**Status**: Proposed +**Status**: Accepted **Date**: 2026-04-04 **Authors**: Claude Code (Opus 4.6) **Supersedes**: None diff --git a/docs/adr/ADR-135-proof-verifier-design.md b/docs/adr/ADR-135-proof-verifier-design.md index aa0bdcb6..fd639964 100644 --- a/docs/adr/ADR-135-proof-verifier-design.md +++ b/docs/adr/ADR-135-proof-verifier-design.md @@ -1,6 +1,6 @@ # ADR-135: Proof Verifier Design — Three-Layer Verification for Capability-Gated Mutation -**Status**: Proposed +**Status**: Accepted **Date**: 2026-04-04 **Authors**: Claude Code (Opus 4.6) **Supersedes**: None diff --git a/docs/adr/ADR-136-memory-hierarchy-reconstruction.md b/docs/adr/ADR-136-memory-hierarchy-reconstruction.md new file mode 100644 index 00000000..29694305 --- /dev/null +++ b/docs/adr/ADR-136-memory-hierarchy-reconstruction.md @@ -0,0 +1,113 @@ +# ADR-136: Memory Hierarchy and Reconstruction + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +ADR-132 introduces a four-tier memory model where page residency is driven by coherence graph signals rather than simple access frequency. Traditional hypervisors use demand paging with a binary resident/swapped model. RVM needs a richer model because: + +1. **Coherence-driven placement**: Pages should remain resident not just because they are accessed, but because the graph structure (cut pressure, locality) justifies the cost of keeping them hot. +2. **Reconstruction from witnesses**: Dormant state can be restored without storing full memory snapshots, by replaying witness-recorded deltas against a compressed checkpoint. +3. **No-std constraint**: The memory manager must operate with zero heap allocation, using only caller-provided fixed-size buffers. + +## Decision + +### Four-Tier Memory Model + +| Tier | Name | Location | Residency Rule | +|------|------|----------|----------------| +| 0 | Hot | Per-core SRAM / L1-adjacent | Always resident during partition execution | +| 1 | Warm | Shared cluster DRAM | Resident if `cut_value + recency_score > eviction_threshold` | +| 2 | Dormant | Compressed in main memory | Checkpoint + witness delta; reconstructed on demand | +| 3 | Cold | RVF-backed persistent archival | Accessed only during recovery; never auto-promoted | + +All tier transitions are **explicit**, not demand-paged. The kernel (or coherence engine, when available) decides when to promote or demote a region. Every transition emits a `RegionPromote` or `RegionDemote` witness record. + +### DC-1 Degraded Mode + +When the coherence engine is absent (DC-1), `cut_value` defaults to 0. Tier placement falls back to static thresholds based on `recency_score` alone: + +| Transition | Threshold (basis points) | +|------------|------------------------| +| Hot -> Warm | 7000 | +| Warm -> Dormant | 4000 | +| Dormant -> Cold | 1000 | +| Warm -> Hot | 8000 | +| Dormant -> Warm | 5000 | + +These conservative defaults prevent aggressive demotion without coherence data. + +### BuddyAllocator + +Physical page allocation uses a power-of-two buddy allocator operating on 4 KiB pages (`PAGE_SIZE = 4096`). The allocator is `no_std` compatible with zero heap allocation, using a fixed-size bitmap for free-list tracking. + +### RegionManager + +The `RegionManager` tracks `OwnedRegion` instances, each containing: + +- Guest physical base address (page-aligned). +- Host physical base address (page-aligned). +- Page count and access permissions (read/write/execute). +- Owning `PartitionId`. +- Current tier state via `RegionTierState`. + +Region overlap checks enforce isolation: guest-physical overlap is valid only within the same partition (separate stage-2 page tables), but host-physical overlap across partitions is a critical isolation violation. + +### ReconstructionPipeline + +Dormant memory is stored as a `CompressedCheckpoint` plus a sequence of `WitnessDelta` entries. Reconstruction proceeds in three steps: + +1. **Load checkpoint**: Decompress the LZ4-compressed snapshot into a caller-provided buffer. +2. **Apply deltas**: Replay each `WitnessDelta` in sequence order, writing the recorded data at the specified offset. +3. **Verify hash**: Validate the final state hash (FNV-1a) against the expected value stored in the checkpoint. + +If verification fails, the reconstruction is aborted and the region remains dormant. A `RecoveryEnter` witness is emitted on failure. + +Each `WitnessDelta` records: +- Witness sequence number (for ordering). +- Byte offset within the region. +- Data length. +- FNV-1a hash of the written data (integrity check per delta). + +### Address Space Validation + +The `validate_region()` function enforces: +- Page alignment of both guest and host base addresses. +- Non-zero page count. +- At least one permission bit set (read, write, or execute). + +The `regions_overlap()` and `regions_overlap_host()` functions detect guest-physical and host-physical overlaps respectively, enabling the kernel to reject mappings that would break isolation. + +## Consequences + +### Positive + +- **Coherence-driven residency** reduces remote memory traffic by keeping pages hot only when graph structure justifies it (target: 20% reduction vs. naive placement). +- **Checkpoint + delta reconstruction** avoids storing full dormant snapshots, reducing cold storage requirements. +- **Zero-allocation design** enables deployment on Seed-class hardware (64 KB - 1 MB RAM). +- **Explicit tier transitions** eliminate demand-paging complexity and make memory behavior deterministic and auditable. + +### Negative + +- **No demand paging** means a page fault on a demoted region is a hard fault, not a transparent recovery. The kernel must proactively manage promotions. +- **Reconstruction latency** depends on the number of deltas since the last checkpoint. Long-running partitions with infrequent checkpoints may have slow reconstruction. +- **Static DC-1 thresholds** are conservative and may over-demote on hardware where DRAM is abundant. + +### Neutral + +- The compression stub currently uses a simple byte-level algorithm. Production deployments will use `lz4_flex` or hardware compression; the interface is abstracted behind the `CompressedCheckpoint` type. + +## References + +- ADR-132: RVM Hypervisor Core (DC-1, DC-6, memory model section) +- ADR-138: Seed Hardware Profile (no_alloc constraints) +- `crates/rvm-memory/src/lib.rs` -- Module root and region validation +- `crates/rvm-memory/src/tier.rs` -- Four-tier model and thresholds +- `crates/rvm-memory/src/allocator.rs` -- BuddyAllocator +- `crates/rvm-memory/src/region.rs` -- OwnedRegion and RegionManager +- `crates/rvm-memory/src/reconstruction.rs` -- ReconstructionPipeline diff --git a/docs/adr/ADR-137-bare-metal-boot-sequence.md b/docs/adr/ADR-137-bare-metal-boot-sequence.md new file mode 100644 index 00000000..a9666703 --- /dev/null +++ b/docs/adr/ADR-137-bare-metal-boot-sequence.md @@ -0,0 +1,105 @@ +# ADR-137: Bare-Metal Boot Sequence + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +RVM boots bare-metal without KVM or Linux. The boot sequence must initialize hardware, establish the capability discipline, create the witness trail, and hand off to the scheduler -- all in a deterministic, witness-gated sequence. A non-deterministic boot makes debugging impossible and prevents measured boot attestation. + +The boot sequence has evolved through two iterations: + +1. **ADR-137 original (7-phase hardware-centric)**: Reset vector, hardware detect, MMU setup, EL2 entry, kernel object init, first witness, scheduler entry. This maps to the AArch64 hardware bring-up path. +2. **ADR-140 revision (7-phase logical)**: HAL init, memory pool init, capability table init, witness trail init, scheduler init, root partition creation, hand-off. This is the currently implemented sequence. + +Both share the principle: each phase is gated by a witness entry and must complete before the next begins. + +## Decision + +### Boot Phases + +The implemented boot sequence uses 7 phases, executed in strict order: + +| Phase | Name | What It Does | +|-------|------|-------------| +| 0 | HalInit | Initialize hardware abstraction: timer, MMU, interrupts, UART | +| 1 | MemoryInit | Initialize physical page allocator (BuddyAllocator) | +| 2 | CapabilityInit | Create the root capability table | +| 3 | WitnessInit | Initialize the witness log ring buffer; emit genesis attestation | +| 4 | SchedulerInit | Initialize the scheduler with deadline-based priority | +| 5 | RootPartition | Create the root partition with bootstrap authority | +| 6 | Handoff | Transfer control to the root partition's entry point | + +### BootTracker + +The `BootTracker` enforces phase ordering. It maintains: + +- `current: Option` -- the phase that must complete next, or `None` if boot is complete. +- `completed: [bool; 7]` -- which phases have finished. + +Calling `complete_phase(phase)` succeeds only if `phase` matches `current`. Out-of-order completion returns `RvmError::InternalError`. Attempting to complete a phase after boot is done returns `RvmError::Unsupported`. + +### MeasuredBootState + +For TPM-style measured boot, `MeasuredBootState` accumulates a hash chain during boot. Each phase extends the measurement with its completion data, producing a boot attestation hash that can be included in the genesis witness record. This enables remote attestation: a verifier can confirm that the boot sequence executed all phases in order with expected firmware and configuration. + +### BootSequence + +The `BootSequence` type wraps the full boot flow with timing. Each `BootStage` records a `PhaseTiming` (start/end timestamps in nanoseconds) so that boot performance can be profiled. The target is cold boot to first witness in under 250ms on Appliance hardware. + +### HAL Abstraction + +The `HalInit` trait abstracts hardware initialization behind three configuration structures: + +- `MmuConfig` -- page table base, granularity, address space size. +- `InterruptConfig` -- GIC distributor and redistributor addresses. +- `UartConfig` -- serial console base address and baud rate. + +A `StubHal` implementation satisfies the trait for testing without real hardware. On AArch64, the real HAL performs EL2 entry, stage-2 page table setup, and GIC initialization. + +### Witness Gating + +Every phase transition emits a `BootAttestation` witness record before advancing. If witness emission fails (ring buffer unavailable during WitnessInit itself), the boot sequence records the failure in the measured boot state and continues -- witness infrastructure is not yet available in phases 0-2. From phase 3 onward, witness emission failure is fatal. + +### AArch64 Entry + +On AArch64, the reset vector (assembly, <100 LoC) performs: + +1. Disable interrupts, set SCTLR to known state. +2. Read MPIDR to determine core ID; park non-primary cores. +3. Zero BSS, set up initial stack pointer. +4. Branch to Rust `_start` which calls `run_boot_sequence()`. + +The entry module (`crates/rvm-boot/src/entry.rs`) provides `BootContext` and `run_boot_sequence()` as the Rust-side entry point. + +## Consequences + +### Positive + +- **Deterministic ordering** via `BootTracker` prevents initialization races and ensures every phase completes before its dependents start. +- **Measured boot** enables remote attestation for high-assurance deployments. +- **Phased witness gating** catches boot-time failures early with auditable records. +- **HAL abstraction** allows the same boot sequence to run on QEMU, real AArch64, and future RISC-V targets. + +### Negative + +- **Strict phase ordering** means phases cannot be parallelized. On multi-core hardware, phases 0-4 run on the primary core only; secondary cores are parked until phase 6 (Handoff). +- **250ms boot target** is aggressive for bare-metal without firmware fast-path optimizations. Phase 1 (memory init) dominates on systems with large physical memory. + +### Neutral + +- The dual 7-phase numbering (hardware-centric vs. logical) may cause confusion. The implemented logical sequence is canonical; the hardware-centric numbering in the module doc-comments is retained for reference. + +## References + +- ADR-132: RVM Hypervisor Core (success criterion 1: cold boot < 250ms) +- ADR-134: Witness Schema and Log Format (BootAttestation witness kind) +- `crates/rvm-boot/src/lib.rs` -- Module root, BootPhase enum, BootTracker +- `crates/rvm-boot/src/sequence.rs` -- BootSequence with timing +- `crates/rvm-boot/src/measured.rs` -- MeasuredBootState hash chain +- `crates/rvm-boot/src/hal_init.rs` -- HalInit trait and StubHal +- `crates/rvm-boot/src/entry.rs` -- AArch64 entry point diff --git a/docs/adr/ADR-138-seed-hardware-profile.md b/docs/adr/ADR-138-seed-hardware-profile.md new file mode 100644 index 00000000..26dd1b8d --- /dev/null +++ b/docs/adr/ADR-138-seed-hardware-profile.md @@ -0,0 +1,109 @@ +# ADR-138: Seed Hardware Profile + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +ADR-132 defines three target platforms: Seed, Appliance, and Chip. Seed is the most constrained: hardware-limited MCU-class devices with 64 KB to 1 MB of RAM, no MMU (or a very simple one), and no coherence engine. RVM must run on Seed targets to demonstrate that its core abstractions (partitions, capabilities, proofs, witnesses) are viable on deeply embedded hardware. + +The challenge is that most hypervisor features assume abundant memory, multi-core processors, and hardware virtualization extensions. Seed has none of these. This ADR defines what RVM looks like when stripped to the bone. + +## Decision + +### Hardware Constraints + +| Resource | Seed Range | Implication | +|----------|-----------|-------------| +| RAM | 64 KB - 1 MB | All structures must be statically sized; no heap allocation | +| Cores | 1 (typically) | No SMP scheduler needed; cooperative or timer-preemptive scheduling | +| MMU | None or MPU | No stage-2 page tables; isolation via MPU regions or software checks | +| Persistent storage | Flash (128 KB - 4 MB) | Witness log overflow to flash; checkpoint storage | +| Network | None, UART, or SPI | No network-based migration; local operation only | +| Power | Battery or energy harvesting | Deep sleep between events; witness log survives wake cycles | + +### What Works on Seed + +The entire RVM core (Layer 1) is `#![no_std]` with `#![forbid(unsafe_code)]` and zero heap allocation. On Seed, this means: + +1. **Capability + Proof + Witness**: The full P1+P2 proof system, capability derivation tree, and witness log work on Seed. These are bitmap operations and fixed-size ring buffers -- no allocator needed. +2. **Partitions (limited)**: 1-4 partitions, statically configured. No dynamic split/merge (requires coherence engine, which is absent per DC-1). +3. **Memory tiers (simplified)**: Only Hot and Cold tiers. Warm tier requires shared DRAM (not available on most MCUs). Dormant tier requires compression (too expensive for 64 KB RAM). Static threshold fallback per DC-1. +4. **Witness log (small ring buffer)**: Ring buffer sized to available RAM. On a 64 KB device, a 4 KB ring buffer holds 64 witness records. Overflow drains to flash. +5. **IPC**: Fixed-size message queues between the few partitions. Zero-copy where MPU regions allow. + +### What Does Not Work on Seed + +1. **Coherence engine (DC-1)**: Absent. No graph, no mincut, no cut pressure. Partitions use static affinity. +2. **Dynamic split/merge**: Requires coherence engine and sufficient memory for two partitions' state. +3. **WASM runtime**: Too expensive. Seed runs native code only (DC-13). +4. **Multi-core scheduling**: Single core; the scheduler degenerates to a priority queue with timer preemption. +5. **GPU compute**: Not applicable. +6. **Migration**: No network, no target to migrate to. State survives only via checkpoint-to-flash. + +### Feature Gating + +Seed builds use minimal Cargo features: + +```toml +[features] +default = [] +# Seed profile: no alloc, no coherence, no wasm, no gpu +seed = [] +# Enables alloc (required for Appliance+) +alloc = [] +# Enables coherence engine +coherence = ["alloc"] +# Enables WASM runtime +wasm = ["alloc"] +``` + +The `no_std` + no `alloc` combination compiles only the core kernel objects. All coherence, WASM, and GPU code is excluded. + +### Memory Layout (64 KB Example) + +| Region | Size | Contents | +|--------|------|----------| +| Code + rodata | 24 KB | RVM kernel image | +| Stack | 4 KB | Single execution stack | +| Partition state | 4 KB | 1-2 partition structures, capability tables | +| Witness log | 4 KB | 64-record ring buffer | +| Application data | 28 KB | Available to the root partition | + +### Boot on Seed + +The boot sequence (ADR-137) runs all 7 phases but with simplified HAL init: +- No MMU setup (skip or configure MPU). +- No GIC (direct NVIC interrupt controller on Cortex-M). +- Memory init configures the buddy allocator over the available SRAM. +- Witness init creates a small ring buffer. +- Root partition is the only partition created. + +## Consequences + +### Positive + +- **Proves core abstractions are viable on MCU**: If capabilities, proofs, and witnesses work on 64 KB, they work everywhere. +- **Zero-allocation guarantee** is enforced by the Seed profile, ensuring no hidden `alloc` dependencies creep into the core. +- **Minimal attack surface**: No WASM interpreter, no coherence graph, no network stack. The TCB on Seed is extremely small. + +### Negative + +- **No dynamic partitioning**: Seed targets cannot split, merge, or migrate. The coherence-native features that differentiate RVM are absent on Seed. +- **Limited witness capacity**: 64 records in a 4 KB ring buffer means the drain-to-flash path must be reliable or witness records will be lost. +- **Single-core only**: Cannot demonstrate SMP scheduling on Seed. + +### Neutral + +- Seed is primarily a validation target, not a deployment target. Its purpose is to prove that the core kernel is truly minimal and allocation-free. + +## References + +- ADR-132: RVM Hypervisor Core (DC-1, DC-13, Seed platform profile) +- ADR-136: Memory Hierarchy and Reconstruction (DC-1 fallback thresholds) +- ADR-137: Bare-Metal Boot Sequence +- `crates/rvm-memory/src/lib.rs` -- `MemoryRegion` with ADR-138 compatibility note diff --git a/docs/adr/ADR-139-appliance-deployment-model.md b/docs/adr/ADR-139-appliance-deployment-model.md new file mode 100644 index 00000000..1d752c82 --- /dev/null +++ b/docs/adr/ADR-139-appliance-deployment-model.md @@ -0,0 +1,109 @@ +# ADR-139: Appliance Deployment Model + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +ADR-132 defines the Appliance as the primary deployment target: an edge hub with 1-32 GB RAM, multi-core processor, and hardware virtualization extensions (ARM VHE or similar). Unlike Seed (ADR-138), the Appliance has sufficient resources to run the full RVM stack: coherence engine, WASM runtime, dynamic partitioning, and GPU compute. + +The Appliance is where RVM's differentiating features are exercised: graph-theoretic partition placement, dynamic split/merge, coherence-weighted scheduling, agent workloads, and measured boot with TEE attestation. + +## Decision + +### Hardware Profile + +| Resource | Appliance Range | Design Point | +|----------|----------------|-------------| +| RAM | 1 - 32 GB | 4 GB typical for edge deployment | +| Cores | 2 - 16 | SMP scheduling with per-core Hot tier memory | +| MMU | Full stage-2 | ARM VHE or RISC-V H-extension; hardware VMID support | +| Storage | 16 GB - 1 TB SSD/eMMC | Witness log persistence, Cold tier storage | +| Network | Ethernet / WiFi | Agent migration between Appliance nodes (post-v1) | +| GPU | Optional (feature-gated) | Compute offload for ML agent workloads | +| TEE | Optional (TrustZone/CCA) | Measured boot, witness signing | + +### Full Feature Set + +The Appliance enables all RVM features: + +1. **Coherence engine**: Full graph state, MinCut computation (Stoer-Wagner builtin or RuVector backend), adaptive recomputation frequency based on CPU load, cut pressure scoring, split/merge recommendations. + +2. **Dynamic partitioning**: Up to 256 physical partition slots (ARM VMID). Dynamic split when cut pressure exceeds threshold (7500 basis points default). Dynamic merge when mutual coherence exceeds threshold (7000 basis points). Scored region assignment during split (DC-9). + +3. **WASM agent runtime**: WebAssembly modules execute inside partitions. 7-state agent lifecycle (Initializing, Running, Suspended, Migrating, Hibernated, Reconstructing, Terminated). 13 host functions including 5 GPU operations. Per-partition resource quotas enforced per epoch. + +4. **SMP scheduling**: Multi-core scheduler with per-core run queues. Two-signal priority: `deadline_urgency + cut_pressure_boost` (DC-4). Flow, Reflex, and Recovery modes. Partition switch target: < 10 microseconds. + +5. **GPU compute** (feature-gated): Optional GPU subsystem for ML inference workloads. Host functions: GpuLaunch, GpuAlloc, GpuFree, GpuTransfer, GpuSync. Capability-checked; requires EXECUTE+WRITE rights for kernel launch. + +6. **Full memory hierarchy**: All four tiers active. Hot tier uses per-core SRAM/L1-adjacent memory. Warm tier in shared DRAM with coherence-driven eviction. Dormant tier with checkpoint + witness delta reconstruction. Cold tier on persistent storage. + +7. **Control partition** (DC-15): First user partition created at boot. Provides witness log queries, partition inspection, health monitoring, and debug console (serial or network). Subject to capability discipline. + +### Resource Quotas + +Each WASM-hosting partition is subject to per-epoch resource budgets: + +| Resource | Default Budget | Enforcement | +|----------|---------------|-------------| +| CPU time | 10 ms per epoch | Lowest-priority agent terminated on exceed | +| Memory pages | 256 pages (16 MiB) | Allocation fails beyond limit | +| IPC messages | 1024 per epoch | Send fails beyond limit | +| Concurrent agents | 32 | Spawn fails beyond limit | + +Quota enforcement uses atomic check-and-record operations to prevent TOCTOU races. Per-epoch counters (CPU, IPC) reset at each scheduler epoch; memory and agent counts persist. + +### Deployment Configuration + +An Appliance deployment is configured via a static manifest specifying: + +- Number and type of initial partitions. +- Per-partition resource quotas. +- Coherence engine parameters (mincut budget, recomputation interval, split/merge thresholds). +- Scheduler mode (Flow default, Reflex for real-time partitions). +- TEE signing configuration (NullSigner, HMAC-SHA256, or Ed25519). +- GPU feature enablement. + +### Success Criteria + +All six ADR-132 success criteria are validated on Appliance: + +1. Cold boot to first witness: < 250ms. +2. Hot partition switch: < 10 microseconds. +3. Remote memory traffic reduction: >= 20% vs. naive placement. +4. Tail latency reduction: >= 20% under mixed partition pressure. +5. Witness completeness: full trail for every privileged action. +6. Fault recovery: recover from injected fault without global reboot. + +## Consequences + +### Positive + +- **Full feature demonstration**: Appliance is the only target where all RVM innovations (coherence, split/merge, WASM agents, GPU) are exercised together. +- **Edge-optimized**: Bounded workloads, deterministic scheduling, and local operation make Appliance ideal for edge AI deployments. +- **Hardware-assisted isolation**: Stage-2 page tables and VMID-based TLB tagging provide strong partition isolation without software overhead. + +### Negative + +- **Complex configuration**: The full feature set requires careful tuning of coherence thresholds, quota budgets, and scheduler parameters. Misconfiguration can cause excessive split/merge churn. +- **GPU adds attack surface**: The GPU subsystem introduces 5 additional host functions and a device lease model. Feature-gating mitigates this for deployments that don't need GPU. +- **Power consumption**: Unlike Seed, Appliance does not support deep sleep. Multi-core operation and continuous coherence recomputation consume significant power. + +### Neutral + +- Appliance is the validation platform for v1. Chip (future Cognitum silicon) will provide hardware-accelerated versions of features that are software-only on Appliance. + +## References + +- ADR-132: RVM Hypervisor Core (DC-4, DC-5, DC-15, success criteria) +- ADR-133: Partition Object Model (split/merge semantics) +- ADR-136: Memory Hierarchy and Reconstruction (four-tier model) +- ADR-138: Seed Hardware Profile (contrast with constrained target) +- ADR-140: Agent Runtime Adapter (WASM lifecycle) +- ADR-141: Coherence Engine Integration (graph and scoring) +- ADR-144: GPU Compute Support diff --git a/docs/adr/ADR-140-agent-runtime-adapter.md b/docs/adr/ADR-140-agent-runtime-adapter.md new file mode 100644 index 00000000..d614c1bb --- /dev/null +++ b/docs/adr/ADR-140-agent-runtime-adapter.md @@ -0,0 +1,141 @@ +# ADR-140: Agent Runtime Adapter + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +ADR-132 establishes that RVM supports WASM agent workloads as an optional execution adapter (DC-13). Agents are the primary workload type for multi-agent edge deployments: small, fast-switching, communication-heavy processes that benefit from RVM's coherence-driven partitioning. + +The agent runtime must define a precise lifecycle state machine, a bounded set of host functions for kernel interaction, a migration protocol with hard time bounds, and per-partition resource quotas to prevent denial-of-service. All of this must work within the `no_std` constraint with zero heap allocation. + +## Decision + +### Agent Lifecycle (7 States) + +``` +Initializing --> Running --> Suspended --> Running + | | + | +--> Migrating --> Running (at destination) + | | + | +--> Hibernated --> Reconstructing --> Running + | + +--> Terminated +``` + +| State | Description | Valid Transitions | +|-------|-------------|-------------------| +| Initializing | Module loaded, being validated | Running, Terminated | +| Running | Actively executing instructions | Suspended, Terminated | +| Suspended | Paused, state preserved in-place | Running, Migrating, Hibernated, Terminated | +| Migrating | Being transferred to another partition | Running (at dest), Aborted->Suspended | +| Hibernated | State serialized to cold storage | Reconstructing, Terminated | +| Reconstructing | Being restored from hibernation | Running, Terminated | +| Terminated | Resources freed, slot available for reuse | (terminal) | + +Every state transition emits a witness record. The `AgentManager` uses a fixed-size array of `Option` slots; terminated agents free their slots immediately for reuse (preventing resource exhaustion). + +### Agent Identity + +Agents are identified by `AgentId`, a transparent wrapper around a `u32` badge value. Badges are derived from capabilities: the capability used to spawn an agent determines its badge. Duplicate active badges within a partition are rejected. + +### Host Functions (13) + +WASM agents interact with the kernel through a fixed set of host functions, each capability-checked before dispatch: + +| Function | ID | Required Rights | Purpose | +|----------|----|----------------|---------| +| Send | 0 | WRITE | Send IPC message to another agent/partition | +| Receive | 1 | READ | Receive pending IPC message | +| Alloc | 2 | WRITE | Allocate linear memory pages | +| Free | 3 | WRITE | Free allocated memory pages | +| Spawn | 4 | EXECUTE | Spawn child agent within same partition | +| Yield | 5 | READ | Yield current execution quantum | +| GetTime | 6 | READ | Read monotonic timer (nanoseconds) | +| GetId | 7 | READ | Return caller's agent identifier | +| GpuLaunch | 8 | EXECUTE+WRITE | Submit GPU compute kernel | +| GpuAlloc | 9 | WRITE | Allocate GPU buffer memory | +| GpuFree | 10 | WRITE | Free GPU buffer memory | +| GpuTransfer | 11 | READ+WRITE | Copy between CPU and GPU buffers | +| GpuSync | 12 | READ | Wait for GPU operation completion | + +GPU functions (8-12) are feature-gated behind `#[cfg(feature = "gpu")]`. When GPU is disabled, these functions return `RvmError::InternalError`. + +The `dispatch_host_call()` function performs capability checking via `CapToken::has_rights()` before routing to the `HostContext` trait implementation. The `HostContext` trait is pluggable: `StubHostContext` provides default behavior for testing; production kernels implement it to connect to real IPC, memory, and GPU subsystems. + +### Migration Protocol (7 Steps) + +Agent migration moves a WASM agent from one partition to another. The protocol has 7 steps with a hard DC-7 time budget of 100 milliseconds: + +| Step | State | Operation | +|------|-------|-----------| +| 1 | Serializing | Serialize agent state to portable format | +| 2 | PausingComms | Pause inter-partition communication for this agent | +| 3 | TransferringRegions | Transfer memory regions to destination partition | +| 4 | UpdatingEdges | Update CommEdge weights in the coherence graph | +| 5 | UpdatingGraph | Recompute coherence topology with new placement | +| 6 | Verifying | Verify state integrity at destination | +| 7 | Resuming | Resume agent execution at destination | + +The `MigrationTracker` enforces: +- **Strict step ordering**: Each step advances to the next via `advance()`. +- **DC-7 timeout**: If elapsed time exceeds the deadline (100ms default), the migration aborts automatically with `RvmError::MigrationTimeout` and a witness record. +- **Self-migration rejection**: Source and destination partitions must differ; self-migration is rejected at plan creation. +- **Atomic completion**: Migration either completes all 7 steps or aborts entirely. Partial migration is not permitted. + +On abort, the source partition is restored to its pre-migration state. The aborted agent is marked migration-ineligible for a cooldown period. + +### Quota Management + +Per-partition resource quotas prevent agent workloads from monopolizing system resources: + +| Quota | Default | Scope | +|-------|---------|-------| +| CPU time | 10,000 us/epoch | Per-epoch, reset at epoch boundary | +| Memory pages | 256 (16 MiB) | Persistent across epochs | +| IPC messages | 1,024/epoch | Per-epoch, reset at epoch boundary | +| Concurrent agents | 32 | Persistent | + +The `QuotaTracker` provides atomic `check_and_record_*` methods that eliminate TOCTOU races in the deprecated `check_quota()` + `record_usage()` two-step pattern. If a resource increment would exceed the quota, no usage is recorded and `ResourceLimitExceeded` is returned. + +### WASM Module Validation + +Before execution, WASM modules are validated by `validate_module()`: +- Magic number (`\0asm`) and version (1). +- Section structure: valid IDs, declared sizes fit within module, non-decreasing order (except custom sections), no duplicate non-custom sections. +- Size limit: modules cannot exceed `MAX_MODULE_SIZE` (1 MiB default) per DC-7 budget constraint. + +## Consequences + +### Positive + +- **7-state lifecycle** covers all agent operational modes including migration and hibernation, with every transition witnessed. +- **Hard migration timeout** (DC-7) prevents unbounded migration from becoming a liveness hazard. +- **Atomic quota enforcement** eliminates a class of TOCTOU resource-exhaustion attacks. +- **Pluggable HostContext** enables testing without real kernel subsystems and future extension without modifying the dispatch path. + +### Negative + +- **13 host functions is a large surface**: Each function requires capability checking and witness logging. GPU functions add 5 more entry points to the TCB. +- **100ms migration timeout** may be too aggressive for partitions with large memory footprints. The timeout is configurable per partition size class but requires tuning. +- **No streaming migration**: The protocol serializes all state before transfer. Iterative pre-copy (like live VM migration) would reduce downtime but adds significant complexity. + +### Neutral + +- The WASM interpreter itself is not specified in this ADR. The runtime is pluggable via the `HostContext` trait; a `wasmtime`-based or custom interpreter can be swapped without changing the lifecycle or host function model. + +## References + +- ADR-132: RVM Hypervisor Core (DC-7, DC-13, DC-14) +- ADR-133: Partition Object Model (partition state machine) +- ADR-134: Witness Schema (TaskSpawn, TaskTerminate, PartitionMigrate action kinds) +- ADR-144: GPU Compute Support (GpuLaunch through GpuSync) +- `crates/rvm-wasm/src/lib.rs` -- Module root, WASM validation +- `crates/rvm-wasm/src/agent.rs` -- AgentManager and lifecycle +- `crates/rvm-wasm/src/host_functions.rs` -- Host function dispatch +- `crates/rvm-wasm/src/migration.rs` -- 7-step migration protocol +- `crates/rvm-wasm/src/quota.rs` -- QuotaTracker and resource budgets diff --git a/docs/adr/ADR-141-coherence-engine-integration.md b/docs/adr/ADR-141-coherence-engine-integration.md new file mode 100644 index 00000000..9a266e34 --- /dev/null +++ b/docs/adr/ADR-141-coherence-engine-integration.md @@ -0,0 +1,185 @@ +# ADR-141: Coherence Engine Integration + +**Status:** Accepted +**Date:** 2026-04-05 +**Authors:** RuVector Contributors +**Supersedes:** None + +--- + +## Context + +ADR-132 introduces coherence domains as RVM's primary abstraction. The coherence engine is the subsystem that maintains the runtime graph, computes scores and cut pressures, and recommends split/merge actions. However, ADR-132 also mandates that the coherence engine is optional (DC-1): the kernel must boot and run without it. + +This creates a dual requirement: the engine must be powerful enough to drive intelligent partition placement, but cleanly separable so the kernel degrades gracefully when the engine is absent, over-budget, or failed. + +## Decision + +### Architecture + +The coherence engine is implemented as Layer 2 in the RVM stack (above the kernel, below execution adapters). It consists of five collaborating modules: + +``` +CoherenceEngine (unified entry point) + | + +-- CoherenceGraph (fixed-size adjacency structure) + +-- MinCutBackend (pluggable: Stoer-Wagner builtin or RuVector) + +-- CoherenceBackend (pluggable: ratio-based builtin or spectral) + +-- PressureModule (cut pressure and merge signal computation) + +-- AdaptiveEngine (recomputation frequency control) +``` + +### CoherenceGraph + +A fixed-size directed weighted graph with compile-time capacity bounds: +- `MAX_NODES = 32` partitions tracked by the engine. +- `MAX_EDGES = 128` directed communication edges. +- Nodes map to `PartitionId` values; edges carry `u64` weights representing communication volume. +- Adjacency-matrix-backed `find_directed_edge()` provides O(1) existence check. +- Edge weights decay by 5% (500 basis points) per epoch to prevent stale communication patterns from dominating. + +The graph uses `NodeIdx` and `EdgeIdx` (both `u16`) for internal indexing, with linked-list adjacency for edge traversal. Zero heap allocation; fully stack-allocated. + +### EmaFilter + +Coherence scores are smoothed using an exponential moving average (EMA) filter: + +``` +new_score = alpha * sample + (1 - alpha) * old_score +``` + +Alpha is expressed in basis points (0-10000) to avoid floating-point in `no_std`. The `EmaFilter` operates entirely in fixed-point arithmetic. This smooths out transient communication spikes that would otherwise cause premature split/merge decisions. + +### Scoring + +The `compute_coherence_score()` function computes a ratio-based score: + +``` +score = internal_weight / total_weight +``` + +Where `internal_weight` is the sum of edge weights between nodes within the same partition's neighborhood, and `total_weight` includes all edges touching the partition. A score of 10000 basis points means fully internal (no cross-partition traffic); 0 means fully external. + +The `recompute_all_scores()` function batch-recomputes scores for all active partitions. + +### Cut Pressure + +Cut pressure measures how much a partition's communication pattern suggests it should be split: + +``` +pressure = external_weight / total_weight * 10000 (basis points) +``` + +- If pressure exceeds `SPLIT_THRESHOLD_BP` (7500 default), the engine recommends splitting. +- If mutual coherence between two partitions exceeds `MERGE_COHERENCE_THRESHOLD_BP` (7000 default), the engine recommends merging. + +The `evaluate_merge()` function checks bidirectional edge weights to compute mutual coherence. + +### MinCut + +The builtin MinCut uses a budgeted Stoer-Wagner heuristic (`MinCutBridge`). DC-2 constrains mincut computation to 50 microseconds per scheduler epoch. If the budget is exceeded: + +1. The last known cut is used. +2. A `degraded_flag` is set. +3. A `MINCUT_BUDGET_EXCEEDED` witness is emitted. + +The `MinCutResult` reports the two partition sets (left/right), the cut weight, and whether the computation completed within budget. + +### Adaptive Recomputation + +The `AdaptiveCoherenceEngine` adjusts recomputation frequency based on CPU load: + +| CPU Load | Recomputation Interval | +|----------|----------------------| +| < 50% | Every epoch | +| 50-80% | Every 2 epochs | +| > 80% | Every 4 epochs | + +This prevents the coherence engine from consuming excessive CPU during high-load periods. The adaptive engine always computes on the first epoch after creation. + +### Pluggable Backends + +Two trait abstractions enable backend swapping: + +**`MinCutBackend`**: Pluggable minimum cut computation. +- `BuiltinMinCut`: Self-contained Stoer-Wagner (always available). +- `RuVectorMinCut` (feature `ruvector`): Delegates to ruvector-mincut's subpolynomial dynamic mincut. Currently stubs to builtin until ruvector gains `no_std` support. + +**`CoherenceBackend`**: Pluggable coherence scoring. +- `BuiltinCoherence`: Ratio-based `internal_weight / total_weight`. +- `SpectralCoherence` (feature `ruvector`): Will use Fiedler vector / algebraic connectivity from ruvector-coherence. Currently stubs to builtin. + +Type aliases provide convenience: +- `DefaultCoherenceEngine` = builtin Stoer-Wagner + ratio scoring. +- `RuVectorCoherenceEngine` = ruvector mincut + spectral scoring. + +### Engine Lifecycle + +```rust +engine.add_partition(id) // Register partition in graph +engine.record_communication(a, b, w) // Record directed edge or increment weight +engine.tick(cpu_load) // Advance epoch, maybe recompute +engine.score(id) // Read latest coherence score +engine.pressure(id) // Read latest cut pressure +engine.recommend() // Get split/merge recommendation +``` + +Each `tick()` returns a `CoherenceDecision`: +- `NoAction` -- no split or merge warranted. +- `SplitRecommended { partition, pressure }` -- partition should be split. +- `MergeRecommended { a, b, mutual_coherence }` -- two partitions should merge. + +### DC-1 Degraded Mode + +When the coherence engine is not present: +- Split/merge operations are disabled. +- Scheduler uses `deadline_urgency` only (`cut_pressure_boost = 0`). +- Memory tiers use static thresholds (see ADR-136). +- A `DEGRADED_MODE_ENTERED` witness is emitted once. + +This is the **guaranteed baseline**: RVM is always functional without intelligence. + +### RuVector Integration + +The ruvector ecosystem provides three crates relevant to the coherence engine: + +| Crate | Usage | Status | +|-------|-------|--------| +| `ruvector-mincut` | Subpolynomial dynamic MinCut | Stub (pending `no_std`) | +| `ruvector-sparsifier` | Sparse graph representation | Planned | +| `ruvector-solver` | Spectral analysis, coherence scoring | Stub (pending `no_std`) | + +The `bridge.rs` module implements the translation layer: it exports the RVM `CoherenceGraph` adjacency structure to the ruvector API format and converts results back. The backend trait abstraction ensures that switching from builtin to ruvector requires no changes to the engine core. + +## Consequences + +### Positive + +- **Clean separation** (DC-1, DC-5): The coherence engine is a pure optimization layer. The kernel never calls into it on the critical scheduling path; it reads cached scores/pressures that the engine computed asynchronously. +- **Pluggable backends** enable gradual migration from builtin algorithms to ruvector's more sophisticated implementations without modifying the engine core. +- **Adaptive recomputation** prevents the engine from becoming a scheduling bottleneck under high CPU load. +- **Fixed-point arithmetic** throughout (basis points, no floating-point) ensures deterministic behavior on all targets. + +### Negative + +- **32-partition engine limit** is lower than the 256-partition hardware limit. The engine tracks the hottest partitions; the rest use stale or default scores. +- **Stoer-Wagner is O(V^3)**: For 32 nodes this is acceptable, but scaling beyond requires the ruvector subpolynomial algorithm. +- **5% per-epoch edge decay** is a heuristic. Too aggressive and communication patterns are forgotten; too conservative and stale edges dominate. The decay rate is a constant, not tunable at runtime. + +### Neutral + +- The ruvector stubs currently delegate to builtin implementations, so `DefaultCoherenceEngine` and `RuVectorCoherenceEngine` produce identical results. This will diverge when ruvector gains `no_std` support. + +## References + +- ADR-132: RVM Hypervisor Core (DC-1, DC-2, DC-4, DC-5, DC-6) +- ADR-133: Partition Object Model (split/merge preconditions) +- ADR-136: Memory Hierarchy (coherence-driven tier placement) +- `crates/rvm-coherence/src/lib.rs` -- Module root, EmaFilter, SensorReading +- `crates/rvm-coherence/src/graph.rs` -- CoherenceGraph +- `crates/rvm-coherence/src/engine.rs` -- CoherenceEngine, DefaultCoherenceEngine +- `crates/rvm-coherence/src/bridge.rs` -- MinCutBackend, CoherenceBackend traits +- `crates/rvm-coherence/src/scoring.rs` -- Ratio-based coherence scoring +- `crates/rvm-coherence/src/pressure.rs` -- Cut pressure and merge signals +- `crates/rvm-coherence/src/mincut.rs` -- Stoer-Wagner MinCutBridge +- `crates/rvm-coherence/src/adaptive.rs` -- Adaptive recomputation engine diff --git a/docs/adr/ADR-145-ipc-protocol-semantics.md b/docs/adr/ADR-145-ipc-protocol-semantics.md new file mode 100644 index 00000000..c0a2aca2 --- /dev/null +++ b/docs/adr/ADR-145-ipc-protocol-semantics.md @@ -0,0 +1,175 @@ +# ADR-145: IPC Protocol Semantics + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (Hypervisor Core, DC-2 coherence graph), ADR-134 (Witness Schema), ADR-135 (Proof Verifier) + +--- + +## Context + +RVM partitions communicate through weighted `CommEdge` channels whose message +activity feeds the coherence graph used for mincut computation (DC-2). The deep +review identified that while `rvm-partition/src/ipc.rs` provides a working IPC +implementation, the protocol semantics -- ordering guarantees, backpressure +behaviour, capability gating, witness integration, and cross-partition delivery +rules -- are not formally specified. Without a specification, integrators +cannot reason about message delivery guarantees, and the coherence engine cannot +rely on weight accuracy. + +### Problem Statement + +1. **No ordering specification**: FIFO is implemented but not declared as a guarantee. +2. **Backpressure is error-only**: `send()` returns `ResourceLimitExceeded` when the queue is full, but there is no retry protocol or sender notification mechanism. +3. **Capability gating is partial**: `send()` validates `msg.sender == caller_id` and `channel.source == caller_id`, but receive has no caller validation. `send_unchecked()` bypasses all checks. +4. **Witness logging is unspecified**: IPC events are not logged to the witness subsystem. +5. **Coherence weight semantics are implicit**: `ChannelMeta::weight` increments on send but the decay/epoch-reset protocol is not documented. + +--- + +## Decision + +### 1. FIFO Per-Queue Ordering Guarantee + +Each `MessageQueue` provides **strict FIFO** ordering. Messages +enqueued via `send()` are dequeued via `receive()` in the same order. This is +guaranteed by the ring buffer's `head`/`tail` discipline: + +```rust +// MessageQueue::send -- tail advances +self.buffer[self.tail] = Some(msg); +self.tail = (self.tail + 1) & (CAPACITY - 1); + +// MessageQueue::receive -- head advances +let msg = self.buffer[self.head].take(); +self.head = (self.head + 1) & (CAPACITY - 1); +``` + +**Cross-channel ordering is explicitly NOT guaranteed.** Two messages sent on +different `CommEdge`s may arrive in any order. This matches the partition +isolation model: partitions sharing state must coordinate via sequence numbers +in the `IpcMessage::sequence` field. + +### 2. Queue Capacity and Backpressure + +Queue capacity is set at compile time via `QUEUE_SIZE` (must be a power of two, +enforced by the const assertion `_CAPACITY_IS_POWER_OF_TWO`). When the queue +is full: + +- `MessageQueue::send()` returns `Err(RvmError::ResourceLimitExceeded)`. +- The message is **not enqueued** and the caller must retry or drop. +- The `ChannelMeta::weight` is **not incremented** on failure. + +The backpressure policy is **caller-driven**: the sender decides whether to +retry, drop, or escalate. The hypervisor does not buffer overflow messages. +This prevents unbounded memory growth and keeps the IPC path allocation-free. + +Recommended queue sizes by deployment tier: + +| Tier | `QUEUE_SIZE` | Rationale | +|------|-------------|-----------| +| Seed (64KB MCU) | 4 | Minimal RAM budget | +| Edge (1MB--16MB) | 16 | Moderate IPC load | +| Cloud (>128MB) | 64 | High-throughput partition communication | + +### 3. Capability-Gated Send and Receive + +**Send path** (`IpcManager::send`): the caller must satisfy two checks: + +1. `msg.sender == caller_id` -- the declared sender matches the actual caller. +2. `channel.source == caller_id` -- the caller is the source endpoint. + +Violation of either returns `Err(RvmError::InsufficientCapability)`. The +`capability_hash` field in `IpcMessage` records the truncated FNV-1a hash of +the authorising capability token for audit purposes. + +**Receive path** (`IpcManager::receive`): currently accepts any caller with a +valid `CommEdgeId`. A future revision should add destination-endpoint +validation (the caller must be the `dest` partition of the channel) to close +the read-side capability gap. Until then, channel identifiers serve as +unforgeable capabilities: only the partition that received the `CommEdgeId` +from `create_channel()` can call `receive()`. + +**Kernel bypass** (`IpcManager::send_unchecked`): skips caller validation. +This method is restricted to kernel-internal paths where the caller has already +performed authorization. It must never be exposed through the partition syscall +interface. + +### 4. Cross-Partition Delivery Semantics + +IPC channels are **unidirectional**: `create_channel(from, to)` creates a +channel where `from` can send and `to` can receive. Bidirectional communication +requires two channels. This simplifies capability reasoning -- each channel +has exactly one writer and one reader. + +**Delivery guarantee**: messages enqueued via `send()` will be available to +`receive()` unless the channel is destroyed via `destroy_channel()`. There is +no at-most-once or at-least-once abstraction; delivery is **exactly-once within +the channel lifetime**. If the channel is destroyed, any unread messages are +silently dropped. + +### 5. Witness Logging of IPC Events + +The following IPC events must be recorded in the witness log (ADR-134): + +| Event | Trigger | Witness Fields | +|-------|---------|---------------| +| `ChannelCreated` | `create_channel()` returns `Ok(edge_id)` | edge_id, source, dest, epoch | +| `MessageSent` | `send()` or `send_unchecked()` returns `Ok(())` | edge_id, sequence, msg_type, capability_hash | +| `ChannelDestroyed` | `destroy_channel()` returns `Ok(())` | edge_id, messages_remaining, weight_at_destroy | +| `SendRejected` | `send()` returns any `Err` | edge_id, error_variant, caller_id | + +Witness records are batched per epoch (DC-10) rather than per-message. The +IPC manager accumulates event counts per epoch and flushes a summary record +at epoch boundaries. + +### 6. Coherence Graph Integration + +`ChannelMeta::weight` tracks the cumulative number of successful sends on a +channel. This value is exposed via `IpcManager::comm_weight(edge_id)` and +feeds the coherence graph's `CommEdge::weight` field. + +Weight semantics: + +- Incremented by 1 on each successful `send()` or `send_unchecked()` call. +- Uses `saturating_add` to prevent overflow. +- Decay is applied externally by the coherence engine at epoch boundaries + (not by the IPC manager itself). +- The coherence engine reads `comm_weight()` for each active edge when + computing the mincut partition scoring (DC-2, 50us budget). + +--- + +## Consequences + +### Positive + +- FIFO guarantee enables deterministic message-passing protocols between partitions. +- Compile-time capacity prevents runtime allocation and bounds worst-case memory. +- Weight tracking provides accurate coherence data without additional instrumentation. +- Unidirectional channels simplify capability reasoning and prevent confused-deputy attacks. + +### Negative + +- No receive-side capability check: a partition with a stolen `CommEdgeId` can read messages. +- No built-in retry/backpressure protocol: callers must implement their own. +- `send_unchecked` is an escape hatch that bypasses the security model. +- Destroyed channels silently drop unread messages with no notification to the receiver. + +### Risks + +- If witness batching is not implemented, IPC activity is unauditable. +- Queue sizes that are too small under load cause frequent `ResourceLimitExceeded` errors + that may cascade into partition stalls. + +--- + +## References + +- `rvm-partition/src/ipc.rs` -- `IpcMessage`, `MessageQueue`, `IpcManager`, `ChannelMeta` +- `rvm-partition/src/comm_edge.rs` -- `CommEdge`, `CommEdgeId` +- ADR-132, Section DC-2 -- Coherence graph and mincut computation +- ADR-132, Section DC-10 -- Epoch-based witness batching +- ADR-134 -- Witness schema and log format diff --git a/docs/adr/ADR-146-smp-scheduling-model.md b/docs/adr/ADR-146-smp-scheduling-model.md new file mode 100644 index 00000000..76325d45 --- /dev/null +++ b/docs/adr/ADR-146-smp-scheduling-model.md @@ -0,0 +1,174 @@ +# ADR-146: SMP Scheduling Model + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (Hypervisor Core, DC-10 witness batching), ADR-144 (GPU Compute Support) + +--- + +## Context + +The deep review identified that the multi-core scheduling model -- the interplay +between `SmpCoordinator`, `PerCpuScheduler`, `CpuState`, `SwitchContext`, and +`SchedulerMode` -- is implemented across four files in `rvm-sched` but lacks a +unified specification. The relationship between VMID-aware partition assignment, +coherence-driven load balancing, epoch accounting, and GPU context switching is +not documented as a cohesive protocol. + +### Problem Statement + +1. **No CPU lifecycle state machine**: `CpuState` tracks `online` and `idle` booleans, but the valid state transitions (Offline -> Online -> Idle -> Busy -> Idle -> Offline) are not specified. +2. **VMID-to-CPU mapping is implicit**: `SmpCoordinator::assign_partition()` stores the partition, but how VMIDs (encoded in `SwitchContext::vttbr_el2` bits [55:48]) relate to partition assignment is undocumented. +3. **Rebalance hint is advisory only**: `rebalance_hint()` returns `(overloaded_cpu, idle_cpu)` based on epoch_ticks, but the rebalance policy is not specified. +4. **GPU context switching is undocumented**: `SwitchContext` conditionally includes `gpu_queue_head` and `gpu_pt_base` under `#[cfg(feature = "gpu")]`, but the save/restore protocol for GPU state is not specified. +5. **Epoch accounting is underspecified**: `CpuState::epoch_ticks` increments on partition assignment but the relationship to the global epoch counter and witness batching (DC-10) is not defined. + +--- + +## Decision + +### 1. CPU Lifecycle State Machine + +Each physical CPU follows a strict state machine: + +``` + bring_online() assign_partition() + Offline ──────────────► Online/Idle ──────────────────► Online/Busy + ▲ ▲ │ + │ │ release_partition() │ + │ take_offline() └────────────────────────────────┘ + └───────────────────────┘ + (requires idle) +``` + +State transitions and their guards: + +| Transition | Method | Guard | +|-----------|--------|-------| +| Offline -> Online/Idle | `bring_online(cpu_id)` | `cpu_id < cpu_count`, CPU is offline | +| Online/Idle -> Online/Busy | `assign_partition(cpu_id, pid)` | CPU is online and idle | +| Online/Busy -> Online/Idle | `release_partition(cpu_id)` | Always succeeds; returns previous partition | +| Online/Idle -> Offline | `take_offline(cpu_id)` | CPU is online and idle (no active partition) | + +Attempting an invalid transition returns `Err(RvmError::InvalidPartitionState)`. +A CPU with an active partition cannot be taken offline; the caller must +`release_partition()` first. + +### 2. VMID-Aware Partition-to-CPU Assignment + +Each partition is identified by a `PartitionId` (u32) and has an associated +`SwitchContext` whose `vttbr_el2` field encodes a 8-bit VMID in bits [55:48]. +The VMID provides hardware-level TLB isolation on AArch64: the MMU tags all TLB +entries with the VMID, so a context switch only requires a VTTBR_EL2 write and +a VMID-scoped TLB invalidation rather than a full TLB flush. + +Assignment protocol: + +1. `SmpCoordinator::assign_partition(cpu_id, partition_id)` marks the CPU busy. +2. `partition_switch(&mut from_ctx, &to_ctx)` validates the target (`validate_for_switch()`), writes `to_ctx.vttbr_el2` to VTTBR_EL2, performs `TLBI VMALLE1` + `DSB ISH` + `ISB`, restores registers, and returns `SwitchResult { from_vmid, to_vmid, elapsed_ns }`. +3. `SwitchContext::vmid()` extracts `(vttbr_el2 >> 48) as u16`; `s2_table_base()` extracts `vttbr_el2 & 0x0000_FFFF_FFFF_FFFE`. + +`partition_affinity(partition_id)` returns the CPU currently running a given +partition, enabling the coherence engine to collocate communicating partitions. + +### 3. Load Balancing via Coherence Pressure + +The `rebalance_hint()` method returns an advisory `(overloaded_cpu, idle_cpu)` pair: + +- **Overloaded CPU**: the online, busy CPU with the highest `epoch_ticks` (cumulative assignment count). +- **Idle CPU**: any online, idle CPU. + +The coherence engine uses this hint combined with `CommEdge` weights to decide +rebalancing: + +1. Query `IpcManager::comm_weight()` for edges involving the overloaded CPU's partition. +2. If the partition's strongest communication edge connects to a partition on the idle CPU's NUMA domain, migrate. +3. If migration would increase coherence score (DC-2), execute the rebalance. +4. Otherwise, leave the assignment unchanged. + +`epoch_ticks` is incremented by `saturating_add(1)` on each `assign_partition()` call. +It is never reset, serving as a monotonic load indicator. The coherence engine +compares relative tick rates across CPUs to detect imbalance. + +### 4. Per-CPU Scheduler State + +`PerCpuScheduler` is `#[repr(C, align(64))]` to prevent false sharing. Each instance tracks `cpu_id`, `current: Option`, `mode: SchedulerMode`, and `idle: bool`. The three scheduler modes: + +| Mode | Behaviour | Use Case | +|------|-----------|----------| +| `Reflex` | Hard real-time; bounded local execution only; no coherence queries | Interrupt handling, safety-critical partitions | +| `Flow` | Normal execution with coherence-aware placement | General workloads | +| `Recovery` | Stabilization: replay, rollback, split operations | Partition recovery (ADR-148 F2/F3) | + +### 5. GPU Context Fields in SwitchContext + +When compiled with `#[cfg(feature = "gpu")]`, `SwitchContext` includes two +additional fields: + +```rust +#[cfg(feature = "gpu")] +pub gpu_queue_head: u64, // GPU command queue head pointer +#[cfg(feature = "gpu")] +pub gpu_pt_base: u64, // GPU page table base address +``` + +GPU context save/restore protocol: + +1. On `partition_switch()`, the outgoing partition's `gpu_queue_head` and `gpu_pt_base` are saved into `from_ctx` (via `save_from()`). +2. The incoming partition's GPU page table base is written to the GPU IOMMU. +3. The GPU command queue head is restored to resume submitted work. +4. `SwitchContext::init()` zeroes both GPU fields for new partitions. + +GPU context switching adds overhead proportional to the IOMMU flush cost. +On deployments without the `gpu` feature, these fields are compiled out entirely, +adding zero overhead to the hot path. + +### 6. Epoch-Based Witness Batching (DC-10) + +Individual context switches are **not** witnessed. At thousands of switches per +second, per-switch witness records would overwhelm the witness log. Instead: + +- Each CPU accumulates `epoch_ticks` across the epoch. +- At epoch boundary, the scheduler emits an `EpochSummary` witness record containing: + - Per-CPU: partition assignments, total ticks, mode changes. + - Per-channel: accumulated IPC weight (from `IpcManager::comm_weight()`). + - Aggregate: total switches, rebalance events, GPU context switches. +- The witness record is appended to the log as a single batched entry (ADR-134). + +--- + +## Consequences + +### Positive + +- Explicit state machine prevents invalid CPU transitions at compile-documented boundaries. +- VMID encoding in VTTBR_EL2 provides hardware-accelerated TLB isolation. +- Cache-line alignment of `PerCpuScheduler` eliminates false sharing. +- Feature-gated GPU fields add zero cost when GPU support is not compiled. +- Epoch batching keeps witness log growth bounded (DC-10). + +### Negative + +- Rebalance hints are advisory; no automatic migration is implemented in v1. +- `epoch_ticks` is a coarse load indicator -- it does not account for partition execution duration. +- GPU context switching is not yet implemented in the HAL; only the data model is specified. +- No work stealing between CPUs; v1 is a cooperative model. + +### Risks + +- Without coherence-driven rebalance, communicating partitions may remain on distant CPUs. +- Monotonic `epoch_ticks` can bias rebalance hints toward the same CPU, requiring normalization. + +--- + +## References + +- `rvm-sched/src/smp.rs` -- `SmpCoordinator`, `CpuState` +- `rvm-sched/src/per_cpu.rs` -- `PerCpuScheduler` +- `rvm-sched/src/switch.rs` -- `SwitchContext`, `SwitchResult`, `partition_switch()` +- `rvm-sched/src/modes.rs` -- `SchedulerMode` (Reflex, Flow, Recovery) +- ADR-132, Section DC-2 -- Coherence graph and mincut +- ADR-132, Section DC-10 -- Epoch-based witness batching +- ADR-144 -- GPU compute support (GPU context fields) diff --git a/docs/adr/ADR-147-hardware-abstraction-layer.md b/docs/adr/ADR-147-hardware-abstraction-layer.md new file mode 100644 index 00000000..79867549 --- /dev/null +++ b/docs/adr/ADR-147-hardware-abstraction-layer.md @@ -0,0 +1,171 @@ +# ADR-147: Hardware Abstraction Layer Contract + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (Hypervisor Core), ADR-133 (HAL Design -- referenced in rvm-hal docs) + +--- + +## Context + +The `rvm-hal` crate defines the hardware abstraction boundary between RVM's +portable kernel logic and architecture-specific implementations. The deep review +identified that while the four HAL traits (`Platform`, `MmuOps`, `TimerOps`, +`InterruptOps`) are implemented and tested, the contract they establish -- +safety invariants, zero-copy requirements, multi-architecture extensibility, +and the `unsafe` code policy -- is not formally specified. + +### Problem Statement + +1. **Unsafe boundary is implicit**: `rvm-hal/src/lib.rs` uses `#![deny(unsafe_code)]` at the crate level but `#[allow(unsafe_code)]` on the `aarch64` module. The rationale and invariant documentation requirements are not specified. +2. **Stage-2 MMU contract is incomplete**: `MmuOps` methods accept `GuestPhysAddr` and `PhysAddr` but do not specify alignment requirements, page size, or double-mapping behaviour. +3. **Zero-copy principle is stated but not enforced**: the module doc says "pass borrowed slices, never owned buffers" but no trait method accepts slices. +4. **Timer precision and monotonicity are not guaranteed**: `now_ns()` returns `u64` but there is no specification of resolution, monotonicity, or overflow behaviour. +5. **Interrupt semantics vary by architecture**: `acknowledge()` returns `Option` (spurious handling) but the protocol for nested interrupts and priority is unspecified. + +--- + +## Decision + +### 1. The Four-Trait Contract + +The HAL is organized as four independent traits (`Platform`, `MmuOps`, +`TimerOps`, `InterruptOps`), each responsible for one hardware subsystem. +Implementations must satisfy all four traits to constitute a complete platform +port. All fallible methods return `RvmResult` (aliased to +`Result`) for portable error handling across architectures. + +### 2. Stage-2 Page Table Management + +`MmuOps` manages the stage-2 translation from guest physical addresses +(`GuestPhysAddr`) to host physical addresses (`PhysAddr`). The contract: + +- **Page granularity**: all operations work on 4KB-aligned pages. Both `guest` + and `host` must be 4KB-aligned; misaligned addresses return + `Err(RvmError::AlignmentError)`. +- **map_page**: creates a new mapping. Returns `Err(RvmError::MemoryOverlap)` + if the guest page is already mapped. Does not support remap-in-place; the + caller must `unmap_page()` first. +- **unmap_page**: removes an existing mapping. Returns an error if the page is + not currently mapped. +- **translate**: returns the host physical address for a mapped guest page. + Returns an error if unmapped. This is a software table walk, not a TLB lookup. +- **flush_tlb**: invalidates TLB entries for `page_count` pages starting at + `guest`. On AArch64, this issues `TLBI IPAS2E1IS` for each page followed by + `DSB ISH` + `ISB`. + +On AArch64, the stage-2 page table root is stored in VTTBR_EL2, which is +written during `partition_switch()` (see ADR-146). The MMU operates at EL2, +translating IPA (Intermediate Physical Address) to PA. + +### 3. AArch64 Implementation: EL2, PL011, GICv2, Generic Timer + +The `rvm-hal::aarch64` module provides the reference implementation: + +| Component | Hardware | Key Registers | +|-----------|----------|--------------| +| MMU | ARMv8-A Stage-2 | VTTBR_EL2, VTCR_EL2 | +| UART | PL011 | UARTDR, UARTFR, UARTCR | +| Interrupt Controller | GICv2 | GICD_*, GICC_* | +| Timer | ARM Generic Timer | CNTPCT_EL0, CNTHP_TVAL_EL2, CNTHP_CTL_EL2 | + +The implementation targets QEMU `virt` machine (Cortex-A72 profile). Timer +resolution matches the physical counter frequency (typically 62.5 MHz on QEMU, +giving ~16ns resolution). `now_ns()` reads `CNTPCT_EL0` and converts via the +counter frequency from `CNTFRQ_EL0`. + +### 4. Future Architecture Targets + +| Target | Privilege Level | MMU | Timer | Interrupt | +|--------|----------------|-----|-------|-----------| +| RISC-V | HS-mode (hypervisor extension) | Sv48x4 (stage-2) | CLINT mtime/mtimecmp | PLIC | +| x86-64 | VMX root mode | EPT (Extended Page Tables) | PIT/HPET/TSC | APIC (Local + I/O) | + +New ports implement the same four traits. Architecture-specific details +(register names, TLB flush instructions, timer resolution) are encapsulated +within the module. Portable kernel code only interacts through the trait +interface. + +### 5. Unsafe Code Policy + +The crate-level lint configuration establishes a two-layer safety model: + +```rust +// rvm-hal/src/lib.rs (crate root) +#![deny(unsafe_code)] // Trait definitions are safe + +// rvm-hal/src/aarch64/mod.rs (arch module) +#[allow(unsafe_code)] // Hardware access requires unsafe +``` + +**Rationale**: trait definitions contain no `unsafe` because the contract is +expressed through Rust's type system and `RvmResult` error handling. Architecture +implementations require `unsafe` for: + +- Inline assembly (`asm!` for MRS/MSR, TLBI, DSB, ISB). +- MMIO register access (volatile reads/writes to device memory). +- Raw pointer dereference for page table manipulation. + +**Invariant documentation requirement**: every `unsafe` block in an architecture +module must include a `// SAFETY:` comment documenting: + +1. What invariant the unsafe code relies on. +2. Why the invariant holds at this call site. +3. What could go wrong if the invariant is violated. + +The crate uses `#![deny(unsafe_code)]` rather than `#![forbid(unsafe_code)]` +precisely because architecture modules need the ability to `#[allow(unsafe_code)]` +locally. `forbid` would prevent this. + +### 6. Zero-Copy Principle + +The module documentation states: "pass borrowed slices, never owned buffers." +While the current trait methods do not accept slices (they operate on scalar +addresses and IDs), the principle applies to: + +- **MmuOps**: addresses are passed by value (`GuestPhysAddr`, `PhysAddr`), not + by reference to owned memory buffers. Page table memory is managed internally. +- **InterruptOps**: IRQ IDs are passed by value. No interrupt descriptor + allocation is exposed through the trait. +- **Future extensions**: any method that needs to transfer bulk data (e.g., DMA + descriptor lists, firmware blobs) must accept `&[u8]` or `&mut [u8]`, never + `Vec` or `Box<[u8]>`. This keeps the HAL `no_std`-compatible and + allocation-free. + +--- + +## Consequences + +### Positive + +- Four clean trait boundaries enable independent porting of each hardware subsystem. +- `deny(unsafe_code)` at the crate level with local `allow` provides defense-in-depth. +- `RvmResult` return types give portable error handling across all architectures. +- 4KB page granularity matches all three target architectures' minimum page size. +- Zero-copy principle keeps the HAL compatible with `#![no_std]` and zero-heap environments. + +### Negative + +- No support for large pages (2MB, 1GB) in the current `MmuOps` contract. +- Timer resolution is architecture-dependent and not queryable through the trait. +- `InterruptOps` does not model interrupt priorities or nesting. +- Only one architecture (AArch64) is implemented; RISC-V and x86-64 are future work. + +### Risks + +- Without a timer resolution query method, portable code cannot accurately reason + about deadline precision on different platforms. +- The `acknowledge()` -> `end_of_interrupt()` protocol assumes level-triggered + semantics; edge-triggered interrupts may require a different flow. + +--- + +## References + +- `rvm-hal/src/lib.rs` -- Trait definitions: `Platform`, `MmuOps`, `TimerOps`, `InterruptOps` +- `rvm-hal/src/aarch64/` -- AArch64 implementation (EL2, PL011, GICv2, ARM generic timer) +- `rvm-types/src/lib.rs` -- `GuestPhysAddr`, `PhysAddr`, `RvmResult`, `RvmError` +- ADR-132 -- RVM hypervisor core design constraints +- ADR-146 -- SMP scheduling model (VTTBR_EL2 usage in partition_switch) diff --git a/docs/adr/ADR-148-error-model-recovery.md b/docs/adr/ADR-148-error-model-recovery.md new file mode 100644 index 00000000..fc2bf082 --- /dev/null +++ b/docs/adr/ADR-148-error-model-recovery.md @@ -0,0 +1,134 @@ +# ADR-148: Error Model and Recovery State Machine + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (Hypervisor Core, DC-14 escalation), ADR-134 (Witness Schema), ADR-142 (TEE Crypto) + +--- + +## Context + +RVM's error model spans 34 error variants across 9 subsystems, a 4-level failure +classification, and recovery primitives (checkpoints and reconstruction receipts). +The deep review found that while the types are well-defined, the escalation +chain between failure classes, the mapping from `RvmError` variants to failure +classes, and the recovery state machine are not specified. Without this +specification, recovery logic cannot be implemented consistently across crates. + +### Problem Statement + +1. **No error-to-failure-class mapping**: which `RvmError` variants are Transient vs Permanent? +2. **No escalation protocol**: when does a Transient failure become Recoverable? When does Recoverable become Permanent? +3. **No recovery state machine**: how does a partition transition from error detection to checkpoint restore to resumed execution? +4. **No witness integration**: error events and recovery actions are not witnessed. +5. **Retryability is ambiguous**: callers cannot determine which errors are safe to retry. + +--- + +## Decision + +### 1. Failure Classes (F1--F4) + +The `FailureClass` enum defines four severity levels with ordered discriminants +(`#[repr(u8)]`, `PartialOrd` + `Ord` derived). Severity can only escalate, never +de-escalate. + +| Class | Enum | Value | Recovery Action | Scope | +|-------|------|-------|----------------|-------| +| F1 | `Transient` | 0 | Agent restart / retry | Single operation | +| F2 | `Recoverable` | 1 | Partition reconstruct from checkpoint | Single partition | +| F3 | `Permanent` | 2 | Memory rollback, partition destroy + recreate | Partition + dependents | +| F4 | `Catastrophic` | 3 | Kernel reboot | System-wide | + +### 2. Error-to-Failure-Class Mapping + +Each `RvmError` variant maps to a default `FailureClass`. Context may escalate +the class (e.g., repeated Transient failures escalate to Recoverable). + +- **F1 (Transient)**: `ResourceLimitExceeded`, `WitnessLogFull`, `ProofBudgetExceeded`, `MinCutBudgetExceeded`, `MigrationTimeout`, `DeviceLeaseExpired` -- temporary conditions; retry after release or budget reset. +- **F2 (Recoverable)**: `InvalidPartitionState`, `SplitPreconditionFailed`, `MergePreconditionFailed`, `CoherenceBelowThreshold`, `StaleCapability`, `InvalidTierTransition`, `DeviceLeaseConflict` -- state corruption that checkpoint restore can fix. +- **F3 (Permanent)**: `PartitionNotFound`, `PartitionLimitExceeded`, `VcpuNotFound`, `VcpuLimitReached`, all Capability errors (`InsufficientCapability`, `CapabilityTypeMismatch`, `DelegationDepthExceeded`, `CapabilityConsumed`), `WitnessVerificationFailed`, `WitnessChainBroken`, `ProofInvalid`, `ProofTierInsufficient`, `MemoryOverlap`, `AlignmentError`, `OutOfMemory`, `DeviceLeaseNotFound`, `Unsupported` -- unrecoverable without partition destroy and recreate. +- **F4 (Catastrophic)**: `CheckpointNotFound`, `CheckpointCorrupted`, `FailureEscalated`, `InternalError` -- recovery infrastructure itself is compromised; kernel reboot required. + +### 3. Witnessed Escalation Chain + +Failure escalation follows a strict witnessed protocol: + +``` +F1 (Transient) + │ retry count > 3 + ▼ +F2 (Recoverable) + │ checkpoint restore fails + ▼ +F3 (Permanent) + │ reconstruction receipt cannot be created + ▼ +F4 (Catastrophic) +``` + +Each escalation step generates a witness record (ADR-134) containing: + +- `original_error: RvmError` -- the triggering error variant. +- `from_class: FailureClass` -- the class before escalation. +- `to_class: FailureClass` -- the class after escalation. +- `partition_id: PartitionId` -- the affected partition. +- `epoch: u32` -- the epoch at escalation time. +- `retry_count: u32` -- number of retries attempted (F1 only). + +Escalation is monotonic: once a failure reaches F3, it cannot return to F2. +The `FailureClass` enum's `Ord` implementation (`Transient < Recoverable < +Permanent < Catastrophic`) enforces this. + +### 4. Recovery Checkpoint and Reconstruction Receipt + +`RecoveryCheckpoint` captures a partition's restorable state: `partition: PartitionId`, `witness_sequence: u64`, `timestamp_ns: u64`, `epoch: u32`. Checkpoints are taken periodically (every N epochs), before risky operations (split, merge, migration), and on demand. Recovery restores the partition to the checkpoint state and replays witnessed events from `witness_sequence` forward, skipping the events that caused the failure. + +`ReconstructionReceipt` provides an audit trail: `original_id: PartitionId`, `checkpoint: RecoveryCheckpoint`, `was_hibernated: bool`. The `was_hibernated` flag distinguishes Dormant-tier reactivation (`true`) from F3 failure recreation (`false`). The receipt is itself witnessed, creating an unbroken audit chain from failure detection through recovery completion. + +### 5. Error Propagation Rules + +| Property | Rule | +|----------|------| +| Retryability | F1 errors are safe to retry (up to 3 times). F2+ errors must not be retried without recovery. | +| Propagation | Errors propagate upward via `RvmResult`. Callers must not silently discard errors. | +| Conversion | Lower-level errors may be wrapped in higher-level variants (e.g., `OutOfMemory` from the allocator becomes `InvalidPartitionState` at the partition layer if it prevents state transition). | +| Logging | All F2+ errors must be witness-logged before propagation. F1 errors are logged only on escalation. | +| Idempotency | Recovery actions (checkpoint restore, partition destroy) must be idempotent. Re-executing a recovery action on an already-recovered partition is a no-op. | + +--- + +## Consequences + +### Positive + +- Clear mapping from error variants to failure classes enables automated recovery. +- Witnessed escalation creates an auditable chain from error to resolution. +- Monotonic escalation prevents recovery loops (bouncing between F1 and F2). +- `RecoveryCheckpoint` and `ReconstructionReceipt` provide complete audit trail. +- Retryability rules prevent callers from retrying non-retryable errors. + +### Negative + +- 34 error variants make exhaustive matching verbose. +- The 3-retry threshold for F1->F2 escalation is fixed; some errors may benefit from more retries. +- Checkpoint storage is not specified (this ADR covers the data model, not persistence). + +### Risks + +- If checkpoint frequency is too low, recovery may lose significant state. +- If witness logging of escalation events fails (e.g., `WitnessLogFull`), the + escalation itself becomes an F4 event, potentially causing cascading failure. +- F3 recovery (partition destroy + recreate) loses all non-checkpointed state. + +--- + +## References + +- `rvm-types/src/error.rs` -- `RvmError` (34 variants), `RvmResult` +- `rvm-types/src/recovery.rs` -- `FailureClass`, `RecoveryCheckpoint`, `ReconstructionReceipt` +- ADR-132, Section DC-14 -- Failure escalation protocol +- ADR-134 -- Witness schema and log format +- ADR-142 -- TEE-backed cryptographic verification (witness chain integrity) diff --git a/docs/adr/ADR-149-rvf-integration.md b/docs/adr/ADR-149-rvf-integration.md new file mode 100644 index 00000000..c1e5b7a1 --- /dev/null +++ b/docs/adr/ADR-149-rvf-integration.md @@ -0,0 +1,140 @@ +# ADR-149: RVF Integration for RVM + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (Hypervisor Core, memory tiers), ADR-134 (Witness Schema), ADR-142 (TEE Crypto), ADR-148 (Error Model) + +--- + +## Context + +The RuVector ecosystem includes 22 RVF (RuVector Format) crates that provide a +standardized package format for boot images, vector indexes, cryptographic +manifests, and runtime containers. RVM currently has no specification for how it +consumes RVF packages. The deep review identified this as specification gap B5: +how RVM uses RVF for boot images, dormant memory checkpoints, witness archives, +and GPU kernel distribution is undocumented. + +The key RVF crates for RVM integration are: `rvf-types` (core types), `rvf-crypto` (signing/verification), `rvf-manifest` (package manifests), `rvf-index` (HNSW indexing), `rvf-runtime` (execution environment), `rvf-kernel` (kernel integration), and `rvf-quant` (quantized compression). A total of 22 crates are available in `ruvector/crates/rvf/`. + +### Problem Statement + +1. **Boot image format is unspecified**: `rvm-boot` loads partition images but the packaging format is ad hoc. +2. **Dormant memory has no container format**: the `Dormant` tier (tier 2) in `rvm-memory` stores compressed state, but the container format and metadata are not standardized. +3. **Witness archives are not portable**: witness logs are append-only in memory but have no export/archive format. +4. **GPU kernel distribution is undefined**: ADR-144 introduces GPU compute but does not specify how GPU kernels are packaged and distributed to partitions. +5. **No cryptographic binding between RVF signing and RVM verification**: `rvf-crypto` and `rvm-proof`'s `WitnessSigner` both provide signing, but their relationship is not specified. + +--- + +## Decision + +### 1. RVF as the Universal Container Format + +All persistent artifacts in RVM are stored as RVF containers. An RVF container +consists of: + +- **Manifest** (`rvf-manifest`): JSON or CBOR metadata describing contents, + version, dependencies, and cryptographic hashes. +- **Payload**: the raw artifact data (binary image, compressed state, log entries). +- **Signature** (`rvf-crypto`): Ed25519 or HMAC-SHA256 signature over the + manifest + payload hash. + +RVF containers are identified by a content-addressed hash of the manifest, +enabling deduplication and integrity verification without parsing the payload. + +### 2. Boot Image Packaging + +`rvm-boot` loads partition images from RVF containers. The manifest declares `type: "rvm-boot-image"`, `target_arch`, `entry_point`, `stack_size`, `vmid`, and `payload_hash`. The payload is an ELF or flat binary; the signature uses Ed25519. + +Boot sequence: (1) verify signature via `rvf-crypto`, (2) verify payload hash against manifest, (3) extract entry point/stack/VMID and pass to `SwitchContext::init()` (ADR-146), (4) load binary into stage-2 address space via `MmuOps::map_page()` (ADR-147), (5) emit witness record with boot image hash, VMID, and epoch. + +### 3. Dormant Memory Checkpoints + +The `Dormant` tier (tier 2) in `rvm-memory/src/tier.rs` stores compressed partition state. When coherence drops below `TierThresholds::warm_to_dormant`, memory is compressed and stored as an RVF checkpoint container. The manifest declares `type: "rvm-checkpoint"`, `partition_id`, `epoch`, `witness_sequence`, `timestamp_ns`, `compression: "lz4"`, and `payload_hash`. These fields map directly to `RecoveryCheckpoint` (ADR-148). + +Reconstruction (Dormant -> Warm) uses `ReconstructionReceipt` with `was_hibernated = true`. The `ReconstructionPipeline` in `rvm-memory/src/reconstruction.rs` decompresses the payload and remaps pages via `MmuOps::map_page()`. + +### 4. Witness Archive + +Witness logs (ADR-134) are archived to RVF containers for long-term storage. The manifest declares `type: "rvm-witness-archive"`, `sequence_range`, `epoch_range`, `record_count`, `chain_hash` (final hash in the witness chain), and `payload_hash`. The `chain_hash` enables verification that the archive is a contiguous, untampered segment. Archives are stored in Cold tier (tier 3, ADR-132) and accessed only during recovery or audit. + +### 5. GPU Kernel Distribution + +GPU kernels (ADR-144) are distributed as RVF containers with `type: "rvm-gpu-kernel"`, `backend` (webgpu/cuda/opencl), `workgroup_size`, and `required_capabilities`. The payload is SPIR-V or WGSL bytecode. The GPU manager verifies the container signature before loading, maps the kernel into the partition's GPU address space, and updates `SwitchContext::gpu_queue_head`. + +### 6. Cryptographic Alignment: rvf-crypto and WitnessSigner + +`rvf-crypto` provides Ed25519 and HMAC-SHA256 signing. `rvm-proof`'s +`WitnessSigner` trait (ADR-134, ADR-142) provides the same algorithms for +witness chain signing. The alignment: + +| Operation | rvf-crypto | rvm-proof WitnessSigner | +|-----------|-----------|----------------------| +| Ed25519 sign | `rvf_crypto::sign_ed25519()` | `Ed25519Signer::sign()` | +| Ed25519 verify | `rvf_crypto::verify_ed25519()` | `Ed25519Signer::verify()` | +| HMAC-SHA256 | `rvf_crypto::hmac_sha256()` | `HmacSha256Signer::sign()` | + +Both use the same key format (32-byte Ed25519 seed, 32-byte HMAC key). RVM +uses `WitnessSigner` for runtime witness operations and `rvf-crypto` for +container-level signing. A single key pair can be used for both, enabling +end-to-end verification from witness record to archived container. + +For TEE deployments (ADR-142), the signing key is held in the secure enclave +and both `WitnessSigner` and `rvf-crypto` delegate to the TEE signing API. + +### 7. Integration Path + +The RVF integration touches three RVM crates: + +| RVM Crate | RVF Dependency | Integration Point | +|-----------|---------------|-------------------| +| `rvm-boot` | `rvf-types`, `rvf-manifest`, `rvf-crypto` | Boot image loading and verification | +| `rvm-memory` | `rvf-types`, `rvf-manifest`, `rvf-crypto`, `rvf-quant` | Dormant checkpoint storage and reconstruction | +| `rvm-witness` | `rvf-types`, `rvf-manifest`, `rvf-crypto` | Witness log archival to Cold tier | +| `rvm-gpu` | `rvf-types`, `rvf-manifest`, `rvf-crypto` | GPU kernel packaging and distribution | + +All RVF dependencies are optional, gated behind the `rvf` feature flag. Without +the feature, RVM operates with raw binary formats (backwards compatible). + +--- + +## Consequences + +### Positive + +- Unified container format across all persistent artifacts. +- Cryptographic signing provides tamper-evident packaging for all artifacts. +- Content-addressed manifests enable deduplication and integrity checking. +- `rvf-quant` provides quantized compression for Dormant tier, reducing memory. +- Feature-gated integration preserves minimal builds for Seed tier (64KB MCU). + +### Negative + +- RVF adds 3-4 new crate dependencies to each integrating RVM crate. +- Manifest parsing requires either JSON or CBOR support, adding code size. +- Container overhead (manifest + signature) adds latency to boot and checkpoint paths. + +### Risks + +- If `rvf-crypto` and `rvm-proof` key formats diverge, key management becomes fragmented. +- RVF container format changes in upstream ruvector could break RVM compatibility. +- Cold-tier witness archives may grow unbounded without a retention policy. + +--- + +## References + +- `ruvector/crates/rvf/` -- 22 RVF crates +- `rvm-memory/src/tier.rs` -- `Tier::Dormant`, `TierThresholds` +- `rvm-memory/src/reconstruction.rs` -- `ReconstructionPipeline` +- `rvm-types/src/recovery.rs` -- `RecoveryCheckpoint`, `ReconstructionReceipt` +- `rvm-proof` -- `WitnessSigner` trait (Ed25519, HMAC-SHA256, DualHmac) +- ADR-132 -- Memory tier model (Hot, Warm, Dormant, Cold) +- ADR-134 -- Witness schema and log format +- ADR-142 -- TEE-backed cryptographic verification +- ADR-144 -- GPU compute support +- ADR-148 -- Error model and recovery (checkpoint/reconstruction types) +- `docs/RUVECTOR-INTEGRATION.md` -- RVF crate inventory diff --git a/docs/adr/ADR-150-device-lease-lifecycle.md b/docs/adr/ADR-150-device-lease-lifecycle.md new file mode 100644 index 00000000..f8a7b78a --- /dev/null +++ b/docs/adr/ADR-150-device-lease-lifecycle.md @@ -0,0 +1,160 @@ +# ADR-150: Device Lease Lifecycle Protocol + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (Hypervisor Core), ADR-134 (Witness Schema), ADR-144 (GPU Compute Support) + +--- + +## Context + +RVM partitions access hardware devices through time-bounded, revocable leases +managed by `DeviceLeaseManager`. The deep review identified that while the +lease manager is implemented and tested, the full lifecycle protocol -- +registration, grant, revocation, expiry, witness logging, GPU device handling, +and DMA budget integration -- is not formally specified. + +### Problem Statement + +1. **Registration semantics are incomplete**: `register_device()` assigns sequential IDs but does not validate MMIO region overlaps or IRQ conflicts. +2. **Capability gating is by hash only**: `ActiveLease::capability_hash` records the authorizing capability but the grant path does not validate the capability itself. +3. **Single-holder exclusivity is implemented but not specified**: `grant_lease()` checks `device.available` but the exclusivity invariant is not documented. +4. **Lease expiry is passive**: `expire_leases(current_epoch)` must be called explicitly; there is no automatic expiry mechanism. +5. **GPU devices have no special handling**: `DeviceClass::Graphics` exists but GPU-specific MMIO mapping, IOMMU isolation, and command queue management are not addressed. +6. **DMA budget integration is missing**: device DMA transfers should count against the partition's `DmaBudget` but there is no enforcement path. + +--- + +## Decision + +### 1. Device Registration Protocol + +`DeviceLeaseManager::register_device(info)` registers a hardware device. It assigns a sequential `u32` id, sets `available = true`, and stores the `DeviceInfo` descriptor (id, `DeviceClass`, mmio_base, mmio_size, irq). Returns `Err(RvmError::ResourceLimitExceeded)` if the device table (fixed-size, `MAX_DEVICES` slots) is full. Devices are registered at boot time by the HAL (ADR-147). + +### 2. Lease Grant: Capability-Gated, Epoch-Based, Exclusive + +Lease grant follows `grant_lease(device_id, partition, duration_epochs, current_epoch, cap_hash)`: + +**Preconditions**: +1. `device_id` must refer to a registered device (`DeviceLeaseNotFound` otherwise). +2. The device must be available (`DeviceLeaseConflict` if already leased). +3. The lease table must have capacity (`ResourceLimitExceeded` if full). + +**Grant semantics**: +- A new `ActiveLease` is created with `granted_epoch = current_epoch` and + `expiry_epoch = current_epoch + duration_epochs` (saturating add). +- The device is marked `available = false`. +- The lease is assigned a monotonically increasing `DeviceLeaseId`. +- `capability_hash` records the truncated FNV-1a hash of the capability token + that authorized the grant, providing audit linkage. + +**Single-holder exclusivity**: a device may be leased to at most one partition +at any time. This is enforced by the `available` flag on `DeviceInfo`. The +invariant: `device.available == false` if and only if there exists an +`ActiveLease` with `lease.device_id == device.id`. + +### 3. Lease Revocation: Immediate, Cascading + +`revoke_lease(lease_id)` immediately terminates a lease: + +1. The `ActiveLease` is removed from the lease table. +2. The underlying device is marked `available = true`. +3. Returns `Err(RvmError::DeviceLeaseNotFound)` if the lease does not exist. + +**Cascading revocation**: when a partition is destroyed (ADR-148 F3 recovery), +all leases held by that partition must be revoked. The caller iterates +`leases` and revokes each matching `partition_id`. This also revokes any +derived capabilities that reference the lease. + +**Immediate effect**: the partition loses MMIO access at revocation time. +Any in-flight DMA transfers to/from the device continue to completion (hardware +cannot be interrupted mid-transfer), but no new transfers are permitted. + +### 4. Lease Expiry: Automatic Collection + +`expire_leases(current_epoch)` scans the lease table and collects all leases where `current_epoch >= expiry_epoch`, removes them, releases devices back to the available pool, and returns the count of expired leases. The scheduler calls this at each epoch boundary (passive -- no background timer). + +`check_lease(lease_id, current_epoch)` validates a lease: returns `Ok(&ActiveLease)` if valid, `Err(DeviceLeaseExpired)` if expired, or `Err(DeviceLeaseNotFound)` if absent. + +### 5. GPU Devices: DeviceClass::Graphics + +GPU devices are registered with `DeviceClass::Graphics` and have additional +lifecycle requirements (ADR-144): + +| Concern | Protocol | +|---------|----------| +| MMIO mapping | GPU register aperture (`mmio_base`, `mmio_size`) is mapped into the partition's stage-2 address space via `MmuOps::map_page()` at lease grant time. Unmapped at revocation. | +| IOMMU isolation | GPU page table base (`SwitchContext::gpu_pt_base`) is set to the partition's GPU-specific page table. Other partitions cannot access this partition's GPU memory. | +| Command queue | `SwitchContext::gpu_queue_head` tracks the partition's position in the GPU command queue. Saved/restored during context switch. | +| Budget | GPU operations count against both `DmaBudget` (DMA transfers) and `GpuBudget` (compute time, GPU memory). | + +GPU lease duration should be conservative (shorter epochs) because GPU resources +are scarce and high-value. The recommended default for Graphics devices is 10 +epochs, compared to 100+ for Network or Storage devices. + +### 6. Witness Logging of Lease Transitions + +All lease lifecycle events are witness-logged (ADR-134): + +| Event | Trigger | Witness Fields | +|-------|---------|---------------| +| `DeviceRegistered` | `register_device()` returns `Ok(id)` | device_id, class, mmio_base, mmio_size, irq | +| `LeaseGranted` | `grant_lease()` returns `Ok(lease_id)` | lease_id, device_id, partition_id, granted_epoch, expiry_epoch, capability_hash | +| `LeaseRevoked` | `revoke_lease()` returns `Ok(())` | lease_id, device_id, partition_id, epoch_at_revocation | +| `LeaseExpired` | `expire_leases()` collects a lease | lease_id, device_id, partition_id, expiry_epoch | +| `LeaseCheckFailed` | `check_lease()` returns `Err` | lease_id, error_variant, current_epoch | + +Witness records are batched per epoch (DC-10). Lease events are low-frequency +(compared to IPC messages) so the witness overhead is negligible. + +### 7. DMA Budget Integration + +Device DMA transfers count against the partition's `DmaBudget` (`rvm-security/src/budget.rs`). Before each transfer of `N` bytes, the hypervisor calls `DmaBudget::try_consume(N)`: if `used_bytes + N <= max_bytes` the transfer proceeds, otherwise `Err(RvmError::ResourceLimitExceeded)` is returned. The budget resets at each epoch boundary. + +For GPU devices (`DeviceClass::Graphics`), both `DmaBudget` (raw DMA bytes) and `GpuBudget` (compute nanoseconds, GPU memory bytes) must pass. The `ResourceQuota` composite in `rvm-security` bundles `DmaBudget` with other per-partition quotas. + +### 8. Zero-Heap Invariant + +`DeviceLeaseManager` uses const generics for all storage. Both `devices` and `leases` arrays are `[Option; N]` initialized with `None` sentinels. No heap allocation occurs at any lifecycle stage. Expiry uses a stack-allocated `expired_device_ids` buffer of size `MAX_LEASES`. This makes the lease manager suitable for `#![no_std]` bare-metal environments. + +--- + +## Consequences + +### Positive + +- Single-holder exclusivity prevents device sharing conflicts between partitions. +- Epoch-based expiry provides automatic lease cleanup without background threads. +- Zero-heap implementation is compatible with all deployment tiers. +- Witness logging of all transitions provides complete audit trail. +- DMA budget enforcement prevents partition-level DMA bus monopolization. + +### Negative + +- No MMIO overlap validation at registration time; overlapping devices could cause conflicts. +- Passive expiry requires the scheduler to call `expire_leases()` at each epoch boundary. +- No lease renewal mechanism; a partition must re-acquire after expiry. +- `capability_hash` is stored but not verified against the capability subsystem at grant time. + +### Risks + +- If `expire_leases()` is not called regularly, expired leases accumulate and + devices remain unavailable. +- GPU lease revocation during active GPU operations may leave the GPU in an + inconsistent state; a GPU reset protocol is needed (ADR-144 future work). +- DMA budget enforcement adds overhead to every DMA transfer path. + +--- + +## References + +- `rvm-partition/src/device.rs` -- `DeviceLeaseManager`, `DeviceInfo`, `ActiveLease` +- `rvm-types/src/device.rs` -- `DeviceLeaseId`, `DeviceClass`, `DeviceLease`, `GpuMemoryType`, `GpuQueuePriority` +- `rvm-security/src/budget.rs` -- `DmaBudget`, `ResourceQuota` +- `rvm-sched/src/switch.rs` -- `SwitchContext::gpu_queue_head`, `SwitchContext::gpu_pt_base` +- ADR-132 -- Hypervisor core design constraints +- ADR-134 -- Witness schema and log format (DC-10 batching) +- ADR-144 -- GPU compute support (GpuBudget, IOMMU isolation) +- ADR-148 -- Error model and recovery (F3 cascading revocation) diff --git a/docs/adr/ADR-151-gpu-witness-event-registry.md b/docs/adr/ADR-151-gpu-witness-event-registry.md new file mode 100644 index 00000000..25084934 --- /dev/null +++ b/docs/adr/ADR-151-gpu-witness-event-registry.md @@ -0,0 +1,200 @@ +# ADR-151: GPU Witness Event Registry + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-134 (Witness Schema and Log Format), ADR-142 (TEE-Backed Cryptographic Verification), ADR-144 (GPU Compute Support) + +--- + +## Context + +ADR-144 introduces the `rvm-gpu` crate with capability-gated, budget-enforced GPU compute. Design constraint DC-GPU-5 mandates that every GPU operation emits a `WitnessRecord`. However, the `ActionKind` enum defined in ADR-134 does not include GPU-specific variants. The enum currently occupies ranges 0x01-0x0F (partition), 0x10-0x1F (capability), 0x20-0x2F (memory), 0x30-0x3F (communication), 0x40-0x4F (device), 0x50-0x5F (proof), 0x60-0x6F (scheduler), 0x70-0x7F (recovery), 0x80-0x8F (boot), and 0x90-0x9F (vector/graph). The 0xA0-0xAF range is unallocated and reserved for GPU operations. + +### Problem Statement + +1. **Audit gap**: GPU kernel launches, buffer allocations, memory transfers, and context switches are privileged operations that mutate partition state and consume shared resources. Without dedicated `ActionKind` variants, these operations either go unwitnessed (violating INV-3: no witness, no mutation) or are logged under generic device lease variants that lose GPU-specific forensic data. +2. **Payload encoding undefined**: The witness record's 8-byte `payload` field (offset 32) and 8-byte `aux` field (offset 56) must encode GPU-specific data (kernel ID, buffer ID, transfer size, compute duration). Without a canonical encoding, audit tools cannot decode GPU witness records. +3. **Forensic query patterns**: Operators need to query the witness log for GPU-specific events -- "show all kernel launches for partition X", "find all budget exceeded events in the last epoch", "trace all memory transfers above 1MB". These queries require filtering by `ActionKind` in the GPU range. + +### SOTA References + +| Source | Key Contribution | Relevance | +|--------|-----------------|-----------| +| ADR-134 | Witness record format, ActionKind enum, payload encoding conventions | Direct extension point for GPU variants | +| ADR-142 | SHA-256 chain hashing, signed witness records | GPU witnesses use the same chain and signing infrastructure | +| ADR-144 | GPU architecture, DC-GPU-5 (all GPU ops witnessed) | Defines the GPU operations that require witness coverage | +| NVIDIA nvprof | GPU kernel profiling event schema | Informs which GPU events are forensically valuable | +| ARM Streamline | GPU workload trace events | Validates kernel launch + memory transfer as primary audit events | + +--- + +## Decision + +Extend the `ActionKind` enum with 12 GPU-specific variants in the 0xA0-0xAF range. Define canonical payload encodings for each variant. GPU witness records use the same 64-byte format, same hash chain, and same ~17ns emission target as all other witness records. + +### New ActionKind Variants + +```rust +// --- GPU operations (0xA0-0xAF) --- +GpuContextCreate = 0xA0, // Partition creates a GPU context +GpuContextDestroy = 0xA1, // Partition destroys a GPU context +GpuKernelLaunch = 0xA2, // Kernel submitted to GPU queue +GpuKernelComplete = 0xA3, // Kernel execution completed +GpuKernelTimeout = 0xA4, // Kernel exceeded deadline (killed) +GpuBufferAlloc = 0xA5, // GPU buffer allocated +GpuBufferFree = 0xA6, // GPU buffer freed +GpuTransfer = 0xA7, // Host<->GPU memory transfer +GpuSync = 0xA8, // GPU queue synchronization barrier +GpuBudgetExceeded = 0xA9, // GPU budget quota exceeded +GpuIommuViolation = 0xAA, // IOMMU fault on GPU memory access +GpuCompileFail = 0xAB, // Kernel compilation failed +``` + +### Payload Encoding by Variant + +Each GPU witness record packs operation-specific data into the 8-byte `payload` field and optionally the 8-byte `aux` field. All encodings are little-endian. + +| ActionKind | payload (8 bytes) | aux (8 bytes) | flags (2 bytes) | +|------------|------------------|---------------|-----------------| +| `GpuContextCreate` | `context_id: u32 \| backend_id: u32` | `memory_budget_bytes: u64` | backend type in low byte | +| `GpuContextDestroy` | `context_id: u32 \| active_buffers: u32` | `total_compute_ns: u64` | 0 | +| `GpuKernelLaunch` | `kernel_id: u16 \| workgroup_x: u16 \| workgroup_y: u16 \| workgroup_z: u16` | `timeout_ns: u64` | 0 | +| `GpuKernelComplete` | `kernel_id: u16 \| _reserved: u16 \| computation_ns: u32` | `output_bytes: u64` | 0 | +| `GpuKernelTimeout` | `kernel_id: u16 \| _reserved: u16 \| elapsed_ns: u32` | `deadline_ns: u64` | 0 | +| `GpuBufferAlloc` | `buffer_id: u32 \| size_bytes: u32` | `budget_remaining_bytes: u64` | usage flags in low byte | +| `GpuBufferFree` | `buffer_id: u32 \| size_bytes: u32` | `budget_remaining_bytes: u64` | 0 | +| `GpuTransfer` | `buffer_id: u32 \| transfer_bytes: u32` | `direction: u8 \| _pad: [u8; 3] \| bandwidth_remaining: u32` | 0 = H2D, 1 = D2H in flags low byte | +| `GpuSync` | `queue_depth_before: u32 \| queue_depth_after: u32` | `wait_ns: u64` | 0 | +| `GpuBudgetExceeded` | `budget_type: u8 \| _pad: [u8; 3] \| requested: u32` | `limit: u64` | budget_type: 0=compute, 1=memory, 2=transfer, 3=launches | +| `GpuIommuViolation` | `fault_addr_low: u32 \| fault_type: u32` | `fault_addr_high: u32 \| partition_gpu_base: u32` | 0 | +| `GpuCompileFail` | `kernel_id: u16 \| error_code: u16 \| source_hash_low: u32` | `source_hash_high: u64` | 0 | + +### Encoding Conventions + +The payload encoding follows the same conventions as existing `ActionKind` variants (see ADR-134 Section 7): + +1. **IDs in high bits, measurements in low bits**: When a record carries both an identifier and a measurement, the ID occupies the high 32 bits and the measurement occupies the low 32 bits of the `payload` field. +2. **Duration in nanoseconds, truncated to u32**: Compute duration is stored as `u32` nanoseconds, giving a maximum of ~4.29 seconds. Any kernel running longer than this has already been killed by the 100ms deadline (DC-GPU-7 in ADR-144). +3. **Budget remaining after operation**: For allocation and transfer events, the `aux` field carries the budget remaining after the operation. This enables trend analysis without requiring a separate budget query. +4. **Transfer direction in flags**: The `flags` field (offset 18, u16) encodes the transfer direction for `GpuTransfer` records: 0 = host-to-device, 1 = device-to-host. + +### Integration with WitnessEmitter + +The `GpuQueue` in `rvm-gpu` calls the existing `WitnessEmitter::emit()` method. No new emission API is needed. The GPU crate constructs a `WitnessRecord` with the appropriate `ActionKind` variant and payload encoding, then passes it to the emitter. + +```rust +// rvm-gpu/src/queue.rs (sketch) +fn submit_kernel(&mut self, kernel: &GpuKernel, config: LaunchConfig) -> Result<(), GpuError> { + // ... capability check, budget check ... + + self.witness_emitter.emit(WitnessRecord::new( + ActionKind::GpuKernelLaunch, + proof.tier(), + self.partition_id, + kernel.id() as u32, + cap_handle.hash_truncated(), + encode_kernel_launch_payload(kernel.id(), config), + ))?; + + // ... submit to hardware queue ... + Ok(()) +} +``` + +The `target_object_id` field (offset 24, u32) carries the GPU context ID for context operations, the kernel ID for kernel operations, and the buffer ID for memory operations. This enables `scan_by_target()` queries to find all operations on a specific GPU object. + +### Audit Query Patterns + +GPU forensics uses the existing `scan_by_kind()` function from ADR-134 with GPU-specific `ActionKind` filters: + +```rust +/// Find all kernel launches for a given partition. +pub fn query_gpu_kernel_launches( + log: &WitnessLog, + partition_id: u32, +) -> impl Iterator { + log.iter().filter(move |r| + r.action_kind == ActionKind::GpuKernelLaunch + && r.actor_partition_id == partition_id) +} + +/// Find all budget exceeded events in a time range. +pub fn query_gpu_budget_violations( + log: &WitnessLog, + start_ns: u64, + end_ns: u64, +) -> impl Iterator { + log.iter().filter(move |r| + r.action_kind == ActionKind::GpuBudgetExceeded + && r.timestamp_ns >= start_ns + && r.timestamp_ns <= end_ns) +} + +/// Find all memory transfers above a size threshold. +pub fn query_gpu_large_transfers( + log: &WitnessLog, + min_bytes: u32, +) -> impl Iterator { + log.iter().filter(move |r| + r.action_kind == ActionKind::GpuTransfer + && (r.payload as u32) >= min_bytes) +} + +/// Trace the full lifecycle of a GPU context. +pub fn query_gpu_context_lifecycle( + log: &WitnessLog, + context_id: u32, +) -> impl Iterator { + log.iter().filter(move |r| + matches!(r.action_kind, + ActionKind::GpuContextCreate + | ActionKind::GpuContextDestroy + | ActionKind::GpuKernelLaunch + | ActionKind::GpuKernelComplete + | ActionKind::GpuKernelTimeout + | ActionKind::GpuBufferAlloc + | ActionKind::GpuBufferFree + | ActionKind::GpuTransfer + | ActionKind::GpuSync + ) && r.target_object_id == context_id) +} +``` + +### Performance Budget + +GPU witness emission uses the same ring buffer and emission path as all other witness records. The target is ~17ns per emission (well within the 500ns budget from ADR-134). GPU operations themselves are measured in microseconds to milliseconds, so the witness overhead is negligible relative to the GPU operation latency. + +| GPU Operation | Typical Latency | Witness Overhead | Overhead Ratio | +|--------------|----------------|-----------------|----------------| +| Kernel launch | 10us | ~17ns | 0.17% | +| Buffer alloc (4KB) | 2us | ~17ns | 0.85% | +| Memory transfer (1MB) | 50us | ~17ns | 0.03% | +| Context create | 100us | ~17ns | 0.02% | + +--- + +## Consequences + +### Positive + +1. **Complete GPU audit trail**: Every GPU operation is witnessed with forensically decodable payload data. No audit gap for GPU subsystem operations. +2. **Consistent with existing witness infrastructure**: GPU witnesses use the same 64-byte format, same hash chain, same ring buffer, and same emission API. No new infrastructure required. +3. **Efficient forensic queries**: The 0xA0-0xAF range grouping enables efficient range-based filtering for GPU-only audit views. +4. **Budget trend analysis**: The `budget_remaining` field in allocation and transfer records enables operators to detect budget exhaustion patterns without separate monitoring infrastructure. + +### Negative + +1. **ActionKind enum grows by 12 variants**: The enum now uses ~42 of 256 available slots. This is well within capacity but increases the surface area for audit query implementations. +2. **Payload encoding complexity**: Each GPU variant has a different payload layout. Audit tools must implement per-variant decoders. This is the same pattern used by existing variants (PartitionSplit, RegionTransfer, etc.) and is an inherent consequence of the 8-byte fixed payload constraint. + +--- + +## References + +- ADR-134: Witness Schema and Log Format (ActionKind enum, payload encoding conventions) +- ADR-142: TEE-Backed Cryptographic Verification (SHA-256 chain hashing, WitnessSigner) +- ADR-144: GPU Compute Support (DC-GPU-5, GPU operations requiring witness coverage) +- `rvm-witness/src/record.rs`: WitnessRecord struct definition +- `rvm-gpu/src/queue.rs`: GpuQueue command submission (consumer of this registry) diff --git a/docs/adr/ADR-152-gpu-mincut-correctness.md b/docs/adr/ADR-152-gpu-mincut-correctness.md new file mode 100644 index 00000000..0c091805 --- /dev/null +++ b/docs/adr/ADR-152-gpu-mincut-correctness.md @@ -0,0 +1,268 @@ +# ADR-152: GPU MinCut Correctness Model + +**Status**: Accepted +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (RVM Hypervisor Core, DC-2), ADR-144 (GPU Compute Support), ADR-151 (GPU Witness Event Registry) + +--- + +## Context + +ADR-144 introduces GPU-accelerated mincut computation for the coherence engine. The Stoer-Wagner algorithm operates on an adjacency matrix of up to 32 nodes (MINCUT_MAX_NODES, per ADR-132), mapping merge steps to GPU workgroup parallel reductions. ADR-132 DC-2 imposes a hard 50-microsecond budget per scheduler epoch for mincut computation. The GPU path is an acceleration, providing a 5.6x speedup for 32-node graphs (45us CPU vs 8us GPU target per ADR-144 benchmarks). + +### Problem Statement + +1. **Correctness is non-negotiable**: The mincut result directly drives partition split/merge decisions and migration triggers. An incorrect cut value or partition assignment causes either unnecessary splits (wasting resources) or missed merges (violating coherence locality). The GPU must produce results identical to the CPU path, or the system must detect divergence and fall back. +2. **Floating-point divergence**: GPU hardware typically uses IEEE 754 single-precision (f32) for parallel reductions. The CPU path in `rvm-coherence` uses fixed-point basis points (u16, range 0-10000) for edge weights. Conversion between representations introduces rounding that can change the mincut result on graphs with nearly-equal cut values. +3. **GPU is acceleration, not authority**: The CPU mincut implementation is the reference. If the GPU disagrees with the CPU, the CPU result is canonical. This is a fundamental design principle: the GPU path exists to meet the DC-2 budget on larger graphs, not to replace the correctness guarantee of the CPU path. +4. **Budget enforcement**: If the GPU path exceeds the 50us DC-2 budget, the system must fall back to the last known CPU result and emit a `MinCutBudgetExceeded` witness. The fallback must not itself violate the budget. + +### SOTA References + +| Source | Key Contribution | Relevance | +|--------|-----------------|-----------| +| Stoer & Wagner (1997) | Exact minimum cut algorithm, O(VE + V^2 log V) | Reference algorithm for both CPU and GPU paths | +| ADR-132 DC-2 | 50us hard budget for mincut per epoch | Budget constraint driving GPU acceleration | +| ADR-144 | GPU architecture, MinCut GPU acceleration section | GPU implementation approach (parallel max scan) | +| CUDA parallel reduction | Warp-level primitives for max/sum reduction | Informs GPU workgroup implementation | +| IEEE 754-2008 | Floating-point arithmetic standard | Defines f32 rounding behavior relevant to GPU computation | + +--- + +## Decision + +### Correctness Contract + +The GPU mincut implementation must satisfy three invariants: + +**Invariant MC-1: Cut value equivalence** + +For integer-weighted adjacency matrices (u16 basis points), the GPU-computed minimum cut value must exactly equal the CPU-computed value. No approximation is permitted for N <= MINCUT_MAX_NODES (32). + +``` +gpu_cut_value == cpu_cut_value (for integer weights) +``` + +**Invariant MC-2: Partition assignment equivalence** + +The partition assignment (which nodes are on which side of the cut) must be identical between GPU and CPU paths. Equivalence is verified by comparing partition bitmasks. + +``` +gpu_partition_bitmask == cpu_partition_bitmask +``` + +**Invariant MC-3: Budget compliance** + +The GPU mincut computation must complete within the DC-2 budget (50us). If it exceeds the budget, the system uses the last known CPU result. + +``` +if gpu_compute_time > 50_000ns: + use last_known_cpu_result + emit witness(MinCutBudgetExceeded) +``` + +### Floating-Point vs Fixed-Point Strategy + +RVM edge weights are stored as u16 basis points (0-10000, where 10000 = 100.00%). The GPU parallel reduction must operate on the same integer representation to guarantee MC-1 and MC-2. + +**Decision: GPU operates in u16 integer space, not f32.** + +The adjacency matrix is uploaded to GPU shared memory as `u16[32][32]`. All merge operations (finding the maximum-weight vertex, contracting edges) use integer arithmetic. This eliminates floating-point divergence entirely. + +If a future GPU backend requires f32 (e.g., a backend that only supports float atomics), the following conversion rules apply: + +**f32-to-u16 conversion:** + +```rust +/// Convert GPU f32 result back to u16 basis points. +/// Rounding: round-half-to-even (IEEE 754 default). +/// Clamp to [0, 10000] to prevent overflow. +fn f32_to_basis_points(value: f32) -> u16 { + let rounded = (value + 0.5).floor() as i32; + rounded.clamp(0, 10000) as u16 +} +``` + +**Epsilon tolerance for f32 paths:** + +When the GPU backend uses f32, the cut value comparison uses an epsilon of 1 basis point: + +``` +|gpu_cut_value_bp - cpu_cut_value_bp| <= 1 (for f32 backends only) +``` + +If the cut values differ by more than 1 basis point, the CPU result is used and a witness is emitted. If the cut values are within epsilon but the partition assignments differ, the CPU assignment is used (the cut value may be achieved by multiple valid partitionings). + +### Fallback Strategy + +The GPU path is a performance optimization. The CPU path is always available as a fallback. The fallback strategy has three levels: + +**Level 1: GPU result matches CPU (normal operation)** + +GPU computes mincut within budget. Result matches CPU. GPU result is used. No fallback needed. + +**Level 2: GPU result diverges from CPU** + +GPU completes within budget but produces a different result. CPU result is used. Witness emitted: + +``` +ActionKind::CoherenceRecomputed (0x93) +payload: gpu_cut_value (high 32) | cpu_cut_value (low 32) +flags: 0x01 (GPU_DIVERGED flag) +``` + +**Level 3: GPU exceeds budget** + +GPU does not complete within 50us. Computation is aborted. Last known CPU result is used. Witness emitted: + +``` +ActionKind::MinCutBudgetExceeded (0x74) +payload: elapsed_ns (high 32) | budget_ns (low 32) +flags: 0x02 (GPU_TIMEOUT flag) +``` + +**Level 4: GPU device unavailable** + +No GPU device found at startup or GPU context creation fails. CPU path is the sole path. No fallback needed; this is the baseline. Witness emitted once at boot: + +``` +ActionKind::GpuDeviceNotFound (0xAA) +``` + +### Cross-Validation Protocol + +In debug builds (`#[cfg(debug_assertions)]`), every GPU mincut result is cross-validated against the CPU result: + +```rust +pub fn mincut_with_validation( + adjacency: &AdjacencyMatrix, + gpu_ctx: Option<&GpuContext>, +) -> MinCutResult { + let cpu_result = mincut_cpu(adjacency); + + if let Some(ctx) = gpu_ctx { + match mincut_gpu(adjacency, ctx) { + Ok(gpu_result) => { + #[cfg(debug_assertions)] + { + assert_eq!( + gpu_result.cut_value, cpu_result.cut_value, + "GPU mincut diverged: gpu={} cpu={}", + gpu_result.cut_value, cpu_result.cut_value + ); + assert_eq!( + gpu_result.partition_mask, cpu_result.partition_mask, + "GPU partition assignment diverged" + ); + } + + // In release builds, use GPU result if it matches + if gpu_result.cut_value == cpu_result.cut_value { + gpu_result + } else { + // Fallback to CPU on divergence + witness_emit(ActionKind::CoherenceRecomputed, ...); + cpu_result + } + } + Err(GpuError::Timeout { elapsed_ns }) => { + witness_emit(ActionKind::MinCutBudgetExceeded, ...); + cpu_result + } + Err(_) => cpu_result, + } + } else { + cpu_result + } +} +``` + +In release builds, cross-validation is replaced by a cheaper comparison: only the cut value and partition mask are compared (two integer comparisons). The full adjacency matrix is not recomputed on CPU unless the GPU result diverges. + +### Nightly Benchmark Protocol + +The nightly CI pipeline (ADR-143) runs GPU vs CPU comparison benchmarks: + +1. **Deterministic test suite**: 100 pre-generated adjacency matrices covering edge cases (single-edge graphs, fully connected, bipartite, near-equal cuts, maximum 32 nodes). +2. **Divergence counter**: Count how many of the 100 test cases produce different GPU vs CPU results. Target: 0 divergences for integer weights. +3. **Timing comparison**: Measure GPU and CPU latency for each test case. Track speedup ratio. Alert if GPU is slower than CPU for any graph size (indicates misconfiguration or backend regression). +4. **Budget compliance**: Verify that all 32-node test cases complete within 50us on GPU. Track the 99th percentile latency. + +### Witness Logging for GPU MinCut + +GPU mincut results are witnessed via `ActionKind::CoherenceRecomputed` (0x93) with the `computation_ns` encoded in the payload: + +``` +payload: cut_value (high 32) | computation_ns (low 32) +aux: partition_mask (u64, one bit per node indicating side A vs side B) +flags: 0x00 = CPU path, 0x01 = GPU path, 0x02 = GPU fallback to CPU +``` + +This enables operators to distinguish GPU-accelerated recomputations from CPU-only ones and to track GPU performance over time. + +--- + +## Design Constraints + +### DC-MC-1: Integer Arithmetic on GPU + +The GPU mincut kernel operates on u16 integer weights. No f32 intermediate values are used unless the GPU backend has no integer atomics, in which case the f32 conversion rules above apply. This constraint eliminates the primary source of GPU/CPU divergence. + +### DC-MC-2: CPU Is Always Available + +The CPU mincut implementation must remain compiled and callable even when the GPU path is enabled. The GPU path does not replace the CPU path; it accelerates it. The `rvm-coherence` crate must not have a hard dependency on `rvm-gpu`. + +### DC-MC-3: Fallback Must Not Exceed Budget + +The fallback from GPU to CPU must itself complete within the remaining DC-2 budget. If the GPU times out at 50us and the CPU path would take 45us, the total would be 95us, exceeding the budget. To prevent this: + +- The GPU path has an internal deadline of 40us (not 50us), leaving 10us for fallback overhead. +- The fallback uses the last known CPU result (cached from the previous epoch), not a fresh computation. +- A fresh CPU computation is scheduled for the next epoch. + +### DC-MC-4: No Silent Degradation + +Every fallback event must emit a witness record. An operator reviewing the witness log must be able to determine: (a) how often the GPU path is used vs CPU, (b) how often the GPU diverges, (c) how often the GPU times out. + +--- + +## Consequences + +### Positive + +1. **Correctness guarantee**: The GPU path is validated against the CPU reference. Divergence is detected, logged, and corrected automatically. +2. **Performance within budget**: The 40us GPU deadline with 10us fallback margin ensures the DC-2 50us budget is never exceeded, even on fallback. +3. **Forensic visibility**: Operators can audit GPU mincut behavior through witness records, tracking divergence rates and performance trends. +4. **Zero risk from GPU errors**: The worst case of a GPU failure is a single-epoch use of a stale (but previously validated) cut result, followed by a fresh CPU computation. + +### Negative + +1. **Debug build overhead**: Cross-validation in debug builds runs both GPU and CPU paths, approximately doubling mincut latency. This is acceptable for development but must not be enabled in production. +2. **Stale fallback risk**: The last known CPU result may be one epoch old. If the coherence graph changed significantly in one epoch, the stale result may produce suboptimal placement decisions for that single epoch. This is acceptable because mincut staleness is already the documented degradation behavior (ADR-132 DC-2). +3. **GPU internal deadline (40us) is tighter than DC-2 (50us)**: The GPU has less time than the full budget to account for fallback overhead. For very large graphs near MINCUT_MAX_NODES, this may cause unnecessary fallbacks. Mitigation: 32-node GPU mincut targets 8us, well within the 40us internal deadline. + +--- + +## Testing Strategy + +| Category | Tests | Pass Criteria | +|----------|-------|---------------| +| Equivalence | 100 deterministic matrices, all sizes 2-32 | GPU result == CPU result for all test cases | +| Edge cases | Single-edge, fully connected, bipartite, equal-weight | No divergence | +| Timeout | Artificially slow GPU backend | Fallback to CPU, witness emitted | +| Budget | 32-node worst-case timing | GPU completes within 40us | +| Fallback chain | GPU unavailable | CPU path produces correct result | +| Nightly regression | Criterion benchmarks with 5% regression threshold | No performance regression | + +--- + +## References + +- Stoer, M. & Wagner, F. "A Simple Min-Cut Algorithm." Journal of the ACM, 1997. +- ADR-132: RVM Hypervisor Core (DC-2: 50us mincut budget) +- ADR-144: GPU Compute Support (MinCut GPU Acceleration section, benchmark targets) +- ADR-151: GPU Witness Event Registry (ActionKind variants for GPU operations) +- `rvm-coherence/src/mincut.rs`: CPU MinCut implementation, MINCUT_MAX_NODES=32 +- `rvm-gpu/src/acceleration.rs`: GPU mincut acceleration entry point diff --git a/docs/adr/ADR-153-multi-node-mesh-protocol.md b/docs/adr/ADR-153-multi-node-mesh-protocol.md new file mode 100644 index 00000000..e66c87ff --- /dev/null +++ b/docs/adr/ADR-153-multi-node-mesh-protocol.md @@ -0,0 +1,244 @@ +# ADR-153: Multi-Node Mesh Protocol + +**Status**: Draft +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (RVM Hypervisor Core), ADR-134 (Witness Schema), ADR-135 (Proof Verifier), ADR-142 (TEE-Backed Verification), ADR-144 (GPU Compute Support) + +--- + +## Context + +ADR-132 describes RVM as a standalone bare-metal hypervisor. The current design assumes a single-node deployment: one RVM instance owns one set of hardware resources, manages partitions locally, and maintains a single coherence graph and witness chain. However, the Appliance deployment model (ADR-139) envisions edge clusters where multiple RVM instances cooperate to provide a larger coherence domain spanning multiple physical machines. + +This ADR captures the design space for multi-node RVM operation. It does not propose a specific protocol. Its purpose is to document requirements, identify sub-problems, evaluate candidate approaches, and establish evaluation criteria for future work. + +### Problem Statement + +1. **Partition migration across nodes**: A single-node RVM can migrate partitions between physical CPU slots (DC-7, 100ms budget). Cross-node migration adds network serialization, transfer latency, and state rebuild on the destination. The existing RVF container format (ADR-149) provides a serialization vehicle, but the protocol for coordinating the source, network, and destination is undefined. +2. **Cross-node coherence graph**: The coherence graph currently resides in a single `rvm-coherence` instance. In a multi-node mesh, inter-node communication edges (CommEdge) have different latency and bandwidth characteristics than intra-node edges. The graph must span nodes while reflecting the higher cost of cross-node communication. +3. **Distributed capability delegation**: Capabilities are currently validated by the local kernel's P1/P2/P3 verifier. Cross-node delegation requires either (a) a shared capability namespace with distributed revocation, or (b) per-node capability spaces with cross-node attestation. +4. **Witness chain federation**: Each node maintains its own hash-chained witness log. For a global audit trail, these per-node chains must be merged into a consistent, tamper-evident structure. This is a non-trivial distributed systems problem. + +### Non-Goals of This ADR + +- Specifying a concrete protocol (this is a design-space document) +- Defining wire formats or network packet layouts +- Committing to a specific consensus algorithm +- Setting implementation timelines + +--- + +## Design Space + +### Sub-Problem 1: Partition Migration Between Nodes + +**Approach A: RVF Container Transfer** + +Serialize the partition state into an RVF container (as used for cold-tier checkpoints), transfer the container over the network, and rebuild the partition on the destination node. + +| Aspect | Details | +|--------|---------| +| Serialization | RVF manifest + memory regions + capability table + GPU context (if active) | +| Transfer | TCP/RDMA between nodes; size bounded by partition memory footprint | +| Rebuild | Destination node creates partition from RVF, maps memory, restores capabilities | +| Witness | Source emits `PartitionMigrate` (0x09) with target_node_id in aux field | +| Consistency | Source partition is suspended during transfer; no concurrent mutation | + +**Approach B: Live Migration with Witness Replay** + +Transfer a checkpoint plus the witness log delta since the checkpoint. The destination replays the delta to reconstruct current state. + +| Aspect | Details | +|--------|---------| +| Checkpoint | Pre-existing cold-tier checkpoint of the partition | +| Delta | Witness records from checkpoint sequence to current sequence | +| Replay | Destination applies replay protocol (ADR-134 Section 7) to reconstruct state | +| Downtime | Minimal: only the final state synchronization requires suspension | +| Complexity | Requires that all witness records are sufficient for state reconstruction | + +**Approach C: Incremental Page Transfer (Pre-copy)** + +Transfer memory pages iteratively while the partition runs. Track dirty pages. Final round suspends the partition and transfers remaining dirty pages. + +| Aspect | Details | +|--------|---------| +| Pre-copy rounds | Configurable (default: 3 rounds) | +| Dirty tracking | Stage-2 page table write-protect + fault handler marks dirty pages | +| Convergence | Migration converges when dirty page rate < transfer rate | +| Abort | If convergence not reached within budget, abort and keep partition on source | +| Overhead | Memory write-protect faults add latency during pre-copy | + +**Evaluation Criteria**: Migration time (target: <500ms for a 16MB partition over 10Gbps network), downtime (target: <10ms), correctness (destination state must be byte-identical to source state at suspension point). + +### Sub-Problem 2: Cross-Node Coherence Graph + +The coherence graph must span multiple nodes. Inter-node CommEdge weights must reflect network latency and bandwidth, which are orders of magnitude worse than intra-node shared memory. + +**Approach A: Gossip-Based Edge Weight Propagation** + +Each node maintains a local view of the global graph. Nodes periodically exchange edge weight updates via a gossip protocol. + +| Aspect | Details | +|--------|---------| +| Consistency | Eventual consistency; local views may differ by one gossip round | +| Latency | Gossip round: configurable (default: 100ms) | +| Bandwidth | O(E_cross) per gossip round, where E_cross = cross-node edge count | +| Partition decisions | Based on local view; may differ across nodes during convergence | + +**Approach B: Centralized Graph Coordinator** + +One node is elected as the graph coordinator. All edge weight updates are sent to the coordinator, which computes the global mincut and distributes partition assignments. + +| Aspect | Details | +|--------|---------| +| Consistency | Strong consistency via single-writer | +| Single point of failure | Coordinator failure requires re-election | +| Latency | All mincut decisions have network round-trip to coordinator | +| Scalability | Coordinator becomes bottleneck at large cluster sizes | + +**Approach C: Hierarchical Graph (Two-Level)** + +Each node computes intra-node mincut locally. A separate inter-node graph captures node-to-node communication. A cluster-level mincut on the inter-node graph determines node-level partition placement. + +| Aspect | Details | +|--------|---------| +| Consistency | Intra-node: local, strong. Inter-node: eventual via gossip | +| Scalability | Intra-node graph: up to 32 nodes (existing). Inter-node graph: up to N_nodes | +| Locality | Most decisions are local; cross-node decisions are rarer and coarser | +| Complexity | Two mincut computations per epoch instead of one | + +**Evaluation Criteria**: Consistency guarantees, convergence time, bandwidth overhead, fault tolerance (what happens when a node fails mid-update). + +### Sub-Problem 3: Distributed Capability Delegation + +Capabilities currently live in a per-node capability table. Cross-node delegation requires extending the capability model. + +**Approach A: Shared Capability Namespace** + +All nodes share a single logical capability table, replicated via consensus. + +| Aspect | Details | +|--------|---------| +| Namespace | Global: CapHandle is unique across all nodes | +| Revocation | Distributed: revoking a capability requires consensus across all nodes | +| Latency | Every capability operation requires network round-trip for consensus | +| Consistency | Strong (via Raft or similar) | + +**Approach B: Per-Node Namespaces with Cross-Node Attestation** + +Each node maintains its own capability table. Cross-node operations carry a signed attestation from the source node. + +| Aspect | Details | +|--------|---------| +| Namespace | Local: CapHandle is node-scoped. Cross-node references include node_id | +| Revocation | Local: each node revokes its own capabilities. Cross-node stale detection via epoch + node attestation | +| Latency | Local operations are unchanged. Cross-node operations add attestation overhead | +| Trust | Relies on TEE-backed node attestation (ADR-142) for cross-node trust | + +**Approach C: Delegated Capabilities with Network Tokens** + +Capabilities can be delegated cross-node by minting a "network capability token" that includes the source node's signature and a bounded validity period. + +| Aspect | Details | +|--------|---------| +| Token format | CapRights + object_id + source_node_id + expiry_ns + Ed25519 signature | +| Validation | Destination node verifies signature against source node's public key | +| Revocation | Expiry-based; no explicit cross-node revocation protocol | +| Delegation depth | Network delegation counts toward the depth-8 limit | + +**Evaluation Criteria**: Latency impact on local operations (must remain <1us for P1), revocation propagation time, trust model compatibility with ADR-142 TEE infrastructure. + +### Sub-Problem 4: Witness Chain Federation + +Each node produces a linear hash-chained witness log (ADR-134). A multi-node deployment needs a global audit trail. + +**Approach A: Merged Linear Chain** + +A designated node collects witness records from all nodes and merges them into a single linear chain, ordered by timestamp. + +| Aspect | Details | +|--------|---------| +| Ordering | Timestamp-based (requires synchronized clocks) | +| Single point of failure | Merge node failure halts global chain | +| Causality | May not preserve causal ordering across nodes | +| Simplicity | Simple; same verification as single-node chain | + +**Approach B: Merkle DAG (Directed Acyclic Graph)** + +Each node's chain is a branch. Cross-node events (migration, cross-node IPC) create merge points linking two branches. + +| Aspect | Details | +|--------|---------| +| Structure | DAG of witness records; each record references its local predecessor and optionally a cross-node predecessor | +| Ordering | Partial order; causal ordering preserved by cross-references | +| No single point of failure | Each node maintains its own chain independently | +| Verification | Requires DAG traversal instead of linear chain walk | + +**Approach C: Cross-Node Checkpoint Anchors** + +Each node maintains its own independent chain. Periodically, all nodes produce a signed checkpoint summary. These summaries are collected into a "federation anchor" record that references all node chains at a specific sequence number. + +| Aspect | Details | +|--------|---------| +| Independence | Nodes operate independently between anchors | +| Anchor frequency | Configurable (default: every 1000 epochs or 1 second) | +| Verification | Verify any node's chain independently; cross-node verification via anchor comparison | +| Offline tolerance | Nodes can be disconnected between anchors | + +**Evaluation Criteria**: Tamper evidence (can a compromised node forge records without detection?), offline tolerance (can nodes operate independently during network partitions?), verification complexity, clock synchronization requirements. + +--- + +## Network Partition Tolerance + +A critical requirement for multi-node RVM is correct behavior during network partitions. When nodes cannot communicate: + +1. **Each node operates autonomously**: Local partitions continue to run, local coherence graph is maintained, local witness chain continues. +2. **Cross-node edges are marked stale**: Inter-node CommEdge weights decay to zero after a configurable timeout (default: 10 seconds without gossip update). +3. **Cross-node migrations are suspended**: No partition migration is attempted while the target node is unreachable. +4. **Capabilities with cross-node attestation expire naturally**: Network capability tokens have bounded validity; expiry handles the partition case without explicit revocation. +5. **Witness chains diverge**: Each node's chain continues independently. Reconciliation occurs when connectivity is restored. +6. **Reconciliation on reconnect**: When connectivity is restored, nodes exchange witness chain summaries to detect the divergence point and resume cross-node operations. + +The system must never enter a state where a network partition causes data loss, authority leakage, or witness chain corruption. The worst case is temporary degradation to independent single-node operation. + +--- + +## Open Questions + +| # | Question | Impact | Potential Resolution | +|---|----------|--------|---------------------| +| 1 | What clock synchronization precision is required for timestamp-based witness ordering? | High | NTP (~1ms), PTP (~1us), or logical clocks (Lamport/vector) | +| 2 | What is the maximum cluster size (number of nodes)? | Medium | Determines whether gossip or consensus is appropriate | +| 3 | Should cross-node mincut use a different algorithm than intra-node? | Medium | The hierarchical two-level approach avoids this question | +| 4 | How are GPU contexts handled during cross-node migration? | High | GPU context serialization is GPU-vendor-specific; may require context rebuild | +| 5 | What is the acceptable cross-node migration budget? | High | DC-7 says 100ms for local; cross-node likely needs 500ms-1s | +| 6 | Is Byzantine fault tolerance required? | High | If any node can be compromised, BFT consensus is needed; if all nodes are trusted, Raft suffices | + +--- + +## Preliminary Recommendations + +Based on the analysis above, the following approach is recommended for initial prototyping (not a final decision): + +1. **Migration**: Approach A (RVF container transfer) for simplicity. Pre-copy (Approach C) is a future optimization. +2. **Coherence graph**: Approach C (hierarchical two-level) for scalability and locality. +3. **Capabilities**: Approach B (per-node namespaces with attestation) for minimal latency impact on local operations. +4. **Witness chain**: Approach C (cross-node checkpoint anchors) for offline tolerance and simplicity. + +These recommendations will be refined into a concrete protocol specification when multi-node work begins (post-v1). + +--- + +## References + +- ADR-132: RVM Hypervisor Core (DC-7 migration budget, DC-12 logical partition limit) +- ADR-134: Witness Schema and Log Format (hash chain, replay protocol) +- ADR-135: Proof Verifier Design (capability model, P1/P2/P3 layers) +- ADR-142: TEE-Backed Cryptographic Verification (Ed25519, node attestation) +- ADR-144: GPU Compute Support (GPU context save/restore) +- Lamport, L. "Time, Clocks, and the Ordering of Events in a Distributed System." CACM, 1978. +- Ongaro, D. & Ousterhout, J. "In Search of an Understandable Consensus Algorithm (Raft)." USENIX ATC, 2014. +- Castro, M. & Liskov, B. "Practical Byzantine Fault Tolerance." OSDI, 1999. diff --git a/docs/adr/ADR-154-formal-verification-roadmap.md b/docs/adr/ADR-154-formal-verification-roadmap.md new file mode 100644 index 00000000..2ccbb92b --- /dev/null +++ b/docs/adr/ADR-154-formal-verification-roadmap.md @@ -0,0 +1,283 @@ +# ADR-154: Formal Verification Roadmap + +**Status**: Draft +**Date**: 2026-04-04 +**Authors**: Claude Code (Opus 4.6) +**Supersedes**: None +**Related**: ADR-132 (RVM Hypervisor Core), ADR-135 (Proof Verifier Design), ADR-134 (Witness Schema), ADR-142 (TEE-Backed Verification) + +--- + +## Context + +ADR-132 explicitly lists formal verification as a non-goal for v1: "Full formal verification (deferred to post-v1; seL4-style proofs are multi-year efforts)." This is the correct phasing -- seL4's initial verification effort took approximately 20 person-years and targeted a much smaller codebase (~10,000 lines of C). RVM is substantially larger and more complex. However, the absence of formal verification does not mean the absence of a plan for it. + +This ADR establishes which subsystems are candidates for formal verification, in what priority order, using which tools, and through what incremental approach. It is a roadmap, not a commitment to any specific verification timeline. + +### Problem Statement + +1. **Safety-critical adoption requires verification evidence**: Domains such as automotive (ISO 26262), aerospace (DO-178C), and medical devices (IEC 62304) require evidence of correctness beyond testing. Formal verification provides the strongest such evidence. +2. **Not all subsystems need verification equally**: The capability system's monotonic attenuation property is a simple enough invariant to verify with bounded model checking. The scheduler's interaction with the coherence engine is too complex for current tools. Prioritization is essential. +3. **Rust tooling is maturing**: Kani (Amazon), Prusti (ETH Zurich), and MIRAI (Facebook) provide Rust-specific verification capabilities that did not exist when seL4 was verified. The barrier to entry is lower than it was for C-based kernels. +4. **Property-based testing is a stepping stone**: Before investing in full model checking, property-based testing (via `proptest` or `bolero`) can discover invariant violations cheaply. Properties that survive extensive proptest campaigns are good candidates for formal proof. + +### SOTA References + +| Source | Key Contribution | Relevance | +|--------|-----------------|-----------| +| seL4 (Klein et al., 2009) | First formally verified OS kernel | Gold standard; informs what is achievable and what it costs | +| Kani (Amazon) | Rust model checker using CBMC | Primary tool candidate for bounded verification of RVM | +| Prusti (ETH Zurich) | Rust verifier based on Viper framework | Pre/post-condition verification for Rust functions | +| MIRAI (Meta) | Abstract interpretation for Rust | Tag analysis and reachability checking | +| Ferrocene (AdaCore + Ferrous) | Qualified Rust compiler for safety-critical systems | Provides the compiler qualification needed for certified deployments | +| CertiKOS (Gu et al., 2016) | Verified concurrent OS kernel in Coq | Demonstrates verification of concurrent kernel with hardware interaction | + +--- + +## Decision + +Establish a four-priority verification roadmap. Begin with property-based testing for all priorities. Graduate Priority 1 to model checking (Kani) as a pilot. Expand to other priorities based on pilot results. + +### Priority 1: Capability Attenuation Monotonicity (rvm-cap) + +**Property**: A derived capability never has more rights than its parent. Formally: + +``` +forall cap_parent, cap_child: + derive(cap_parent, new_rights) = Some(cap_child) + implies + cap_child.rights.is_subset_of(cap_parent.rights) +``` + +**Why Priority 1**: This is the foundational security invariant. If attenuation is violated, a partition can escalate its privileges. The property is simple (subset relation on a u8 bitmap), the code is small (~200 lines in `rvm-cap/src/capability.rs`), and the state space is bounded (7 rights, depth <= 8). + +**Verification approach**: + +| Phase | Tool | What It Proves | Effort | +|-------|------|----------------|--------| +| Phase A | proptest | No counterexample found in 10M random derivation chains | 1 week | +| Phase B | Kani | Bounded proof for all u8 right combinations, depth 1-8 | 2 weeks | +| Phase C | Prusti (optional) | Pre/post-condition annotation on `derive()` | 1 week | + +**Kani harness sketch**: + +```rust +#[kani::proof] +fn verify_monotonic_attenuation() { + let parent_rights: u8 = kani::any(); + let child_rights: u8 = kani::any(); + let parent = Capability { + rights: CapRights(parent_rights), + // ... other fields symbolic + }; + + if let Some(child) = derive(&parent, CapRights(child_rights), 0, &tree) { + assert!(child.rights.is_subset_of(parent.rights)); + } +} +``` + +### Priority 2: Witness Chain Integrity (rvm-witness) + +**Property**: If any witness record in the chain is modified after emission, the chain verification detects the modification. Formally: + +``` +forall chain[0..N], i in 0..N: + tamper(chain[i]) implies + verify_chain(chain) = Err(ChainBreak { index: i }) +``` + +**Why Priority 2**: The witness chain is the audit backbone. If chain integrity can be silently violated, the entire auditability guarantee collapses. The property involves SHA-256 hash chaining (ADR-142), which is deterministic and bounded. + +**Verification approach**: + +| Phase | Tool | What It Proves | Effort | +|-------|------|----------------|--------| +| Phase A | proptest | Tampering any byte in any record is detected for chains of length 1-1000 | 1 week | +| Phase B | Kani | Bounded proof for chains of length 1-16 with symbolic record contents | 3 weeks | + +**Challenges**: SHA-256 computation is expensive for symbolic execution. Kani may require abstraction of the hash function (replace SHA-256 with a symbolic collision-resistant function) to keep verification tractable. + +### Priority 3: Partition Lifecycle State Machine (rvm-partition) + +**Property**: Partitions transition only through valid states, and no transition produces a state that violates isolation. Formally: + +``` +forall partition, event: + transition(partition.state, event) = new_state + implies + new_state in VALID_STATES + and + isolation_invariant(partition, new_state) holds +``` + +The valid states are: `Created`, `Running`, `Suspended`, `Hibernated`, `Migrating`, `Splitting`, `Merging`, `Destroyed`. The isolation invariant requires that no two partitions share a mutable memory region without an explicit CommEdge and capability grant. + +**Why Priority 3**: The partition lifecycle is the core abstraction. Invalid transitions (e.g., `Destroyed` -> `Running`) could resurrect a partition with stale capabilities, violating isolation. + +**Verification approach**: + +| Phase | Tool | What It Proves | Effort | +|-------|------|----------------|--------| +| Phase A | proptest + state machine testing | No invalid transition reachable from any state via random event sequences | 2 weeks | +| Phase B | Kani | Bounded proof that the transition function preserves the isolation invariant for all state/event combinations | 4 weeks | + +### Priority 4: Memory Isolation (rvm-memory) + +**Property**: No two partitions can access the same physical page unless they share a memory region with appropriate capabilities. Formally: + +``` +forall p1, p2, page: + p1 != p2 + and maps(p1, page) and maps(p2, page) + implies + exists region, cap: + shared_region(region, p1, p2) + and region.contains(page) + and cap.rights.contains(READ or WRITE) +``` + +**Why Priority 4**: Memory isolation is a hardware-enforced property (stage-2 page tables on ARM, EPT on x86). The verification target is the page table construction logic in `rvm-memory`, not the hardware itself. This is the most complex verification target because it involves page table walks, physical address arithmetic, and IOMMU configuration. + +**Verification approach**: + +| Phase | Tool | What It Proves | Effort | +|-------|------|----------------|--------| +| Phase A | proptest | `regions_overlap_host()` returns true for all overlapping regions, false for all non-overlapping | 2 weeks | +| Phase B | Kani | Bounded proof for page tables with up to 64 entries | 6 weeks | +| Phase C | MIRAI | Abstract interpretation of the full page table construction path | 4 weeks | + +--- + +## Tool Evaluation + +### Kani (Primary Recommendation) + +Kani is Amazon's Rust model checker, built on CBMC (C Bounded Model Checker). It verifies Rust code by translating it to a verification IR and exhaustively checking all paths up to a bounded depth. + +| Aspect | Assessment | +|--------|-----------| +| Rust support | Native; understands ownership, borrowing, lifetimes | +| no_std compatibility | Yes; works with `#![no_std]` crates | +| Bounded verification | Up to configurable depth; sufficient for RVM's bounded data structures | +| Performance | Practical for functions with up to ~1000 lines and bounded loops | +| Maturity | Production use at Amazon; actively maintained | +| Limitations | Cannot verify unbounded loops; SHA-256 symbolic execution is expensive | + +### Prusti (Secondary) + +Prusti is ETH Zurich's Rust verifier, built on the Viper verification framework. It verifies pre/post-conditions annotated on Rust functions. + +| Aspect | Assessment | +|--------|-----------| +| Annotation style | Rust attributes (`#[requires(...)]`, `#[ensures(...)]`) | +| Verification power | Function-level pre/post-conditions, loop invariants | +| no_std compatibility | Partial; some standard library annotations missing | +| Maturity | Research tool; less production-hardened than Kani | +| Best for | Annotating and verifying individual functions (derive, verify_p1) | + +### MIRAI (Supplementary) + +MIRAI is Meta's abstract interpretation tool for Rust. It performs tag analysis and reachability checking at the MIR level. + +| Aspect | Assessment | +|--------|-----------| +| Analysis type | Abstract interpretation (over-approximation) | +| False positives | Expected; requires manual triage | +| Speed | Fast; can analyze entire crates in seconds | +| Best for | Finding unreachable code, tag-based security analysis | + +--- + +## Relationship to seL4 + +seL4's formal verification provides a useful reference point, but RVM's approach differs in several important ways: + +| Aspect | seL4 | RVM | +|--------|------|-----| +| Language | C + Isabelle/HOL | Rust + Kani/Prusti | +| Verification scope | Full functional correctness of kernel | Targeted properties of critical subsystems | +| Effort | ~20 person-years (initial) | Estimated 2-6 person-months per priority | +| Compiler trust | Verified C-to-ARM compilation chain | Ferrocene qualified compiler (future) | +| Hardware model | Formal ARM model | Hardware assumed correct; verify software logic | +| Concurrency | Single-core only (initial) | Single-core properties first; SMP deferred | + +RVM does not aim to replicate seL4's full functional correctness proof. Instead, it targets the four specific properties above, which together cover the critical security and integrity invariants. This is a pragmatic choice: targeted verification of critical properties provides high assurance value at a fraction of the cost of full verification. + +--- + +## Incremental Approach + +The verification roadmap follows a "test, then prove" methodology: + +``` +Step 1: Property-based testing (proptest/bolero) + - Define the property as a proptest strategy + - Run 10M+ test cases + - Fix any violations found + - Commit the proptest as a regression test + +Step 2: Bounded model checking (Kani) + - Write a Kani proof harness for the same property + - Run Kani with increasing bounds until verification completes or times out + - If Kani times out, identify the bottleneck and abstract it + - Commit the Kani harness as a CI verification step + +Step 3: Annotation-based verification (Prusti, optional) + - Annotate the function with pre/post-conditions + - Verify with Prusti + - Annotations serve as machine-checked documentation + +Step 4: Continuous verification + - Kani harnesses run in nightly CI (ADR-143) + - Any code change that breaks a verified property fails the pipeline + - New properties are added incrementally as subsystems mature +``` + +--- + +## Consequences + +### Positive + +1. **Clear prioritization**: Four priorities ordered by security impact and verification tractability. Prevents the "verify everything or nothing" trap. +2. **Incremental approach**: Property-based testing provides immediate value. Formal verification builds on tested properties, reducing wasted verification effort. +3. **Tool diversity**: Three complementary tools (Kani, Prusti, MIRAI) cover different verification niches. No single-tool dependency. +4. **CI integration**: Verification harnesses become regression tests, preventing future regressions of verified properties. + +### Negative + +1. **Not full functional correctness**: The roadmap verifies four specific properties, not the entire kernel. Bugs outside these properties remain undetected by formal methods. +2. **Tool maturity risk**: Kani and Prusti are actively developed but not yet as mature as Isabelle/HOL (seL4's verifier). Tool bugs could produce false confidence. +3. **Ongoing maintenance cost**: Verification harnesses must be updated when the verified code changes. This adds friction to development. + +--- + +## Timeline Estimate + +| Priority | Phase A (proptest) | Phase B (Kani) | Phase C (Optional) | Total | +|----------|-------------------|----------------|-------------------|-------| +| P1: Capability attenuation | 1 week | 2 weeks | 1 week (Prusti) | 4 weeks | +| P2: Witness chain integrity | 1 week | 3 weeks | -- | 4 weeks | +| P3: Partition lifecycle | 2 weeks | 4 weeks | -- | 6 weeks | +| P4: Memory isolation | 2 weeks | 6 weeks | 4 weeks (MIRAI) | 12 weeks | + +**Total estimated effort**: 26 person-weeks (~6 person-months) for all four priorities. This assumes one engineer with verification experience. + +--- + +## References + +- Klein, G., et al. "seL4: Formal Verification of an OS Kernel." SOSP 2009. +- Gu, R., et al. "CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels." OSDI 2016. +- Kani Rust Verifier: https://model-checking.github.io/kani/ +- Prusti: https://www.pm.inf.ethz.ch/research/prusti.html +- MIRAI: https://github.com/facebookexperimental/MIRAI +- Ferrocene: https://ferrocene.dev/ +- ADR-132: RVM Hypervisor Core (non-goals: formal verification deferred) +- ADR-135: Proof Verifier Design (capability model, monotonic attenuation) +- ADR-134: Witness Schema and Log Format (hash chain integrity) +- `rvm-cap/src/capability.rs`: Capability derive() implementation +- `rvm-witness/src/chain.rs`: Witness chain hash computation +- `rvm-partition/src/lifecycle.rs`: Partition state machine +- `rvm-memory/src/regions.rs`: regions_overlap_host() check diff --git a/docs/research/README.md b/docs/research/README.md new file mode 100644 index 00000000..2796785e --- /dev/null +++ b/docs/research/README.md @@ -0,0 +1,73 @@ +# RVM Research Topics + +**Last Updated**: 2026-04-04 + +This directory contains research briefs for open questions, comparative analyses, and security studies related to the RVM hypervisor. Each document identifies the research question, its relevance to RVM design decisions, known prior work, and suggested investigation approaches. + +--- + +## Directory Structure + +``` +docs/research/ + README.md # This file + specification-improvement-plan.md # Full gap analysis and GOAP action plan + theoretical-foundation.md # 6 research topics on RVM's theoretical basis + comparative-analysis.md # 4 comparison studies against existing systems + security-analysis.md # 4 security research topics +``` + +--- + +## Research Topics by Category + +### Theoretical Foundation + +| ID | Topic | Related ADRs | Priority | +|----|-------|-------------|----------| +| R1 | Coherence convergence guarantees | ADR-132, ADR-141 | High | +| R2 | MinCut budget analysis | ADR-132 DC-2, ADR-144 | High | +| R3 | Reconstruction fidelity | ADR-134, ADR-136 | High | +| R4 | Proof tier latency bounds | ADR-135 DC-3 | Medium | +| R5 | Memory tier transition optimality | ADR-136 | Medium | +| R6 | GPU acceleration speedup model | ADR-144, ADR-152 | Medium | + +### Comparative Analysis + +| ID | Topic | Comparison Target | Priority | +|----|-------|------------------|----------| +| R7 | RVM vs seL4 | Capability models, verification gap | High | +| R8 | RVM vs Firecracker | Boot time, partition switch, memory | Medium | +| R9 | RVM vs CFS/EEVDF | Scheduling algorithms | Medium | +| R10 | RVM vs zswap/zram | Memory compression vs reconstruction | Low | + +### Security Analysis + +| ID | Topic | Threat Class | Priority | +|----|-------|-------------|----------| +| R11 | Constant-time audit | Side-channel | High | +| R12 | GPU covert channels | Information leakage | High | +| R13 | TEE collateral expiry | Availability/trust | Medium | +| R14 | Witness truncation attacks | Denial of audit | Medium | + +--- + +## How to Use These Documents + +Each research brief follows a common structure: + +1. **Question**: The specific research question +2. **Relevance**: Why this matters for RVM +3. **Prior Work**: Known results from literature or related systems +4. **Approach**: Suggested investigation methodology +5. **Expected Outcome**: What a successful investigation would produce + +Research topics may be promoted to full ADRs when they produce concrete design recommendations. For example, R6 (GPU speedup model) directly informs ADR-152 (GPU MinCut Correctness Model). + +--- + +## Cross-References + +- ADR index: `docs/adr/` +- Specification improvement plan: `docs/research/specification-improvement-plan.md` +- RuVector integration: `docs/RUVECTOR-INTEGRATION.md` diff --git a/docs/research/comparative-analysis.md b/docs/research/comparative-analysis.md new file mode 100644 index 00000000..d2b3e872 --- /dev/null +++ b/docs/research/comparative-analysis.md @@ -0,0 +1,110 @@ +# Comparative Analysis Research Topics + +**Date**: 2026-04-04 +**Scope**: Four comparison studies between RVM and existing systems + +--- + +## R7: RVM vs seL4 — Capability Models and Formal Verification Gap + +**Question**: How does RVM's capability model map onto seL4's CNode/CSlot model, and what security properties does RVM lack due to the absence of formal verification? + +**Relevance**: RVM's capability system (ADR-135) is explicitly inspired by seL4. Both use unforgeable tokens with monotonic attenuation and bounded delegation. However, RVM diverges from seL4 in several ways that have security implications. + +**Key Comparison Points**: + +| Aspect | seL4 | RVM | Implication | +|--------|------|-----|-------------| +| Capability storage | CNode/CSlot (tree-structured) | Flat capability table (per-partition) | RVM's flat table is simpler but lacks seL4's hierarchical revocation efficiency | +| Rights model | 18+ rights (platform-specific) | 7 rights (u8 bitmap) | RVM is more constrained; some seL4 operations have no RVM equivalent | +| Delegation depth | Unbounded (CDT tracks all derivations) | Bounded at 8 (ADR-135) | RVM trades flexibility for bounded revocation cost | +| Revocation | CDT walk (O(n) in derivation tree) | Epoch-based (O(d) in descendants) | RVM's epoch-based approach is faster but less precise | +| Formal verification | Full functional correctness (Isabelle/HOL) | Deferred (ADR-154 roadmap) | RVM cannot make the same security claims as seL4 until verification is complete | +| Concurrency | Single-core verified; multicore is recent | SMP from v1 (ADR-146) | seL4's multicore verification is partial; RVM's is not started | + +**Approach**: (1) Map each seL4 CNode operation (mint, copy, mutate, delete, revoke) to its RVM equivalent. (2) Identify operations that seL4 supports but RVM does not (e.g., badge-based identification, endpoint capabilities). (3) For each gap, assess whether the gap is a security risk or a deliberate simplification. (4) Quantify the verification gap: which seL4-verified properties does RVM's proptest suite cover vs not cover? + +**Expected Outcome**: A mapping table of seL4-to-RVM capability operations with security impact annotations, plus a gap analysis identifying properties that require formal verification before safety-critical deployment. + +--- + +## R8: RVM vs Firecracker — Boot Time, Partition Switch, and Memory Model + +**Question**: How does RVM's performance compare to Firecracker across the critical path metrics (boot, switch, memory efficiency)? + +**Relevance**: Firecracker (AWS) is the closest existing system to RVM in terms of minimalism and use case (lightweight isolation for serverless/edge workloads). RVM targets different abstraction (coherence domains vs microVMs), but the performance characteristics must be competitive or the coherence advantage is moot. + +**Key Comparison Points**: + +| Metric | Firecracker | RVM Target | Notes | +|--------|------------|------------|-------| +| Cold boot to first user code | ~125ms (KVM + minimal Linux guest) | <250ms (bare-metal, no KVM) | RVM has no Linux dependency but also no KVM fast path | +| VM/partition switch | ~10us (KVM VMRUN/VMEXIT) | <10us (bare-metal EL2 switch) | RVM avoids KVM ioctl overhead but must manage stage-2 tables directly | +| Memory overhead per instance | ~5MB (minimal initrd) | TBD (partition metadata + capability table) | RVM targets smaller per-partition overhead | +| Max instances | ~4000 microVMs per host (Firecracker paper) | DC-12: 4096 logical, 256 physical | RVM multiplexes logical partitions over VMID slots | +| Network I/O | virtio-net (KVM passthrough) | Not yet specified (ADR-153 scope) | Firecracker has mature I/O; RVM defers to post-v1 | +| GPU support | None | ADR-144 (capability-gated) | RVM advantage | +| Formal verification | None | ADR-154 roadmap | Neither is verified today | + +**Approach**: (1) Set up identical QEMU AArch64 virt environments for Firecracker and RVM. (2) Measure cold boot time with PMU cycle counters. (3) Measure partition/VM switch latency under identical workloads (ping-pong IPC between two instances). (4) Measure memory overhead with increasing instance counts (1, 10, 100, 1000). (5) Document the methodology for reproducibility (ADR-143 benchmark framework). + +**Expected Outcome**: A benchmark comparison table with methodology documentation, identifying where RVM beats Firecracker (switch latency, GPU, per-partition overhead) and where it trails (boot time, I/O maturity). + +--- + +## R9: RVM vs CFS/EEVDF — Scheduling Algorithms + +**Question**: How does RVM's coherence-aware 2-signal scheduler (deadline_urgency + cut_pressure_boost) compare to Linux CFS and EEVDF under equivalent workloads? + +**Relevance**: ADR-132 DC-4 defines RVM's v1 scheduler as deliberately simple: two signals only. Linux CFS (Completely Fair Scheduler) uses virtual runtime-based fairness, and EEVDF (Earliest Eligible Virtual Deadline First) is the newer replacement. RVM's scheduler optimizes for coherence locality, not fairness. This is a fundamentally different objective, and the comparison must account for this. + +**Key Comparison Points**: + +| Aspect | CFS | EEVDF | RVM | +|--------|-----|-------|-----| +| Objective | CPU fairness (virtual runtime) | Fairness + latency (eligible virtual deadline) | Coherence locality (cut pressure + deadline) | +| Fairness guarantee | Proportional share | Bounded latency | None (not a goal) | +| Context switch frequency | Adaptive (granularity knob) | Adaptive | Per-epoch (configurable, default 1ms) | +| Workload awareness | Weight-based nice levels | Deadline-aware | Graph-structure-aware | +| Tail latency | Depends on load | Bounded by deadline | Depends on cut pressure distribution | +| Multi-core | Per-CPU runqueue + load balancing | Same | Per-CPU scheduler + coherence CPU (ADR-146) | + +**Approach**: (1) Implement a workload simulator that generates partition communication patterns (uniform, hotspot, producer-consumer, graph-structured). (2) For each pattern, simulate CFS, EEVDF, and RVM scheduling decisions. (3) Measure: tail latency (p99), throughput (completed operations per second), remote memory traffic (cross-partition accesses that could have been local). (4) RVM should win on remote traffic reduction; CFS/EEVDF should win on fairness. Quantify the tradeoff. + +**Expected Outcome**: Simulation results showing the coherence-locality vs fairness tradeoff, with recommendations for workload classes where RVM's scheduler is advantageous. + +--- + +## R10: RVM vs zswap/zram — Memory Compression vs Witness Reconstruction + +**Question**: How does RVM's 4-tier memory model (with dormant-tier compression and cold-tier reconstruction) compare to Linux zswap/zram for memory-constrained workloads? + +**Relevance**: Both RVM and zswap/zram address the problem of fitting more workload into limited physical memory. zswap compresses pages in a kernel-managed pool before writing to swap. zram compresses pages into a RAM-backed block device. RVM's approach is fundamentally different: dormant-tier pages are compressed, but cold-tier pages are not stored at all -- they are reconstructed from checkpoints plus witness replay (ADR-134 Section 7, ADR-136). + +**Key Comparison Points**: + +| Aspect | zswap | zram | RVM Dormant/Cold | +|--------|-------|------|------------------| +| Compression algorithm | LZO, LZ4, zstd (configurable) | LZO, LZ4, zstd | LZ4 (dormant tier) | +| Decompression latency | ~1-10us per page | ~1-10us per page | ~1-10us (dormant), ~1-100ms (cold, replay) | +| Memory savings | 2-3x compression ratio | 2-3x compression ratio | 2-3x (dormant), unbounded (cold, reconstructed) | +| Correctness guarantee | Byte-identical decompression | Byte-identical decompression | Depends on replay fidelity (R3) | +| Eviction policy | LRU | LRU | Cut-value + recency (coherence-aware) | +| Recovery from eviction | Read from swap device | Decompress from RAM | Replay from checkpoint | +| CPU cost | Compression + decompression | Compression + decompression | Compression (dormant) + SHA-256 replay (cold) | + +**Approach**: (1) Set up identical memory-constrained environments (256MB, 512MB, 1GB) running equivalent workloads. (2) For zswap/zram: measure throughput and tail latency with varying compression algorithms. (3) For RVM: measure throughput and tail latency with coherence-aware demotion to dormant and cold tiers. (4) Measure the reconstruction latency distribution for cold-tier pages (this is unique to RVM and has no zswap equivalent). (5) Compare total memory efficiency: how many concurrent partitions can run within the memory budget? + +**Expected Outcome**: A benchmark comparison showing RVM's advantage in total memory efficiency (cold-tier reconstruction allows more partitions than compression alone) and disadvantage in cold-tier access latency (replay is slower than decompression). + +--- + +## Cross-References + +- ADR-132: RVM Hypervisor Core (scheduler, memory model, design constraints) +- ADR-134: Witness Schema and Log Format (replay protocol for cold-tier reconstruction) +- ADR-135: Proof Verifier Design (capability model for seL4 comparison) +- ADR-136: Memory Hierarchy and Reconstruction (4-tier model) +- ADR-144: GPU Compute Support (GPU advantage over Firecracker) +- ADR-146: SMP and Per-CPU Scheduling (multi-core scheduler design) +- ADR-154: Formal Verification Roadmap (verification gap vs seL4) diff --git a/docs/research/security-analysis.md b/docs/research/security-analysis.md new file mode 100644 index 00000000..78cab72d --- /dev/null +++ b/docs/research/security-analysis.md @@ -0,0 +1,123 @@ +# Security Analysis Research Topics + +**Date**: 2026-04-04 +**Scope**: Four security research topics covering side channels, covert channels, trust expiry, and audit integrity + +--- + +## R11: Constant-Time Audit — Verify All P1/P2 Checks Are Truly Constant-Time + +**Question**: Do the P2 verification checks in `rvm-proof` remain constant-time after compiler optimization on AArch64 and x86-64? + +**Relevance**: ADR-135 mandates constant-time execution for P2 policy validation. ADR-142 requires constant-time comparison (`subtle::ConstantTimeEq`) for all hash and signature checks. However, writing constant-time Rust code is necessary but not sufficient -- the compiler may introduce variable-time behavior through optimizations such as branch-based lowering of conditional moves, auto-vectorization with early exits, or dead code elimination of seemingly redundant operations. + +**Threat Model**: An attacker with precise timing measurements (e.g., cache-line flush + reload, branch predictor priming) observes the latency of P2 verification calls with varying proof inputs. If the latency varies based on which check fails first, the attacker learns which checks pass, enabling incremental proof forgery. The attack requires local timing measurements (same machine or same cache partition), which is realistic in a multi-partition hypervisor. + +**Prior Work**: (1) The `dudect` tool (Reparaz et al., 2017) performs statistical constant-time testing using Welch's t-test on execution time distributions. (2) The `ct-verif` tool (Almeida et al., 2016) formally verifies constant-time properties via symbolic execution. (3) `subtle` crate documentation warns that constant-time guarantees depend on compiler behavior and recommends assembly audit. + +**Approach**: + +1. **Assembly audit**: Compile `verify_p2()` with `--release` optimization for both `aarch64-unknown-none` and `x86_64-unknown-none` targets. Inspect the generated assembly for: (a) conditional branches on secret-dependent values, (b) memory access patterns that vary with secret values, (c) early-exit paths that bypass remaining checks. +2. **Statistical testing**: Use `dudect` methodology -- run `verify_p2()` with two input classes (all checks pass, first check fails) for 10M iterations. Apply Welch's t-test to the timing distributions. If |t| > 4.5, reject the constant-time hypothesis. +3. **Compiler fence verification**: Verify that `core::hint::black_box()` barriers in the P2 path survive optimization. Check that the compiler does not eliminate the "execute all checks" pattern by proving a later check subsumes an earlier one. +4. **Cross-platform comparison**: Run the same analysis on Cortex-A72 (Appliance target) and QEMU virt (development target). Compiler backends may produce different code for different targets. + +**Expected Outcome**: An assembly audit report for P2 verification on both target architectures. If constant-time violations are found, concrete patches (e.g., replacing `if` with conditional select, adding `black_box` barriers, using inline assembly for critical comparisons). A regression test that runs the `dudect` statistical check in nightly CI (ADR-143). + +--- + +## R12: GPU Covert Channels — Can Partitions Leak Data via GPU Timing? + +**Question**: Can two RVM partitions communicate through GPU timing side channels despite IOMMU isolation? + +**Relevance**: ADR-144 provides per-partition GPU isolation via IOMMU page tables. However, IOMMU isolation protects memory access, not execution timing. If two partitions share a GPU (time-multiplexed via `GpuContext` save/restore), the execution time of one partition's kernel may depend on the cache state left by the other partition's kernel. This creates a timing covert channel. + +**Threat Model**: Partition A (sender) modulates its GPU kernel execution time to encode information. Partition B (receiver) measures its own GPU kernel execution time, which varies based on A's GPU cache pollution. The channel bandwidth depends on the GPU context switch frequency and the timing measurement precision. + +**Known GPU Covert Channel Classes**: + +| Channel | Mechanism | Bandwidth Estimate | +|---------|-----------|-------------------| +| GPU L1 cache | Sender fills/evicts L1 lines; receiver measures L1 hit/miss | ~100 Kbps (Naghibijouybari et al., 2018) | +| GPU L2 cache | Same as L1 but across compute units | ~50 Kbps | +| GPU memory bus contention | Sender issues high-bandwidth transfers; receiver measures transfer latency | ~10 Kbps | +| GPU execution unit contention | Sender occupies ALUs; receiver measures kernel dispatch latency | ~5 Kbps | +| GPU power throttling | Sender induces thermal throttling; receiver observes clock frequency drop | ~1 Kbps | + +**Approach**: + +1. **Channel characterization**: Implement sender and receiver kernels for each channel class. Measure achievable bandwidth and error rate on the target GPU backend (WASM-SIMD for Seed, WebGPU/CUDA for Appliance). +2. **Mitigation evaluation**: For each channel, evaluate mitigations: (a) GPU cache flush on context switch (eliminates L1/L2 channels but adds latency), (b) GPU memory bus partitioning (if hardware supports it), (c) noise injection (randomize kernel launch timing), (d) GPU time-padding (pad all kernel launches to a fixed duration, eliminating timing variation). +3. **Cost-benefit analysis**: For each mitigation, measure the performance overhead against ADR-144 benchmark targets. Determine which mitigations are acceptable for the Appliance profile. +4. **Witness-based detection**: Evaluate whether GPU witness records (ADR-151) can detect covert channel activity through anomalous kernel timing patterns (e.g., bimodal execution time distribution). + +**Expected Outcome**: A threat model document with measured channel bandwidths, a recommended mitigation strategy for each channel class, and performance overhead measurements. If GPU cache flush is recommended, provide an updated budget for `GpuContext` save/restore (ADR-144 DC-GPU-4). + +--- + +## R13: TEE Collateral Expiry — What Happens When TEE Quotes Expire? + +**Question**: What is the blast radius when TEE attestation collateral expires, and can an attacker force expiry? + +**Relevance**: ADR-142 specifies that Intel TDX collateral expires after 30 days (per Intel's TDX Enabling Guide). When collateral expires, the `TeeQuoteVerifier::collateral_valid()` check returns false, and `TeeWitnessSigner` can no longer produce verified quotes. This affects all witness signing and P3 deep proof verification on the affected node. + +**Threat Model**: An attacker blocks network access to the Intel Provisioning Certification Service (PCS) or AMD Key Distribution Service (KDS). After the collateral expiry window (30 days for TDX), the node can no longer refresh collateral. All TEE-backed signing degrades to software fallback (`Ed25519WitnessSigner`), which provides cryptographic signing but not hardware-bound attestation. + +**Scenarios**: + +| Scenario | Trigger | Impact | Duration | +|----------|---------|--------|----------| +| Normal expiry | 30 days without refresh | Degraded to software signing | Until connectivity restored | +| DNS poisoning | Attacker redirects PCS DNS | Stale collateral; quotes may be rejected by remote verifiers | Until DNS corrected | +| PCS outage | Intel/AMD service unavailable | All nodes with expired collateral degrade simultaneously | Until service restored | +| Clock manipulation | Attacker skews system clock forward | Premature collateral expiry | Until clock corrected | + +**Approach**: + +1. **Blast radius analysis**: Map which RVM operations depend on TEE collateral validity. Determine whether P1 and P2 verification (which do not use TEE quotes) continue to function. Verify that the fallback to `Ed25519WitnessSigner` is seamless and does not require operator intervention. +2. **Refresh strategy**: Evaluate proactive collateral refresh (refresh at 50% of expiry window, i.e., 15 days for TDX). Implement a health check that warns operators when collateral is within 7 days of expiry. +3. **Clock security**: Evaluate NTP authentication (NTS, RFC 8915) as a mitigation for clock manipulation attacks. Determine whether RVM should refuse to degrade if the clock jumps forward by more than a configurable threshold. +4. **Multi-platform resilience**: If the deployment uses multiple TEE platforms (e.g., TDX on some nodes, SEV-SNP on others), evaluate whether collateral expiry on one platform affects the others. Design for independent collateral lifecycles. + +**Expected Outcome**: A blast radius table showing which RVM operations degrade on collateral expiry, a recommended refresh strategy, and a health monitoring specification for collateral validity. + +--- + +## R14: Witness Truncation Attacks — Ring Buffer Overflow as Denial of Audit + +**Question**: Can an attacker exploit the witness ring buffer overflow to erase evidence of malicious activity? + +**Relevance**: ADR-134 Section 4 defines the witness log as a ring buffer with background drain. When the drain task cannot keep up, the oldest un-drained records are overwritten. The system detects this via a sequence gap and logs a `RecoveryEnter` witness when the drain catches up. However, if an attacker can intentionally cause ring buffer overflow, they can erase specific witness records by timing their malicious operations to coincide with the overflow window. + +**Threat Model**: A malicious partition with high privilege (but not kernel-level) floods the witness log with legitimate but high-volume operations (e.g., rapid capability grants and revokes, or rapid GPU buffer alloc/free cycles using ADR-151 variants). The flood rate exceeds the drain rate, causing the ring buffer to wrap and overwrite records from before the flood. If the attacker times the flood to coincide with their malicious operation (e.g., an unauthorized migration), the malicious operation's witness record is overwritten before it can be drained to persistent storage. + +**Attack Steps**: + +1. Attacker identifies the ring buffer capacity (262,144 records per ADR-134) and drain rate. +2. Attacker performs the malicious operation (e.g., privilege escalation attempt). +3. Attacker immediately floods the witness log with 262,144+ innocuous operations. +4. The ring buffer wraps, overwriting the malicious operation's witness record. +5. When the drain catches up, it sees a sequence gap but the overwritten records are gone. + +**Approach**: + +1. **Rate limiting**: Evaluate per-partition witness emission rate limits. A partition that exceeds N witness records per epoch (e.g., 1000) triggers a `WitnessFloodDetected` event and is throttled. +2. **Priority draining**: Evaluate draining high-severity events (proof rejections, IOMMU violations, budget exceeded) before low-severity events (normal IPC, epoch summaries). This ensures security-critical records survive overflow. +3. **Dual-buffer architecture**: Evaluate a separate, smaller ring buffer for security-critical events (proof rejections, capability escalation attempts, IOMMU violations). This buffer has its own drain path and is not affected by floods in the main buffer. +4. **Immediate drain trigger**: When a security-critical event is emitted, trigger an immediate drain of at least the security buffer. This bounds the window during which the event could be overwritten. +5. **Witness hash anchoring**: Periodically write a hash anchor (SHA-256 of the last N records) to persistent storage. Even if records are overwritten, the anchor detects that the chain has been disrupted. + +**Expected Outcome**: A concrete mitigation recommendation (rate limiting, priority draining, or dual buffer), with performance impact analysis. The mitigation must not violate the 500ns witness emission budget (ADR-134 Section 6) or add allocation to the fast path. + +--- + +## Cross-References + +- ADR-134: Witness Schema and Log Format (ring buffer, emission protocol, 500ns budget) +- ADR-135: Proof Verifier Design (P2 constant-time verification) +- ADR-142: TEE-Backed Cryptographic Verification (TEE pipeline, collateral refresh, constant-time) +- ADR-144: GPU Compute Support (GPU isolation, IOMMU, context switch) +- ADR-151: GPU Witness Event Registry (GPU-specific ActionKind variants) +- Naghibijouybari, H. et al. "Rendered Insecure: GPU Side Channel Attacks are Practical." CCS 2018. +- Reparaz, O. et al. "Dude, is my code constant time?" DATE 2017. +- Almeida, J.B. et al. "Verifying Constant-Time Implementations." USENIX Security 2016. diff --git a/docs/research/specification-improvement-plan.md b/docs/research/specification-improvement-plan.md new file mode 100644 index 00000000..41660b5c --- /dev/null +++ b/docs/research/specification-improvement-plan.md @@ -0,0 +1,610 @@ +# RVM Specification Improvement Plan + +**Date**: 2026-04-04 +**Scope**: Complete specification gap analysis and GOAP action plan +**Inputs**: 6 RVM ADRs (132, 134, 135, 142, 143, 144), ruvector submodule ADRs (001-141), 14 crates, 945 tests + +--- + +## 1. Current State Assessment + +### 1.1 What Exists + +**RVM Main Repository (6 ADRs)** + +| ADR | Title | Status | Scope | +|-----|-------|--------|-------| +| ADR-132 | RVM Hypervisor Core | Proposed | Master architecture, 15 design constraints, 4-phase roadmap | +| ADR-134 | Witness Schema and Log Format | Proposed | 64-byte record, hash chaining, ring buffer, replay, audit | +| ADR-135 | Proof Verifier Design | Proposed | 3-layer proof system, capability model, nonce tracker | +| ADR-142 | TEE-Backed Cryptographic Verification | Accepted | SHA-256 upgrade, WitnessSigner, TEE pipeline, constant-time | +| ADR-143 | Nightly Verified Release Pipeline | Accepted | CI/CD, benchmark regression, Claude Code version tracking | +| ADR-144 | GPU Compute Support | Accepted | rvm-gpu crate, 3-layer architecture, capability-gated GPU | + +**RuVector Submodule (follow-on ADRs referenced but not copied into RVM)** + +| ADR | Title | Status | Location | +|-----|-------|--------|----------| +| ADR-133 | Partition Object Model | Proposed | ruvector/docs/adr/ only | +| ADR-136 | Memory Hierarchy and Reconstruction | Proposed | ruvector/docs/adr/ only | +| ADR-137 | Bare-Metal Boot Sequence | Proposed | ruvector/docs/adr/ only | +| ADR-138 | Seed Hardware Bring-Up | Proposed | ruvector/docs/adr/ only | +| ADR-139 | Appliance Deployment Model | Proposed | ruvector/docs/adr/ only | +| ADR-140 | Agent Runtime Adapter | Proposed | ruvector/docs/adr/ only | +| ADR-141 | Coherence Engine Kernel Integration | Accepted | ruvector/docs/adr/ only | + +**Implementation Crates (14)** + +| Crate | ADR Coverage | Spec Gap | +|-------|-------------|----------| +| rvm-types | ADR-132, 133, 134 | Well-covered | +| rvm-boot | ADR-137 | ADR-137 not in RVM docs/ | +| rvm-cap | ADR-135 | Well-covered | +| rvm-coherence | ADR-141 | ADR-141 not in RVM docs/ | +| rvm-gpu | ADR-144 | New, well-specified | +| rvm-hal | ADR-137 | No dedicated HAL ADR | +| rvm-kernel | ADR-132 | Integration layer, needs integration spec | +| rvm-memory | ADR-136 | ADR-136 not in RVM docs/ | +| rvm-partition | ADR-133 | ADR-133 not in RVM docs/ | +| rvm-proof | ADR-135, 142 | Well-covered | +| rvm-sched | ADR-132 DC-4 | No dedicated scheduler ADR | +| rvm-security | ADR-142 | SecurityGate documented | +| rvm-wasm | ADR-140 | ADR-140 not in RVM docs/ | +| rvm-witness | ADR-134, 142 | Well-covered | + +### 1.2 What Does Not Exist + +**Missing from RVM docs/adr/**: +- ADR-133 through ADR-141 (except 134, 135) live only in the ruvector submodule +- No dedicated HAL specification +- No IPC protocol specification +- No SMP/multi-core specification +- No error model specification +- No formal security model document +- No performance budget specification +- No RVF integration specification for RVM +- No multi-node / cluster specification +- No upgrade/migration specification + +--- + +## 2. Specification Gaps (Prioritized) + +### Gap Category A: Critical -- ADRs exist in submodule but not in RVM + +These specifications are written and referenced by implementation code, but a reader of the RVM repository alone cannot find them. This is a documentation hygiene problem that creates confusion for contributors. + +| Priority | Gap | Impact | +|----------|-----|--------| +| A1 | ADR-133 (Partition Object Model) not in RVM docs/ | Partition is the core abstraction; its spec must be findable | +| A2 | ADR-136 (Memory Hierarchy) not in RVM docs/ | rvm-memory references ADR-136; readers cannot find it | +| A3 | ADR-137 (Boot Sequence) not in RVM docs/ | rvm-boot and rvm-hal reference it | +| A4 | ADR-140 (Agent Runtime) not in RVM docs/ | rvm-wasm references ADR-140 | +| A5 | ADR-141 (Coherence Engine Integration) not in RVM docs/ | rvm-coherence references it | +| A6 | ADR-138 (Seed Hardware Bring-Up) not in RVM docs/ | Hardware target spec | +| A7 | ADR-139 (Appliance Deployment Model) not in RVM docs/ | Primary deployment target | + +### Gap Category B: Missing Specifications -- No ADR exists + +These are areas where implementation exists but no specification captures the design rationale, constraints, or invariants. + +| Priority | Gap | Affected Crates | Impact | +|----------|-----|----------------|--------| +| B1 | IPC Protocol | rvm-partition (ipc module) | IPC semantics (zero-copy, notification, queue) are implemented but not formally specified | +| B2 | SMP and Per-CPU Scheduling | rvm-sched (smp, per_cpu modules) | Multi-core scheduling exists but no ADR defines core affinity, IPI protocol, or load balancing | +| B3 | Hardware Abstraction Layer | rvm-hal | HAL trait boundaries, platform detection, and fallback semantics are unspecified | +| B4 | Error Model and Recovery Semantics | rvm-types (error module), rvm-kernel | F1-F4 failure classes are described in ADR-132 DC-14 but not formally specified with state machines | +| B5 | RVF Integration for RVM | rvm-boot, rvm-wasm | How RVM uses RVF packages for boot images, agent modules, and checkpoints is undocumented | +| B6 | Device Lease Protocol | rvm-partition (device module) | Device lease lifecycle (grant, renew, expire, revoke) exists in code but has no ADR | +| B7 | Coherence Graph Semantics | rvm-coherence (graph module) | Edge weight meaning, decay constants, graph capacity limits are implementation choices without specification | +| B8 | Security Gate Composition | rvm-security (gate module) | The SecurityGate integrates P1+P2+P3+witness but no ADR describes the composition rules | +| B9 | Agent Lifecycle State Machine | rvm-wasm (agent module) | 7-state machine referenced but not formally specified with transition guards | +| B10 | Benchmark Methodology | rvm-benches | Performance claims (10us switch, 50us mincut budget) need documented measurement methodology | + +### Gap Category C: Inconsistencies Between Specs and Implementation + +| Priority | Inconsistency | Details | +|----------|--------------|---------| +| C1 | ADR-132 says "5 ADRs" but 6 exist | ADR-143 was added after the master ADR was written | +| C2 | ADR-132 lists ADR-133 as follow-on but it is not in RVM docs/ | Planned ADRs (133-140) are listed but only 134, 135 are in RVM docs/ | +| C3 | ADR-134 addendum and ADR-142 partially overlap | ADR-134's addendum summarizes ADR-142 changes; the canonical source of truth is ambiguous for WitnessSigner defaults | +| C4 | ADR-135 addendum says P3 is implemented per ADR-142, but ADR-135 status is still "Proposed" | Status should be updated to "Accepted" with amendments | +| C5 | Partition limit: ADR-132 says DC-12 logical 4096, rvm-partition says MAX_PARTITIONS = 256 | The code enforces the physical limit; the logical multiplexing from DC-12 may not be implemented | +| C6 | GPU ActionKind variants not in ADR-134 witness enum | ADR-144 defines GPU witness events but ADR-134's ActionKind enum does not include them | +| C7 | rvm-coherence references ADR-139 in doc comment but ADR-139 is the Appliance model, not the coherence spec | Likely should reference ADR-141 | +| C8 | ADR-132 non-goals says "no GPU" but ADR-144 adds GPU | ADR-132 was written before GPU support was decided; non-goals section is now stale | + +### Gap Category D: Specifications Needed for New Subsystems + +| Priority | Specification | Justification | +|----------|--------------|---------------| +| D1 | GPU Witness Integration | ADR-144 defines GPU ActionKind variants but they are not formally registered in the ADR-134 witness schema | +| D2 | GPU-Accelerated MinCut Formal Model | ADR-144 describes the acceleration but does not specify correctness invariants (GPU result must match CPU result within epsilon) | +| D3 | Nightly Pipeline Verification Criteria | ADR-143 lists tests but does not specify the acceptance criteria for "verified" (what exactly must pass?) | +| D4 | Multi-Node Mesh Protocol | ADR-132 mentions mesh clustering but it is explicitly deferred; a placeholder ADR should capture the design space | +| D5 | Formal Verification Roadmap | ADR-132 says formal verification is deferred; a research ADR should capture what properties to verify and what tools to use | + +--- + +## 3. GOAP Action Plan + +### Goal State + +``` +{ + all_adrs_in_rvm_docs: true, // Category A resolved + missing_specs_written: true, // Category B resolved + inconsistencies_fixed: true, // Category C resolved + new_subsystem_specs: true, // Category D resolved + research_topics_documented: true, + docs_structure_established: true +} +``` + +### Current State + +``` +{ + all_adrs_in_rvm_docs: false, // 7 ADRs in submodule only + missing_specs_written: false, // 10 specs missing + inconsistencies_fixed: false, // 8 inconsistencies + new_subsystem_specs: false, // 5 new specs needed + research_topics_documented: false, + docs_structure_established: false +} +``` + +### Action Plan (Ordered by Dependencies) + +--- + +#### Phase 1: Documentation Hygiene (Category A) -- Precondition: None + +**Action 1.1: Copy Follow-On ADRs into RVM docs/adr/** + +``` +Action: copy_follow_on_adrs + Preconditions: { ruvector_submodule_accessible: true } + Effects: { all_adrs_in_rvm_docs: true } + Cost: 1 (low -- file operations only) + Tool: Bash (cp) +``` + +Copy ADR-133, 136, 137, 138, 139, 140, 141 from `ruvector/docs/adr/` into `docs/adr/`. Add a header note indicating the canonical source is the ruvector submodule. This makes the complete specification browsable from the RVM repository alone. + +**Files to copy:** +- `ruvector/docs/adr/ADR-133-partition-object-model.md` -> `docs/adr/` +- `ruvector/docs/adr/ADR-136-memory-hierarchy-reconstruction.md` -> `docs/adr/` +- `ruvector/docs/adr/ADR-137-bare-metal-boot-sequence.md` -> `docs/adr/` +- `ruvector/docs/adr/ADR-138-seed-hardware-bring-up.md` -> `docs/adr/` +- `ruvector/docs/adr/ADR-139-appliance-deployment-model.md` -> `docs/adr/` +- `ruvector/docs/adr/ADR-140-agent-runtime-adapter.md` -> `docs/adr/` +- `ruvector/docs/adr/ADR-141-coherence-engine-kernel-integration.md` -> `docs/adr/` + +--- + +#### Phase 2: Fix Inconsistencies (Category C) -- Precondition: Phase 1 + +**Action 2.1: Update ADR-132 to reflect current state** + +``` +Action: update_adr_132 + Preconditions: { all_adrs_in_rvm_docs: true } + Effects: { adr_132_current: true } + Cost: 2 + Tool: Edit +``` + +Changes needed: +- Update ADR count (now 13 ADRs: 132-144, minus the gap) +- Add ADR-142, 143, 144 to the follow-on table +- Remove "no GPU" from non-goals (now contradicted by ADR-144) +- Update DC-3 addendum to reference ADR-142 implementation +- Note 14 crates (was implied 13) + +**Action 2.2: Update ADR-134 to reconcile with ADR-142** + +``` +Action: reconcile_adr_134_142 + Preconditions: { adr_132_current: true } + Effects: { witness_schema_consistent: true } + Cost: 2 + Tool: Edit +``` + +Changes needed: +- Add GPU ActionKind variants (0xA0-0xAF range) to the enum +- Mark ADR-134 status as "Accepted (amended by ADR-142)" +- Add cross-reference to ADR-144 for GPU witness events + +**Action 2.3: Update ADR-135 status** + +``` +Action: update_adr_135_status + Preconditions: { witness_schema_consistent: true } + Effects: { proof_verifier_status_current: true } + Cost: 1 + Tool: Edit +``` + +Change status from "Proposed" to "Accepted (P3 implemented per ADR-142)". + +**Action 2.4: Fix rvm-coherence doc comment** + +``` +Action: fix_coherence_doc_reference + Preconditions: {} + Effects: { doc_references_correct: true } + Cost: 1 + Tool: Edit +``` + +Change "ADR-139" reference in `rvm-coherence/src/lib.rs` to "ADR-141". + +--- + +#### Phase 3: Write Missing Specifications (Category B) -- Precondition: Phase 2 + +**Action 3.1: ADR-145 IPC Protocol Specification** + +``` +Action: write_adr_145_ipc + Preconditions: { adr_132_current: true, witness_schema_consistent: true } + Effects: { ipc_specified: true } + Cost: 5 + Tool: Write +``` + +**Action 3.2: ADR-146 SMP and Per-CPU Scheduling** + +``` +Action: write_adr_146_smp + Preconditions: { ipc_specified: true } + Effects: { smp_specified: true } + Cost: 5 + Tool: Write +``` + +**Action 3.3: ADR-147 Hardware Abstraction Layer** + +``` +Action: write_adr_147_hal + Preconditions: {} + Effects: { hal_specified: true } + Cost: 4 + Tool: Write +``` + +**Action 3.4: ADR-148 Error Model and Recovery State Machine** + +``` +Action: write_adr_148_error_model + Preconditions: {} + Effects: { error_model_specified: true } + Cost: 4 + Tool: Write +``` + +**Action 3.5: ADR-149 RVF Integration for RVM** + +``` +Action: write_adr_149_rvf_integration + Preconditions: {} + Effects: { rvf_integration_specified: true } + Cost: 3 + Tool: Write +``` + +**Action 3.6: ADR-150 Device Lease Protocol** + +``` +Action: write_adr_150_device_lease + Preconditions: { hal_specified: true } + Effects: { device_lease_specified: true } + Cost: 3 + Tool: Write +``` + +--- + +#### Phase 4: New Subsystem Specifications (Category D) -- Precondition: Phase 3 + +**Action 4.1: ADR-151 GPU Witness Event Registry** + +``` +Action: write_adr_151_gpu_witness + Preconditions: { witness_schema_consistent: true } + Effects: { gpu_witness_registered: true } + Cost: 2 + Tool: Write +``` + +**Action 4.2: ADR-152 GPU MinCut Correctness Model** + +``` +Action: write_adr_152_gpu_mincut_correctness + Preconditions: { gpu_witness_registered: true } + Effects: { gpu_mincut_correctness_specified: true } + Cost: 3 + Tool: Write +``` + +**Action 4.3: ADR-153 Multi-Node Mesh Protocol (Design Space)** + +``` +Action: write_adr_153_mesh_protocol + Preconditions: { ipc_specified: true, smp_specified: true } + Effects: { mesh_protocol_space_documented: true } + Cost: 4 + Tool: Write +``` + +**Action 4.4: ADR-154 Formal Verification Roadmap** + +``` +Action: write_adr_154_formal_verification + Preconditions: { error_model_specified: true } + Effects: { formal_verification_roadmap: true } + Cost: 4 + Tool: Write +``` + +--- + +#### Phase 5: Research and Documentation Infrastructure -- Precondition: Phase 2 + +**Action 5.1: Establish docs/research/ directory structure** + +``` +Action: create_research_structure + Preconditions: {} + Effects: { docs_structure_established: true } + Cost: 1 + Tool: Bash (mkdir) +``` + +**Action 5.2: Write research topic briefs** + +``` +Action: write_research_briefs + Preconditions: { docs_structure_established: true } + Effects: { research_topics_documented: true } + Cost: 5 + Tool: Write +``` + +--- + +## 4. New ADR Proposals + +### ADR-145: Inter-Partition Communication Protocol + +**Summary**: Specifies the three IPC mechanisms available in RVM: synchronous message passing via `MessageQueue`, asynchronous notification via `NotificationSignal`, and zero-copy shared region via `ZeroCopyShare`. Defines message format, queue depth limits, backpressure semantics, and the requirement that every IPC operation updates the coherence graph edge weight (the auto-feeding mechanism from ADR-141). Establishes that IPC is always capability-gated (WRITE right on the source CommEdge) and that all sends emit a witness record. Specifies the ordering guarantees: FIFO within a single CommEdge, no global ordering across edges. + +### ADR-146: Symmetric Multi-Processing and Per-CPU Scheduling + +**Summary**: Specifies how RVM manages multiple CPU cores. Defines the per-CPU scheduler state (`PerCpuScheduler`), the IPI (inter-processor interrupt) protocol for cross-core partition migration, core affinity semantics for partitions, and the SMP boot sequence where the BSP (bootstrap processor) initializes the kernel and then wakes APs (application processors). Establishes that each CPU runs its own scheduler tick independently, coherence engine recomputation is delegated to a single "coherence CPU" to avoid contention, and partition switch latency targets apply per-core. Defines the relationship between `PartitionId`, `VcpuId`, and physical CPU index. + +### ADR-147: Hardware Abstraction Layer + +**Summary**: Specifies the HAL trait boundaries that isolate platform-specific code from kernel logic. Defines traits for `Timer` (monotonic nanosecond clock), `Uart` (debug console), `Mmu` (stage-2 page table operations), `Interrupt` (GIC/PLIC/APIC abstraction), `Iommu` (DMA protection), and `PowerManagement` (core enable/disable, frequency scaling). Establishes that all HAL implementations are `no_std`, that the HAL is selected at compile time via feature flags (`hal-aarch64`, `hal-riscv64`, `hal-x86_64`), and that the HAL provides a `PlatformInfo` struct populated from DTB/ACPI at boot. Specifies the fallback behavior when hardware features are unavailable (e.g., no IOMMU degrades to software-enforced DMA bounds). + +### ADR-148: Error Model and Recovery State Machine + +**Summary**: Formalizes the F1-F4 failure classification from ADR-132 DC-14 as a state machine with explicit transition guards and escalation rules. Specifies that F1 (agent failure) allows 3 restart attempts before escalating to F2, F2 (partition failure) triggers checkpoint-based reconstruction, F3 (memory corruption) triggers region rollback from the dormant tier, and F4 (kernel failure) triggers A/B image failover. Defines the `RecoveryStateMachine` with states `Normal`, `Degraded`, `Recovering`, and `FailSafe`, along with the witness events emitted at each transition. Establishes invariants: no recovery action may proceed without a witness record, no escalation may skip a level (F1 cannot jump to F3), and the system must reach a stable state within a bounded time (configurable per failure class). + +### ADR-149: RVF Integration for RVM + +**Summary**: Specifies how the RVF (RuVector Format) package system integrates with RVM across three use cases: (1) Boot images -- the RVF manifest defines the Appliance image layout including kernel binary, DTB overlay, initial partitions, and agent modules; (2) Agent deployment -- WASM agent modules are distributed as RVF packages with signed manifests, capability declarations, and resource quota requirements; (3) Checkpoint persistence -- the cold tier (Tier 3) serializes recovery checkpoints as RVF containers with content-addressed storage. Specifies the RVF manifest fields required for each use case, the signature verification flow at boot (ML-DSA-65 per ADR-042), and the relationship between RVF component IDs and partition IDs. + +### ADR-150: Device Lease Protocol + +**Summary**: Specifies the complete lifecycle of a device lease: request (partition submits a capability-gated lease request), grant (kernel assigns the device with IOMMU isolation and time bounds), renew (partition extends the lease before expiry), expire (kernel reclaims the device on timeout), and revoke (kernel forces device reclamation on security event or partition termination). Defines the `DeviceLease` struct fields (device ID, partition ID, grant time, expiry time, IOMMU context), the capability requirements (READ for query, EXECUTE+WRITE for grant), the witness events for each lifecycle transition, and the interaction with GPU device leases introduced in ADR-144. + +### ADR-151: GPU Witness Event Registry + +**Summary**: Extends the ADR-134 `ActionKind` enum with GPU-specific witness events in the 0xA0-0xAF range: `GpuContextCreate` (0xA0), `GpuContextDestroy` (0xA1), `GpuKernelLaunch` (0xA2), `GpuKernelComplete` (0xA3), `GpuKernelTimeout` (0xA4), `GpuBufferAlloc` (0xA5), `GpuBufferFree` (0xA6), `GpuTransfer` (0xA7), `GpuBudgetExceeded` (0xA8), `GpuIommuViolation` (0xA9), `GpuDeviceNotFound` (0xAA), `GpuCompileFail` (0xAB). Specifies the payload encoding for each event (kernel ID, buffer size, compute duration, budget remaining) and the audit query patterns for GPU forensics. + +### ADR-152: GPU MinCut Correctness Model + +**Summary**: Specifies the correctness contract for GPU-accelerated mincut. The GPU result must satisfy: (1) the cut value equals the exact Stoer-Wagner result (no approximation for N <= 32), (2) the partition assignment is identical to the CPU path (verified by comparing partition bitmasks), and (3) if the GPU path exceeds the DC-2 budget (50us), it falls back to the last known CPU result with a `MinCutBudgetExceeded` witness. Defines the testing strategy: every GPU mincut result is cross-validated against the CPU result in debug builds, and a nightly benchmark tracks GPU vs CPU divergence. Specifies the acceptable numerical tolerance (zero for integer weights, epsilon=1e-6 for floating-point weights). + +### ADR-153: Multi-Node Mesh Protocol (Design Space) + +**Summary**: A design-space ADR (status: Draft) that captures the requirements and open questions for RVM multi-node mesh operation. Identifies four sub-problems: (1) Partition migration across nodes (requires serializing partition state, transferring via network, rebuilding on destination), (2) Cross-node coherence graph maintenance (requires gossip or consensus protocol for edge weight propagation), (3) Distributed capability delegation (requires cross-node capability tokens with node-specific attestation), (4) Witness chain federation (requires merging per-node witness chains into a globally consistent audit trail). Does not propose a solution; establishes the design space and evaluation criteria for future work. + +### ADR-154: Formal Verification Roadmap + +**Summary**: A research ADR that outlines the path toward formal verification of critical RVM properties. Identifies four verification targets: (1) Capability monotonic attenuation (property: derived capabilities never exceed parent rights), (2) Witness chain integrity (property: any chain break is detectable), (3) Partition isolation (property: no stage-2 mapping allows cross-partition memory access without capability), (4) Scheduler liveness (property: every runnable partition is eventually scheduled). Evaluates three verification approaches: Kani (Rust model checker, suitable for bounded verification of capability operations), Prusti (Rust verifier, suitable for pre/post-condition checking on proof system), and Coq/Lean (theorem provers for deep properties like isolation). Proposes starting with Kani on `rvm-cap` as a pilot. + +--- + +## 5. Research Topics + +### 5.1 Theoretical Foundation Strengthening + +| Topic | Question | Relevance | Priority | +|-------|----------|-----------|----------| +| R1: Spectral gap and coherence convergence | Under what conditions does the coherence score converge as the communication graph evolves? What is the mixing time? | Validates that coherence scoring is stable, not oscillatory | High | +| R2: Mincut budget analysis | For the Stoer-Wagner algorithm on N nodes, what is the exact worst-case time as a function of N and edge count? What is the largest N that fits within 50us on Cortex-A72? | Validates DC-2 budget on target hardware | High | +| R3: Capability delegation depth and authority propagation | What is the maximum authority surface reachable through depth-8 delegation? Can we bound the number of active capabilities? | Validates that bounded delegation prevents authority explosion | Medium | +| R4: Witness chain FNV-to-SHA migration formal properties | Does the chain migration (FNV tail + SHA-256 head) preserve the tamper-evidence property? Under what adversary model? | Validates ADR-142 migration strategy | Medium | +| R5: Dormant reconstruction fidelity | Under what conditions does checkpoint + delta replay produce byte-identical state to the original? What happens with concurrent mutations during checkpoint? | Validates the "memory time travel" claim from ADR-136 | High | +| R6: GPU determinism for integer mincut | Is the GPU parallel reduction provably bit-identical to the sequential CPU path for integer adjacency matrices? | Validates ADR-152 correctness claim | Medium | + +### 5.2 Comparative Analysis + +| Topic | Comparison | Deliverable | +|-------|-----------|-------------| +| R7: RVM vs seL4 capability model | Map RVM's 7-right, depth-8 model onto seL4's CNode/CSlot model. Identify where RVM diverges and why. | Comparison document with security implications | +| R8: RVM vs Firecracker boot path | Trace both boot sequences step by step. Quantify where RVM adds latency (no KVM) and where it saves (no Linux). | Benchmark comparison document | +| R9: Coherence scheduling vs CFS/EEVDF | Compare RVM's 2-signal scheduler against Linux CFS and EEVDF under equivalent workloads. | Simulation or microbenchmark results | +| R10: Memory tier efficiency vs zswap/ZRAM | Compare RVM's 4-tier model against Linux zswap for memory-constrained workloads. | Benchmark document with methodology | + +### 5.3 Security Research + +| Topic | Threat | Deliverable | +|-------|--------|-------------| +| R11: Side-channel resistance of P2 constant-time path | Verify that compiler optimizations do not break constant-time semantics on AArch64 and x86-64. | Assembly audit of P2 verify path | +| R12: GPU covert channel analysis | Can two partitions communicate through GPU timing side channels despite IOMMU isolation? | Threat model document with mitigations | +| R13: TEE collateral expiry attack surface | If TDX collateral expires (30-day window), what is the blast radius? Can an attacker force expiry? | Attack tree analysis | +| R14: Witness chain truncation attack | If the ring buffer overwrites un-drained records, can an attacker exploit the sequence gap? | Formal analysis of gap detection | + +--- + +## 6. Recommended docs/research/ Directory Structure + +``` +docs/ + adr/ # Architecture Decision Records + ADR-132-ruvix-hypervisor-core.md + ADR-133-partition-object-model.md # (copy from ruvector) + ADR-134-witness-schema-log-format.md + ADR-135-proof-verifier-design.md + ADR-136-memory-hierarchy-reconstruction.md # (copy from ruvector) + ADR-137-bare-metal-boot-sequence.md # (copy from ruvector) + ADR-138-seed-hardware-bring-up.md # (copy from ruvector) + ADR-139-appliance-deployment-model.md # (copy from ruvector) + ADR-140-agent-runtime-adapter.md # (copy from ruvector) + ADR-141-coherence-engine-kernel-integration.md # (copy from ruvector) + ADR-142-tee-backed-cryptographic-verification.md + ADR-143-nightly-verified-release-pipeline.md + ADR-144-gpu-compute-support.md + ADR-145-ipc-protocol.md # (new) + ADR-146-smp-per-cpu-scheduling.md # (new) + ADR-147-hardware-abstraction-layer.md # (new) + ADR-148-error-model-recovery.md # (new) + ADR-149-rvf-integration.md # (new) + ADR-150-device-lease-protocol.md # (new) + ADR-151-gpu-witness-registry.md # (new) + ADR-152-gpu-mincut-correctness.md # (new) + ADR-153-multi-node-mesh-protocol.md # (new, Draft status) + ADR-154-formal-verification-roadmap.md # (new, Draft status) + + research/ + specification-improvement-plan.md # (this document) + + theory/ + coherence-convergence.md # R1 + mincut-budget-analysis.md # R2 + capability-authority-bounds.md # R3 + chain-migration-properties.md # R4 + reconstruction-fidelity.md # R5 + gpu-determinism-proof.md # R6 + + comparisons/ + rvm-vs-sel4-capabilities.md # R7 + rvm-vs-firecracker-boot.md # R8 + coherence-vs-cfs-scheduling.md # R9 + tier-memory-vs-zswap.md # R10 + + security/ + p2-constant-time-audit.md # R11 + gpu-covert-channel-analysis.md # R12 + tee-collateral-expiry.md # R13 + witness-chain-truncation.md # R14 + + benchmarks/ + measurement-methodology.md # B10: How we measure + baseline-results.md # Criterion JSON baseline reference + regression-thresholds.md # What constitutes a regression + + RUVECTOR-INTEGRATION.md # (existing) +``` + +--- + +## 7. Execution Summary + +### Phase Dependency Graph + +``` +Phase 1 (Copy ADRs) Cost: 1 Preconditions: None + | + v +Phase 2 (Fix Inconsistencies) Cost: 6 Preconditions: Phase 1 + | + v +Phase 3 (Missing Specs) Cost: 24 Preconditions: Phase 2 + | + v +Phase 4 (New Subsystem Specs) Cost: 13 Preconditions: Phase 3 + | + v +Phase 5 (Research + Docs) Cost: 6 Preconditions: Phase 2 +``` + +**Total estimated cost units: 50** + +### Priority Matrix + +| Priority | Actions | Rationale | +|----------|---------|-----------| +| **P0 (Do Now)** | 1.1, 2.1, 2.4 | Documentation hygiene -- zero-risk, high visibility | +| **P1 (This Week)** | 2.2, 2.3, 5.1 | Fix inconsistencies before writing new specs | +| **P2 (Next Sprint)** | 3.1, 3.2, 3.3, 3.4 | Core missing specs that block contributor onboarding | +| **P3 (Following Sprint)** | 3.5, 3.6, 4.1, 4.2, 4.3 | Completeness specs and new subsystem formalization | +| **P4 (Ongoing)** | 4.4, 5.2 | Research topics and formal verification -- long-term | + +### Replanning Triggers + +The plan should be re-evaluated if any of these conditions arise: + +1. **New crate added** -- any new RVM crate requires a specification check +2. **ADR-142 TEE providers become hardware-testable** -- may require ADR updates +3. **Multi-node work begins** -- ADR-153 must be promoted from Draft to Proposed +4. **Formal verification pilot completes** -- ADR-154 must be updated with results +5. **Performance regression detected by nightly pipeline** -- may indicate spec budget violations +6. **Security audit findings** -- any new audit may invalidate ADR-142 assumptions + +### Success Criteria + +The specification improvement is complete when: + +1. All 13+ ADRs are browsable from `docs/adr/` without requiring submodule navigation +2. Every RVM crate has at least one ADR that covers its design rationale +3. No ADR status field contradicts the implementation state +4. The `ActionKind` witness enum covers all privileged actions including GPU operations +5. The `docs/research/` directory contains at least one document per research category +6. The benchmark methodology document exists and is referenced by ADR-143 +7. A contributor can understand the full RVM architecture by reading only `docs/adr/` files in numeric order + +--- + +## Appendix A: ADR Numbering Inventory + +| Number | Title | Location | Status | +|--------|-------|----------|--------| +| 132 | RVM Hypervisor Core | RVM docs/adr/ | Proposed | +| 133 | Partition Object Model | ruvector only | Proposed | +| 134 | Witness Schema and Log Format | RVM docs/adr/ | Proposed (needs update) | +| 135 | Proof Verifier Design | RVM docs/adr/ | Proposed (needs update) | +| 136 | Memory Hierarchy and Reconstruction | ruvector only | Proposed | +| 137 | Bare-Metal Boot Sequence | ruvector only | Proposed | +| 138 | Seed Hardware Bring-Up | ruvector only | Proposed | +| 139 | Appliance Deployment Model | ruvector only | Proposed | +| 140 | Agent Runtime Adapter | ruvector only | Proposed | +| 141 | Coherence Engine Kernel Integration | ruvector only | Accepted | +| 142 | TEE-Backed Cryptographic Verification | RVM docs/adr/ | Accepted | +| 143 | Nightly Verified Release Pipeline | RVM docs/adr/ | Accepted | +| 144 | GPU Compute Support | RVM docs/adr/ | Accepted | +| 145-154 | (Proposed new ADRs) | Not yet written | -- | + +## Appendix B: Crate-to-ADR Coverage Matrix + +| Crate | Primary ADR | Secondary ADRs | Missing Coverage | +|-------|------------|----------------|-----------------| +| rvm-types | 132, 133, 134 | 135, 136 | None | +| rvm-boot | 137 | 132, 138, 139 | RVF boot manifest (proposed ADR-149) | +| rvm-cap | 135 | 132 | None | +| rvm-coherence | 141 | 132 DC-1/DC-2 | Graph semantics (part of ADR-141) | +| rvm-gpu | 144 | 132 DC-2 | GPU witness (ADR-151), correctness (ADR-152) | +| rvm-hal | 137 | 132 | **Dedicated HAL ADR missing (proposed ADR-147)** | +| rvm-kernel | 132 | All | Integration spec adequate in ADR-132 | +| rvm-memory | 136 | 132 | None | +| rvm-partition | 133 | 132, 135, 141 | IPC (ADR-145), device lease (ADR-150) | +| rvm-proof | 135, 142 | 132, 134 | None | +| rvm-sched | 132 DC-4 | 141 | **SMP ADR missing (proposed ADR-146)** | +| rvm-security | 142 | 135 | Security gate composition (covered by ADR-142) | +| rvm-wasm | 140 | 132, 133 | Agent lifecycle (documented in ADR-140) | +| rvm-witness | 134, 142 | 132 | GPU events (ADR-151) | diff --git a/docs/research/theoretical-foundation.md b/docs/research/theoretical-foundation.md new file mode 100644 index 00000000..2df6d51d --- /dev/null +++ b/docs/research/theoretical-foundation.md @@ -0,0 +1,100 @@ +# Theoretical Foundation Research Topics + +**Date**: 2026-04-04 +**Scope**: Six open research questions on the theoretical underpinnings of the RVM hypervisor + +--- + +## R1: Coherence Convergence Guarantees + +**Question**: Does the EMA-filtered coherence score converge for all input distributions, and what is the convergence rate? + +**Relevance**: The coherence engine (ADR-132, ADR-141) uses an exponential moving average (EMA) filter over inter-partition communication weights to compute coherence scores. If the filter does not converge under certain workload patterns (e.g., periodic bursts, adversarial access patterns), the mincut-based placement decisions may oscillate, causing unnecessary partition migrations. + +**Prior Work**: EMA convergence is well-understood for stationary stochastic processes. The open question is whether RVM's communication patterns are sufficiently stationary, or whether the coherence graph's topology changes (partition creation, destruction, split, merge) introduce non-stationarities that break convergence assumptions. + +**Approach**: (1) Model the communication weight update as a discrete-time dynamical system. (2) Identify conditions on the update rate and graph topology change rate under which the EMA filter converges to a fixed point or limit cycle. (3) Simulate with synthetic workload traces (periodic, bursty, adversarial) and measure convergence time and oscillation amplitude. (4) If oscillation is detected, evaluate damping strategies (e.g., increasing EMA alpha, adding hysteresis to split/merge thresholds). + +**Expected Outcome**: A theorem or empirical bound on convergence time as a function of EMA alpha and graph mutation rate, plus a recommendation for default alpha values. + +--- + +## R2: MinCut Budget Analysis + +**Question**: For the Stoer-Wagner algorithm on N nodes with E edges, what is the exact worst-case execution time on the target hardware, and what is the optimal iteration count for the DC-2 budget? + +**Relevance**: ADR-132 DC-2 mandates a 50-microsecond hard budget for mincut per scheduler epoch. The Stoer-Wagner algorithm has complexity O(VE + V^2 log V). For MINCUT_MAX_NODES=32 (ADR-132), the worst case is a fully connected graph with E = 32*31/2 = 496 edges. The question is whether this worst case fits within 50us on the target hardware (ARM Cortex-A72 at 1.5GHz for Appliance, Cortex-M33 for Seed). + +**Prior Work**: Stoer-Wagner runtime on modern hardware has been benchmarked for large graphs (thousands of nodes) but not characterized at the micro-level for small, cache-resident graphs. RVM's use case (N <= 32, cache-hot adjacency matrix) is unusual. + +**Approach**: (1) Implement a cycle-accurate benchmark of the CPU mincut path on Cortex-A72 (QEMU + PMU counters). (2) Measure worst-case latency for fully connected 32-node graphs. (3) If the worst case exceeds 50us, determine the maximum N that fits within the budget. (4) Evaluate whether the GPU path (ADR-144, ADR-152) extends the feasible N. + +**Expected Outcome**: A table mapping N to worst-case mincut latency on each target platform, plus a recommendation for MINCUT_MAX_NODES per hardware profile. + +--- + +## R3: Reconstruction Fidelity + +**Question**: Under what conditions does checkpoint + witness replay produce byte-identical state to the original execution? + +**Relevance**: ADR-134 Section 7 defines the replay protocol: given a checkpoint and a witness log segment, replaying the segment deterministically reconstructs the kernel state. This is the foundation of RVM's recovery model (F2 failure class, ADR-132 DC-14). If reconstruction is not byte-identical, recovered partitions may exhibit different behavior than the original, violating deterministic recovery. + +**Prior Work**: Deterministic replay systems (e.g., rr, PANDA) achieve byte-identical replay for single-threaded programs. Multi-threaded and hardware-interacting systems introduce non-determinism (interrupt timing, device state, MMIO responses) that breaks replay fidelity. + +**Approach**: (1) Enumerate sources of non-determinism in RVM's privileged action path: timer interrupt timing, MMIO device responses, random number generation. (2) For each source, determine whether the witness record captures sufficient information to reproduce the non-determinism during replay. (3) Construct a test: checkpoint, run N privileged actions, replay from checkpoint, compare final state. (4) Identify and document any actions that are not replay-faithful and propose mitigation (e.g., recording timer values in the witness payload). + +**Expected Outcome**: A proof (or counterexample) that checkpoint + witness replay is byte-identical for all privileged actions, plus a list of actions requiring additional replay data. + +--- + +## R4: Proof Tier Latency Bounds + +**Question**: Can the P1 <1us, P2 <100us, and P3 <10ms latency budgets (ADR-135 DC-3) be formally justified, and are they achievable on all target hardware? + +**Relevance**: These budgets are currently engineering estimates based on implementation analysis (ADR-135 provides a cost breakdown for P1 and P2). A formal argument would strengthen confidence that the budgets are not arbitrary but are derived from the operations required at each tier. + +**Prior Work**: seL4 provides worst-case execution time (WCET) analysis for its syscall paths on ARM. RVM's P1 (table lookup + bitmap AND) is structurally similar to seL4's capability lookup. P2 (8 constant-time checks) is more complex but still bounded. + +**Approach**: (1) For P1: count the maximum number of instructions in the verify_p1() path. Multiply by the worst-case cycles-per-instruction on each target. Verify <1us. (2) For P2: same analysis, accounting for the constant-time execution of all 8 checks (no early exit). Include SHA-256 cost for mutation hash comparison (ADR-142). (3) For P3: bound the cost of SHA-256 preimage check, Merkle path verification (log N hashes for depth-N tree), and Ed25519 signature verification. (4) Validate on QEMU with PMU cycle counters. + +**Expected Outcome**: WCET bounds for P1, P2, P3 on each target platform, with a formal argument that the bounds hold for all input sizes. + +--- + +## R5: Memory Tier Transition Optimality + +**Question**: When is demotion (moving a page to a colder tier) better than eviction (freeing the page entirely), and what is the optimal transition threshold? + +**Relevance**: ADR-136 defines a 4-tier memory model (hot/warm/dormant/cold). The transition decision between tiers is driven by cut-value and recency score. However, the optimal threshold for demotion vs eviction depends on the probability that the page will be re-accessed, the cost of reconstruction from a colder tier, and the memory pressure from other partitions. + +**Prior Work**: Traditional page replacement algorithms (LRU, CLOCK, LRU-K) optimize for hit rate. RVM's model adds two dimensions: (1) the coherence graph's cut-value, which predicts cross-partition access patterns, and (2) the reconstruction cost, which varies by tier (warm: decompress, dormant: decompress + replay, cold: full checkpoint restore). + +**Approach**: (1) Model the tier transition decision as a cost-minimization problem: minimize expected_access_latency + memory_pressure_penalty. (2) Derive the optimal demotion threshold as a function of re-access probability and reconstruction cost. (3) Simulate with synthetic and real workload traces. (4) Compare RVM's coherence-aware demotion against LRU and LRU-K baselines. + +**Expected Outcome**: A formula for the optimal demotion threshold as a function of cut-value, recency, and tier-specific reconstruction cost, plus simulation results comparing against baselines. + +--- + +## R6: GPU Acceleration Speedup Model + +**Question**: What is the theoretical speedup from GPU-accelerated mincut and batch scoring, modeled via Amdahl's law? + +**Relevance**: ADR-144 claims 5.6x speedup for 32-node mincut and up to 32x for batch scoring of 1024 partitions. These targets are based on algorithmic analysis (O(N^3) vs O(N^2 log N) for mincut, O(P) vs O(1) for scoring). A formal Amdahl's law analysis would determine the actual system-level speedup, accounting for the serial fraction (data upload, result download, GPU launch overhead). + +**Prior Work**: Amdahl's law provides the upper bound on speedup: S = 1 / (s + p/N), where s is the serial fraction, p is the parallel fraction, and N is the number of parallel processors. For GPU workloads, the serial fraction includes kernel launch overhead (~10us per ADR-144), host-device memory transfer, and result readback. + +**Approach**: (1) Decompose the mincut and scoring workloads into serial and parallel fractions. (2) Measure the serial overhead (launch, transfer) on target GPU hardware. (3) Compute the Amdahl's law speedup bound for each workload at varying N. (4) Compare against the empirical benchmarks from ADR-144 Table 9. (5) Determine the crossover point: the minimum N at which GPU acceleration beats the CPU path. + +**Expected Outcome**: Amdahl's law speedup curves for mincut and scoring as a function of N, plus the crossover N for each workload. This informs the GPU auto-selection threshold in the coherence engine. + +--- + +## Cross-References + +- ADR-132: RVM Hypervisor Core (DC-2 mincut budget, DC-3 proof tiers, DC-14 failure classes) +- ADR-134: Witness Schema and Log Format (replay protocol) +- ADR-135: Proof Verifier Design (P1/P2/P3 latency budgets) +- ADR-136: Memory Hierarchy and Reconstruction (4-tier model) +- ADR-141: Coherence Engine Kernel Integration (EMA filter, coherence scoring) +- ADR-144: GPU Compute Support (GPU acceleration targets) +- ADR-152: GPU MinCut Correctness Model (GPU/CPU equivalence)