Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/adr/ADR-132-ruvix-hypervisor-core.md
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ RVM leverages existing RuVector crates rather than reimplementing graph primitiv
## Non-Goals (v1)

- Full Linux ABI compatibility
- Large device model surface (USB, GPU, network card diversity)
- Large device model surface (USB, GPU, network card diversity). **Note:** GPU compute was added in ADR-144 as an optional, feature-gated subsystem.
- Desktop or workstation use
- Full formal verification (deferred to post-v1; seL4-style proofs are multi-year efforts)
- Cloud VM replacement (strongest advantage is edge/appliance coherence)
Expand Down
121 changes: 121 additions & 0 deletions docs/adr/ADR-133-partition-object-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# ADR-133: Partition Object Model

**Status:** Accepted
**Date:** 2026-04-05
**Authors:** RuVector Contributors
**Supersedes:** None

---

## Context

ADR-132 establishes partitions as the primary abstraction in RVM -- not VMs. A partition is a coherence domain container for scheduling, isolation, migration, and fault containment. However, ADR-132 specifies partitions at the architectural level without defining the concrete object model: lifecycle states, transition rules, split/merge semantics, communication edges, device leases, or the relationship between logical and physical partition slots.

Without a precise object model, implementations risk diverging on fundamental questions: What states can a partition occupy? When is a split legal? How are capabilities handled during merge? This ADR answers those questions.

## Decision

### Partition Structure

A partition is a fixed-size kernel object containing:

- **PartitionId**: Unique identifier (u32), recyclable after destruction.
- **PartitionState**: Lifecycle state enum (see below).
- **PartitionType**: `Agent` (workload), `Infrastructure` (driver domain), or `Root` (bootstrap authority).
- **CoherenceScore**: Current locality metric from the coherence graph.
- **CutPressure**: Graph-derived isolation signal; high pressure triggers migration or split.
- **vCPU count and CPU affinity**: Scheduling parameters.
- **Epoch**: Creation epoch for capability staleness detection.

### Lifecycle States

```
Created --> Running --> Suspended --> Running --> Hibernated --> Created
| | | | |
+----------+------------+------------+----------->Destroyed
```

| State | Description |
|-------|-------------|
| `Created` | Allocated, capability table initialized, not yet scheduled |
| `Running` | Actively scheduled on physical CPU(s) |
| `Suspended` | All vCPUs paused, state preserved in-place (Hot/Warm tier) |
| `Hibernated` | State serialized to Cold storage, physical resources released |
| `Destroyed` | Resources reclaimed, ID available for reuse |

All transitions are validated by `valid_transition()` and emit witness records. Invalid transitions return `RvmError::InvalidPartitionState`.

### MAX_PARTITIONS = 256

The hard limit of 256 active physical partition slots is derived from the ARM VMID width (8 bits). Logical partitions may exceed this limit (DC-12 allows up to 4096 logical partitions); physical slots are multiplexed via TLB flush and VMID reassignment when logical count exceeds physical capacity.

### Split Semantics

Partition split divides one partition into two, triggered by high cut pressure. Preconditions:

1. Source partition must be in `Running` or `Suspended` state.
2. The coherence engine provides a scored region assignment (DC-9): each memory region is assigned to the side with higher `alpha * local_access_fraction + beta * remote_access_cost_avoided + gamma * size_penalty`.
3. Capabilities follow their target objects (DC-8): if a capability's target is on side A, it goes to partition A only. Shared targets get READ_ONLY attenuation in both, with a `CAPABILITY_ATTENUATED_ON_SPLIT` witness.
4. Two new `PartitionId` values are allocated; the original ID is destroyed.

### Merge Semantics

Partition merge combines two partitions into one. Seven preconditions must all hold (DC-11):

1. A shared `CommEdge` exists between the two partitions.
2. Mutual coherence score exceeds the merge threshold (7000 basis points default).
3. No conflicting device leases.
4. No overlapping mutable memory regions.
5. Capability intersection is valid (no escalation).
6. Both partitions are in `Running` or `Suspended` state.
7. P2 proof validates merge authority.

Failure of any precondition results in rejection plus a witness record. The `merge_preconditions_full()` function checks all seven and returns a typed `MergePreconditionError` identifying which conditions failed.

### CommEdge Model

A `CommEdge` is a weighted directed edge in the coherence graph representing inter-partition communication. Each edge carries:

- **CommEdgeId**: Unique edge identifier.
- Source and destination `PartitionId`.
- Weight (communication volume, decayed per epoch).

CommEdges are the input to the mincut algorithm. The scheduler uses cut pressure derived from these edges to boost scheduling priority (DC-4: `priority = deadline_urgency + cut_pressure_boost`).

### IPC

The `IpcManager` provides fixed-capacity message queues between partitions. Each `MessageQueue` is bounded (default 64 messages) and supports zero-copy semantics via shared memory regions. IPC messages emit `IpcSend` / `IpcReceive` witness records.

### Device Leases

A `DeviceLeaseManager` tracks time-bounded, revocable access grants to hardware devices. Each `ActiveLease` records the partition, device, expiry time, and capability used to acquire the lease. Device leases are checked during merge preconditions (condition 3) to prevent conflicts.

## Consequences

### Positive

- **Precise lifecycle** prevents invalid state transitions at compile time via exhaustive match.
- **Scored region assignment** during split avoids oscillation from hotspot access patterns.
- **Seven merge preconditions** prevent authority leaks and resource conflicts.
- **Fixed-size structures** (`MAX_PARTITIONS = 256`) enable fully stack-allocated operation in `no_std`.

### Negative

- **256 physical slots** may become a bottleneck for large agent workloads; VMID multiplexing (DC-12) adds TLB flush overhead.
- **Seven merge preconditions** are conservative; some legitimate merges may be rejected by condition 4 (overlapping regions) when the overlap is intentional shared memory.

### Neutral

- The split/merge operations are novel for a hypervisor. No existing system provides a direct comparison for performance baseline.

## References

- ADR-132: RVM Hypervisor Core (DC-8, DC-9, DC-11, DC-12)
- `crates/rvm-partition/src/lib.rs` -- Partition module root
- `crates/rvm-partition/src/partition.rs` -- Core struct and constants
- `crates/rvm-partition/src/lifecycle.rs` -- State transition validation
- `crates/rvm-partition/src/split.rs` -- Scored region assignment
- `crates/rvm-partition/src/merge.rs` -- Merge precondition checks
- `crates/rvm-partition/src/comm_edge.rs` -- CommEdge model
- `crates/rvm-partition/src/ipc.rs` -- IPC message queues
- `crates/rvm-partition/src/device.rs` -- Device lease management
2 changes: 1 addition & 1 deletion docs/adr/ADR-134-witness-schema-log-format.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR-134: Witness Schema and Log Format

**Status**: Proposed
**Status**: Accepted
**Date**: 2026-04-04
**Authors**: Claude Code (Opus 4.6)
**Supersedes**: None
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/ADR-135-proof-verifier-design.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR-135: Proof Verifier Design — Three-Layer Verification for Capability-Gated Mutation

**Status**: Proposed
**Status**: Accepted
**Date**: 2026-04-04
**Authors**: Claude Code (Opus 4.6)
**Supersedes**: None
Expand Down
113 changes: 113 additions & 0 deletions docs/adr/ADR-136-memory-hierarchy-reconstruction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# ADR-136: Memory Hierarchy and Reconstruction

**Status:** Accepted
**Date:** 2026-04-05
**Authors:** RuVector Contributors
**Supersedes:** None

---

## Context

ADR-132 introduces a four-tier memory model where page residency is driven by coherence graph signals rather than simple access frequency. Traditional hypervisors use demand paging with a binary resident/swapped model. RVM needs a richer model because:

1. **Coherence-driven placement**: Pages should remain resident not just because they are accessed, but because the graph structure (cut pressure, locality) justifies the cost of keeping them hot.
2. **Reconstruction from witnesses**: Dormant state can be restored without storing full memory snapshots, by replaying witness-recorded deltas against a compressed checkpoint.
3. **No-std constraint**: The memory manager must operate with zero heap allocation, using only caller-provided fixed-size buffers.

## Decision

### Four-Tier Memory Model

| Tier | Name | Location | Residency Rule |
|------|------|----------|----------------|
| 0 | Hot | Per-core SRAM / L1-adjacent | Always resident during partition execution |
| 1 | Warm | Shared cluster DRAM | Resident if `cut_value + recency_score > eviction_threshold` |
| 2 | Dormant | Compressed in main memory | Checkpoint + witness delta; reconstructed on demand |
| 3 | Cold | RVF-backed persistent archival | Accessed only during recovery; never auto-promoted |

All tier transitions are **explicit**, not demand-paged. The kernel (or coherence engine, when available) decides when to promote or demote a region. Every transition emits a `RegionPromote` or `RegionDemote` witness record.

### DC-1 Degraded Mode

When the coherence engine is absent (DC-1), `cut_value` defaults to 0. Tier placement falls back to static thresholds based on `recency_score` alone:

| Transition | Threshold (basis points) |
|------------|------------------------|
| Hot -> Warm | 7000 |
| Warm -> Dormant | 4000 |
| Dormant -> Cold | 1000 |
| Warm -> Hot | 8000 |
| Dormant -> Warm | 5000 |

These conservative defaults prevent aggressive demotion without coherence data.

### BuddyAllocator

Physical page allocation uses a power-of-two buddy allocator operating on 4 KiB pages (`PAGE_SIZE = 4096`). The allocator is `no_std` compatible with zero heap allocation, using a fixed-size bitmap for free-list tracking.

### RegionManager

The `RegionManager` tracks `OwnedRegion` instances, each containing:

- Guest physical base address (page-aligned).
- Host physical base address (page-aligned).
- Page count and access permissions (read/write/execute).
- Owning `PartitionId`.
- Current tier state via `RegionTierState`.

Region overlap checks enforce isolation: guest-physical overlap is valid only within the same partition (separate stage-2 page tables), but host-physical overlap across partitions is a critical isolation violation.

### ReconstructionPipeline

Dormant memory is stored as a `CompressedCheckpoint` plus a sequence of `WitnessDelta` entries. Reconstruction proceeds in three steps:

1. **Load checkpoint**: Decompress the LZ4-compressed snapshot into a caller-provided buffer.
2. **Apply deltas**: Replay each `WitnessDelta` in sequence order, writing the recorded data at the specified offset.
3. **Verify hash**: Validate the final state hash (FNV-1a) against the expected value stored in the checkpoint.

If verification fails, the reconstruction is aborted and the region remains dormant. A `RecoveryEnter` witness is emitted on failure.

Each `WitnessDelta` records:
- Witness sequence number (for ordering).
- Byte offset within the region.
- Data length.
- FNV-1a hash of the written data (integrity check per delta).

### Address Space Validation

The `validate_region()` function enforces:
- Page alignment of both guest and host base addresses.
- Non-zero page count.
- At least one permission bit set (read, write, or execute).

The `regions_overlap()` and `regions_overlap_host()` functions detect guest-physical and host-physical overlaps respectively, enabling the kernel to reject mappings that would break isolation.

## Consequences

### Positive

- **Coherence-driven residency** reduces remote memory traffic by keeping pages hot only when graph structure justifies it (target: 20% reduction vs. naive placement).
- **Checkpoint + delta reconstruction** avoids storing full dormant snapshots, reducing cold storage requirements.
- **Zero-allocation design** enables deployment on Seed-class hardware (64 KB - 1 MB RAM).
- **Explicit tier transitions** eliminate demand-paging complexity and make memory behavior deterministic and auditable.

### Negative

- **No demand paging** means a page fault on a demoted region is a hard fault, not a transparent recovery. The kernel must proactively manage promotions.
- **Reconstruction latency** depends on the number of deltas since the last checkpoint. Long-running partitions with infrequent checkpoints may have slow reconstruction.
- **Static DC-1 thresholds** are conservative and may over-demote on hardware where DRAM is abundant.

### Neutral

- The compression stub currently uses a simple byte-level algorithm. Production deployments will use `lz4_flex` or hardware compression; the interface is abstracted behind the `CompressedCheckpoint` type.

## References

- ADR-132: RVM Hypervisor Core (DC-1, DC-6, memory model section)
- ADR-138: Seed Hardware Profile (no_alloc constraints)
- `crates/rvm-memory/src/lib.rs` -- Module root and region validation
- `crates/rvm-memory/src/tier.rs` -- Four-tier model and thresholds
- `crates/rvm-memory/src/allocator.rs` -- BuddyAllocator
- `crates/rvm-memory/src/region.rs` -- OwnedRegion and RegionManager
- `crates/rvm-memory/src/reconstruction.rs` -- ReconstructionPipeline
105 changes: 105 additions & 0 deletions docs/adr/ADR-137-bare-metal-boot-sequence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# ADR-137: Bare-Metal Boot Sequence

**Status:** Accepted
**Date:** 2026-04-05
**Authors:** RuVector Contributors
**Supersedes:** None

---

## Context

RVM boots bare-metal without KVM or Linux. The boot sequence must initialize hardware, establish the capability discipline, create the witness trail, and hand off to the scheduler -- all in a deterministic, witness-gated sequence. A non-deterministic boot makes debugging impossible and prevents measured boot attestation.

The boot sequence has evolved through two iterations:

1. **ADR-137 original (7-phase hardware-centric)**: Reset vector, hardware detect, MMU setup, EL2 entry, kernel object init, first witness, scheduler entry. This maps to the AArch64 hardware bring-up path.
2. **ADR-140 revision (7-phase logical)**: HAL init, memory pool init, capability table init, witness trail init, scheduler init, root partition creation, hand-off. This is the currently implemented sequence.

Both share the principle: each phase is gated by a witness entry and must complete before the next begins.

## Decision

### Boot Phases

The implemented boot sequence uses 7 phases, executed in strict order:

| Phase | Name | What It Does |
|-------|------|-------------|
| 0 | HalInit | Initialize hardware abstraction: timer, MMU, interrupts, UART |
| 1 | MemoryInit | Initialize physical page allocator (BuddyAllocator) |
| 2 | CapabilityInit | Create the root capability table |
| 3 | WitnessInit | Initialize the witness log ring buffer; emit genesis attestation |
| 4 | SchedulerInit | Initialize the scheduler with deadline-based priority |
| 5 | RootPartition | Create the root partition with bootstrap authority |
| 6 | Handoff | Transfer control to the root partition's entry point |

### BootTracker

The `BootTracker` enforces phase ordering. It maintains:

- `current: Option<BootPhase>` -- the phase that must complete next, or `None` if boot is complete.
- `completed: [bool; 7]` -- which phases have finished.

Calling `complete_phase(phase)` succeeds only if `phase` matches `current`. Out-of-order completion returns `RvmError::InternalError`. Attempting to complete a phase after boot is done returns `RvmError::Unsupported`.

### MeasuredBootState

For TPM-style measured boot, `MeasuredBootState` accumulates a hash chain during boot. Each phase extends the measurement with its completion data, producing a boot attestation hash that can be included in the genesis witness record. This enables remote attestation: a verifier can confirm that the boot sequence executed all phases in order with expected firmware and configuration.

### BootSequence

The `BootSequence` type wraps the full boot flow with timing. Each `BootStage` records a `PhaseTiming` (start/end timestamps in nanoseconds) so that boot performance can be profiled. The target is cold boot to first witness in under 250ms on Appliance hardware.

### HAL Abstraction

The `HalInit` trait abstracts hardware initialization behind three configuration structures:

- `MmuConfig` -- page table base, granularity, address space size.
- `InterruptConfig` -- GIC distributor and redistributor addresses.
- `UartConfig` -- serial console base address and baud rate.

A `StubHal` implementation satisfies the trait for testing without real hardware. On AArch64, the real HAL performs EL2 entry, stage-2 page table setup, and GIC initialization.

### Witness Gating

Every phase transition emits a `BootAttestation` witness record before advancing. If witness emission fails (ring buffer unavailable during WitnessInit itself), the boot sequence records the failure in the measured boot state and continues -- witness infrastructure is not yet available in phases 0-2. From phase 3 onward, witness emission failure is fatal.

### AArch64 Entry

On AArch64, the reset vector (assembly, <100 LoC) performs:

1. Disable interrupts, set SCTLR to known state.
2. Read MPIDR to determine core ID; park non-primary cores.
3. Zero BSS, set up initial stack pointer.
4. Branch to Rust `_start` which calls `run_boot_sequence()`.

The entry module (`crates/rvm-boot/src/entry.rs`) provides `BootContext` and `run_boot_sequence()` as the Rust-side entry point.

## Consequences

### Positive

- **Deterministic ordering** via `BootTracker` prevents initialization races and ensures every phase completes before its dependents start.
- **Measured boot** enables remote attestation for high-assurance deployments.
- **Phased witness gating** catches boot-time failures early with auditable records.
- **HAL abstraction** allows the same boot sequence to run on QEMU, real AArch64, and future RISC-V targets.

### Negative

- **Strict phase ordering** means phases cannot be parallelized. On multi-core hardware, phases 0-4 run on the primary core only; secondary cores are parked until phase 6 (Handoff).
- **250ms boot target** is aggressive for bare-metal without firmware fast-path optimizations. Phase 1 (memory init) dominates on systems with large physical memory.

### Neutral

- The dual 7-phase numbering (hardware-centric vs. logical) may cause confusion. The implemented logical sequence is canonical; the hardware-centric numbering in the module doc-comments is retained for reference.

## References

- ADR-132: RVM Hypervisor Core (success criterion 1: cold boot < 250ms)
- ADR-134: Witness Schema and Log Format (BootAttestation witness kind)
- `crates/rvm-boot/src/lib.rs` -- Module root, BootPhase enum, BootTracker
- `crates/rvm-boot/src/sequence.rs` -- BootSequence with timing
- `crates/rvm-boot/src/measured.rs` -- MeasuredBootState hash chain
- `crates/rvm-boot/src/hal_init.rs` -- HalInit trait and StubHal
- `crates/rvm-boot/src/entry.rs` -- AArch64 entry point
Loading
Loading