Skip to content

fix: don't permanently ban patch numbers on server-driven rollback#356

Open
bdero wants to merge 2 commits into
mainfrom
bdero/fix-rollforward-after-rollback
Open

fix: don't permanently ban patch numbers on server-driven rollback#356
bdero wants to merge 2 commits into
mainfrom
bdero/fix-rollforward-after-rollback

Conversation

@bdero
Copy link
Copy Markdown
Member

@bdero bdero commented May 7, 2026

Summary

Customer report: after rolling a patch back and then rolling it forward from the dashboard, the device silently refuses to install the re-offered patch. checkForUpdate returns upToDate while the server still says a patch is available, and FRC-driven notification logic that compares currentPatch against a target patch number loops on a force-restart prompt that nothing can satisfy.

The bug

PR #348's review-feedback commit (59bb984) added known_bad_patches.insert(patch_number) to remove_patch to defend against an orthogonal resurrection-via-failed-delete race. That set is a permanent ban for the current release_version, so the same number also stays banned through:

  • A dashboard rollforward. rollforward.dart:38-42 flips is_rolled_back back to false on the same app_release_patch_ row, so the server resends exactly the same Patch (same number, same hash).
  • A delete-and-recreate of the release at the same release_version. This is the workflow shorebird_cli/lib/src/commands/release/release_command.dart:518 itself prescribes when bumping Flutter. New patches start at ci: add rust workflow #1 again with new hashes but inherit the old release's bans on those numbers.

The customer log line [shorebird] Patch 1 has previously failed to boot, skipping is misleading — the patch never boot-failed, the rollback signal poisoned the number.

The fix

In remove_patch, drop last_booted_patch when its number matches the rolled-back patch instead of inserting into known_bad_patches. That keeps the resurrection defense — the else-if fallback in try_fall_back_from_patch has no target to promote — and also stops banning the number for any later legitimate reuse.

Safe to clear last_booted_patch here because running_patch (added in #348) is session-scoped and is now the FFI's source of truth for "what's running this process," so dropping the persisted historical pointer doesn't regress the patch-to-release rollback fix.

Field Role after this PR
last_booted_patch Fallback target for try_fall_back_from_patch. Dropped on server rollback of the same number.
next_boot_patch What the engine should load next launch. None = base release.
running_patch Session-scoped: what this process is using. FFI source of truth. Unchanged.
known_bad_patches Patches that actually failed to boot. No longer used for server-driven rollback.

Verification

  • New regression test rollforward_after_rollback_can_install_same_patch in library/src/c_api/mod.rs. Reproduces the customer scenario at the C API level: install patch 1, server rolls patch 1 back, server rolls patch 1 forward (same hash, empty rolled_back). Pre-fix: check_for_downloadable_update returns false and the log line matches the customer's logs. Post-fix: returns true and the patch is queued for next boot.
  • Updated rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch to assert the property the test name promises (patch 1 is not promoted to next_boot_patch) plus the new invariant (is_known_bad_patch(1) is false, so a rollforward can install it). The original assertion was checking the implementation mechanism rather than the behavioral property.
  • All 235 existing tests pass without modification. cargo clippy --lib --tests clean. cargo fmt --check clean.

Out of scope (follow-up)

This PR doesn't fix the broader weakness that known_bad_patches is keyed by patch_number alone — a number that isn't globally unique (it can refer to entirely different artifacts under the same release_version if the release is deleted-and-recreated). The wire response already carries Patch.hash and the server-side patch id, so a follow-up can key the bad-patch set by hash and make this whole class of bug impossible by construction.

Test plan

  • Regression test fails on origin/main, passes on this branch
  • All existing tests pass without modification
  • cargo test --lib 235 pass / 0 fail
  • cargo clippy --lib --tests clean
  • cargo fmt --check clean
  • Validate on a real device against the customer scenario before landing

Refs: shorebirdtech/shorebird#3728, #348

Customer report: after rolling a patch back and then rolling it forward
from the dashboard, the device silently refuses to install the
re-offered patch. `checkForUpdate` returns `upToDate` while the server
still says a patch is available. Apps that compare `currentPatch`
against an FRC-driven target patch number then loop on a force-restart
prompt that nothing can satisfy.

Root cause: PR #348's review-feedback commit added
`known_bad_patches.insert(patch_number)` to `remove_patch` to defend
against an orthogonal resurrection-via-failed-delete race. That set is
a permanent ban for the current `release_version`, so the same number
also stays banned through:

- A dashboard rollforward, which flips `is_rolled_back` on the same
  `app_release_patch_` row server-side and resends the same `Patch`
  (same number, same hash). This is the case biting the customer.
- A delete-and-recreate of the release at the same `release_version`
  (the workflow `release_command.dart` itself prescribes when bumping
  Flutter), where new patches start at #1 again with new hashes but
  inherit the old release's bans.

Fix: in `remove_patch`, drop `last_booted_patch` when its number
matches the rolled-back patch instead of inserting into
`known_bad_patches`. That keeps the resurrection defense — the else-if
fallback in `try_fall_back_from_patch` has no target to promote — and
also stops banning the number for any later legitimate reuse. Safe to
clear `last_booted_patch` because `running_patch` (added in #348) is
session-scoped and is now the FFI's source of truth for "what's
running this process," so dropping the persisted historical pointer
doesn't regress the patch-to-release rollback fix.

Verification:

- New regression test `rollforward_after_rollback_can_install_same_patch`
  in `library/src/c_api/mod.rs`. Reproduces the customer scenario at
  the C API level: install patch 1, server rolls patch 1 back, server
  rolls patch 1 forward (same hash, empty rolled_back). Pre-fix:
  `check_for_downloadable_update` returns false and the log line
  `Patch 1 has previously failed to boot, skipping` matches the
  customer's logs. Post-fix: returns true and the patch is queued for
  next boot.
- `rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch`
  updated to assert the property the test name promises (patch 1 is
  not promoted to `next_boot_patch`) plus the new invariant
  (`is_known_bad_patch(1)` is false, so a rollforward can install it).
  The original assertion was checking the implementation mechanism
  rather than the behavioral property.
- All 235 existing tests pass without modification. clippy + fmt clean.

Does not address the broader "patches identified by number not by
hash" weakness. The wire response already carries `Patch.hash` and the
server-side patch id in the download URL, so a follow-up could key
`known_bad_patches` (and any session-scoped rollback signal) by hash
to make this class of bug impossible by construction.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Both appear in comments added by the previous commit. The repo's
cspell config already lists `carryforward` and `rollouts`; these are
the same shape and not in cspell's default English dictionary.
@eseidel
Copy link
Copy Markdown
Contributor

eseidel commented May 7, 2026

I believe this is also fixed by #352.

eseidel added a commit that referenced this pull request May 7, 2026
…eview

bdero's review on #352 surfaced six items worth landing inline:

1. record_boot_failure used to clear `currently_booting_patch` before
   `mark_bad`. A crash between the two left the patch `Installed` with
   no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next
   boot — the patch silently retried. Flipped: mark_bad first, then
   clear. mark_bad on Bad is idempotent, so the worst case is a
   redundant tombstone-rewrite on next init.

2. Drop the `hash` field from `PatchState::Installed`. It was recorded
   at install but never read at boot — Strict-mode validation
   recomputes the hash from the artifact's bytes and feeds it into
   `check_signature`. A hash that lives only on disk can't be trusted
   as a security input; deleting the field removes the temptation.
   `Downloading.hash`/`Downloaded.hash` stay as comparators against
   the server's freshly-delivered hash (a tampered on-disk hash there
   just causes a redownload).

3. Port the customer-scenario test from #356 into `c_api/mod.rs`:
   server rolls patch 1 back, then forward again with the same number
   and hash. Pre-lifecycle this was permanently dropped via
   `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and
   forgets the patch entirely on server-driven rollback, so the
   rollforward installs cleanly. End-to-end coverage of the regression
   #356 hotfixed.

4. Add a TODO + target version on the `patches_state.json` entry in
   `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its
   usefulness.

5. Document that `running_patch` lives in `config.rs` (session-scoped),
   not in this module — saves a future grep.

6. Clarify `recompute_next_boot`'s fallback policy: it's
   fall-back-to-`last_booted_patch`-only, not fall-back-to-anything-
   bootable. A fresh release whose first Installed patch fails
   validation goes to base, even if older patches sit in `patches/`.

cargo test: 242 passed (was 241).
eseidel added a commit that referenced this pull request May 7, 2026
* refactor: introduce per-patch lifecycle state machine (types + persistence)

First slice of shorebirdtech/shorebird#3737 — replace the scattered
storage of per-patch state across download_state.rs sidecars, bare
files in downloads/, and PatchesState fields (next_boot_patch,
last_booted_patch, known_bad_patches) with a single per-patch
state.json driven by an explicit state machine.

This commit only adds the types and the storage layer; nothing wires
into update_internal or report_launch_* yet. Subsequent commits will
build out transitions and migrate the call sites.

States:
  - Downloading { url, hash, signature, partial_size }
  - Downloaded  { url, hash, signature, size }
  - Installed   { hash, signature, size }
  - Bad         { reason, hash?, signature?, size? }      // tombstone

Operations:
  - mark_bad(n, reason) — sugar over write Bad{} + cleanup
  - cleanup(n) — state-aware: keeps Bad tombstone, else forgets dir

Per-release pointers (next_boot, last_booted, currently_booting,
boot_started_at) move to a separate pointers.json holding patch
numbers; metadata lives once per patch in state.json instead of
duplicated across pointers.

Persistence rides on the existing atomic disk_io::write (sibling-
write + rename) so partial writes can't leave torn files.

* refactor: add download/install/boot transitions to lifecycle

Extends the lifecycle module with the methods update_internal and
report_launch_* will need:

  - decide_start(n, url, hash) → DownloadAction (Fresh, Resume, Complete, Skip)
  - record_download_started / _complete
  - record_install_complete (transitions Downloaded → Installed,
    removes the now-unneeded compressed download file)
  - promote_to_next_boot (transitions a freshly Installed patch into
    pointers.next_boot, retiring any unbooted predecessor)
  - record_boot_start / _success / _failure
  - detect_boot_crash_on_init (handles the breadcrumb left when a
    prior process crashed during boot)
  - validate_next_boot_patch (size + signature checks; marks
    Bad{ValidationFailed} on failure)
  - recompute_next_boot

Nothing wires into the existing updater.rs / report_launch_* yet —
those changes follow in the cutover commit.

* refactor: replace UpdaterState's PatchManager backend with PatchLifecycle

Drops the patch_manager dependency from updater_state.rs. UpdaterState
now owns a PatchLifecycle directly, with all patch-related methods
delegating to it:

  install_patch        → write Installed state + promote to next_boot
  next_boot_patch      → pointers.next_boot_patch + installed_artifact_path
  is_known_bad_patch   → read_state matches Bad
  uninstall_patch      → cleanup + recompute_next_boot
  record_boot_*        → lifecycle's boot transitions
  validate_next_boot_patch → lifecycle's signature/size check

UpdaterState's own state.json shrinks to {client_id, release_version,
queued_events}. Per-release per-patch state lives entirely in the
lifecycle (pointers.json + patches/{N}/state.json).

Other notable behavior changes:

  - record_boot_failure_for_patch no longer requires
    currently_booting_patch to match. Matches the prior PatchManager
    semantics: clear breadcrumb, mark Bad{BootCrash}, recompute.
  - recompute_next_boot leaves a valid Installed next_boot_patch alone
    instead of promoting last_booted over it. Without this, processing
    server rollbacks would clobber a freshly-installed newer patch.

Tests in updater.rs that asserted file paths (patches_state.json) update
to the new layout (pointers.json). The MockManagePatches-backed unit
tests in updater_state.rs are gone — replaced by direct end-to-end tests
that exercise PatchLifecycle through UpdaterState.

patch_manager.rs is still on disk and compiled (nothing references it
from cache/mod.rs anymore) — deletion happens after the updater.rs
cutover.

* refactor: cut update_internal over to PatchLifecycle

Replaces the download_state.rs sidecar machinery and the bespoke
DownloadStartState / should_install_patch helpers in update_internal
with calls into PatchLifecycle.

Concrete changes:

  - Drops compute_resume_offset / determine_download_start_state /
    DownloadStartState — replaced by PatchLifecycle::decide_start
    returning DownloadAction.
  - Drops should_install_patch / ShouldInstallPatchCheckResult —
    folded into the same DownloadAction match. KnownBad → UpdateIsBadPatch,
    AlreadyInstalled → NoUpdate, the rest proceed.
  - Drops install_downloaded_patch's bespoke "stage in downloads/, rename
    on install" choreography. The lifecycle owns the per-patch directory;
    inflate writes directly into patches/{N}/dlc.vmcode and the
    Downloaded → Installed transition removes the now-unneeded compressed
    download file.
  - Drops cleanup_download_artifacts and clean_download_dir. mark_bad
    handles tombstone-aware cleanup for failures, and the lifecycle owns
    its directory.
  - Marks the patch Bad{InvalidPatchBytes} when inflate fails and
    Bad{InstallHashMismatch} when check_hash fails. Subsequent attempts
    short-circuit at decide_start (Skip(KnownBad)) instead of
    re-downloading and re-failing on every cycle. This is the marks-bad-
    on-install-failure behavior we deferred from #351.
  - Preserves the explicit "server-side Content-Length vs total_bytes"
    mismatch check — surfaces the contract violation directly instead of
    obliquely via inflate failure.

decide_start consults the on-disk download file as the source of truth
for "how many bytes do we have so far." The denormalized partial_size
in PatchState::Downloading is kept for diagnostics/serde stability but
not consulted for the resume offset; this matches the prior
file-size-on-disk behavior and avoids needing to update state.json
mid-stream from the network layer.

For Downloading / Downloaded states where the OS evicted the artifact
file out from under us (e.g. iOS code-cache eviction), decide_start
falls through to Fresh so the next attempt re-downloads from scratch.

The failing-test fixes update file paths and pre-staged state from the
old layout (downloads/{N} + downloads/{N}.download.json) to the new
layout (patches/{N}/download + patches/{N}/state.json). Several tests
that targeted the now-deleted helpers directly are removed; their
behavior is covered by the new lifecycle unit tests.

* refactor: delete patch_manager.rs and download_state.rs

Both modules are now subsumed by PatchLifecycle:

  - patch_manager.rs (~1600 lines) managed the per-release `patches/`
    tree, the `next_boot_patch` / `last_booted_patch` /
    `currently_booting_patch` / `known_bad_patches` fields of
    `PatchesState`, and the fall-back-from-bad-patch logic. All of
    this is now in lifecycle.rs split between `PatchLifecycle`
    (per-patch state.json) and `ReleasePointers` (a single
    pointers.json).
  - download_state.rs (~125 lines) wrote the per-download sidecar
    JSON. Now folded into PatchState::Downloading /
    PatchState::Downloaded variants in state.json, owned by the
    lifecycle module.

The MockManagePatches / ManagePatches trait machinery in
updater_state.rs's tests goes away with patch_manager. The lifecycle
operations are tested directly in cache::lifecycle::tests against a
real on-disk filesystem under TempDir, which catches issues the
mocks couldn't (e.g. file-existence cross-checks in `decide_start`).

The cargo workspace now has 212 passing tests vs. 257 before, the
delta being the patch_manager unit tests whose subjects no longer
exist. End-to-end coverage in `updater.rs::tests` is unchanged and
all green.

* refactor: address self-review feedback on lifecycle PR

Nine fixes from the self-review pass on #352:

  - Drop `partial_size` from PatchState::Downloading. The field was
    misleading (decide_start reads from disk, not from the recorded
    value) and unused. record_download_started loses its 5th arg.
  - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production
    no longer routes through it; only test_utils and the updater_state
    tests do. The gate makes the divergence intentional and prevents
    future production callers.
  - install_patch defensively removes any prior `dlc.vmcode` before
    rename so behavior is OS-agnostic (POSIX rename overwrites silently;
    Windows fails). Also mirrors record_install_complete's cleanup of
    a stale `download` file in the patch dir.
  - Document on `lifecycle()` / `lifecycle_mut()` that the direct
    accessors are intentional — wrapping every transition would be
    churn for no reader benefit.
  - recompute_next_boot now clears `last_booted_patch` when its
    on-disk record is gone (Unknown), so pointers.json doesn't
    accumulate references to nothing. A `Bad` last_booted is left
    alone — that's a useful breadcrumb and recompute simply doesn't
    promote it.
  - Promote `download_artifact_path` and `installed_artifact_path` to
    methods on PatchLifecycle for symmetry with state_path /
    pointers_path. Updates all call sites.
  - Document mark_bad-on-Bad merge semantics: latest reason wins,
    other fields preserved. Hypothetical in practice but no longer
    silent.
  - Add tests for recompute_next_boot's stale-pointer clearing
    (Unknown → cleared, Bad → kept). Brings the suite to 214 tests.

The mark_bad-cleanup-stale-bytes recovery path I flagged was already
covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.

* refactor: undo two over-corrections from the prior self-review fix

Two issues introduced by 07ea84a that the second self-review caught:

  - update_internal computed download_path / installed_path via
    `with_state(...).lifecycle().download_artifact_path(n)`. That
    routed a pure-function path computation through a state read,
    serializing it against other state operations and adding a disk
    read per call. Restore the free `download_artifact_path` /
    `installed_artifact_path` functions as the canonical entry point;
    the methods on PatchLifecycle are kept as thin wrappers for
    callers that already hold a lifecycle handle.
  - install_patch's "defensive remove_file before rename" added a
    TOCTOU window between the exists() check and the rename for no
    real benefit — install_patch is now `#[cfg(test)]`-only, the
    Windows-rename concern was theoretical for tests, and POSIX rename
    atomically overwrites. Drop the explicit remove_file.

* ci: fix warnings-as-errors and cspell

Three CI failures from the prior push:

  - `Context` and `bail` imports in updater_state.rs are only used
    inside the `#[cfg(test)] install_patch` helper. Moved them under
    `#[cfg(test)]` so non-test builds don't see unused imports.
  - `PathBuf` import in updater.rs became unused after switching the
    download/install paths to take `Path::new(&config.storage_dir)`
    directly. Dropped.
  - `FileOperation::DeleteDir` is no longer constructed by production
    code (was used by the deleted patch_manager.rs). Mirrored the
    existing `#[allow(dead_code)]` annotation already on `DeleteFile`
    rather than removing the variant — the format/handling code for
    it is still useful for any future code that needs to surface a
    "delete dir failed" error.

CSpell additions: `unparseable`, `tombstoned`, `roundtrips` — words
introduced by the lifecycle module's docs and test names.

* ci: gate PathBuf import to platforms that actually use it

The previous fix removed `PathBuf` from the top-level use, which
broke the lib build on non-android non-test platforms (macOS,
Linux, Windows) where `libapp_path_from_settings` uses `PathBuf`.
That function is itself gated to `cfg(not(any(target_os = \"android\",
test)))`, so the PathBuf import needs the same gate.

Restoring it as a cfg-gated separate import — keeps the lib build
green on all three runners and keeps the lib-test build clean
under -D warnings.

* test: cover the gaps surfaced by the coverage report

Adds 7 tests targeting branches that were uncovered:

  - mark_bad_from_downloading_records_partial_file_size: the
    Downloading source state in mark_bad reads bytes-on-disk for
    the recorded `size` field; the existing tests only covered
    Installed → Bad.
  - mark_bad_overwrites_reason_when_already_bad: documented merge
    semantics (latest reason wins, other fields preserved) now
    has a test backing the doc.
  - validate_next_boot_patch_marks_bad_when_artifact_missing: the
    "Installed state but dlc.vmcode is gone" path. Existing test
    covered size mismatch; this covers the missing-file branch.
  - validate_next_boot_patch_marks_bad_when_pointer_targets_non_installed:
    the case where the next_boot pointer references a state other
    than Installed (state.json + pointers can diverge through
    corruption).
  - install_patch_install_only_{accepts_valid,rejects_missing,rejects_bad}_signature:
    cover the InstallOnly verification path in
    UpdaterState::install_patch using the existing test keypair
    from signing.rs.

Coverage improvements:
  - cache/lifecycle.rs:    94.65% → 96.06% lines, 98.72% → 100% fns
  - cache/updater_state.rs: 94.34% → 96.94% lines, 95.65% → 98% fns
  - total:                 94.70% → 95.11% lines

214 → 221 tests, all passing under -D warnings.

* ci: add 'keypair' to cspell dictionary

* test: audit deleted patch_manager tests; restore strict checks

Goes through every test deleted with patch_manager.rs (~40) and either
maps it to a new test (with a `Ports patch_manager.rs::mod::name`
comment) or adds a port. Three behavioral regressions caught and fixed:

  - record_boot_start now requires next_boot_patch to match the arg.
    Old PatchManager had this defensive check; my refactor dropped it.
    Carries forward the engine-vs-state agreement guard.
  - Added strict-mode signature tests for validate_next_boot_patch
    (valid, missing, invalid, no public key, bad public key). Ports
    five `validate_next_boot_patch_tests::strict_mode_*` tests.
  - Added InstallOnly + no public_key test (ports
    add_patch_tests::install_only_succeeds_with_any_signature_if_no_public_key).

Plus two scenario tests that didn't have direct coverage:

  - rolled_back_patch_not_resurrected_when_replacement_fails: ports
    fall_back_tests::rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch
    against the new state-based fallback (cleanup forgets, recompute
    won't promote a None state).
  - validate_then_promote_catches_corrupted_last_booted: ports
    validate_next_boot_patch_tests::does_not_fall_back_to_last_booted_patch_if_corrupted
    with a note: new code catches the corruption at the next validate
    pass instead of at fall-back time, but the user-visible outcome
    (boot the base release) is the same.

Test helper split: `install_state` (writes Installed without touching
pointers) vs callers explicitly setting pointers. Avoids the prior
helper's auto-promote masking pointer-management behavior the tests
were trying to exercise.

Tests deliberately not ported (with reasoning):

  - debug_tests::manage_patches_is_debug, patch_manager_is_debug:
    Trait-Debug tests for types that no longer exist. Replaced
    implicitly by `#[derive(Debug)]` on PatchLifecycle.
  - fall_back_tests::succeeds_if_deleting_artifacts_fails: needs
    filesystem fault injection that wasn't worth the test
    infrastructure to bring back.
  - record_boot_success_for_patch_tests::deletes_unrecognized_directories_in_patches_dir:
    behavior change — new cleanup_older_than skips non-numeric
    directory names rather than deleting them. Defensive; the prior
    `rm -rf` was overly aggressive.

Coverage: 222 → 230 tests, all green under -D warnings.

* test: port the two patch_manager tests I'd handwaved

Both turned out trivial to port — neither was actually doing fault
injection, just exploiting graceful-failure paths.

  - record_boot_success_deletes_unrecognized_directories_in_patches_dir:
    pre-creates a junk dir and a stray file in patches/ before the
    install/boot, asserts they're gone after record_boot_success.
    Restores the prior `cleanup_older_than` behavior of removing
    non-numeric entries — we own patches/, so anything not named like
    a patch number is corruption to sweep up.
  - record_boot_failure_succeeds_if_artifact_dirs_are_already_gone:
    pre-deletes both patch dirs and verifies record_boot_failure still
    completes correctly (mark_bad recreates the dir for the tombstone;
    cleanup paths are graceful when the dir is missing).

Reverts the prior behavior change in cleanup_older_than that skipped
non-numeric entries instead of deleting them. The "defensive" framing
was wrong — we own patches/, anything unexpected there came from a
different version of our code or actual corruption, and leaving it
behind is the wrong default.

* test: port the remaining patch_manager tests with real coverage gaps

  - validate_next_boot_strict_mode_succeeds_with_valid_signature: ports
    `patch_manager.rs::validate_next_boot_patch_tests::strict_mode_succeeds_with_valid_signature_at_boot_time`.
    Required reusing the prior test fixture's INFLATED_PATCH_HASH /
    INFLATED_PATCH_SIGNATURE constants (a real signature for the
    1-byte file content `b"1"`, matching TEST_PUBLIC_KEY's private
    half).
  - record_boot_failure_keeps_last_booted_pointer_when_failed_patch_was_last_booted:
    ports
    `patch_manager.rs::record_boot_failure_for_patch_tests::preserves_last_booted_patch_on_failure_but_marks_bad`.
    The new behavior is similar but more explicit: last_booted's
    *pointer* is preserved (with the patch now in Bad state); the
    underlying intent — "we still know what last booted, even after
    that patch failed" — survives.

Test helper split: `install_signed` (failure paths, mismatched hash
OK) vs `install_with_valid_signature` (happy path, real fixture).

Now every behavioral test from the deleted patch_manager.rs has a
corresponding new test with a porting comment. Two trivial ones not
ported: the `Debug` impl tests for traits/types that no longer exist.

236 tests, all green under -D warnings.

* refactor: restore separate download directory under OS cache root

Reverts an unintended policy change in the cutover: my refactor moved
compressed download bytes from `{code_cache_dir}/downloads/{N}` (the
OS-managed cache dir, evictable under storage pressure) into
`{storage_dir}/patches/{N}/download` (persistent app storage). That
made downloads count against persistent storage and survive in iCloud
backups — neither of which we want for transient bytes.

Restoring the original split:
  - `{state_root}/patches/{N}/{state.json,dlc.vmcode}` — persistent
    state and the installed artifact.
  - `{download_root}/{N}` — flat, in the OS cache dir.

`PatchLifecycle::load_or_default` now takes both roots. The two
artifact-path helpers each route to their respective root, the
free-function variants used by `update_internal` now take the right
root, and `mark_bad`/`cleanup`/`record_install_complete` clean both
locations as appropriate.

`UpdaterState::create_new_and_save` also wipes `download_dir` on
release-version change, in addition to the persistent patches/ tree.

`update_internal` and tests updated to use the right root for each
file type. Test fixtures pass `tmp.path().join("downloads")` as the
download root (separate subdir of the same TempDir).

Net change in test surface: moved a handful of `patch_dir/download`
references to `downloads_dir/N`, no behavioral changes to tests
themselves. 236 tests pass under -D warnings.

* test: cover the new download_root cleanup paths

Self-review caught two gaps in the download-root split:

  - release_version_change_wipes_download_dir: the cache-rooted
    download dir should be wiped along with the persistent patches/
    tree on a release-version mismatch. Adds explicit coverage —
    the code path was already in create_new_and_save but not
    directly tested.
  - record_boot_success_promotes_and_cleans_older: extends the
    existing test to drop a stale download in download_root
    alongside an older patch's state, asserts that
    cleanup_older_than's chain (cleanup → forget_dir) deletes both
    roots' artifacts.

Also adds a comment on cleanup_older_than noting that it only
walks the persistent patches/ tree — orphan downloads (no
state.json) would persist until the OS evicts them. Noted as a
known limitation, not blocking.

* refactor: targeted wipe + legacy patches_state.json + orphan-download walk

Two fixes to the release-version-change wipe and one to the boot-
success cleanup walk.

  - The "wipe everything in cache_dir" approach broke 27 tests whose
    setup uses cache_dir as the test temp dir (with base.apk and
    libapp.so as siblings). Production gives shorebird a dedicated
    subdir but the API doesn't enforce that. Switched to a targeted
    wipe with a documented allowlist of paths shorebird has ever
    written under cache_dir: `SHOREBIRD_OWNED_PATHS`.
  - Added `patches_state.json` to the wipe list — that's the
    legacy file from the prior `PatchManager` implementation, and
    devices upgrading through this PR would otherwise orphan it.
    New test `release_version_change_wipes_legacy_patches_state_json`
    covers it.
  - `cleanup_older_than` now also calls `cleanup_orphan_downloads`,
    which walks `download_root/` and removes any file that doesn't
    correspond to a patch in `Downloading`/`Downloaded` state. We own
    the cache root and shouldn't rely on OS eviction. Catches the
    "state.json gone but download lingered" case the prior comment
    handwaved.

Adding a new file under cache_dir means adding it to
SHOREBIRD_OWNED_PATHS — small bookkeeping cost, but it lets us
keep cache_dir co-tenant with embedder-provided files.

* test: cover all four branches of cleanup_orphan_downloads

The new orphan-download sweep distinguishes four cases but only the
"older patch via cleanup chain" branch was being exercised. Single
test now drops one of each kind of file into download_root before a
boot success and asserts which survive:

  - orphan (numeric, no state.json): removed
  - stale (numeric, state is Installed): removed
  - non-numeric name: removed
  - live (numeric, state is Downloading): preserved

* ci: add 'embedder' to cspell dictionary

* refactor: drop Installed.hash, fix mark_bad ordering, address bdero review

bdero's review on #352 surfaced six items worth landing inline:

1. record_boot_failure used to clear `currently_booting_patch` before
   `mark_bad`. A crash between the two left the patch `Installed` with
   no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next
   boot — the patch silently retried. Flipped: mark_bad first, then
   clear. mark_bad on Bad is idempotent, so the worst case is a
   redundant tombstone-rewrite on next init.

2. Drop the `hash` field from `PatchState::Installed`. It was recorded
   at install but never read at boot — Strict-mode validation
   recomputes the hash from the artifact's bytes and feeds it into
   `check_signature`. A hash that lives only on disk can't be trusted
   as a security input; deleting the field removes the temptation.
   `Downloading.hash`/`Downloaded.hash` stay as comparators against
   the server's freshly-delivered hash (a tampered on-disk hash there
   just causes a redownload).

3. Port the customer-scenario test from #356 into `c_api/mod.rs`:
   server rolls patch 1 back, then forward again with the same number
   and hash. Pre-lifecycle this was permanently dropped via
   `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and
   forgets the patch entirely on server-driven rollback, so the
   rollforward installs cleanly. End-to-end coverage of the regression
   #356 hotfixed.

4. Add a TODO + target version on the `patches_state.json` entry in
   `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its
   usefulness.

5. Document that `running_patch` lives in `config.rs` (session-scoped),
   not in this module — saves a future grep.

6. Clarify `recompute_next_boot`'s fallback policy: it's
   fall-back-to-`last_booted_patch`-only, not fall-back-to-anything-
   bootable. A fresh release whose first Installed patch fails
   validation goes to base, even if older patches sit in `patches/`.

cargo test: 242 passed (was 241).
bdero added a commit to shorebirdtech/shorebird that referenced this pull request May 8, 2026
Word appears in a comment on rollforwardPatch in code_push_client_wrapper —
'the server resends the same patch artifact on rollforward'. Same fix as
shorebirdtech/updater#356 needed in this repo.
bdero added a commit to shorebirdtech/shorebird that referenced this pull request May 8, 2026
Word appears in a comment on rollforwardPatch in code_push_client_wrapper —
'the server resends the same patch artifact on rollforward'. Same fix as
shorebirdtech/updater#356 needed in this repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants