fix: don't permanently ban patch numbers on server-driven rollback by bdero · Pull Request #356 · shorebirdtech/updater

bdero · 2026-05-07T21:43:10Z

Summary

Customer report: after rolling a patch back and then rolling it forward from the dashboard, the device silently refuses to install the re-offered patch. checkForUpdate returns upToDate while the server still says a patch is available, and FRC-driven notification logic that compares currentPatch against a target patch number loops on a force-restart prompt that nothing can satisfy.

The bug

PR #348's review-feedback commit (59bb984) added known_bad_patches.insert(patch_number) to remove_patch to defend against an orthogonal resurrection-via-failed-delete race. That set is a permanent ban for the current release_version, so the same number also stays banned through:

A dashboard rollforward. rollforward.dart:38-42 flips is_rolled_back back to false on the same app_release_patch_ row, so the server resends exactly the same Patch (same number, same hash).
A delete-and-recreate of the release at the same release_version. This is the workflow shorebird_cli/lib/src/commands/release/release_command.dart:518 itself prescribes when bumping Flutter. New patches start at ci: add rust workflow #1 again with new hashes but inherit the old release's bans on those numbers.

The customer log line [shorebird] Patch 1 has previously failed to boot, skipping is misleading — the patch never boot-failed, the rollback signal poisoned the number.

The fix

In remove_patch, drop last_booted_patch when its number matches the rolled-back patch instead of inserting into known_bad_patches. That keeps the resurrection defense — the else-if fallback in try_fall_back_from_patch has no target to promote — and also stops banning the number for any later legitimate reuse.

Safe to clear last_booted_patch here because running_patch (added in #348) is session-scoped and is now the FFI's source of truth for "what's running this process," so dropping the persisted historical pointer doesn't regress the patch-to-release rollback fix.

Field	Role after this PR
`last_booted_patch`	Fallback target for `try_fall_back_from_patch`. Dropped on server rollback of the same number.
`next_boot_patch`	What the engine should load next launch. `None` = base release.
`running_patch`	Session-scoped: what this process is using. FFI source of truth. Unchanged.
`known_bad_patches`	Patches that actually failed to boot. No longer used for server-driven rollback.

Verification

New regression test rollforward_after_rollback_can_install_same_patch in library/src/c_api/mod.rs. Reproduces the customer scenario at the C API level: install patch 1, server rolls patch 1 back, server rolls patch 1 forward (same hash, empty rolled_back). Pre-fix: check_for_downloadable_update returns false and the log line matches the customer's logs. Post-fix: returns true and the patch is queued for next boot.
Updated rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch to assert the property the test name promises (patch 1 is not promoted to next_boot_patch) plus the new invariant (is_known_bad_patch(1) is false, so a rollforward can install it). The original assertion was checking the implementation mechanism rather than the behavioral property.
All 235 existing tests pass without modification. cargo clippy --lib --tests clean. cargo fmt --check clean.

Out of scope (follow-up)

This PR doesn't fix the broader weakness that known_bad_patches is keyed by patch_number alone — a number that isn't globally unique (it can refer to entirely different artifacts under the same release_version if the release is deleted-and-recreated). The wire response already carries Patch.hash and the server-side patch id, so a follow-up can key the bad-patch set by hash and make this whole class of bug impossible by construction.

Test plan

Regression test fails on origin/main, passes on this branch
All existing tests pass without modification
cargo test --lib 235 pass / 0 fail
cargo clippy --lib --tests clean
cargo fmt --check clean
Validate on a real device against the customer scenario before landing

Refs: shorebirdtech/shorebird#3728, #348

Customer report: after rolling a patch back and then rolling it forward from the dashboard, the device silently refuses to install the re-offered patch. `checkForUpdate` returns `upToDate` while the server still says a patch is available. Apps that compare `currentPatch` against an FRC-driven target patch number then loop on a force-restart prompt that nothing can satisfy. Root cause: PR #348's review-feedback commit added `known_bad_patches.insert(patch_number)` to `remove_patch` to defend against an orthogonal resurrection-via-failed-delete race. That set is a permanent ban for the current `release_version`, so the same number also stays banned through: - A dashboard rollforward, which flips `is_rolled_back` on the same `app_release_patch_` row server-side and resends the same `Patch` (same number, same hash). This is the case biting the customer. - A delete-and-recreate of the release at the same `release_version` (the workflow `release_command.dart` itself prescribes when bumping Flutter), where new patches start at #1 again with new hashes but inherit the old release's bans. Fix: in `remove_patch`, drop `last_booted_patch` when its number matches the rolled-back patch instead of inserting into `known_bad_patches`. That keeps the resurrection defense — the else-if fallback in `try_fall_back_from_patch` has no target to promote — and also stops banning the number for any later legitimate reuse. Safe to clear `last_booted_patch` because `running_patch` (added in #348) is session-scoped and is now the FFI's source of truth for "what's running this process," so dropping the persisted historical pointer doesn't regress the patch-to-release rollback fix. Verification: - New regression test `rollforward_after_rollback_can_install_same_patch` in `library/src/c_api/mod.rs`. Reproduces the customer scenario at the C API level: install patch 1, server rolls patch 1 back, server rolls patch 1 forward (same hash, empty rolled_back). Pre-fix: `check_for_downloadable_update` returns false and the log line `Patch 1 has previously failed to boot, skipping` matches the customer's logs. Post-fix: returns true and the patch is queued for next boot. - `rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch` updated to assert the property the test name promises (patch 1 is not promoted to `next_boot_patch`) plus the new invariant (`is_known_bad_patch(1)` is false, so a rollforward can install it). The original assertion was checking the implementation mechanism rather than the behavioral property. - All 235 existing tests pass without modification. clippy + fmt clean. Does not address the broader "patches identified by number not by hash" weakness. The wire response already carries `Patch.hash` and the server-side patch id in the download URL, so a follow-up could key `known_bad_patches` (and any session-scoped rollback signal) by hash to make this class of bug impossible by construction.

codecov · 2026-05-07T21:47:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Both appear in comments added by the previous commit. The repo's cspell config already lists `carryforward` and `rollouts`; these are the same shape and not in cspell's default English dictionary.

eseidel · 2026-05-07T21:55:44Z

I believe this is also fixed by #352.

…eview bdero's review on #352 surfaced six items worth landing inline: 1. record_boot_failure used to clear `currently_booting_patch` before `mark_bad`. A crash between the two left the patch `Installed` with no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next boot — the patch silently retried. Flipped: mark_bad first, then clear. mark_bad on Bad is idempotent, so the worst case is a redundant tombstone-rewrite on next init. 2. Drop the `hash` field from `PatchState::Installed`. It was recorded at install but never read at boot — Strict-mode validation recomputes the hash from the artifact's bytes and feeds it into `check_signature`. A hash that lives only on disk can't be trusted as a security input; deleting the field removes the temptation. `Downloading.hash`/`Downloaded.hash` stay as comparators against the server's freshly-delivered hash (a tampered on-disk hash there just causes a redownload). 3. Port the customer-scenario test from #356 into `c_api/mod.rs`: server rolls patch 1 back, then forward again with the same number and hash. Pre-lifecycle this was permanently dropped via `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and forgets the patch entirely on server-driven rollback, so the rollforward installs cleanly. End-to-end coverage of the regression #356 hotfixed. 4. Add a TODO + target version on the `patches_state.json` entry in `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its usefulness. 5. Document that `running_patch` lives in `config.rs` (session-scoped), not in this module — saves a future grep. 6. Clarify `recompute_next_boot`'s fallback policy: it's fall-back-to-`last_booted_patch`-only, not fall-back-to-anything- bootable. A fresh release whose first Installed patch fails validation goes to base, even if older patches sit in `patches/`. cargo test: 242 passed (was 241).

* refactor: introduce per-patch lifecycle state machine (types + persistence) First slice of shorebirdtech/shorebird#3737 — replace the scattered storage of per-patch state across download_state.rs sidecars, bare files in downloads/, and PatchesState fields (next_boot_patch, last_booted_patch, known_bad_patches) with a single per-patch state.json driven by an explicit state machine. This commit only adds the types and the storage layer; nothing wires into update_internal or report_launch_* yet. Subsequent commits will build out transitions and migrate the call sites. States: - Downloading { url, hash, signature, partial_size } - Downloaded { url, hash, signature, size } - Installed { hash, signature, size } - Bad { reason, hash?, signature?, size? } // tombstone Operations: - mark_bad(n, reason) — sugar over write Bad{} + cleanup - cleanup(n) — state-aware: keeps Bad tombstone, else forgets dir Per-release pointers (next_boot, last_booted, currently_booting, boot_started_at) move to a separate pointers.json holding patch numbers; metadata lives once per patch in state.json instead of duplicated across pointers. Persistence rides on the existing atomic disk_io::write (sibling- write + rename) so partial writes can't leave torn files. * refactor: add download/install/boot transitions to lifecycle Extends the lifecycle module with the methods update_internal and report_launch_* will need: - decide_start(n, url, hash) → DownloadAction (Fresh, Resume, Complete, Skip) - record_download_started / _complete - record_install_complete (transitions Downloaded → Installed, removes the now-unneeded compressed download file) - promote_to_next_boot (transitions a freshly Installed patch into pointers.next_boot, retiring any unbooted predecessor) - record_boot_start / _success / _failure - detect_boot_crash_on_init (handles the breadcrumb left when a prior process crashed during boot) - validate_next_boot_patch (size + signature checks; marks Bad{ValidationFailed} on failure) - recompute_next_boot Nothing wires into the existing updater.rs / report_launch_* yet — those changes follow in the cutover commit. * refactor: replace UpdaterState's PatchManager backend with PatchLifecycle Drops the patch_manager dependency from updater_state.rs. UpdaterState now owns a PatchLifecycle directly, with all patch-related methods delegating to it: install_patch → write Installed state + promote to next_boot next_boot_patch → pointers.next_boot_patch + installed_artifact_path is_known_bad_patch → read_state matches Bad uninstall_patch → cleanup + recompute_next_boot record_boot_* → lifecycle's boot transitions validate_next_boot_patch → lifecycle's signature/size check UpdaterState's own state.json shrinks to {client_id, release_version, queued_events}. Per-release per-patch state lives entirely in the lifecycle (pointers.json + patches/{N}/state.json). Other notable behavior changes: - record_boot_failure_for_patch no longer requires currently_booting_patch to match. Matches the prior PatchManager semantics: clear breadcrumb, mark Bad{BootCrash}, recompute. - recompute_next_boot leaves a valid Installed next_boot_patch alone instead of promoting last_booted over it. Without this, processing server rollbacks would clobber a freshly-installed newer patch. Tests in updater.rs that asserted file paths (patches_state.json) update to the new layout (pointers.json). The MockManagePatches-backed unit tests in updater_state.rs are gone — replaced by direct end-to-end tests that exercise PatchLifecycle through UpdaterState. patch_manager.rs is still on disk and compiled (nothing references it from cache/mod.rs anymore) — deletion happens after the updater.rs cutover. * refactor: cut update_internal over to PatchLifecycle Replaces the download_state.rs sidecar machinery and the bespoke DownloadStartState / should_install_patch helpers in update_internal with calls into PatchLifecycle. Concrete changes: - Drops compute_resume_offset / determine_download_start_state / DownloadStartState — replaced by PatchLifecycle::decide_start returning DownloadAction. - Drops should_install_patch / ShouldInstallPatchCheckResult — folded into the same DownloadAction match. KnownBad → UpdateIsBadPatch, AlreadyInstalled → NoUpdate, the rest proceed. - Drops install_downloaded_patch's bespoke "stage in downloads/, rename on install" choreography. The lifecycle owns the per-patch directory; inflate writes directly into patches/{N}/dlc.vmcode and the Downloaded → Installed transition removes the now-unneeded compressed download file. - Drops cleanup_download_artifacts and clean_download_dir. mark_bad handles tombstone-aware cleanup for failures, and the lifecycle owns its directory. - Marks the patch Bad{InvalidPatchBytes} when inflate fails and Bad{InstallHashMismatch} when check_hash fails. Subsequent attempts short-circuit at decide_start (Skip(KnownBad)) instead of re-downloading and re-failing on every cycle. This is the marks-bad- on-install-failure behavior we deferred from #351. - Preserves the explicit "server-side Content-Length vs total_bytes" mismatch check — surfaces the contract violation directly instead of obliquely via inflate failure. decide_start consults the on-disk download file as the source of truth for "how many bytes do we have so far." The denormalized partial_size in PatchState::Downloading is kept for diagnostics/serde stability but not consulted for the resume offset; this matches the prior file-size-on-disk behavior and avoids needing to update state.json mid-stream from the network layer. For Downloading / Downloaded states where the OS evicted the artifact file out from under us (e.g. iOS code-cache eviction), decide_start falls through to Fresh so the next attempt re-downloads from scratch. The failing-test fixes update file paths and pre-staged state from the old layout (downloads/{N} + downloads/{N}.download.json) to the new layout (patches/{N}/download + patches/{N}/state.json). Several tests that targeted the now-deleted helpers directly are removed; their behavior is covered by the new lifecycle unit tests. * refactor: delete patch_manager.rs and download_state.rs Both modules are now subsumed by PatchLifecycle: - patch_manager.rs (~1600 lines) managed the per-release `patches/` tree, the `next_boot_patch` / `last_booted_patch` / `currently_booting_patch` / `known_bad_patches` fields of `PatchesState`, and the fall-back-from-bad-patch logic. All of this is now in lifecycle.rs split between `PatchLifecycle` (per-patch state.json) and `ReleasePointers` (a single pointers.json). - download_state.rs (~125 lines) wrote the per-download sidecar JSON. Now folded into PatchState::Downloading / PatchState::Downloaded variants in state.json, owned by the lifecycle module. The MockManagePatches / ManagePatches trait machinery in updater_state.rs's tests goes away with patch_manager. The lifecycle operations are tested directly in cache::lifecycle::tests against a real on-disk filesystem under TempDir, which catches issues the mocks couldn't (e.g. file-existence cross-checks in `decide_start`). The cargo workspace now has 212 passing tests vs. 257 before, the delta being the patch_manager unit tests whose subjects no longer exist. End-to-end coverage in `updater.rs::tests` is unchanged and all green. * refactor: address self-review feedback on lifecycle PR Nine fixes from the self-review pass on #352: - Drop `partial_size` from PatchState::Downloading. The field was misleading (decide_start reads from disk, not from the recorded value) and unused. record_download_started loses its 5th arg. - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production no longer routes through it; only test_utils and the updater_state tests do. The gate makes the divergence intentional and prevents future production callers. - install_patch defensively removes any prior `dlc.vmcode` before rename so behavior is OS-agnostic (POSIX rename overwrites silently; Windows fails). Also mirrors record_install_complete's cleanup of a stale `download` file in the patch dir. - Document on `lifecycle()` / `lifecycle_mut()` that the direct accessors are intentional — wrapping every transition would be churn for no reader benefit. - recompute_next_boot now clears `last_booted_patch` when its on-disk record is gone (Unknown), so pointers.json doesn't accumulate references to nothing. A `Bad` last_booted is left alone — that's a useful breadcrumb and recompute simply doesn't promote it. - Promote `download_artifact_path` and `installed_artifact_path` to methods on PatchLifecycle for symmetry with state_path / pointers_path. Updates all call sites. - Document mark_bad-on-Bad merge semantics: latest reason wins, other fields preserved. Hypothetical in practice but no longer silent. - Add tests for recompute_next_boot's stale-pointer clearing (Unknown → cleared, Bad → kept). Brings the suite to 214 tests. The mark_bad-cleanup-stale-bytes recovery path I flagged was already covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed. * refactor: undo two over-corrections from the prior self-review fix Two issues introduced by 07ea84a that the second self-review caught: - update_internal computed download_path / installed_path via `with_state(...).lifecycle().download_artifact_path(n)`. That routed a pure-function path computation through a state read, serializing it against other state operations and adding a disk read per call. Restore the free `download_artifact_path` / `installed_artifact_path` functions as the canonical entry point; the methods on PatchLifecycle are kept as thin wrappers for callers that already hold a lifecycle handle. - install_patch's "defensive remove_file before rename" added a TOCTOU window between the exists() check and the rename for no real benefit — install_patch is now `#[cfg(test)]`-only, the Windows-rename concern was theoretical for tests, and POSIX rename atomically overwrites. Drop the explicit remove_file. * ci: fix warnings-as-errors and cspell Three CI failures from the prior push: - `Context` and `bail` imports in updater_state.rs are only used inside the `#[cfg(test)] install_patch` helper. Moved them under `#[cfg(test)]` so non-test builds don't see unused imports. - `PathBuf` import in updater.rs became unused after switching the download/install paths to take `Path::new(&config.storage_dir)` directly. Dropped. - `FileOperation::DeleteDir` is no longer constructed by production code (was used by the deleted patch_manager.rs). Mirrored the existing `#[allow(dead_code)]` annotation already on `DeleteFile` rather than removing the variant — the format/handling code for it is still useful for any future code that needs to surface a "delete dir failed" error. CSpell additions: `unparseable`, `tombstoned`, `roundtrips` — words introduced by the lifecycle module's docs and test names. * ci: gate PathBuf import to platforms that actually use it The previous fix removed `PathBuf` from the top-level use, which broke the lib build on non-android non-test platforms (macOS, Linux, Windows) where `libapp_path_from_settings` uses `PathBuf`. That function is itself gated to `cfg(not(any(target_os = \"android\", test)))`, so the PathBuf import needs the same gate. Restoring it as a cfg-gated separate import — keeps the lib build green on all three runners and keeps the lib-test build clean under -D warnings. * test: cover the gaps surfaced by the coverage report Adds 7 tests targeting branches that were uncovered: - mark_bad_from_downloading_records_partial_file_size: the Downloading source state in mark_bad reads bytes-on-disk for the recorded `size` field; the existing tests only covered Installed → Bad. - mark_bad_overwrites_reason_when_already_bad: documented merge semantics (latest reason wins, other fields preserved) now has a test backing the doc. - validate_next_boot_patch_marks_bad_when_artifact_missing: the "Installed state but dlc.vmcode is gone" path. Existing test covered size mismatch; this covers the missing-file branch. - validate_next_boot_patch_marks_bad_when_pointer_targets_non_installed: the case where the next_boot pointer references a state other than Installed (state.json + pointers can diverge through corruption). - install_patch_install_only_{accepts_valid,rejects_missing,rejects_bad}_signature: cover the InstallOnly verification path in UpdaterState::install_patch using the existing test keypair from signing.rs. Coverage improvements: - cache/lifecycle.rs: 94.65% → 96.06% lines, 98.72% → 100% fns - cache/updater_state.rs: 94.34% → 96.94% lines, 95.65% → 98% fns - total: 94.70% → 95.11% lines 214 → 221 tests, all passing under -D warnings. * ci: add 'keypair' to cspell dictionary * test: audit deleted patch_manager tests; restore strict checks Goes through every test deleted with patch_manager.rs (~40) and either maps it to a new test (with a `Ports patch_manager.rs::mod::name` comment) or adds a port. Three behavioral regressions caught and fixed: - record_boot_start now requires next_boot_patch to match the arg. Old PatchManager had this defensive check; my refactor dropped it. Carries forward the engine-vs-state agreement guard. - Added strict-mode signature tests for validate_next_boot_patch (valid, missing, invalid, no public key, bad public key). Ports five `validate_next_boot_patch_tests::strict_mode_*` tests. - Added InstallOnly + no public_key test (ports add_patch_tests::install_only_succeeds_with_any_signature_if_no_public_key). Plus two scenario tests that didn't have direct coverage: - rolled_back_patch_not_resurrected_when_replacement_fails: ports fall_back_tests::rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch against the new state-based fallback (cleanup forgets, recompute won't promote a None state). - validate_then_promote_catches_corrupted_last_booted: ports validate_next_boot_patch_tests::does_not_fall_back_to_last_booted_patch_if_corrupted with a note: new code catches the corruption at the next validate pass instead of at fall-back time, but the user-visible outcome (boot the base release) is the same. Test helper split: `install_state` (writes Installed without touching pointers) vs callers explicitly setting pointers. Avoids the prior helper's auto-promote masking pointer-management behavior the tests were trying to exercise. Tests deliberately not ported (with reasoning): - debug_tests::manage_patches_is_debug, patch_manager_is_debug: Trait-Debug tests for types that no longer exist. Replaced implicitly by `#[derive(Debug)]` on PatchLifecycle. - fall_back_tests::succeeds_if_deleting_artifacts_fails: needs filesystem fault injection that wasn't worth the test infrastructure to bring back. - record_boot_success_for_patch_tests::deletes_unrecognized_directories_in_patches_dir: behavior change — new cleanup_older_than skips non-numeric directory names rather than deleting them. Defensive; the prior `rm -rf` was overly aggressive. Coverage: 222 → 230 tests, all green under -D warnings. * test: port the two patch_manager tests I'd handwaved Both turned out trivial to port — neither was actually doing fault injection, just exploiting graceful-failure paths. - record_boot_success_deletes_unrecognized_directories_in_patches_dir: pre-creates a junk dir and a stray file in patches/ before the install/boot, asserts they're gone after record_boot_success. Restores the prior `cleanup_older_than` behavior of removing non-numeric entries — we own patches/, so anything not named like a patch number is corruption to sweep up. - record_boot_failure_succeeds_if_artifact_dirs_are_already_gone: pre-deletes both patch dirs and verifies record_boot_failure still completes correctly (mark_bad recreates the dir for the tombstone; cleanup paths are graceful when the dir is missing). Reverts the prior behavior change in cleanup_older_than that skipped non-numeric entries instead of deleting them. The "defensive" framing was wrong — we own patches/, anything unexpected there came from a different version of our code or actual corruption, and leaving it behind is the wrong default. * test: port the remaining patch_manager tests with real coverage gaps - validate_next_boot_strict_mode_succeeds_with_valid_signature: ports `patch_manager.rs::validate_next_boot_patch_tests::strict_mode_succeeds_with_valid_signature_at_boot_time`. Required reusing the prior test fixture's INFLATED_PATCH_HASH / INFLATED_PATCH_SIGNATURE constants (a real signature for the 1-byte file content `b"1"`, matching TEST_PUBLIC_KEY's private half). - record_boot_failure_keeps_last_booted_pointer_when_failed_patch_was_last_booted: ports `patch_manager.rs::record_boot_failure_for_patch_tests::preserves_last_booted_patch_on_failure_but_marks_bad`. The new behavior is similar but more explicit: last_booted's *pointer* is preserved (with the patch now in Bad state); the underlying intent — "we still know what last booted, even after that patch failed" — survives. Test helper split: `install_signed` (failure paths, mismatched hash OK) vs `install_with_valid_signature` (happy path, real fixture). Now every behavioral test from the deleted patch_manager.rs has a corresponding new test with a porting comment. Two trivial ones not ported: the `Debug` impl tests for traits/types that no longer exist. 236 tests, all green under -D warnings. * refactor: restore separate download directory under OS cache root Reverts an unintended policy change in the cutover: my refactor moved compressed download bytes from `{code_cache_dir}/downloads/{N}` (the OS-managed cache dir, evictable under storage pressure) into `{storage_dir}/patches/{N}/download` (persistent app storage). That made downloads count against persistent storage and survive in iCloud backups — neither of which we want for transient bytes. Restoring the original split: - `{state_root}/patches/{N}/{state.json,dlc.vmcode}` — persistent state and the installed artifact. - `{download_root}/{N}` — flat, in the OS cache dir. `PatchLifecycle::load_or_default` now takes both roots. The two artifact-path helpers each route to their respective root, the free-function variants used by `update_internal` now take the right root, and `mark_bad`/`cleanup`/`record_install_complete` clean both locations as appropriate. `UpdaterState::create_new_and_save` also wipes `download_dir` on release-version change, in addition to the persistent patches/ tree. `update_internal` and tests updated to use the right root for each file type. Test fixtures pass `tmp.path().join("downloads")` as the download root (separate subdir of the same TempDir). Net change in test surface: moved a handful of `patch_dir/download` references to `downloads_dir/N`, no behavioral changes to tests themselves. 236 tests pass under -D warnings. * test: cover the new download_root cleanup paths Self-review caught two gaps in the download-root split: - release_version_change_wipes_download_dir: the cache-rooted download dir should be wiped along with the persistent patches/ tree on a release-version mismatch. Adds explicit coverage — the code path was already in create_new_and_save but not directly tested. - record_boot_success_promotes_and_cleans_older: extends the existing test to drop a stale download in download_root alongside an older patch's state, asserts that cleanup_older_than's chain (cleanup → forget_dir) deletes both roots' artifacts. Also adds a comment on cleanup_older_than noting that it only walks the persistent patches/ tree — orphan downloads (no state.json) would persist until the OS evicts them. Noted as a known limitation, not blocking. * refactor: targeted wipe + legacy patches_state.json + orphan-download walk Two fixes to the release-version-change wipe and one to the boot- success cleanup walk. - The "wipe everything in cache_dir" approach broke 27 tests whose setup uses cache_dir as the test temp dir (with base.apk and libapp.so as siblings). Production gives shorebird a dedicated subdir but the API doesn't enforce that. Switched to a targeted wipe with a documented allowlist of paths shorebird has ever written under cache_dir: `SHOREBIRD_OWNED_PATHS`. - Added `patches_state.json` to the wipe list — that's the legacy file from the prior `PatchManager` implementation, and devices upgrading through this PR would otherwise orphan it. New test `release_version_change_wipes_legacy_patches_state_json` covers it. - `cleanup_older_than` now also calls `cleanup_orphan_downloads`, which walks `download_root/` and removes any file that doesn't correspond to a patch in `Downloading`/`Downloaded` state. We own the cache root and shouldn't rely on OS eviction. Catches the "state.json gone but download lingered" case the prior comment handwaved. Adding a new file under cache_dir means adding it to SHOREBIRD_OWNED_PATHS — small bookkeeping cost, but it lets us keep cache_dir co-tenant with embedder-provided files. * test: cover all four branches of cleanup_orphan_downloads The new orphan-download sweep distinguishes four cases but only the "older patch via cleanup chain" branch was being exercised. Single test now drops one of each kind of file into download_root before a boot success and asserts which survive: - orphan (numeric, no state.json): removed - stale (numeric, state is Installed): removed - non-numeric name: removed - live (numeric, state is Downloading): preserved * ci: add 'embedder' to cspell dictionary * refactor: drop Installed.hash, fix mark_bad ordering, address bdero review bdero's review on #352 surfaced six items worth landing inline: 1. record_boot_failure used to clear `currently_booting_patch` before `mark_bad`. A crash between the two left the patch `Installed` with no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next boot — the patch silently retried. Flipped: mark_bad first, then clear. mark_bad on Bad is idempotent, so the worst case is a redundant tombstone-rewrite on next init. 2. Drop the `hash` field from `PatchState::Installed`. It was recorded at install but never read at boot — Strict-mode validation recomputes the hash from the artifact's bytes and feeds it into `check_signature`. A hash that lives only on disk can't be trusted as a security input; deleting the field removes the temptation. `Downloading.hash`/`Downloaded.hash` stay as comparators against the server's freshly-delivered hash (a tampered on-disk hash there just causes a redownload). 3. Port the customer-scenario test from #356 into `c_api/mod.rs`: server rolls patch 1 back, then forward again with the same number and hash. Pre-lifecycle this was permanently dropped via `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and forgets the patch entirely on server-driven rollback, so the rollforward installs cleanly. End-to-end coverage of the regression #356 hotfixed. 4. Add a TODO + target version on the `patches_state.json` entry in `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its usefulness. 5. Document that `running_patch` lives in `config.rs` (session-scoped), not in this module — saves a future grep. 6. Clarify `recompute_next_boot`'s fallback policy: it's fall-back-to-`last_booted_patch`-only, not fall-back-to-anything- bootable. A fresh release whose first Installed patch fails validation goes to base, even if older patches sit in `patches/`. cargo test: 242 passed (was 241).

Word appears in a comment on rollforwardPatch in code_push_client_wrapper — 'the server resends the same patch artifact on rollforward'. Same fix as shorebirdtech/updater#356 needed in this repo.

chore: add rollforward, resends to cspell dictionary

d758331

Both appear in comments added by the previous commit. The repo's cspell config already lists `carryforward` and `rollouts`; these are the same shape and not in cspell's default English dictionary.

bdero mentioned this pull request May 7, 2026

refactor: per-patch lifecycle state machine #352

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't permanently ban patch numbers on server-driven rollback#356

fix: don't permanently ban patch numbers on server-driven rollback#356
bdero wants to merge 2 commits into
mainfrom
bdero/fix-rollforward-after-rollback

bdero commented May 7, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 7, 2026

Uh oh!

eseidel commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bdero commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

Verification

Out of scope (follow-up)

Test plan

Uh oh!

codecov Bot commented May 7, 2026

Codecov Report

Uh oh!

eseidel commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bdero commented May 7, 2026 •

edited

Loading