fix: don't permanently ban patch numbers on server-driven rollback#356
Open
bdero wants to merge 2 commits into
Open
fix: don't permanently ban patch numbers on server-driven rollback#356bdero wants to merge 2 commits into
bdero wants to merge 2 commits into
Conversation
Customer report: after rolling a patch back and then rolling it forward from the dashboard, the device silently refuses to install the re-offered patch. `checkForUpdate` returns `upToDate` while the server still says a patch is available. Apps that compare `currentPatch` against an FRC-driven target patch number then loop on a force-restart prompt that nothing can satisfy. Root cause: PR #348's review-feedback commit added `known_bad_patches.insert(patch_number)` to `remove_patch` to defend against an orthogonal resurrection-via-failed-delete race. That set is a permanent ban for the current `release_version`, so the same number also stays banned through: - A dashboard rollforward, which flips `is_rolled_back` on the same `app_release_patch_` row server-side and resends the same `Patch` (same number, same hash). This is the case biting the customer. - A delete-and-recreate of the release at the same `release_version` (the workflow `release_command.dart` itself prescribes when bumping Flutter), where new patches start at #1 again with new hashes but inherit the old release's bans. Fix: in `remove_patch`, drop `last_booted_patch` when its number matches the rolled-back patch instead of inserting into `known_bad_patches`. That keeps the resurrection defense — the else-if fallback in `try_fall_back_from_patch` has no target to promote — and also stops banning the number for any later legitimate reuse. Safe to clear `last_booted_patch` because `running_patch` (added in #348) is session-scoped and is now the FFI's source of truth for "what's running this process," so dropping the persisted historical pointer doesn't regress the patch-to-release rollback fix. Verification: - New regression test `rollforward_after_rollback_can_install_same_patch` in `library/src/c_api/mod.rs`. Reproduces the customer scenario at the C API level: install patch 1, server rolls patch 1 back, server rolls patch 1 forward (same hash, empty rolled_back). Pre-fix: `check_for_downloadable_update` returns false and the log line `Patch 1 has previously failed to boot, skipping` matches the customer's logs. Post-fix: returns true and the patch is queued for next boot. - `rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch` updated to assert the property the test name promises (patch 1 is not promoted to `next_boot_patch`) plus the new invariant (`is_known_bad_patch(1)` is false, so a rollforward can install it). The original assertion was checking the implementation mechanism rather than the behavioral property. - All 235 existing tests pass without modification. clippy + fmt clean. Does not address the broader "patches identified by number not by hash" weakness. The wire response already carries `Patch.hash` and the server-side patch id in the download URL, so a follow-up could key `known_bad_patches` (and any session-scoped rollback signal) by hash to make this class of bug impossible by construction.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Both appear in comments added by the previous commit. The repo's cspell config already lists `carryforward` and `rollouts`; these are the same shape and not in cspell's default English dictionary.
Contributor
|
I believe this is also fixed by #352. |
4 tasks
eseidel
added a commit
that referenced
this pull request
May 7, 2026
…eview bdero's review on #352 surfaced six items worth landing inline: 1. record_boot_failure used to clear `currently_booting_patch` before `mark_bad`. A crash between the two left the patch `Installed` with no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next boot — the patch silently retried. Flipped: mark_bad first, then clear. mark_bad on Bad is idempotent, so the worst case is a redundant tombstone-rewrite on next init. 2. Drop the `hash` field from `PatchState::Installed`. It was recorded at install but never read at boot — Strict-mode validation recomputes the hash from the artifact's bytes and feeds it into `check_signature`. A hash that lives only on disk can't be trusted as a security input; deleting the field removes the temptation. `Downloading.hash`/`Downloaded.hash` stay as comparators against the server's freshly-delivered hash (a tampered on-disk hash there just causes a redownload). 3. Port the customer-scenario test from #356 into `c_api/mod.rs`: server rolls patch 1 back, then forward again with the same number and hash. Pre-lifecycle this was permanently dropped via `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and forgets the patch entirely on server-driven rollback, so the rollforward installs cleanly. End-to-end coverage of the regression #356 hotfixed. 4. Add a TODO + target version on the `patches_state.json` entry in `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its usefulness. 5. Document that `running_patch` lives in `config.rs` (session-scoped), not in this module — saves a future grep. 6. Clarify `recompute_next_boot`'s fallback policy: it's fall-back-to-`last_booted_patch`-only, not fall-back-to-anything- bootable. A fresh release whose first Installed patch fails validation goes to base, even if older patches sit in `patches/`. cargo test: 242 passed (was 241).
eseidel
added a commit
that referenced
this pull request
May 7, 2026
* refactor: introduce per-patch lifecycle state machine (types + persistence) First slice of shorebirdtech/shorebird#3737 — replace the scattered storage of per-patch state across download_state.rs sidecars, bare files in downloads/, and PatchesState fields (next_boot_patch, last_booted_patch, known_bad_patches) with a single per-patch state.json driven by an explicit state machine. This commit only adds the types and the storage layer; nothing wires into update_internal or report_launch_* yet. Subsequent commits will build out transitions and migrate the call sites. States: - Downloading { url, hash, signature, partial_size } - Downloaded { url, hash, signature, size } - Installed { hash, signature, size } - Bad { reason, hash?, signature?, size? } // tombstone Operations: - mark_bad(n, reason) — sugar over write Bad{} + cleanup - cleanup(n) — state-aware: keeps Bad tombstone, else forgets dir Per-release pointers (next_boot, last_booted, currently_booting, boot_started_at) move to a separate pointers.json holding patch numbers; metadata lives once per patch in state.json instead of duplicated across pointers. Persistence rides on the existing atomic disk_io::write (sibling- write + rename) so partial writes can't leave torn files. * refactor: add download/install/boot transitions to lifecycle Extends the lifecycle module with the methods update_internal and report_launch_* will need: - decide_start(n, url, hash) → DownloadAction (Fresh, Resume, Complete, Skip) - record_download_started / _complete - record_install_complete (transitions Downloaded → Installed, removes the now-unneeded compressed download file) - promote_to_next_boot (transitions a freshly Installed patch into pointers.next_boot, retiring any unbooted predecessor) - record_boot_start / _success / _failure - detect_boot_crash_on_init (handles the breadcrumb left when a prior process crashed during boot) - validate_next_boot_patch (size + signature checks; marks Bad{ValidationFailed} on failure) - recompute_next_boot Nothing wires into the existing updater.rs / report_launch_* yet — those changes follow in the cutover commit. * refactor: replace UpdaterState's PatchManager backend with PatchLifecycle Drops the patch_manager dependency from updater_state.rs. UpdaterState now owns a PatchLifecycle directly, with all patch-related methods delegating to it: install_patch → write Installed state + promote to next_boot next_boot_patch → pointers.next_boot_patch + installed_artifact_path is_known_bad_patch → read_state matches Bad uninstall_patch → cleanup + recompute_next_boot record_boot_* → lifecycle's boot transitions validate_next_boot_patch → lifecycle's signature/size check UpdaterState's own state.json shrinks to {client_id, release_version, queued_events}. Per-release per-patch state lives entirely in the lifecycle (pointers.json + patches/{N}/state.json). Other notable behavior changes: - record_boot_failure_for_patch no longer requires currently_booting_patch to match. Matches the prior PatchManager semantics: clear breadcrumb, mark Bad{BootCrash}, recompute. - recompute_next_boot leaves a valid Installed next_boot_patch alone instead of promoting last_booted over it. Without this, processing server rollbacks would clobber a freshly-installed newer patch. Tests in updater.rs that asserted file paths (patches_state.json) update to the new layout (pointers.json). The MockManagePatches-backed unit tests in updater_state.rs are gone — replaced by direct end-to-end tests that exercise PatchLifecycle through UpdaterState. patch_manager.rs is still on disk and compiled (nothing references it from cache/mod.rs anymore) — deletion happens after the updater.rs cutover. * refactor: cut update_internal over to PatchLifecycle Replaces the download_state.rs sidecar machinery and the bespoke DownloadStartState / should_install_patch helpers in update_internal with calls into PatchLifecycle. Concrete changes: - Drops compute_resume_offset / determine_download_start_state / DownloadStartState — replaced by PatchLifecycle::decide_start returning DownloadAction. - Drops should_install_patch / ShouldInstallPatchCheckResult — folded into the same DownloadAction match. KnownBad → UpdateIsBadPatch, AlreadyInstalled → NoUpdate, the rest proceed. - Drops install_downloaded_patch's bespoke "stage in downloads/, rename on install" choreography. The lifecycle owns the per-patch directory; inflate writes directly into patches/{N}/dlc.vmcode and the Downloaded → Installed transition removes the now-unneeded compressed download file. - Drops cleanup_download_artifacts and clean_download_dir. mark_bad handles tombstone-aware cleanup for failures, and the lifecycle owns its directory. - Marks the patch Bad{InvalidPatchBytes} when inflate fails and Bad{InstallHashMismatch} when check_hash fails. Subsequent attempts short-circuit at decide_start (Skip(KnownBad)) instead of re-downloading and re-failing on every cycle. This is the marks-bad- on-install-failure behavior we deferred from #351. - Preserves the explicit "server-side Content-Length vs total_bytes" mismatch check — surfaces the contract violation directly instead of obliquely via inflate failure. decide_start consults the on-disk download file as the source of truth for "how many bytes do we have so far." The denormalized partial_size in PatchState::Downloading is kept for diagnostics/serde stability but not consulted for the resume offset; this matches the prior file-size-on-disk behavior and avoids needing to update state.json mid-stream from the network layer. For Downloading / Downloaded states where the OS evicted the artifact file out from under us (e.g. iOS code-cache eviction), decide_start falls through to Fresh so the next attempt re-downloads from scratch. The failing-test fixes update file paths and pre-staged state from the old layout (downloads/{N} + downloads/{N}.download.json) to the new layout (patches/{N}/download + patches/{N}/state.json). Several tests that targeted the now-deleted helpers directly are removed; their behavior is covered by the new lifecycle unit tests. * refactor: delete patch_manager.rs and download_state.rs Both modules are now subsumed by PatchLifecycle: - patch_manager.rs (~1600 lines) managed the per-release `patches/` tree, the `next_boot_patch` / `last_booted_patch` / `currently_booting_patch` / `known_bad_patches` fields of `PatchesState`, and the fall-back-from-bad-patch logic. All of this is now in lifecycle.rs split between `PatchLifecycle` (per-patch state.json) and `ReleasePointers` (a single pointers.json). - download_state.rs (~125 lines) wrote the per-download sidecar JSON. Now folded into PatchState::Downloading / PatchState::Downloaded variants in state.json, owned by the lifecycle module. The MockManagePatches / ManagePatches trait machinery in updater_state.rs's tests goes away with patch_manager. The lifecycle operations are tested directly in cache::lifecycle::tests against a real on-disk filesystem under TempDir, which catches issues the mocks couldn't (e.g. file-existence cross-checks in `decide_start`). The cargo workspace now has 212 passing tests vs. 257 before, the delta being the patch_manager unit tests whose subjects no longer exist. End-to-end coverage in `updater.rs::tests` is unchanged and all green. * refactor: address self-review feedback on lifecycle PR Nine fixes from the self-review pass on #352: - Drop `partial_size` from PatchState::Downloading. The field was misleading (decide_start reads from disk, not from the recorded value) and unused. record_download_started loses its 5th arg. - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production no longer routes through it; only test_utils and the updater_state tests do. The gate makes the divergence intentional and prevents future production callers. - install_patch defensively removes any prior `dlc.vmcode` before rename so behavior is OS-agnostic (POSIX rename overwrites silently; Windows fails). Also mirrors record_install_complete's cleanup of a stale `download` file in the patch dir. - Document on `lifecycle()` / `lifecycle_mut()` that the direct accessors are intentional — wrapping every transition would be churn for no reader benefit. - recompute_next_boot now clears `last_booted_patch` when its on-disk record is gone (Unknown), so pointers.json doesn't accumulate references to nothing. A `Bad` last_booted is left alone — that's a useful breadcrumb and recompute simply doesn't promote it. - Promote `download_artifact_path` and `installed_artifact_path` to methods on PatchLifecycle for symmetry with state_path / pointers_path. Updates all call sites. - Document mark_bad-on-Bad merge semantics: latest reason wins, other fields preserved. Hypothetical in practice but no longer silent. - Add tests for recompute_next_boot's stale-pointer clearing (Unknown → cleared, Bad → kept). Brings the suite to 214 tests. The mark_bad-cleanup-stale-bytes recovery path I flagged was already covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed. * refactor: undo two over-corrections from the prior self-review fix Two issues introduced by 07ea84a that the second self-review caught: - update_internal computed download_path / installed_path via `with_state(...).lifecycle().download_artifact_path(n)`. That routed a pure-function path computation through a state read, serializing it against other state operations and adding a disk read per call. Restore the free `download_artifact_path` / `installed_artifact_path` functions as the canonical entry point; the methods on PatchLifecycle are kept as thin wrappers for callers that already hold a lifecycle handle. - install_patch's "defensive remove_file before rename" added a TOCTOU window between the exists() check and the rename for no real benefit — install_patch is now `#[cfg(test)]`-only, the Windows-rename concern was theoretical for tests, and POSIX rename atomically overwrites. Drop the explicit remove_file. * ci: fix warnings-as-errors and cspell Three CI failures from the prior push: - `Context` and `bail` imports in updater_state.rs are only used inside the `#[cfg(test)] install_patch` helper. Moved them under `#[cfg(test)]` so non-test builds don't see unused imports. - `PathBuf` import in updater.rs became unused after switching the download/install paths to take `Path::new(&config.storage_dir)` directly. Dropped. - `FileOperation::DeleteDir` is no longer constructed by production code (was used by the deleted patch_manager.rs). Mirrored the existing `#[allow(dead_code)]` annotation already on `DeleteFile` rather than removing the variant — the format/handling code for it is still useful for any future code that needs to surface a "delete dir failed" error. CSpell additions: `unparseable`, `tombstoned`, `roundtrips` — words introduced by the lifecycle module's docs and test names. * ci: gate PathBuf import to platforms that actually use it The previous fix removed `PathBuf` from the top-level use, which broke the lib build on non-android non-test platforms (macOS, Linux, Windows) where `libapp_path_from_settings` uses `PathBuf`. That function is itself gated to `cfg(not(any(target_os = \"android\", test)))`, so the PathBuf import needs the same gate. Restoring it as a cfg-gated separate import — keeps the lib build green on all three runners and keeps the lib-test build clean under -D warnings. * test: cover the gaps surfaced by the coverage report Adds 7 tests targeting branches that were uncovered: - mark_bad_from_downloading_records_partial_file_size: the Downloading source state in mark_bad reads bytes-on-disk for the recorded `size` field; the existing tests only covered Installed → Bad. - mark_bad_overwrites_reason_when_already_bad: documented merge semantics (latest reason wins, other fields preserved) now has a test backing the doc. - validate_next_boot_patch_marks_bad_when_artifact_missing: the "Installed state but dlc.vmcode is gone" path. Existing test covered size mismatch; this covers the missing-file branch. - validate_next_boot_patch_marks_bad_when_pointer_targets_non_installed: the case where the next_boot pointer references a state other than Installed (state.json + pointers can diverge through corruption). - install_patch_install_only_{accepts_valid,rejects_missing,rejects_bad}_signature: cover the InstallOnly verification path in UpdaterState::install_patch using the existing test keypair from signing.rs. Coverage improvements: - cache/lifecycle.rs: 94.65% → 96.06% lines, 98.72% → 100% fns - cache/updater_state.rs: 94.34% → 96.94% lines, 95.65% → 98% fns - total: 94.70% → 95.11% lines 214 → 221 tests, all passing under -D warnings. * ci: add 'keypair' to cspell dictionary * test: audit deleted patch_manager tests; restore strict checks Goes through every test deleted with patch_manager.rs (~40) and either maps it to a new test (with a `Ports patch_manager.rs::mod::name` comment) or adds a port. Three behavioral regressions caught and fixed: - record_boot_start now requires next_boot_patch to match the arg. Old PatchManager had this defensive check; my refactor dropped it. Carries forward the engine-vs-state agreement guard. - Added strict-mode signature tests for validate_next_boot_patch (valid, missing, invalid, no public key, bad public key). Ports five `validate_next_boot_patch_tests::strict_mode_*` tests. - Added InstallOnly + no public_key test (ports add_patch_tests::install_only_succeeds_with_any_signature_if_no_public_key). Plus two scenario tests that didn't have direct coverage: - rolled_back_patch_not_resurrected_when_replacement_fails: ports fall_back_tests::rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch against the new state-based fallback (cleanup forgets, recompute won't promote a None state). - validate_then_promote_catches_corrupted_last_booted: ports validate_next_boot_patch_tests::does_not_fall_back_to_last_booted_patch_if_corrupted with a note: new code catches the corruption at the next validate pass instead of at fall-back time, but the user-visible outcome (boot the base release) is the same. Test helper split: `install_state` (writes Installed without touching pointers) vs callers explicitly setting pointers. Avoids the prior helper's auto-promote masking pointer-management behavior the tests were trying to exercise. Tests deliberately not ported (with reasoning): - debug_tests::manage_patches_is_debug, patch_manager_is_debug: Trait-Debug tests for types that no longer exist. Replaced implicitly by `#[derive(Debug)]` on PatchLifecycle. - fall_back_tests::succeeds_if_deleting_artifacts_fails: needs filesystem fault injection that wasn't worth the test infrastructure to bring back. - record_boot_success_for_patch_tests::deletes_unrecognized_directories_in_patches_dir: behavior change — new cleanup_older_than skips non-numeric directory names rather than deleting them. Defensive; the prior `rm -rf` was overly aggressive. Coverage: 222 → 230 tests, all green under -D warnings. * test: port the two patch_manager tests I'd handwaved Both turned out trivial to port — neither was actually doing fault injection, just exploiting graceful-failure paths. - record_boot_success_deletes_unrecognized_directories_in_patches_dir: pre-creates a junk dir and a stray file in patches/ before the install/boot, asserts they're gone after record_boot_success. Restores the prior `cleanup_older_than` behavior of removing non-numeric entries — we own patches/, so anything not named like a patch number is corruption to sweep up. - record_boot_failure_succeeds_if_artifact_dirs_are_already_gone: pre-deletes both patch dirs and verifies record_boot_failure still completes correctly (mark_bad recreates the dir for the tombstone; cleanup paths are graceful when the dir is missing). Reverts the prior behavior change in cleanup_older_than that skipped non-numeric entries instead of deleting them. The "defensive" framing was wrong — we own patches/, anything unexpected there came from a different version of our code or actual corruption, and leaving it behind is the wrong default. * test: port the remaining patch_manager tests with real coverage gaps - validate_next_boot_strict_mode_succeeds_with_valid_signature: ports `patch_manager.rs::validate_next_boot_patch_tests::strict_mode_succeeds_with_valid_signature_at_boot_time`. Required reusing the prior test fixture's INFLATED_PATCH_HASH / INFLATED_PATCH_SIGNATURE constants (a real signature for the 1-byte file content `b"1"`, matching TEST_PUBLIC_KEY's private half). - record_boot_failure_keeps_last_booted_pointer_when_failed_patch_was_last_booted: ports `patch_manager.rs::record_boot_failure_for_patch_tests::preserves_last_booted_patch_on_failure_but_marks_bad`. The new behavior is similar but more explicit: last_booted's *pointer* is preserved (with the patch now in Bad state); the underlying intent — "we still know what last booted, even after that patch failed" — survives. Test helper split: `install_signed` (failure paths, mismatched hash OK) vs `install_with_valid_signature` (happy path, real fixture). Now every behavioral test from the deleted patch_manager.rs has a corresponding new test with a porting comment. Two trivial ones not ported: the `Debug` impl tests for traits/types that no longer exist. 236 tests, all green under -D warnings. * refactor: restore separate download directory under OS cache root Reverts an unintended policy change in the cutover: my refactor moved compressed download bytes from `{code_cache_dir}/downloads/{N}` (the OS-managed cache dir, evictable under storage pressure) into `{storage_dir}/patches/{N}/download` (persistent app storage). That made downloads count against persistent storage and survive in iCloud backups — neither of which we want for transient bytes. Restoring the original split: - `{state_root}/patches/{N}/{state.json,dlc.vmcode}` — persistent state and the installed artifact. - `{download_root}/{N}` — flat, in the OS cache dir. `PatchLifecycle::load_or_default` now takes both roots. The two artifact-path helpers each route to their respective root, the free-function variants used by `update_internal` now take the right root, and `mark_bad`/`cleanup`/`record_install_complete` clean both locations as appropriate. `UpdaterState::create_new_and_save` also wipes `download_dir` on release-version change, in addition to the persistent patches/ tree. `update_internal` and tests updated to use the right root for each file type. Test fixtures pass `tmp.path().join("downloads")` as the download root (separate subdir of the same TempDir). Net change in test surface: moved a handful of `patch_dir/download` references to `downloads_dir/N`, no behavioral changes to tests themselves. 236 tests pass under -D warnings. * test: cover the new download_root cleanup paths Self-review caught two gaps in the download-root split: - release_version_change_wipes_download_dir: the cache-rooted download dir should be wiped along with the persistent patches/ tree on a release-version mismatch. Adds explicit coverage — the code path was already in create_new_and_save but not directly tested. - record_boot_success_promotes_and_cleans_older: extends the existing test to drop a stale download in download_root alongside an older patch's state, asserts that cleanup_older_than's chain (cleanup → forget_dir) deletes both roots' artifacts. Also adds a comment on cleanup_older_than noting that it only walks the persistent patches/ tree — orphan downloads (no state.json) would persist until the OS evicts them. Noted as a known limitation, not blocking. * refactor: targeted wipe + legacy patches_state.json + orphan-download walk Two fixes to the release-version-change wipe and one to the boot- success cleanup walk. - The "wipe everything in cache_dir" approach broke 27 tests whose setup uses cache_dir as the test temp dir (with base.apk and libapp.so as siblings). Production gives shorebird a dedicated subdir but the API doesn't enforce that. Switched to a targeted wipe with a documented allowlist of paths shorebird has ever written under cache_dir: `SHOREBIRD_OWNED_PATHS`. - Added `patches_state.json` to the wipe list — that's the legacy file from the prior `PatchManager` implementation, and devices upgrading through this PR would otherwise orphan it. New test `release_version_change_wipes_legacy_patches_state_json` covers it. - `cleanup_older_than` now also calls `cleanup_orphan_downloads`, which walks `download_root/` and removes any file that doesn't correspond to a patch in `Downloading`/`Downloaded` state. We own the cache root and shouldn't rely on OS eviction. Catches the "state.json gone but download lingered" case the prior comment handwaved. Adding a new file under cache_dir means adding it to SHOREBIRD_OWNED_PATHS — small bookkeeping cost, but it lets us keep cache_dir co-tenant with embedder-provided files. * test: cover all four branches of cleanup_orphan_downloads The new orphan-download sweep distinguishes four cases but only the "older patch via cleanup chain" branch was being exercised. Single test now drops one of each kind of file into download_root before a boot success and asserts which survive: - orphan (numeric, no state.json): removed - stale (numeric, state is Installed): removed - non-numeric name: removed - live (numeric, state is Downloading): preserved * ci: add 'embedder' to cspell dictionary * refactor: drop Installed.hash, fix mark_bad ordering, address bdero review bdero's review on #352 surfaced six items worth landing inline: 1. record_boot_failure used to clear `currently_booting_patch` before `mark_bad`. A crash between the two left the patch `Installed` with no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next boot — the patch silently retried. Flipped: mark_bad first, then clear. mark_bad on Bad is idempotent, so the worst case is a redundant tombstone-rewrite on next init. 2. Drop the `hash` field from `PatchState::Installed`. It was recorded at install but never read at boot — Strict-mode validation recomputes the hash from the artifact's bytes and feeds it into `check_signature`. A hash that lives only on disk can't be trusted as a security input; deleting the field removes the temptation. `Downloading.hash`/`Downloaded.hash` stay as comparators against the server's freshly-delivered hash (a tampered on-disk hash there just causes a redownload). 3. Port the customer-scenario test from #356 into `c_api/mod.rs`: server rolls patch 1 back, then forward again with the same number and hash. Pre-lifecycle this was permanently dropped via `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and forgets the patch entirely on server-driven rollback, so the rollforward installs cleanly. End-to-end coverage of the regression #356 hotfixed. 4. Add a TODO + target version on the `patches_state.json` entry in `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its usefulness. 5. Document that `running_patch` lives in `config.rs` (session-scoped), not in this module — saves a future grep. 6. Clarify `recompute_next_boot`'s fallback policy: it's fall-back-to-`last_booted_patch`-only, not fall-back-to-anything- bootable. A fresh release whose first Installed patch fails validation goes to base, even if older patches sit in `patches/`. cargo test: 242 passed (was 241).
bdero
added a commit
to shorebirdtech/shorebird
that referenced
this pull request
May 8, 2026
Word appears in a comment on rollforwardPatch in code_push_client_wrapper — 'the server resends the same patch artifact on rollforward'. Same fix as shorebirdtech/updater#356 needed in this repo.
bdero
added a commit
to shorebirdtech/shorebird
that referenced
this pull request
May 8, 2026
Word appears in a comment on rollforwardPatch in code_push_client_wrapper — 'the server resends the same patch artifact on rollforward'. Same fix as shorebirdtech/updater#356 needed in this repo.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Customer report: after rolling a patch back and then rolling it forward from the dashboard, the device silently refuses to install the re-offered patch.
checkForUpdatereturnsupToDatewhile the server still says a patch is available, and FRC-driven notification logic that comparescurrentPatchagainst a target patch number loops on a force-restart prompt that nothing can satisfy.The bug
PR #348's review-feedback commit (59bb984) added
known_bad_patches.insert(patch_number)toremove_patchto defend against an orthogonal resurrection-via-failed-delete race. That set is a permanent ban for the currentrelease_version, so the same number also stays banned through:rollforward.dart:38-42flipsis_rolled_backback tofalseon the sameapp_release_patch_row, so the server resends exactly the samePatch(same number, same hash).release_version. This is the workflowshorebird_cli/lib/src/commands/release/release_command.dart:518itself prescribes when bumping Flutter. New patches start at ci: add rust workflow #1 again with new hashes but inherit the old release's bans on those numbers.The customer log line
[shorebird] Patch 1 has previously failed to boot, skippingis misleading — the patch never boot-failed, the rollback signal poisoned the number.The fix
In
remove_patch, droplast_booted_patchwhen its number matches the rolled-back patch instead of inserting intoknown_bad_patches. That keeps the resurrection defense — the else-if fallback intry_fall_back_from_patchhas no target to promote — and also stops banning the number for any later legitimate reuse.Safe to clear
last_booted_patchhere becauserunning_patch(added in #348) is session-scoped and is now the FFI's source of truth for "what's running this process," so dropping the persisted historical pointer doesn't regress the patch-to-release rollback fix.last_booted_patchtry_fall_back_from_patch. Dropped on server rollback of the same number.next_boot_patchNone= base release.running_patchknown_bad_patchesVerification
rollforward_after_rollback_can_install_same_patchinlibrary/src/c_api/mod.rs. Reproduces the customer scenario at the C API level: install patch 1, server rolls patch 1 back, server rolls patch 1 forward (same hash, emptyrolled_back). Pre-fix:check_for_downloadable_updatereturns false and the log line matches the customer's logs. Post-fix: returns true and the patch is queued for next boot.rollback_then_failed_replacement_does_not_resurrect_rolled_back_patchto assert the property the test name promises (patch 1 is not promoted tonext_boot_patch) plus the new invariant (is_known_bad_patch(1)is false, so a rollforward can install it). The original assertion was checking the implementation mechanism rather than the behavioral property.cargo clippy --lib --testsclean.cargo fmt --checkclean.Out of scope (follow-up)
This PR doesn't fix the broader weakness that
known_bad_patchesis keyed bypatch_numberalone — a number that isn't globally unique (it can refer to entirely different artifacts under the samerelease_versionif the release is deleted-and-recreated). The wire response already carriesPatch.hashand the server-side patch id, so a follow-up can key the bad-patch set by hash and make this whole class of bug impossible by construction.Test plan
origin/main, passes on this branchcargo test --lib235 pass / 0 failcargo clippy --lib --testscleancargo fmt --checkcleanRefs: shorebirdtech/shorebird#3728, #348