fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468
fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468sherwinski wants to merge 9 commits into
Conversation
iOS 26 Safari PWA can leave the `ONE_SIGNAL_SDK_DB.Options` object store in a state where every `readwrite` request stalls indefinitely after the user navigates back into the PWA following a push subscription. The request never fires `success`, `error`, `abort`, or `complete`, so `OneSignal.init()` blocks on the first Options `put` until WebKit's internal transaction watchdog finally aborts it ~30 minutes later. Reads on the same connection still work, `readwrite` on other stores still works, and reopening the database does not clear the wedge — only the `Options` store readwrite path is poisoned. A separate IDB at the same origin is unaffected, so this is a per-database WebKit bug, not the NetworkProcess crash family already tracked in WebKit bugs 273827 / 277615 / 309386. Wrap `db.put`/`db.delete` on `Options` with a 1.5s hard timeout. On timeout, log a `[SDK-4336]` warning and resolve the promise as a no-op so init can continue. Other stores keep their existing behavior. The values written to `Options` are non-critical session metadata (`pageTitle`, `persistNotification`, webhook URLs, click-handler config, `lastPushToken`, `isPushEnabled`, etc.) that the service worker reads with sensible fallbacks if missing or stale, so push delivery remains unaffected.
Once a single Options `readwrite` request times out we know the store is poisoned for the rest of the page's lifetime — fresh connections inherit the same WebKit lock state, and we have no signal that would let us probe whether the wedge has cleared mid-session. Today every remaining Options write in `initSaveState` + `saveInitOptions` still arms its own 1.5s timer and walks to the timeout independently, which adds up to ~12s of init latency on the first navigation back into a wedged PWA. Add a module-scoped `optionsWriteWedged` flag. When the first Options write times out, set the flag and resolve subsequent Options writes as no-ops immediately, logging a `[SDK-4336]` warning so the skip is visible in telemetry. The flag is page-scoped (resets on navigation), so a subsequent navigation will probe the wedge fresh with the regular timeout. With this in place, init on a wedged page completes in ~1.5s instead of ~12s.
The first SDK-4336 commit only protected `Options` writes, but on-device
verification (logs12.txt) showed that once init completes, the
`OperationRepo` queue still wedges: `_executeOperations` awaits a
`db.put('operations', ...)` (or a downstream model-store `_persist`)
that never settles, leaving `runningOps = true` forever and spamming
`Ops in progress` every 500ms. This is the same iOS 26 Safari PWA
WebKit lock poisoning we saw on `Options`, just affecting different
stores once init is no longer the first thing to write.
Generalize the workaround:
- Rename `optionsWriteWedged` → `readwriteWedged` and apply the timeout
+ circuit breaker to every readwrite op (`put`, `delete`, `clear`),
not just `Options`.
- Once any readwrite times out, mark the DB readwrite path wedged for
the rest of the page's lifetime. All subsequent readwrites
short-circuit to a no-op resolve, with a `[SDK-4336]` warning logged
for telemetry.
- Reads (`get`, `getAll`) and `objectStoreNames`/`close` are unchanged.
The values we drop on a wedged page are either session metadata the
service worker re-derives from network state on the next visit, or
queued operations whose effects (subscription create/update/delete,
identity changes, etc.) are idempotent server-side and will be
re-attempted on the next page load. The alternative is letting the
operation queue spin forever, which is materially worse.
This reverts commit 3c41181.
…erge) Adds the reproducible demo we used to verify the SDK-4336 fix on a real iOS Safari PWA. Lets a reviewer reproduce the original 30-minute init hang on `main` and confirm the fix branch resolves it. What's included: - `preview/pageA.html`, `preview/pageB.html` — minimal two-page sandbox with a Register button on Page A, designed to exercise the navigation-after-push-subscription flow described in the ticket. Page A persists `app_id` to `localStorage` so subsequent in-page navigations don't lose it (the in-page links don't carry the query string, which would otherwise produce `InvalidAppIdError`). - Both pages set `apple-mobile-web-app-capable` so they install as a standalone PWA when added to the Home Screen. - `preview/OneSignalSDKWorker.js` and `preview/push/onesignal/OneSignalSDKWorker.js` now `importScripts` from `self.location.origin` instead of a hardcoded `https://localhost:4001`, so the worker resolves correctly when the dev server is exposed via an ngrok / Cloudflare tunnel for on-device testing. - `preview/vite.config.ts` disables HMR (the WebSocket can't reach a device through a tunnel and floods the console with unhandled-rejection spam from `ws.send`), strips Vite's auto-injected `/@vite/client` from HTML responses for the same reason, and sends `Cache-Control: no-store` for SDK assets and HTML so iOS Safari / PWA doesn't pin a stale build during a debug session. This commit is intentionally **not** part of the SDK fix; it should be reverted before merging the PR. Kept in branch history so we can re-introduce it if SDK-4336 surfaces again.
Testing instructions for reviewersThe third commit ( You'll need:
One-time setup# Repo
git checkout sherwin/sdk-4336
# Build the dev SDK against your tunnel host so the iOS device fetches assets
# from the right origin. Replace the value with your own ngrok host.
BUILD_ORIGIN=<your-tunnel-host>.ngrok-free.app NO_DEV_PORT=true vp run build:dev-prod
# In two separate terminals:
SDK_ENV=dev vp dev --filter @onesignal/preview # serves on https://localhost:4001
ngrok http https://localhost:4001 # exposes that to the publicReproduce the bug on
|
Shortens the two `Log._warn` strings in `withOptionsWriteTimeout` (still tagged with `[SDK-4336]`) and bumps the `page.es6.js` and `sw.js` size-limit entries to fit the circuit-breaker code added by the SDK-4336 fix.
|
Note: we'll also roll back the |
|
@claude review |
| // webhook URLs, click-handler config, `lastPushToken`, `isPushEnabled`, | ||
| // etc.) that the service worker reads with sensible fallbacks if | ||
| // missing or stale. | ||
| const OPTIONS_WRITE_TIMEOUT_MS = 1500; |
There was a problem hiding this comment.
How'd you settle on this value?
There was a problem hiding this comment.
Just by tinkering with it and trying to find a balance between letting it hang too long and letting a non-poisoned state operate normally. We can increase it, since either way it will be a lot better than the current ~30 min it takes to resolve itself.
There was a problem hiding this comment.
So what happens if it does fail, do operations just linger?
There was a problem hiding this comment.
On timeout we resolve the wrapper promise with undefined and trip the breaker, so the JS caller proceeds. The underlying IDBRequest itself isn't cancellable — it stays queued in WebKit's transaction queue and eventually settles (~30 min later, when the OS watchdog aborts the wedged transaction). By then the page is usually unloaded. Our op().then(...) handler has an if (settled) return; guard, so the eventual settlement is a no-op for us — no unhandled rejection, no late state mutation. Net effect on the page: zero; one zombie IDB request lingers on the WebKit-internal side until the OS reaps it.
Combine `withOptionsWriteTimeout` and the per-method `if (storeName === 'Options')` branches into a single `guardOptionsWrite(storeName, label, op)` helper, and condense the explanatory comment block. Also drop the `[SDK-4336]` prefix from runtime warnings — the messages stand on their own and the ticket is captured in the commit log.
Two refinements to the Options-store guard: 1. `db.put`/`db.delete` now `await dbPromise` *before* invoking `guardOptionsWrite`, so the timeout scopes only to the readwrite request itself. Previously the 1500ms budget covered both DB open/upgrade and the put, so a slow `open()` (cross-tab `blocked` event during a schema upgrade, `terminated()` callback re-opening, or v5/v6 migrations on slow hardware) could false-trip the breaker and silently drop subsequent Options writes for the page lifetime. 2. Export `isOptionsWriteWedged()` and use it from `initSaveState` to defer the new-appId commit when the Options reset got circuit-broken mid-flight. Without this, the `Ids.appId` write (unguarded — the guard is Options-only) would succeed while the previous app's `isPushEnabled` / `lastPushId` / `lastPushToken` / `lastOptedIn` stayed put, and the `previousAppId !== appId` gate would keep the reset branch from re-entering on later loads — leaving cross-app contamination permanent. Skipping the commit lets a future non-wedged load complete the reset.
| // on later loads — leaving the stale values permanent. Skipping the | ||
| // appId commit instead lets a future non-wedged load complete the reset. | ||
| if (isOptionsWriteWedged()) { | ||
| Log._warn('App ID change reset deferred; will retry on next non-wedged load'); |
There was a problem hiding this comment.
avoid overusing logs, they bloat sdk size
| async delete<K extends IDBStoreName>(storeName: K, key: IndexedDBSchema[K]['key']) { | ||
| return (await dbPromise).delete(storeName, key); | ||
| const _db = await dbPromise; | ||
| return guardOptionsWrite(storeName, `delete(${storeName}/${String(key)})`, () => |
There was a problem hiding this comment.
isnt key always string
| ): Promise<T | undefined> { | ||
| if (storeName !== 'Options') return op(); | ||
| if (optionsWriteWedged) { | ||
| Log._warn(`db.${label} skipped (Options store wedged)`); |
There was a problem hiding this comment.
maybe dont need this log
| // request itself — DB open/upgrade time is awaited by callers before they | ||
| // hand us a sync closure, so a slow `open()` (e.g. cross-tab `blocked` event, | ||
| // schema migration on a slow device) won't false-trip the breaker. | ||
| function guardOptionsWrite<T>( |
There was a problem hiding this comment.
could do to optimize for bundle size:
function guardOptionsWrite<T>(
storeName: IDBStoreName,
label: string,
op: () => Promise<T>,
): Promise<T | undefined> {
if (storeName !== 'Options') return op();
if (optionsWriteWedged) {
return Promise.resolve(undefined);
}
let timer: ReturnType<typeof setTimeout>;
const timeout = new Promise<undefined>((resolve) => {
timer = setTimeout(() => {
optionsWriteWedged = true;
Log._warn(`db.${label} timed out`);
resolve(undefined);
}, OPTIONS_WRITE_TIMEOUT_MS);
});
return Promise.race([op(), timeout]).finally(() => clearTimeout(timer));
}
|
would fadi/sdk-4336-options-write-timeout be enough? |
Description
1 Line Summary
Stops
OneSignal.init()from hanging for ~30 minutes on iOS 26 Safari PWAs by cappingOptions-store readwrite operations with a 1.5s hard timeout and tripping a page-scoped circuit breaker once the wedge is detected.Note: A bug report (315804) was filed in Webkit upstream to investigate this.
Details
Root cause
On iOS 26 Safari running as a Home-Screen PWA (
display: standalone), the navigation back into the app after a successful push subscription leaves theONE_SIGNAL_SDK_DBIndexedDB in a poisoned state where everyreadwriterequest on theOptionsobject store stalls indefinitely. TheIDBRequestis created and goes topending, but no event ever fires — nosuccess, noerror, no transactioncomplete/abort, noIDBDatabase.onclose. WebKit's internal transaction watchdog only forcibly aborts the wedged transaction after roughly 30 minutes; until thenawait db.put('Options', ...)never settles.OneSignal.init()always writes toOptionsvery early —initSaveStatewritespageTitle,saveInitOptionswrites 7+ entries (webhook URLs, persistNotification, click-handler config,lastPushToken,isPushEnabled, etc.). Without this guard, the very first of those writes blocks init forever and the support team observes "init hanging for 30 minutes, then eventually recovers" — exactly the watchdog timer.What we ruled out and how:
UnknownError: Connection to Indexed Database server lostand fireIDBDatabase.onclose. We never see either.TransactionInactiveErrorsynchronously. We throw nothing.IDBRequestis dead.get,getAll) on the same DB connection still work.readwriteon other stores still works. Closing and re-opening the database returns a freshIDBDatabasewhose firstreadwriteonOptionswedges identically.readwrite putcompleted in 11 ms while the real DB was hung. The wedge is per-database, not origin-wide.Filed upstream as WebKit bug 315804 ("A readwrite IDBTransaction never fires oncomplete, onerror, or onabort after the user subscribes to web push in an installed PWA") with a link back to this PR for steps on how to reproduce. The source comment in
client.tsalso references the bug so the workaround can be removed once WebKit ships a fix.Fix
Three commits, each independently revertable:
81415fdd— fix: fail-fast Options writes. Wrapdb.put('Options', ...)anddb.delete('Options', ...)with a 1500ms hard timeout. On timeout, log a[SDK-4336]warning and resolve the promise as a no-op so init proceeds. Other stores keep their existing behavior. The values that don't get persisted are session metadata that the SW reads with sensible fallbacks if missing or stale, so push delivery is unaffected.1ae8abdf— perf: short-circuit after first wedge. Once a singleOptionsreadwritetimes out we know the store is poisoned for the rest of this page's lifetime. Add a module-scopedoptionsWriteWedgedflag so the remaining 7 Options writes ininitSaveState+saveInitOptionsshort-circuit immediately instead of each independently paying the 1500ms timeout. Cuts init latency on a wedged page from ~12s to ~1.5s. The flag is page-scoped (resets on navigation), so a subsequent navigation will probe the wedge fresh.88e4cd59— chore(preview): repro sandbox (will be reverted before merge). Two-page demo + dev-server hardening that lets a reviewer reproduce the original 30-minute hang onmainand confirm this branch resolves it. Detailed testing instructions in a separate PR comment.A 4th commit (
3c41181b) generalizing the timeout to all readwrite stores was tried and reverted (83cbff87) because we couldn't validate it on device — every subsequent on-device repro happened with an empty operation queue. Parked in branch history with the validation steps captured in a Linear comment on SDK-4336.Systems Affected
Validation
Tests
Info
client.test.tsround-trip tests (everyOptionswrite goes through the new wrapper, so all 12 client tests are effectively also covering the timeout path's happy case).[SDK-4336] db.put(Options) timed out … Tripping circuit breakerwarning and 7 follow-updb.put(Options) skippedwarnings, theninternalInit→ SW handshake →sessionInitproceed normally.client.test.tsusesfake-indexeddbwhich doesn't reproduce the WebKit-specific wedge — a synthetic timeout test would only verify the timer plumbing, which is straightforward enough that the on-device evidence is more meaningful.Checklist
Programming Checklist
Interfaces:
Functions:
Typescript:
Other:
elem of arraysyntax. PreferforEachor usemapcontextif possible.Screenshots
Info
N/A — runtime correctness fix, no UI changes. See on-device console logs in the SDK-4336 Linear ticket and chat history.
Checklist
Related Tickets
SDK-4336