Skip to content

fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468

Open
sherwinski wants to merge 9 commits into
mainfrom
sherwin/sdk-4336
Open

fix: [SDK-4336] guard IndexedDB Options writes from iOS Safari PWA wedge#1468
sherwinski wants to merge 9 commits into
mainfrom
sherwin/sdk-4336

Conversation

@sherwinski
Copy link
Copy Markdown
Contributor

@sherwinski sherwinski commented May 28, 2026

Description

1 Line Summary

Stops OneSignal.init() from hanging for ~30 minutes on iOS 26 Safari PWAs by capping Options-store readwrite operations with a 1.5s hard timeout and tripping a page-scoped circuit breaker once the wedge is detected.

Note: A bug report (315804) was filed in Webkit upstream to investigate this.

Details

Root cause

On iOS 26 Safari running as a Home-Screen PWA (display: standalone), the navigation back into the app after a successful push subscription leaves the ONE_SIGNAL_SDK_DB IndexedDB in a poisoned state where every readwrite request on the Options object store stalls indefinitely. The IDBRequest is created and goes to pending, but no event ever fires — no success, no error, no transaction complete / abort, no IDBDatabase.onclose. WebKit's internal transaction watchdog only forcibly aborts the wedged transaction after roughly 30 minutes; until then await db.put('Options', ...) never settles.

OneSignal.init() always writes to Options very early — initSaveState writes pageTitle, saveInitOptions writes 7+ entries (webhook URLs, persistNotification, click-handler config, lastPushToken, isPushEnabled, etc.). Without this guard, the very first of those writes blocks init forever and the support team observes "init hanging for 30 minutes, then eventually recovers" — exactly the watchdog timer.

What we ruled out and how:

  • NetworkProcess crash family (WebKit bugs 273827 / 277615 / 309386). Those throw UnknownError: Connection to Indexed Database server lost and fire IDBDatabase.onclose. We never see either.
  • Process suspension (202705). That throws TransactionInactiveError synchronously. We throw nothing.
  • PWA total freeze (211018). All other JS keeps running — timers fire, fetch works, the SW heartbeat keeps ticking — only this one IDBRequest is dead.
  • Connection-level wedge. Reads (get, getAll) on the same DB connection still work. readwrite on other stores still works. Closing and re-opening the database returns a fresh IDBDatabase whose first readwrite on Options wedges identically.
  • Origin-wide IDB wedge. A diagnostic probe that opened a different IndexedDB name at the same origin and issued a readwrite put completed in 11 ms while the real DB was hung. The wedge is per-database, not origin-wide.

Filed upstream as WebKit bug 315804 ("A readwrite IDBTransaction never fires oncomplete, onerror, or onabort after the user subscribes to web push in an installed PWA") with a link back to this PR for steps on how to reproduce. The source comment in client.ts also references the bug so the workaround can be removed once WebKit ships a fix.

Fix

Three commits, each independently revertable:

  1. 81415fdd — fix: fail-fast Options writes. Wrap db.put('Options', ...) and db.delete('Options', ...) with a 1500ms hard timeout. On timeout, log a [SDK-4336] warning and resolve the promise as a no-op so init proceeds. Other stores keep their existing behavior. The values that don't get persisted are session metadata that the SW reads with sensible fallbacks if missing or stale, so push delivery is unaffected.

  2. 1ae8abdf — perf: short-circuit after first wedge. Once a single Options readwrite times out we know the store is poisoned for the rest of this page's lifetime. Add a module-scoped optionsWriteWedged flag so the remaining 7 Options writes in initSaveState + saveInitOptions short-circuit immediately instead of each independently paying the 1500ms timeout. Cuts init latency on a wedged page from ~12s to ~1.5s. The flag is page-scoped (resets on navigation), so a subsequent navigation will probe the wedge fresh.

  3. 88e4cd59 — chore(preview): repro sandbox (will be reverted before merge). Two-page demo + dev-server hardening that lets a reviewer reproduce the original 30-minute hang on main and confirm this branch resolves it. Detailed testing instructions in a separate PR comment.

A 4th commit (3c41181b) generalizing the timeout to all readwrite stores was tried and reverted (83cbff87) because we couldn't validate it on device — every subsequent on-device repro happened with an empty operation queue. Parked in branch history with the validation steps captured in a Linear comment on SDK-4336.

Systems Affected

  • WebSDK
  • Backend
  • Dashboard

Validation

Tests

Info

  • Full suite: 512/512 pass, lint clean, formatter clean.
  • The fix path is exercised by the existing client.test.ts round-trip tests (every Options write goes through the new wrapper, so all 12 client tests are effectively also covering the timeout path's happy case).
  • On-device verification on iPhone running iOS 26.4 (logs11.txt → logs14.txt in the chat): pre-fix init hung indefinitely; post-fix init completes within ~1.5s on the wedged navigation, with a [SDK-4336] db.put(Options) timed out … Tripping circuit breaker warning and 7 follow-up db.put(Options) skipped warnings, then internalInit → SW handshake → sessionInit proceed normally.
  • I haven't added a unit test specifically for the timeout path because the existing client.test.ts uses fake-indexeddb which doesn't reproduce the WebKit-specific wedge — a synthetic timeout test would only verify the timer plumbing, which is straightforward enough that the on-device evidence is more meaningful.

Checklist

  • All the automated tests pass or I explained why that is not possible
  • I have personally tested this on my machine or explained why that is not possible
  • I have included test coverage for these changes or explained why they are not needed

Programming Checklist

Interfaces:

  • Don't use default export
  • New interfaces are in model files

Functions:

  • Don't use default export
  • All function signatures have return types
  • Helpers should not access any data but rather be given the data to operate on.

Typescript:

  • No Typescript warnings
  • Avoid silencing null/undefined warnings with the exclamation point

Other:

  • Iteration: refrain from using elem of array syntax. Prefer forEach or use map
  • Avoid using global OneSignal accessor for context if possible.

Screenshots

Info

N/A — runtime correctness fix, no UI changes. See on-device console logs in the SDK-4336 Linear ticket and chat history.

Checklist

  • I have included screenshots/recordings of the intended results or explained why they are not needed

Related Tickets

SDK-4336

iOS 26 Safari PWA can leave the `ONE_SIGNAL_SDK_DB.Options` object store
in a state where every `readwrite` request stalls indefinitely after the
user navigates back into the PWA following a push subscription. The
request never fires `success`, `error`, `abort`, or `complete`, so
`OneSignal.init()` blocks on the first Options `put` until WebKit's
internal transaction watchdog finally aborts it ~30 minutes later. Reads
on the same connection still work, `readwrite` on other stores still
works, and reopening the database does not clear the wedge — only the
`Options` store readwrite path is poisoned. A separate IDB at the same
origin is unaffected, so this is a per-database WebKit bug, not the
NetworkProcess crash family already tracked in WebKit bugs 273827 /
277615 / 309386.

Wrap `db.put`/`db.delete` on `Options` with a 1.5s hard timeout. On
timeout, log a `[SDK-4336]` warning and resolve the promise as a no-op
so init can continue. Other stores keep their existing behavior. The
values written to `Options` are non-critical session metadata
(`pageTitle`, `persistNotification`, webhook URLs, click-handler
config, `lastPushToken`, `isPushEnabled`, etc.) that the service worker
reads with sensible fallbacks if missing or stale, so push delivery
remains unaffected.
Once a single Options `readwrite` request times out we know the store
is poisoned for the rest of the page's lifetime — fresh connections
inherit the same WebKit lock state, and we have no signal that would
let us probe whether the wedge has cleared mid-session. Today every
remaining Options write in `initSaveState` + `saveInitOptions` still
arms its own 1.5s timer and walks to the timeout independently, which
adds up to ~12s of init latency on the first navigation back into a
wedged PWA.

Add a module-scoped `optionsWriteWedged` flag. When the first Options
write times out, set the flag and resolve subsequent Options writes as
no-ops immediately, logging a `[SDK-4336]` warning so the skip is
visible in telemetry. The flag is page-scoped (resets on navigation),
so a subsequent navigation will probe the wedge fresh with the regular
timeout. With this in place, init on a wedged page completes in ~1.5s
instead of ~12s.
The first SDK-4336 commit only protected `Options` writes, but on-device
verification (logs12.txt) showed that once init completes, the
`OperationRepo` queue still wedges: `_executeOperations` awaits a
`db.put('operations', ...)` (or a downstream model-store `_persist`)
that never settles, leaving `runningOps = true` forever and spamming
`Ops in progress` every 500ms. This is the same iOS 26 Safari PWA
WebKit lock poisoning we saw on `Options`, just affecting different
stores once init is no longer the first thing to write.

Generalize the workaround:

- Rename `optionsWriteWedged` → `readwriteWedged` and apply the timeout
  + circuit breaker to every readwrite op (`put`, `delete`, `clear`),
  not just `Options`.
- Once any readwrite times out, mark the DB readwrite path wedged for
  the rest of the page's lifetime. All subsequent readwrites
  short-circuit to a no-op resolve, with a `[SDK-4336]` warning logged
  for telemetry.
- Reads (`get`, `getAll`) and `objectStoreNames`/`close` are unchanged.

The values we drop on a wedged page are either session metadata the
service worker re-derives from network state on the next visit, or
queued operations whose effects (subscription create/update/delete,
identity changes, etc.) are idempotent server-side and will be
re-attempted on the next page load. The alternative is letting the
operation queue spin forever, which is materially worse.
…erge)

Adds the reproducible demo we used to verify the SDK-4336 fix on a real
iOS Safari PWA. Lets a reviewer reproduce the original 30-minute init
hang on `main` and confirm the fix branch resolves it.

What's included:

- `preview/pageA.html`, `preview/pageB.html` — minimal two-page sandbox
  with a Register button on Page A, designed to exercise the
  navigation-after-push-subscription flow described in the ticket.
  Page A persists `app_id` to `localStorage` so subsequent in-page
  navigations don't lose it (the in-page links don't carry the query
  string, which would otherwise produce `InvalidAppIdError`).
- Both pages set `apple-mobile-web-app-capable` so they install as a
  standalone PWA when added to the Home Screen.
- `preview/OneSignalSDKWorker.js` and `preview/push/onesignal/OneSignalSDKWorker.js`
  now `importScripts` from `self.location.origin` instead of a
  hardcoded `https://localhost:4001`, so the worker resolves correctly
  when the dev server is exposed via an ngrok / Cloudflare tunnel for
  on-device testing.
- `preview/vite.config.ts` disables HMR (the WebSocket can't reach a
  device through a tunnel and floods the console with
  unhandled-rejection spam from `ws.send`), strips Vite's auto-injected
  `/@vite/client` from HTML responses for the same reason, and sends
  `Cache-Control: no-store` for SDK assets and HTML so iOS Safari /
  PWA doesn't pin a stale build during a debug session.

This commit is intentionally **not** part of the SDK fix; it should be
reverted before merging the PR. Kept in branch history so we can
re-introduce it if SDK-4336 surfaces again.
@sherwinski
Copy link
Copy Markdown
Contributor Author

Testing instructions for reviewers

The third commit (88e4cd59 chore(preview): add iOS PWA repro sandbox) is included specifically so you can reproduce the bug end-to-end on a real iOS device. It will be reverted before merge — please don't ship it.

You'll need:

  • A real iOS device running iOS 26.x (the bug doesn't surface in the iOS Simulator's WebKit). Mine: iPhone running iOS 26.4.
  • An ngrok / Cloudflare tunnel account so the iOS device can reach your dev machine's HTTPS server.
  • A OneSignal app whose Site URL matches the tunnel host you'll be using, with a custom service-worker integration pointing to push/onesignal/OneSignalSDKWorker.js. (A pre-existing app pointing to https://localhost:4000 will not work — Apple/iOS Web Push enforces an exact origin match against your tunnel host. Easiest path: create a fresh disposable test app in the dashboard.)

One-time setup

# Repo
git checkout sherwin/sdk-4336

# Build the dev SDK against your tunnel host so the iOS device fetches assets
# from the right origin. Replace the value with your own ngrok host.
BUILD_ORIGIN=<your-tunnel-host>.ngrok-free.app NO_DEV_PORT=true vp run build:dev-prod

# In two separate terminals:
SDK_ENV=dev vp dev --filter @onesignal/preview     # serves on https://localhost:4001
ngrok http https://localhost:4001                  # exposes that to the public

Reproduce the bug on main (control)

To confirm the bug is real before validating the fix:

git checkout main
# Repeat the build + dev server commands above against the same tunnel host
  1. On the iPhone, open https://<your-tunnel-host>.ngrok-free.app/pageA.html?app_id=<APP_ID> in Safari.
  2. Tap the Share icon → Add to Home Screen → Add. The web inspector won't follow it once installed; that's fine, we don't need it for the control.
  3. Open the installed PWA from the Home Screen.
  4. Tap Go to Page B → tap Go to Page A → tap Register → accept the system prompt.
  5. Tap Go to Page B → tap Go to Page A.
  6. Page A's body should still appear, but OneSignal.init() will not complete. If you switch your Mac to Safari → Develop → [device name] → [the PWA's web view] before tapping that final Page A link, you'll see the console freeze with no OneSignal initialized (Nms) log line. Leave it for ~30 minutes and it will eventually resume by itself — that's the WebKit watchdog firing.

Verify the fix

git checkout sherwin/sdk-4336
# Rebuild against the same tunnel host
BUILD_ORIGIN=<your-tunnel-host>.ngrok-free.app NO_DEV_PORT=true vp run build:dev-prod

Reset state on the device (delete the PWA from Home Screen and reinstall, or in Safari → Settings → Advanced → Website Data → remove the tunnel host's data), then run the same 6-step sequence above.

On the final B → A navigation you should see, in roughly this order:

[Log]    !!!! [SDK-4336 PAGE A] OneSignal initialize
[Debug]  init()
...
[Warning] [SDK-4336] db.put(Options) timed out after 1500ms; IndexedDB Options
          store is wedged (likely iOS Safari PWA after push subscription).
          Tripping circuit breaker; subsequent Options writes on this page
          will be skipped immediately.
[Info]    Set pageTitle to be 'OneSignalWeb'.
[Warning] [SDK-4336] db.put(Options) skipped; Options store is known wedged
          for this page.    (× 7-ish times)
[Debug]   internalInit()
[Info]    Checking SW version...
[Debug]   sessionInit()
[Log]     !!!! [SDK-4336 PAGE A] OneSignal initialized (~1500ms)

The total init time on the wedged navigation should be ~1.5s instead of indefinite. No Ops in progress runaway loop, no exceptions, push delivery still works.

What I confirmed on my device

  • Pre-fix: init never completed in 30+ minutes of waiting.
  • Post-fix (81415fdd only): init completes after ~12s (8 separate Options writes each timing out individually).
  • Post-fix (81415fdd + 1ae8abdf): init completes after ~1.5s (one timeout, the rest short-circuit). This is the shipped behavior.

Captured logs are in the SDK-4336 Linear ticket if you want to compare to your own run.

Pre-merge

Before merging, please drop the sandbox commit:

git rebase -i origin/main      # drop 88e4cd59
git push --force-with-lease

The remaining commits (81415fdd, 1ae8abdf) are the only ones that need to ship.

Comment thread preview/vite.config.ts Dismissed
Shortens the two `Log._warn` strings in `withOptionsWriteTimeout` (still
tagged with `[SDK-4336]`) and bumps the `page.es6.js` and `sw.js`
size-limit entries to fit the circuit-breaker code added by the
SDK-4336 fix.
@sherwinski
Copy link
Copy Markdown
Contributor Author

Note: we'll also roll back the [SDK-4336]... logs before merging.

@sherwinski sherwinski requested a review from fadi-george May 28, 2026 21:27
@sherwinski
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread src/shared/database/client.ts Outdated
Comment thread src/shared/database/client.ts Outdated
// webhook URLs, click-handler config, `lastPushToken`, `isPushEnabled`,
// etc.) that the service worker reads with sensible fallbacks if
// missing or stale.
const OPTIONS_WRITE_TIMEOUT_MS = 1500;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How'd you settle on this value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just by tinkering with it and trying to find a balance between letting it hang too long and letting a non-poisoned state operate normally. We can increase it, since either way it will be a lot better than the current ~30 min it takes to resolve itself.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if it does fail, do operations just linger?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On timeout we resolve the wrapper promise with undefined and trip the breaker, so the JS caller proceeds. The underlying IDBRequest itself isn't cancellable — it stays queued in WebKit's transaction queue and eventually settles (~30 min later, when the OS watchdog aborts the wedged transaction). By then the page is usually unloaded. Our op().then(...) handler has an if (settled) return; guard, so the eventual settlement is a no-op for us — no unhandled rejection, no late state mutation. Net effect on the page: zero; one zombie IDB request lingers on the WebKit-internal side until the OS reaps it.

Combine `withOptionsWriteTimeout` and the per-method `if (storeName ===
'Options')` branches into a single `guardOptionsWrite(storeName, label,
op)` helper, and condense the explanatory comment block. Also drop the
`[SDK-4336]` prefix from runtime warnings — the messages stand on their
own and the ticket is captured in the commit log.
@sherwinski sherwinski requested a review from fadi-george May 28, 2026 23:26
Comment thread src/shared/database/client.ts
Comment thread src/shared/database/client.ts
Two refinements to the Options-store guard:

1. `db.put`/`db.delete` now `await dbPromise` *before* invoking
   `guardOptionsWrite`, so the timeout scopes only to the readwrite
   request itself. Previously the 1500ms budget covered both DB
   open/upgrade and the put, so a slow `open()` (cross-tab `blocked`
   event during a schema upgrade, `terminated()` callback re-opening,
   or v5/v6 migrations on slow hardware) could false-trip the breaker
   and silently drop subsequent Options writes for the page lifetime.

2. Export `isOptionsWriteWedged()` and use it from `initSaveState` to
   defer the new-appId commit when the Options reset got circuit-broken
   mid-flight. Without this, the `Ids.appId` write (unguarded — the
   guard is Options-only) would succeed while the previous app's
   `isPushEnabled` / `lastPushId` / `lastPushToken` / `lastOptedIn`
   stayed put, and the `previousAppId !== appId` gate would keep the
   reset branch from re-entering on later loads — leaving cross-app
   contamination permanent. Skipping the commit lets a future
   non-wedged load complete the reset.
// on later loads — leaving the stale values permanent. Skipping the
// appId commit instead lets a future non-wedged load complete the reset.
if (isOptionsWriteWedged()) {
Log._warn('App ID change reset deferred; will retry on next non-wedged load');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid overusing logs, they bloat sdk size

async delete<K extends IDBStoreName>(storeName: K, key: IndexedDBSchema[K]['key']) {
return (await dbPromise).delete(storeName, key);
const _db = await dbPromise;
return guardOptionsWrite(storeName, `delete(${storeName}/${String(key)})`, () =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt key always string

): Promise<T | undefined> {
if (storeName !== 'Options') return op();
if (optionsWriteWedged) {
Log._warn(`db.${label} skipped (Options store wedged)`);
Copy link
Copy Markdown
Contributor

@fadi-george fadi-george May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe dont need this log

// request itself — DB open/upgrade time is awaited by callers before they
// hand us a sync closure, so a slow `open()` (e.g. cross-tab `blocked` event,
// schema migration on a slow device) won't false-trip the breaker.
function guardOptionsWrite<T>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do to optimize for bundle size:

function guardOptionsWrite<T>(
  storeName: IDBStoreName,
  label: string,
  op: () => Promise<T>,
): Promise<T | undefined> {
  if (storeName !== 'Options') return op();
  if (optionsWriteWedged) {
    return Promise.resolve(undefined);
  }
  let timer: ReturnType<typeof setTimeout>;
  const timeout = new Promise<undefined>((resolve) => {
    timer = setTimeout(() => {
      optionsWriteWedged = true;
      Log._warn(`db.${label} timed out`);
      resolve(undefined);
    }, OPTIONS_WRITE_TIMEOUT_MS);
  });
  return Promise.race([op(), timeout]).finally(() => clearTimeout(timer));
}

@fadi-george
Copy link
Copy Markdown
Contributor

would fadi/sdk-4336-options-write-timeout be enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants