Skip to content

Hotfix: DB/pooler connection resilience (production)#2331

Merged
ae2079 merged 2 commits into
masterfrom
hotfix/db-connection-resilience-master
Jun 11, 2026
Merged

Hotfix: DB/pooler connection resilience (production)#2331
ae2079 merged 2 commits into
masterfrom
hotfix/db-connection-resilience-master

Conversation

@ae2079

@ae2079 ae2079 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Production hotfix → master

Pushes only the DB/pooler connection-resilience fix to production. These commits are cherry-picked from #2330 (already merged to staging); the other in-flight staging changes (Stellar #2329, QF Round actions #2328, etc.) are intentionally excluded.

Why

Prod went down during a DigitalOcean managed-Postgres / PgBouncer connectivity blip (server login has been failing … (server_login_retry)), amplified into a hard outage that needed a manual redeploy.

What's included

File Change
src/utils/globalErrorHandlers.ts (new) + src/index.ts Keep the API alive on unhandledRejection; clean exit on uncaughtException (Docker restart: always recreates a fresh process)
src/orm.ts idleTimeoutMillis 500→30000 + connectionTimeoutMillis 10000 on both data sources (shared poolerExtraConfig); removed dead no-op pool keys
src/server/bootstrap.ts Exit on startup failure (skipped under tests) so a DB-unreachable start self-heals via restart instead of becoming a zombie with no HTTP listener
src/sentryLogger.ts Disable Sentry's built-in global handlers so ours are the single source of truth (no double-capture / exit race)
config/example.env Document pool-size sizing

⚠️ Paired ops change (not in code)

The over-sized pool lives in the server's vault/production.env: TYPEORM_DATABASE_POOL_SIZE was 97 (~485 connections across 5 processes). Already lowered to 20 in the prod vault — prod must be redeployed/restarted to load it.

Testing

tsc, ESLint, Prettier, full build pass. Runtime smoke tests confirm: unhandledRejection stays alive, uncaughtException exits 1, Sentry registers 0 global handlers (filter works), idempotent registration, and the bootstrap startup-exit path. Reviewed via adversarial multi-dimension review + CodeRabbit.

🤖 Generated with Claude Code

ae2079 and others added 2 commits June 11, 2026 22:40
Production and staging went down during a DigitalOcean managed Postgres /
PgBouncer connectivity blip ("server login has been failing ...
(server_login_retry)"). Several issues combined to turn a transient DB
problem into a hard outage that needed a manual redeploy:

- No process-level error handlers, so a rejected DB query during a blip
  crashed the whole Node process (unhandledRejection, Node >= 15 exits).
- orm.ts used idleTimeoutMillis: 500, recycling idle connections twice a
  second and hammering the pooler with reconnect/login churn.
- pool.connect() had no connectionTimeoutMillis, so during a pooler stall
  requests hung indefinitely instead of failing fast.
- bootstrap()'s catch only logged; if the DB was unreachable at startup the
  process stayed up with no HTTP listener (a zombie that `restart: always`
  never recovers, since the policy only fires on process exit).
- Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations
  double-handled with any new handlers (double capture + exit race).

Changes:
- Add src/utils/globalErrorHandlers.ts: keep the process alive on
  unhandledRejection (log + Sentry), exit cleanly on uncaughtException so
  Docker (restart: always) recreates a fresh process. Registered first in
  index.ts.
- orm.ts: idleTimeoutMillis 500 -> 30000 and add connectionTimeoutMillis:
  10000 on both AppDataSource and CronDataSource; drop the no-op
  maxWaitingClients/evictionRunIntervalMillis keys (node-postgres ignores
  them).
- bootstrap(): exit on startup failure (skipped under tests) so
  restart: always self-heals once the DB is reachable again.
- sentryLogger.ts: disable Sentry's global handlers so ours are the single
  source of truth.
- example.env: document pool-size sizing to prevent recurrence.

NOTE: the over-sized production pool (TYPEORM_DATABASE_POOL_SIZE=97 per
process x 5 processes) lives in the gitignored config/production.env and
must be reduced on the server separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pool docs)

- orm.ts: extract the duplicated `extra` pool config (idleTimeoutMillis,
  connectionTimeoutMillis) into a shared `poolerExtraConfig` constant used by
  both AppDataSource and CronDataSource (CodeRabbit nitpick).
- example.env: clarify that the jobs process's CronDataSource pool does NOT
  honor TYPEORM_DATABASE_POOL_SIZE — it uses node-postgres' default of ~10.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ae2079 ae2079 merged commit c973ee9 into master Jun 11, 2026
3 of 4 checks passed
@ae2079 ae2079 deleted the hotfix/db-connection-resilience-master branch June 11, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant