Hotfix: DB/pooler connection resilience (production)#2331
Merged
Conversation
Production and staging went down during a DigitalOcean managed Postgres /
PgBouncer connectivity blip ("server login has been failing ...
(server_login_retry)"). Several issues combined to turn a transient DB
problem into a hard outage that needed a manual redeploy:
- No process-level error handlers, so a rejected DB query during a blip
crashed the whole Node process (unhandledRejection, Node >= 15 exits).
- orm.ts used idleTimeoutMillis: 500, recycling idle connections twice a
second and hammering the pooler with reconnect/login churn.
- pool.connect() had no connectionTimeoutMillis, so during a pooler stall
requests hung indefinitely instead of failing fast.
- bootstrap()'s catch only logged; if the DB was unreachable at startup the
process stayed up with no HTTP listener (a zombie that `restart: always`
never recovers, since the policy only fires on process exit).
- Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations
double-handled with any new handlers (double capture + exit race).
Changes:
- Add src/utils/globalErrorHandlers.ts: keep the process alive on
unhandledRejection (log + Sentry), exit cleanly on uncaughtException so
Docker (restart: always) recreates a fresh process. Registered first in
index.ts.
- orm.ts: idleTimeoutMillis 500 -> 30000 and add connectionTimeoutMillis:
10000 on both AppDataSource and CronDataSource; drop the no-op
maxWaitingClients/evictionRunIntervalMillis keys (node-postgres ignores
them).
- bootstrap(): exit on startup failure (skipped under tests) so
restart: always self-heals once the DB is reachable again.
- sentryLogger.ts: disable Sentry's global handlers so ours are the single
source of truth.
- example.env: document pool-size sizing to prevent recurrence.
NOTE: the over-sized production pool (TYPEORM_DATABASE_POOL_SIZE=97 per
process x 5 processes) lives in the gitignored config/production.env and
must be reduced on the server separately.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pool docs) - orm.ts: extract the duplicated `extra` pool config (idleTimeoutMillis, connectionTimeoutMillis) into a shared `poolerExtraConfig` constant used by both AppDataSource and CronDataSource (CodeRabbit nitpick). - example.env: clarify that the jobs process's CronDataSource pool does NOT honor TYPEORM_DATABASE_POOL_SIZE — it uses node-postgres' default of ~10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Production hotfix →
masterPushes only the DB/pooler connection-resilience fix to production. These commits are cherry-picked from #2330 (already merged to
staging); the other in-flight staging changes (Stellar #2329, QF Round actions #2328, etc.) are intentionally excluded.Why
Prod went down during a DigitalOcean managed-Postgres / PgBouncer connectivity blip (
server login has been failing … (server_login_retry)), amplified into a hard outage that needed a manual redeploy.What's included
src/utils/globalErrorHandlers.ts(new) +src/index.tsunhandledRejection; clean exit onuncaughtException(Dockerrestart: alwaysrecreates a fresh process)src/orm.tsidleTimeoutMillis 500→30000+connectionTimeoutMillis 10000on both data sources (sharedpoolerExtraConfig); removed dead no-op pool keyssrc/server/bootstrap.tssrc/sentryLogger.tsconfig/example.envThe over-sized pool lives in the server's vault/
production.env:TYPEORM_DATABASE_POOL_SIZEwas 97 (~485 connections across 5 processes). Already lowered to 20 in the prod vault — prod must be redeployed/restarted to load it.Testing
tsc, ESLint, Prettier, full build pass. Runtime smoke tests confirm: unhandledRejection stays alive, uncaughtException exits 1, Sentry registers 0 global handlers (filter works), idempotent registration, and the bootstrap startup-exit path. Reviewed via adversarial multi-dimension review + CodeRabbit.🤖 Generated with Claude Code