Hotfix: DB/pooler connection resilience (production) by ae2079 · Pull Request #2331 · Giveth/impact-graph

ae2079 · 2026-06-11T19:12:23Z

Production hotfix → `master`

Pushes only the DB/pooler connection-resilience fix to production. These commits are cherry-picked from #2330 (already merged to staging); the other in-flight staging changes (Stellar #2329, QF Round actions #2328, etc.) are intentionally excluded.

Why

Prod went down during a DigitalOcean managed-Postgres / PgBouncer connectivity blip (server login has been failing … (server_login_retry)), amplified into a hard outage that needed a manual redeploy.

What's included

File	Change
`src/utils/globalErrorHandlers.ts` (new) + `src/index.ts`	Keep the API alive on `unhandledRejection`; clean exit on `uncaughtException` (Docker `restart: always` recreates a fresh process)
`src/orm.ts`	`idleTimeoutMillis 500→30000` + `connectionTimeoutMillis 10000` on both data sources (shared `poolerExtraConfig`); removed dead no-op pool keys
`src/server/bootstrap.ts`	Exit on startup failure (skipped under tests) so a DB-unreachable start self-heals via restart instead of becoming a zombie with no HTTP listener
`src/sentryLogger.ts`	Disable Sentry's built-in global handlers so ours are the single source of truth (no double-capture / exit race)
`config/example.env`	Document pool-size sizing

⚠️ Paired ops change (not in code)

The over-sized pool lives in the server's vault/production.env: TYPEORM_DATABASE_POOL_SIZE was 97 (~485 connections across 5 processes). Already lowered to 20 in the prod vault — prod must be redeployed/restarted to load it.

Testing

tsc, ESLint, Prettier, full build pass. Runtime smoke tests confirm: unhandledRejection stays alive, uncaughtException exits 1, Sentry registers 0 global handlers (filter works), idempotent registration, and the bootstrap startup-exit path. Reviewed via adversarial multi-dimension review + CodeRabbit.

🤖 Generated with Claude Code

Production and staging went down during a DigitalOcean managed Postgres / PgBouncer connectivity blip ("server login has been failing ... (server_login_retry)"). Several issues combined to turn a transient DB problem into a hard outage that needed a manual redeploy: - No process-level error handlers, so a rejected DB query during a blip crashed the whole Node process (unhandledRejection, Node >= 15 exits). - orm.ts used idleTimeoutMillis: 500, recycling idle connections twice a second and hammering the pooler with reconnect/login churn. - pool.connect() had no connectionTimeoutMillis, so during a pooler stall requests hung indefinitely instead of failing fast. - bootstrap()'s catch only logged; if the DB was unreachable at startup the process stayed up with no HTTP listener (a zombie that `restart: always` never recovers, since the policy only fires on process exit). - Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations double-handled with any new handlers (double capture + exit race). Changes: - Add src/utils/globalErrorHandlers.ts: keep the process alive on unhandledRejection (log + Sentry), exit cleanly on uncaughtException so Docker (restart: always) recreates a fresh process. Registered first in index.ts. - orm.ts: idleTimeoutMillis 500 -> 30000 and add connectionTimeoutMillis: 10000 on both AppDataSource and CronDataSource; drop the no-op maxWaitingClients/evictionRunIntervalMillis keys (node-postgres ignores them). - bootstrap(): exit on startup failure (skipped under tests) so restart: always self-heals once the DB is reachable again. - sentryLogger.ts: disable Sentry's global handlers so ours are the single source of truth. - example.env: document pool-size sizing to prevent recurrence. NOTE: the over-sized production pool (TYPEORM_DATABASE_POOL_SIZE=97 per process x 5 processes) lives in the gitignored config/production.env and must be reduced on the server separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… pool docs) - orm.ts: extract the duplicated `extra` pool config (idleTimeoutMillis, connectionTimeoutMillis) into a shared `poolerExtraConfig` constant used by both AppDataSource and CronDataSource (CodeRabbit nitpick). - example.env: clarify that the jobs process's CronDataSource pool does NOT honor TYPEORM_DATABASE_POOL_SIZE — it uses node-postgres' default of ~10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ae2079 and others added 2 commits June 11, 2026 22:40

ae2079 merged commit c973ee9 into master Jun 11, 2026
3 of 4 checks passed

ae2079 deleted the hotfix/db-connection-resilience-master branch June 11, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix: DB/pooler connection resilience (production)#2331

Hotfix: DB/pooler connection resilience (production)#2331
ae2079 merged 2 commits into
masterfrom
hotfix/db-connection-resilience-master

ae2079 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ae2079 commented Jun 11, 2026

Production hotfix → master

Why

What's included

⚠️ Paired ops change (not in code)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Production hotfix → `master`