fix: make API resilient to DB/pooler connection loss#2330
Conversation
Production and staging went down during a DigitalOcean managed Postgres /
PgBouncer connectivity blip ("server login has been failing ...
(server_login_retry)"). Several issues combined to turn a transient DB
problem into a hard outage that needed a manual redeploy:
- No process-level error handlers, so a rejected DB query during a blip
crashed the whole Node process (unhandledRejection, Node >= 15 exits).
- orm.ts used idleTimeoutMillis: 500, recycling idle connections twice a
second and hammering the pooler with reconnect/login churn.
- pool.connect() had no connectionTimeoutMillis, so during a pooler stall
requests hung indefinitely instead of failing fast.
- bootstrap()'s catch only logged; if the DB was unreachable at startup the
process stayed up with no HTTP listener (a zombie that `restart: always`
never recovers, since the policy only fires on process exit).
- Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations
double-handled with any new handlers (double capture + exit race).
Changes:
- Add src/utils/globalErrorHandlers.ts: keep the process alive on
unhandledRejection (log + Sentry), exit cleanly on uncaughtException so
Docker (restart: always) recreates a fresh process. Registered first in
index.ts.
- orm.ts: idleTimeoutMillis 500 -> 30000 and add connectionTimeoutMillis:
10000 on both AppDataSource and CronDataSource; drop the no-op
maxWaitingClients/evictionRunIntervalMillis keys (node-postgres ignores
them).
- bootstrap(): exit on startup failure (skipped under tests) so
restart: always self-heals once the DB is reachable again.
- sentryLogger.ts: disable Sentry's global handlers so ours are the single
source of truth.
- example.env: document pool-size sizing to prevent recurrence.
NOTE: the over-sized production pool (TYPEORM_DATABASE_POOL_SIZE=97 per
process x 5 processes) lives in the gitignored config/production.env and
must be reduced on the server separately.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughRegisters idempotent process-level handlers for unhandledRejection and uncaughtException with safe Sentry capture and flush-on-exit, integrates handlers at startup before bootstrap, disables Sentry’s built-in uncaught/unhandled integrations, and centralizes TypeORM pooler-focused extras with env docs about effective DB connections. ChangesError Handling and Database Connection Infrastructure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install timed out. The project may have too many dependencies for the sandbox. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/orm.ts (1)
93-106: ⚡ Quick winConsider extracting duplicated connection pool configuration.
The
extraconfiguration block and its comments are duplicated identically betweenAppDataSource(lines 55-68) andCronDataSource(lines 93-106). Consider extracting this into a shared constant to reduce duplication and ensure consistency.♻️ Refactor to reduce duplication
+// Shared connection pool configuration for all DataSources running behind +// a Postgres connection pooler (DigitalOcean managed Postgres / PgBouncer). +const poolerExtraConfig = { + // Recycling idle connections every 500ms (the previous idleTimeoutMillis) + // caused constant reconnect + login churn against the pooler, surfacing in + // production as "server login has been failing ... (server_login_retry)" errors. + idleTimeoutMillis: 30000, + // Fail fast instead of hanging forever when a connection cannot be acquired + // during a pooler stall, so requests error out quickly and the pool can recover. + connectionTimeoutMillis: 10000, + // (maxWaitingClients / evictionRunIntervalMillis were generic-pool options + // that node-postgres ignores, so they were removed.) +}; + export class AppDataSource { private static datasource: DataSource; @@ -52,17 +64,7 @@ }, }, poolSize, - extra: { - // The service runs behind a Postgres connection pooler (DigitalOcean - // managed Postgres / PgBouncer). Recycling idle connections every - // 500ms (the previous idleTimeoutMillis) caused constant reconnect + - // login churn against the pooler, surfacing in production as - // "server login has been failing ... (server_login_retry)" errors. - idleTimeoutMillis: 30000, - // Fail fast instead of hanging forever when a connection cannot be - // acquired during a pooler stall, so requests error out quickly and - // the pool can recover. - connectionTimeoutMillis: 10000, - // (maxWaitingClients / evictionRunIntervalMillis were generic-pool - // options that node-postgres ignores, so they were removed.) - }, + extra: poolerExtraConfig, }); @@ -90,17 +92,7 @@ entities: [CronJob], synchronize: false, dropSchema: false, - extra: { - // The service runs behind a Postgres connection pooler (DigitalOcean - // managed Postgres / PgBouncer). Recycling idle connections every - // 500ms (the previous idleTimeoutMillis) caused constant reconnect + - // login churn against the pooler, surfacing in production as - // "server login has been failing ... (server_login_retry)" errors. - idleTimeoutMillis: 30000, - // Fail fast instead of hanging forever when a connection cannot be - // acquired during a pooler stall, so requests error out quickly and - // the pool can recover. - connectionTimeoutMillis: 10000, - // (maxWaitingClients / evictionRunIntervalMillis were generic-pool - // options that node-postgres ignores, so they were removed.) - }, + extra: poolerExtraConfig, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/orm.ts` around lines 93 - 106, Extract the duplicated extra connection-pool config into a shared constant (e.g., SHARED_POOL_EXTRA or commonPoolExtra) and reuse it in both AppDataSource and CronDataSource by spreading that constant into their existing config objects; update references to the existing properties idleTimeoutMillis and connectionTimeoutMillis so the two data source definitions consume the shared constant instead of copy-pasting the same block, keeping the original comments with the constant for context.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@config/example.env`:
- Around line 10-18: Update the example.env comment to clarify that
CronDataSource (defined in src/orm.ts) does not read TYPEORM_DATABASE_POOL_SIZE
and therefore defaults to node-postgres’s pool max (~10) under TypeORM 0.3.20;
explicitly state this is independent of AppDataSource’s
TYPEORM_DATABASE_POOL_SIZE setting so readers understand the jobs process will
open ~10 separate CronDataSource connections on top of the app pool.
In `@src/server/bootstrap.ts`:
- Around line 428-440: The catch block in bootstrap() currently logs and sends
the error to Sentry but swallows it in test mode (isTestEnv), preventing callers
from seeing startup failures; change the handler so after logger.fatal(...) and
SentryLogger.captureException(...) you rethrow the original error when isTestEnv
is true (e.g., if (isTestEnv) throw err), and keep the existing
flushSentryAndExit() behavior for non-test runs so production still exits
cleanly.
---
Nitpick comments:
In `@src/orm.ts`:
- Around line 93-106: Extract the duplicated extra connection-pool config into a
shared constant (e.g., SHARED_POOL_EXTRA or commonPoolExtra) and reuse it in
both AppDataSource and CronDataSource by spreading that constant into their
existing config objects; update references to the existing properties
idleTimeoutMillis and connectionTimeoutMillis so the two data source definitions
consume the shared constant instead of copy-pasting the same block, keeping the
original comments with the constant for context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3b0514ff-2d54-4a1f-bc0b-29225f94f7cd
📒 Files selected for processing (6)
config/example.envsrc/index.tssrc/orm.tssrc/sentryLogger.tssrc/server/bootstrap.tssrc/utils/globalErrorHandlers.ts
… pool docs) - orm.ts: extract the duplicated `extra` pool config (idleTimeoutMillis, connectionTimeoutMillis) into a shared `poolerExtraConfig` constant used by both AppDataSource and CronDataSource (CodeRabbit nitpick). - example.env: clarify that the jobs process's CronDataSource pool does NOT honor TYPEORM_DATABASE_POOL_SIZE — it uses node-postgres' default of ~10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Production (
mainnet) and staging both went down and required a manual full redeploy to recover. Logs showed:error: server login has been failing, cached error: connect failed (server_login_retry)— a PgBouncer message; the app talks to DigitalOcean managed Postgres via the PgBouncer pool (...ondigitalocean.com:25061), and the pooler couldn't log in to the backend.query failed: SELECT 1— the failing health-check query (TypeORMadvanced-consolelogger).SyntaxError: ... is not valid JSON/BadRequestError: request aborted— noise (handled 400s from garbage/aborted requests; printed by Express's default error handler becauseNODE_ENV=production !== 'test'). Not crashes.Root cause
A transient DB/pooler problem was amplified into a hard outage by several app-side issues:
unhandledRejection; Node ≥15 exits when there is no listener). (Note: Sentry's default integrations partially masked this, but withOnUncaughtExceptionalso force-exiting.)idleTimeoutMillis: 500inorm.tsrecycled idle connections twice a second, hammering the pooler with reconnect/login churn.connectionTimeoutMillis→ during a pooler stall,pool.connect()waits forever, so requests hang instead of failing fast (matches the stagingjobshealth-check 10s timeouts).bootstrap()swallowed startup failures — itscatchonly logged. If the DB was unreachable at startup, the process stayed up with no HTTP listener on :4000 — a zombie thatrestart: alwaysnever restarts (the policy only fires on process exit). This is the most likely reason a manual redeploy was needed.OnUncaughtException/OnUnhandledRejectionintegrations double-handled with any new handlers (double capture + racingprocess.exit).Changes
src/utils/globalErrorHandlers.ts(new)unhandledRejection→ log + Sentry, keep process alive (a transient DB error must not kill the API).uncaughtException→ log + Sentry, flush + exit(1) so Docker (restart: always) recreates a clean process. Idempotent.src/index.tsbootstrap().src/orm.tsidleTimeoutMillis: 500 → 30000and addconnectionTimeoutMillis: 10000on bothAppDataSourceandCronDataSource. RemovedmaxWaitingClients/evictionRunIntervalMillis(generic-pool options that node-postgres ignores).src/server/bootstrap.tsrestart: alwaysself-heals once the DB is reachable.src/sentryLogger.tsconfig/example.envTYPEORM_DATABASE_POOL_SIZEsizing to prevent recurrence.The over-sized pool lives in the gitignored
config/production.envon the server, so it can't be changed here:production.envhasTYPEORM_DATABASE_POOL_SIZE=97→ with 4ql+ 1jobs= ~485 connections to the DO PgBouncer pool. Reduce it sopool_size × num_processesstays well under the DO pool/cluster connection limit (≈ 15–20 per process is a sane starting point; verify against the DO pool size & confirm pool mode =transaction).Testing (local)
tsc --noEmit,eslint,prettier --check, fullnpm run build— all pass.unhandledRejection→ process stays alive (exit 0), error logged. ✅uncaughtException→ exits 1 after Sentry flush. ✅flushSentryAndExit(bootstrap startup-failure path) exits 1. ✅Reviewed
Ran an adversarial multi-dimension review (correctness / resilience / conventions / side-effects). All confirmed findings are addressed in this PR.
Suggested follow-ups (out of scope here)
/healthz, so a non-serving-but-alive process is detectable.SIGTERM/SIGINT(drain in-flight requests,httpServer.close(),DataSource.destroy()) to avoid abrupt connection teardown on deploys.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Chores