Skip to content

fix: make API resilient to DB/pooler connection loss#2330

Merged
ae2079 merged 2 commits into
stagingfrom
fix/db-connection-resilience
Jun 11, 2026
Merged

fix: make API resilient to DB/pooler connection loss#2330
ae2079 merged 2 commits into
stagingfrom
fix/db-connection-resilience

Conversation

@ae2079

@ae2079 ae2079 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem

Production (mainnet) and staging both went down and required a manual full redeploy to recover. Logs showed:

  • error: server login has been failing, cached error: connect failed (server_login_retry) — a PgBouncer message; the app talks to DigitalOcean managed Postgres via the PgBouncer pool (...ondigitalocean.com:25061), and the pooler couldn't log in to the backend.
  • query failed: SELECT 1 — the failing health-check query (TypeORM advanced-console logger).
  • SyntaxError: ... is not valid JSON / BadRequestError: request abortednoise (handled 400s from garbage/aborted requests; printed by Express's default error handler because NODE_ENV=production !== 'test'). Not crashes.

Root cause

A transient DB/pooler problem was amplified into a hard outage by several app-side issues:

  1. No process-level error handlers → a rejected DB query during the blip crashed the whole Node process (unhandledRejection; Node ≥15 exits when there is no listener). (Note: Sentry's default integrations partially masked this, but with OnUncaughtException also force-exiting.)
  2. idleTimeoutMillis: 500 in orm.ts recycled idle connections twice a second, hammering the pooler with reconnect/login churn.
  3. No connectionTimeoutMillis → during a pooler stall, pool.connect() waits forever, so requests hang instead of failing fast (matches the staging jobs health-check 10s timeouts).
  4. bootstrap() swallowed startup failures — its catch only logged. If the DB was unreachable at startup, the process stayed up with no HTTP listener on :4000 — a zombie that restart: always never restarts (the policy only fires on process exit). This is the most likely reason a manual redeploy was needed.
  5. Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations double-handled with any new handlers (double capture + racing process.exit).

Changes

File Change
src/utils/globalErrorHandlers.ts (new) unhandledRejection → log + Sentry, keep process alive (a transient DB error must not kill the API). uncaughtException → log + Sentry, flush + exit(1) so Docker (restart: always) recreates a clean process. Idempotent.
src/index.ts Register the handlers first thing, before bootstrap().
src/orm.ts idleTimeoutMillis: 500 → 30000 and add connectionTimeoutMillis: 10000 on both AppDataSource and CronDataSource. Removed maxWaitingClients/evictionRunIntervalMillis (generic-pool options that node-postgres ignores).
src/server/bootstrap.ts On startup failure, capture to Sentry and exit (skipped under tests) so restart: always self-heals once the DB is reachable.
src/sentryLogger.ts Disable Sentry's built-in global handlers so ours are the single source of truth (no double capture / exit race).
config/example.env Document TYPEORM_DATABASE_POOL_SIZE sizing to prevent recurrence.

⚠️ Required ops change (NOT in this PR)

The over-sized pool lives in the gitignored config/production.env on the server, so it can't be changed here:

  • Prod production.env has TYPEORM_DATABASE_POOL_SIZE=97 → with 4 ql + 1 jobs = ~485 connections to the DO PgBouncer pool. Reduce it so pool_size × num_processes stays well under the DO pool/cluster connection limit (≈ 15–20 per process is a sane starting point; verify against the DO pool size & confirm pool mode = transaction).

Testing (local)

  • tsc --noEmit, eslint, prettier --check, full npm run build — all pass.
  • Runtime smoke tests:
    • unhandledRejection → process stays alive (exit 0), error logged. ✅
    • uncaughtException → exits 1 after Sentry flush. ✅
    • Verified Sentry registers 0 global handlers after init (filter works); our handlers are the only app listeners; registration is idempotent. ✅
    • flushSentryAndExit (bootstrap startup-failure path) exits 1. ✅

Reviewed

Ran an adversarial multi-dimension review (correctness / resilience / conventions / side-effects). All confirmed findings are addressed in this PR.

Suggested follow-ups (out of scope here)

  • Add a Docker healthcheck to the production/staging compose services hitting /healthz, so a non-serving-but-alive process is detectable.
  • Graceful shutdown on SIGTERM/SIGINT (drain in-flight requests, httpServer.close(), DataSource.destroy()) to avoid abrupt connection teardown on deploys.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved startup and process-level error handling to avoid zombie processes and ensure orderly shutdown.
    • Prevented duplicate error captures to monitoring by coordinating global handlers and SDK integrations.
  • Chores

    • Tuned database connection pooling behavior for pooler environments.
    • Added documentation for connection-pool sizing and notes about extra cron connections.

Production and staging went down during a DigitalOcean managed Postgres /
PgBouncer connectivity blip ("server login has been failing ...
(server_login_retry)"). Several issues combined to turn a transient DB
problem into a hard outage that needed a manual redeploy:

- No process-level error handlers, so a rejected DB query during a blip
  crashed the whole Node process (unhandledRejection, Node >= 15 exits).
- orm.ts used idleTimeoutMillis: 500, recycling idle connections twice a
  second and hammering the pooler with reconnect/login churn.
- pool.connect() had no connectionTimeoutMillis, so during a pooler stall
  requests hung indefinitely instead of failing fast.
- bootstrap()'s catch only logged; if the DB was unreachable at startup the
  process stayed up with no HTTP listener (a zombie that `restart: always`
  never recovers, since the policy only fires on process exit).
- Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations
  double-handled with any new handlers (double capture + exit race).

Changes:
- Add src/utils/globalErrorHandlers.ts: keep the process alive on
  unhandledRejection (log + Sentry), exit cleanly on uncaughtException so
  Docker (restart: always) recreates a fresh process. Registered first in
  index.ts.
- orm.ts: idleTimeoutMillis 500 -> 30000 and add connectionTimeoutMillis:
  10000 on both AppDataSource and CronDataSource; drop the no-op
  maxWaitingClients/evictionRunIntervalMillis keys (node-postgres ignores
  them).
- bootstrap(): exit on startup failure (skipped under tests) so
  restart: always self-heals once the DB is reachable again.
- sentryLogger.ts: disable Sentry's global handlers so ours are the single
  source of truth.
- example.env: document pool-size sizing to prevent recurrence.

NOTE: the over-sized production pool (TYPEORM_DATABASE_POOL_SIZE=97 per
process x 5 processes) lives in the gitignored config/production.env and
must be reduced on the server separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8af730a9-e928-4e62-9734-0085f855951d

📥 Commits

Reviewing files that changed from the base of the PR and between 2462ee9 and e65c563.

📒 Files selected for processing (2)
  • config/example.env
  • src/orm.ts

Walkthrough

Registers idempotent process-level handlers for unhandledRejection and uncaughtException with safe Sentry capture and flush-on-exit, integrates handlers at startup before bootstrap, disables Sentry’s built-in uncaught/unhandled integrations, and centralizes TypeORM pooler-focused extras with env docs about effective DB connections.

Changes

Error Handling and Database Connection Infrastructure

Layer / File(s) Summary
Global error handler registration and Sentry setup
src/utils/globalErrorHandlers.ts, src/sentryLogger.ts
New registerGlobalErrorHandlers() registers idempotent listeners for unhandledRejection (log + Sentry capture, keep alive) and uncaughtException (log + Sentry capture, flush, exit). Sentry.init disables built-in OnUncaughtException/OnUnhandledRejection integrations.
Error handler startup integration
src/index.ts, src/server/bootstrap.ts
Calls registerGlobalErrorHandlers() before bootstrap() and updates bootstrap’s outer error handler to capture startup exceptions to Sentry and, when not in tests, flush Sentry and exit to avoid leaving the process running without an HTTP listener.
Database connection pooling configuration
src/orm.ts, config/example.env
Adds poolerExtraConfig with idleTimeoutMillis: 30000 and connectionTimeoutMillis: 10000, applies it to AppDataSource and CronDataSource, and documents how TYPEORM_DATABASE_POOL_SIZE scales effective connections across processes (note: cron datasource adds extra connections).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I scoped my ears to catch the crash and hiss,
I log, I send to Sentry, then I kiss
the sleepy process shorter, tidy, neat—
flushes done, no doubles, no repeat.
Hop, hop — connections pooled, the garden’s sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main objective: improving API resilience when database/pooler connections are lost. The title directly reflects the primary changes across multiple files focused on connection handling, error recovery, and process lifecycle management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/db-connection-resilience

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install timed out. The project may have too many dependencies for the sandbox.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/orm.ts (1)

93-106: ⚡ Quick win

Consider extracting duplicated connection pool configuration.

The extra configuration block and its comments are duplicated identically between AppDataSource (lines 55-68) and CronDataSource (lines 93-106). Consider extracting this into a shared constant to reduce duplication and ensure consistency.

♻️ Refactor to reduce duplication
+// Shared connection pool configuration for all DataSources running behind
+// a Postgres connection pooler (DigitalOcean managed Postgres / PgBouncer).
+const poolerExtraConfig = {
+  // Recycling idle connections every 500ms (the previous idleTimeoutMillis)
+  // caused constant reconnect + login churn against the pooler, surfacing in
+  // production as "server login has been failing ... (server_login_retry)" errors.
+  idleTimeoutMillis: 30000,
+  // Fail fast instead of hanging forever when a connection cannot be acquired
+  // during a pooler stall, so requests error out quickly and the pool can recover.
+  connectionTimeoutMillis: 10000,
+  // (maxWaitingClients / evictionRunIntervalMillis were generic-pool options
+  // that node-postgres ignores, so they were removed.)
+};
+
 export class AppDataSource {
   private static datasource: DataSource;
 
@@ -52,17 +64,7 @@
         },
       },
       poolSize,
-      extra: {
-        // The service runs behind a Postgres connection pooler (DigitalOcean
-        // managed Postgres / PgBouncer). Recycling idle connections every
-        // 500ms (the previous idleTimeoutMillis) caused constant reconnect +
-        // login churn against the pooler, surfacing in production as
-        // "server login has been failing ... (server_login_retry)" errors.
-        idleTimeoutMillis: 30000,
-        // Fail fast instead of hanging forever when a connection cannot be
-        // acquired during a pooler stall, so requests error out quickly and
-        // the pool can recover.
-        connectionTimeoutMillis: 10000,
-        // (maxWaitingClients / evictionRunIntervalMillis were generic-pool
-        // options that node-postgres ignores, so they were removed.)
-      },
+      extra: poolerExtraConfig,
     });
 
@@ -90,17 +92,7 @@
       entities: [CronJob],
       synchronize: false,
       dropSchema: false,
-      extra: {
-        // The service runs behind a Postgres connection pooler (DigitalOcean
-        // managed Postgres / PgBouncer). Recycling idle connections every
-        // 500ms (the previous idleTimeoutMillis) caused constant reconnect +
-        // login churn against the pooler, surfacing in production as
-        // "server login has been failing ... (server_login_retry)" errors.
-        idleTimeoutMillis: 30000,
-        // Fail fast instead of hanging forever when a connection cannot be
-        // acquired during a pooler stall, so requests error out quickly and
-        // the pool can recover.
-        connectionTimeoutMillis: 10000,
-        // (maxWaitingClients / evictionRunIntervalMillis were generic-pool
-        // options that node-postgres ignores, so they were removed.)
-      },
+      extra: poolerExtraConfig,
     });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/orm.ts` around lines 93 - 106, Extract the duplicated extra
connection-pool config into a shared constant (e.g., SHARED_POOL_EXTRA or
commonPoolExtra) and reuse it in both AppDataSource and CronDataSource by
spreading that constant into their existing config objects; update references to
the existing properties idleTimeoutMillis and connectionTimeoutMillis so the two
data source definitions consume the shared constant instead of copy-pasting the
same block, keeping the original comments with the constant for context.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/example.env`:
- Around line 10-18: Update the example.env comment to clarify that
CronDataSource (defined in src/orm.ts) does not read TYPEORM_DATABASE_POOL_SIZE
and therefore defaults to node-postgres’s pool max (~10) under TypeORM 0.3.20;
explicitly state this is independent of AppDataSource’s
TYPEORM_DATABASE_POOL_SIZE setting so readers understand the jobs process will
open ~10 separate CronDataSource connections on top of the app pool.

In `@src/server/bootstrap.ts`:
- Around line 428-440: The catch block in bootstrap() currently logs and sends
the error to Sentry but swallows it in test mode (isTestEnv), preventing callers
from seeing startup failures; change the handler so after logger.fatal(...) and
SentryLogger.captureException(...) you rethrow the original error when isTestEnv
is true (e.g., if (isTestEnv) throw err), and keep the existing
flushSentryAndExit() behavior for non-test runs so production still exits
cleanly.

---

Nitpick comments:
In `@src/orm.ts`:
- Around line 93-106: Extract the duplicated extra connection-pool config into a
shared constant (e.g., SHARED_POOL_EXTRA or commonPoolExtra) and reuse it in
both AppDataSource and CronDataSource by spreading that constant into their
existing config objects; update references to the existing properties
idleTimeoutMillis and connectionTimeoutMillis so the two data source definitions
consume the shared constant instead of copy-pasting the same block, keeping the
original comments with the constant for context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3b0514ff-2d54-4a1f-bc0b-29225f94f7cd

📥 Commits

Reviewing files that changed from the base of the PR and between 9f307f2 and 2462ee9.

📒 Files selected for processing (6)
  • config/example.env
  • src/index.ts
  • src/orm.ts
  • src/sentryLogger.ts
  • src/server/bootstrap.ts
  • src/utils/globalErrorHandlers.ts

Comment thread config/example.env
Comment thread src/server/bootstrap.ts
… pool docs)

- orm.ts: extract the duplicated `extra` pool config (idleTimeoutMillis,
  connectionTimeoutMillis) into a shared `poolerExtraConfig` constant used by
  both AppDataSource and CronDataSource (CodeRabbit nitpick).
- example.env: clarify that the jobs process's CronDataSource pool does NOT
  honor TYPEORM_DATABASE_POOL_SIZE — it uses node-postgres' default of ~10.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ae2079 ae2079 merged commit c2047ad into staging Jun 11, 2026
12 checks passed
@ae2079 ae2079 deleted the fix/db-connection-resilience branch June 11, 2026 19:07
@ae2079 ae2079 restored the fix/db-connection-resilience branch June 11, 2026 19:07
@ae2079 ae2079 deleted the fix/db-connection-resilience branch June 11, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant