fix: make API resilient to DB/pooler connection loss by ae2079 · Pull Request #2330 · Giveth/impact-graph

ae2079 · 2026-06-11T15:53:11Z

Problem

Production (mainnet) and staging both went down and required a manual full redeploy to recover. Logs showed:

error: server login has been failing, cached error: connect failed (server_login_retry) — a PgBouncer message; the app talks to DigitalOcean managed Postgres via the PgBouncer pool (...ondigitalocean.com:25061), and the pooler couldn't log in to the backend.
query failed: SELECT 1 — the failing health-check query (TypeORM advanced-console logger).
SyntaxError: ... is not valid JSON / BadRequestError: request aborted — noise (handled 400s from garbage/aborted requests; printed by Express's default error handler because NODE_ENV=production !== 'test'). Not crashes.

Root cause

A transient DB/pooler problem was amplified into a hard outage by several app-side issues:

No process-level error handlers → a rejected DB query during the blip crashed the whole Node process (unhandledRejection; Node ≥15 exits when there is no listener). (Note: Sentry's default integrations partially masked this, but with OnUncaughtException also force-exiting.)
idleTimeoutMillis: 500 in orm.ts recycled idle connections twice a second, hammering the pooler with reconnect/login churn.
No connectionTimeoutMillis → during a pooler stall, pool.connect() waits forever, so requests hang instead of failing fast (matches the staging jobs health-check 10s timeouts).
bootstrap() swallowed startup failures — its catch only logged. If the DB was unreachable at startup, the process stayed up with no HTTP listener on :4000 — a zombie that restart: always never restarts (the policy only fires on process exit). This is the most likely reason a manual redeploy was needed.
Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations double-handled with any new handlers (double capture + racing process.exit).

Changes

File	Change
`src/utils/globalErrorHandlers.ts` (new)	`unhandledRejection` → log + Sentry, keep process alive (a transient DB error must not kill the API). `uncaughtException` → log + Sentry, flush + exit(1) so Docker (`restart: always`) recreates a clean process. Idempotent.
`src/index.ts`	Register the handlers first thing, before `bootstrap()`.
`src/orm.ts`	`idleTimeoutMillis: 500 → 30000` and add `connectionTimeoutMillis: 10000` on both `AppDataSource` and `CronDataSource`. Removed `maxWaitingClients`/`evictionRunIntervalMillis` (generic-pool options that node-postgres ignores).
`src/server/bootstrap.ts`	On startup failure, capture to Sentry and exit (skipped under tests) so `restart: always` self-heals once the DB is reachable.
`src/sentryLogger.ts`	Disable Sentry's built-in global handlers so ours are the single source of truth (no double capture / exit race).
`config/example.env`	Document `TYPEORM_DATABASE_POOL_SIZE` sizing to prevent recurrence.

⚠️ Required ops change (NOT in this PR)

The over-sized pool lives in the gitignored config/production.env on the server, so it can't be changed here:

Prod production.env has TYPEORM_DATABASE_POOL_SIZE=97 → with 4 ql + 1 jobs = ~485 connections to the DO PgBouncer pool. Reduce it so pool_size × num_processes stays well under the DO pool/cluster connection limit (≈ 15–20 per process is a sane starting point; verify against the DO pool size & confirm pool mode = transaction).

Testing (local)

tsc --noEmit, eslint, prettier --check, full npm run build — all pass.
Runtime smoke tests:
- unhandledRejection → process stays alive (exit 0), error logged. ✅
- uncaughtException → exits 1 after Sentry flush. ✅
- Verified Sentry registers 0 global handlers after init (filter works); our handlers are the only app listeners; registration is idempotent. ✅
- flushSentryAndExit (bootstrap startup-failure path) exits 1. ✅

Reviewed

Ran an adversarial multi-dimension review (correctness / resilience / conventions / side-effects). All confirmed findings are addressed in this PR.

Suggested follow-ups (out of scope here)

Add a Docker healthcheck to the production/staging compose services hitting /healthz, so a non-serving-but-alive process is detectable.
Graceful shutdown on SIGTERM/SIGINT (drain in-flight requests, httpServer.close(), DataSource.destroy()) to avoid abrupt connection teardown on deploys.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved startup and process-level error handling to avoid zombie processes and ensure orderly shutdown.
- Prevented duplicate error captures to monitoring by coordinating global handlers and SDK integrations.
Chores
- Tuned database connection pooling behavior for pooler environments.
- Added documentation for connection-pool sizing and notes about extra cron connections.

Production and staging went down during a DigitalOcean managed Postgres / PgBouncer connectivity blip ("server login has been failing ... (server_login_retry)"). Several issues combined to turn a transient DB problem into a hard outage that needed a manual redeploy: - No process-level error handlers, so a rejected DB query during a blip crashed the whole Node process (unhandledRejection, Node >= 15 exits). - orm.ts used idleTimeoutMillis: 500, recycling idle connections twice a second and hammering the pooler with reconnect/login churn. - pool.connect() had no connectionTimeoutMillis, so during a pooler stall requests hung indefinitely instead of failing fast. - bootstrap()'s catch only logged; if the DB was unreachable at startup the process stayed up with no HTTP listener (a zombie that `restart: always` never recovers, since the policy only fires on process exit). - Sentry's built-in OnUncaughtException/OnUnhandledRejection integrations double-handled with any new handlers (double capture + exit race). Changes: - Add src/utils/globalErrorHandlers.ts: keep the process alive on unhandledRejection (log + Sentry), exit cleanly on uncaughtException so Docker (restart: always) recreates a fresh process. Registered first in index.ts. - orm.ts: idleTimeoutMillis 500 -> 30000 and add connectionTimeoutMillis: 10000 on both AppDataSource and CronDataSource; drop the no-op maxWaitingClients/evictionRunIntervalMillis keys (node-postgres ignores them). - bootstrap(): exit on startup failure (skipped under tests) so restart: always self-heals once the DB is reachable again. - sentryLogger.ts: disable Sentry's global handlers so ours are the single source of truth. - example.env: document pool-size sizing to prevent recurrence. NOTE: the over-sized production pool (TYPEORM_DATABASE_POOL_SIZE=97 per process x 5 processes) lives in the gitignored config/production.env and must be reduced on the server separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-11T16:03:22Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8af730a9-e928-4e62-9734-0085f855951d

📥 Commits

Reviewing files that changed from the base of the PR and between 2462ee9 and e65c563.

📒 Files selected for processing (2)

config/example.env
src/orm.ts

Walkthrough

Registers idempotent process-level handlers for unhandledRejection and uncaughtException with safe Sentry capture and flush-on-exit, integrates handlers at startup before bootstrap, disables Sentry’s built-in uncaught/unhandled integrations, and centralizes TypeORM pooler-focused extras with env docs about effective DB connections.

Changes

Error Handling and Database Connection Infrastructure

Layer / File(s)	Summary
Global error handler registration and Sentry setup `src/utils/globalErrorHandlers.ts`, `src/sentryLogger.ts`	New `registerGlobalErrorHandlers()` registers idempotent listeners for `unhandledRejection` (log + Sentry capture, keep alive) and `uncaughtException` (log + Sentry capture, flush, exit). `Sentry.init` disables built-in OnUncaughtException/OnUnhandledRejection integrations.
Error handler startup integration `src/index.ts`, `src/server/bootstrap.ts`	Calls `registerGlobalErrorHandlers()` before `bootstrap()` and updates bootstrap’s outer error handler to capture startup exceptions to Sentry and, when not in tests, flush Sentry and exit to avoid leaving the process running without an HTTP listener.
Database connection pooling configuration `src/orm.ts`, `config/example.env`	Adds `poolerExtraConfig` with `idleTimeoutMillis: 30000` and `connectionTimeoutMillis: 10000`, applies it to `AppDataSource` and `CronDataSource`, and documents how `TYPEORM_DATABASE_POOL_SIZE` scales effective connections across processes (note: cron datasource adds extra connections).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I scoped my ears to catch the crash and hiss,
I log, I send to Sentry, then I kiss
the sleepy process shorter, tidy, neat—
flushes done, no doubles, no repeat.
Hop, hop — connections pooled, the garden’s sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly summarizes the main objective: improving API resilience when database/pooler connections are lost. The title directly reflects the primary changes across multiple files focused on connection handling, error recovery, and process lifecycle management.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/db-connection-resilience

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install timed out. The project may have too many dependencies for the sandbox.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/orm.ts (1)

93-106: ⚡ Quick win

Consider extracting duplicated connection pool configuration.

The extra configuration block and its comments are duplicated identically between AppDataSource (lines 55-68) and CronDataSource (lines 93-106). Consider extracting this into a shared constant to reduce duplication and ensure consistency.

♻️ Refactor to reduce duplication

+// Shared connection pool configuration for all DataSources running behind
+// a Postgres connection pooler (DigitalOcean managed Postgres / PgBouncer).
+const poolerExtraConfig = {
+  // Recycling idle connections every 500ms (the previous idleTimeoutMillis)
+  // caused constant reconnect + login churn against the pooler, surfacing in
+  // production as "server login has been failing ... (server_login_retry)" errors.
+  idleTimeoutMillis: 30000,
+  // Fail fast instead of hanging forever when a connection cannot be acquired
+  // during a pooler stall, so requests error out quickly and the pool can recover.
+  connectionTimeoutMillis: 10000,
+  // (maxWaitingClients / evictionRunIntervalMillis were generic-pool options
+  // that node-postgres ignores, so they were removed.)
+};
+
 export class AppDataSource {
   private static datasource: DataSource;
 
@@ -52,17 +64,7 @@
         },
       },
       poolSize,
-      extra: {
-        // The service runs behind a Postgres connection pooler (DigitalOcean
-        // managed Postgres / PgBouncer). Recycling idle connections every
-        // 500ms (the previous idleTimeoutMillis) caused constant reconnect +
-        // login churn against the pooler, surfacing in production as
-        // "server login has been failing ... (server_login_retry)" errors.
-        idleTimeoutMillis: 30000,
-        // Fail fast instead of hanging forever when a connection cannot be
-        // acquired during a pooler stall, so requests error out quickly and
-        // the pool can recover.
-        connectionTimeoutMillis: 10000,
-        // (maxWaitingClients / evictionRunIntervalMillis were generic-pool
-        // options that node-postgres ignores, so they were removed.)
-      },
+      extra: poolerExtraConfig,
     });
 
@@ -90,17 +92,7 @@
       entities: [CronJob],
       synchronize: false,
       dropSchema: false,
-      extra: {
-        // The service runs behind a Postgres connection pooler (DigitalOcean
-        // managed Postgres / PgBouncer). Recycling idle connections every
-        // 500ms (the previous idleTimeoutMillis) caused constant reconnect +
-        // login churn against the pooler, surfacing in production as
-        // "server login has been failing ... (server_login_retry)" errors.
-        idleTimeoutMillis: 30000,
-        // Fail fast instead of hanging forever when a connection cannot be
-        // acquired during a pooler stall, so requests error out quickly and
-        // the pool can recover.
-        connectionTimeoutMillis: 10000,
-        // (maxWaitingClients / evictionRunIntervalMillis were generic-pool
-        // options that node-postgres ignores, so they were removed.)
-      },
+      extra: poolerExtraConfig,
     });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/orm.ts` around lines 93 - 106, Extract the duplicated extra
connection-pool config into a shared constant (e.g., SHARED_POOL_EXTRA or
commonPoolExtra) and reuse it in both AppDataSource and CronDataSource by
spreading that constant into their existing config objects; update references to
the existing properties idleTimeoutMillis and connectionTimeoutMillis so the two
data source definitions consume the shared constant instead of copy-pasting the
same block, keeping the original comments with the constant for context.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/example.env`:
- Around line 10-18: Update the example.env comment to clarify that
CronDataSource (defined in src/orm.ts) does not read TYPEORM_DATABASE_POOL_SIZE
and therefore defaults to node-postgres’s pool max (~10) under TypeORM 0.3.20;
explicitly state this is independent of AppDataSource’s
TYPEORM_DATABASE_POOL_SIZE setting so readers understand the jobs process will
open ~10 separate CronDataSource connections on top of the app pool.

In `@src/server/bootstrap.ts`:
- Around line 428-440: The catch block in bootstrap() currently logs and sends
the error to Sentry but swallows it in test mode (isTestEnv), preventing callers
from seeing startup failures; change the handler so after logger.fatal(...) and
SentryLogger.captureException(...) you rethrow the original error when isTestEnv
is true (e.g., if (isTestEnv) throw err), and keep the existing
flushSentryAndExit() behavior for non-test runs so production still exits
cleanly.

---

Nitpick comments:
In `@src/orm.ts`:
- Around line 93-106: Extract the duplicated extra connection-pool config into a
shared constant (e.g., SHARED_POOL_EXTRA or commonPoolExtra) and reuse it in
both AppDataSource and CronDataSource by spreading that constant into their
existing config objects; update references to the existing properties
idleTimeoutMillis and connectionTimeoutMillis so the two data source definitions
consume the shared constant instead of copy-pasting the same block, keeping the
original comments with the constant for context.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3b0514ff-2d54-4a1f-bc0b-29225f94f7cd

📥 Commits

Reviewing files that changed from the base of the PR and between 9f307f2 and 2462ee9.

📒 Files selected for processing (6)

config/example.env
src/index.ts
src/orm.ts
src/sentryLogger.ts
src/server/bootstrap.ts
src/utils/globalErrorHandlers.ts

… pool docs) - orm.ts: extract the duplicated `extra` pool config (idleTimeoutMillis, connectionTimeoutMillis) into a shared `poolerExtraConfig` constant used by both AppDataSource and CronDataSource (CodeRabbit nitpick). - example.env: clarify that the jobs process's CronDataSource pool does NOT honor TYPEORM_DATABASE_POOL_SIZE — it uses node-postgres' default of ~10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread config/example.env

Comment thread src/server/bootstrap.ts

ae2079 merged commit c2047ad into staging Jun 11, 2026
12 checks passed

ae2079 deleted the fix/db-connection-resilience branch June 11, 2026 19:07

ae2079 restored the fix/db-connection-resilience branch June 11, 2026 19:07

ae2079 deleted the fix/db-connection-resilience branch June 11, 2026 19:07

This was referenced Jun 11, 2026

Hotfix: DB/pooler connection resilience (production) #2331

Merged

fix: handle malformed/aborted requests cleanly (fail loud, not silent) #2333

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make API resilient to DB/pooler connection loss#2330

fix: make API resilient to DB/pooler connection loss#2330
ae2079 merged 2 commits into
stagingfrom
fix/db-connection-resilience

ae2079 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Review failed

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ae2079 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Changes

⚠️ Required ops change (NOT in this PR)

Testing (local)

Reviewed

Suggested follow-ups (out of scope here)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ae2079 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading