PMM-15050 Hi-load performance improvements by ademidoff · Pull Request #5338 · percona/pmm

ademidoff · 2026-05-08T05:48:56Z

Summary

When PMM Server is restarted (for example during a version migration), all PMM Clients lose their long-lived gRPC Connect streams and reconnect at roughly the same instant. With the previous defaults a fleet of 800 agents would issue 800 simultaneous Grafana lookups, exhaust Grafana's request capacity, fan out into hundreds of Postgres backends, and trip Postgres' max_connections — turning the migration into a reconnect storm and a "too many clients" livelock.

This PR addresses the storm at four layers.

`pmm-managed` auth cache

Longer auth cache TTL (3 s → 60 s) in AuthServer. Once an agent's credentials have been validated, the next reconnect within a minute is a free cache hit instead of another Grafana round-trip.
Longer auth timeout (3 s → 15 s). A temporarily slow Grafana no longer immediately translates to mass 401s in nginx auth_request.
Larger HTTP transport pool for the Grafana client: MaxIdleConns 50 → 200 and explicit MaxIdleConnsPerHost 100 (Go default is 2). Reconnect bursts no longer force a fresh TCP/TLS handshake for almost every request.

Postgres connection pressure

Bound Grafana's Postgres pool in grafana.ini: max_open_conn = 100, max_idle_conn = 25. Grafana 11's default for max_open_conn is 0 (unlimited), so a synchronized burst of token validations can ask Grafana to open hundreds of Postgres backends and exhaust max_connections. Capping it at 100 contains the blast radius without compromising steady-state performance.
Raise pmm-managed's Postgres pool from 5 idle / 10 open to 20 idle / 50 open. The previous limit caused DB-bound auth paths (LBAC role lookup, settings reads, agent-state writes) to queue at 10 during the burst, on the same Postgres backends Grafana was contending for.

`pmm-agent` reconnect

Wider reconnect backoff jitter (±25 % → ±50 %) in the shared agent/utils/backoff package. Spreads the first retry of 800 agents across a 1 s window instead of 0.5 s. The cap (backoffMaxDelay) stays at 15 s so an agent in repeated failure does not hold up to a minute of metrics in its local buffer.

Drive-bys

Replaces deprecated errors.Cause with errors.As in auth_server.go.
Replaces interface{} with any in the Grafana client to clear pre-existing lint warnings in files touched here.

Notes

The delayJitter constant lives in the shared agent/utils/backoff package, so wider jitter also applies to the slowlog reader retry and the exporter restart loop. Both are local to a single agent process; wider jitter is strictly more decorrelating there and has no downside.
The Grafana PostgreSQL datasource (used by dashboards) is a separate connection pool not bounded by this PR. It's typically a steady-state load, not a reconnect-burst contributor; can be revisited if dashboards are seen contributing.

🤖 Generated with Claude Code

When PMM Server restarts (e.g. during a version migration), all agents lose their long-lived gRPC Connect streams and reconnect at roughly the same time. Each reconnect goes through nginx auth_request -> pmm-managed AuthServer, which on cache miss calls Grafana /api/auth/serviceaccount. With the previous defaults a fleet of 800 agents would issue 800 simultaneous Grafana lookups, exhaust Grafana, and time out at the 3-second auth deadline. This change reduces the load on Grafana along three independent axes: - Singleflight around the cache miss in retrieveRole: concurrent calls for the same hashed credentials now collapse into a single Grafana request; followers wait for the leader and pick up the cached entry. - Longer auth cache TTL (3s -> 60s) and longer auth timeout (3s -> 15s), so a fresh agent's role no longer needs to be re-fetched every few seconds and a temporarily slow Grafana does not immediately translate to mass 401s. - Larger HTTP transport pool for the Grafana client (MaxIdleConns 50 -> 200, explicit MaxIdleConnsPerHost 100 instead of the Go default of 2), so reconnect bursts no longer force a fresh TCP/TLS handshake for almost every request. Drive-by: replaces interface{} with any in the Grafana client to clear pre-existing lint warnings in the files touched here.

codecov · 2026-05-08T05:59:56Z

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 42.20%. Comparing base (2396f7d) to head (c3dfc02).

Files with missing lines	Patch %	Lines
managed/services/grafana/client.go	90.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##               v3    #5338   +/-   ##
=======================================
  Coverage   42.20%   42.20%           
=======================================
  Files         410      410           
  Lines       41995    41997    +2     
=======================================
+ Hits        17723    17725    +2     
  Misses      22488    22488           
  Partials     1784     1784

Flag	Coverage Δ
admin	`34.89% <ø> (ø)`
agent	`49.23% <ø> (ø)`
managed	`40.72% <92.85%> (+<0.01%)`	⬆️
vmproxy	`72.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

errors.Cause from github.com/pkg/errors does not walk standard Unwrap chains. errors.As is the supported replacement, also used elsewhere in this file, and removes the need for the //nolint:errorlint suppression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Each PMM Client authenticates with its own per-node service token, so the cache key (a hash of the Authorization/Cookie headers) is unique per client. Singleflight only coalesces concurrent calls sharing the same key, so for the migration scenario that motivated this PR — hundreds of clients reconnecting at the same moment — it provided essentially no deduplication. The longer cache TTL and bigger HTTP transport pool do the actual load-shedding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When PMM Server restarts, all clients lose their gRPC streams at the same instant and start reconnecting against a small randomized window. With min=1s / max=15s and ±25% jitter, the first retry of a fleet of 800 agents was concentrated in roughly a 0.5-second window — a worst case of ~1600 reconnects/sec hitting auth_request and Grafana. Doubling the jitter (±25% -> ±50%) and raising the cap from 15s to 60s spreads later retries over a wider interval, so a server outage longer than 15s no longer keeps every agent hammering at the cap simultaneously. The jitter constant lives in the shared backoff package, so the change also applies to slowlog and process restart loops; wider jitter is strictly more decorrelating and has no downside for those local loops. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reverting the cap bump from the previous commit. With a 60s cap, an agent in repeated failure can hold up to a minute of metrics in its local buffer between reconnect attempts, which is a worse trade-off than the server-side benefit of spreading retries further apart. The wider jitter (±50%) introduced in the previous commit is kept; it still meaningfully spreads later retries within the 15s cap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Grafana 11 defaults max_open_conn to 0 (unlimited), so a burst of concurrent auth lookups - e.g. a fleet of agents reconnecting at the same time - can ask Grafana to open hundreds of fresh Postgres backends in parallel and saturate Postgres' max_connections, returning "too many clients" and tipping the reconnect loop into a livelock. Cap the pool at 100 open / 25 idle. Token validation is a fast query, so 100 concurrent backends is plenty even for thousands of agents while staying well within PMM Server's max_connections=2000 ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The pool was hard-coded to 5 idle / 10 open. Under a reconnect storm from a fleet of agents, every DB-bound auth path (LBAC role lookup, settings read, agent-state write) queues at 10, and that queue sits on the same Postgres backends that Grafana's auth flow is competing for. Bump to 20 idle / 50 open. Stays well within PMM Server's max_connections=2000 and gives auth paths enough headroom to drain the burst instead of stalling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

14400s (4h) is already Grafana's default for conn_max_lifetime, so setting it explicitly is just clutter. The values that actually change behaviour are max_open_conn and max_idle_conn. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ademidoff requested a review from a team as a code owner May 8, 2026 05:48

ademidoff requested review from 4nte and maxkondr and removed request for a team May 8, 2026 05:48

ademidoff and others added 4 commits May 8, 2026 09:01

ademidoff mentioned this pull request May 10, 2026

PMM-15050 Hi-load performance improvements Percona-Lab/pmm-submodules#4350

Draft

ademidoff changed the title ~~PMM-15050 Reduce Grafana auth load when fleets of agents reconnect~~ PMM-15050 Hi-load performance improvements May 10, 2026

ademidoff and others added 4 commits May 10, 2026 12:28

Merge branch 'v3' into PMM-15050-perf-improvements

c3dfc02

ademidoff marked this pull request as draft May 10, 2026 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PMM-15050 Hi-load performance improvements#5338

PMM-15050 Hi-load performance improvements#5338
ademidoff wants to merge 9 commits into
v3from
PMM-15050-perf-improvements

ademidoff commented May 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ademidoff commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

pmm-managed auth cache

Postgres connection pressure

pmm-agent reconnect

Drive-bys

Notes

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ademidoff commented May 8, 2026 •

edited

Loading

`pmm-managed` auth cache

`pmm-agent` reconnect

codecov Bot commented May 8, 2026 •

edited

Loading