Skip to content

PMM-15050 Hi-load performance improvements#5338

Draft
ademidoff wants to merge 9 commits into
v3from
PMM-15050-perf-improvements
Draft

PMM-15050 Hi-load performance improvements#5338
ademidoff wants to merge 9 commits into
v3from
PMM-15050-perf-improvements

Conversation

@ademidoff
Copy link
Copy Markdown
Member

@ademidoff ademidoff commented May 8, 2026

PMM-15050

FB: SUBMODULES-4350

Summary

When PMM Server is restarted (for example during a version migration), all PMM Clients lose their long-lived gRPC Connect streams and reconnect at roughly the same instant. With the previous defaults a fleet of 800 agents would issue 800 simultaneous Grafana lookups, exhaust Grafana's request capacity, fan out into hundreds of Postgres backends, and trip Postgres' max_connections — turning the migration into a reconnect storm and a "too many clients" livelock.

This PR addresses the storm at four layers.

pmm-managed auth cache

  • Longer auth cache TTL (3 s → 60 s) in AuthServer. Once an agent's credentials have been validated, the next reconnect within a minute is a free cache hit instead of another Grafana round-trip.
  • Longer auth timeout (3 s → 15 s). A temporarily slow Grafana no longer immediately translates to mass 401s in nginx auth_request.
  • Larger HTTP transport pool for the Grafana client: MaxIdleConns 50 → 200 and explicit MaxIdleConnsPerHost 100 (Go default is 2). Reconnect bursts no longer force a fresh TCP/TLS handshake for almost every request.

Postgres connection pressure

  • Bound Grafana's Postgres pool in grafana.ini: max_open_conn = 100, max_idle_conn = 25. Grafana 11's default for max_open_conn is 0 (unlimited), so a synchronized burst of token validations can ask Grafana to open hundreds of Postgres backends and exhaust max_connections. Capping it at 100 contains the blast radius without compromising steady-state performance.
  • Raise pmm-managed's Postgres pool from 5 idle / 10 open to 20 idle / 50 open. The previous limit caused DB-bound auth paths (LBAC role lookup, settings reads, agent-state writes) to queue at 10 during the burst, on the same Postgres backends Grafana was contending for.

pmm-agent reconnect

  • Wider reconnect backoff jitter (±25 %±50 %) in the shared agent/utils/backoff package. Spreads the first retry of 800 agents across a 1 s window instead of 0.5 s. The cap (backoffMaxDelay) stays at 15 s so an agent in repeated failure does not hold up to a minute of metrics in its local buffer.

Drive-bys

  • Replaces deprecated errors.Cause with errors.As in auth_server.go.
  • Replaces interface{} with any in the Grafana client to clear pre-existing lint warnings in files touched here.

Notes

  • The delayJitter constant lives in the shared agent/utils/backoff package, so wider jitter also applies to the slowlog reader retry and the exporter restart loop. Both are local to a single agent process; wider jitter is strictly more decorrelating there and has no downside.
  • The Grafana PostgreSQL datasource (used by dashboards) is a separate connection pool not bounded by this PR. It's typically a steady-state load, not a reconnect-burst contributor; can be revisited if dashboards are seen contributing.

🤖 Generated with Claude Code

When PMM Server restarts (e.g. during a version migration), all agents
lose their long-lived gRPC Connect streams and reconnect at roughly the
same time. Each reconnect goes through nginx auth_request -> pmm-managed
AuthServer, which on cache miss calls Grafana /api/auth/serviceaccount.
With the previous defaults a fleet of 800 agents would issue 800
simultaneous Grafana lookups, exhaust Grafana, and time out at the
3-second auth deadline.

This change reduces the load on Grafana along three independent axes:

- Singleflight around the cache miss in retrieveRole: concurrent calls
  for the same hashed credentials now collapse into a single Grafana
  request; followers wait for the leader and pick up the cached entry.

- Longer auth cache TTL (3s -> 60s) and longer auth timeout (3s -> 15s),
  so a fresh agent's role no longer needs to be re-fetched every few
  seconds and a temporarily slow Grafana does not immediately translate
  to mass 401s.

- Larger HTTP transport pool for the Grafana client (MaxIdleConns
  50 -> 200, explicit MaxIdleConnsPerHost 100 instead of the Go default
  of 2), so reconnect bursts no longer force a fresh TCP/TLS handshake
  for almost every request.

Drive-by: replaces interface{} with any in the Grafana client to clear
pre-existing lint warnings in the files touched here.
@ademidoff ademidoff requested a review from a team as a code owner May 8, 2026 05:48
@ademidoff ademidoff requested review from 4nte and maxkondr and removed request for a team May 8, 2026 05:48
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 42.20%. Comparing base (2396f7d) to head (c3dfc02).

Files with missing lines Patch % Lines
managed/services/grafana/client.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##               v3    #5338   +/-   ##
=======================================
  Coverage   42.20%   42.20%           
=======================================
  Files         410      410           
  Lines       41995    41997    +2     
=======================================
+ Hits        17723    17725    +2     
  Misses      22488    22488           
  Partials     1784     1784           
Flag Coverage Δ
admin 34.89% <ø> (ø)
agent 49.23% <ø> (ø)
managed 40.72% <92.85%> (+<0.01%) ⬆️
vmproxy 72.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ademidoff and others added 4 commits May 8, 2026 09:01
errors.Cause from github.com/pkg/errors does not walk standard Unwrap
chains. errors.As is the supported replacement, also used elsewhere in
this file, and removes the need for the //nolint:errorlint suppression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each PMM Client authenticates with its own per-node service token, so
the cache key (a hash of the Authorization/Cookie headers) is unique per
client. Singleflight only coalesces concurrent calls sharing the same
key, so for the migration scenario that motivated this PR — hundreds of
clients reconnecting at the same moment — it provided essentially no
deduplication. The longer cache TTL and bigger HTTP transport pool do
the actual load-shedding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When PMM Server restarts, all clients lose their gRPC streams at the
same instant and start reconnecting against a small randomized window.
With min=1s / max=15s and ±25% jitter, the first retry of a fleet of
800 agents was concentrated in roughly a 0.5-second window — a worst
case of ~1600 reconnects/sec hitting auth_request and Grafana.

Doubling the jitter (±25% -> ±50%) and raising the cap from 15s to 60s
spreads later retries over a wider interval, so a server outage longer
than 15s no longer keeps every agent hammering at the cap simultaneously.

The jitter constant lives in the shared backoff package, so the change
also applies to slowlog and process restart loops; wider jitter is
strictly more decorrelating and has no downside for those local loops.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reverting the cap bump from the previous commit. With a 60s cap, an
agent in repeated failure can hold up to a minute of metrics in its
local buffer between reconnect attempts, which is a worse trade-off
than the server-side benefit of spreading retries further apart.

The wider jitter (±50%) introduced in the previous commit is kept;
it still meaningfully spreads later retries within the 15s cap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ademidoff ademidoff changed the title PMM-15050 Reduce Grafana auth load when fleets of agents reconnect PMM-15050 Hi-load performance improvements May 10, 2026
ademidoff and others added 4 commits May 10, 2026 12:28
Grafana 11 defaults max_open_conn to 0 (unlimited), so a burst of
concurrent auth lookups - e.g. a fleet of agents reconnecting at the
same time - can ask Grafana to open hundreds of fresh Postgres backends
in parallel and saturate Postgres' max_connections, returning
"too many clients" and tipping the reconnect loop into a livelock.

Cap the pool at 100 open / 25 idle. Token validation is a fast query,
so 100 concurrent backends is plenty even for thousands of agents while
staying well within PMM Server's max_connections=2000 ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pool was hard-coded to 5 idle / 10 open. Under a reconnect storm
from a fleet of agents, every DB-bound auth path (LBAC role lookup,
settings read, agent-state write) queues at 10, and that queue sits on
the same Postgres backends that Grafana's auth flow is competing for.

Bump to 20 idle / 50 open. Stays well within PMM Server's
max_connections=2000 and gives auth paths enough headroom to drain the
burst instead of stalling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14400s (4h) is already Grafana's default for conn_max_lifetime, so
setting it explicitly is just clutter. The values that actually change
behaviour are max_open_conn and max_idle_conn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ademidoff ademidoff marked this pull request as draft May 10, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant