PMM-15050 Hi-load performance improvements#5338
Draft
ademidoff wants to merge 9 commits into
Draft
Conversation
When PMM Server restarts (e.g. during a version migration), all agents lose their long-lived gRPC Connect streams and reconnect at roughly the same time. Each reconnect goes through nginx auth_request -> pmm-managed AuthServer, which on cache miss calls Grafana /api/auth/serviceaccount. With the previous defaults a fleet of 800 agents would issue 800 simultaneous Grafana lookups, exhaust Grafana, and time out at the 3-second auth deadline. This change reduces the load on Grafana along three independent axes: - Singleflight around the cache miss in retrieveRole: concurrent calls for the same hashed credentials now collapse into a single Grafana request; followers wait for the leader and pick up the cached entry. - Longer auth cache TTL (3s -> 60s) and longer auth timeout (3s -> 15s), so a fresh agent's role no longer needs to be re-fetched every few seconds and a temporarily slow Grafana does not immediately translate to mass 401s. - Larger HTTP transport pool for the Grafana client (MaxIdleConns 50 -> 200, explicit MaxIdleConnsPerHost 100 instead of the Go default of 2), so reconnect bursts no longer force a fresh TCP/TLS handshake for almost every request. Drive-by: replaces interface{} with any in the Grafana client to clear pre-existing lint warnings in the files touched here.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v3 #5338 +/- ##
=======================================
Coverage 42.20% 42.20%
=======================================
Files 410 410
Lines 41995 41997 +2
=======================================
+ Hits 17723 17725 +2
Misses 22488 22488
Partials 1784 1784
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
errors.Cause from github.com/pkg/errors does not walk standard Unwrap chains. errors.As is the supported replacement, also used elsewhere in this file, and removes the need for the //nolint:errorlint suppression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each PMM Client authenticates with its own per-node service token, so the cache key (a hash of the Authorization/Cookie headers) is unique per client. Singleflight only coalesces concurrent calls sharing the same key, so for the migration scenario that motivated this PR — hundreds of clients reconnecting at the same moment — it provided essentially no deduplication. The longer cache TTL and bigger HTTP transport pool do the actual load-shedding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When PMM Server restarts, all clients lose their gRPC streams at the same instant and start reconnecting against a small randomized window. With min=1s / max=15s and ±25% jitter, the first retry of a fleet of 800 agents was concentrated in roughly a 0.5-second window — a worst case of ~1600 reconnects/sec hitting auth_request and Grafana. Doubling the jitter (±25% -> ±50%) and raising the cap from 15s to 60s spreads later retries over a wider interval, so a server outage longer than 15s no longer keeps every agent hammering at the cap simultaneously. The jitter constant lives in the shared backoff package, so the change also applies to slowlog and process restart loops; wider jitter is strictly more decorrelating and has no downside for those local loops. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reverting the cap bump from the previous commit. With a 60s cap, an agent in repeated failure can hold up to a minute of metrics in its local buffer between reconnect attempts, which is a worse trade-off than the server-side benefit of spreading retries further apart. The wider jitter (±50%) introduced in the previous commit is kept; it still meaningfully spreads later retries within the 15s cap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Grafana 11 defaults max_open_conn to 0 (unlimited), so a burst of concurrent auth lookups - e.g. a fleet of agents reconnecting at the same time - can ask Grafana to open hundreds of fresh Postgres backends in parallel and saturate Postgres' max_connections, returning "too many clients" and tipping the reconnect loop into a livelock. Cap the pool at 100 open / 25 idle. Token validation is a fast query, so 100 concurrent backends is plenty even for thousands of agents while staying well within PMM Server's max_connections=2000 ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pool was hard-coded to 5 idle / 10 open. Under a reconnect storm from a fleet of agents, every DB-bound auth path (LBAC role lookup, settings read, agent-state write) queues at 10, and that queue sits on the same Postgres backends that Grafana's auth flow is competing for. Bump to 20 idle / 50 open. Stays well within PMM Server's max_connections=2000 and gives auth paths enough headroom to drain the burst instead of stalling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14400s (4h) is already Grafana's default for conn_max_lifetime, so setting it explicitly is just clutter. The values that actually change behaviour are max_open_conn and max_idle_conn. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PMM-15050
FB: SUBMODULES-4350
Summary
When PMM Server is restarted (for example during a version migration), all PMM Clients lose their long-lived gRPC
Connectstreams and reconnect at roughly the same instant. With the previous defaults a fleet of 800 agents would issue 800 simultaneous Grafana lookups, exhaust Grafana's request capacity, fan out into hundreds of Postgres backends, and trip Postgres'max_connections— turning the migration into a reconnect storm and a "too many clients" livelock.This PR addresses the storm at four layers.
pmm-managedauth cacheAuthServer. Once an agent's credentials have been validated, the next reconnect within a minute is a free cache hit instead of another Grafana round-trip.auth_request.MaxIdleConns50 → 200 and explicitMaxIdleConnsPerHost100 (Go default is 2). Reconnect bursts no longer force a fresh TCP/TLS handshake for almost every request.Postgres connection pressure
grafana.ini:max_open_conn = 100,max_idle_conn = 25. Grafana 11's default formax_open_connis0(unlimited), so a synchronized burst of token validations can ask Grafana to open hundreds of Postgres backends and exhaustmax_connections. Capping it at 100 contains the blast radius without compromising steady-state performance.pmm-agentreconnect±25 %→±50 %) in the sharedagent/utils/backoffpackage. Spreads the first retry of 800 agents across a 1 s window instead of 0.5 s. The cap (backoffMaxDelay) stays at 15 s so an agent in repeated failure does not hold up to a minute of metrics in its local buffer.Drive-bys
errors.Causewitherrors.Asinauth_server.go.interface{}withanyin the Grafana client to clear pre-existing lint warnings in files touched here.Notes
delayJitterconstant lives in the sharedagent/utils/backoffpackage, so wider jitter also applies to the slowlog reader retry and the exporter restart loop. Both are local to a single agent process; wider jitter is strictly more decorrelating there and has no downside.🤖 Generated with Claude Code