Skip to content

Add ElastiCache auto-discovery for Valkey/Redis services#5303

Open
angelvilardellperez wants to merge 16 commits into
percona:v3from
angelvilardellperez:pmm-autodiscover
Open

Add ElastiCache auto-discovery for Valkey/Redis services#5303
angelvilardellperez wants to merge 16 commits into
percona:v3from
angelvilardellperez:pmm-autodiscover

Conversation

@angelvilardellperez
Copy link
Copy Markdown

Summary

Adds ElastiCache auto-discovery to PMM, following the same pattern as the existing RDS discovery feature. ElastiCache replication groups tagged with pmm_enable=true are automatically discovered and registered as Valkey services in PMM inventory.

Features

  • Background auto-discovery: Reconciler runs every 5 minutes, scanning AWS regions for tagged ElastiCache clusters
  • API endpoint: POST /v1/management/services:discoverElastiCache for manual discovery via Swagger/UI
  • Cluster Mode support: Uses ConfigurationEndpoint for Cluster Mode Enabled, PrimaryEndpoint + ReaderEndpoint for Cluster Mode Disabled
  • Auth filtering: Automatically skips clusters with AUTH tokens or ACL user groups (no credentials support yet)
  • Environment tag: Reads the AWS Environment tag to populate the PMM environment label
  • Stale removal: Services are removed from PMM when the pmm_enable=true tag is removed from the cluster
  • New node type: RemoteElastiCacheNode for inventory tracking (same pattern as RemoteRDSNode)

New files

File Description
api/management/v1/elasticache.proto Discovery API messages (DiscoverElastiCache, AddElastiCacheService)
managed/services/management/elasticache.go API discovery + add service handler
managed/services/management/elasticache_discovery.go Background auto-discovery reconciler

Modified files

  • api/management/v1/service.proto — Added DiscoverElastiCache RPC, elasticache to AddService/Response
  • api/inventory/v1/nodes.proto — Added RemoteElastiCacheNode type
  • managed/models/node_model.go — Added RemoteElastiCacheNodeType
  • managed/services/converters.go — Node type conversion for ElastiCache
  • managed/services/inventory/ — Node list/get/add support
  • managed/services/management/service.go — AddService routing + RemoveService cleanup
  • managed/cmd/pmm-managed/main.go — Background job startup

How it works

PMM starts → 30s delay → reconciler starts (every 5 min):
  1. DescribeReplicationGroups across all AWS regions (parallel)
  2. Filter: engine=redis|valkey, status=available, no AUTH/ACL
  3. ListTagsForResource → keep only pmm_enable=true
  4. Diff with existing PMM services (labeled managed_by=elasticache-autodiscovery)
  5. Add new → RemoteElastiCacheNode + ValkeyService + ValkeyExporter
  6. Remove stale → service + agents + node cleanup

AWS permissions required

  • elasticache:DescribeReplicationGroups
  • elasticache:ListTagsForResource

Uses the default AWS credential chain (IRSA on EKS, env vars, ~/.aws).

Testing

  • Tested locally with PMM dev container against AWS staging ElastiCache clusters
  • Verified: discovery, tag filtering, auth filtering, cluster mode enabled/disabled, add, remove, environment tag extraction

Labels applied to auto-discovered services

managed_by: elasticache-autodiscovery
source: elasticache
engine: valkey|redis
role: primary|reader|cluster

@barryib
Copy link
Copy Markdown

barryib commented Apr 29, 2026

Resolves #5300

@angelvilardellperez angelvilardellperez marked this pull request as ready for review April 30, 2026 07:35
@angelvilardellperez angelvilardellperez requested a review from a team as a code owner April 30, 2026 07:35
@angelvilardellperez angelvilardellperez requested review from JiriCtvrtka and maxkondr and removed request for a team April 30, 2026 07:35
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 18.63118% with 428 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.90%. Comparing base (85387c6) to head (3aba34e).

Files with missing lines Patch % Lines
...naged/services/management/elasticache_discovery.go 24.22% 209 Missing and 10 partials ⚠️
managed/services/management/elasticache.go 1.07% 184 Missing ⚠️
managed/services/converters.go 0.00% 11 Missing ⚠️
managed/cmd/pmm-managed/main.go 0.00% 5 Missing ⚠️
managed/services/inventory/grpc/nodes_server.go 0.00% 4 Missing ⚠️
managed/services/inventory/nodes.go 92.30% 1 Missing and 1 partial ⚠️
managed/services/management/service.go 33.33% 2 Missing ⚠️
managed/models/agent_helpers.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #5303      +/-   ##
==========================================
- Coverage   43.21%   42.90%   -0.31%     
==========================================
  Files         413      415       +2     
  Lines       42302    42826     +524     
==========================================
+ Hits        18280    18376      +96     
- Misses      22155    22572     +417     
- Partials     1867     1878      +11     
Flag Coverage Δ
admin 34.94% <ø> (ø)
agent 49.22% <ø> (ø)
managed 41.84% <18.63%> (-0.46%) ⬇️
vmproxy 72.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ademidoff
Copy link
Copy Markdown
Member

@angelvilardellperez Run make init && make gen to pass linters.

@angelvilardellperez
Copy link
Copy Markdown
Author

@ademidoff

@angelvilardellperez Run make init && make gen to pass linters.

pushed

angelvilardellperez and others added 6 commits May 5, 2026 16:15
Add background auto-discovery that periodically scans AWS ElastiCache
replication groups tagged with pmm_enable=true and registers them as
Valkey services in PMM inventory.

Features:
- DiscoverElastiCache API endpoint for manual discovery via Swagger/UI
- Background reconciler running every 5 minutes
- Supports Cluster Mode Enabled (ConfigurationEndpoint) and Disabled
  (Primary + Reader endpoints per shard)
- Filters out clusters with AUTH/ACL enabled (no credentials support)
- Uses AWS "Environment" tag for PMM environment label
- Auto-removes services when pmm_enable=true tag is removed
- New RemoteElastiCacheNode type for inventory tracking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix nil response on empty discovery (return empty response, not nil)
- Add RemoteElastiCacheNodeType to compatibleNodeAndAgent map
- Add RemoteElastiCacheNodeType to inventory-layer RemoveService cleanup
- Add instance_id field to AddRemoteElastiCacheNodeParams proto
- Fix misleading "unchanged" count when add failures occur

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run make init && make gen to regenerate all derived files
after ElastiCache auto-discovery changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover uncovered lines flagged by Codecov:
- nodes.go: AddRemoteElastiCacheNode lifecycle and uniqueness
- service.go: RemoveService/ListServices with ElastiCache node type
- elasticache.go: region listing and engine map
- elasticache_discovery.go: findManagedServices, addInstance, removeService

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regenerated after rebase onto latest v3 to align with updated
protoc tooling and resolve generated file conflicts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 15, 2026

CLA assistant check
All committers have signed the CLA.

@JiriCtvrtka
Copy link
Copy Markdown
Contributor

@angelvilardellperez could you run
make gen
make format
go mod tidy

to solve conflicts? Then we can proceed, thank you.

angelvilardellperez and others added 2 commits May 19, 2026 13:01
@angelvilardellperez
Copy link
Copy Markdown
Author

@angelvilardellperez could you run make gen make format go mod tidy

to solve conflicts? Then we can proceed, thank you.

pushed after running the 3 commands. Conflicts still present so i manually accepted the ones for v3

@JiriCtvrtka
Copy link
Copy Markdown
Contributor

Yeah I can still see bunch of changed files due to formatting. Let me try to fix it.

@JiriCtvrtka
Copy link
Copy Markdown
Contributor

@angelvilardellperez Could you check if now PR contains only your changes? If yes I can do final review. Thank you.

@angelvilardellperez
Copy link
Copy Markdown
Author

@JiriCtvrtka all good

Comment on lines +166 to +176
// Remove stale.
var removed int
for addr, svc := range managedByAddr {
if _, exists := expectedByAddr[addr]; exists {
continue
}
if err := d.removeService(ctx, svc); err != nil {
d.l.Warnf("Failed to remove %s (%s): %v", svc.ServiceName, addr, err)
continue
}
removed++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Region and tag lookup failures are currently treated as successful empty results. That means reconciliation can run with partial discovery data and remove existing managed services that were only missing because AWS calls failed. We should skip stale-removal unless the discovery scan completed successfully.

What do you think?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit pushed

angelvilardellperez and others added 4 commits May 20, 2026 12:32
Region and tag-lookup failures were treated as successful empty
results, so reconcile could remove managed services whose AWS calls
had only failed transiently. Now propagate a scanComplete flag from
checkTags through discoverRegionTagged to discoverTaggedInstances,
and gate the stale-removal loop on it. Adds still proceed on partial
scans since they only act on clusters we did successfully observe as
tagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
buf treats "ElastiCache" as two words, so the ENUM_VALUE_PREFIX rule
expected DISCOVER_ELASTI_CACHE_ENGINE_*. Renamed the three enum
values and propagated to the regenerated pb.go (including rawDesc
length prefixes), swagger JSONs, json client, and Go callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JiriCtvrtka JiriCtvrtka self-requested a review May 21, 2026 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants