Add ElastiCache auto-discovery for Valkey/Redis services#5303
Add ElastiCache auto-discovery for Valkey/Redis services#5303angelvilardellperez wants to merge 16 commits into
Conversation
|
Resolves #5300 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## v3 #5303 +/- ##
==========================================
- Coverage 43.21% 42.90% -0.31%
==========================================
Files 413 415 +2
Lines 42302 42826 +524
==========================================
+ Hits 18280 18376 +96
- Misses 22155 22572 +417
- Partials 1867 1878 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@angelvilardellperez Run |
pushed |
e4b42b2 to
22d0be5
Compare
Add background auto-discovery that periodically scans AWS ElastiCache replication groups tagged with pmm_enable=true and registers them as Valkey services in PMM inventory. Features: - DiscoverElastiCache API endpoint for manual discovery via Swagger/UI - Background reconciler running every 5 minutes - Supports Cluster Mode Enabled (ConfigurationEndpoint) and Disabled (Primary + Reader endpoints per shard) - Filters out clusters with AUTH/ACL enabled (no credentials support) - Uses AWS "Environment" tag for PMM environment label - Auto-removes services when pmm_enable=true tag is removed - New RemoteElastiCacheNode type for inventory tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix nil response on empty discovery (return empty response, not nil) - Add RemoteElastiCacheNodeType to compatibleNodeAndAgent map - Add RemoteElastiCacheNodeType to inventory-layer RemoveService cleanup - Add instance_id field to AddRemoteElastiCacheNodeParams proto - Fix misleading "unchanged" count when add failures occur Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run make init && make gen to regenerate all derived files after ElastiCache auto-discovery changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover uncovered lines flagged by Codecov: - nodes.go: AddRemoteElastiCacheNode lifecycle and uniqueness - service.go: RemoveService/ListServices with ElastiCache node type - elasticache.go: region listing and engine map - elasticache_discovery.go: findManagedServices, addInstance, removeService Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regenerated after rebase onto latest v3 to align with updated protoc tooling and resolve generated file conflicts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
646879d to
f4f6a68
Compare
|
@angelvilardellperez could you run to solve conflicts? Then we can proceed, thank you. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pushed after running the 3 commands. Conflicts still present so i manually accepted the ones for v3 |
|
Yeah I can still see bunch of changed files due to formatting. Let me try to fix it. |
|
@angelvilardellperez Could you check if now PR contains only your changes? If yes I can do final review. Thank you. |
|
@JiriCtvrtka all good |
| // Remove stale. | ||
| var removed int | ||
| for addr, svc := range managedByAddr { | ||
| if _, exists := expectedByAddr[addr]; exists { | ||
| continue | ||
| } | ||
| if err := d.removeService(ctx, svc); err != nil { | ||
| d.l.Warnf("Failed to remove %s (%s): %v", svc.ServiceName, addr, err) | ||
| continue | ||
| } | ||
| removed++ |
There was a problem hiding this comment.
Region and tag lookup failures are currently treated as successful empty results. That means reconciliation can run with partial discovery data and remove existing managed services that were only missing because AWS calls failed. We should skip stale-removal unless the discovery scan completed successfully.
What do you think?
Region and tag-lookup failures were treated as successful empty results, so reconcile could remove managed services whose AWS calls had only failed transiently. Now propagate a scanComplete flag from checkTags through discoverRegionTagged to discoverTaggedInstances, and gate the stale-removal loop on it. Adds still proceed on partial scans since they only act on clusters we did successfully observe as tagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
buf treats "ElastiCache" as two words, so the ENUM_VALUE_PREFIX rule expected DISCOVER_ELASTI_CACHE_ENGINE_*. Renamed the three enum values and propagated to the regenerated pb.go (including rawDesc length prefixes), swagger JSONs, json client, and Go callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds ElastiCache auto-discovery to PMM, following the same pattern as the existing RDS discovery feature. ElastiCache replication groups tagged with
pmm_enable=trueare automatically discovered and registered as Valkey services in PMM inventory.Features
POST /v1/management/services:discoverElastiCachefor manual discovery via Swagger/UIConfigurationEndpointfor Cluster Mode Enabled,PrimaryEndpoint+ReaderEndpointfor Cluster Mode DisabledEnvironmenttag to populate the PMM environment labelpmm_enable=truetag is removed from the clusterRemoteElastiCacheNodefor inventory tracking (same pattern asRemoteRDSNode)New files
api/management/v1/elasticache.protomanaged/services/management/elasticache.gomanaged/services/management/elasticache_discovery.goModified files
api/management/v1/service.proto— AddedDiscoverElastiCacheRPC,elasticacheto AddService/Responseapi/inventory/v1/nodes.proto— AddedRemoteElastiCacheNodetypemanaged/models/node_model.go— AddedRemoteElastiCacheNodeTypemanaged/services/converters.go— Node type conversion for ElastiCachemanaged/services/inventory/— Node list/get/add supportmanaged/services/management/service.go— AddService routing + RemoveService cleanupmanaged/cmd/pmm-managed/main.go— Background job startupHow it works
AWS permissions required
elasticache:DescribeReplicationGroupselasticache:ListTagsForResourceUses the default AWS credential chain (IRSA on EKS, env vars,
~/.aws).Testing
Labels applied to auto-discovered services