Skip to content

PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620

Open
eddiebarth1 wants to merge 4 commits into
mainfrom
plt-2119/stabilize-slo-tag-agnostic-psl
Open

PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620
eddiebarth1 wants to merge 4 commits into
mainfrom
plt-2119/stabilize-slo-tag-agnostic-psl

Conversation

@eddiebarth1

@eddiebarth1 eddiebarth1 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Closes PLT-2119

Why

Every deploy created a per-tag PSL ({app}-{target}-{tag}-servicelevels). Sloth regenerates burn-rate rules whenever a PSL changes, so every deploy reset the burn-rate windows and emitted a fresh alert series per revision — noisy duplicate SLO alerts, untrustworthy burn-rate trends.

What

  • One stable PSL per app/target ({app}-{target}-servicelevels), built by SyncServiceLevels mirroring the structure of the original SyncTaggedServiceLevels.
  • SLI queries keep per-revision granularity: sum by (tag) (rate(...)) replaces the per-tag sum(rate(...{tag="X"})). Numerically identical per tag (verified below).
  • Single writer: ResourceSyncer.syncServiceLevels upserts when the leading releasable revision has enabled SLOs, deletes otherwise. Steady state never writes — CreateOrUpdate no-ops on identical specs, so the PSL only changes when SLO definitions change (rare, and the one case where windows still reset, since the rules genuinely changed).
  • Teardown deletes the shared PSL before the ReleaseManager self-deletes. Service-wide objects normally die with the target namespace; the shared PSL is the exception — it lives in the cross-app service-levels namespace, where no namespace cascade or OwnerReference (cross-namespace) reaches it, and the RM's final reconcile is the last code that ever looks at this app/target. Without this, decommissioned services leak PSLs and Sloth keeps alerting on dead apps.
  • DeleteServiceLevels keeps the list/selector form, minus the bug. The old selector included LabelTag: "", which never matches an absent label — cleanup was a silent no-op. The selector is now {app, target}, which matches the shared PSL and also sweeps any lingering legacy per-tag PSLs once observation is off.
  • SLI queries honor serviceLevelIndicator.tagKey. No service overrides it today (kbfd defaults it to tag), but a custom key (e.g. destination_workload) is label_replace'd into the canonical tag label so the rollback pipeline keeps matching.
  • Legacy per-tag PSLs are scrubbed per-incarnation (Releasing/Canarying/Retiring/Deleting/Failing). Self-limiting; each app converges on its next deploy.
  • Review feedback is codified as architecture. The scope model from Sofie's review — syncer.go owns app/target-lifetime objects, incarnation.go/state handlers own per-revision objects, and shared-resource cleanup needs full-collection visibility — is now written down as the PICCHU-INV-SCOPE-* invariants in CLAUDE.md, so the lesson outlives this PR.

Behavior changes

  • Enabled: false now also removes the SLO's recording rules and ServiceMonitor (previously only the PSL honored it). Disabled uniformly means "do not observe".
  • Automated canary rollback is untouched: IsRevisionTriggered matches picchu's own per-tag canary rules (canary="true"/slo="true"syncCanaryRules, not in this diff). The PSL feeds human burn-rate alerting only.

Verified against prod Thanos (2026-06-09, pre-ship)

  • Recording rules ({app}:{slo}:errors|total, e.g. medium2:authentication_account_login_success:total) carry tag on all 4 production clusters.
  • New SLI = old SLI per tag: 0.00782 vs 0.00786 (query-time jitter).
  • Sloth alert exprs use max without (sloth_window)tag survives into firing alerts.
  • Fleet scan for rules lacking tag: only clientele (aggregates by source_canonical_revision) and drizzle (metric has no tag label) — both already have no per-revision SLI today; not regressions. No is_grpc SLOs exist fleet-wide.

Test plan

  • make test / make fmt vet; unit tests for the lifecycle (Enabled filtering, sync-vs-delete decision, teardown deletes the PSL before RM self-delete)
  • Prod Thanos input verification (above)
  • Staging: migration window — legacy + shared PSL briefly coexist with the same sloth_id (values are identical; watch prometheus-slo for duplicate-sample warnings until scrub completes)
  • Staging: burn-rate windows survive a definition-unchanged deploy; one multi-SLO PSL renders a full Sloth rule group
  • Staging: forced canary violation rolls back; Enabled: false flip removes rules + ServiceMonitor; decommission deletes the shared PSL

🤖 Generated with Claude Code

@eddiebarth1 eddiebarth1 force-pushed the plt-2119/stabilize-slo-tag-agnostic-psl branch from 257e2a9 to fd86181 Compare May 19, 2026 20:57
@eddiebarth1 eddiebarth1 force-pushed the plt-2119/stabilize-slo-tag-agnostic-psl branch from fd86181 to 6c13ce6 Compare June 3, 2026 20:53
Comment thread controllers/syncer.go
Eduardo Barth and others added 2 commits June 9, 2026 17:32
…viceLevel

Every deploy created a per-tag PSL ({app}-{target}-{tag}-servicelevels);
Sloth regenerates burn-rate rules on any PSL change, so each deploy reset
burn-rate windows and emitted duplicate alert series per revision.

- One stable PSL per app/target ({app}-{target}-servicelevels); the name
  comes from one helper shared by the sync and delete paths.
- SLI queries preserve per-revision granularity: sum by (tag) (rate(...))
  replaces the per-tag sum(rate(...{tag=X})). Verified numerically
  identical per tag against prod Thanos.
- Single writer, modeled on syncSLORules: ResourceSyncer.syncServiceLevels
  upserts while the leading releasable revision has enabled SLOs, deletes
  otherwise. Steady state never writes (CreateOrUpdate no-ops on identical
  specs); the PSL only changes when SLO definitions change.
- Teardown deletes the shared PSL before the ReleaseManager self-deletes.
  Unlike namespace-scoped service-wide objects, the PSL lives in the
  cross-app service-levels namespace where neither namespace cascade nor
  cross-namespace OwnerReferences can reach it.
- DeleteServiceLevels deletes by deterministic name; the old LabelTag=""
  selector never matched an absent label (silent no-op).
- Enabled=false is filtered once at the source (prepareServiceLevelObjectives)
  so PSL, SLO rules, and ServiceMonitors uniformly stop observing.
- Legacy per-tag PSLs are scrubbed per-incarnation via
  deleteTaggedServiceLevels; SyncTaggedServiceLevels and its helpers are
  removed as dead code.
- Tests: plan-level sync/delete units plus controller-level coverage for
  prepareServiceLevelObjectives filtering, the sync-vs-delete decision,
  and a teardown regression test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codify the scope model from Sofie's review: syncer.go owns
app/target-lifetime objects, incarnation.go and state handlers own
per-revision objects, and shared-resource cleanup requires visibility of
the full incarnation collection (PICCHU-INV-SCOPE-1/2/3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@eddiebarth1 eddiebarth1 force-pushed the plt-2119/stabilize-slo-tag-agnostic-psl branch from e32f633 to 2787ee9 Compare June 9, 2026 22:32
@eddiebarth1 eddiebarth1 requested a review from sofiegonzalez June 9, 2026 22:32

@sofiegonzalez sofiegonzalez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments!!

Comment thread controllers/syncer.go
Comment thread controllers/syncer.go Outdated
Comment thread controllers/state.go Outdated
Comment thread controllers/releasemanager_controller.go Outdated
Comment thread controllers/plan/deleteServiceLevels.go
Comment thread controllers/plan/syncServiceLevels_test.go
Comment thread controllers/plan/sloConfig.go Outdated
Comment thread controllers/syncer.go
Comment thread controllers/syncer.go
Comment thread controllers/plan/syncServiceLevels.go Outdated
Eduardo Barth and others added 2 commits June 10, 2026 16:37
Per Sofie's review:
- serviceLevelIndicator.tagKey lets a service expose its revision under a
  label other than "tag" (kbfd defaults it to "tag"; no service overrides
  it today). The shared SLI source now label_replace's a custom tagKey
  into the canonical "tag" label so Sloth recording rules and burn-rate
  alerts keep the label the rollback pipeline matches on, preserving the
  adjustability the old per-tag queries had. Covered by a fixture SLO
  using tagKey: destination_workload.
- Restore the skip log when service-levels-namespace is unset, matching
  the old syncTaggedServiceLevels behavior.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…review

- SyncServiceLevels now mirrors the original SyncTaggedServiceLevels
  structure exactly: list-shaped builder with (value, error) return,
  Enabled checked in the plan, name built by a serviceLevelName method.
- DeleteServiceLevels returns to the list/selector form. The only change
  from the original is dropping LabelTag:"" from the selector - equality
  selectors never match an absent label, which made the old delete a
  silent no-op. The {app, target} selector also sweeps lingering legacy
  per-tag PSLs once observation is off; test covers shared + legacy
  deletion and other-app isolation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants