feat: add upgrade-lock label to skip node drain during rolling updates by samuelstolicny · Pull Request #2062 · berops/claudie

samuelstolicny · 2026-04-20T07:50:09Z

Summary

Adds an operator-controlled "pause button" for Claudie rolling updates. Any node labeled with claudie.io/upgrade-lock is skipped during the DELETE_NODES drain phase, and Claudie keeps retrying until the operator removes the label.

Closes #2021.

Solution

When the operator applies the label claudie.io/upgrade-lock to a node, Claudie's DeleteNodes loop skips cordon/drain on that node, returns ErrUpgradeLocked, and relies on the existing task-retry mechanism to reschedule DELETE_NODES until the label is removed.

Why a label and not a taint

The original design used a taint, but testing on a live cluster showed that Claudie's PATCH_NODES stage uses "replace" on /spec/taints, which wipes the entire taints array on every reconcile. The operator-applied taint never survived long enough for DELETE_NODES to see it.

Labels use a per-key JSON patch (replace /metadata/labels/<key>), so operator-applied labels are preserved across reconciles. Using a label also matches the semantics better: the upgrade-lock is a coordination signal with Claudie, not a scheduling rule. Operators who want additional scheduling protection can apply their own taints independently.

Operator workflow

# Before triggering an update
kubectl label node <node-name> claudie.io/upgrade-lock=true

# Apply updated InputManifest
kubectl apply -f manifest.yaml

# Claudie drains unlabeled nodes, skips labeled ones, and retries
# Verify replication/health on your workload

# Release the node when safe
kubectl label node <node-name> claudie.io/upgrade-lock-

Testing

End-to-end validated on a live Hetzner cluster (1 control-plane + 2 workers, template tag bumped v0.10.0 -> v0.11.0):

Label survived all reconcile cycles (TERRAFORMER, ANSIBLER, KUBE_ELEVEN, PATCH_NODES)
DELETE_NODES logged node <name> has upgrade-lock label, skipping drain
ErrUpgradeLocked rescheduled the task, observed retries every ~25s
Unlabeled peer worker was drained and deleted normally
After removing the label, the next retry cleanly drained and deleted the locked node
No cluster impact from the skip; DONE state reached after unlock

Test plan

Builds clean (go build ./...)
End-to-end rolling update on live Hetzner cluster with one worker locked
Verified unlabeled peer drained, labeled peer skipped
Verified label removal triggers drain on next retry

Summary by CodeRabbit

Documentation
- Added guidance on using an upgrade-lock node label: rollout behavior, expected logs, and operator steps to apply/remove the label.
New Features
- Nodes labeled with the upgrade-lock are skipped during drain/deletion and deferred for retry until the label is removed.
Chores
- Updated manifest image tags for multiple components.

Operators can label nodes with claudie.io/upgrade-lock to have Claudie skip the drain phase until they remove the label. Prevents data loss on StatefulSets whose replication may exceed the drain timeout.

coderabbitai · 2026-04-20T07:50:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a coordination-only claudie.io/upgrade-lock node label and support to skip draining/deleting labeled nodes during rolling updates; implements KubectlNodeHasLabel to read node labels from kubectl get node -o json, integrates the check into node deletion to defer nodes and return ErrUpgradeLocked when any were skipped, adds docs, and updates manifest image tags.

Changes

Cohort / File(s)	Summary
Documentation `docs/update/update.md`	New docs describing `claudie.io/upgrade-lock` semantics, operator workflow (label before verification, remove after), skip-on-label during drain, retry/defer behavior, and expected log messages.
Kubectl Utility `internal/kubectl/kubectl.go`	Adds `KubectlNodeHasLabel(nodeName, labelKey string) (bool, error)` which runs `kubectl get node <nodeName> -o json`, parses `metadata.labels`, and returns label presence or error.
Node Deletion Logic `services/kuber/internal/worker/service/internal/nodes/delete.go`	Introduces `upgradeLockLabelKey` (`claudie.io/upgrade-lock`) and exported `ErrUpgradeLocked`; `DeleteNodes()` checks the label via new kubectl helper, skips cordon/drain/delete for labeled nodes (logs and marks run as locked), and returns `errors.Join(errDel, ErrUpgradeLocked)` if any nodes were skipped.
Manifests - Claudie `manifests/claudie/kustomization.yaml`	Updated multiple `images[].newTag` values from `d9754c7-4083` → `ccc8651-4088`.
Manifests - Testing Framework `manifests/testing-framework/kustomization.yaml`	Updated image tag override for `ghcr.io/berops/claudie/testing-framework`: `d9754c7-4083` → `ccc8651-4088`.

Sequence Diagram(s)

sequenceDiagram
    participant Deleter as Deleter (kuber worker)
    participant Kubectl as internal/kubectl
    participant KubeAPI as Kubernetes API
    participant Operator as Operator / Human

    Deleter->>Kubectl: KubectlNodeHasLabel(nodeName, "claudie.io/upgrade-lock")
    Kubectl->>KubeAPI: kubectl get node <nodeName> -o json
    KubeAPI-->>Kubectl: node JSON (includes metadata.labels)
    Kubectl-->>Deleter: bool (label present?) / error

    alt label present
        Deleter->>Deleter: log "node skipped (upgrade lock)"
        Deleter-->>Deleter: mark locked, continue without cordon/drain/delete
    else label absent
        Deleter->>Deleter: proceed with cordon/drain/delete
    end

    Note over Operator,Deleter: Operator removes label after replication/verification
    Operator->>KubeAPI: kubectl label node <node> claudie.io/upgrade-lock-
    Deleter->>Kubectl: subsequent run detects no label and deletes node

Possibly related PRs

introduce 30 minute drain timeout for nodes #2011: Modifies DeleteNodes flow (drain timeout / Kubectl drain behavior) and touches the same deletion/cordon/drain codepaths.
Feat/auto repair #2038: Adjusts kubectl helper and node deletion behavior related to labels/taints and upgrade coordination.

Suggested labels

test-set-rolling-update, refresh-docs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	Image tag updates in kustomization files appear administrative and unrelated to the core upgrade-lock feature. These changes should be justified or separated into a distinct maintenance PR.	Either document the reason for image tag updates (`ccc8651`-4088) or move them to a separate commit/PR focused on dependency updates.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature added: an upgrade-lock label mechanism to skip node drain during rolling updates, which is the primary objective of this PR.
Linked Issues check	✅ Passed	All PR objectives directly address the linked issue `#2021`: the upgrade-lock label provides operators a mechanism to prevent premature node draining/deletion during Claudie upgrades, preventing data loss during replication.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/upgrade-lock-taint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

services/kuber/internal/worker/service/internal/nodes/delete.go (1)
141-215: ⚠️ Potential issue | 🔴 Critical

Critical: locked master nodes are still removed from etcd.

The continue at Line 154 skips cordon/drain/delete for a locked master, but any unlocked master processed later in the loop calls d.deleteFromEtcd(...) at Line 207, which iterates over all of d.masterNodes (see deleteFromEtcd at Lines 342–362) and issues etcdctl member remove for every entry it finds — including the ones that were supposed to be protected by the upgrade-lock.

Effects:

The locked master is evicted from etcd while still running, breaking quorum semantics and potentially corrupting the control plane — the exact data-loss class this feature is meant to prevent.

On retry, after the operator removes the label, the node is drained normally, but in the meantime its etcd membership has already been torn down.

Suggested fix: build a filtered view of masters to act on (excluding locked ones) and pass it to deleteFromEtcd, or short-circuit deleteFromEtcd per-node. For example, collect locked names up front and skip them both in the drain loop and inside deleteFromEtcd.
🔧 Sketch
-	var locked bool
+	var locked bool
+	lockedNames := map[string]struct{}{}
@@
-		} else if hasLock {
+		} else if hasLock {
 			logger.Info().Msgf("node %s has upgrade-lock label, skipping drain", master.k8sName)
 			locked = true
+			lockedNames[master.k8sName] = struct{}{}
 			continue
 		}
@@
-		if err := d.deleteFromEtcd(logger, kubectl); err != nil {
+		if err := d.deleteFromEtcd(logger, kubectl, lockedNames); err != nil {
And in deleteFromEtcd, skip entries whose k8sName is in lockedNames before issuing member remove.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/kuber/internal/worker/service/internal/nodes/delete.go` around lines
141 - 215, The loop currently skips cordon/drain for locked masters
(d.masterNodes) but still calls d.deleteFromEtcd(...) which iterates all
masters; collect locked master k8sNames (e.g., when kubectl.KubectlNodeHasLabel
returns true) into a set (lockedNames), skip any locked master in the main loop
(continue as now) and pass lockedNames into deleteFromEtcd (or alter
deleteFromEtcd to accept the set) so deleteFromEtcd ignores entries whose
master.k8sName is in lockedNames; likewise ensure deleteNodesByName is only
called for non-locked masters or is given the filtered list to avoid removing
locked members from etcd or deleting nodes protected by upgrade-lock.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/kubectl/kubectl.go`:
- Around line 152-169: KubectlNodeHasLabel currently treats any runWithOutput
error as "label absent", which allows deletes to proceed on transient failures;
change KubectlNodeHasLabel (and its use of runWithOutput) to differentiate
definitive "node not found / label absent" from transient/unknown errors: if the
kubectl command returns empty output or an explicit "not found" indication,
return (false, nil); otherwise return a retryable sentinel error (reuse
ErrUpgradeLocked or define a new ErrCouldNotDetermineLabel) so callers in
delete.go (which already handle ErrUpgradeLocked) will defer deletion and retry
instead of proceeding. Ensure references include KubectlNodeHasLabel,
runWithOutput, and ErrUpgradeLocked (or the new sentinel) so the retry path is
triggered on transient failures.
- Around line 152-169: KubectlNodeHasLabel currently treats empty label values
as absent and attempts to query a dotted jsonpath (only escaping dots) which
fails for keys with slashes; change KubectlNodeHasLabel to call getKubeconfig as
before but run kubectl with "-o jsonpath='{.metadata.labels}'" via
runWithOutput, unmarshal the returned JSON into a map[string]string and then
check presence with a map lookup (_, ok := labels[labelKey]) to return
true/false; keep existing error handling and cleanup from getKubeconfig and
return any runWithOutput/unmarshal errors.

In `@services/kuber/internal/worker/service/internal/nodes/delete.go`:
- Around line 149-155: The error branch when calling KubectlNodeHasLabel
(checking master.k8sName) must not fall through to drain; treat any error as an
assumed locked node: set locked = true, log the error at Warn/Info that we are
treating the node as locked, and continue so the node is retried instead of
drained; apply the same change to the analogous worker loop check around lines
226–232 where KubectlNodeHasLabel is used. Ensure you only modify the error
branch (the else-if hasLock branch stays the same) and reference
KubectlNodeHasLabel, master.k8sName, logger, and the locked variable when making
the change.
- Around line 306-309: The code in deleteNodes currently returns
errors.Join(errDel, ErrUpgradeLocked) when locked, which causes the pipeline to
treat the result as a fatal error; instead, stop propagating ErrUpgradeLocked as
part of the returned error so the stage can be retried: when locked is true,
return only errDel (or nil if errDel is nil) and record the upgrade-locked
condition via a diagnostic/logging path (do not include ErrUpgradeLocked in the
returned error). Update the branch handling the locked variable in deleteNodes
so ErrUpgradeLocked is not joined into the returned error but still
logged/tracked separately.

---

Outside diff comments:
In `@services/kuber/internal/worker/service/internal/nodes/delete.go`:
- Around line 141-215: The loop currently skips cordon/drain for locked masters
(d.masterNodes) but still calls d.deleteFromEtcd(...) which iterates all
masters; collect locked master k8sNames (e.g., when kubectl.KubectlNodeHasLabel
returns true) into a set (lockedNames), skip any locked master in the main loop
(continue as now) and pass lockedNames into deleteFromEtcd (or alter
deleteFromEtcd to accept the set) so deleteFromEtcd ignores entries whose
master.k8sName is in lockedNames; likewise ensure deleteNodesByName is only
called for non-locked masters or is given the filtered list to avoid removing
locked members from etcd or deleting nodes protected by upgrade-lock.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e07f3c4-f959-4f3d-b228-3270f232bb3e

📥 Commits

Reviewing files that changed from the base of the PR and between ca5b777 and 3413123.

📒 Files selected for processing (3)

docs/update/update.md
internal/kubectl/kubectl.go
services/kuber/internal/worker/service/internal/nodes/delete.go

…okup

…laudie into feat/upgrade-lock-taint

Despire

LGTM

bernardhalas · 2026-04-20T16:54:58Z

If there's only a subset of the nodes that have claudie.io/upgrade-lock label, will the DELETE_NODES pause always when the workflow reaches such a node? Or will the workflow skip to the next node which doesn't have this upgrade-lock label set?

Despire · 2026-04-20T19:04:46Z

All nodes with label claudie.io/upgrade-lock will be skipped and other nodes will be drained and removed.

The worfklow then repeats until the labels are removed.

feat: add upgrade-lock label to skip node drain during rolling updates

901b25f

Operators can label nodes with claudie.io/upgrade-lock to have Claudie skip the drain phase until they remove the label. Prevents data loss on StatefulSets whose replication may exceed the drain timeout.

samuelstolicny requested a review from Despire April 20, 2026 07:50

Merged master into feat/upgrade-lock-taint.

3413123

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread internal/kubectl/kubectl.go

Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go Outdated

Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go

Auto commit - update kustomization.yaml

5eaecfc

Despire requested changes Apr 20, 2026

View reviewed changes

Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go Outdated

Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go

Comment thread internal/kubectl/kubectl.go

samuelstolicny and others added 3 commits April 20, 2026 11:18

fix(upgrade-lock): fail closed on label check errors, use JSON key lo…

d0353ea

…okup

Merge branch 'feat/upgrade-lock-taint' of https://github.com/berops/c…

b113eaa

…laudie into feat/upgrade-lock-taint

Auto commit - update kustomization.yaml

e1b4b84

samuelstolicny requested a review from Despire April 20, 2026 09:34

samuelstolicny and others added 3 commits April 20, 2026 11:41

docs(upgrade-lock): soften retry timing wording

83927da

Merge branch 'feat/upgrade-lock-taint' of https://github.com/berops/c…

5bcce90

…laudie into feat/upgrade-lock-taint

Auto commit - update kustomization.yaml

855aa81

Despire approved these changes Apr 20, 2026

View reviewed changes

samuelstolicny added this pull request to the merge queue Apr 20, 2026

Merged via the queue into master with commit 1a0e274 Apr 20, 2026

samuelstolicny deleted the feat/upgrade-lock-taint branch April 20, 2026 11:40

coderabbitai Bot mentioned this pull request Apr 24, 2026

Update changelog-0.12.x.md #2073

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add upgrade-lock label to skip node drain during rolling updates#2062

feat: add upgrade-lock label to skip node drain during rolling updates#2062
samuelstolicny merged 9 commits intomasterfrom
feat/upgrade-lock-taint

samuelstolicny commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Despire left a comment

Uh oh!

bernardhalas commented Apr 20, 2026

Uh oh!

Despire commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

samuelstolicny commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Solution

Why a label and not a taint

Operator workflow

Testing

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Despire left a comment

Choose a reason for hiding this comment

Uh oh!

bernardhalas commented Apr 20, 2026

Uh oh!

Despire commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samuelstolicny commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading