Skip to content

feat: add upgrade-lock label to skip node drain during rolling updates#2062

Merged
samuelstolicny merged 9 commits intomasterfrom
feat/upgrade-lock-taint
Apr 20, 2026
Merged

feat: add upgrade-lock label to skip node drain during rolling updates#2062
samuelstolicny merged 9 commits intomasterfrom
feat/upgrade-lock-taint

Conversation

@samuelstolicny
Copy link
Copy Markdown
Contributor

@samuelstolicny samuelstolicny commented Apr 20, 2026

Summary

Adds an operator-controlled "pause button" for Claudie rolling updates. Any node labeled with claudie.io/upgrade-lock is skipped during the DELETE_NODES drain phase, and Claudie keeps retrying until the operator removes the label.

Closes #2021.

Solution

When the operator applies the label claudie.io/upgrade-lock to a node, Claudie's DeleteNodes loop skips cordon/drain on that node, returns ErrUpgradeLocked, and relies on the existing task-retry mechanism to reschedule DELETE_NODES until the label is removed.

Why a label and not a taint

The original design used a taint, but testing on a live cluster showed that Claudie's PATCH_NODES stage uses "replace" on /spec/taints, which wipes the entire taints array on every reconcile. The operator-applied taint never survived long enough for DELETE_NODES to see it.

Labels use a per-key JSON patch (replace /metadata/labels/<key>), so operator-applied labels are preserved across reconciles. Using a label also matches the semantics better: the upgrade-lock is a coordination signal with Claudie, not a scheduling rule. Operators who want additional scheduling protection can apply their own taints independently.

Operator workflow

# Before triggering an update
kubectl label node <node-name> claudie.io/upgrade-lock=true

# Apply updated InputManifest
kubectl apply -f manifest.yaml

# Claudie drains unlabeled nodes, skips labeled ones, and retries
# Verify replication/health on your workload

# Release the node when safe
kubectl label node <node-name> claudie.io/upgrade-lock-

Testing

End-to-end validated on a live Hetzner cluster (1 control-plane + 2 workers, template tag bumped v0.10.0 -> v0.11.0):

  • Label survived all reconcile cycles (TERRAFORMER, ANSIBLER, KUBE_ELEVEN, PATCH_NODES)
  • DELETE_NODES logged node <name> has upgrade-lock label, skipping drain
  • ErrUpgradeLocked rescheduled the task, observed retries every ~25s
  • Unlabeled peer worker was drained and deleted normally
  • After removing the label, the next retry cleanly drained and deleted the locked node
  • No cluster impact from the skip; DONE state reached after unlock

Test plan

  • Builds clean (go build ./...)
  • End-to-end rolling update on live Hetzner cluster with one worker locked
  • Verified unlabeled peer drained, labeled peer skipped
  • Verified label removal triggers drain on next retry

Summary by CodeRabbit

  • Documentation

    • Added guidance on using an upgrade-lock node label: rollout behavior, expected logs, and operator steps to apply/remove the label.
  • New Features

    • Nodes labeled with the upgrade-lock are skipped during drain/deletion and deferred for retry until the label is removed.
  • Chores

    • Updated manifest image tags for multiple components.

Operators can label nodes with claudie.io/upgrade-lock to have Claudie
skip the drain phase until they remove the label. Prevents data loss on
StatefulSets whose replication may exceed the drain timeout.
@samuelstolicny samuelstolicny requested a review from Despire April 20, 2026 07:50
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a coordination-only claudie.io/upgrade-lock node label and support to skip draining/deleting labeled nodes during rolling updates; implements KubectlNodeHasLabel to read node labels from kubectl get node -o json, integrates the check into node deletion to defer nodes and return ErrUpgradeLocked when any were skipped, adds docs, and updates manifest image tags.

Changes

Cohort / File(s) Summary
Documentation
docs/update/update.md
New docs describing claudie.io/upgrade-lock semantics, operator workflow (label before verification, remove after), skip-on-label during drain, retry/defer behavior, and expected log messages.
Kubectl Utility
internal/kubectl/kubectl.go
Adds KubectlNodeHasLabel(nodeName, labelKey string) (bool, error) which runs kubectl get node <nodeName> -o json, parses metadata.labels, and returns label presence or error.
Node Deletion Logic
services/kuber/internal/worker/service/internal/nodes/delete.go
Introduces upgradeLockLabelKey (claudie.io/upgrade-lock) and exported ErrUpgradeLocked; DeleteNodes() checks the label via new kubectl helper, skips cordon/drain/delete for labeled nodes (logs and marks run as locked), and returns errors.Join(errDel, ErrUpgradeLocked) if any nodes were skipped.
Manifests - Claudie
manifests/claudie/kustomization.yaml
Updated multiple images[].newTag values from d9754c7-4083ccc8651-4088.
Manifests - Testing Framework
manifests/testing-framework/kustomization.yaml
Updated image tag override for ghcr.io/berops/claudie/testing-framework: d9754c7-4083ccc8651-4088.

Sequence Diagram(s)

sequenceDiagram
    participant Deleter as Deleter (kuber worker)
    participant Kubectl as internal/kubectl
    participant KubeAPI as Kubernetes API
    participant Operator as Operator / Human

    Deleter->>Kubectl: KubectlNodeHasLabel(nodeName, "claudie.io/upgrade-lock")
    Kubectl->>KubeAPI: kubectl get node <nodeName> -o json
    KubeAPI-->>Kubectl: node JSON (includes metadata.labels)
    Kubectl-->>Deleter: bool (label present?) / error

    alt label present
        Deleter->>Deleter: log "node skipped (upgrade lock)"
        Deleter-->>Deleter: mark locked, continue without cordon/drain/delete
    else label absent
        Deleter->>Deleter: proceed with cordon/drain/delete
    end

    Note over Operator,Deleter: Operator removes label after replication/verification
    Operator->>KubeAPI: kubectl label node <node> claudie.io/upgrade-lock-
    Deleter->>Kubectl: subsequent run detects no label and deletes node
Loading

Possibly related PRs

Suggested labels

test-set-rolling-update, refresh-docs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Image tag updates in kustomization files appear administrative and unrelated to the core upgrade-lock feature. These changes should be justified or separated into a distinct maintenance PR. Either document the reason for image tag updates (ccc8651-4088) or move them to a separate commit/PR focused on dependency updates.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature added: an upgrade-lock label mechanism to skip node drain during rolling updates, which is the primary objective of this PR.
Linked Issues check ✅ Passed All PR objectives directly address the linked issue #2021: the upgrade-lock label provides operators a mechanism to prevent premature node draining/deletion during Claudie upgrades, preventing data loss during replication.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/upgrade-lock-taint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/kuber/internal/worker/service/internal/nodes/delete.go (1)

141-215: ⚠️ Potential issue | 🔴 Critical

Critical: locked master nodes are still removed from etcd.

The continue at Line 154 skips cordon/drain/delete for a locked master, but any unlocked master processed later in the loop calls d.deleteFromEtcd(...) at Line 207, which iterates over all of d.masterNodes (see deleteFromEtcd at Lines 342–362) and issues etcdctl member remove for every entry it finds — including the ones that were supposed to be protected by the upgrade-lock.

Effects:

  • The locked master is evicted from etcd while still running, breaking quorum semantics and potentially corrupting the control plane — the exact data-loss class this feature is meant to prevent.
  • On retry, after the operator removes the label, the node is drained normally, but in the meantime its etcd membership has already been torn down.

Suggested fix: build a filtered view of masters to act on (excluding locked ones) and pass it to deleteFromEtcd, or short-circuit deleteFromEtcd per-node. For example, collect locked names up front and skip them both in the drain loop and inside deleteFromEtcd.

🔧 Sketch
-	var locked bool
+	var locked bool
+	lockedNames := map[string]struct{}{}
@@
-		} else if hasLock {
+		} else if hasLock {
 			logger.Info().Msgf("node %s has upgrade-lock label, skipping drain", master.k8sName)
 			locked = true
+			lockedNames[master.k8sName] = struct{}{}
 			continue
 		}
@@
-		if err := d.deleteFromEtcd(logger, kubectl); err != nil {
+		if err := d.deleteFromEtcd(logger, kubectl, lockedNames); err != nil {

And in deleteFromEtcd, skip entries whose k8sName is in lockedNames before issuing member remove.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/kuber/internal/worker/service/internal/nodes/delete.go` around lines
141 - 215, The loop currently skips cordon/drain for locked masters
(d.masterNodes) but still calls d.deleteFromEtcd(...) which iterates all
masters; collect locked master k8sNames (e.g., when kubectl.KubectlNodeHasLabel
returns true) into a set (lockedNames), skip any locked master in the main loop
(continue as now) and pass lockedNames into deleteFromEtcd (or alter
deleteFromEtcd to accept the set) so deleteFromEtcd ignores entries whose
master.k8sName is in lockedNames; likewise ensure deleteNodesByName is only
called for non-locked masters or is given the filtered list to avoid removing
locked members from etcd or deleting nodes protected by upgrade-lock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/kubectl/kubectl.go`:
- Around line 152-169: KubectlNodeHasLabel currently treats any runWithOutput
error as "label absent", which allows deletes to proceed on transient failures;
change KubectlNodeHasLabel (and its use of runWithOutput) to differentiate
definitive "node not found / label absent" from transient/unknown errors: if the
kubectl command returns empty output or an explicit "not found" indication,
return (false, nil); otherwise return a retryable sentinel error (reuse
ErrUpgradeLocked or define a new ErrCouldNotDetermineLabel) so callers in
delete.go (which already handle ErrUpgradeLocked) will defer deletion and retry
instead of proceeding. Ensure references include KubectlNodeHasLabel,
runWithOutput, and ErrUpgradeLocked (or the new sentinel) so the retry path is
triggered on transient failures.
- Around line 152-169: KubectlNodeHasLabel currently treats empty label values
as absent and attempts to query a dotted jsonpath (only escaping dots) which
fails for keys with slashes; change KubectlNodeHasLabel to call getKubeconfig as
before but run kubectl with "-o jsonpath='{.metadata.labels}'" via
runWithOutput, unmarshal the returned JSON into a map[string]string and then
check presence with a map lookup (_, ok := labels[labelKey]) to return
true/false; keep existing error handling and cleanup from getKubeconfig and
return any runWithOutput/unmarshal errors.

In `@services/kuber/internal/worker/service/internal/nodes/delete.go`:
- Around line 149-155: The error branch when calling KubectlNodeHasLabel
(checking master.k8sName) must not fall through to drain; treat any error as an
assumed locked node: set locked = true, log the error at Warn/Info that we are
treating the node as locked, and continue so the node is retried instead of
drained; apply the same change to the analogous worker loop check around lines
226–232 where KubectlNodeHasLabel is used. Ensure you only modify the error
branch (the else-if hasLock branch stays the same) and reference
KubectlNodeHasLabel, master.k8sName, logger, and the locked variable when making
the change.
- Around line 306-309: The code in deleteNodes currently returns
errors.Join(errDel, ErrUpgradeLocked) when locked, which causes the pipeline to
treat the result as a fatal error; instead, stop propagating ErrUpgradeLocked as
part of the returned error so the stage can be retried: when locked is true,
return only errDel (or nil if errDel is nil) and record the upgrade-locked
condition via a diagnostic/logging path (do not include ErrUpgradeLocked in the
returned error). Update the branch handling the locked variable in deleteNodes
so ErrUpgradeLocked is not joined into the returned error but still
logged/tracked separately.

---

Outside diff comments:
In `@services/kuber/internal/worker/service/internal/nodes/delete.go`:
- Around line 141-215: The loop currently skips cordon/drain for locked masters
(d.masterNodes) but still calls d.deleteFromEtcd(...) which iterates all
masters; collect locked master k8sNames (e.g., when kubectl.KubectlNodeHasLabel
returns true) into a set (lockedNames), skip any locked master in the main loop
(continue as now) and pass lockedNames into deleteFromEtcd (or alter
deleteFromEtcd to accept the set) so deleteFromEtcd ignores entries whose
master.k8sName is in lockedNames; likewise ensure deleteNodesByName is only
called for non-locked masters or is given the filtered list to avoid removing
locked members from etcd or deleting nodes protected by upgrade-lock.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e07f3c4-f959-4f3d-b228-3270f232bb3e

📥 Commits

Reviewing files that changed from the base of the PR and between ca5b777 and 3413123.

📒 Files selected for processing (3)
  • docs/update/update.md
  • internal/kubectl/kubectl.go
  • services/kuber/internal/worker/service/internal/nodes/delete.go

Comment thread internal/kubectl/kubectl.go
Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go Outdated
Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go
Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go Outdated
Comment thread services/kuber/internal/worker/service/internal/nodes/delete.go
Comment thread internal/kubectl/kubectl.go
@samuelstolicny samuelstolicny requested a review from Despire April 20, 2026 09:34
Copy link
Copy Markdown
Contributor

@Despire Despire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samuelstolicny samuelstolicny added this pull request to the merge queue Apr 20, 2026
Merged via the queue into master with commit 1a0e274 Apr 20, 2026
@samuelstolicny samuelstolicny deleted the feat/upgrade-lock-taint branch April 20, 2026 11:40
@bernardhalas
Copy link
Copy Markdown
Member

If there's only a subset of the nodes that have claudie.io/upgrade-lock label, will the DELETE_NODES pause always when the workflow reaches such a node? Or will the workflow skip to the next node which doesn't have this upgrade-lock label set?

@Despire
Copy link
Copy Markdown
Contributor

Despire commented Apr 20, 2026

All nodes with label claudie.io/upgrade-lock will be skipped and other nodes will be drained and removed.

The worfklow then repeats until the labels are removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node drain while Claudie rolling upgrade

3 participants