Skip to content

fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52)#595

Open
eliran-ops wants to merge 3 commits intomainfrom
feat/investigate-stale-junk-clusters-lingering-in-cluster-list
Open

fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52)#595
eliran-ops wants to merge 3 commits intomainfrom
feat/investigate-stale-junk-clusters-lingering-in-cluster-list

Conversation

@eliran-ops
Copy link
Copy Markdown
Contributor

@eliran-ops eliran-ops commented Apr 30, 2026

Linear: SKY-834

Companion to #590 (bug 54 — cert-health namespace mismatch).

Why

Bug 52: junk / stale clusters lingering in the cluster dropdown.

In multi-file (isolated-load) mode, contextRegistry + perFileConfigs were built once in setupIsolatedLoad and never refreshed. Single-file mode reloads on every GetAvailableContexts call (it walks RawConfig() through DeferredLoadingClientConfig), but the multi-file path served whatever was in memory.

So the dropdown kept showing entries for:

  • kubeconfig files removed from a watched directory,
  • kubeconfig files rewritten by kubectl config delete-context,
  • kubeconfig files written then deleted by CAPI when a managed cluster was destroyed.

All three look identical to the user: a row that's there but errors out on switch.

What

New helper refreshContextRegistry — surgical, mtime-driven reconciliation against disk.

Disk state Action
File missing Drop every registry entry pointing at it. Drop cached config + mtime.
File mtime moved Re-parse THAT file only. Drop entries for removed contexts; add entries for new ones.
File mtime unchanged No-op (avoids re-parsing N files on every dropdown open).
File present but parse-fails Keep prior entries (user might be mid-edit; silent pruning would be worse).

Wired into GetAvailableContexts (multi-file branch only) under clientMu write lock. Same per-file isolation guarantees as buildContextRegistry — colliding user/cluster names across files remain structurally impossible.

Out of scope (deliberately deferred)

  1. Re-scanning kubeconfig DIRECTORIES for newly-added files. Today, a freshly-created kubeconfig still requires a Radar restart to appear. Same workaround as before. Worth a follow-up but raises a syscall-budget question (every dropdown open shouldn't ReadDir all watched dirs without throttling).
  2. fsnotify-based eager refresh. Same data, more moving parts; on-demand mtime check on dropdown open is enough to clear the junk-cluster complaint.

Test plan

  • go build ./...
  • go test ./internal/k8s/ (all green; 5 new subtests for refresh)
  • Manual: start Radar with two kubeconfig files; kubectl config delete-context foo in one of them; reopen the dropdown — foo is gone, no restart needed.
  • Manual: rm one of the watched kubeconfig files; reopen the dropdown — every context from that file is gone.

Made with Cursor


Note

Medium Risk
Touches concurrency and shared state around kubeconfig context enumeration; incorrect locking or refresh logic could cause deadlocks, performance regressions, or missing/incorrect context listings.

Overview
Prevents stale/junk kubeconfig contexts from lingering in the multi-file (isolated-load) context list by adding an mtime-backed refresh path that drops deleted files, updates rewritten files, and picks up newly added/removed contexts without requiring a restart.

GetAvailableContexts now refreshes contextRegistry/perFileConfigs on each call under a single clientMu write lock to avoid concurrent map access, and CAPI temp-kubeconfig updates/promotions now seed and maintain the new perFileMtimes cache. Adds targeted unit tests covering file deletion, in-place rewrites, new contexts, parse failures, and nil/empty-edge cases.

Reviewed by Cursor Bugbot for commit 5f2654d. Bugbot is set up for automated code reviews on this repo. Configure here.

Bug 52: junk / stale clusters lingering in the cluster dropdown.

Root cause: in multi-file (isolated-load) mode, contextRegistry +
perFileConfigs were built ONCE in setupIsolatedLoad and never
refreshed. Single-file mode reloads on every GetAvailableContexts
call (it walks RawConfig() through DeferredLoadingClientConfig),
but the multi-file path served whatever was in memory. So the
dropdown kept showing entries for:

  - kubeconfig files removed from a watched directory,
  - kubeconfig files rewritten by 'kubectl config delete-context',
  - kubeconfig files written then deleted by CAPI when a managed
    cluster was destroyed.

All three look identical to the user: a row that's there, but
errors out when they try to switch to it.

Fix: introduce refreshContextRegistry — surgical, mtime-driven
reconciliation of registry + perFileConfigs against what's on
disk now.

  - File missing on disk → drop every registry entry pointing at
    it AND its cached config + mtime. (CAPI-destroyed case.)
  - File mtime moved → re-parse THAT file only. Drop entries for
    contexts removed from the file; add entries for newly-added
    contexts. (kubectl config delete-context / set-context case.)
  - File mtime unchanged → no-op (avoids re-parsing N files on
    every dropdown open).
  - File present but parse-fails → keep prior entries. The user
    might be mid-edit; silently pruning the dropdown while they
    save would be more confusing than a momentarily stale entry.

Wire it into GetAvailableContexts (multi-file branch only) under
clientMu write lock. Same per-file isolation guarantees as
buildContextRegistry — colliding user/cluster names across files
remain structurally impossible.

Out of scope (deliberately deferred):
  - Re-scanning kubeconfig DIRECTORIES for newly-added files.
    Today, a freshly-created kubeconfig won't appear without a
    Radar restart. Same workaround as before (restart). Worth a
    follow-up but adds a syscall budget question (every dropdown
    open shouldn't ReadDir all watched dirs).
  - fsnotify-based eager refresh. Same data, more moving parts;
    on-demand mtime check on dropdown open is enough to clear the
    junk-cluster complaint.

Tests pin: removed-file drop, in-place context delete, in-place
context add, no-op stability, and the parse-failure defensive
case (keep stale, don't prune).

Made-with: Cursor
Comment thread internal/k8s/client.go Outdated
Comment thread internal/k8s/client.go Outdated
Bugbot HIGH x2 on PR #595.

1. Concurrent map read/write panic.

   v1 released clientMu before iterating the registry. The local
   variables 'registry' and 'fileConfigs' still pointed at the LIVE
   maps, so a second concurrent GetAvailableContexts call would
   acquire the lock and mutate them via refreshContextRegistry
   while the first caller iterated. Two HTTP /contexts requests
   in flight at once = 'fatal error: concurrent map read and map
   write'.

   Hold the lock through the iteration. The contexts slice is
   private to the caller, so we can release just before the
   return.

2. Nil perFileMtimes after CAPI promotion.

   MergeAndSwitchContext promotes single-file → isolated-load
   mode by writing contextRegistry and perFileConfigs but never
   touched perFileMtimes. After a CAPI cluster was registered,
   perFileMtimes stayed nil and the next refresh would write to
   it ('assignment to entry in nil map' panic).

   Seed perFileMtimes alongside the other globals on both promo
   paths. Add a defensive lazy-init in GetAvailableContexts so a
   future caller that forgets to seed it doesn't crash either.

   Test pins the contract that refresh populates an empty mtime
   map without panicking and writes the mtime back.

Made-with: Cursor
Comment thread internal/k8s/context_registry.go
Comment thread internal/k8s/client.go
Comment thread internal/k8s/context_registry_test.go
…ath mtime

Bugbot v3 follow-ups on PR #595.

1. Refresh loses track of files with all contexts removed.

   byFile was built solely from registry entries. If a previous
   refresh dropped every context for a file (e.g. user did
   'kubectl config delete-context' for each one), the file's path
   stayed in fileMtimes/fileConfigs but vanished from byFile on
   subsequent calls. Any new contexts later added to that file
   were never discovered until restart.

   Seed byFile from BOTH registry AND fileMtimes keys so an
   emptied-but-still-tracked file gets re-stat'd on every refresh.

2. CAPI fast path omits perFileMtimes update.

   The MergeAndSwitchContext re-use branch overwrote the existing
   temp file (changing its mtime) and refreshed perFileConfigs,
   but didn't bump perFileMtimes. The next refresh would see a
   stale mtime and uselessly re-parse a file we'd just parsed in
   the same critical section.

   Bump perFileMtimes alongside perFileConfigs in the fast path.

3. Nil-mtime test was hand-waving.

   v3 declared nilMtimes, asserted it was nil, but then passed
   emptyMtimes (a non-nil map) to refresh. The function was NOT
   actually nil-safe — production seeded the map before calling.
   Make the function itself nil-safe (returns false / no-op) so
   callers that forget can't panic, and rewrite the test to pass
   an actual nil. Add a second regression that exercises the
   byFile-from-mtimes seed: emptied registry + rewritten file
   must surface the new contexts.

Made-with: Cursor
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5f2654d. Configure here.

}
}
return changed
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-place map mutation creates concurrent read/write crash

High Severity

refreshContextRegistry mutates contextRegistry and perFileConfigs in place (delete, insert), but SwitchContext and WriteKubeconfigForCurrentContext still snapshot these maps as bare references under RLock and read from them after releasing the lock. A concurrent GetAvailableContexts call can trigger refreshContextRegistry under a write lock, mutating the same live maps — causing Go's fatal "concurrent map read and map write" panic. Before this change the maps were never mutated in place (only atomically replaced), so the reference-snapshot pattern was safe.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5f2654d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants