fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52)#595
fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52)#595eliran-ops wants to merge 3 commits intomainfrom
Conversation
Bug 52: junk / stale clusters lingering in the cluster dropdown.
Root cause: in multi-file (isolated-load) mode, contextRegistry +
perFileConfigs were built ONCE in setupIsolatedLoad and never
refreshed. Single-file mode reloads on every GetAvailableContexts
call (it walks RawConfig() through DeferredLoadingClientConfig),
but the multi-file path served whatever was in memory. So the
dropdown kept showing entries for:
- kubeconfig files removed from a watched directory,
- kubeconfig files rewritten by 'kubectl config delete-context',
- kubeconfig files written then deleted by CAPI when a managed
cluster was destroyed.
All three look identical to the user: a row that's there, but
errors out when they try to switch to it.
Fix: introduce refreshContextRegistry — surgical, mtime-driven
reconciliation of registry + perFileConfigs against what's on
disk now.
- File missing on disk → drop every registry entry pointing at
it AND its cached config + mtime. (CAPI-destroyed case.)
- File mtime moved → re-parse THAT file only. Drop entries for
contexts removed from the file; add entries for newly-added
contexts. (kubectl config delete-context / set-context case.)
- File mtime unchanged → no-op (avoids re-parsing N files on
every dropdown open).
- File present but parse-fails → keep prior entries. The user
might be mid-edit; silently pruning the dropdown while they
save would be more confusing than a momentarily stale entry.
Wire it into GetAvailableContexts (multi-file branch only) under
clientMu write lock. Same per-file isolation guarantees as
buildContextRegistry — colliding user/cluster names across files
remain structurally impossible.
Out of scope (deliberately deferred):
- Re-scanning kubeconfig DIRECTORIES for newly-added files.
Today, a freshly-created kubeconfig won't appear without a
Radar restart. Same workaround as before (restart). Worth a
follow-up but adds a syscall budget question (every dropdown
open shouldn't ReadDir all watched dirs).
- fsnotify-based eager refresh. Same data, more moving parts;
on-demand mtime check on dropdown open is enough to clear the
junk-cluster complaint.
Tests pin: removed-file drop, in-place context delete, in-place
context add, no-op stability, and the parse-failure defensive
case (keep stale, don't prune).
Made-with: Cursor
Bugbot HIGH x2 on PR #595. 1. Concurrent map read/write panic. v1 released clientMu before iterating the registry. The local variables 'registry' and 'fileConfigs' still pointed at the LIVE maps, so a second concurrent GetAvailableContexts call would acquire the lock and mutate them via refreshContextRegistry while the first caller iterated. Two HTTP /contexts requests in flight at once = 'fatal error: concurrent map read and map write'. Hold the lock through the iteration. The contexts slice is private to the caller, so we can release just before the return. 2. Nil perFileMtimes after CAPI promotion. MergeAndSwitchContext promotes single-file → isolated-load mode by writing contextRegistry and perFileConfigs but never touched perFileMtimes. After a CAPI cluster was registered, perFileMtimes stayed nil and the next refresh would write to it ('assignment to entry in nil map' panic). Seed perFileMtimes alongside the other globals on both promo paths. Add a defensive lazy-init in GetAvailableContexts so a future caller that forgets to seed it doesn't crash either. Test pins the contract that refresh populates an empty mtime map without panicking and writes the mtime back. Made-with: Cursor
…ath mtime Bugbot v3 follow-ups on PR #595. 1. Refresh loses track of files with all contexts removed. byFile was built solely from registry entries. If a previous refresh dropped every context for a file (e.g. user did 'kubectl config delete-context' for each one), the file's path stayed in fileMtimes/fileConfigs but vanished from byFile on subsequent calls. Any new contexts later added to that file were never discovered until restart. Seed byFile from BOTH registry AND fileMtimes keys so an emptied-but-still-tracked file gets re-stat'd on every refresh. 2. CAPI fast path omits perFileMtimes update. The MergeAndSwitchContext re-use branch overwrote the existing temp file (changing its mtime) and refreshed perFileConfigs, but didn't bump perFileMtimes. The next refresh would see a stale mtime and uselessly re-parse a file we'd just parsed in the same critical section. Bump perFileMtimes alongside perFileConfigs in the fast path. 3. Nil-mtime test was hand-waving. v3 declared nilMtimes, asserted it was nil, but then passed emptyMtimes (a non-nil map) to refresh. The function was NOT actually nil-safe — production seeded the map before calling. Make the function itself nil-safe (returns false / no-op) so callers that forget can't panic, and rewrite the test to pass an actual nil. Add a second regression that exercises the byFile-from-mtimes seed: emptied registry + rewritten file must surface the new contexts. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5f2654d. Configure here.
| } | ||
| } | ||
| return changed | ||
| } |
There was a problem hiding this comment.
In-place map mutation creates concurrent read/write crash
High Severity
refreshContextRegistry mutates contextRegistry and perFileConfigs in place (delete, insert), but SwitchContext and WriteKubeconfigForCurrentContext still snapshot these maps as bare references under RLock and read from them after releasing the lock. A concurrent GetAvailableContexts call can trigger refreshContextRegistry under a write lock, mutating the same live maps — causing Go's fatal "concurrent map read and map write" panic. Before this change the maps were never mutated in place (only atomically replaced), so the reference-snapshot pattern was safe.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 5f2654d. Configure here.


Linear: SKY-834
Companion to #590 (bug 54 — cert-health namespace mismatch).
Why
Bug 52: junk / stale clusters lingering in the cluster dropdown.
In multi-file (isolated-load) mode,
contextRegistry+perFileConfigswere built once insetupIsolatedLoadand never refreshed. Single-file mode reloads on everyGetAvailableContextscall (it walksRawConfig()throughDeferredLoadingClientConfig), but the multi-file path served whatever was in memory.So the dropdown kept showing entries for:
kubectl config delete-context,All three look identical to the user: a row that's there but errors out on switch.
What
New helper
refreshContextRegistry— surgical, mtime-driven reconciliation against disk.Wired into
GetAvailableContexts(multi-file branch only) underclientMuwrite lock. Same per-file isolation guarantees asbuildContextRegistry— colliding user/cluster names across files remain structurally impossible.Out of scope (deliberately deferred)
ReadDirall watched dirs without throttling).Test plan
go build ./...go test ./internal/k8s/(all green; 5 new subtests for refresh)kubectl config delete-context fooin one of them; reopen the dropdown —foois gone, no restart needed.rmone of the watched kubeconfig files; reopen the dropdown — every context from that file is gone.Made with Cursor
Note
Medium Risk
Touches concurrency and shared state around kubeconfig context enumeration; incorrect locking or refresh logic could cause deadlocks, performance regressions, or missing/incorrect context listings.
Overview
Prevents stale/junk kubeconfig contexts from lingering in the multi-file (isolated-load) context list by adding an mtime-backed refresh path that drops deleted files, updates rewritten files, and picks up newly added/removed contexts without requiring a restart.
GetAvailableContextsnow refreshescontextRegistry/perFileConfigson each call under a singleclientMuwrite lock to avoid concurrent map access, and CAPI temp-kubeconfig updates/promotions now seed and maintain the newperFileMtimescache. Adds targeted unit tests covering file deletion, in-place rewrites, new contexts, parse failures, and nil/empty-edge cases.Reviewed by Cursor Bugbot for commit 5f2654d. Bugbot is set up for automated code reviews on this repo. Configure here.