fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52) by eliran-ops · Pull Request #595 · skyhook-io/radar

eliran-ops · 2026-04-30T11:15:45Z

Companion to #590 (bug 54 — cert-health namespace mismatch).

Why

Bug 52: junk / stale clusters lingering in the cluster dropdown.

In multi-file (isolated-load) mode, contextRegistry + perFileConfigs were built once in setupIsolatedLoad and never refreshed. Single-file mode reloads on every GetAvailableContexts call (it walks RawConfig() through DeferredLoadingClientConfig), but the multi-file path served whatever was in memory.

So the dropdown kept showing entries for:

kubeconfig files removed from a watched directory,
kubeconfig files rewritten by kubectl config delete-context,
kubeconfig files written then deleted by CAPI when a managed cluster was destroyed.

All three look identical to the user: a row that's there but errors out on switch.

What

New helper refreshContextRegistry — surgical, mtime-driven reconciliation against disk.

Disk state	Action
File missing	Drop every registry entry pointing at it. Drop cached config + mtime.
File mtime moved	Re-parse THAT file only. Drop entries for removed contexts; add entries for new ones.
File mtime unchanged	No-op (avoids re-parsing N files on every dropdown open).
File present but parse-fails	Keep prior entries (user might be mid-edit; silent pruning would be worse).

Wired into GetAvailableContexts (multi-file branch only) under clientMu write lock. Same per-file isolation guarantees as buildContextRegistry — colliding user/cluster names across files remain structurally impossible.

Out of scope (deliberately deferred)

Re-scanning kubeconfig DIRECTORIES for newly-added files. Today, a freshly-created kubeconfig still requires a Radar restart to appear. Same workaround as before. Worth a follow-up but raises a syscall-budget question (every dropdown open shouldn't ReadDir all watched dirs without throttling).
fsnotify-based eager refresh. Same data, more moving parts; on-demand mtime check on dropdown open is enough to clear the junk-cluster complaint.

Test plan

go build ./...
go test ./internal/k8s/ (all green; 5 new subtests for refresh)
Manual: start Radar with two kubeconfig files; kubectl config delete-context foo in one of them; reopen the dropdown — foo is gone, no restart needed.
Manual: rm one of the watched kubeconfig files; reopen the dropdown — every context from that file is gone.

Made with Cursor

Note

Medium Risk
Touches concurrency and shared state around kubeconfig context enumeration; incorrect locking or refresh logic could cause deadlocks, performance regressions, or missing/incorrect context listings.

Overview
Prevents stale/junk kubeconfig contexts from lingering in the multi-file (isolated-load) context list by adding an mtime-backed refresh path that drops deleted files, updates rewritten files, and picks up newly added/removed contexts without requiring a restart.

GetAvailableContexts now refreshes contextRegistry/perFileConfigs on each call under a single clientMu write lock to avoid concurrent map access, and CAPI temp-kubeconfig updates/promotions now seed and maintain the new perFileMtimes cache. Adds targeted unit tests covering file deletion, in-place rewrites, new contexts, parse failures, and nil/empty-edge cases.

^{Reviewed by Cursor Bugbot for commit 5f2654d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Bug 52: junk / stale clusters lingering in the cluster dropdown. Root cause: in multi-file (isolated-load) mode, contextRegistry + perFileConfigs were built ONCE in setupIsolatedLoad and never refreshed. Single-file mode reloads on every GetAvailableContexts call (it walks RawConfig() through DeferredLoadingClientConfig), but the multi-file path served whatever was in memory. So the dropdown kept showing entries for: - kubeconfig files removed from a watched directory, - kubeconfig files rewritten by 'kubectl config delete-context', - kubeconfig files written then deleted by CAPI when a managed cluster was destroyed. All three look identical to the user: a row that's there, but errors out when they try to switch to it. Fix: introduce refreshContextRegistry — surgical, mtime-driven reconciliation of registry + perFileConfigs against what's on disk now. - File missing on disk → drop every registry entry pointing at it AND its cached config + mtime. (CAPI-destroyed case.) - File mtime moved → re-parse THAT file only. Drop entries for contexts removed from the file; add entries for newly-added contexts. (kubectl config delete-context / set-context case.) - File mtime unchanged → no-op (avoids re-parsing N files on every dropdown open). - File present but parse-fails → keep prior entries. The user might be mid-edit; silently pruning the dropdown while they save would be more confusing than a momentarily stale entry. Wire it into GetAvailableContexts (multi-file branch only) under clientMu write lock. Same per-file isolation guarantees as buildContextRegistry — colliding user/cluster names across files remain structurally impossible. Out of scope (deliberately deferred): - Re-scanning kubeconfig DIRECTORIES for newly-added files. Today, a freshly-created kubeconfig won't appear without a Radar restart. Same workaround as before (restart). Worth a follow-up but adds a syscall budget question (every dropdown open shouldn't ReadDir all watched dirs). - fsnotify-based eager refresh. Same data, more moving parts; on-demand mtime check on dropdown open is enough to clear the junk-cluster complaint. Tests pin: removed-file drop, in-place context delete, in-place context add, no-op stability, and the parse-failure defensive case (keep stale, don't prune). Made-with: Cursor

Bugbot HIGH x2 on PR #595. 1. Concurrent map read/write panic. v1 released clientMu before iterating the registry. The local variables 'registry' and 'fileConfigs' still pointed at the LIVE maps, so a second concurrent GetAvailableContexts call would acquire the lock and mutate them via refreshContextRegistry while the first caller iterated. Two HTTP /contexts requests in flight at once = 'fatal error: concurrent map read and map write'. Hold the lock through the iteration. The contexts slice is private to the caller, so we can release just before the return. 2. Nil perFileMtimes after CAPI promotion. MergeAndSwitchContext promotes single-file → isolated-load mode by writing contextRegistry and perFileConfigs but never touched perFileMtimes. After a CAPI cluster was registered, perFileMtimes stayed nil and the next refresh would write to it ('assignment to entry in nil map' panic). Seed perFileMtimes alongside the other globals on both promo paths. Add a defensive lazy-init in GetAvailableContexts so a future caller that forgets to seed it doesn't crash either. Test pins the contract that refresh populates an empty mtime map without panicking and writes the mtime back. Made-with: Cursor

…ath mtime Bugbot v3 follow-ups on PR #595. 1. Refresh loses track of files with all contexts removed. byFile was built solely from registry entries. If a previous refresh dropped every context for a file (e.g. user did 'kubectl config delete-context' for each one), the file's path stayed in fileMtimes/fileConfigs but vanished from byFile on subsequent calls. Any new contexts later added to that file were never discovered until restart. Seed byFile from BOTH registry AND fileMtimes keys so an emptied-but-still-tracked file gets re-stat'd on every refresh. 2. CAPI fast path omits perFileMtimes update. The MergeAndSwitchContext re-use branch overwrote the existing temp file (changing its mtime) and refreshed perFileConfigs, but didn't bump perFileMtimes. The next refresh would see a stale mtime and uselessly re-parse a file we'd just parsed in the same critical section. Bump perFileMtimes alongside perFileConfigs in the fast path. 3. Nil-mtime test was hand-waving. v3 declared nilMtimes, asserted it was nil, but then passed emptyMtimes (a non-nil map) to refresh. The function was NOT actually nil-safe — production seeded the map before calling. Make the function itself nil-safe (returns false / no-op) so callers that forget can't panic, and rewrite the test to pass an actual nil. Add a second regression that exercises the byFile-from-mtimes seed: emptied registry + rewritten file must surface the new contexts. Made-with: Cursor

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5f2654d. Configure here.}

cursor · 2026-04-30T12:21:16Z

+		}
+	}
+	return changed
+}


In-place map mutation creates concurrent read/write crash

High Severity

refreshContextRegistry mutates contextRegistry and perFileConfigs in place (delete, insert), but SwitchContext and WriteKubeconfigForCurrentContext still snapshot these maps as bare references under RLock and read from them after releasing the lock. A concurrent GetAvailableContexts call can trigger refreshContextRegistry under a write lock, mutating the same live maps — causing Go's fatal "concurrent map read and map write" panic. Before this change the maps were never mutated in place (only atomically replaced), so the reference-snapshot pattern was safe.

Additional Locations (2)

internal/k8s/client.go#L792-L810

internal/k8s/client.go#L561-L581

^{Reviewed by Cursor Bugbot for commit 5f2654d. Configure here.}

eliran-ops requested review from hisco and nadaverell as code owners April 30, 2026 11:15

cursor Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread internal/k8s/client.go Outdated

Comment thread internal/k8s/client.go Outdated

cursor Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread internal/k8s/context_registry.go

Comment thread internal/k8s/client.go

Comment thread internal/k8s/context_registry_test.go

cursor Bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52)#595

fix(contexts): refresh registry against disk on each list call (SKY-834 bug 52)#595
eliran-ops wants to merge 3 commits intomainfrom
feat/investigate-stale-junk-clusters-lingering-in-cluster-list

eliran-ops commented Apr 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eliran-ops commented Apr 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Out of scope (deliberately deferred)

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 30, 2026

Choose a reason for hiding this comment

In-place map mutation creates concurrent read/write crash

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eliran-ops commented Apr 30, 2026 •

edited by cursor Bot

Loading