OCPBUGS-84534: fix concurrent map race in project authorization cache#642
OCPBUGS-84534: fix concurrent map race in project authorization cache#642sanchezl wants to merge 2 commits intoopenshift:mainfrom
Conversation
addSubjectsToNamespace and deleteNamespaceFromSubjects mutate subjectRecord.namespaces (a sets.String / map) in place while List() iterates the same map from HTTP request goroutines. This causes a fatal "concurrent map iteration and map write" panic that crashes openshift-apiserver pods intermittently. Use copy-on-write: create a new subjectRecord with a copied namespaces set and replace it in the store, so any in-flight List() holding the old record iterates an immutable snapshot. cache.Store is internally thread-safe (ThreadSafeStore), so the replacement is safe without additional locking, and List() never blocks.
During full cache invalidation, synchronize() swaps three store pointers (userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) in sequence without synchronization. A concurrent List() call can observe stores from different points in time, producing silently incorrect results. Group the three stores in an authorizationCacheStores struct behind atomic.Pointer so they are swapped as a single unit. List() snapshots the pointer once, ensuring it reads from a consistent set of stores.
|
@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Walkthrough
ChangesCache Atomicity Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/verified by "TestAuthorizationCacheRace" |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sanchezl: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/project/auth/cache.go (1)
451-471:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMove
lastCacheInvalidationto after the atomic store swap.On the full-rebuild path, Line 453 resets the expiry timer before the rebuilt stores are visible.
List()keeps serving the old snapshot until Line 467, so a slow rebuild can leave stale data live longer thanmaxCacheLifespan, and the next expiry window starts too early.Suggested fix
invalidateCache := ac.invalidateCache(expired) if invalidateCache { - ac.lastCacheInvalidation = ac.clock.Now() userSubjectRecordStore = cache.NewStore(subjectRecordKeyFn) groupSubjectRecordStore = cache.NewStore(subjectRecordKeyFn) reviewRecordStore = cache.NewStore(reviewRecordKeyFn) } @@ if invalidateCache { ac.stores.Store(&authorizationCacheStores{ userSubjectRecordStore: userSubjectRecordStore, groupSubjectRecordStore: groupSubjectRecordStore, reviewRecordStore: reviewRecordStore, }) + ac.lastCacheInvalidation = ac.clock.Now() }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/project/auth/cache.go` around lines 451 - 471, The cache expiry timestamp ac.lastCacheInvalidation is being set when invalidateCache is true before swapping in the rebuilt stores, which can extend stale-serving time; move the assignment of ac.lastCacheInvalidation to after the atomic swap (the ac.stores.Store call that installs the new authorizationCacheStores with userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) so the expiry timer starts only once the new stores are visible to List()/readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/project/auth/cache.go`:
- Around line 451-471: The cache expiry timestamp ac.lastCacheInvalidation is
being set when invalidateCache is true before swapping in the rebuilt stores,
which can extend stale-serving time; move the assignment of
ac.lastCacheInvalidation to after the atomic swap (the ac.stores.Store call that
installs the new authorizationCacheStores with userSubjectRecordStore,
groupSubjectRecordStore, reviewRecordStore) so the expiry timer starts only once
the new stores are visible to List()/readers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f932d9a3-ce8c-40cc-a062-02fafaee0b7a
📒 Files selected for processing (2)
pkg/project/auth/cache.gopkg/project/auth/cache_test.go
|
/retest-required |
|
@sanchezl: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Fix a fatal
concurrent map iteration and map writepanic inAuthorizationCache.List()that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.Commit 1 — Copy-on-write subjectRecords:
addSubjectsToNamespaceanddeleteNamespaceFromSubjectsnow create newsubjectRecordobjects with copied namespaces sets instead of mutating the underlying map in place. Any concurrentList()holding the old record iterates an immutable snapshot.cache.Storeis internally thread-safe, so the replacement is safe without locking —List()never blocks.Commit 2 — Atomic store pointer swap: During full cache invalidation, three store pointers were swapped non-atomically. A concurrent
List()could read stores from different points in time. Wraps all three stores in a struct behindatomic.Pointerso they swap as a single unit.Root Cause
List()(called from HTTP request goroutines viaproxy.(*REST).List) readssubjectRecord.namespaces(sets.String=map[string]Empty). Meanwhile,synchronize()(background goroutine) mutates the same maps in place:addSubjectsToNamespace():item.namespaces.Insert(namespace)deleteNamespaceFromSubjects():delete(subjectRecord.namespaces, namespace)Go's runtime detects the concurrent map read+write and kills the process.
Fix History
sync.RWMutexto synchronize accessList()requests (goroutine dumps showed 3-4 minute waits onRLock)This fix avoids locks entirely via copy-on-write.
List()never blocks, regardless of how longsynchronize()takes.Related Issues
QA Validation
Test 1: Race condition is fixed
Test 2: No regression on large clusters
oc get projectslatency before and after — should not regressTest 3: Cache freshness
oc projectsoutput within ~15 secondsVerification
/verified by "TestAuthorizationCacheRace"Summary by CodeRabbit
Bug Fixes
Tests