Fix concurrent map access race in project authorization cache#643
Fix concurrent map access race in project authorization cache#643gangwgr wants to merge 1 commit intoopenshift:mainfrom
Conversation
The openshift-apiserver was experiencing repeated crashes due to concurrent map iteration and write operations on subjectRecord.namespaces. The issue occurred when: - HTTP request handlers called AuthorizationCache.List() which read from subjectRecord.namespaces.List() at lines 517 and 524 - Concurrently, the synchronize() goroutine (line 286) called deleteNamespaceFromSubjects() and addSubjectsToNamespace() which modified the same sets.String (Go map) without synchronization - Go runtime detected concurrent map access and panicked with "fatal error: concurrent map iteration and map write" This fix adds proper synchronization using sync.RWMutex to protect all accesses to the subjectRecord.namespaces field: - Added mu sync.RWMutex field to subjectRecord struct - Protected all .List() reads with RLock/RUnlock - Protected all .Insert() and .Delete() writes with Lock/Unlock Impact: - Eliminates openshift-apiserver CrashLoopBackOff - Fixes high pod restart counts (14, 119, 131 restarts observed) - Resolves OCPBUGS-XXXXX Testing: - Added comprehensive race detector tests - All new tests pass with -race flag - All existing tests pass without regression - Stress tested with 20 concurrent goroutines for multiple seconds Stack trace from bug report: k8s.io/apimachinery/pkg/util/sets.List[...](0xc0668e6600) k8s.io/apimachinery@v0.31.1/pkg/util/sets/set.go:203 +0xb7 k8s.io/apimachinery/pkg/util/sets.String.List(...) k8s.io/apimachinery@v0.31.1/pkg/util/sets/string.go:121 Fixes: OCPBUGS-XXXXX
|
Skipping CI for Draft Pull Request. |
WalkthroughThis PR adds concurrency safety to the ChangesAuthorization Cache Concurrency Safety
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/project/auth/cache_race_test.go (1)
125-155: ⚡ Quick winExercise
AuthorizationCache.List()directly here.This loop reimplements
List()instead of calling it, so the test won't catch regressions in the real method—especially anything around store snapshotting or future changes above the per-record lock. A small fakeNamespaceLister/ClusterRoleListerwould make this a real regression test.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/project/auth/cache_race_test.go` around lines 125 - 155, The test currently reimplements the logic of AuthorizationCache.List() using direct store lookups (userSubjectRecordStore / groupSubjectRecordStore and subjectRecord locks), which won't catch regressions; update the test to call AuthorizationCache.List() directly and assert on its output instead. To exercise snapshotting and lister behaviour, replace the inline namespace/clusterrole access with small fakes for NamespaceLister and ClusterRoleLister (or a fake NamespaceLister/ClusterRoleLister implementation) wired into the AuthorizationCache under test so List() sees realistic listers; keep the concurrent goroutines running and synchronize with done/wg the same way so you exercise the real List() locking and snapshot semantics rather than reimplementing them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/project/auth/cache.go`:
- Around line 518-520: The store pointer (ac.userSubjectRecordStore /
ac.groupSubjectRecordStore) must be read under the cache-level lock to avoid
concurrent swaps during synchronize(); fix by acquiring ac.mu.RLock(), snapshot
the appropriate store pointer into a local variable, then call that store's
List() while still holding the read lock (and continue to use
subjectRecord.mu.RLock() around subjectRecord.namespaces.List()); release locks
after the snapshot/List completes. Apply the same pattern for the other
occurrence around lines with subjectRecord.namespaces.List() (the 527-529
occurrence).
---
Nitpick comments:
In `@pkg/project/auth/cache_race_test.go`:
- Around line 125-155: The test currently reimplements the logic of
AuthorizationCache.List() using direct store lookups (userSubjectRecordStore /
groupSubjectRecordStore and subjectRecord locks), which won't catch regressions;
update the test to call AuthorizationCache.List() directly and assert on its
output instead. To exercise snapshotting and lister behaviour, replace the
inline namespace/clusterrole access with small fakes for NamespaceLister and
ClusterRoleLister (or a fake NamespaceLister/ClusterRoleLister implementation)
wired into the AuthorizationCache under test so List() sees realistic listers;
keep the concurrent goroutines running and synchronize with done/wg the same way
so you exercise the real List() locking and snapshot semantics rather than
reimplementing them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6eaf07f6-6eea-4a68-ab1e-0c15040e8caf
📒 Files selected for processing (2)
pkg/project/auth/cache.gopkg/project/auth/cache_race_test.go
| subjectRecord.mu.RLock() | ||
| keys.Insert(subjectRecord.namespaces.List()...) | ||
| subjectRecord.mu.RUnlock() |
There was a problem hiding this comment.
Guard the store pointers as well as the per-subject maps.
This fixes the subjectRecord.namespaces race, but List() still reads ac.userSubjectRecordStore / ac.groupSubjectRecordStore concurrently with synchronize() swapping those fields during cache invalidation. That leaves an unsynchronized read/write on the store references and can also give List() a mixed old/new snapshot.
Suggested fix
type AuthorizationCache struct {
+ storeMu sync.RWMutex
+
reviewRecordStore cache.Store
userSubjectRecordStore cache.Store
groupSubjectRecordStore cache.Store
...
}
func (ac *AuthorizationCache) List(userInfo user.Info, selector labels.Selector) (*corev1.NamespaceList, error) {
+ ac.storeMu.RLock()
+ userStore := ac.userSubjectRecordStore
+ groupStore := ac.groupSubjectRecordStore
+ ac.storeMu.RUnlock()
+
keys := sets.String{}
user := userInfo.GetName()
groups := userInfo.GetGroups()
- obj, exists, _ := ac.userSubjectRecordStore.GetByKey(user)
+ obj, exists, _ := userStore.GetByKey(user)
if exists {
subjectRecord := obj.(*subjectRecord)
subjectRecord.mu.RLock()
keys.Insert(subjectRecord.namespaces.List()...)
subjectRecord.mu.RUnlock()
}
for _, group := range groups {
- obj, exists, _ := ac.groupSubjectRecordStore.GetByKey(group)
+ obj, exists, _ := groupStore.GetByKey(group)
if exists {
subjectRecord := obj.(*subjectRecord)
subjectRecord.mu.RLock()
keys.Insert(subjectRecord.namespaces.List()...)
subjectRecord.mu.RUnlock()
}
}
}
// inside synchronize()
if invalidateCache {
+ ac.storeMu.Lock()
ac.userSubjectRecordStore = userSubjectRecordStore
ac.groupSubjectRecordStore = groupSubjectRecordStore
ac.reviewRecordStore = reviewRecordStore
+ ac.storeMu.Unlock()
}Also applies to: 527-529
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/project/auth/cache.go` around lines 518 - 520, The store pointer
(ac.userSubjectRecordStore / ac.groupSubjectRecordStore) must be read under the
cache-level lock to avoid concurrent swaps during synchronize(); fix by
acquiring ac.mu.RLock(), snapshot the appropriate store pointer into a local
variable, then call that store's List() while still holding the read lock (and
continue to use subjectRecord.mu.RLock() around
subjectRecord.namespaces.List()); release locks after the snapshot/List
completes. Apply the same pattern for the other occurrence around lines with
subjectRecord.namespaces.List() (the 527-529 occurrence).
The openshift-apiserver was experiencing repeated crashes due to concurrent map iteration and write operations on subjectRecord.namespaces.
The issue occurred when:
This fix adds proper synchronization using sync.RWMutex to protect all accesses to the subjectRecord.namespaces field:
Impact:
Testing:
Stack trace from bug report:
k8s.io/apimachinery/pkg/util/sets.List...
k8s.io/apimachinery@v0.31.1/pkg/util/sets/set.go:203 +0xb7
k8s.io/apimachinery/pkg/util/sets.String.List(...)
k8s.io/apimachinery@v0.31.1/pkg/util/sets/string.go:121
Fixes: OCPBUGS-XXXXX
Summary by CodeRabbit
Bug Fixes
Tests