Skip to content

Fix concurrent map access race in project authorization cache#643

Draft
gangwgr wants to merge 1 commit intoopenshift:mainfrom
gangwgr:fix-concurrent-map-race
Draft

Fix concurrent map access race in project authorization cache#643
gangwgr wants to merge 1 commit intoopenshift:mainfrom
gangwgr:fix-concurrent-map-race

Conversation

@gangwgr
Copy link
Copy Markdown
Contributor

@gangwgr gangwgr commented May 8, 2026

The openshift-apiserver was experiencing repeated crashes due to concurrent map iteration and write operations on subjectRecord.namespaces.

The issue occurred when:

  • HTTP request handlers called AuthorizationCache.List() which read from subjectRecord.namespaces.List() at lines 517 and 524
  • Concurrently, the synchronize() goroutine (line 286) called deleteNamespaceFromSubjects() and addSubjectsToNamespace() which modified the same sets.String (Go map) without synchronization
  • Go runtime detected concurrent map access and panicked with "fatal error: concurrent map iteration and map write"

This fix adds proper synchronization using sync.RWMutex to protect all accesses to the subjectRecord.namespaces field:

  • Added mu sync.RWMutex field to subjectRecord struct
  • Protected all .List() reads with RLock/RUnlock
  • Protected all .Insert() and .Delete() writes with Lock/Unlock

Impact:

  • Eliminates openshift-apiserver CrashLoopBackOff
  • Fixes high pod restart counts (14, 119, 131 restarts observed)
  • Resolves OCPBUGS-XXXXX

Testing:

  • Added comprehensive race detector tests
  • All new tests pass with -race flag
  • All existing tests pass without regression
  • Stress tested with 20 concurrent goroutines for multiple seconds

Stack trace from bug report:
k8s.io/apimachinery/pkg/util/sets.List...
k8s.io/apimachinery@v0.31.1/pkg/util/sets/set.go:203 +0xb7
k8s.io/apimachinery/pkg/util/sets.String.List(...)
k8s.io/apimachinery@v0.31.1/pkg/util/sets/string.go:121

Fixes: OCPBUGS-XXXXX

Summary by CodeRabbit

  • Bug Fixes

    • Resolved potential concurrency issues in authorization cache operations to ensure thread-safe access to cached authorization data.
  • Tests

    • Added comprehensive concurrency tests to verify authorization cache stability under concurrent access scenarios.

The openshift-apiserver was experiencing repeated crashes due to
concurrent map iteration and write operations on subjectRecord.namespaces.

The issue occurred when:
- HTTP request handlers called AuthorizationCache.List() which read from
  subjectRecord.namespaces.List() at lines 517 and 524
- Concurrently, the synchronize() goroutine (line 286) called
  deleteNamespaceFromSubjects() and addSubjectsToNamespace() which
  modified the same sets.String (Go map) without synchronization
- Go runtime detected concurrent map access and panicked with
  "fatal error: concurrent map iteration and map write"

This fix adds proper synchronization using sync.RWMutex to protect all
accesses to the subjectRecord.namespaces field:
- Added mu sync.RWMutex field to subjectRecord struct
- Protected all .List() reads with RLock/RUnlock
- Protected all .Insert() and .Delete() writes with Lock/Unlock

Impact:
- Eliminates openshift-apiserver CrashLoopBackOff
- Fixes high pod restart counts (14, 119, 131 restarts observed)
- Resolves OCPBUGS-XXXXX

Testing:
- Added comprehensive race detector tests
- All new tests pass with -race flag
- All existing tests pass without regression
- Stress tested with 20 concurrent goroutines for multiple seconds

Stack trace from bug report:
k8s.io/apimachinery/pkg/util/sets.List[...](0xc0668e6600)
  k8s.io/apimachinery@v0.31.1/pkg/util/sets/set.go:203 +0xb7
k8s.io/apimachinery/pkg/util/sets.String.List(...)
  k8s.io/apimachinery@v0.31.1/pkg/util/sets/string.go:121

Fixes: OCPBUGS-XXXXX
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Walkthrough

This PR adds concurrency safety to the AuthorizationCache by introducing a read-write mutex to the subjectRecord struct and protecting namespace read/write operations. Three test cases are added to verify that concurrent access does not trigger race conditions.

Changes

Authorization Cache Concurrency Safety

Layer / File(s) Summary
Data Shape
pkg/project/auth/cache.go
subjectRecord struct gains mu sync.RWMutex field to protect concurrent access to namespaces.
Core Implementation
pkg/project/auth/cache.go
List() acquires read locks when accessing subjectRecord.namespaces for both user and group subject records. deleteNamespaceFromSubjects() locks before deleting namespaces and checking emptiness. addSubjectsToNamespace() locks before inserting namespaces into subject records.
Concurrency Tests
pkg/project/auth/cache_race_test.go
Three tests added: TestConcurrentMapAccess_NoRace verifies concurrent reads/writes don't panic over 2 seconds; TestAuthorizationCache_ListConcurrentAccess simulates concurrent writers and readers aggregating namespaces; TestSubjectRecordConcurrentModification stress-tests a single record with many concurrent operations and captures goroutine panics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: fixing a concurrent map access race in the project authorization cache by adding mutex protection to subjectRecord.namespaces.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are static and descriptive. Three test functions use clear names with no dynamic information like timestamps, UUIDs, or generated suffixes.
Test Structure And Quality ✅ Passed The custom check is specific to Ginkgo test code. The PR adds standard Go tests using *testing.T, not Ginkgo tests. The check is not applicable to this codebase's testing framework.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR adds only standard Go unit tests using testing.T, not subject to MicroShift compatibility checks.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR adds three standard Go unit tests, not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests. Since no Ginkgo tests are added, the SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go source files fixing concurrency bugs. No deployment manifests, operator code, or scheduling constraints introduced. Custom check not applicable.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations. cache.go has no stdout writes. cache_race_test.go has only t.Log() calls inside test functions, which are intercepted by the testing framework.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Added tests are standard Go unit tests, not Ginkgo e2e tests. No Ginkgo patterns detected. Custom check only applies to Ginkgo e2e tests with IPv4 or external connectivity issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/project/auth/cache_race_test.go (1)

125-155: ⚡ Quick win

Exercise AuthorizationCache.List() directly here.

This loop reimplements List() instead of calling it, so the test won't catch regressions in the real method—especially anything around store snapshotting or future changes above the per-record lock. A small fake NamespaceLister / ClusterRoleLister would make this a real regression test.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/project/auth/cache_race_test.go` around lines 125 - 155, The test
currently reimplements the logic of AuthorizationCache.List() using direct store
lookups (userSubjectRecordStore / groupSubjectRecordStore and subjectRecord
locks), which won't catch regressions; update the test to call
AuthorizationCache.List() directly and assert on its output instead. To exercise
snapshotting and lister behaviour, replace the inline namespace/clusterrole
access with small fakes for NamespaceLister and ClusterRoleLister (or a fake
NamespaceLister/ClusterRoleLister implementation) wired into the
AuthorizationCache under test so List() sees realistic listers; keep the
concurrent goroutines running and synchronize with done/wg the same way so you
exercise the real List() locking and snapshot semantics rather than
reimplementing them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/project/auth/cache.go`:
- Around line 518-520: The store pointer (ac.userSubjectRecordStore /
ac.groupSubjectRecordStore) must be read under the cache-level lock to avoid
concurrent swaps during synchronize(); fix by acquiring ac.mu.RLock(), snapshot
the appropriate store pointer into a local variable, then call that store's
List() while still holding the read lock (and continue to use
subjectRecord.mu.RLock() around subjectRecord.namespaces.List()); release locks
after the snapshot/List completes. Apply the same pattern for the other
occurrence around lines with subjectRecord.namespaces.List() (the 527-529
occurrence).

---

Nitpick comments:
In `@pkg/project/auth/cache_race_test.go`:
- Around line 125-155: The test currently reimplements the logic of
AuthorizationCache.List() using direct store lookups (userSubjectRecordStore /
groupSubjectRecordStore and subjectRecord locks), which won't catch regressions;
update the test to call AuthorizationCache.List() directly and assert on its
output instead. To exercise snapshotting and lister behaviour, replace the
inline namespace/clusterrole access with small fakes for NamespaceLister and
ClusterRoleLister (or a fake NamespaceLister/ClusterRoleLister implementation)
wired into the AuthorizationCache under test so List() sees realistic listers;
keep the concurrent goroutines running and synchronize with done/wg the same way
so you exercise the real List() locking and snapshot semantics rather than
reimplementing them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6eaf07f6-6eea-4a68-ab1e-0c15040e8caf

📥 Commits

Reviewing files that changed from the base of the PR and between 999dd5a and 21382cd.

📒 Files selected for processing (2)
  • pkg/project/auth/cache.go
  • pkg/project/auth/cache_race_test.go

Comment thread pkg/project/auth/cache.go
Comment on lines +518 to +520
subjectRecord.mu.RLock()
keys.Insert(subjectRecord.namespaces.List()...)
subjectRecord.mu.RUnlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the store pointers as well as the per-subject maps.

This fixes the subjectRecord.namespaces race, but List() still reads ac.userSubjectRecordStore / ac.groupSubjectRecordStore concurrently with synchronize() swapping those fields during cache invalidation. That leaves an unsynchronized read/write on the store references and can also give List() a mixed old/new snapshot.

Suggested fix
 type AuthorizationCache struct {
+	storeMu sync.RWMutex
+
 	reviewRecordStore       cache.Store
 	userSubjectRecordStore  cache.Store
 	groupSubjectRecordStore cache.Store
 	...
 }

 func (ac *AuthorizationCache) List(userInfo user.Info, selector labels.Selector) (*corev1.NamespaceList, error) {
+	ac.storeMu.RLock()
+	userStore := ac.userSubjectRecordStore
+	groupStore := ac.groupSubjectRecordStore
+	ac.storeMu.RUnlock()
+
 	keys := sets.String{}
 	user := userInfo.GetName()
 	groups := userInfo.GetGroups()

-	obj, exists, _ := ac.userSubjectRecordStore.GetByKey(user)
+	obj, exists, _ := userStore.GetByKey(user)
 	if exists {
 		subjectRecord := obj.(*subjectRecord)
 		subjectRecord.mu.RLock()
 		keys.Insert(subjectRecord.namespaces.List()...)
 		subjectRecord.mu.RUnlock()
 	}

 	for _, group := range groups {
-		obj, exists, _ := ac.groupSubjectRecordStore.GetByKey(group)
+		obj, exists, _ := groupStore.GetByKey(group)
 		if exists {
 			subjectRecord := obj.(*subjectRecord)
 			subjectRecord.mu.RLock()
 			keys.Insert(subjectRecord.namespaces.List()...)
 			subjectRecord.mu.RUnlock()
 		}
 	}
 }

 // inside synchronize()
 if invalidateCache {
+	ac.storeMu.Lock()
 	ac.userSubjectRecordStore = userSubjectRecordStore
 	ac.groupSubjectRecordStore = groupSubjectRecordStore
 	ac.reviewRecordStore = reviewRecordStore
+	ac.storeMu.Unlock()
 }

Also applies to: 527-529

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/project/auth/cache.go` around lines 518 - 520, The store pointer
(ac.userSubjectRecordStore / ac.groupSubjectRecordStore) must be read under the
cache-level lock to avoid concurrent swaps during synchronize(); fix by
acquiring ac.mu.RLock(), snapshot the appropriate store pointer into a local
variable, then call that store's List() while still holding the read lock (and
continue to use subjectRecord.mu.RLock() around
subjectRecord.namespaces.List()); release locks after the snapshot/List
completes. Apply the same pattern for the other occurrence around lines with
subjectRecord.namespaces.List() (the 527-529 occurrence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant