Skip to content

OCPBUGS-84534: fix concurrent map race in project authorization cache#642

Open
sanchezl wants to merge 2 commits intoopenshift:mainfrom
sanchezl:bugfix/project-auth-cache-race
Open

OCPBUGS-84534: fix concurrent map race in project authorization cache#642
sanchezl wants to merge 2 commits intoopenshift:mainfrom
sanchezl:bugfix/project-auth-cache-race

Conversation

@sanchezl
Copy link
Copy Markdown
Contributor

@sanchezl sanchezl commented May 7, 2026

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache.List() that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Commit 1 — Copy-on-write subjectRecords: addSubjectsToNamespace and deleteNamespaceFromSubjects now create new subjectRecord objects with copied namespaces sets instead of mutating the underlying map in place. Any concurrent List() holding the old record iterates an immutable snapshot. cache.Store is internally thread-safe, so the replacement is safe without locking — List() never blocks.

Commit 2 — Atomic store pointer swap: During full cache invalidation, three store pointers were swapped non-atomically. A concurrent List() could read stores from different points in time. Wraps all three stores in a struct behind atomic.Pointer so they swap as a single unit.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

  • addSubjectsToNamespace(): item.namespaces.Insert(namespace)
  • deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process.

Fix History

  1. PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access
  2. PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)
  3. PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained
  4. PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This fix avoids locks entirely via copy-on-write. List() never blocks, regardless of how long synchronize() takes.

Related Issues

QA Validation

Test 1: Race condition is fixed

  • Deploy a build with the fix
  • Run the cluster under load with concurrent project list requests
  • Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters

  • Provision a cluster with 2000+ namespaces and substantial RBAC
  • Measure oc get projects latency before and after — should not regress
  • Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

  • Grant/revoke a user's access to a namespace
  • Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced authorization cache reliability under concurrent load through improved internal consistency mechanisms.
  • Tests

    • Added comprehensive stress testing for authorization cache operations under concurrent access patterns.

sanchezl added 2 commits May 7, 2026 10:04
addSubjectsToNamespace and deleteNamespaceFromSubjects mutate
subjectRecord.namespaces (a sets.String / map) in place while
List() iterates the same map from HTTP request goroutines. This
causes a fatal "concurrent map iteration and map write" panic
that crashes openshift-apiserver pods intermittently.

Use copy-on-write: create a new subjectRecord with a copied
namespaces set and replace it in the store, so any in-flight
List() holding the old record iterates an immutable snapshot.
cache.Store is internally thread-safe (ThreadSafeStore), so the
replacement is safe without additional locking, and List() never
blocks.
During full cache invalidation, synchronize() swaps three store
pointers (userSubjectRecordStore, groupSubjectRecordStore,
reviewRecordStore) in sequence without synchronization. A
concurrent List() call can observe stores from different points
in time, producing silently incorrect results.

Group the three stores in an authorizationCacheStores struct
behind atomic.Pointer so they are swapped as a single unit.
List() snapshots the pointer once, ensuring it reads from a
consistent set of stores.
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache.List() that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Commit 1 — Copy-on-write subjectRecords: addSubjectsToNamespace and deleteNamespaceFromSubjects now create new subjectRecord objects with copied namespaces sets instead of mutating the underlying map in place. Any concurrent List() holding the old record iterates an immutable snapshot. cache.Store is internally thread-safe, so the replacement is safe without locking — List() never blocks.

Commit 2 — Atomic store pointer swap: During full cache invalidation, three store pointers were swapped non-atomically. A concurrent List() could read stores from different points in time. Wraps all three stores in a struct behind atomic.Pointer so they swap as a single unit.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

  • addSubjectsToNamespace(): item.namespaces.Insert(namespace)
  • deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process.

Fix History

  1. PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access
  2. PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)
  3. PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained
  4. PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This fix avoids locks entirely via copy-on-write. List() never blocks, regardless of how long synchronize() takes.

Related Issues

QA Validation

Test 1: Race condition is fixed

  • Deploy a build with the fix
  • Run the cluster under load with concurrent project list requests
  • Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters

  • Provision a cluster with 2000+ namespaces and substantial RBAC
  • Measure oc get projects latency before and after — should not regress
  • Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

  • Grant/revoke a user's access to a namespace
  • Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Walkthrough

AuthorizationCache refactors its internal storage to use an atomically-swapped pointer to a grouped stores structure, replacing separate mutable cache fields. Read operations snapshot the stores pointer for consistency; write operations perform incremental updates with atomic swaps. Namespace-to-subject mutations adopt immutable clone-and-update semantics. A race condition test was added.

Changes

Cache Atomicity Refactor

Layer / File(s) Summary
Data Shape
pkg/project/auth/cache.go (lines 177–198)
New authorizationCacheStores struct groups three cache.Store instances; AuthorizationCache adds atomic.Pointer[authorizationCacheStores] field stores and removes separate reviewRecordStore, userSubjectRecordStore, groupSubjectRecordStore fields.
Initialization
pkg/project/auth/cache.go (lines 274–278)
NewAuthorizationCache constructs the three cache stores and atomically installs them via ac.stores.Store(...) instead of setting individual fields.
Synchronization
pkg/project/auth/cache.go (lines 435–472)
synchronize() snapshots the current stores pointer, performs incremental updates against the snapshot, and atomically swaps the rebuilt stores set via ac.stores.Store(...) on full rebuild.
Read Path
pkg/project/auth/cache.go (lines 525–538)
List() loads the stores pointer once at the start and reads both user and group subject data from the snapshot for a consistent view throughout the request.
Immutable Updates
pkg/project/auth/cache.go (lines 614–641)
deleteNamespaceFromSubjects and addSubjectsToNamespace replace in-place mutations with clone-and-update patterns that create new sets.String and subjectRecord instances.
Concurrency Tests
pkg/project/auth/cache_test.go (lines 6, 440–554)
New TestAuthorizationCacheRace runs concurrent writer goroutines (mutating and calling synchronize()) and reader goroutines (calling List()) for 2 seconds to detect race conditions; sync package imported for coordination primitives.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'OCPBUGS-84534: fix concurrent map race in project authorization cache' directly and accurately describes the main change: fixing a concurrent map race condition in the authorization cache.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test names are stable and deterministic. The new TestAuthorizationCacheRace uses a static, descriptive function name with no dynamic values. Existing tests use static names in table-driven structures.
Test Structure And Quality ✅ Passed The custom check is for reviewing Ginkgo test code structure. The test added (TestAuthorizationCacheRace) is standard Go testing, not Ginkgo-based. The check is not applicable to this PR.
Microshift Test Compatibility ✅ Passed The new test (TestAuthorizationCacheRace) is a standard Go unit test, not a Ginkgo e2e test. The check only applies to Ginkgo tests, so it is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. The new TestAuthorizationCacheRace is a standard Go unit test using testing.T, not a Ginkgo e2e test. Check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies internal Go cache code (pkg/project/auth/), not deployment manifests or controllers. No pod scheduling constraints, affinity rules, replicas, or topology configurations are introduced.
Ote Binary Stdout Contract ✅ Passed This modifies internal library code, not OTE binary. The klog.V(5).Info() call is in a library method, not process-level code like main() or init().
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds a unit test using standard Go testing (func Test*), not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests (It, Describe, Context, When patterns). Not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from deads2k and derekwaynecarr May 7, 2026 14:27
@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented May 7, 2026

/verified by "TestAuthorizationCacheRace"

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sanchezl: This PR has been marked as verified by "TestAuthorizationCacheRace".

Details

In response to this:

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache.List() that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Commit 1 — Copy-on-write subjectRecords: addSubjectsToNamespace and deleteNamespaceFromSubjects now create new subjectRecord objects with copied namespaces sets instead of mutating the underlying map in place. Any concurrent List() holding the old record iterates an immutable snapshot. cache.Store is internally thread-safe, so the replacement is safe without locking — List() never blocks.

Commit 2 — Atomic store pointer swap: During full cache invalidation, three store pointers were swapped non-atomically. A concurrent List() could read stores from different points in time. Wraps all three stores in a struct behind atomic.Pointer so they swap as a single unit.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

  • addSubjectsToNamespace(): item.namespaces.Insert(namespace)
  • deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process.

Fix History

  1. PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access
  2. PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)
  3. PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained
  4. PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This fix avoids locks entirely via copy-on-write. List() never blocks, regardless of how long synchronize() takes.

Related Issues

QA Validation

Test 1: Race condition is fixed

  • Deploy a build with the fix
  • Run the cluster under load with concurrent project list requests
  • Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters

  • Provision a cluster with 2000+ namespaces and substantial RBAC
  • Measure oc get projects latency before and after — should not regress
  • Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

  • Grant/revoke a user's access to a namespace
  • Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced authorization cache reliability under concurrent load through improved internal consistency mechanisms.

  • Tests

  • Added comprehensive stress testing for authorization cache operations under concurrent access patterns.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/project/auth/cache.go (1)

451-471: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move lastCacheInvalidation to after the atomic store swap.

On the full-rebuild path, Line 453 resets the expiry timer before the rebuilt stores are visible. List() keeps serving the old snapshot until Line 467, so a slow rebuild can leave stale data live longer than maxCacheLifespan, and the next expiry window starts too early.

Suggested fix
 	invalidateCache := ac.invalidateCache(expired)
 	if invalidateCache {
-		ac.lastCacheInvalidation = ac.clock.Now()
 		userSubjectRecordStore = cache.NewStore(subjectRecordKeyFn)
 		groupSubjectRecordStore = cache.NewStore(subjectRecordKeyFn)
 		reviewRecordStore = cache.NewStore(reviewRecordKeyFn)
 	}
@@
 	if invalidateCache {
 		ac.stores.Store(&authorizationCacheStores{
 			userSubjectRecordStore:  userSubjectRecordStore,
 			groupSubjectRecordStore: groupSubjectRecordStore,
 			reviewRecordStore:       reviewRecordStore,
 		})
+		ac.lastCacheInvalidation = ac.clock.Now()
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/project/auth/cache.go` around lines 451 - 471, The cache expiry timestamp
ac.lastCacheInvalidation is being set when invalidateCache is true before
swapping in the rebuilt stores, which can extend stale-serving time; move the
assignment of ac.lastCacheInvalidation to after the atomic swap (the
ac.stores.Store call that installs the new authorizationCacheStores with
userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) so the
expiry timer starts only once the new stores are visible to List()/readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/project/auth/cache.go`:
- Around line 451-471: The cache expiry timestamp ac.lastCacheInvalidation is
being set when invalidateCache is true before swapping in the rebuilt stores,
which can extend stale-serving time; move the assignment of
ac.lastCacheInvalidation to after the atomic swap (the ac.stores.Store call that
installs the new authorizationCacheStores with userSubjectRecordStore,
groupSubjectRecordStore, reviewRecordStore) so the expiry timer starts only once
the new stores are visible to List()/readers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f932d9a3-ce8c-40cc-a062-02fafaee0b7a

📥 Commits

Reviewing files that changed from the base of the PR and between 999dd5a and 10ef6dd.

📒 Files selected for processing (2)
  • pkg/project/auth/cache.go
  • pkg/project/auth/cache_test.go

@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented May 8, 2026

/retest-required

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

@sanchezl: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants