Skip to content

fix(gitops-engine): Handle Deleted Namespaces Gracefully During Sync (#24709)#24739

Closed
tricktron wants to merge 1 commit into
argoproj:masterfrom
tricktron:fix-argocd-cluster-sync-after-ns-deletion
Closed

fix(gitops-engine): Handle Deleted Namespaces Gracefully During Sync (#24709)#24739
tricktron wants to merge 1 commit into
argoproj:masterfrom
tricktron:fix-argocd-cluster-sync-after-ns-deletion

Conversation

@tricktron
Copy link
Copy Markdown

Summary

Migrated from argoproj/gitops-engine#785

Fix infinite sync failure loops when managed namespaces are deleted by implementing automatic namespace validation and cleanup during cluster cache synchronization.

Fixes: #24709

Problem

When namespaces managed by ArgoCD are deleted without first removing the managed-by label, the GitOps Engine enters an infinite failure loop during cluster cache sync operations. The processApi() function attempts to list resources in deleted namespaces, resulting in 403 Forbidden errors from the Kubernetes API. This causes:

Complete sync failures every 10 minutes (default cache sync interval)
ArgoCD becomes unresponsive until manual controller restart
No automatic recovery mechanism exists
Root Cause: The sync() process iterates through c.namespaces slice containing deleted namespace names but has no validation to check if those namespaces still exist before attempting API operations.

Solution

Implement namespace validation with automatic cleanup:

Key Changes

namespaceExists() function - Validates namespace existence using canonical apierrors.IsNotFound() detection
Enhanced processApi() - Skip deleted namespaces during resource processing using thread-safe tracking
Post-sync cleanup in sync() - Remove deleted namespaces from configuration after parallel processing completes
I also added a test for the scenario called TestSyncWithDeletedNamespace and added the default namespace in other tests to not break them.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Title of the PR
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@tricktron tricktron requested a review from a team as a code owner September 25, 2025 15:20
@bunnyshell
Copy link
Copy Markdown

bunnyshell Bot commented Sep 25, 2025

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

@tricktron tricktron force-pushed the fix-argocd-cluster-sync-after-ns-deletion branch from 94d1cfc to 4ff775d Compare September 26, 2025 19:03
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 26, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.56%. Comparing base (f397bf6) to head (8d31111).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
gitops-engine/pkg/cache/cluster.go 90.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #24739      +/-   ##
==========================================
+ Coverage   62.16%   63.56%   +1.39%     
==========================================
  Files         417      417              
  Lines       70283    57133   -13150     
==========================================
- Hits        43691    36316    -7375     
+ Misses      23192    17413    -5779     
- Partials     3400     3404       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been marked as stale because it has had no activity for 90 days. Please comment if this is still relevant.

@github-actions github-actions Bot added the Stale No activity for over 90 days label Feb 11, 2026
@simonkrenger
Copy link
Copy Markdown

#24709 is still open, so this is still relevant

Copy link
Copy Markdown
Member

@ranakan19 ranakan19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.
Looks like a neat solution to detect deleted namespaces and avoid sync failure caused by accessing resources of deleted namespaces.
This would only be an in-memory fix to avoid sync failures, admins would still need to update the cluster secret. One consideration: with sync now able to proceed normally, there's a possibility this could make such configuration drift less visible. While deleted namespaces are logged, without additional monitoring, cluster secrets might not be maintained promptly. This might be worth tracking in a separate issue.

For this change, in addition to the sync-time detection that you're alreasy testing, could you add another test scenario for processApi() changes in startMissingChanges i.e the function completes successfully without starting watches for the deleted namespace?

Comment thread gitops-engine/pkg/cache/cluster_test.go
Implement automatic detection and removal of deleted namespaces to
prevent infinite sync failure loops when namespaces are deleted without
removing the managed-by label first.

Signed-off-by: Thibault Gagnaux <thibault.gagnaux@bit.admin.ch>
@tricktron tricktron force-pushed the fix-argocd-cluster-sync-after-ns-deletion branch from 4ff775d to 8d31111 Compare April 23, 2026 08:11
@tricktron
Copy link
Copy Markdown
Author

Thanks for the review @ranakan19!

  • On the test for processApi() in startMissingWatches: Added as TestStartMissingWatchesWithDeletedNamespace.
  • On the silent drift concern: You're right. This is an in-memory-only fix. The pruned namespaces are never written back to the cluster secret. Currently, the argocd-operator manages this secret. Otherwise, it is managed manually by an admin adding or removing namespaces via the argocd cli.

Note: I'm reworking the implementation approach. Instead of pre-checking namespace existence with an extra GET call per namespace per API, I'm going to handle the Forbidden/NotFound errors directly in processApi() where they already occur. > This is simpler (no extra API calls on the happy path) and handles the actual error from the bug report (403 Forbidden, not 404 NotFound).

@agaudreault agaudreault self-assigned this Apr 23, 2026
@tricktron
Copy link
Copy Markdown
Author

tricktron commented Apr 23, 2026

Superseded by #27528 which takes a different approach: Instead of pre-checking namespace existence before each callback, it handles the NotFound/Forbidden error inline in processApi() and does NOT prune the namespace afterwards as this is the responsibility of the cluster secret owner (Argocd operator if used or admin). This avoids the extra API calls on the happy path and does not modify any state. It just focuses on graceful degradation and keeps the sync running while logging the error so that the responsible owner can fix it.

@ranakan19 @agaudreault If you agree that the #27528 is the better approach then I'll close this pr.

Copy link
Copy Markdown
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general comment as the other PR.

@tricktron both PR approach are problematic due to the parallelization of processApi. I would suggest closing either one of them, so we can focus on one PR and get it right.

In this approach, the namespace is not pruned form the secret. However, it is removed from the cluster cache. The impact is that a new CRD discovered in the cluster will not start a watch on this namespace. It also means that this namespace will only be retried when the secret namespace field is modified (causing SetNamespace to be called) before invalidating the cache.

In my opinion, the namespaces should not be removed from the cluster cache so it is retried until the secret is correctly updated. However, this means that new CRDs (check startMissingWatches) should also validate the namespaces.

I think the process could be

  • EnsureSynced is called
  • sync is called
  • sync validates the ns permissions
  • Sets c.invalidNamespaces[ny_namespace] = fmt.Error("Namespace X is not accessible")
  • Call processAPI for all resources
  • processAPI skips all invalidNamespaces without error/logs
  • sync returns
  • EnsureSynced sets syncStatus.syncWarnings = invalidNamespaces.Values()

// call the callback. If we're managing the whole cluster, we call the callback with the client and an empty namespace.
// If we're managing specific namespaces, we call the callback for each namespace.
func (c *clusterCache) processApi(client dynamic.Interface, api kube.APIResourceInfo, callback func(resClient dynamic.ResourceInterface, ns string) error) error {
func (c *clusterCache) processApi(client dynamic.Interface, api kube.APIResourceInfo, deletedNamespaces *sync.Map, callback func(resClient dynamic.ResourceInterface, ns string) error) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with this function is that it is run asynchronously. deletedNamespaces might be empty for all calls. If you have 100 kinds, this will cause 100 calls to validate the namespace.

It seems better to check for namespace existence before processing the APIs in parallel.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, processApi() calls namespaceExists(context.Background(), ...), so namespace validation can continue even after the per-API context has been canceled, potentially hanging sync/watch startup on slow or wedged API calls.

Severity: remediation recommended | Category: reliability

How to fix: Thread ctx through namespaceExists

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

Namespace validation uses context.Background() instead of the caller’s cancelable context. This can cause namespace GET calls to outlive the API processing context.

Issue Context

sync() creates ctx, cancel := context.WithCancel(...) per API and uses it for list/watch. Namespace validation should respect that same lifecycle.

Fix Focus Areas

  • gitops-engine/pkg/cache/cluster.go[947-975]
  • gitops-engine/pkg/cache/cluster.go[290-306]

Implementation notes

  • Pass the per-API ctx into processApi (or into the callback) and then into namespaceExists(ctx, ...).
  • Avoid creating new Background contexts in hot paths.

Found by Qodo. Free code review for open-source maintainers.

@tricktron
Copy link
Copy Markdown
Author

Closing in favor of #27528 which incorporates the feedback from both PRs.

@tricktron tricktron closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster Cache Sync Fails When Managed Namespaces are Deleted Without Label Removal

5 participants