fix(gitops-engine): Handle Deleted Namespaces Gracefully During Sync (#24709)#24739
fix(gitops-engine): Handle Deleted Namespaces Gracefully During Sync (#24709)#24739tricktron wants to merge 1 commit into
Conversation
❌ Preview Environment deleted from BunnyshellAvailable commands (reply to this comment):
|
94d1cfc to
4ff775d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #24739 +/- ##
==========================================
+ Coverage 62.16% 63.56% +1.39%
==========================================
Files 417 417
Lines 70283 57133 -13150
==========================================
- Hits 43691 36316 -7375
+ Misses 23192 17413 -5779
- Partials 3400 3404 +4 ☔ View full report in Codecov by Sentry. |
|
This pull request has been marked as stale because it has had no activity for 90 days. Please comment if this is still relevant. |
|
#24709 is still open, so this is still relevant |
ranakan19
left a comment
There was a problem hiding this comment.
Thanks for your contribution.
Looks like a neat solution to detect deleted namespaces and avoid sync failure caused by accessing resources of deleted namespaces.
This would only be an in-memory fix to avoid sync failures, admins would still need to update the cluster secret. One consideration: with sync now able to proceed normally, there's a possibility this could make such configuration drift less visible. While deleted namespaces are logged, without additional monitoring, cluster secrets might not be maintained promptly. This might be worth tracking in a separate issue.
For this change, in addition to the sync-time detection that you're alreasy testing, could you add another test scenario for processApi() changes in startMissingChanges i.e the function completes successfully without starting watches for the deleted namespace?
Implement automatic detection and removal of deleted namespaces to prevent infinite sync failure loops when namespaces are deleted without removing the managed-by label first. Signed-off-by: Thibault Gagnaux <thibault.gagnaux@bit.admin.ch>
4ff775d to
8d31111
Compare
|
Thanks for the review @ranakan19!
Note: I'm reworking the implementation approach. Instead of pre-checking namespace existence with an extra GET call per namespace per API, I'm going to handle the Forbidden/NotFound errors directly in processApi() where they already occur. > This is simpler (no extra API calls on the happy path) and handles the actual error from the bug report (403 Forbidden, not 404 NotFound). |
|
Superseded by #27528 which takes a different approach: Instead of pre-checking namespace existence before each callback, it handles the NotFound/Forbidden error inline in processApi() and does NOT prune the namespace afterwards as this is the responsibility of the cluster secret owner (Argocd operator if used or admin). This avoids the extra API calls on the happy path and does not modify any state. It just focuses on graceful degradation and keeps the sync running while logging the error so that the responsible owner can fix it. @ranakan19 @agaudreault If you agree that the #27528 is the better approach then I'll close this pr. |
There was a problem hiding this comment.
Same general comment as the other PR.
@tricktron both PR approach are problematic due to the parallelization of processApi. I would suggest closing either one of them, so we can focus on one PR and get it right.
In this approach, the namespace is not pruned form the secret. However, it is removed from the cluster cache. The impact is that a new CRD discovered in the cluster will not start a watch on this namespace. It also means that this namespace will only be retried when the secret namespace field is modified (causing SetNamespace to be called) before invalidating the cache.
In my opinion, the namespaces should not be removed from the cluster cache so it is retried until the secret is correctly updated. However, this means that new CRDs (check startMissingWatches) should also validate the namespaces.
I think the process could be
EnsureSyncedis calledsyncis calledsyncvalidates the ns permissions- Sets
c.invalidNamespaces[ny_namespace] = fmt.Error("Namespace X is not accessible") - Call
processAPIfor all resources processAPIskips allinvalidNamespaceswithout error/logssyncreturnsEnsureSyncedsetssyncStatus.syncWarnings = invalidNamespaces.Values()
| // call the callback. If we're managing the whole cluster, we call the callback with the client and an empty namespace. | ||
| // If we're managing specific namespaces, we call the callback for each namespace. | ||
| func (c *clusterCache) processApi(client dynamic.Interface, api kube.APIResourceInfo, callback func(resClient dynamic.ResourceInterface, ns string) error) error { | ||
| func (c *clusterCache) processApi(client dynamic.Interface, api kube.APIResourceInfo, deletedNamespaces *sync.Map, callback func(resClient dynamic.ResourceInterface, ns string) error) error { |
There was a problem hiding this comment.
The main problem with this function is that it is run asynchronously. deletedNamespaces might be empty for all calls. If you have 100 kinds, this will cause 100 calls to validate the namespace.
It seems better to check for namespace existence before processing the APIs in parallel.
|
Hi, Severity: remediation recommended | Category: reliability How to fix: Thread ctx through namespaceExists Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo. Free code review for open-source maintainers. |
|
Closing in favor of #27528 which incorporates the feedback from both PRs. |
Summary
Migrated from argoproj/gitops-engine#785
Fix infinite sync failure loops when managed namespaces are deleted by implementing automatic namespace validation and cleanup during cluster cache synchronization.
Fixes: #24709
Problem
When namespaces managed by ArgoCD are deleted without first removing the managed-by label, the GitOps Engine enters an infinite failure loop during cluster cache sync operations. The processApi() function attempts to list resources in deleted namespaces, resulting in 403 Forbidden errors from the Kubernetes API. This causes:
Complete sync failures every 10 minutes (default cache sync interval)
ArgoCD becomes unresponsive until manual controller restart
No automatic recovery mechanism exists
Root Cause: The sync() process iterates through c.namespaces slice containing deleted namespace names but has no validation to check if those namespaces still exist before attempting API operations.
Solution
Implement namespace validation with automatic cleanup:
Key Changes
namespaceExists() function - Validates namespace existence using canonical apierrors.IsNotFound() detection
Enhanced processApi() - Skip deleted namespaces during resource processing using thread-safe tracking
Post-sync cleanup in sync() - Remove deleted namespaces from configuration after parallel processing completes
I also added a test for the scenario called TestSyncWithDeletedNamespace and added the default namespace in other tests to not break them.
Checklist: