fix(gitops-engine): Handle Deleted Namespaces Gracefully During Sync (#24709) by tricktron · Pull Request #24739 · argoproj/argo-cd

tricktron · 2025-09-25T15:20:40Z

Summary

Migrated from argoproj/gitops-engine#785

Fix infinite sync failure loops when managed namespaces are deleted by implementing automatic namespace validation and cleanup during cluster cache synchronization.

Fixes: #24709

Problem

When namespaces managed by ArgoCD are deleted without first removing the managed-by label, the GitOps Engine enters an infinite failure loop during cluster cache sync operations. The processApi() function attempts to list resources in deleted namespaces, resulting in 403 Forbidden errors from the Kubernetes API. This causes:

Complete sync failures every 10 minutes (default cache sync interval)
ArgoCD becomes unresponsive until manual controller restart
No automatic recovery mechanism exists
Root Cause: The sync() process iterates through c.namespaces slice containing deleted namespace names but has no validation to check if those namespaces still exist before attempting API operations.

Solution

Implement namespace validation with automatic cleanup:

Key Changes

namespaceExists() function - Validates namespace existence using canonical apierrors.IsNotFound() detection
Enhanced processApi() - Skip deleted namespaces during resource processing using thread-safe tracking
Post-sync cleanup in sync() - Remove deleted namespaces from configuration after parallel processing completes
I also added a test for the scenario called TestSyncWithDeletedNamespace and added the default namespace in other tests to not break them.

Checklist:

bunnyshell · 2025-09-25T15:20:46Z

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

🚀 /bns:deploy to deploy the environment

codecov · 2025-09-26T19:38:51Z

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.56%. Comparing base (f397bf6) to head (8d31111).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
gitops-engine/pkg/cache/cluster.go	90.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #24739      +/-   ##
==========================================
+ Coverage   62.16%   63.56%   +1.39%     
==========================================
  Files         417      417              
  Lines       70283    57133   -13150     
==========================================
- Hits        43691    36316    -7375     
+ Misses      23192    17413    -5779     
- Partials     3400     3404       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2026-02-11T00:12:54Z

This pull request has been marked as stale because it has had no activity for 90 days. Please comment if this is still relevant.

simonkrenger · 2026-02-11T07:14:04Z

#24709 is still open, so this is still relevant

ranakan19

Thanks for your contribution.
Looks like a neat solution to detect deleted namespaces and avoid sync failure caused by accessing resources of deleted namespaces.
This would only be an in-memory fix to avoid sync failures, admins would still need to update the cluster secret. One consideration: with sync now able to proceed normally, there's a possibility this could make such configuration drift less visible. While deleted namespaces are logged, without additional monitoring, cluster secrets might not be maintained promptly. This might be worth tracking in a separate issue.

For this change, in addition to the sync-time detection that you're alreasy testing, could you add another test scenario for processApi() changes in startMissingChanges i.e the function completes successfully without starting watches for the deleted namespace?

Implement automatic detection and removal of deleted namespaces to prevent infinite sync failure loops when namespaces are deleted without removing the managed-by label first. Signed-off-by: Thibault Gagnaux <thibault.gagnaux@bit.admin.ch>

tricktron · 2026-04-23T13:15:42Z

Thanks for the review @ranakan19!

On the test for processApi() in startMissingWatches: Added as TestStartMissingWatchesWithDeletedNamespace.
On the silent drift concern: You're right. This is an in-memory-only fix. The pruned namespaces are never written back to the cluster secret. Currently, the argocd-operator manages this secret. Otherwise, it is managed manually by an admin adding or removing namespaces via the argocd cli.

Note: I'm reworking the implementation approach. Instead of pre-checking namespace existence with an extra GET call per namespace per API, I'm going to handle the Forbidden/NotFound errors directly in processApi() where they already occur. > This is simpler (no extra API calls on the happy path) and handles the actual error from the bug report (403 Forbidden, not 404 NotFound).

tricktron · 2026-04-23T20:42:52Z

Superseded by #27528 which takes a different approach: Instead of pre-checking namespace existence before each callback, it handles the NotFound/Forbidden error inline in processApi() and does NOT prune the namespace afterwards as this is the responsibility of the cluster secret owner (Argocd operator if used or admin). This avoids the extra API calls on the happy path and does not modify any state. It just focuses on graceful degradation and keeps the sync running while logging the error so that the responsible owner can fix it.

@ranakan19 @agaudreault If you agree that the #27528 is the better approach then I'll close this pr.

agaudreault

Same general comment as the other PR.

@tricktron both PR approach are problematic due to the parallelization of processApi. I would suggest closing either one of them, so we can focus on one PR and get it right.

In this approach, the namespace is not pruned form the secret. However, it is removed from the cluster cache. The impact is that a new CRD discovered in the cluster will not start a watch on this namespace. It also means that this namespace will only be retried when the secret namespace field is modified (causing SetNamespace to be called) before invalidating the cache.

In my opinion, the namespaces should not be removed from the cluster cache so it is retried until the secret is correctly updated. However, this means that new CRDs (check startMissingWatches) should also validate the namespaces.

I think the process could be

EnsureSynced is called
sync is called
sync validates the ns permissions
Sets c.invalidNamespaces[ny_namespace] = fmt.Error("Namespace X is not accessible")
Call processAPI for all resources
processAPI skips all invalidNamespaces without error/logs
sync returns
EnsureSynced sets syncStatus.syncWarnings = invalidNamespaces.Values()

agaudreault · 2026-04-24T18:24:41Z

 // call the callback. If we're managing the whole cluster, we call the callback with the client and an empty namespace.
 // If we're managing specific namespaces, we call the callback for each namespace.
-func (c *clusterCache) processApi(client dynamic.Interface, api kube.APIResourceInfo, callback func(resClient dynamic.ResourceInterface, ns string) error) error {
+func (c *clusterCache) processApi(client dynamic.Interface, api kube.APIResourceInfo, deletedNamespaces *sync.Map, callback func(resClient dynamic.ResourceInterface, ns string) error) error {


The main problem with this function is that it is run asynchronously. deletedNamespaces might be empty for all calls. If you have 100 kinds, this will cause 100 calls to validate the namespace.

It seems better to check for namespace existence before processing the APIs in parallel.

Qodo-Free-For-OSS · 2026-04-25T06:42:06Z

Hi, processApi() calls namespaceExists(context.Background(), ...), so namespace validation can continue even after the per-API context has been canceled, potentially hanging sync/watch startup on slow or wedged API calls.

Severity: remediation recommended | Category: reliability

How to fix: Thread ctx through namespaceExists

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

Namespace validation uses context.Background() instead of the caller’s cancelable context. This can cause namespace GET calls to outlive the API processing context.

Issue Context

sync() creates ctx, cancel := context.WithCancel(...) per API and uses it for list/watch. Namespace validation should respect that same lifecycle.

Fix Focus Areas

gitops-engine/pkg/cache/cluster.go[947-975]

gitops-engine/pkg/cache/cluster.go[290-306]

Implementation notes

Pass the per-API ctx into processApi (or into the callback) and then into namespaceExists(ctx, ...).

Avoid creating new Background contexts in hot paths.

Found by Qodo. Free code review for open-source maintainers.

tricktron · 2026-04-27T10:36:20Z

Closing in favor of #27528 which incorporates the feedback from both PRs.

tricktron requested a review from a team as a code owner September 25, 2025 15:20

tricktron force-pushed the fix-argocd-cluster-sync-after-ns-deletion branch from 94d1cfc to 4ff775d Compare September 26, 2025 19:03

github-actions Bot added the Stale No activity for over 90 days label Feb 11, 2026

github-actions Bot removed the Stale No activity for over 90 days label Feb 12, 2026

tricktron mentioned this pull request Mar 16, 2026

Cluster Cache Sync Fails When Managed Namespaces are Deleted Without Label Removal #24709

Open

3 tasks

ranakan19 reviewed Apr 21, 2026

View reviewed changes

Comment thread gitops-engine/pkg/cache/cluster_test.go

tricktron force-pushed the fix-argocd-cluster-sync-after-ns-deletion branch from 4ff775d to 8d31111 Compare April 23, 2026 08:11

agaudreault self-assigned this Apr 23, 2026

agaudreault requested changes Apr 24, 2026

View reviewed changes

agaudreault mentioned this pull request Apr 24, 2026

feat: Namespace Selectors Support In Cluster Secrets #21846

Open

14 tasks

tricktron closed this Apr 27, 2026

Conversation

tricktron commented Sep 25, 2025

Summary

Problem

Solution

Key Changes

Uh oh!

bunnyshell Bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Preview Environment deleted from Bunnyshell

Uh oh!

codecov Bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Feb 11, 2026

Uh oh!

simonkrenger commented Feb 11, 2026

Uh oh!

ranakan19 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tricktron commented Apr 23, 2026

Uh oh!

tricktron commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agaudreault left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agaudreault Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Qodo-Free-For-OSS commented Apr 25, 2026

Issue description

Issue Context

Fix Focus Areas

Implementation notes

Uh oh!

tricktron commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bunnyshell Bot commented Sep 25, 2025 •

edited

Loading

codecov Bot commented Sep 26, 2025 •

edited

Loading

tricktron commented Apr 23, 2026 •

edited

Loading

agaudreault left a comment •

edited

Loading