fix: improve teleport operator timing and eliminate race conditions by ssyno · Pull Request #291 · giantswarm/teleport-operator

ssyno · 2025-07-30T08:39:21Z

Towards: https://github.com/giantswarm/giantswarm/issues/33940

Replace periodic reconciliation with event-driven watchers for kubeconfig secrets and tbot configmaps
Add watchers for teleport-{cluster}-kubeconfig secrets and teleport-tbot-{cluster}-config configmaps
Increase resource limits (CPU: 250m->500m, Memory: 500Mi->1Gi) for better performance on busy clusters
Add timing logs to help debug future performance issues
Eliminate 3-minute delay window that caused cluster test failures
This fixes race conditions where cluster tests would fail if teleport operator took too long to create kubeconfig secrets, especially on the garm management cluster.

What this PR does / why we need it

Checklist

Update changelog in CHANGELOG.md.

- Replace periodic reconciliation with event-driven watchers for kubeconfig secrets and tbot configmaps - Add watchers for teleport-{cluster}-kubeconfig secrets and teleport-tbot-{cluster}-config configmaps - Increase resource limits (CPU: 250m->500m, Memory: 500Mi->1Gi) for better performance on busy clusters - Add timing logs to help debug future performance issues - Eliminate 3-minute delay window that caused cluster test failures on garm This fixes race conditions where cluster tests would fail if teleport operator took too long to create kubeconfig secrets, especially on the garm management cluster.

stone-z · 2025-07-30T08:50:42Z

+		Watches(
+			&source.Kind{Type: &corev1.Secret{}},
+			handler.EnqueueRequestsFromMapFunc(r.findClustersForKubeconfigSecret),
+			builder.WithPredicates(predicate.NewPredicateFuncs(r.isKubeconfigSecret)),
+		).
+		Watches(
+			&source.Kind{Type: &corev1.ConfigMap{}},
+			handler.EnqueueRequestsFromMapFunc(r.findClustersForTbotConfigMap),
+			builder.WithPredicates(predicate.NewPredicateFuncs(r.isTbotConfigMap)),
+		).


Why is it necessary to watch the resources we create?

I guess we have to do so, if we want to use the watchers architecture

stone-z · 2025-07-30T08:51:27Z

 // SetupWithManager sets up the controller with the Manager.
 func (r *ClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
 	return ctrl.NewControllerManagedBy(mgr).
 		For(&capi.Cluster{}).


IMO this line suggests that the operator should be reacting immediately to Cluster CR creation, so the 3 minute lag time is external to the operator

yeah this should be the issue, watcher means the operator responds immediately to new clusters, the delay is happening in the tbot deployment or in the Teleport API calls.

When a new Cluster CR is created, do we spin up or re-deploy tbot? If so, then I think the solution here is to immediately create the placeholder secret so that e2e can wait for it to be populated, and then fill it in once tbot is done

stone-z · 2025-07-30T08:51:50Z

+func (r *ClusterReconciler) isKubeconfigSecret(obj client.Object) bool {
+	secret, ok := obj.(*corev1.Secret)
+	if !ok {
+		return false
+	}
+
+	// Check if it's in the teleport bot namespace and matches our naming pattern
+	if secret.Namespace != key.TeleportBotNamespace {
+		return false
+	}
+
+	// Check if it matches teleport kubeconfig secret naming pattern: teleport-{cluster}-kubeconfig
+	return strings.HasPrefix(secret.Name, "teleport-") && strings.HasSuffix(secret.Name, "-kubeconfig")
+}
+
+// isTbotConfigMap checks if a configmap is a tbot configmap we should watch
+func (r *ClusterReconciler) isTbotConfigMap(obj client.Object) bool {
+	cm, ok := obj.(*corev1.ConfigMap)
+	if !ok {
+		return false
+	}
+
+	// Check if it's in the teleport bot namespace and matches our naming pattern
+	if cm.Namespace != key.TeleportBotNamespace {
+		return false
+	}
+
+	// Check if it matches tbot configmap naming pattern: teleport-tbot-{cluster}-config
+	return strings.HasPrefix(cm.Name, "teleport-tbot-") && strings.HasSuffix(cm.Name, "-config")
+}


If these are truly necessary, they should be key functions, not instance methods

stone-z · 2025-07-30T08:55:29Z

 	}

-	log.Info("Reconciling cluster", "cluster", cluster)
+	log.Info("Reconciling cluster", "cluster", cluster, "creation_time", cluster.CreationTimestamp, "reconcile_start", start)


I'm fine to measure the reconciliation time. If we do so, it would be nice to have it as a metric

sure, we will go with metrics

ssyno requested a review from a team as a code owner July 30, 2025 08:39

stone-z reviewed Jul 30, 2025

View reviewed changes

ssyno added 2 commits August 20, 2025 11:36

Merge branch 'main' into fix-teleport-timing-race-conditions

dd192be

Merge branch 'main' into fix-teleport-timing-race-conditions

64ce91b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve teleport operator timing and eliminate race conditions#291

fix: improve teleport operator timing and eliminate race conditions#291
ssyno wants to merge 3 commits into
mainfrom
fix-teleport-timing-race-conditions

ssyno commented Jul 30, 2025 •

edited

Loading

Uh oh!

stone-z Jul 30, 2025

Uh oh!

ssyno Jul 30, 2025

Uh oh!

stone-z Jul 30, 2025

Uh oh!

ssyno Jul 30, 2025

Uh oh!

stone-z Jul 30, 2025

Uh oh!

stone-z Jul 30, 2025

Uh oh!

stone-z Jul 30, 2025

Uh oh!

ssyno Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ssyno commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ssyno commented Jul 30, 2025 •

edited

Loading