Skip to content

fix: improve teleport operator timing and eliminate race conditions#291

Open
ssyno wants to merge 3 commits into
mainfrom
fix-teleport-timing-race-conditions
Open

fix: improve teleport operator timing and eliminate race conditions#291
ssyno wants to merge 3 commits into
mainfrom
fix-teleport-timing-race-conditions

Conversation

@ssyno
Copy link
Copy Markdown
Contributor

@ssyno ssyno commented Jul 30, 2025

Towards: https://github.com/giantswarm/giantswarm/issues/33940

  • Replace periodic reconciliation with event-driven watchers for kubeconfig secrets and tbot configmaps
  • Add watchers for teleport-{cluster}-kubeconfig secrets and teleport-tbot-{cluster}-config configmaps
  • Increase resource limits (CPU: 250m->500m, Memory: 500Mi->1Gi) for better performance on busy clusters
  • Add timing logs to help debug future performance issues
  • Eliminate 3-minute delay window that caused cluster test failures
    This fixes race conditions where cluster tests would fail if teleport operator took too long to create kubeconfig secrets, especially on the garm management cluster.

What this PR does / why we need it

Checklist

  • Update changelog in CHANGELOG.md.

- Replace periodic reconciliation with event-driven watchers for kubeconfig secrets and tbot configmaps
- Add watchers for teleport-{cluster}-kubeconfig secrets and teleport-tbot-{cluster}-config configmaps
- Increase resource limits (CPU: 250m->500m, Memory: 500Mi->1Gi) for better performance on busy clusters
- Add timing logs to help debug future performance issues
- Eliminate 3-minute delay window that caused cluster test failures on garm

This fixes race conditions where cluster tests would fail if teleport operator
took too long to create kubeconfig secrets, especially on the garm management cluster.
@ssyno ssyno requested a review from a team as a code owner July 30, 2025 08:39
Comment on lines +274 to +283
Watches(
&source.Kind{Type: &corev1.Secret{}},
handler.EnqueueRequestsFromMapFunc(r.findClustersForKubeconfigSecret),
builder.WithPredicates(predicate.NewPredicateFuncs(r.isKubeconfigSecret)),
).
Watches(
&source.Kind{Type: &corev1.ConfigMap{}},
handler.EnqueueRequestsFromMapFunc(r.findClustersForTbotConfigMap),
builder.WithPredicates(predicate.NewPredicateFuncs(r.isTbotConfigMap)),
).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to watch the resources we create?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have to do so, if we want to use the watchers architecture

// SetupWithManager sets up the controller with the Manager.
func (r *ClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&capi.Cluster{}).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this line suggests that the operator should be reacting immediately to Cluster CR creation, so the 3 minute lag time is external to the operator

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this should be the issue, watcher means the operator responds immediately to new clusters, the delay is happening in the tbot deployment or in the Teleport API calls.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new Cluster CR is created, do we spin up or re-deploy tbot? If so, then I think the solution here is to immediately create the placeholder secret so that e2e can wait for it to be populated, and then fill it in once tbot is done

Comment on lines +288 to +317
func (r *ClusterReconciler) isKubeconfigSecret(obj client.Object) bool {
secret, ok := obj.(*corev1.Secret)
if !ok {
return false
}

// Check if it's in the teleport bot namespace and matches our naming pattern
if secret.Namespace != key.TeleportBotNamespace {
return false
}

// Check if it matches teleport kubeconfig secret naming pattern: teleport-{cluster}-kubeconfig
return strings.HasPrefix(secret.Name, "teleport-") && strings.HasSuffix(secret.Name, "-kubeconfig")
}

// isTbotConfigMap checks if a configmap is a tbot configmap we should watch
func (r *ClusterReconciler) isTbotConfigMap(obj client.Object) bool {
cm, ok := obj.(*corev1.ConfigMap)
if !ok {
return false
}

// Check if it's in the teleport bot namespace and matches our naming pattern
if cm.Namespace != key.TeleportBotNamespace {
return false
}

// Check if it matches tbot configmap naming pattern: teleport-tbot-{cluster}-config
return strings.HasPrefix(cm.Name, "teleport-tbot-") && strings.HasSuffix(cm.Name, "-config")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are truly necessary, they should be key functions, not instance methods

}

log.Info("Reconciling cluster", "cluster", cluster)
log.Info("Reconciling cluster", "cluster", cluster, "creation_time", cluster.CreationTimestamp, "reconcile_start", start)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to measure the reconciliation time. If we do so, it would be nice to have it as a metric

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we will go with metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants