fix: improve teleport operator timing and eliminate race conditions#291
fix: improve teleport operator timing and eliminate race conditions#291ssyno wants to merge 3 commits into
Conversation
- Replace periodic reconciliation with event-driven watchers for kubeconfig secrets and tbot configmaps
- Add watchers for teleport-{cluster}-kubeconfig secrets and teleport-tbot-{cluster}-config configmaps
- Increase resource limits (CPU: 250m->500m, Memory: 500Mi->1Gi) for better performance on busy clusters
- Add timing logs to help debug future performance issues
- Eliminate 3-minute delay window that caused cluster test failures on garm
This fixes race conditions where cluster tests would fail if teleport operator
took too long to create kubeconfig secrets, especially on the garm management cluster.
| Watches( | ||
| &source.Kind{Type: &corev1.Secret{}}, | ||
| handler.EnqueueRequestsFromMapFunc(r.findClustersForKubeconfigSecret), | ||
| builder.WithPredicates(predicate.NewPredicateFuncs(r.isKubeconfigSecret)), | ||
| ). | ||
| Watches( | ||
| &source.Kind{Type: &corev1.ConfigMap{}}, | ||
| handler.EnqueueRequestsFromMapFunc(r.findClustersForTbotConfigMap), | ||
| builder.WithPredicates(predicate.NewPredicateFuncs(r.isTbotConfigMap)), | ||
| ). |
There was a problem hiding this comment.
Why is it necessary to watch the resources we create?
There was a problem hiding this comment.
I guess we have to do so, if we want to use the watchers architecture
| // SetupWithManager sets up the controller with the Manager. | ||
| func (r *ClusterReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| return ctrl.NewControllerManagedBy(mgr). | ||
| For(&capi.Cluster{}). |
There was a problem hiding this comment.
IMO this line suggests that the operator should be reacting immediately to Cluster CR creation, so the 3 minute lag time is external to the operator
There was a problem hiding this comment.
yeah this should be the issue, watcher means the operator responds immediately to new clusters, the delay is happening in the tbot deployment or in the Teleport API calls.
There was a problem hiding this comment.
When a new Cluster CR is created, do we spin up or re-deploy tbot? If so, then I think the solution here is to immediately create the placeholder secret so that e2e can wait for it to be populated, and then fill it in once tbot is done
| func (r *ClusterReconciler) isKubeconfigSecret(obj client.Object) bool { | ||
| secret, ok := obj.(*corev1.Secret) | ||
| if !ok { | ||
| return false | ||
| } | ||
|
|
||
| // Check if it's in the teleport bot namespace and matches our naming pattern | ||
| if secret.Namespace != key.TeleportBotNamespace { | ||
| return false | ||
| } | ||
|
|
||
| // Check if it matches teleport kubeconfig secret naming pattern: teleport-{cluster}-kubeconfig | ||
| return strings.HasPrefix(secret.Name, "teleport-") && strings.HasSuffix(secret.Name, "-kubeconfig") | ||
| } | ||
|
|
||
| // isTbotConfigMap checks if a configmap is a tbot configmap we should watch | ||
| func (r *ClusterReconciler) isTbotConfigMap(obj client.Object) bool { | ||
| cm, ok := obj.(*corev1.ConfigMap) | ||
| if !ok { | ||
| return false | ||
| } | ||
|
|
||
| // Check if it's in the teleport bot namespace and matches our naming pattern | ||
| if cm.Namespace != key.TeleportBotNamespace { | ||
| return false | ||
| } | ||
|
|
||
| // Check if it matches tbot configmap naming pattern: teleport-tbot-{cluster}-config | ||
| return strings.HasPrefix(cm.Name, "teleport-tbot-") && strings.HasSuffix(cm.Name, "-config") | ||
| } |
There was a problem hiding this comment.
If these are truly necessary, they should be key functions, not instance methods
| } | ||
|
|
||
| log.Info("Reconciling cluster", "cluster", cluster) | ||
| log.Info("Reconciling cluster", "cluster", cluster, "creation_time", cluster.CreationTimestamp, "reconcile_start", start) |
There was a problem hiding this comment.
I'm fine to measure the reconciliation time. If we do so, it would be nice to have it as a metric
There was a problem hiding this comment.
sure, we will go with metrics
Towards: https://github.com/giantswarm/giantswarm/issues/33940
This fixes race conditions where cluster tests would fail if teleport operator took too long to create kubeconfig secrets, especially on the garm management cluster.
What this PR does / why we need it
Checklist