Skip to content

[BUG] os.Exit in goroutine bypasses graceful shutdown, leaking leader election lease #2401

@abhaygoudannavar

Description

@abhaygoudannavar

What happened:

I was looking at the startup flow in main.go and noticed a couple of related problems around lines 257–269.

The goroutine that waits for webhook readiness and sets up controllers calls os.Exit(1) directly if either step fails:

// main.go:257-269
go func() {
    setupLog.Info("wait webhook ready")
    if err = webhook.WaitReady(); err != nil {
        setupLog.Error(err, "unable to wait webhook ready")
        os.Exit(1)
    }

    setupLog.Info("setup controllers")
    if err = controller.SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to setup controllers")
        os.Exit(1)
    }
}()

os.Exit kills the process immediately — no deferred functions run, the manager's graceful shutdown path is skipped, and critically, the leader election lease is never released. The context from ctrl.SetupSignalHandler() (line 167) is meant to handle this, but os.Exit bypasses it entirely.

With the default leaseDuration of 15 seconds, this means no other kruise-manager replica can acquire leadership until the lease expires on its own. Every time this failure path is hit, all Kruise controllers and webhooks are effectively down for that window.

The second part of the problem is WaitReady() itself in pkg/webhook/server.go:

    startTS := time.Now()
    var err error
    for {
        duration := time.Since(startTS)
        if err = Checker(nil); err == nil {
            return nil
        }
        if duration > time.Second*5 {
            klog.ErrorS(err, "Failed to wait webhook ready", "duration", duration)
        }
        time.Sleep(time.Second * 2)
    }
}

This is an infinite loop with no timeout and no context cancellation. If the health checker never passes (say there's a TLS cert issue, or the webhook server fails to bind), this goroutine spins forever. The os.Exit(1) path never fires, controllers never get set up, but the manager keeps running and renewing the leader lease — so you end up with a leader that holds the lock but does nothing. Liveness probes against /healthz would still pass since that's just a ping check. The readyz check would catch it, but that depends on the probe configuration.

What you expected to happen:

1.On failure, the goroutine should cancel the manager's context to trigger a graceful shutdown (releasing the leader lease, draining queues, etc.) rather than calling os.Exit directly.
2.WaitReady should accept a context.Context so it can be cancelled on shutdown, and ideally have a timeout so it doesn't spin forever.

How to reproduce it (as minimally and precisely as possible):

For the os.Exit issue:

1.Deploy kruise-manager with leader election enabled (default).
2.Cause controller.SetupWithManager to fail (e.g., by breaking a CRD definition so the controller can't set up a watch).
3.Observe that the process exits immediately and the leader lease remains held until TTL expires. Other replicas can't take over during that window.

For the infinite WaitReady loop:
1.Deploy kruise-manager with a misconfigured webhook TLS cert (e.g., wrong ca-cert.pem).
2.The health checker will never pass, and WaitReady will loop forever.
3.The manager holds the leader lease and passes healthz, but no controllers are running.

Anything else we need to know?:

A possible fix would be something like:

ctx, cancel := context.WithCancel(ctrl.SetupSignalHandler())
// ...
go func() {
    if err := webhook.WaitReady(ctx); err != nil {
        setupLog.Error(err, "unable to wait webhook ready")
        cancel() // triggers graceful manager shutdown
        return
    }
    if err := controller.SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to setup controllers")
        cancel()
        return
    }
}()

And updating WaitReady to respect context cancellation:

    startTS := time.Now()
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
        }
        if err := Checker(nil); err == nil {
            return nil
        }
        // ...
    }
}

Environment:
Kruise version: master (HEAD)
Kubernetes version: all versions affected
Install details: N/A — code-level issue
Others: N/A

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions