fix: replace os.Exit in goroutine with context cancellation for graceful shutdown#2402
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR fixes two critical issues in the startup path that prevent graceful shutdown of the manager:
-
os.Exit(1) in goroutine leaks leader election lease: When webhook readiness check or controller setup fails, the goroutine was calling
os.Exit(1)directly, which bypasses all deferred cleanup including the manager's graceful shutdown. This prevented the leader election lease from being released, blocking other replicas from taking over leadership until the TTL expired (default 15 seconds). -
WaitReady() infinite loop with no timeout: The webhook readiness check function was an infinite loop with no context cancellation support. If the health checker never passed, the goroutine would spin forever, and the manager would keep renewing the leader lease while doing nothing useful.
Changes:
- Wraps the signal handler context with
context.WithCancel()to enable graceful cancellation - Replaces
os.Exit(1)calls in the background goroutine withcancel()+returnto trigger manager shutdown - Updates
WaitReady()to accept a context parameter and respect context cancellation - Leaves
os.Exit(1)calls in the initialization phase (before manager startup) unchanged as they predate leader election
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| main.go | Wraps signal handler with cancellable context, updates WaitReady call to pass context, replaces os.Exit calls in goroutine with graceful cancellation |
| pkg/webhook/server.go | Updates WaitReady signature to accept context, adds context cancellation check in the polling loop |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2402 +/- ##
==========================================
+ Coverage 48.77% 49.00% +0.22%
==========================================
Files 324 324
Lines 27928 27932 +4
==========================================
+ Hits 13623 13689 +66
+ Misses 12775 12705 -70
- Partials 1530 1538 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
25eeb8a to
afd2ee7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
pkg/webhook/server.go:1
- Cancellation is only checked before
time.Sleep, so shutdown may be delayed by up to 2 seconds per loop iteration. Consider replacingtime.Sleepwith a ticker +selectonctx.Done()so cancellation is responsive even during the wait interval.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| go func() { | ||
| setupLog.Info("wait webhook ready") | ||
| if err = webhook.WaitReady(); err != nil { | ||
| if err = webhook.WaitReady(ctx); err != nil { | ||
| setupLog.Error(err, "unable to wait webhook ready") | ||
| os.Exit(1) | ||
| cancel() | ||
| return | ||
| } | ||
|
|
||
| setupLog.Info("setup controllers") | ||
| if err = controller.SetupWithManager(mgr); err != nil { | ||
| setupLog.Error(err, "unable to setup controllers") | ||
| os.Exit(1) | ||
| cancel() | ||
| return | ||
| } | ||
| }() |
There was a problem hiding this comment.
The goroutine assigns to the outer err variable (if err = ...) which is likely also accessed on the main goroutine path, creating a data race. Use a goroutine-local variable (e.g., if err := ...; err != nil) for both calls so no shared state is mutated across goroutines.
| func TestWaitReadyCancel(t *testing.T) { | ||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| cancel() // instantly cancel the context | ||
|
|
||
| err := WaitReady(ctx) | ||
| if err == nil { | ||
| t.Fatalf("expected error, got nil") | ||
| } | ||
| if !strings.Contains(err.Error(), "context cancelled while waiting for webhook ready") { | ||
| t.Fatalf("unexpected error message: %v", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
This test asserts on a substring of the error message, which makes it brittle to wording changes. Prefer checking the wrapped cause (e.g., errors.Is(err, context.Canceled)) and only assert on message content if absolutely necessary.
| aHourAgo := metav1.NewTime(time.Unix(time.Now().Add(-time.Hour).Unix(), 0)) | ||
| Clock = testingclock.NewFakeClock(aHourAgo.Time) |
There was a problem hiding this comment.
The test mutates the package-level Clock and does not restore it, which can leak state into other tests and cause ordering-dependent failures. Capture the previous clock and defer restoring it at the end of the test.
| aHourAgo := metav1.NewTime(time.Unix(time.Now().Add(-time.Hour).Unix(), 0)) | |
| Clock = testingclock.NewFakeClock(aHourAgo.Time) | |
| aHourAgo := metav1.NewTime(time.Unix(time.Now().Add(-time.Hour).Unix(), 0)) | |
| oldClock := Clock | |
| Clock = testingclock.NewFakeClock(aHourAgo.Time) | |
| defer func() { | |
| Clock = oldClock | |
| }() |
afd2ee7 to
b58ba06
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pkg/webhook/server.go:1
- Cancellation responsiveness is delayed by up to 2 seconds because
time.Sleepis not interruptible; ifctxis canceled during the sleep, shutdown will wait for the sleep to finish before exiting. Consider replacingtime.Sleepwith aselectthat waits on eitherctx.Done()or a timer/ticker so cancellation returns promptly.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| go func() { | ||
| setupLog.Info("wait webhook ready") | ||
| if err = webhook.WaitReady(); err != nil { | ||
| if err = webhook.WaitReady(ctx); err != nil { | ||
| setupLog.Error(err, "unable to wait webhook ready") | ||
| os.Exit(1) | ||
| cancel() | ||
| return | ||
| } | ||
|
|
||
| setupLog.Info("setup controllers") | ||
| if err = controller.SetupWithManager(mgr); err != nil { | ||
| setupLog.Error(err, "unable to setup controllers") | ||
| os.Exit(1) | ||
| cancel() | ||
| return | ||
| } | ||
| }() |
There was a problem hiding this comment.
err is assigned inside a goroutine (err = ...) which can introduce a data race if err is also accessed from the main goroutine (common pattern in main). Use a goroutine-local variable instead (e.g., if err := ...; err != nil { ... }) for both WaitReady and SetupWithManager to avoid sharing mutable state across goroutines.
pkg/webhook/server.go
Outdated
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return fmt.Errorf("context cancelled while waiting for webhook ready: %w", ctx.Err()) |
There was a problem hiding this comment.
The standard Go spelling in context errors is “canceled” (one “l”). Consider changing the message to “context canceled …” for consistency (and update the corresponding test assertion).
| return fmt.Errorf("context cancelled while waiting for webhook ready: %w", ctx.Err()) | |
| return fmt.Errorf("context canceled while waiting for webhook ready: %w", ctx.Err()) |
…ful shutdown The goroutine in main.go that waits for webhook readiness and sets up controllers was calling os.Exit(1) on failure, which kills the process immediately without releasing the leader election lease. Other replicas cannot acquire leadership until the lease TTL expires (default 15s). Additionally, WaitReady() was an infinite loop with no context or timeout, risking a zombie leader state if the webhook health check never passes. This commit: - Wraps the signal handler context with context.WithCancel - Replaces os.Exit(1) in the goroutine with cancel() + return, triggering the manager's graceful shutdown path - Updates WaitReady to accept a context.Context so it can exit when the context is cancelled Fixes openkruise#2401 Signed-off-by: abhaygoudannavar <abhaysgoudnvr@gmail.com>
b58ba06 to
195d5e3
Compare
Ⅰ. Describe what this PR does
Two related problems in the startup path:
1.
os.Exit(1)in goroutine leaks the leader election leaseThe goroutine at
main.go:257that waits for webhook readiness and sets up controllers callsos.Exit(1)on failure. This kills the process immediately — deferred cleanup never runs, the manager's graceful shutdown is skipped, and the leader election lease is never released. With the defaultleaseDurationof 15s, no other replica can take over until the lease expires on its own.2.
WaitReady()is an infinite loop with no exit conditionWaitReadyinpkg/webhook/server.goloops forever pollingChecker()with a 2-second sleep. If the health check never passes (e.g. bad TLS cert, webhook server failed to bind), this goroutine spins indefinitely. Controllers never get set up, but the manager keeps running and renewing the leader lease — so you get a leader that holds the lock but does nothing useful.What this PR changes:
context.WithCancelto get acancelfunctionos.Exit(1)in the goroutine withcancel()+return, which triggers the manager's graceful shutdown path (releasing the leader lease, draining queues, etc.)WaitReadyto accept acontext.Contextso it can exit cleanly when the context is cancelledⅡ. Does this pull request fix one issue?
fixes #2401
Ⅲ. Describe how to verify it
main.goandpkg/webhook/server.go.os.Exit(1)calls in the goroutine are replaced withcancel()+return, whileos.Exit(1)calls in the main goroutine (pre-manager startup) are intentionally left unchanged since those run before leader election.WaitReadynow checksctx.Done()at the top of each loop iteration.controller.SetupWithManagerto fail and confirm the manager exits cleanly and the leader lease is released immediately (rather than held for the full TTL).Ⅳ. Special notes for reviews
os.Exit(1)calls outside the goroutine (lines 179–247) are intentionally left as-is. Those run in the main goroutine beforemgr.Start(), so leader election hasn't started yet and there's no lease to leak.WaitReady's signature changed fromWaitReady()toWaitReady(ctx context.Context)— the only caller is inmain.go, which is updated in the same commit.