Skip to content

fix: replace os.Exit in goroutine with context cancellation for graceful shutdown#2402

Open
abhaygoudannavar wants to merge 1 commit intoopenkruise:masterfrom
abhaygoudannavar:fix/graceful-shutdown-on-goroutine-failure
Open

fix: replace os.Exit in goroutine with context cancellation for graceful shutdown#2402
abhaygoudannavar wants to merge 1 commit intoopenkruise:masterfrom
abhaygoudannavar:fix/graceful-shutdown-on-goroutine-failure

Conversation

@abhaygoudannavar
Copy link
Copy Markdown

Ⅰ. Describe what this PR does

Two related problems in the startup path:

1. os.Exit(1) in goroutine leaks the leader election lease

The goroutine at main.go:257 that waits for webhook readiness and sets up controllers calls os.Exit(1) on failure. This kills the process immediately — deferred cleanup never runs, the manager's graceful shutdown is skipped, and the leader election lease is never released. With the default leaseDuration of 15s, no other replica can take over until the lease expires on its own.

2. WaitReady() is an infinite loop with no exit condition

WaitReady in pkg/webhook/server.go loops forever polling Checker() with a 2-second sleep. If the health check never passes (e.g. bad TLS cert, webhook server failed to bind), this goroutine spins indefinitely. Controllers never get set up, but the manager keeps running and renewing the leader lease — so you get a leader that holds the lock but does nothing useful.

What this PR changes:

  • Wraps the signal handler context with context.WithCancel to get a cancel function
  • Replaces os.Exit(1) in the goroutine with cancel() + return, which triggers the manager's graceful shutdown path (releasing the leader lease, draining queues, etc.)
  • Updates WaitReady to accept a context.Context so it can exit cleanly when the context is cancelled

Ⅱ. Does this pull request fix one issue?

fixes #2401

Ⅲ. Describe how to verify it

  1. Review the diff — the changes are straightforward and localized to main.go and pkg/webhook/server.go.
  2. Verify the os.Exit(1) calls in the goroutine are replaced with cancel() + return, while os.Exit(1) calls in the main goroutine (pre-manager startup) are intentionally left unchanged since those run before leader election.
  3. Verify WaitReady now checks ctx.Done() at the top of each loop iteration.
  4. To test the graceful shutdown path: cause controller.SetupWithManager to fail and confirm the manager exits cleanly and the leader lease is released immediately (rather than held for the full TTL).

Ⅳ. Special notes for reviews

  • The os.Exit(1) calls outside the goroutine (lines 179–247) are intentionally left as-is. Those run in the main goroutine before mgr.Start(), so leader election hasn't started yet and there's no lease to leak.
  • WaitReady's signature changed from WaitReady() to WaitReady(ctx context.Context) — the only caller is in main.go, which is updated in the same commit.

Copilot AI review requested due to automatic review settings April 8, 2026 21:48
@kruise-bot kruise-bot requested review from furykerry and zmberg April 8, 2026 21:48
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fei-guo for approval by writing /assign @fei-guo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two critical issues in the startup path that prevent graceful shutdown of the manager:

  1. os.Exit(1) in goroutine leaks leader election lease: When webhook readiness check or controller setup fails, the goroutine was calling os.Exit(1) directly, which bypasses all deferred cleanup including the manager's graceful shutdown. This prevented the leader election lease from being released, blocking other replicas from taking over leadership until the TTL expired (default 15 seconds).

  2. WaitReady() infinite loop with no timeout: The webhook readiness check function was an infinite loop with no context cancellation support. If the health checker never passed, the goroutine would spin forever, and the manager would keep renewing the leader lease while doing nothing useful.

Changes:

  • Wraps the signal handler context with context.WithCancel() to enable graceful cancellation
  • Replaces os.Exit(1) calls in the background goroutine with cancel() + return to trigger manager shutdown
  • Updates WaitReady() to accept a context parameter and respect context cancellation
  • Leaves os.Exit(1) calls in the initialization phase (before manager startup) unchanged as they predate leader election

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
main.go Wraps signal handler with cancellable context, updates WaitReady call to pass context, replaces os.Exit calls in goroutine with graceful cancellation
pkg/webhook/server.go Updates WaitReady signature to accept context, adds context cancellation check in the polling loop

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 49.00%. Comparing base (749e8f2) to head (b58ba06).

Files with missing lines Patch % Lines
pkg/webhook/server.go 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2402      +/-   ##
==========================================
+ Coverage   48.77%   49.00%   +0.22%     
==========================================
  Files         324      324              
  Lines       27928    27932       +4     
==========================================
+ Hits        13623    13689      +66     
+ Misses      12775    12705      -70     
- Partials     1530     1538       +8     
Flag Coverage Δ
unittests 49.00% <80.00%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@abhaygoudannavar abhaygoudannavar force-pushed the fix/graceful-shutdown-on-goroutine-failure branch from 25eeb8a to afd2ee7 Compare April 13, 2026 05:42
@kruise-bot kruise-bot added size/M size/M: 30-99 and removed size/S size/S 10-29 labels Apr 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/webhook/server.go:1

  • Cancellation is only checked before time.Sleep, so shutdown may be delayed by up to 2 seconds per loop iteration. Consider replacing time.Sleep with a ticker + select on ctx.Done() so cancellation is responsive even during the wait interval.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 260 to 274
go func() {
setupLog.Info("wait webhook ready")
if err = webhook.WaitReady(); err != nil {
if err = webhook.WaitReady(ctx); err != nil {
setupLog.Error(err, "unable to wait webhook ready")
os.Exit(1)
cancel()
return
}

setupLog.Info("setup controllers")
if err = controller.SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to setup controllers")
os.Exit(1)
cancel()
return
}
}()
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine assigns to the outer err variable (if err = ...) which is likely also accessed on the main goroutine path, creating a data race. Use a goroutine-local variable (e.g., if err := ...; err != nil) for both calls so no shared state is mutated across goroutines.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +20
func TestWaitReadyCancel(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
cancel() // instantly cancel the context

err := WaitReady(ctx)
if err == nil {
t.Fatalf("expected error, got nil")
}
if !strings.Contains(err.Error(), "context cancelled while waiting for webhook ready") {
t.Fatalf("unexpected error message: %v", err)
}
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts on a substring of the error message, which makes it brittle to wording changes. Prefer checking the wrapped cause (e.g., errors.Is(err, context.Canceled)) and only assert on message content if absolutely necessary.

Copilot uses AI. Check for mistakes.
Comment on lines +653 to +654
aHourAgo := metav1.NewTime(time.Unix(time.Now().Add(-time.Hour).Unix(), 0))
Clock = testingclock.NewFakeClock(aHourAgo.Time)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test mutates the package-level Clock and does not restore it, which can leak state into other tests and cause ordering-dependent failures. Capture the previous clock and defer restoring it at the end of the test.

Suggested change
aHourAgo := metav1.NewTime(time.Unix(time.Now().Add(-time.Hour).Unix(), 0))
Clock = testingclock.NewFakeClock(aHourAgo.Time)
aHourAgo := metav1.NewTime(time.Unix(time.Now().Add(-time.Hour).Unix(), 0))
oldClock := Clock
Clock = testingclock.NewFakeClock(aHourAgo.Time)
defer func() {
Clock = oldClock
}()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

pkg/webhook/server.go:1

  • Cancellation responsiveness is delayed by up to 2 seconds because time.Sleep is not interruptible; if ctx is canceled during the sleep, shutdown will wait for the sleep to finish before exiting. Consider replacing time.Sleep with a select that waits on either ctx.Done() or a timer/ticker so cancellation returns promptly.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 260 to 274
go func() {
setupLog.Info("wait webhook ready")
if err = webhook.WaitReady(); err != nil {
if err = webhook.WaitReady(ctx); err != nil {
setupLog.Error(err, "unable to wait webhook ready")
os.Exit(1)
cancel()
return
}

setupLog.Info("setup controllers")
if err = controller.SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to setup controllers")
os.Exit(1)
cancel()
return
}
}()
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is assigned inside a goroutine (err = ...) which can introduce a data race if err is also accessed from the main goroutine (common pattern in main). Use a goroutine-local variable instead (e.g., if err := ...; err != nil { ... }) for both WaitReady and SetupWithManager to avoid sharing mutable state across goroutines.

Copilot uses AI. Check for mistakes.
for {
select {
case <-ctx.Done():
return fmt.Errorf("context cancelled while waiting for webhook ready: %w", ctx.Err())
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard Go spelling in context errors is “canceled” (one “l”). Consider changing the message to “context canceled …” for consistency (and update the corresponding test assertion).

Suggested change
return fmt.Errorf("context cancelled while waiting for webhook ready: %w", ctx.Err())
return fmt.Errorf("context canceled while waiting for webhook ready: %w", ctx.Err())

Copilot uses AI. Check for mistakes.
…ful shutdown

The goroutine in main.go that waits for webhook readiness and sets up
controllers was calling os.Exit(1) on failure, which kills the process
immediately without releasing the leader election lease. Other replicas
cannot acquire leadership until the lease TTL expires (default 15s).
Additionally, WaitReady() was an infinite loop with no context or
timeout, risking a zombie leader state if the webhook health check
never passes.
This commit:
- Wraps the signal handler context with context.WithCancel
- Replaces os.Exit(1) in the goroutine with cancel() + return,
  triggering the manager's graceful shutdown path
- Updates WaitReady to accept a context.Context so it can exit
  when the context is cancelled
Fixes openkruise#2401

Signed-off-by: abhaygoudannavar <abhaysgoudnvr@gmail.com>
@abhaygoudannavar abhaygoudannavar force-pushed the fix/graceful-shutdown-on-goroutine-failure branch from b58ba06 to 195d5e3 Compare April 13, 2026 06:55
@kruise-bot kruise-bot added size/L size/L: 100-499 and removed size/M size/M: 30-99 labels Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L size/L: 100-499

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] os.Exit in goroutine bypasses graceful shutdown, leaking leader election lease

3 participants