Skip to content

fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter#7289

Merged
pingsutw merged 3 commits into
v2from
fix/admission-denied-permanent-failure
Apr 27, 2026
Merged

fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter#7289
pingsutw merged 3 commits into
v2from
fix/admission-denied-permanent-failure

Conversation

@pingsutw
Copy link
Copy Markdown
Member

@pingsutw pingsutw commented Apr 27, 2026

Why are the changes needed?

When the k8s API rejects pod creation via an admission webhook (e.g. flyte-pod-webhook denies the request because a referenced secret cannot be found in any configured secret manager), launchResource in executor/pkg/plugin/k8s/plugin_manager.go returned the error as a raw Go error with pluginsCore.UnknownTransition and no phase transition.

In taskaction_controller.go, when Plugin.Handle returns a Go error, the controller only logs "Plugin Handle failed", emits a FailedPluginHandle event, and requeues with the default duration. It does not record a phase, increment any system-failure counter, or convert to PermanentFailure. As a result, the TaskAction stays in Queued forever — the webhook will keep denying the same request, and there is no max-system-failure handling on this path.

Repro:

@env.task
def fn(x: int) -> int:
    print("my secret is: ", os.getenv("TEST1"))   # secret "test1" does not exist
    ...

Before the fix, the TaskAction loops forever with the webhook denial message in the controller log:

admission webhook "flyte-pod-webhook.flyte.org" denied the request:
  none of the secret managers injected secret [key:"test1" ...]

What changes were proposed in this pull request?

In launchResource, before the generic system-error fall-through:

  1. Detect admission-webhook denials via a small helper isAdmissionWebhookDenial(err) that matches the canonical error string ("admission webhook" + "denied the request").
  2. Also detect k8serrors.IsInvalid(err) (invalid pod specs are equally non-transient).
  3. Return pluginsCore.PhaseInfoFailure("AdmissionDenied", err.Error(), nil) — a permanent failure phase — so the TaskAction terminates immediately with the underlying webhook message preserved for the user.

How was this patch tested?

Screenshot 2026-04-27 at 11 56 45 AM

Manual end-to-end test using the bundled devbox:

  1. Rebuilt flyte-devbox:latest with this change (make -C docker/devbox-bundled build) and restarted the devbox.
  2. Ran a task that references a non-existent secret (secrets=["test1"]).
  3. Confirmed the TaskAction transitions to PermanentFailure in ~8s (single attempt) instead of getting stuck:
NAMESPACE                 NAME                      RUN                    TASKTYPE   STATUS             AGE
flytesnacks-development   rf5v6h9d79n2ndnnqkpr-a0   rf5v6h9d79n2ndnnqkpr   python     PermanentFailure   8s

Error State:

Code:    AdmissionDenied
Kind:    USER
Message: admission webhook "flyte-pod-webhook.flyte.org" denied the request: ...

Labels

  • fixed

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

@pingsutw pingsutw force-pushed the fix/admission-denied-permanent-failure branch from 0b45c96 to 4326c7a Compare April 27, 2026 06:47
@pingsutw pingsutw changed the title fix(executor): treat admission webhook denials as permanent task failures fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter Apr 27, 2026
@pingsutw pingsutw force-pushed the fix/admission-denied-permanent-failure branch 9 times, most recently from 9089e3b to 7fc0c2a Compare April 27, 2026 07:40
…ailure counter

When Plugin.Handle returns a Go error (no phase transition), the TaskAction
reconciler logs "Plugin Handle failed", emits an event, and requeues with
the default duration. There was no system-failure attempt counter on this
path, so a deterministic error — e.g. an admission webhook denying pod
creation because a referenced secret cannot be found in any configured
secret manager — would loop forever, leaving the TaskAction stuck in Queued.

Mirror v1 propeller's MaxSystemFailures behaviour:

- Add Status.SystemFailures (uint32) to TaskAction.
- Increment it on every Plugin.Handle Go-error return, persist via Status update.
- Reset to 0 on a successful Handle so transient errors don't accumulate
  across the lifetime of a long-running task.
- Once the count exceeds DefaultMaxSystemFailures (30), synthesize a
  PhasePermanentFailure with code "MaxSystemFailuresExceeded" so the
  TaskAction terminates cleanly with the underlying error message preserved.
- Threshold is overridable per-reconciler via TaskActionReconciler.MaxSystemFailures.

This keeps the system tolerant of genuinely transient k8s API hiccups
while ensuring deterministic failures eventually surface to the user.

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw force-pushed the fix/admission-denied-permanent-failure branch from 7fc0c2a to 908efd8 Compare April 27, 2026 07:43
Signed-off-by: Kevin Su <pingsutw@apache.org>
Comment thread executor/pkg/controller/taskaction_controller.go
"Plugin %q system error: %v", pluginID, handleErr)
}

taskAction.Status.SystemFailures++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we resetting this back to 0? Suppose we hit 2 system errors and then a success, shouldn't this be reset?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure actually. we don't reset it in v1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldnt it be an issue for long running tasks that may have transient failures followed by successes in between. Task could silently die after accumulating nonconsecutive maxSystemFailures

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, let me update it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw merged commit 36edb1e into v2 Apr 27, 2026
19 checks passed
@pingsutw pingsutw deleted the fix/admission-denied-permanent-failure branch April 27, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants