fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter by pingsutw · Pull Request #7289 · flyteorg/flyte

pingsutw · 2026-04-27T06:41:04Z

Why are the changes needed?

When the k8s API rejects pod creation via an admission webhook (e.g. flyte-pod-webhook denies the request because a referenced secret cannot be found in any configured secret manager), launchResource in executor/pkg/plugin/k8s/plugin_manager.go returned the error as a raw Go error with pluginsCore.UnknownTransition and no phase transition.

In taskaction_controller.go, when Plugin.Handle returns a Go error, the controller only logs "Plugin Handle failed", emits a FailedPluginHandle event, and requeues with the default duration. It does not record a phase, increment any system-failure counter, or convert to PermanentFailure. As a result, the TaskAction stays in Queued forever — the webhook will keep denying the same request, and there is no max-system-failure handling on this path.

Repro:

@env.task
def fn(x: int) -> int:
    print("my secret is: ", os.getenv("TEST1"))   # secret "test1" does not exist
    ...

Before the fix, the TaskAction loops forever with the webhook denial message in the controller log:

admission webhook "flyte-pod-webhook.flyte.org" denied the request:
  none of the secret managers injected secret [key:"test1" ...]

What changes were proposed in this pull request?

In launchResource, before the generic system-error fall-through:

Detect admission-webhook denials via a small helper isAdmissionWebhookDenial(err) that matches the canonical error string ("admission webhook" + "denied the request").
Also detect k8serrors.IsInvalid(err) (invalid pod specs are equally non-transient).
Return pluginsCore.PhaseInfoFailure("AdmissionDenied", err.Error(), nil) — a permanent failure phase — so the TaskAction terminates immediately with the underlying webhook message preserved for the user.

How was this patch tested?

Manual end-to-end test using the bundled devbox:

Rebuilt flyte-devbox:latest with this change (make -C docker/devbox-bundled build) and restarted the devbox.
Ran a task that references a non-existent secret (secrets=["test1"]).
Confirmed the TaskAction transitions to PermanentFailure in ~8s (single attempt) instead of getting stuck:

NAMESPACE                 NAME                      RUN                    TASKTYPE   STATUS             AGE
flytesnacks-development   rf5v6h9d79n2ndnnqkpr-a0   rf5v6h9d79n2ndnnqkpr   python     PermanentFailure   8s

Error State:

Code:    AdmissionDenied
Kind:    USER
Message: admission webhook "flyte-pod-webhook.flyte.org" denied the request: ...

Labels

fixed

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

main
- Flyte 2: Cut over to Flyte 2 new backend implementation #6583
  - fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter 👈

…ailure counter When Plugin.Handle returns a Go error (no phase transition), the TaskAction reconciler logs "Plugin Handle failed", emits an event, and requeues with the default duration. There was no system-failure attempt counter on this path, so a deterministic error — e.g. an admission webhook denying pod creation because a referenced secret cannot be found in any configured secret manager — would loop forever, leaving the TaskAction stuck in Queued. Mirror v1 propeller's MaxSystemFailures behaviour: - Add Status.SystemFailures (uint32) to TaskAction. - Increment it on every Plugin.Handle Go-error return, persist via Status update. - Reset to 0 on a successful Handle so transient errors don't accumulate across the lifetime of a long-running task. - Once the count exceeds DefaultMaxSystemFailures (30), synthesize a PhasePermanentFailure with code "MaxSystemFailuresExceeded" so the TaskAction terminates cleanly with the underlying error message preserved. - Threshold is overridable per-reconciler via TaskActionReconciler.MaxSystemFailures. This keeps the system tolerant of genuinely transient k8s API hiccups while ensuring deterministic failures eventually surface to the user. Signed-off-by: Kevin Su <pingsutw@apache.org>

Signed-off-by: Kevin Su <pingsutw@apache.org>

AdilFayyaz · 2026-04-27T19:26:42Z

+			"Plugin %q system error: %v", pluginID, handleErr)
+	}
+
+	taskAction.Status.SystemFailures++


are we resetting this back to 0? Suppose we hit 2 system errors and then a success, shouldn't this be reset?

I'm not sure actually. we don't reset it in v1.

wouldnt it be an issue for long running tasks that may have transient failures followed by successes in between. Task could silently die after accumulating nonconsecutive maxSystemFailures

that makes sense, let me update it

Signed-off-by: Kevin Su <pingsutw@apache.org>

github-actions Bot mentioned this pull request Apr 27, 2026

Flyte 2: Cut over to Flyte 2 new backend implementation #6583

Merged

pingsutw force-pushed the fix/admission-denied-permanent-failure branch from 0b45c96 to 4326c7a Compare April 27, 2026 06:47

pingsutw changed the title ~~fix(executor): treat admission webhook denials as permanent task failures~~ fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter Apr 27, 2026

pingsutw force-pushed the fix/admission-denied-permanent-failure branch 9 times, most recently from 9089e3b to 7fc0c2a Compare April 27, 2026 07:40

pingsutw force-pushed the fix/admission-denied-permanent-failure branch from 7fc0c2a to 908efd8 Compare April 27, 2026 07:43

make devbox-build

74e2e37

Signed-off-by: Kevin Su <pingsutw@apache.org>

AdilFayyaz reviewed Apr 27, 2026

View reviewed changes

reset

7c16066

Signed-off-by: Kevin Su <pingsutw@apache.org>

AdilFayyaz approved these changes Apr 27, 2026

View reviewed changes

pingsutw merged commit 36edb1e into v2 Apr 27, 2026
19 checks passed

pingsutw deleted the fix/admission-denied-permanent-failure branch April 27, 2026 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter#7289

fix(executor): bound consecutive Plugin.Handle errors with a system-failure counter#7289
pingsutw merged 3 commits into
v2from
fix/admission-denied-permanent-failure

pingsutw commented Apr 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Uh oh!

AdilFayyaz Apr 27, 2026

Uh oh!

pingsutw Apr 27, 2026

Uh oh!

AdilFayyaz Apr 27, 2026

Uh oh!

pingsutw Apr 27, 2026

Uh oh!

pingsutw Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pingsutw commented Apr 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Labels

Check all the applicable boxes

Uh oh!

Uh oh!

AdilFayyaz Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

pingsutw Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

AdilFayyaz Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

pingsutw Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

pingsutw Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pingsutw commented Apr 27, 2026 •

edited by github-actions Bot

Loading