Skip to content

fix: print "job timed out"/"job cancelled" instead of "context cancelled"#3870

Open
mastermanas805 wants to merge 5 commits into
buildkite:mainfrom
mastermanas805:fix/3384-job-timeout-error
Open

fix: print "job timed out"/"job cancelled" instead of "context cancelled"#3870
mastermanas805 wants to merge 5 commits into
buildkite:mainfrom
mastermanas805:fix/3384-job-timeout-error

Conversation

@mastermanas805
Copy link
Copy Markdown
Contributor

@mastermanas805 mastermanas805 commented Apr 29, 2026

Description

Bare context canceled was leaking to user-facing job logs from the Kubernetes runner (kubernetes/runner.go) and the executor's command phase. Adds internal/job.FormatJobError: context.DeadlineExceeded"job timed out", context.Canceled"job cancelled". Routed through the three sites that surface this error.

Fixes #3384

Server-side, a UI cancel and job-level timeout both arrive as context.Canceled (the agent can't distinguish locally), so Canceled → "cancelled" is honest; only DeadlineExceeded (set by the agent itself) → "timed out".

…led"

When a job is cancelled or times out, the agent surfaced bare Go
context errors ("context canceled", "context deadline exceeded") in
the job log. These come from a couple of places, primarily the
kubernetes runner returning ctx.Err() directly and the executor
emitting %v on a context-wrapped command error.

This adds an internal/job.FormatJobError helper that rescues
context.DeadlineExceeded as "job timed out" and context.Canceled as
"job cancelled", and routes the three leakiest log sites through it:

  - agent/run_job.go     "Error running job: ..."
  - internal/job/executor.go  "user command error: ..."
  - internal/job/executor.go  CommandPhase default branch

The agent cannot reliably distinguish a server-driven cancel (UI
button) from a server-driven job timeout -- both transition the job
to the same canceling state and arrive as a context.Canceled. The
helper deliberately uses the umbrella term "job cancelled" rather
than claiming a timeout it can't confirm. Only context.DeadlineExceeded
(set explicitly within the agent, e.g. WithGracePeriod) maps to
"job timed out".

Fixes buildkite#3384.
@mastermanas805 mastermanas805 marked this pull request as ready for review April 29, 2026 11:46
@mastermanas805 mastermanas805 requested review from a team as code owners April 29, 2026 11:46
@moskyb
Copy link
Copy Markdown
Contributor

moskyb commented May 5, 2026

@mastermanas805 can you go into a bit more detail about how/where you're seeing context cancelled/deadline exceeded errors in the logs? when i cancel jobs i get the standard "terminated" message:

CleanShot 2026-05-05 at 15 28 58@2x

i agree that better error messages are a very good thing, so if you can provide some steps to reproduce the errors you're seeing, i'd really appreciate it.

@mastermanas805
Copy link
Copy Markdown
Contributor Author

mastermanas805 commented May 5, 2026

You're seeing "terminated" because the non-k8s path works fine — the leak only shows up under the kubernetes runner, which is why this has been hard to pin down historically.

Added kubernetes/repro_3384_test.go (552bd2a) that spins up a real kubernetes.Runner, cancels its context, and prints the resulting job log line.

On main:

user-visible job log line on main: "Error running job: context canceled"

On this branch:

POST-FIX job log line: "Error running job: job cancelled"

Only the agent/run_job.go call site differs.

The two paths diverge because of how process.Run is implemented. *process.Process.Run (regular agent) blocks on command.Wait(), which returns *exec.ExitError with signal info — so on cancel you land in isExitError && isExitSignaled at executor.go:1135 and get the "interrupted by a signal: terminated" line from your screenshot. *kubernetes.Runner.Run is a different beast — it's a socket server, not a wrapped subprocess, so when its context cancels it just returns ctx.Err() raw (kubernetes/runner.go:122-123). That used to flow straight through agent/run_job.go:333 into the log. The existing test at kubernetes_test.go:292 even documents this with an errors.Is(err, context.Canceled) tolerance.

To reproduce live: run anything via agent-stack-k8s (or --kubernetes-exec) and cancel from the UI. Job-level timeouts come through the same way server-side.

The two executor.go changes are defensive — the executor (bootstrap) runs inside the user's container in k8s and as a child process otherwise, so the same wrapping covers any ctx error that slips into the default branch of the command-phase switch, e.g. an early setup failure while the context is already cancelled. I don't have a clean repro for those, so happy to drop them and keep the fix to just the k8s call site if you'd rather.

Drives a real kubernetes.Runner, cancels its context, and verifies both
that runner.Run returns context.Canceled bare and that
job.FormatJobError translates it to "job cancelled" at the
agent/run_job.go call site.

Locks in the user-visible message so that buildkite#3384 cannot regress, and
documents the leak path that pre-dates this PR for any reviewer who
can't reproduce against the non-k8s flow.
@mastermanas805 mastermanas805 force-pushed the fix/3384-job-timeout-error branch from 6450f8a to 552bd2a Compare May 5, 2026 06:00
@mastermanas805
Copy link
Copy Markdown
Contributor Author

@moskyb

@moskyb
Copy link
Copy Markdown
Contributor

moskyb commented May 5, 2026

heya, the test above was performed on a live instance of the agent-stack-k8s. if you could produce evidence of these errors making it to logs in live agent instances — not in tests where you're manually cancelling the context — i would really appreciate it. context cancelled/deadline timeout errors will only make it into job logs if the agent's context gets cancelled, which is a code path that i don't think actually gets exercised in production workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Print better error message when job times out

2 participants