fix: print "job timed out"/"job cancelled" instead of "context cancelled"#3870
fix: print "job timed out"/"job cancelled" instead of "context cancelled"#3870mastermanas805 wants to merge 5 commits into
Conversation
…led"
When a job is cancelled or times out, the agent surfaced bare Go
context errors ("context canceled", "context deadline exceeded") in
the job log. These come from a couple of places, primarily the
kubernetes runner returning ctx.Err() directly and the executor
emitting %v on a context-wrapped command error.
This adds an internal/job.FormatJobError helper that rescues
context.DeadlineExceeded as "job timed out" and context.Canceled as
"job cancelled", and routes the three leakiest log sites through it:
- agent/run_job.go "Error running job: ..."
- internal/job/executor.go "user command error: ..."
- internal/job/executor.go CommandPhase default branch
The agent cannot reliably distinguish a server-driven cancel (UI
button) from a server-driven job timeout -- both transition the job
to the same canceling state and arrive as a context.Canceled. The
helper deliberately uses the umbrella term "job cancelled" rather
than claiming a timeout it can't confirm. Only context.DeadlineExceeded
(set explicitly within the agent, e.g. WithGracePeriod) maps to
"job timed out".
Fixes buildkite#3384.
|
@mastermanas805 can you go into a bit more detail about how/where you're seeing
i agree that better error messages are a very good thing, so if you can provide some steps to reproduce the errors you're seeing, i'd really appreciate it. |
|
You're seeing "terminated" because the non-k8s path works fine — the leak only shows up under the kubernetes runner, which is why this has been hard to pin down historically. Added On main: On this branch: Only the The two paths diverge because of how To reproduce live: run anything via agent-stack-k8s (or The two |
Drives a real kubernetes.Runner, cancels its context, and verifies both that runner.Run returns context.Canceled bare and that job.FormatJobError translates it to "job cancelled" at the agent/run_job.go call site. Locks in the user-visible message so that buildkite#3384 cannot regress, and documents the leak path that pre-dates this PR for any reviewer who can't reproduce against the non-k8s flow.
6450f8a to
552bd2a
Compare
|
heya, the test above was performed on a live instance of the agent-stack-k8s. if you could produce evidence of these errors making it to logs in live agent instances — not in tests where you're manually cancelling the context — i would really appreciate it. context cancelled/deadline timeout errors will only make it into job logs if the agent's context gets cancelled, which is a code path that i don't think actually gets exercised in production workloads. |

Description
Bare
context canceledwas leaking to user-facing job logs from the Kubernetes runner (kubernetes/runner.go) and the executor's command phase. Addsinternal/job.FormatJobError:context.DeadlineExceeded→"job timed out",context.Canceled→"job cancelled". Routed through the three sites that surface this error.Fixes #3384
Server-side, a UI cancel and job-level timeout both arrive as
context.Canceled(the agent can't distinguish locally), soCanceled→ "cancelled" is honest; onlyDeadlineExceeded(set by the agent itself) → "timed out".