Log premature-terminal and missing-state job failures (observability for #1337)#1343
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughAdds a new helper ChangesMissing Redis State & Premature Terminal Diagnostics
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a8bce8d to
aaa5365
Compare
…gnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y terminal
The missing-state diagnostic logged a WARNING saying 'Failing job' for every
in-flight result that arrived after a job finished — but _fail_job no-ops on a
terminal job, so after a cancel (which deletes the Redis state) this fired once
per in-flight batch and misdescribed normal cleanup as a failure. Now: a
terminal job logs at info ('ignoring in-flight result for already-terminal
job'); only a NON-terminal job with missing state logs the warning, which is the
case actually worth investigating.
Refs #1337.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
42d127b to
0f52c17
Compare
✅ Deploy Preview for antenna-ssec canceled.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 461-487: The condition checking `if row["status"] in
JobState.final_states()` at line 461 does not include the CANCELING state, but
_fail_job treats CANCELING as terminal/no-op. This causes cancel-in-flight
cleanup to incorrectly trigger the non-terminal warning log with misleading
"Failing job" message. Modify the condition to also treat CANCELING as terminal,
either by adding CANCELING to the final_states check or by explicitly including
it in the condition, so that expected cancel races are properly classified as
expected cleanup rather than anomalies.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
…eaner The missing-state and completed-after-terminal logs read like insider notes — ticket numbers and race-theory in the runtime message. Move the rationale and the issue reference into code comments and make the log lines plain operational statements an operator can act on without chasing a ticket. Also drop the redundant dispatch_mode field and the extra status re-query. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds targeted logging to improve observability around two “job reached a terminal state unexpectedly” scenarios in the async results processing pipeline, without changing the underlying guard/fail behavior. This helps distinguish legitimate terminal transitions (cancel/reaper) from premature/incorrect ones that would otherwise be silent.
Changes:
- Emit missing-Redis-state context logs before ack+fail in both
stage="process"andstage="results"missing-state branches. - Emit a warning when
_update_job_progressdetects completion but the guarded terminal transition does not apply (job already terminal/CANCELING).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Treat CANCELING as terminal-like in the missing-state classification so a cancel-in-flight result logs the benign info line instead of the misleading 'still running / marking it failed' warning (matches _fail_job's no-op set). Caught by CodeRabbit and Copilot. - Rename the values() dict from 'row' to 'job_values' (per review). - Log the completed-after-terminal case via job.logger and include the stage and attempted terminal state, without an extra status re-query (per Copilot). Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
This is an observability-only change (no behaviour change) that makes two classes of "the job ended up in a terminal state" event visible in the logs, so we can tell legitimate terminal verdicts from premature ones before adding any corrective logic.
It follows #1338 (now merged), which made the result handler's terminal status transition atomic and non-regressing (the companion #1342 extends the same guard to the cancel and signal-handler writers). A side effect of that hardening is that a terminal verdict is now irreversible: a late-arriving completion can no longer pull a job back out of REVOKED/FAILURE. That is correct when the terminal verdict was right (a user cancel, a real crash), but it cements the verdict when it was wrong (for example, the stale-job reaper revoked a slow-but-alive job and its results then landed, or a result arrived while the job's Redis state was momentarily absent). These logs surface those cases instead of letting them disappear silently.
List of Changes
_update_job_progress(ami/jobs/tasks.py), when the guarded terminal transition does not fire because the job is already terminal/CANCELING, emit a warning via the per-job log, naming the stage and the terminal state that was not applied. This is often legitimate (a cancel or the reaper genuinely won the race), but a frequent occurrence is the signal of a premature terminal verdict. Observation only — the guard behaviour is unchanged._fail_job). That single condition conflates three very different situations: state genuinely cleaned up (end of life), state never seeded yet (startup race), and state wiped by a duplicate/redeliveredrun_jobre-runninginitialize_job. A new_log_missing_state_contexthelper records the job's age and status at both missing-state branches and splits the log by job state: a terminal/CANCELING job logs an info line (a late result after the job already finished — benign, e.g. cancel cleanup, which_fail_jobno-ops on anyway), while a still-running job with missing state logs a warning (the case worth investigating). Behaviour unchanged — the job is still failed where it was before; we just log why first, at the right severity.The log messages are plain operational statements (no ticket numbers or internal jargon in the runtime strings); the rationale and issue reference live in code comments.
Why observation before correction
We have a report of Redis state appearing "missing for a moment at the beginning" of jobs, and the result handler currently fails a job on the first missing-state read with no second chance. Before adding grace/retry logic we want to confirm the actual trigger (a small
agein the new log, on a still-running job, would point to a not-yet-seeded or redispatch race rather than genuine cleanup). Instrument, confirm, then fix.Follow-up (NOT in this PR — proposed)
Once the logs confirm the trigger, the corrective changes to make are:
STARTED/ recently dispatched, do not ack-and-fail; re-raise so NATS redelivers and the brief gap self-heals, and only fail after a grace window. Today it fails immediately on the first read.initialize_jobnon-clobbering / idempotent. It currentlydeletes the pending sets before re-adding, so a secondrun_job(a re-run, or anacks_lateredelivery) wipes a job's live in-flight state. Refuse or no-op re-initialization of a job that already has pending state unless it's an explicit reset. This is a prerequisite for Improve celery task dispatch and cancellation to prevent stuck jobs #1324'sacks_lateredelivery, which can re-triggerrun_job.How to Test the Changes
ami/jobs/tests/test_tasks.pyandami/jobs/tests/test_jobs.pypass locally (no behaviour change).Checklist
Builds on #1338 (merged). Sibling fast-follow of #1342 (not a dependency). Refs #1337, #1219, #1324.
Summary by CodeRabbit