Log premature-terminal and missing-state job failures (observability for #1337) by mihow · Pull Request #1343 · RolnickLab/antenna

mihow · 2026-06-19T06:59:29Z

Summary

This is an observability-only change (no behaviour change) that makes two classes of "the job ended up in a terminal state" event visible in the logs, so we can tell legitimate terminal verdicts from premature ones before adding any corrective logic.

It follows #1338 (now merged), which made the result handler's terminal status transition atomic and non-regressing (the companion #1342 extends the same guard to the cancel and signal-handler writers). A side effect of that hardening is that a terminal verdict is now irreversible: a late-arriving completion can no longer pull a job back out of REVOKED/FAILURE. That is correct when the terminal verdict was right (a user cancel, a real crash), but it cements the verdict when it was wrong (for example, the stale-job reaper revoked a slow-but-alive job and its results then landed, or a result arrived while the job's Redis state was momentarily absent). These logs surface those cases instead of letting them disappear silently.

List of Changes

Log when work completes for a job that is already terminal. In _update_job_progress (ami/jobs/tasks.py), when the guarded terminal transition does not fire because the job is already terminal/CANCELING, emit a warning via the per-job log, naming the stage and the terminal state that was not applied. This is often legitimate (a cancel or the reaper genuinely won the race), but a frequent occurrence is the signal of a premature terminal verdict. Observation only — the guard behaviour is unchanged.
Log context when a result arrives for missing Redis state. The result handler treats a missing total-images key as fatal (ack + _fail_job). That single condition conflates three very different situations: state genuinely cleaned up (end of life), state never seeded yet (startup race), and state wiped by a duplicate/redelivered run_job re-running initialize_job. A new _log_missing_state_context helper records the job's age and status at both missing-state branches and splits the log by job state: a terminal/CANCELING job logs an info line (a late result after the job already finished — benign, e.g. cancel cleanup, which _fail_job no-ops on anyway), while a still-running job with missing state logs a warning (the case worth investigating). Behaviour unchanged — the job is still failed where it was before; we just log why first, at the right severity.

The log messages are plain operational statements (no ticket numbers or internal jargon in the runtime strings); the rationale and issue reference live in code comments.

Why observation before correction

We have a report of Redis state appearing "missing for a moment at the beginning" of jobs, and the result handler currently fails a job on the first missing-state read with no second chance. Before adding grace/retry logic we want to confirm the actual trigger (a small age in the new log, on a still-running job, would point to a not-yet-seeded or redispatch race rather than genuine cleanup). Instrument, confirm, then fix.

Follow-up (NOT in this PR — proposed)

Once the logs confirm the trigger, the corrective changes to make are:

Grace on missing-state in the result handler. If state is missing but the job is young / not yet STARTED / recently dispatched, do not ack-and-fail; re-raise so NATS redelivers and the brief gap self-heals, and only fail after a grace window. Today it fails immediately on the first read.
Make initialize_job non-clobbering / idempotent. It currently deletes the pending sets before re-adding, so a second run_job (a re-run, or an acks_late redelivery) wipes a job's live in-flight state. Refuse or no-op re-initialization of a job that already has pending state unless it's an explicit reset. This is a prerequisite for Improve celery task dispatch and cancellation to prevent stuck jobs #1324's acks_late redelivery, which can re-trigger run_job.

How to Test the Changes

Full ami/jobs/tests/test_tasks.py and ami/jobs/tests/test_jobs.py pass locally (no behaviour change).
The new log lines sit on the existing missing-state and already-terminal code paths; no new branches in control flow.
Verified live on a dev deployment: cancelling a job mid-process produces the expected info line ("result arrived after the job already finished, status=REVOKED ... ignoring") for each in-flight result, rather than a misleading failure warning.

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

Builds on #1338 (merged). Sibling fast-follow of #1342 (not a dependency). Refs #1337, #1219, #1324.

Summary by CodeRabbit

Chores
- Enhanced diagnostic logging for job processing to improve visibility into edge cases, including better detection of missing state conditions and unexpected job state transitions for troubleshooting purposes.

netlify · 2026-06-19T06:59:36Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`7b848cb`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a35b8ea18f8ce0008a56815

coderabbitai · 2026-06-19T06:59:38Z

📝 Walkthrough

Walkthrough

Adds a new helper _log_missing_state_context(job_id, stage) in ami/jobs/tasks.py that fetches job metadata and logs an info or warning message depending on whether the job is already terminal or not. This helper is called at both the "process" and "results" missing-state paths in process_nats_pipeline_result. A warning log is also added in _update_job_progress when became_complete=True but the guarded DB transition is skipped because the job is already terminal.

Changes

Missing Redis State & Premature Terminal Diagnostics

Layer / File(s)	Summary
`_log_missing_state_context` helper and call sites `ami/jobs/tasks.py`	Adds `_log_missing_state_context(job_id, stage)` (lines 439–488) that queries job status, dispatch mode, and age, then logs info if the job is already terminal or a warning if it is non-terminal with missing Redis state. Wires this call into both the "process"-stage (line 285) and "results"-stage (line 368) missing-state branches of `process_nats_pipeline_result`, before the existing ACK/fail path.
Warning log for skipped completion transition in `_update_job_progress` `ami/jobs/tasks.py`	Adds an `else` branch (lines 708–722) for the `became_complete=True` / zero-row-updated case, fetching the current job status and logging a warning that completion was not applied because the job was already terminal or CANCELING.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Review how the state and progress of jobs are tracked #1285: This PR directly implements forensic logging for missing Redis progress state and premature terminal verdicts across process_nats_pipeline_result and _update_job_progress, which aligns with the framework proposed in #1285 for triangulating job state across DB, Redis, and NATS.
Review and simplify job logs #1236: Both touch logging behavior in process_nats_pipeline_result; this PR adds structured diagnostic logging at the same missing-state ACK/fail path discussed in #1236.

Possibly related PRs

RolnickLab/antenna#1234: Directly modifies the same process_nats_pipeline_result and _update_job_progress control flow paths — including the missing-state ACK/fail path — that this PR now augments with diagnostic logging.

Suggested labels

PSv2

Poem

🐇 Hoppity-hop through the pipeline I go,
When Redis goes missing, I now let you know!
A warning for non-terminals, info for the done,
No silent failures — diagnostics are fun!
The job's age and status, all logged with great care,
So no sneaky transitions slip by unaware. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding diagnostic logging for premature-terminal and missing-state job failures. It is clear, concise, and specific about the observability improvements being made.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides comprehensive coverage of all required template sections including summary, list of changes, detailed description with rationale, testing instructions, and a completed checklist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/1337-terminal-anomaly-logs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…gnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…y terminal The missing-state diagnostic logged a WARNING saying 'Failing job' for every in-flight result that arrived after a job finished — but _fail_job no-ops on a terminal job, so after a cancel (which deletes the Redis state) this fired once per in-flight batch and misdescribed normal cleanup as a failure. Now: a terminal job logs at info ('ignoring in-flight result for already-terminal job'); only a NON-terminal job with missing state logs the warning, which is the case actually worth investigating. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

netlify · 2026-06-19T21:21:16Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`7b848cb`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a35b8ea46c9560008d9737a

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 461-487: The condition checking `if row["status"] in
JobState.final_states()` at line 461 does not include the CANCELING state, but
_fail_job treats CANCELING as terminal/no-op. This causes cancel-in-flight
cleanup to incorrectly trigger the non-terminal warning log with misleading
"Failing job" message. Modify the condition to also treat CANCELING as terminal,
either by adding CANCELING to the final_states check or by explicitly including
it in the condition, so that expected cancel races are properly classified as
expected cleanup rather than anomalies.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe12e339-3fc2-4556-af0c-12faa47373e2

📥 Commits

Reviewing files that changed from the base of the PR and between df04cf5 and 0f52c17.

📒 Files selected for processing (1)

ami/jobs/tasks.py

…eaner The missing-state and completed-after-terminal logs read like insider notes — ticket numbers and race-theory in the runtime message. Move the rationale and the issue reference into code comments and make the log lines plain operational statements an operator can act on without chasing a ticket. Also drop the redundant dispatch_mode field and the extra status re-query. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds targeted logging to improve observability around two “job reached a terminal state unexpectedly” scenarios in the async results processing pipeline, without changing the underlying guard/fail behavior. This helps distinguish legitimate terminal transitions (cancel/reaper) from premature/incorrect ones that would otherwise be silent.

Changes:

Emit missing-Redis-state context logs before ack+fail in both stage="process" and stage="results" missing-state branches.
Emit a warning when _update_job_progress detects completion but the guarded terminal transition does not apply (job already terminal/CANCELING).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Treat CANCELING as terminal-like in the missing-state classification so a cancel-in-flight result logs the benign info line instead of the misleading 'still running / marking it failed' warning (matches _fail_job's no-op set). Caught by CodeRabbit and Copilot. - Rename the values() dict from 'row' to 'job_values' (per review). - Log the completed-after-terminal case via job.logger and include the stage and attempted terminal state, without an extra status re-query (per Copilot). Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mihow marked this pull request as ready for review June 19, 2026 21:21

mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from a8bce8d to aaa5365 Compare June 19, 2026 21:21

mihow and others added 2 commits June 19, 2026 14:21

mihow changed the base branch from fix/1337-terminal-transition-chokepoint to main June 19, 2026 21:21

mihow force-pushed the fix/1337-terminal-anomaly-logs branch from 42d127b to 0f52c17 Compare June 19, 2026 21:21

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread ami/jobs/tasks.py Outdated

Copilot AI review requested due to automatic review settings June 19, 2026 21:36

Copilot started reviewing on behalf of mihow June 19, 2026 21:37 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread ami/jobs/tasks.py

Comment thread ami/jobs/tasks.py

mihow commented Jun 19, 2026

View reviewed changes

Comment thread ami/jobs/tasks.py Outdated

mihow added the PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515. label Jun 19, 2026

mihow merged commit 702de1e into main Jun 19, 2026
7 checks passed

mihow deleted the fix/1337-terminal-anomaly-logs branch June 19, 2026 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log premature-terminal and missing-state job failures (observability for #1337)#1343

Log premature-terminal and missing-state job failures (observability for #1337)#1343
mihow merged 4 commits into
mainfrom
fix/1337-terminal-anomaly-logs

mihow commented Jun 19, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

netlify Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Why observation before correction

Follow-up (NOT in this PR — proposed)

How to Test the Changes

Checklist

Summary by CodeRabbit

Uh oh!

netlify Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

netlify Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Jun 19, 2026 •

edited

Loading

netlify Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

netlify Bot commented Jun 19, 2026 •

edited

Loading