Skip to content

Improve celery task dispatch and cancellation to prevent stuck jobs#1324

Open
mihow wants to merge 2 commits into
mainfrom
fix/celery-stuck-jobs
Open

Improve celery task dispatch and cancellation to prevent stuck jobs#1324
mihow wants to merge 2 commits into
mainfrom
fix/celery-stuck-jobs

Conversation

@mihow

@mihow mihow commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Closes #1323.

Last week we hit the symptom Issue #1323 describes: a run_job was sitting inside filter_processed_images() for ~9 minutes against a huge collection, and a fresh run_job queued behind it sat in RESERVED on the same worker container the entire time — even though 15 of 16 children on that container were idle and the entire sibling container was idle. SIGKILL'ing the blocker let the queued job start in the same second. The job that finally ran was almost certainly fine, but it wasted nine minutes of wall clock for no good reason, and any user clicking "run" during that window saw nothing happen.

This PR fixes the three reinforcing causes the issue identified, plus an orthogonal cancel-path bug I noticed while reading the code paths. Each one is a small config or decorator change; the value is in stacking them.

What changes for users

  • Stuck run_job no longer blocks other jobs in the queue. A slow first job releases the worker slot for the next one as soon as a sibling child is idle, rather than holding 15 idle slots hostage.
  • A worker crash mid-job no longer silently drops the job. Before: the celery message was already acked at delivery, so a SIGKILL/OOM/deploy roll meant the job stayed in STARTED forever until the reaper found it. After: broker holds the message, redelivers it when a worker comes back. The job either resumes or — if it had already settled — exits cleanly.
  • Cancelling an async ML job actually cancels. Today's Job.cancel calls revoke(terminate=True) on the local run_job task. For ASYNC_API that task has almost always already finished (queue_images_to_nats returns fast) — terminating it does nothing about the actual work running on the remote ADC worker, and on the rare occasion the bootstrap is still running, the SIGTERM kills it without redelivery. The real cancel mechanism for async is tearing down the NATS stream + Redis state, which cleanup_async_job_if_needed already does. We now skip terminate for ASYNC_API.

Plain-language summary

Behavior Before After
Long run_job blocking sibling jobs on same container Yes — pre-assignment pins messages to busy child No — fair scheduler hands messages only to idle children
Long run_job blocking sibling jobs on sibling container Sometimes — broker spills late Reduced — pairs naturally with smaller per-container reservation window (see below)
Worker SIGKILL/OOM mid-run_job Message lost; job stranded in STARTED Broker redelivers; early-guard handles redelivery cleanly
Cancel of ASYNC_API job Terminates local bootstrap (mostly a no-op, occasionally a SIGTERM mid-flight); remote workers keep going Tears down NATS stream + Redis state; remote work stops naturally
Cancel of SYNC_API / INTERNAL job Terminates celery task Unchanged — terminate is still the only way

What's in this PR

  1. config/settings/base.pyCELERY_WORKER_POOL_OPTIMIZATION = "fair". One line. Applies to all queues; the value is largely on jobs.
  2. ami/jobs/tasks.pyacks_late=True, reject_on_worker_lost=True on run_job, plus an early-guard at the top of the task body that returns cleanly when job.status is in final_states() or CANCELING. The guard is what makes redelivery and cancel-race safe.
  3. ami/jobs/models.pyJob.cancel no longer passes terminate=True when dispatch_mode == ASYNC_API. For other dispatch modes, behavior is unchanged.
  4. docker-compose.worker.yml — added a commented-out CELERY_WORKER_CONCURRENCY: "4" on celeryworker_jobs with a TODO referencing this issue, in case we want to enable cause C later.

A note on the counter-intuitive concurrency knob

The issue suggests lowering per-container concurrency on the jobs queue as a third fix. I've left this commented out for now (just a discoverability hint in docker-compose.worker.yml) because it reads as backwards and I'd rather watch the first two fixes in production before pulling this lever too.

The reasoning, briefly: celery's prefetch reserves concurrency × prefetch_multiplier(=1) messages per container at the broker level. With concurrency=16, that's 16 messages held in the container's local buffer. When one of those messages is a stuck task, the container still tells the broker "I have 15 free slots" — and the broker keeps offering new messages to that container instead of spilling to a fully-idle sibling. Lowering concurrency to 4 shrinks the reservation window so the broker spills sooner.

The reason this isn't a meaningful capacity cut: run_job spends nearly all its time waiting on NATS results to come back, not burning CPU. The 16 was originally raised (#1228) for ml_results and antenna, which are DB/Redis-bound and benefit from oversubscription. The jobs queue inherited the high number incidentally.

What the three fixes are actually doing

There are three different head-of-line problems, with three different mechanisms:

  • Inside a child (one slow task blocks its own pipe): can't be fixed, that's just how prefork works.
  • Inside a container, across children (default pre-assignment scheduler pins messages to specific child pipes regardless of which child becomes idle first): fixed by -O fair.
  • Across containers (broker prefetch reservation: a 1-task container still looks like it has 15 free slots): mitigated by lower per-container concurrency (the deferred knob).

acks_late is orthogonal to all three — it's about surviving worker death, not about scheduling. But it's a precondition for both the cancel fix to be safe (cancellation can now terminate a worker without losing the redelivered message that the early-guard will short-circuit) and for the deferred concurrency change to be safe (smaller pools mean each child handles more tasks, and a single SIGKILL hurts more).

What's verified

  • New unit tests on Job.cancel for ASYNC_API / SYNC_API / no-task-id paths.
  • New tests on the run_job early-guard covering REVOKED, CANCELING, SUCCESS, and the contract pair (PENDING still runs).
  • Full ami.jobs.tests (118 tests) and ami.ml.tests + ami.ml.orchestration.tests (81 tests) green.

What still needs verifying in staging/prod

From the issue's "what we still need to verify" section, two of three are now testable:

  1. Re-running the symptom against a real -O fair worker — confirm RESERVED → ACTIVE happens immediately when a sibling child is idle. I'd want to do this on dev box (queue two long run_jobs back-to-back, sleep one).
  2. -O fair interaction with max-tasks-per-child=100 — should be fine but worth watching for the first day after deploy.
  3. acks_late redelivery in practice — the early-guard makes a redelivered run_job a no-op when the job is already settled, so this is covered by the tests, but worth eyeballing the celery logs after deploy for unexpected Skipping run_job messages.

The cancel fix is the one I'd most like a second pair of eyes on — the docstring is the long version, but the gist is "for ASYNC_API the celery task isn't where the work is, so terminating it doesn't cancel; cleanup does."

Related

Co-Authored-By: Claude noreply@anthropic.com

Summary by CodeRabbit

  • Bug Fixes

    • Job cancellation now handles different job types more reliably, ensuring consistent state transitions and proper cleanup.
    • Task execution is more resilient when workers are unexpectedly lost, with automatic re-delivery of in-flight tasks to maintain reliability.
  • Chores

    • Optimized worker pool scheduling configuration to enable fair task distribution and reduce processing delays.

Review Change Stack

Addresses three reinforcing causes of run_job head-of-line blocking on the
jobs queue (#1323) plus an orthogonal cancel bug exposed
by the same investigation.

- Enable fair scheduling (CELERY_WORKER_POOL_OPTIMIZATION = "fair") so the
  master process holds prefetched messages in a shared buffer instead of
  pre-assigning them to specific prefork children. Long heterogeneous
  tasks (notably run_job inside filter_processed_images) no longer block
  newer messages stuck behind them on the same child.

- Add acks_late=True + reject_on_worker_lost=True to run_job so a worker
  SIGKILL/OOM mid-task triggers broker redelivery instead of silently
  dropping the job. Pairs with an early-guard at the top of run_job that
  returns cleanly if the Job is already in a terminal state or being
  cancelled, so redelivery never re-runs side effects.

- Fix Job.cancel for ASYNC_API: skip terminate=True on the (likely-done)
  run_job task — the actual work runs on remote ADC workers via NATS, and
  cleanup_async_job_if_needed is what stops it. Terminating the local
  bootstrap was a no-op at best and SIGTERM'd a still-bootstrapping child
  at worst. INTERNAL / SYNC_API keep terminate=True since their celery
  task body owns the entire job lifecycle.

- Document the optional CELERY_WORKER_CONCURRENCY=4 override on the
  celeryworker_jobs container (commented out for now) so operators can
  opt in once -O fair is observed in production.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 23:16
@netlify

netlify Bot commented May 27, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit c8e6c8a
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a17971be3a452000810b632

@netlify

netlify Bot commented May 27, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit c8e6c8a
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a17971b5a4ff00008e75cd4

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@mihow, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 41 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9597a54d-4d1a-4e33-b4f8-6d679c8a0109

📥 Commits

Reviewing files that changed from the base of the PR and between f772b71 and c8e6c8a.

📒 Files selected for processing (4)
  • ami/jobs/models.py
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/tests/test_tasks.py
📝 Walkthrough

Walkthrough

This PR addresses blocking behavior in the Celery jobs queue by unifying job cancellation semantics across dispatch modes, adding broker-safe redelivery handling to the run_job task with early-return guards, and configuring fair worker pool scheduling to prevent head-of-line blocking.

Changes

Job Cancellation and Celery Task Resilience

Layer / File(s) Summary
Job cancellation refactor for dispatch modes
ami/jobs/models.py, ami/jobs/tests/test_jobs.py
Job.cancel() now revokes Celery tasks with terminate=not is_async_api, meaning ASYNC_API jobs skip worker termination while SYNC_API/INTERNAL jobs use SIGTERM. Status unconditionally transitions to REVOKED and cleanup is called. Three test cases verify behavior for ASYNC_API, SYNC_API, and jobs without task_id.
Celery task reliability: late acks and early guards
ami/jobs/tasks.py, ami/jobs/tests/test_tasks.py
run_job task now uses acks_late=True and reject_on_worker_lost=True for broker redelivery safety. An early-return guard skips execution if the job is already terminal (REVOKED, SUCCESS) or canceling (CANCELING), preventing side-effect re-runs on redelivered messages. Post-execution refreshes the job and logs conditionally by dispatch mode. Regression tests confirm the guard prevents Job.run() from executing in terminal/canceling states.
Worker pool scheduling and configuration
config/settings/base.py, docker-compose.worker.yml
Introduces CELERY_WORKER_POOL_OPTIMIZATION = "fair" to enable fair scheduling in prefork pools, reducing head-of-line blocking from long heterogeneous tasks. Documentation comments in docker-compose.worker.yml explain CELERY_WORKER_CONCURRENCY tuning rationale for the jobs queue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • RolnickLab/antenna#1118: Introduced JobDispatchMode enum and dispatch_mode field on Job, which this PR's cancellation logic now uses to determine Celery task termination behavior.

Suggested labels

backend

Suggested reviewers

  • carlosgjs

Poem

🐰 A job was stuck, and workers sat idle,

Fair scheduling and acks_late made the queue less spiteful.

With early guards and redelivery's care,

No more head-of-line blocking—freedom in the air! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main objective: improving celery task dispatch and cancellation to prevent stuck jobs, which directly addresses the root causes identified in issue #1323.
Linked Issues check ✅ Passed The code changes directly implement all three primary objectives from issue #1323: (A) adds CELERY_WORKER_POOL_OPTIMIZATION='fair' for fair scheduling, (B) adds acks_late/reject_on_worker_lost with status guards for safe redelivery, and (C) provides commented concurrency guidance for future tuning.
Out of Scope Changes check ✅ Passed All code changes are tightly scoped to addressing the three causes of stuck jobs identified in #1323: scheduler config, celery task decorators/guards, cancel semantics, and worker config comments. No unrelated refactoring or feature additions present.
Description check ✅ Passed The PR description is comprehensive, well-structured, and covers all required sections from the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/celery-stuck-jobs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Celery run_job scheduling and cancellation behavior to reduce stuck-job blast radius and make worker-loss redelivery safer.

Changes:

  • Enables fair Celery prefork scheduling globally.
  • Adds late acknowledgement/redelivery settings and an early status guard to run_job.
  • Changes Job.cancel() behavior for ASYNC_API jobs and adds regression tests.
  • Adds a documented optional jobs-worker concurrency override.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
config/settings/base.py Adds fair worker pool optimization for Celery workers.
ami/jobs/tasks.py Adds late ack/reject-on-worker-lost and early short-circuit logic for run_job.
ami/jobs/models.py Updates cancellation behavior for async vs sync/internal jobs.
ami/jobs/tests/test_tasks.py Adds early-guard regression tests for run_job.
ami/jobs/tests/test_jobs.py Adds cancellation behavior tests.
docker-compose.worker.yml Documents a possible future jobs-worker concurrency override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/tasks.py Outdated
Comment on lines +165 to +170
if job.status in JobState.final_states() or job.status == JobState.CANCELING:
job.logger.info(
f"Skipping run_job for job {job.pk}: already in status {job.status} "
f"(redelivery or cancellation in flight)"
)
return
Comment thread ami/jobs/models.py
Comment on lines 1097 to +1101
task = run_job.AsyncResult(self.task_id)
if task:
task.revoke(terminate=True)
if self.dispatch_mode == JobDispatchMode.ASYNC_API:
# For async jobs we need to set the status to revoked here since the task already
# finished (it only queues the images).
self.status = JobState.REVOKED
self.save()
else:
self.status = JobState.REVOKED
self.save()
task.revoke(terminate=not is_async_api)

self.status = JobState.REVOKED

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 161-170: The pre-run guard using job.status /
JobState.final_states() is insufficient for ASYNC_API jobs because cancellation
may occur after the initial check but before dispatch; to fix, add a second
status refresh and guard immediately before the async dispatch call (right
before queue_images_to_nats) by reloading the Job from the DB (e.g., call the
model refresh/get by PK) and aborting the task (return) if the reloaded
job.status is JobState.CANCELING or in JobState.final_states(), logging a
similar skip message; ensure you reference the same job PK/logger and perform
this check right before queue_images_to_nats to avoid enqueuing work for
canceled jobs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed9ca82f-4b76-42e9-ae6e-290f6abf61b2

📥 Commits

Reviewing files that changed from the base of the PR and between f585ddc and f772b71.

📒 Files selected for processing (6)
  • ami/jobs/models.py
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/tests/test_tasks.py
  • config/settings/base.py
  • docker-compose.worker.yml

Comment thread ami/jobs/tasks.py
Comment on lines +161 to +170
# Early-guard: under acks_late, the broker may redeliver this message after a
# worker SIGKILL/OOM, and Job.cancel() may also flip status to CANCELING /
# REVOKED while the message sits in the prefetch buffer. Don't re-run a job
# that's already settled or being torn down.
if job.status in JobState.final_states() or job.status == JobState.CANCELING:
job.logger.info(
f"Skipping run_job for job {job.pk}: already in status {job.status} "
f"(redelivery or cancellation in flight)"
)
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Entry-only cancel guard is race-prone for ASYNC_API jobs.

Lines 165-170 guard only before job.run(). If cancel happens after that check, the task can still reach async dispatch and enqueue work under a canceled job because ASYNC_API cancel no longer terminates the worker process. Add a second DB refresh/status check immediately before async dispatch (e.g., right before queue_images_to_nats) and abort when status is CANCELING/terminal.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/jobs/tasks.py` around lines 161 - 170, The pre-run guard using job.status
/ JobState.final_states() is insufficient for ASYNC_API jobs because
cancellation may occur after the initial check but before dispatch; to fix, add
a second status refresh and guard immediately before the async dispatch call
(right before queue_images_to_nats) by reloading the Job from the DB (e.g., call
the model refresh/get by PK) and aborting the task (return) if the reloaded
job.status is JobState.CANCELING or in JobState.final_states(), logging a
similar skip message; ensure you reference the same job PK/logger and perform
this check right before queue_images_to_nats to avoid enqueuing work for
canceled jobs.

Review on #1324 surfaced two races that left the early-guard non-functional
in production:

1. ``task_prerun`` (``pre_update_job_status``) wrote PENDING to the row
   before the ``run_job`` body inspected status. A canceled or redelivered
   message therefore had its REVOKED/CANCELING overwritten with PENDING,
   and the early-guard added in the parent commit never tripped. The
   existing tests passed only because they invoked ``run_job.apply(args=[…])``
   while production uses ``kwargs={"job_id": …}`` — under args, the prerun
   handler raised ``KeyError`` and exited silently. Switching the tests to
   ``kwargs=`` reproduces the production code path; the prerun handler now
   short-circuits when ``Job.is_settled()`` is true, preserving the status
   the early-guard reads next.

2. For ASYNC_API jobs ``Job.cancel()`` revokes without ``terminate=True``,
   marks the row REVOKED, and tears down the NATS stream + Redis state.
   ``MLJob.run`` running in a worker that's still inside ``collect_images``
   (slow for large collections) would then proceed to ``queue_images_to_nats``
   and recreate the stream the cancel just deleted, dispatching real GPU
   work to ADC for a revoked job; the results came back to no Redis state
   and ``_fail_job`` silently overwrote REVOKED with FAILURE. The bootstrap
   now checks ``Job.status`` (via a values-only read so the in-memory
   ``progress`` mutations don't clobber the cancel's REVOKED) right after
   the collect stage and bails out before any dispatch.

Adds ``Job.is_settled()`` to centralize the "terminal or being torn down"
predicate that ``run_job``'s early-guard, the prerun handler, ``_fail_job``,
and the bootstrap guard all needed. Adds two regression tests: one for the
prerun-then-guard chain, one for the cancel-during-bootstrap race.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow

mihow commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: (posted on Michael's behalf — this is part of a structural review pass over the four open PSv2-readiness PRs: #1276, #1279, #1312, #1324, toward the umbrella #910.)

First, the parts of this PR that look solid: the Job.cancel() rewrite is a genuine root-cause fix — for async_api jobs the celery task is not where the work lives, so cleanup-then-REVOKED is the right cancel semantics, and it also fixes a real pre-existing bug where sync jobs with a task_id could sit in CANCELING forever. The new tests exercise the signals via run_job.apply rather than mocking them away, and the mid-bootstrap cancellation test writes a real status change mid-collect. That's the strongest part of the diff and we'd love to see it merge soon.

A few things seem worth discussing before the rest merges:

  1. CELERY_WORKER_POOL_OPTIMIZATION appears to be a no-op. We could not find this setting anywhere in the celery package (checked against the pinned 5.4.x line and current releases): fair scheduling seems to be CLI-only (celery worker -O fair), wired through WorkController.setup_defaults as a kwarg from the CLI parser, with no app.conf lookup. Since the worker start scripts under compose/ aren't modified here, the head-of-line blocking incident this PR targets (Stuck job blocks other jobs even when workers are free #1323) would likely recur after merge. Suggestion: add -O fair to the worker start command(s) and drop the settings entry. Easy to verify on staging by watching task distribution across prefork children.

  2. acks_late interacts badly with RabbitMQ's consumer_timeout in some of our environments. run_job has a multi-day time limit, and with acks_late=True the message stays unacked for the task's whole runtime. RabbitMQ's default consumer_timeout is 30 minutes — past that it closes the channel and redelivers the message while the original task is still running. We checked our deployments: the production broker already has consumer_timeout raised well above the task limit, but the demo and staging rabbitmq containers (from docker-compose.staging.yml) run the 30-minute default — which is also exactly where the concurrent-load e2e validation for PSv2 runs. Two options: raise consumer_timeout in the compose file before this lands anywhere, or scope acks_late to the async_api path where the task body is short.

  3. Redelivery while a job is STARTED isn't guarded. is_settled() covers final states + CANCELING, so a redelivered message for a STARTED job (consumer timeout above, broker hiccup, worker loss) passes both guards and re-runs MLJob.run concurrently with in-flight work — double collect_images, double NATS enqueue for not-yet-processed images. And since reject_on_worker_lost=True requeues without any delivery cap on the celery/RabbitMQ side, a job that reliably OOMs its worker becomes a redeliver loop. A test for the redelivered-while-STARTED case (or an explicit idempotency argument in the PR body) would settle this.

  4. Bookkeeping: this PR closes Stuck job blocks other jobs even when workers are free #1323 — worth stating explicitly that CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282 (cancelled jobs served by /next) and Clean up task queue when job is revoked or re-started, fix duplicate tasks #1283 (revoke cleanup) remain open and aren't covered here, so they don't get lost. The temporary-hack markers in ami/jobs/views.py that reference them are untouched by this diff.

Smaller cleanups, low priority: the is_settled() docstring says the mid-bootstrap check uses it, but that check re-implements the predicate inline — worth unifying since centralizing the predicate is the point; and Copilot's note about update_job_status (task_postrun) lacking the same guard symmetry still looks open.

Given how much of this is good, one path worth considering: split the cancel() fix into its own PR and merge it now, and let the dispatch/acks_late half follow once items 1–2 are resolved. Happy to help verify the -O fair behavior on staging.

@mihow mihow added the PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515. label Jun 16, 2026
mihow added a commit that referenced this pull request Jun 19, 2026
…gnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mihow added a commit that referenced this pull request Jun 19, 2026
…for #1337) (#1343)

* feat(jobs): log premature-terminal and missing-state failures for diagnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(jobs): downgrade missing-state log to info when the job is already terminal

The missing-state diagnostic logged a WARNING saying 'Failing job' for every
in-flight result that arrived after a job finished — but _fail_job no-ops on a
terminal job, so after a cancel (which deletes the Redis state) this fired once
per in-flight batch and misdescribed normal cleanup as a failure. Now: a
terminal job logs at info ('ignoring in-flight result for already-terminal
job'); only a NON-terminal job with missing state logs the warning, which is the
case actually worth investigating.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(jobs): make the diagnostic log lines operator-readable and leaner

The missing-state and completed-after-terminal logs read like insider notes —
ticket numbers and race-theory in the runtime message. Move the rationale and
the issue reference into code comments and make the log lines plain operational
statements an operator can act on without chasing a ticket. Also drop the
redundant dispatch_mode field and the extra status re-query.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(jobs): address review on missing-state diagnostics

- Treat CANCELING as terminal-like in the missing-state classification so a
  cancel-in-flight result logs the benign info line instead of the misleading
  'still running / marking it failed' warning (matches _fail_job's no-op set).
  Caught by CodeRabbit and Copilot.
- Rename the values() dict from 'row' to 'job_values' (per review).
- Log the completed-after-terminal case via job.logger and include the stage and
  attempted terminal state, without an extra status re-query (per Copilot).

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mihow

mihow commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: Recommendation now that the #1337 work has landed — this PR is mostly superseded and worth closing once its one unique win is salvaged. Flagging rather than closing since it's your PR.

Net: once the skip-terminate fold in #1342 merges, this can be closed. Happy to open the small acks_late-gated PR separately if you want to keep that piece.

mihow added a commit that referenced this pull request Jun 20, 2026
…e into cancel

- The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not
  routed through _guarded_status_update. Correct the docstring's false 'single
  chokepoint' claim: this helper is the chokepoint for the lock-free writers;
  _fail_job and the reaper enforce the same no-resurrect invariant under
  select_for_update (the reaper deliberately keeps a broader from-set so it can
  still force a stuck CANCELING/UNKNOWN job terminal as last resort).
- Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes
  the local run_job WITHOUT terminate. That task only queues images and has
  usually finished; the remote ADC work is stopped by the NATS/Redis teardown,
  not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate.

Refs #1337. Supersedes the cancel rewrite in #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stuck job blocks other jobs even when workers are free

2 participants