Improve celery task dispatch and cancellation to prevent stuck jobs by mihow · Pull Request #1324 · RolnickLab/antenna

mihow · 2026-05-27T23:16:29Z

Closes #1323.

Last week we hit the symptom Issue #1323 describes: a run_job was sitting inside filter_processed_images() for ~9 minutes against a huge collection, and a fresh run_job queued behind it sat in RESERVED on the same worker container the entire time — even though 15 of 16 children on that container were idle and the entire sibling container was idle. SIGKILL'ing the blocker let the queued job start in the same second. The job that finally ran was almost certainly fine, but it wasted nine minutes of wall clock for no good reason, and any user clicking "run" during that window saw nothing happen.

This PR fixes the three reinforcing causes the issue identified, plus an orthogonal cancel-path bug I noticed while reading the code paths. Each one is a small config or decorator change; the value is in stacking them.

What changes for users

Stuck run_job no longer blocks other jobs in the queue. A slow first job releases the worker slot for the next one as soon as a sibling child is idle, rather than holding 15 idle slots hostage.
A worker crash mid-job no longer silently drops the job. Before: the celery message was already acked at delivery, so a SIGKILL/OOM/deploy roll meant the job stayed in STARTED forever until the reaper found it. After: broker holds the message, redelivers it when a worker comes back. The job either resumes or — if it had already settled — exits cleanly.
Cancelling an async ML job actually cancels. Today's Job.cancel calls revoke(terminate=True) on the local run_job task. For ASYNC_API that task has almost always already finished (queue_images_to_nats returns fast) — terminating it does nothing about the actual work running on the remote ADC worker, and on the rare occasion the bootstrap is still running, the SIGTERM kills it without redelivery. The real cancel mechanism for async is tearing down the NATS stream + Redis state, which cleanup_async_job_if_needed already does. We now skip terminate for ASYNC_API.

Plain-language summary

Behavior	Before	After
Long `run_job` blocking sibling jobs on same container	Yes — pre-assignment pins messages to busy child	No — fair scheduler hands messages only to idle children
Long `run_job` blocking sibling jobs on sibling container	Sometimes — broker spills late	Reduced — pairs naturally with smaller per-container reservation window (see below)
Worker SIGKILL/OOM mid-`run_job`	Message lost; job stranded in STARTED	Broker redelivers; early-guard handles redelivery cleanly
Cancel of ASYNC_API job	Terminates local bootstrap (mostly a no-op, occasionally a SIGTERM mid-flight); remote workers keep going	Tears down NATS stream + Redis state; remote work stops naturally
Cancel of SYNC_API / INTERNAL job	Terminates celery task	Unchanged — terminate is still the only way

What's in this PR

config/settings/base.py — CELERY_WORKER_POOL_OPTIMIZATION = "fair". One line. Applies to all queues; the value is largely on jobs.
ami/jobs/tasks.py — acks_late=True, reject_on_worker_lost=True on run_job, plus an early-guard at the top of the task body that returns cleanly when job.status is in final_states() or CANCELING. The guard is what makes redelivery and cancel-race safe.
ami/jobs/models.py — Job.cancel no longer passes terminate=True when dispatch_mode == ASYNC_API. For other dispatch modes, behavior is unchanged.
docker-compose.worker.yml — added a commented-out CELERY_WORKER_CONCURRENCY: "4" on celeryworker_jobs with a TODO referencing this issue, in case we want to enable cause C later.

A note on the counter-intuitive concurrency knob

The issue suggests lowering per-container concurrency on the jobs queue as a third fix. I've left this commented out for now (just a discoverability hint in docker-compose.worker.yml) because it reads as backwards and I'd rather watch the first two fixes in production before pulling this lever too.

The reasoning, briefly: celery's prefetch reserves concurrency × prefetch_multiplier(=1) messages per container at the broker level. With concurrency=16, that's 16 messages held in the container's local buffer. When one of those messages is a stuck task, the container still tells the broker "I have 15 free slots" — and the broker keeps offering new messages to that container instead of spilling to a fully-idle sibling. Lowering concurrency to 4 shrinks the reservation window so the broker spills sooner.

The reason this isn't a meaningful capacity cut: run_job spends nearly all its time waiting on NATS results to come back, not burning CPU. The 16 was originally raised (#1228) for ml_results and antenna, which are DB/Redis-bound and benefit from oversubscription. The jobs queue inherited the high number incidentally.

What the three fixes are actually doing

There are three different head-of-line problems, with three different mechanisms:

Inside a child (one slow task blocks its own pipe): can't be fixed, that's just how prefork works.
Inside a container, across children (default pre-assignment scheduler pins messages to specific child pipes regardless of which child becomes idle first): fixed by -O fair.
Across containers (broker prefetch reservation: a 1-task container still looks like it has 15 free slots): mitigated by lower per-container concurrency (the deferred knob).

acks_late is orthogonal to all three — it's about surviving worker death, not about scheduling. But it's a precondition for both the cancel fix to be safe (cancellation can now terminate a worker without losing the redelivered message that the early-guard will short-circuit) and for the deferred concurrency change to be safe (smaller pools mean each child handles more tasks, and a single SIGKILL hurts more).

What's verified

New unit tests on Job.cancel for ASYNC_API / SYNC_API / no-task-id paths.
New tests on the run_job early-guard covering REVOKED, CANCELING, SUCCESS, and the contract pair (PENDING still runs).
Full ami.jobs.tests (118 tests) and ami.ml.tests + ami.ml.orchestration.tests (81 tests) green.

What still needs verifying in staging/prod

From the issue's "what we still need to verify" section, two of three are now testable:

Re-running the symptom against a real -O fair worker — confirm RESERVED → ACTIVE happens immediately when a sibling child is idle. I'd want to do this on dev box (queue two long run_jobs back-to-back, sleep one).
-O fair interaction with max-tasks-per-child=100 — should be fine but worth watching for the first day after deploy.
acks_late redelivery in practice — the early-guard makes a redelivered run_job a no-op when the job is already settled, so this is covered by the tests, but worth eyeballing the celery logs after deploy for unexpected Skipping run_job messages.

The cancel fix is the one I'd most like a second pair of eyes on — the docstring is the long version, but the gist is "for ASYNC_API the celery task isn't where the work is, so terminating it doesn't cancel; cleanup does."

Stuck job blocks other jobs even when workers are free #1323 — this issue.
Preparing jobs takes too long preparing large collections #1321 — filter_processed_images slowness, the upstream cause of the long run_job that exposed this. Fixing Preparing jobs takes too long preparing large collections #1321 reduces the frequency of stuck jobs; this PR reduces the blast radius when they happen.
fix(celery): update worker concurrency defaults #1228 — the PR that raised CELERY_WORKER_CONCURRENCY to 16 for ml/antenna queues. Context for why the jobs queue inherited the same number.
fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 / fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) #1235 — the prior async-job fix pair (ACK/SREM ordering, Bug C guard, stale-job cutoff). The cancel fix here is in similar territory.

Co-Authored-By: Claude noreply@anthropic.com

Summary by CodeRabbit

Bug Fixes
- Job cancellation now handles different job types more reliably, ensuring consistent state transitions and proper cleanup.
- Task execution is more resilient when workers are unexpectedly lost, with automatic re-delivery of in-flight tasks to maintain reliability.
Chores
- Optimized worker pool scheduling configuration to enable fair task distribution and reduce processing delays.

Addresses three reinforcing causes of run_job head-of-line blocking on the jobs queue (#1323) plus an orthogonal cancel bug exposed by the same investigation. - Enable fair scheduling (CELERY_WORKER_POOL_OPTIMIZATION = "fair") so the master process holds prefetched messages in a shared buffer instead of pre-assigning them to specific prefork children. Long heterogeneous tasks (notably run_job inside filter_processed_images) no longer block newer messages stuck behind them on the same child. - Add acks_late=True + reject_on_worker_lost=True to run_job so a worker SIGKILL/OOM mid-task triggers broker redelivery instead of silently dropping the job. Pairs with an early-guard at the top of run_job that returns cleanly if the Job is already in a terminal state or being cancelled, so redelivery never re-runs side effects. - Fix Job.cancel for ASYNC_API: skip terminate=True on the (likely-done) run_job task — the actual work runs on remote ADC workers via NATS, and cleanup_async_job_if_needed is what stops it. Terminating the local bootstrap was a no-op at best and SIGTERM'd a still-bootstrapping child at worst. INTERNAL / SYNC_API keep terminate=True since their celery task body owns the entire job lifecycle. - Document the optional CELERY_WORKER_CONCURRENCY=4 override on the celeryworker_jobs container (commented out for now) so operators can opt in once -O fair is observed in production. Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-05-27T23:16:34Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`c8e6c8a`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a17971be3a452000810b632

netlify · 2026-05-27T23:16:34Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`c8e6c8a`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a17971b5a4ff00008e75cd4

coderabbitai · 2026-05-27T23:16:56Z

Warning

Review limit reached

@mihow, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 41 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9597a54d-4d1a-4e33-b4f8-6d679c8a0109

📥 Commits

Reviewing files that changed from the base of the PR and between f772b71 and c8e6c8a.

📒 Files selected for processing (4)

ami/jobs/models.py
ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/tests/test_tasks.py

📝 Walkthrough

Walkthrough

This PR addresses blocking behavior in the Celery jobs queue by unifying job cancellation semantics across dispatch modes, adding broker-safe redelivery handling to the run_job task with early-return guards, and configuring fair worker pool scheduling to prevent head-of-line blocking.

Changes

Job Cancellation and Celery Task Resilience

Layer / File(s)	Summary
Job cancellation refactor for dispatch modes `ami/jobs/models.py`, `ami/jobs/tests/test_jobs.py`	`Job.cancel()` now revokes Celery tasks with `terminate=not is_async_api`, meaning ASYNC_API jobs skip worker termination while SYNC_API/INTERNAL jobs use SIGTERM. Status unconditionally transitions to REVOKED and cleanup is called. Three test cases verify behavior for ASYNC_API, SYNC_API, and jobs without task_id.
Celery task reliability: late acks and early guards `ami/jobs/tasks.py`, `ami/jobs/tests/test_tasks.py`	`run_job` task now uses `acks_late=True` and `reject_on_worker_lost=True` for broker redelivery safety. An early-return guard skips execution if the job is already terminal (REVOKED, SUCCESS) or canceling (CANCELING), preventing side-effect re-runs on redelivered messages. Post-execution refreshes the job and logs conditionally by dispatch mode. Regression tests confirm the guard prevents `Job.run()` from executing in terminal/canceling states.
Worker pool scheduling and configuration `config/settings/base.py`, `docker-compose.worker.yml`	Introduces `CELERY_WORKER_POOL_OPTIMIZATION = "fair"` to enable fair scheduling in prefork pools, reducing head-of-line blocking from long heterogeneous tasks. Documentation comments in docker-compose.worker.yml explain CELERY_WORKER_CONCURRENCY tuning rationale for the jobs queue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

RolnickLab/antenna#1118: Introduced JobDispatchMode enum and dispatch_mode field on Job, which this PR's cancellation logic now uses to determine Celery task termination behavior.

Suggested labels

backend

Suggested reviewers

carlosgjs

Poem

🐰 A job was stuck, and workers sat idle,

Fair scheduling and acks_late made the queue less spiteful.

With early guards and redelivery's care,

No more head-of-line blocking—freedom in the air! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main objective: improving celery task dispatch and cancellation to prevent stuck jobs, which directly addresses the root causes identified in issue `#1323`.
Linked Issues check	✅ Passed	The code changes directly implement all three primary objectives from issue `#1323`: (A) adds CELERY_WORKER_POOL_OPTIMIZATION='fair' for fair scheduling, (B) adds acks_late/reject_on_worker_lost with status guards for safe redelivery, and (C) provides commented concurrency guidance for future tuning.
Out of Scope Changes check	✅ Passed	All code changes are tightly scoped to addressing the three causes of stuck jobs identified in `#1323`: scheduler config, celery task decorators/guards, cancel semantics, and worker config comments. No unrelated refactoring or feature additions present.
Description check	✅ Passed	The PR description is comprehensive, well-structured, and covers all required sections from the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/celery-stuck-jobs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR improves Celery run_job scheduling and cancellation behavior to reduce stuck-job blast radius and make worker-loss redelivery safer.

Changes:

Enables fair Celery prefork scheduling globally.
Adds late acknowledgement/redelivery settings and an early status guard to run_job.
Changes Job.cancel() behavior for ASYNC_API jobs and adds regression tests.
Adds a documented optional jobs-worker concurrency override.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`config/settings/base.py`	Adds fair worker pool optimization for Celery workers.
`ami/jobs/tasks.py`	Adds late ack/reject-on-worker-lost and early short-circuit logic for `run_job`.
`ami/jobs/models.py`	Updates cancellation behavior for async vs sync/internal jobs.
`ami/jobs/tests/test_tasks.py`	Adds early-guard regression tests for `run_job`.
`ami/jobs/tests/test_jobs.py`	Adds cancellation behavior tests.
`docker-compose.worker.yml`	Documents a possible future jobs-worker concurrency override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if job.status in JobState.final_states() or job.status == JobState.CANCELING:
+        job.logger.info(
+            f"Skipping run_job for job {job.pk}: already in status {job.status} "
+            f"(redelivery or cancellation in flight)"
+        )
+        return


            task = run_job.AsyncResult(self.task_id)
            if task:
-                task.revoke(terminate=True)
-            if self.dispatch_mode == JobDispatchMode.ASYNC_API:
-                # For async jobs we need to set the status to revoked here since the task already
-                # finished (it only queues the images).
-                self.status = JobState.REVOKED
-                self.save()
-        else:
-            self.status = JobState.REVOKED
-            self.save()
+                task.revoke(terminate=not is_async_api)
+
+        self.status = JobState.REVOKED


coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 161-170: The pre-run guard using job.status /
JobState.final_states() is insufficient for ASYNC_API jobs because cancellation
may occur after the initial check but before dispatch; to fix, add a second
status refresh and guard immediately before the async dispatch call (right
before queue_images_to_nats) by reloading the Job from the DB (e.g., call the
model refresh/get by PK) and aborting the task (return) if the reloaded
job.status is JobState.CANCELING or in JobState.final_states(), logging a
similar skip message; ensure you reference the same job PK/logger and perform
this check right before queue_images_to_nats to avoid enqueuing work for
canceled jobs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed9ca82f-4b76-42e9-ae6e-290f6abf61b2

📥 Commits

Reviewing files that changed from the base of the PR and between f585ddc and f772b71.

📒 Files selected for processing (6)

ami/jobs/models.py
ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/tests/test_tasks.py
config/settings/base.py
docker-compose.worker.yml

coderabbitai · 2026-05-27T23:22:27Z

+    # Early-guard: under acks_late, the broker may redeliver this message after a
+    # worker SIGKILL/OOM, and Job.cancel() may also flip status to CANCELING /
+    # REVOKED while the message sits in the prefetch buffer. Don't re-run a job
+    # that's already settled or being torn down.
+    if job.status in JobState.final_states() or job.status == JobState.CANCELING:
+        job.logger.info(
+            f"Skipping run_job for job {job.pk}: already in status {job.status} "
+            f"(redelivery or cancellation in flight)"
+        )
+        return


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Entry-only cancel guard is race-prone for ASYNC_API jobs.

Lines 165-170 guard only before job.run(). If cancel happens after that check, the task can still reach async dispatch and enqueue work under a canceled job because ASYNC_API cancel no longer terminates the worker process. Add a second DB refresh/status check immediately before async dispatch (e.g., right before queue_images_to_nats) and abort when status is CANCELING/terminal.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/jobs/tasks.py` around lines 161 - 170, The pre-run guard using job.status / JobState.final_states() is insufficient for ASYNC_API jobs because cancellation may occur after the initial check but before dispatch; to fix, add a second status refresh and guard immediately before the async dispatch call (right before queue_images_to_nats) by reloading the Job from the DB (e.g., call the model refresh/get by PK) and aborting the task (return) if the reloaded job.status is JobState.CANCELING or in JobState.final_states(), logging a similar skip message; ensure you reference the same job PK/logger and perform this check right before queue_images_to_nats to avoid enqueuing work for canceled jobs.

Review on #1324 surfaced two races that left the early-guard non-functional in production: 1. ``task_prerun`` (``pre_update_job_status``) wrote PENDING to the row before the ``run_job`` body inspected status. A canceled or redelivered message therefore had its REVOKED/CANCELING overwritten with PENDING, and the early-guard added in the parent commit never tripped. The existing tests passed only because they invoked ``run_job.apply(args=[…])`` while production uses ``kwargs={"job_id": …}`` — under args, the prerun handler raised ``KeyError`` and exited silently. Switching the tests to ``kwargs=`` reproduces the production code path; the prerun handler now short-circuits when ``Job.is_settled()`` is true, preserving the status the early-guard reads next. 2. For ASYNC_API jobs ``Job.cancel()`` revokes without ``terminate=True``, marks the row REVOKED, and tears down the NATS stream + Redis state. ``MLJob.run`` running in a worker that's still inside ``collect_images`` (slow for large collections) would then proceed to ``queue_images_to_nats`` and recreate the stream the cancel just deleted, dispatching real GPU work to ADC for a revoked job; the results came back to no Redis state and ``_fail_job`` silently overwrote REVOKED with FAILURE. The bootstrap now checks ``Job.status`` (via a values-only read so the in-memory ``progress`` mutations don't clobber the cancel's REVOKED) right after the collect stage and bails out before any dispatch. Adds ``Job.is_settled()`` to centralize the "terminal or being torn down" predicate that ``run_job``'s early-guard, the prerun handler, ``_fail_job``, and the bootstrap guard all needed. Adds two regression tests: one for the prerun-then-guard chain, one for the cancel-during-bootstrap race. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-06-12T20:04:16Z

Claude says: (posted on Michael's behalf — this is part of a structural review pass over the four open PSv2-readiness PRs: #1276, #1279, #1312, #1324, toward the umbrella #910.)

First, the parts of this PR that look solid: the Job.cancel() rewrite is a genuine root-cause fix — for async_api jobs the celery task is not where the work lives, so cleanup-then-REVOKED is the right cancel semantics, and it also fixes a real pre-existing bug where sync jobs with a task_id could sit in CANCELING forever. The new tests exercise the signals via run_job.apply rather than mocking them away, and the mid-bootstrap cancellation test writes a real status change mid-collect. That's the strongest part of the diff and we'd love to see it merge soon.

A few things seem worth discussing before the rest merges:

CELERY_WORKER_POOL_OPTIMIZATION appears to be a no-op. We could not find this setting anywhere in the celery package (checked against the pinned 5.4.x line and current releases): fair scheduling seems to be CLI-only (celery worker -O fair), wired through WorkController.setup_defaults as a kwarg from the CLI parser, with no app.conf lookup. Since the worker start scripts under compose/ aren't modified here, the head-of-line blocking incident this PR targets (Stuck job blocks other jobs even when workers are free #1323) would likely recur after merge. Suggestion: add -O fair to the worker start command(s) and drop the settings entry. Easy to verify on staging by watching task distribution across prefork children.
acks_late interacts badly with RabbitMQ's consumer_timeout in some of our environments. run_job has a multi-day time limit, and with acks_late=True the message stays unacked for the task's whole runtime. RabbitMQ's default consumer_timeout is 30 minutes — past that it closes the channel and redelivers the message while the original task is still running. We checked our deployments: the production broker already has consumer_timeout raised well above the task limit, but the demo and staging rabbitmq containers (from docker-compose.staging.yml) run the 30-minute default — which is also exactly where the concurrent-load e2e validation for PSv2 runs. Two options: raise consumer_timeout in the compose file before this lands anywhere, or scope acks_late to the async_api path where the task body is short.
Redelivery while a job is STARTED isn't guarded. is_settled() covers final states + CANCELING, so a redelivered message for a STARTED job (consumer timeout above, broker hiccup, worker loss) passes both guards and re-runs MLJob.run concurrently with in-flight work — double collect_images, double NATS enqueue for not-yet-processed images. And since reject_on_worker_lost=True requeues without any delivery cap on the celery/RabbitMQ side, a job that reliably OOMs its worker becomes a redeliver loop. A test for the redelivered-while-STARTED case (or an explicit idempotency argument in the PR body) would settle this.
Bookkeeping: this PR closes Stuck job blocks other jobs even when workers are free #1323 — worth stating explicitly that CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282 (cancelled jobs served by /next) and Clean up task queue when job is revoked or re-started, fix duplicate tasks #1283 (revoke cleanup) remain open and aren't covered here, so they don't get lost. The temporary-hack markers in ami/jobs/views.py that reference them are untouched by this diff.

Smaller cleanups, low priority: the is_settled() docstring says the mid-bootstrap check uses it, but that check re-implements the predicate inline — worth unifying since centralizing the predicate is the point; and Copilot's note about update_job_status (task_postrun) lacking the same guard symmetry still looks open.

Given how much of this is good, one path worth considering: split the cancel() fix into its own PR and merge it now, and let the dispatch/acks_late half follow once items 1–2 are resolved. Happy to help verify the -O fair behavior on staging.

…gnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…for #1337) (#1343) * feat(jobs): log premature-terminal and missing-state failures for diagnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(jobs): downgrade missing-state log to info when the job is already terminal The missing-state diagnostic logged a WARNING saying 'Failing job' for every in-flight result that arrived after a job finished — but _fail_job no-ops on a terminal job, so after a cancel (which deletes the Redis state) this fired once per in-flight batch and misdescribed normal cleanup as a failure. Now: a terminal job logs at info ('ignoring in-flight result for already-terminal job'); only a NON-terminal job with missing state logs the warning, which is the case actually worth investigating. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): make the diagnostic log lines operator-readable and leaner The missing-state and completed-after-terminal logs read like insider notes — ticket numbers and race-theory in the runtime message. Move the rationale and the issue reference into code comments and make the log lines plain operational statements an operator can act on without chasing a ticket. Also drop the redundant dispatch_mode field and the extra status re-query. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): address review on missing-state diagnostics - Treat CANCELING as terminal-like in the missing-state classification so a cancel-in-flight result logs the benign info line instead of the misleading 'still running / marking it failed' warning (matches _fail_job's no-op set). Caught by CodeRabbit and Copilot. - Rename the values() dict from 'row' to 'job_values' (per review). - Log the completed-after-terminal case via job.logger and include the stage and attempted terminal state, without an extra status re-query (per Copilot). Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mihow · 2026-06-20T00:58:11Z

Claude says: Recommendation now that the #1337 work has landed — this PR is mostly superseded and worth closing once its one unique win is salvaged. Flagging rather than closing since it's your PR.

cancel() rewrite (skip revoke(terminate=True) for ASYNC_API) — the genuinely useful part, and it isn't anywhere else. I've folded it into Guard the cancel and signal-handler job-status writes too (fast-follow on #1338) #1342 (which now owns cancel() after Stop a finished job from being pulled back to running by a slower worker #1338/Guard the cancel and signal-handler job-status writes too (fast-follow on #1338) #1342), so the two PRs no longer edit the same method in conflicting ways. Your reasoning carried over verbatim: the local run_job only queues images and has usually finished, so terminate does nothing about the remote ADC work — that's stopped by the NATS/Redis teardown.
CELERY_WORKER_POOL_OPTIMIZATION = "fair" — this is a no-op as a Django setting; fair scheduling is the worker CLI flag -O fair (in the worker start script), not a settings value. So it doesn't deliver the queue-unblocking the PR describes.
acks_late / reject_on_worker_lost crash-resume — the one piece left, and it needs gating before it's safe: demo/staging RabbitMQ consumer_timeout is 30 min vs prod's 7 days, so a run_job longer than 30 min on those boxes gets channel-killed and redelivered = duplicate concurrent jobs (and it overlaps the 10-min reaper). The stranded-STARTED recovery it targets is already handled by the reaper today. Suggest splitting acks_late into its own PR gated behind a consumer_timeout fix + an idempotent initialize_job (Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337 follow-up), or dropping it.

Net: once the skip-terminate fold in #1342 merges, this can be closed. Happy to open the small acks_late-gated PR separately if you want to keep that piece.

…e into cancel - The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not routed through _guarded_status_update. Correct the docstring's false 'single chokepoint' claim: this helper is the chokepoint for the lock-free writers; _fail_job and the reaper enforce the same no-resurrect invariant under select_for_update (the reaper deliberately keeps a broader from-set so it can still force a stuck CANCELING/UNKNOWN job terminal as last resort). - Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes the local run_job WITHOUT terminate. That task only queues images and has usually finished; the remote ADC work is stopped by the NATS/Redis teardown, not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate. Refs #1337. Supersedes the cancel rewrite in #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 27, 2026 23:16

Copilot started reviewing on behalf of mihow May 27, 2026 23:16 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

This was referenced Jun 12, 2026

Don't mark images as processed too soon #1312

Open

feat(ml): propagate pipeline config through NATS pull-mode tasks #1279

Open

fix(jobs): fix dangling jobs from going to revoked #1276

Open

mihow added the PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515. label Jun 16, 2026

mihow mentioned this pull request Jun 20, 2026

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338) #1342

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve celery task dispatch and cancellation to prevent stuck jobs#1324

Improve celery task dispatch and cancellation to prevent stuck jobs#1324
mihow wants to merge 2 commits into
mainfrom
fix/celery-stuck-jobs

mihow commented May 27, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 27, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

mihow commented Jun 12, 2026

Uh oh!

mihow commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes for users

Plain-language summary

What's in this PR

A note on the counter-intuitive concurrency knob

What the three fixes are actually doing

What's verified

What still needs verifying in staging/prod

Related

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

mihow commented Jun 12, 2026

Uh oh!

mihow commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented May 27, 2026 •

edited

Loading

netlify Bot commented May 27, 2026 •

edited

Loading

netlify Bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading