You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several open and closed issues about jobs showing the wrong status, finished jobs being marked failed, and the task queue starving all trace back to one mechanism: concurrent result handlers update a job's progress/status with an unlocked read-modify-write of the whole JSONB blob, so two handlers that read the same snapshot silently lose one of the two updates. This issue is meant to be the single place that names that root cause, separates the bugs it produces into "wrong appearance" vs "actually fails work" (they overlap — see below), and decides on a fix that does not reintroduce the row-lock contention that #1261 removed.
_update_job_progress (ami/jobs/tasks.py) does, per result batch:
read the Job row including the progress JSONB (stages, counts, status),
mutate it in Python (increment processed/failed counts, recompute stage percentages, possibly set a terminal status),
write the whole blob back with job.save(update_fields=["progress", "status", ...]).
Under async_api, many workers POST results concurrently, so step 1→3 interleaves across processes. Classic lost-update: two handlers read the same snapshot, both write their own version, one increment (or one status transition) is lost. The same unlocked full-object save also happens for the logs field and from the task signals.
There was a lock here — _update_job_progress previously used select_for_update. It was removed in #1261 because under concurrent load every result handler serialized on the single job row, causing slow POSTs, timeouts, and NATS redeliveries (contention analysis in #1256). So the fix must make these writes atomic without reintroducing a transaction-length row lock — re-adding select_for_update would trade this race straight back for the contention problem.
Two failure shapes — appearance vs. lost work
The point of this split is to triage: some of these only make the UI lie, some actually destroy or stall work, and a few start as the former and become the latter.
A. Appearance only (work is fine, the status shown is wrong)
False REVOKE of a completed job. A lost final processed == total + SUCCESS write leaves the job at ~99.9% STARTED; the stale-job reaper reads progress.is_complete() == False and REVOKEs a job that actually finished. fix(jobs): fix dangling jobs from going to revoked #1276 mitigates this by having the reaper consult Redis first, but its fallback path still trusts the clobbered JSONB when Redis state is gone.
Status resurrection. Handler A sees completion → writes SUCCESS → fires async-state cleanup (Redis keys deleted); handler B, holding a stale pre-A snapshot with status=STARTED, commits afterward, resurrecting a finished job to STARTED after its Redis state is already gone.
C. Crossover — an appearance bug that turns into a real failure
This is the important one, because it's why "just a display bug" isn't safe to defer: status is also an input to scheduling, not only an output to the UI.
A job left wrongly in STARTED (or any non-terminal state) stays in active_states(), so the worker claim endpoint keeps handing it out. Workers spend cycles on a job that is already done, and newer jobs are starved behind it.
The claim endpoint itself is being reshaped in Add GET /jobs/next action; deprecate list(ids_only=1) claim-semantics #1265 (dedicated GET /jobs/next/, deprecating the list(ids_only=1) claim-hack). The state-machine semantics fixed here and the eligibility filter there need to agree on exactly which states are "claimable", or the crossover reopens.
Prior and in-flight attempts (consolidate, don't duplicate)
One statement; a stale writer can no longer resurrect SUCCESS → STARTED, and the rowcount doubles as an exactly-once "who finalizes" claim. Closes the resurrection arm (B) and, combined with the /next eligibility filter, the crossover (C). Does not by itself fix the lost increment.
2. Atomic counter columns — current preference for the core fix.
Move the contended counters out of the JSONB into integer columns updated with SQL-side arithmetic:
Atomicity lives in the statement; the lock is held for microseconds (statement-scope), a different regime from the transaction-length select_for_update that caused #1261's contention. Completion becomes a conditional update (processed_count >= F("total_count") + the status guard from #1), giving exactly-once finalization with no distributed lock and no dependence on Redis being alive. JSONB keeps the static stage structure; percentages computed on read. Cost: a migration; per-stage live granularity needs extra columns or accepting a coarse overall percentage during the run.
3. Single-writer rendering from the Redis tracker — no migration.
Result handlers stop writing progress (they already maintain atomic per-task state in AsyncJobStateManager); one writer (throttled heartbeat or a periodic task) renders the JSONB from Redis. Exactly-once finalize via a SETNX-style claim. Least code, but keeps Redis load-bearing for correctness — the failure mode #1276 already wrestles with.
4. Compute-on-read.
No progress writes during the run; the API derives live progress from Redis for active async jobs and serves a stored snapshot once terminal. Freshest UI, zero write contention, but couples API latency to Redis and needs airtight fallbacks when keys are absent.
Directions 1 + 2 together remove both race arms with no new infrastructure and no broker/Redis dependency for correctness; 3 or 4 could layer on for per-stage display freshness.
Measure hot-row update cost under realistic result-handler concurrency to confirm statement-scope F() updates don't reintroduce meaningful contention.
Audit every writer of progress/status/logs (_update_job_progress, task signals, reaper, cancel path, JobLogHandler) so none keeps a full-blob save that can clobber the counters.
Summary
Several open and closed issues about jobs showing the wrong status, finished jobs being marked failed, and the task queue starving all trace back to one mechanism: concurrent result handlers update a job's
progress/statuswith an unlocked read-modify-write of the whole JSONB blob, so two handlers that read the same snapshot silently lose one of the two updates. This issue is meant to be the single place that names that root cause, separates the bugs it produces into "wrong appearance" vs "actually fails work" (they overlap — see below), and decides on a fix that does not reintroduce the row-lock contention that #1261 removed.Part of the PSv2 effort (#515).
Root cause
_update_job_progress(ami/jobs/tasks.py) does, per result batch:Jobrow including theprogressJSONB (stages, counts, status),job.save(update_fields=["progress", "status", ...]).Under
async_api, many workers POST results concurrently, so step 1→3 interleaves across processes. Classic lost-update: two handlers read the same snapshot, both write their own version, one increment (or one status transition) is lost. The same unlocked full-object save also happens for thelogsfield and from the task signals.There was a lock here —
_update_job_progresspreviously usedselect_for_update. It was removed in #1261 because under concurrent load every result handler serialized on the single job row, causing slow POSTs, timeouts, and NATS redeliveries (contention analysis in #1256). So the fix must make these writes atomic without reintroducing a transaction-length row lock — re-addingselect_for_updatewould trade this race straight back for the contention problem.Two failure shapes — appearance vs. lost work
The point of this split is to triage: some of these only make the UI lie, some actually destroy or stall work, and a few start as the former and become the latter.
A. Appearance only (work is fine, the status shown is wrong)
STARTED; the status bar can't show the real state. This is the most-reported user-facing face of the bug.B. Actually causes lost / stalled / failed work
processed == total+SUCCESSwrite leaves the job at ~99.9%STARTED; the stale-job reaper readsprogress.is_complete() == Falseand REVOKEs a job that actually finished. fix(jobs): fix dangling jobs from going to revoked #1276 mitigates this by having the reaper consult Redis first, but its fallback path still trusts the clobbered JSONB when Redis state is gone.SUCCESS→ fires async-state cleanup (Redis keys deleted); handler B, holding a stale pre-A snapshot withstatus=STARTED, commits afterward, resurrecting a finished job toSTARTEDafter its Redis state is already gone.max_deliverwithout posting results), bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs #1247 (one worker batch's worth of images stranded after long-running jobs).C. Crossover — an appearance bug that turns into a real failure
This is the important one, because it's why "just a display bug" isn't safe to defer: status is also an input to scheduling, not only an output to the UI.
STARTED(or any non-terminal state) stays inactive_states(), so the worker claim endpoint keeps handing it out. Workers spend cycles on a job that is already done, and newer jobs are starved behind it.CANCELLEDjobs leak through the/nextfilter and starve newerasync_apijobs. Same shape — wrong/again-served status becomes a queue-starvation (effective DoS) bug.GET /jobs/next/, deprecating thelist(ids_only=1)claim-hack). The state-machine semantics fixed here and the eligibility filter there need to agree on exactly which states are "claimable", or the crossover reopens.Prior and in-flight attempts (consolidate, don't duplicate)
fix/job-clobbering) — "Don't overwrite logs & status in concurrent background tasks", pulled from Update status of disconnected and stale jobs #981. This is the most direct existing attempt at the root cause but is currently a stub. Proposal: either flesh out [Draft] Don't overwrite logs & status in concurrent background tasks #1026 against the design below, or close it in favour of a fresh implementation — but the spec should live here.jsonb_setsingle-statement update for thelogsfield; same atomic-write idea, scoped to logs, deferred in Refactor job logging to use separate table #1259.Directions to discuss (ordered by effort, not yet decided)
1. Conditional status transitions — small, worth doing regardless.
One statement; a stale writer can no longer resurrect
SUCCESS → STARTED, and the rowcount doubles as an exactly-once "who finalizes" claim. Closes the resurrection arm (B) and, combined with the/nexteligibility filter, the crossover (C). Does not by itself fix the lost increment.2. Atomic counter columns — current preference for the core fix.
Move the contended counters out of the JSONB into integer columns updated with SQL-side arithmetic:
Atomicity lives in the statement; the lock is held for microseconds (statement-scope), a different regime from the transaction-length
select_for_updatethat caused #1261's contention. Completion becomes a conditional update (processed_count >= F("total_count")+ the status guard from #1), giving exactly-once finalization with no distributed lock and no dependence on Redis being alive. JSONB keeps the static stage structure; percentages computed on read. Cost: a migration; per-stage live granularity needs extra columns or accepting a coarse overall percentage during the run.3. Single-writer rendering from the Redis tracker — no migration.
Result handlers stop writing progress (they already maintain atomic per-task state in
AsyncJobStateManager); one writer (throttled heartbeat or a periodic task) renders the JSONB from Redis. Exactly-once finalize via aSETNX-style claim. Least code, but keeps Redis load-bearing for correctness — the failure mode #1276 already wrestles with.4. Compute-on-read.
No progress writes during the run; the API derives live progress from Redis for active async jobs and serves a stored snapshot once terminal. Freshest UI, zero write contention, but couples API latency to Redis and needs airtight fallbacks when keys are absent.
Directions 1 + 2 together remove both race arms with no new infrastructure and no broker/Redis dependency for correctness; 3 or 4 could layer on for per-stage display freshness.
What we still need to verify
TransactionTestCase(the fix(jobs): fix dangling jobs from going to revoked #1276 suite has a clobber repro to build from) and confirm direction 2 closes it.F()updates don't reintroduce meaningful contention.progress/status/logs(_update_job_progress, task signals, reaper, cancel path,JobLogHandler) so none keeps a full-blob save that can clobber the counters./next(Add GET /jobs/next action; deprecate list(ids_only=1) claim-semantics #1265) matches the terminal/active semantics decided here, so the crossover (C) can't reopen.