Saving job progress concurrently is the root of multiple issues related to incorrect job statuses

## Summary

Several open and closed issues about jobs showing the wrong status, finished jobs being marked failed, and the task queue starving all trace back to one mechanism: concurrent result handlers update a job's `progress`/`status` with an unlocked read-modify-write of the whole JSONB blob, so two handlers that read the same snapshot silently lose one of the two updates. This issue is meant to be the single place that names that root cause, separates the bugs it produces into "wrong appearance" vs "actually fails work" (they overlap — see below), and decides on a fix that does not reintroduce the row-lock contention that #1261 removed.

Part of the PSv2 effort (#515).

## Root cause

`_update_job_progress` (`ami/jobs/tasks.py`) does, per result batch:

1. read the `Job` row including the `progress` JSONB (stages, counts, status),
2. mutate it in Python (increment processed/failed counts, recompute stage percentages, possibly set a terminal status),
3. write the whole blob back with `job.save(update_fields=["progress", "status", ...])`.

Under `async_api`, many workers POST results concurrently, so step 1→3 interleaves across processes. Classic lost-update: two handlers read the same snapshot, both write their own version, one increment (or one status transition) is lost. The same unlocked full-object save also happens for the `logs` field and from the task signals.

There was a lock here — `_update_job_progress` previously used `select_for_update`. It was removed in #1261 because under concurrent load every result handler serialized on the single job row, causing slow POSTs, timeouts, and NATS redeliveries (contention analysis in #1256). So **the fix must make these writes atomic without reintroducing a transaction-length row lock** — re-adding `select_for_update` would trade this race straight back for the contention problem.

## Two failure shapes — appearance vs. lost work

The point of this split is to triage: some of these only make the UI lie, some actually destroy or stall work, and a few start as the former and *become* the latter.

### A. Appearance only (work is fine, the status shown is wrong)

- **#721** — job has finished (or errored), but overall status and/or per-stage status still reads `STARTED`; the status bar can't show the real state. This is the most-reported user-facing face of the bug.
- **#1155** — log says "Finished job" when only the first stage finished.
- Progress bar stuck at ~99.9% on a job that actually completed (the lost final-increment interleaving).
- Counts that visibly go up and down during a run (older sighting: #370, closed 2023 — same mechanism).

### B. Actually causes lost / stalled / failed work

- **False REVOKE of a completed job.** A lost final `processed == total` + `SUCCESS` write leaves the job at ~99.9% `STARTED`; the stale-job reaper reads `progress.is_complete() == False` and REVOKEs a job that actually finished. #1276 mitigates this by having the reaper consult Redis first, but its fallback path still trusts the clobbered JSONB when Redis state is gone.
- **Status resurrection.** Handler A sees completion → writes `SUCCESS` → fires async-state cleanup (Redis keys deleted); handler B, holding a stale pre-A snapshot with `status=STARTED`, commits afterward, resurrecting a finished job to `STARTED` after its Redis state is already gone.
- Adjacent stalls that share the "job stuck in a non-terminal state" surface: **#1168** (jobs hang forever when NATS tasks exhaust `max_deliver` without posting results), **#1247** (one worker batch's worth of images stranded after long-running jobs).

### C. Crossover — an appearance bug that turns into a real failure

This is the important one, because it's why "just a display bug" isn't safe to defer: **status is also an input to scheduling, not only an output to the UI.**

- A job left wrongly in `STARTED` (or any non-terminal state) stays in `active_states()`, so the worker claim endpoint keeps handing it out. Workers spend cycles on a job that is already done, and **newer jobs are starved** behind it.
- #1282 is exactly this failure with a different trigger: `CANCELLED` jobs leak through the `/next` filter and starve newer `async_api` jobs. Same shape — wrong/again-served status becomes a queue-starvation (effective DoS) bug.
- The claim endpoint itself is being reshaped in **#1265** (dedicated `GET /jobs/next/`, deprecating the `list(ids_only=1)` claim-hack). The state-machine semantics fixed here and the eligibility filter there need to agree on exactly which states are "claimable", or the crossover reopens.

## Prior and in-flight attempts (consolidate, don't duplicate)

- **#981** (closed) — added a periodic check that re-derives status for incomplete jobs. This treats the symptom downstream (sweep and correct) rather than preventing the clobber. #721 is still open, so it did not fully resolve it.
- **#1026** (open, draft, branch `fix/job-clobbering`) — "Don't overwrite logs & status in concurrent background tasks", pulled from #981. This is the most direct existing attempt at the root cause but is currently a stub. **Proposal: either flesh out #1026 against the design below, or close it in favour of a fresh implementation — but the spec should live here.**
- **#1256** (option D) — `jsonb_set` single-statement update for the `logs` field; same atomic-write idea, scoped to logs, deferred in #1259.
- **#1285** — diagnostic triangulation across DB/Redis/NATS; names this race as the root cause of #1276 but scopes itself to forensics, not the fix.
- Mitigations touching the same hot path: **#1276**, **#1324**, **#1287** (savepoint around the log insert), **#1288** (defer aggregate count refresh).

## Directions to discuss (ordered by effort, not yet decided)

**1. Conditional status transitions — small, worth doing regardless.**
```python
Job.objects.filter(pk=job_id, status__in=JobState.active_states()).update(status=final_state)
```
One statement; a stale writer can no longer resurrect `SUCCESS → STARTED`, and the rowcount doubles as an exactly-once "who finalizes" claim. Closes the resurrection arm (B) and, combined with the `/next` eligibility filter, the crossover (C). Does not by itself fix the lost *increment*.

**2. Atomic counter columns — current preference for the core fix.**
Move the contended counters out of the JSONB into integer columns updated with SQL-side arithmetic:
```python
Job.objects.filter(pk=job_id).update(processed_count=F("processed_count") + n)
```
Atomicity lives in the statement; the lock is held for microseconds (statement-scope), a different regime from the transaction-length `select_for_update` that caused #1261's contention. Completion becomes a conditional update (`processed_count >= F("total_count")` + the status guard from #1), giving exactly-once finalization with no distributed lock and no dependence on Redis being alive. JSONB keeps the static stage structure; percentages computed on read. Cost: a migration; per-stage live granularity needs extra columns or accepting a coarse overall percentage during the run.

**3. Single-writer rendering from the Redis tracker — no migration.**
Result handlers stop writing progress (they already maintain atomic per-task state in `AsyncJobStateManager`); one writer (throttled heartbeat or a periodic task) renders the JSONB from Redis. Exactly-once finalize via a `SETNX`-style claim. Least code, but keeps Redis load-bearing for correctness — the failure mode #1276 already wrestles with.

**4. Compute-on-read.**
No progress writes during the run; the API derives live progress from Redis for active async jobs and serves a stored snapshot once terminal. Freshest UI, zero write contention, but couples API latency to Redis and needs airtight fallbacks when keys are absent.

Directions 1 + 2 together remove both race arms with no new infrastructure and no broker/Redis dependency for correctness; 3 or 4 could layer on for per-stage display freshness.

## What we still need to verify

- Reproduce the lost-increment interleaving in a `TransactionTestCase` (the #1276 suite has a clobber repro to build from) and confirm direction 2 closes it.
- Measure hot-row update cost under realistic result-handler concurrency to confirm statement-scope `F()` updates don't reintroduce meaningful contention.
- Audit every writer of `progress`/`status`/`logs` (`_update_job_progress`, task signals, reaper, cancel path, `JobLogHandler`) so none keeps a full-blob save that can clobber the counters.
- Confirm the claimable-state set used by `/next` (#1265) matches the terminal/active semantics decided here, so the crossover (C) can't reopen.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337

Summary

Root cause

Two failure shapes — appearance vs. lost work

A. Appearance only (work is fine, the status shown is wrong)

B. Actually causes lost / stalled / failed work

C. Crossover — an appearance bug that turns into a real failure

Prior and in-flight attempts (consolidate, don't duplicate)

Directions to discuss (ordered by effort, not yet decided)

What we still need to verify

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337

Description

Summary

Root cause

Two failure shapes — appearance vs. lost work

A. Appearance only (work is fine, the status shown is wrong)

B. Actually causes lost / stalled / failed work

C. Crossover — an appearance bug that turns into a real failure

Prior and in-flight attempts (consolidate, don't duplicate)

Directions to discuss (ordered by effort, not yet decided)

What we still need to verify

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions