Skip to content

Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337

@mihow

Description

@mihow

Summary

Several open and closed issues about jobs showing the wrong status, finished jobs being marked failed, and the task queue starving all trace back to one mechanism: concurrent result handlers update a job's progress/status with an unlocked read-modify-write of the whole JSONB blob, so two handlers that read the same snapshot silently lose one of the two updates. This issue is meant to be the single place that names that root cause, separates the bugs it produces into "wrong appearance" vs "actually fails work" (they overlap — see below), and decides on a fix that does not reintroduce the row-lock contention that #1261 removed.

Part of the PSv2 effort (#515).

Root cause

_update_job_progress (ami/jobs/tasks.py) does, per result batch:

  1. read the Job row including the progress JSONB (stages, counts, status),
  2. mutate it in Python (increment processed/failed counts, recompute stage percentages, possibly set a terminal status),
  3. write the whole blob back with job.save(update_fields=["progress", "status", ...]).

Under async_api, many workers POST results concurrently, so step 1→3 interleaves across processes. Classic lost-update: two handlers read the same snapshot, both write their own version, one increment (or one status transition) is lost. The same unlocked full-object save also happens for the logs field and from the task signals.

There was a lock here — _update_job_progress previously used select_for_update. It was removed in #1261 because under concurrent load every result handler serialized on the single job row, causing slow POSTs, timeouts, and NATS redeliveries (contention analysis in #1256). So the fix must make these writes atomic without reintroducing a transaction-length row lock — re-adding select_for_update would trade this race straight back for the contention problem.

Two failure shapes — appearance vs. lost work

The point of this split is to triage: some of these only make the UI lie, some actually destroy or stall work, and a few start as the former and become the latter.

A. Appearance only (work is fine, the status shown is wrong)

B. Actually causes lost / stalled / failed work

C. Crossover — an appearance bug that turns into a real failure

This is the important one, because it's why "just a display bug" isn't safe to defer: status is also an input to scheduling, not only an output to the UI.

  • A job left wrongly in STARTED (or any non-terminal state) stays in active_states(), so the worker claim endpoint keeps handing it out. Workers spend cycles on a job that is already done, and newer jobs are starved behind it.
  • CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282 is exactly this failure with a different trigger: CANCELLED jobs leak through the /next filter and starve newer async_api jobs. Same shape — wrong/again-served status becomes a queue-starvation (effective DoS) bug.
  • The claim endpoint itself is being reshaped in Add GET /jobs/next action; deprecate list(ids_only=1) claim-semantics #1265 (dedicated GET /jobs/next/, deprecating the list(ids_only=1) claim-hack). The state-machine semantics fixed here and the eligibility filter there need to agree on exactly which states are "claimable", or the crossover reopens.

Prior and in-flight attempts (consolidate, don't duplicate)

Directions to discuss (ordered by effort, not yet decided)

1. Conditional status transitions — small, worth doing regardless.

Job.objects.filter(pk=job_id, status__in=JobState.active_states()).update(status=final_state)

One statement; a stale writer can no longer resurrect SUCCESS → STARTED, and the rowcount doubles as an exactly-once "who finalizes" claim. Closes the resurrection arm (B) and, combined with the /next eligibility filter, the crossover (C). Does not by itself fix the lost increment.

2. Atomic counter columns — current preference for the core fix.
Move the contended counters out of the JSONB into integer columns updated with SQL-side arithmetic:

Job.objects.filter(pk=job_id).update(processed_count=F("processed_count") + n)

Atomicity lives in the statement; the lock is held for microseconds (statement-scope), a different regime from the transaction-length select_for_update that caused #1261's contention. Completion becomes a conditional update (processed_count >= F("total_count") + the status guard from #1), giving exactly-once finalization with no distributed lock and no dependence on Redis being alive. JSONB keeps the static stage structure; percentages computed on read. Cost: a migration; per-stage live granularity needs extra columns or accepting a coarse overall percentage during the run.

3. Single-writer rendering from the Redis tracker — no migration.
Result handlers stop writing progress (they already maintain atomic per-task state in AsyncJobStateManager); one writer (throttled heartbeat or a periodic task) renders the JSONB from Redis. Exactly-once finalize via a SETNX-style claim. Least code, but keeps Redis load-bearing for correctness — the failure mode #1276 already wrestles with.

4. Compute-on-read.
No progress writes during the run; the API derives live progress from Redis for active async jobs and serves a stored snapshot once terminal. Freshest UI, zero write contention, but couples API latency to Redis and needs airtight fallbacks when keys are absent.

Directions 1 + 2 together remove both race arms with no new infrastructure and no broker/Redis dependency for correctness; 3 or 4 could layer on for per-stage display freshness.

What we still need to verify

  • Reproduce the lost-increment interleaving in a TransactionTestCase (the fix(jobs): fix dangling jobs from going to revoked #1276 suite has a clobber repro to build from) and confirm direction 2 closes it.
  • Measure hot-row update cost under realistic result-handler concurrency to confirm statement-scope F() updates don't reintroduce meaningful contention.
  • Audit every writer of progress/status/logs (_update_job_progress, task signals, reaper, cancel path, JobLogHandler) so none keeps a full-blob save that can clobber the counters.
  • Confirm the claimable-state set used by /next (Add GET /jobs/next action; deprecate list(ids_only=1) claim-semantics #1265) matches the terminal/active semantics decided here, so the crossover (C) can't reopen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PSv2Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions