Skip to content

bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs #1247

@mihow

Description

@mihow

Symptom

Long-running ML jobs (>5 min observed) consistently end with exactly worker.batch_size images stuck in Redis pending sets — 8 if the ADC worker is configured for batch_size=8, 16 if 16. Short jobs (<5 min) don't show this.

Most recent example: demo job 88 (project 5), 544 images total, 16 stranded after the NATS stream drained. PR #1244's reconciler (now live on demo) marked them failed at the next jobs_health_check tick (~22 min idle), and the job landed in SUCCESS (16/544 = 2.9% < FAILURE_THRESHOLD):

Update: I am also seeing cases where just 1 image remains, instead of the batch size.

[04:30:00] WARNING jobs_health_check: marked 16 image(s) as failed (job idle past cutoff;
  NATS consumer num_pending=0 num_ack_pending=0 num_redelivered=16). IDs: [...]

num_redelivered=16 matching batch size = strong signal that one full ADC pull was the unit lost. Without #1244, check_stale_jobs would have REVOKED the job and lost 528 successful results.

Why this matters

PR #1244 is the right safety net but the underlying bug is deterministic. A job big enough to trigger it loses one batch; failure rate scales with 1/N images per job — invisible at small scale, painful at production scale. We should fix the root cause.

Refined hypothesis (based on code reading, not yet measured)

ADC's last in-flight batch is abandoned by the result-POST thread pool when it stalls or fails, leaving the NATS messages un-ACK'd because the ACK chain (Antenna SREM → ACK) only fires on a successful POST landing.

Architectural facts established by investigation:

  1. ADC does not poll NATS state. The _worker_loop in worker.py:93-129 is an unconditional while True. The pull loop terminates via the DataLoader iterator (datasets.py:275-290), which breaks when Antenna's /jobs/{id}/tasks/ REST endpoint returns an empty batch — not from NATS num_pending inspection.
  2. ADC is not the ACKer. _process_batch calls result_poster.post_async() (worker.py:501), which queues an HTTP POST to a thread pool with backpressure (max 5 pending) and returns immediately. Antenna receives the POST, runs process_nats_pipeline_result, and only then sends the NATS ACK on the reply_subject (tasks.py:292).
  3. The ACK chain is gated on the POST landing. Order in process_nats_pipeline_result: parse → SREM pending:process → save detections → SREM pending:results → progress update → ACK. If the POST itself never lands at Antenna, none of those steps run for that batch.
  4. The last-batch handoff is the smoking gun. When /jobs/{id}/tasks/ returns empty, the DataLoader iterator breaks and _process_job calls result_poster.wait_for_all_posts() with a bounded timeout (~210s = 60 + 5*30). If any pending POST in the thread pool times out, errors out, or is force-cancelled (result_posting.py:114-122), the batch is abandoned silently — the worker logs "Done, detections:..." at INFO and moves on. NATS never sees the ACK; messages enter num_ack_pending, get redelivered up to max_deliver, then sit there until cleanup.

This explains the "long jobs only" pattern: more batches → more chances for one to hit a transient HTTP blip → exactly-one-batch-worth of images stranded. Short jobs finish before the probability accumulates.

Verification gaps

The above is from code reading, not measurement. To confirm:

  • Reproduce on demand (dispatch a 100+ image job, force a transient error in Antenna's /result/ endpoint mid-job, confirm exactly batch_size remains)
  • Search ADC worker logs for the wait_for_all_posts timeout / result_posting cancellation lines around the moment of "Done, detections:..." in a known-stranded job
  • Confirm the abandoned POST does not retry (whether result_poster has retry logic, and if so, what its budget is)
  • Determine whether Antenna logs anything when the POST fails to land vs. when it lands but ACK fails — the diagnostic loudness matters for next-time triage

Fix directions (rank-ordered, no decision yet)

  1. Make wait_for_all_posts retries explicit and bounded. If a POST fails after retries, the batch should be marked "POST failure" in ADC's own logs at WARNING+ so the failure is at least visible. (Cheapest, doesn't reach NATS at all.)
  2. Have ADC's worker shutdown explicitly NACK or terminate any in-flight messages so NATS gets a fast redeliver instead of waiting ack_wait × max_deliver. (Requires NATS-aware shutdown path in ADC.)
  3. Have Antenna track POST receipt and emit a periodic "POST never landed for image X" log distinct from the existing "Job state keys not found" path. (Helps diagnose, doesn't fix.)

PR #1244's reconciler stays as the safety net regardless of which fix lands.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    PSv2Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515.bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions