Improve the filter stage during job preparation, fix for large capture sets#1322
Conversation
filter_processed_images() iterated images one at a time, issuing 3-5 ORM round trips per image against Detection / Classification. On a ~147k-image collection that produces ~500k single-row queries inside one Celery task, which silences the AMQP heartbeat long enough that the reaper flips the job from STARTED to REVOKED before any forward progress is reported. Rewrite to chunk the input via itertools.islice and run at most two bulk queries per chunk: one over Detection rows, one over Classification rows for the real detections in that chunk. Semantics for all five branches (no detections, null-only, unclassified real, partial classifier coverage, fully classified) are preserved. Adds tests covering empty input, mixed batches across all branches, an assertion that query count scales with batch count rather than image count, and ordering preservation across batch boundaries. Fixes #1321. Co-Authored-By: Claude <noreply@anthropic.com>
… dicts Takeaway-review followups on the bulk filter_processed_images rewrite: - Strip Detection's Meta.ordering on the bulk select via .order_by(). The per-batch query was sorting all matching rows on (frame_num, timestamp) before returning, wasted work since results land in a dict. - Extract bbox_is_null() helper in ami.main.models next to NULL_DETECTIONS_FILTER so the SQL filter and the in-memory check share a single source of truth. - Rename images_with_any_detection -> images_with_pipeline_detection and real_detections_per_image -> real_pipeline_detections_per_image to make clear the dicts only cover detections from this pipeline's algorithms. Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughThis PR addresses collection-preparation performance bottlenecks by replacing per-image ORM queries with batch-based DB queries in ChangesCollection Preparation and NATS Queueing Performance
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
collect_images() returned a SourceImage queryset without select_related on the deployment/data_source FK chain. Downstream queue_images_to_nats() calls image.url() per image, which traverses self.deployment.data_source — two FK lookups per call. On a ~97k-image collection that's ~4.6ms × 97k = ~7 minutes of pure ORM round-trips before NATS publishes start, well inside the 10-minute reaper window (see issue #1321 follow-up comment). Add .select_related("deployment__data_source") to both queryset branches in collect_images so the joins ride along on the initial SourceImage fetch and image.url() in the queue loop stays a pure-Python operation. Test asserts that accessing .deployment.data_source on the returned rows triggers zero additional queries. Co-Authored-By: Claude <noreply@anthropic.com>
The pre-NATS prep loop in queue_images_to_nats() called image.url() twice per iteration — once in the truthy conditional, once for assignment. image.url() goes through SourceImage.public_url() which touches deployment and data_source; doubling it doubles the FK chain cost. On ~97k-image collections the redundant call alone adds ~7 minutes (see issue #1321 follow-up). Cache the call in a local once and drop the now-redundant hasattr/url double-check. The companion select_related in collect_images keeps the single remaining call cheap. Co-Authored-By: Claude <noreply@anthropic.com>
Each publish_task awaits a JetStream ack round-trip (~1.3ms measured on Serbia 2026-05-27 for the 105k-image incident-shape collection). Sequential awaits stack linearly, so 105k images took 139s and 700k would push past the 10-min reaper threshold on its own. Switch the inner loop to chunked asyncio.gather batches of 200 so the client can pipeline ack roundtrips. _ensure_stream / _ensure_consumer are hoisted out of the loop and pre-warmed once on entry — they were already cache-skipped after first call (nats_queue.py:310,358) but each call still serialised inside its publish coroutine. JetStream still assigns ack.seq atomically server-side, so messages remain ordered in the stream by arrival; only the order *within* a fanout chunk is non-deterministic from the caller's POV. ADC consumers fetch in seq order and treat each image-task independently — no per-image ordering dependency. Adds 5 unit tests in ami/ml/orchestration/tests/test_jobs.py mocking TaskQueueManager: warm-up count, chunk-boundary publish count, partial publish failure path, stream-setup failure short-circuit, and a sentinel test locking the chunk constant. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a large-collection performance bottleneck in ML job preparation (Collect → filter → queue) that previously caused jobs to appear stalled and get reaped, by reducing per-image ORM work and improving NATS publish throughput.
Changes:
- Rewrite
filter_processed_imagesto bulk-query detections/classifications in batches instead of per-image ORM calls. - Ensure
collect_imagesreturnsSourceImagerows withdeployment__data_sourcejoined to avoid N+1s during URL generation. - Speed up NATS queueing by caching
image.url()per iteration and publishing JetStream messages concurrently in bounded chunks.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
ami/ml/models/pipeline.py |
Batch/bulk implementation of filter_processed_images and select_related("deployment__data_source") in collect_images. |
ami/ml/orchestration/jobs.py |
Cache image.url() once per image and fan out JetStream publishes via chunked asyncio.gather. |
ami/main/models.py |
Add bbox_is_null() helper to share null-bbox semantics between ORM and in-memory logic. |
ami/ml/tests.py |
Add tests for bounded query counts, batch order preservation, and collect_images prefetch behavior. |
ami/ml/orchestration/tests/test_jobs.py |
Add unit tests for fanout chunking/warm-up behavior and failure handling in queue_images_to_nats. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…mages filter_processed_images now accepts optional `job` and `total` kwargs and emits a fractional `collect` progress update at most once per COLLECT_PROGRESS_SAVE_INTERVAL_SECONDS (5s wall time), capped at COLLECT_PROGRESS_MAX_FRACTION (0.99). collect_images wires both through. Reaper at ami/jobs/tasks.py:929 triggers on (status in running_states) AND (updated_at < cutoff, default 10min). Before this commit, the collect stage held a worker for several seconds to ~minutes on large collections without ever saving the Job row, so updated_at went stale and the reaper revoked otherwise-healthy jobs. The throttled save now keeps updated_at fresh. Cap at 0.99 so the caller's terminal status=SUCCESS, progress=1 transition still owns the final flip. Legacy callers without `job` see zero job.save() calls — the throttle block is fully gated on both new kwargs. Two new tests in ami/ml/tests.py: - mocked time.monotonic verifies the >=5s gate fires exactly the expected number of times across a multi-chunk run. - omitting `job` keeps the legacy code path silent. Co-Authored-By: Claude <noreply@anthropic.com>
Three small fixes flagged on PR #1322: - Typo in the debug log message ("do yet have" → "do not yet have"). - Reword the `collect_images` source-selection comment so it accurately describes when the `deployment__data_source` join is applied — the `source_images` branch hands the caller's iterable through unchanged. - Fail fast on non-positive `batch_size` instead of silently filtering out every image when an `islice` of 0/negative returns an empty batch. Co-Authored-By: Claude <noreply@anthropic.com>
…Manager Per Copilot review on PR #1322: calling _ensure_stream / _ensure_consumer directly from the orchestrator coupled queue_images_to_nats to private TaskQueueManager internals. Adds a thin public wrapper that calls both, keeps the per-instance "already-warmed" caches behind the manager's API, and gives future refactors a single seam to evolve. Switches the publish-loop warm-up in queue_images_to_nats to the new public method and updates the fanout unit tests accordingly. Co-Authored-By: Claude <noreply@anthropic.com>
Follow-up to the takeaway review on PR #1322: - Drop `test_filter_processed_images_preserves_input_order_across_batches` — `test_filter_processed_images_mixed_batch` already asserts the output order with `[unprocessed, unclassified]`. There is no shuffle/reorder path in the function for a cross-batch ordering test to exercise. - Drop `test_default_chunk_size_is_explicit` — locking the constant value catches no behavior regression; the explanatory comment at `ami/ml/orchestration/jobs.py:14-19` is the real documentation. - Replace the `iter([...])` + `next(...)` clock stub in `test_filter_processed_images_emits_throttled_collect_progress` with a counter-based stub. The previous pattern would crash with `StopIteration` if a future change in `filter_processed_images` added another `time.monotonic()` call; the counter form fails with a clear cadence-mismatch assertion instead. No coverage lost. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
ami/ml/orchestration/jobs.py (1)
103-107: 💤 Low valueConsider whether the
hasattrcheck is necessary.The
hasattr(image, "url")guard is defensive, butSourceImageshould always have aurl()method sinceimagesis typed aslist[SourceImage]. If there's a specific case where the method might be missing (e.g., a proxy object or a mock), the guard is justified; otherwise, simplifying toimage_url = image.url()would be cleaner.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/ml/orchestration/jobs.py` around lines 103 - 107, The hasattr guard around image.url() is unnecessary because images is typed as list[SourceImage]; replace the conditional assignment with a direct call (image_url = image.url()) in the loop where image_url is defined, and remove the comment justifying the guard; if there are tests or mocks that caused the guard to be added, instead update those tests/mocks to implement url() rather than keeping the runtime check (see the collect_images() path and the SourceImage type for where to validate).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ami/ml/models/pipeline.py`:
- Around line 112-114: The bug is that pipeline_classifier_ids being empty makes
pipeline_classifier_ids.issubset(classifier_ids_per_detection) vacuously true,
skipping images incorrectly; fix by short-circuiting: explicitly check if
pipeline_classifier_ids is empty (the set built from pipeline_algorithms and
detection_type_keys) and when empty immediately take the "reprocess all images"
branch (log via task_logger and skip the subset logic), and apply the same
explicit-empty check to the other similar blocks referencing
pipeline_classifier_ids and classifier_ids_per_detection around the areas you
noted (lines ~147-153 and ~170-182) so the subset test is only evaluated when
pipeline_classifier_ids is non-empty.
- Around line 203-208: The heartbeat save for the Job progress currently calls
job.save(update_fields=["progress"]) which prevents Django from auto-updating
Job.updated_at; modify the save to include "updated_at" (e.g.,
job.save(update_fields=["progress", "updated_at"])) or simply call job.save() so
updated_at is bumped; locate the block around the collect heartbeat (references:
job, processed_count, total, COLLECT_PROGRESS_SAVE_INTERVAL_SECONDS,
last_progress_save_monotonic) and update the save call accordingly.
In `@ami/ml/tests.py`:
- Line 569: The test currently uses a hardcoded secret literal secret_key="y"
which triggers Ruff S106; to fix, replace the inline literal with a generated
variable (e.g. import secrets and assign a runtime value like secret_key =
secrets.token_urlsafe(...) or derive a variable above the test) and pass that
variable into the fixture/call instead of the literal, or alternatively add an
explicit noqa comment (S106) next to the parameter with a short justification;
update the usage of the symbol secret_key in the test to reference the new
variable (and add the necessary import) so Ruff no longer flags the hardcoded
secret.
---
Nitpick comments:
In `@ami/ml/orchestration/jobs.py`:
- Around line 103-107: The hasattr guard around image.url() is unnecessary
because images is typed as list[SourceImage]; replace the conditional assignment
with a direct call (image_url = image.url()) in the loop where image_url is
defined, and remove the comment justifying the guard; if there are tests or
mocks that caused the guard to be added, instead update those tests/mocks to
implement url() rather than keeping the runtime check (see the collect_images()
path and the SourceImage type for where to validate).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ca87a556-6035-4e51-9173-5714f641052c
📒 Files selected for processing (6)
ami/main/models.pyami/ml/models/pipeline.pyami/ml/orchestration/jobs.pyami/ml/orchestration/nats_queue.pyami/ml/orchestration/tests/test_jobs.pyami/ml/tests.py
Three correctness fixes and one nit cleanup from the CodeRabbit review on PR #1322: 1. **`Job.save` heartbeat now bumps `updated_at`.** The throttled progress save in `filter_processed_images` was calling `job.save(update_fields=["progress"])`. Django's `auto_now=True` pre_save hook only fires for fields listed in `update_fields`, so `Job.updated_at` was never refreshed. The reaper at `ami/jobs/tasks.py:929-944` keys off `Job.updated_at < cutoff`, so the heartbeat did not actually defeat the reaper as intended. Including `updated_at` in `update_fields` restores the auto_now behavior. Test assertion updated to enforce the contract. 2. **Short-circuit the "no classifier" path.** When a pipeline has zero classifier algorithms registered, `pipeline_classifier_ids` is the empty set and `set().issubset(anything)` is vacuously True — so the subset check below would skip every image with existing detections, directly contradicting the warning "Will reprocess all images." Now `yield from images; return` after the warning so the warning's stated behavior actually holds. 3. **Drop the `hasattr(image, "url")` guard in `queue_images_to_nats`.** `images` is typed `list[SourceImage]` and `SourceImage` always defines `url()`. The guard masked nothing real and confused readers. 4. **`# noqa: S106` on the fixture `secret_key="y"` values** in two test files so future runs under Ruff's bandit ruleset stay clean. These are clearly fixture values, never real credentials. Co-Authored-By: Claude <noreply@anthropic.com>
Add a behavioural test that pins the fix from 51a7fff: a pipeline with no classifier algorithms registered must yield every input image, matching the "Will reprocess all images" warning. Without the short-circuit added in 51a7fff, the empty `pipeline_classifier_ids` set makes the `set().issubset(observed)` check vacuously True and images with detections get silently skipped. Co-Authored-By: Claude <noreply@anthropic.com>
Operators running ML jobs on large Capture Sets (
SourceImageCollection) were seeing the job sit inSTARTEDfor ~20 minutes, no Collect-stage progress, no errors in the log, and then flip toREVOKEDonce the reaper noticed no forward progress. From their side the platform looked like it had silently dropped the job. The Celery worker was actually still running — stuck in the Collect → queue path doing per-image SQL — but nojob.progressupdates were being saved, so the reaper's 10-minute "no forward progress" heuristic fired.Three independent O(N)-per-image sites in that path were each big enough on their own to push past the 10-minute reaper threshold on collections in the ~100k-image range. This PR fixes all three.
Collect → queue → NATS-stream-created on a ~100k-image job that previously hung for ~19 minutes and got reaped now completes in 42 seconds end-to-end (measured on the Serbia dev box on the same shape of collection, see numbers below). That's roughly a 20–25× drop in absolute wall time on the path that was getting killed; the per-image cost goes from ~7.7 ms (147k incident wall time ÷ image count) down to ~0.4 ms. Smaller jobs see the same shape but the absolute win is too small to notice.
Before (collecting images takes more than 15 minutes and gets revoked)

After (collecting images takes 30 seconds or so)

What this PR changes
filter_processed_images(ami/ml/models/pipeline.py). Was 3–5 ORM round-trips per image; now chunks viaitertools.isliceatbatch_size=1000and does at most 2 bulk queries per chunk. Semantics for all five processed/unprocessed branches preserved. 147k images → ~295 queries instead of ~500k. Measured: 9.4 s for 105k images, down from a reaper-killed run on 147k.deployment__data_sourceincollect_images(ami/ml/models/pipeline.py). The two queryset branches (collection-backed and deployment-backed) now.select_related("deployment__data_source")so the FK chain rides along on the initial fetch. Saves a query per image on the downstreamimage.url()call inqueue_images_to_nats.image.url()once per iteration inqueue_images_to_nats(ami/ml/orchestration/jobs.py). The pre-NATS prep loop called it twice — once in the truthy check, once for assignment — doubling the FK chain cost on every iteration. Now called once.asyncio.gather(ami/ml/orchestration/jobs.py, new constantNATS_PUBLISH_FANOUT_CHUNK_SIZE=200). Sequentialawait self.js.publish(...)stacked the per-message JetStream ack round-trip (~1.3 ms each, including TCP) linearly. Chunkedgatherlets the NATS client pipeline acks back to us._ensure_stream/_ensure_consumerare also pre-warmed once before the loop — they were already cache-noop after first call, but pre-warming prevents 200 concurrent coroutines from all trying to create the stream on the first chunk. Measured: 139 s → 15 s on 105k images (~9× speedup on the publish loop alone).bbox_is_null()helper next toNULL_DETECTIONS_FILTERinami/main/models.pyso the SQL filter and the in-memory check share a single source of truth. Used inside the bulk filter.Meta.orderingon the bulk select via.order_by(). The rows land in a dict — no point sorting them in Postgres.images_with_pipeline_detection,real_pipeline_detections_per_image) so it's clear they only cover detections from this pipeline's algorithms.ami/ml/models/pipeline.py). When a pipeline has no classifier algorithms registered, the empty classifier set madeset().issubset(observed)vacuously true, so images with existing detections were silently skipped — contradicting the "Will reprocess all images" warning. Now yields all images and returns. This is a pre-existing bug surfaced during review; fixed here because the PR rewrote that exact code path.Tests
batch_size=5→ 4 queries (1 pipeline + 2 detection batches + 1 classification batch). Locks in the O(batches) shape.issubsetbug fixed in review.collect_imagesreturns SourceImage rows withdeployment+deployment.data_sourcejoined — accessing them triggers 0 additional queries.ami/ml/orchestration/tests/test_jobs.py, mockingTaskQueueManager): warm-up runs exactly once,publish_taskawait count equals image count across a chunk boundary, partial-failure path returnsFalsewithout raising, and a stream-setup exception short-circuits and yields zero publishes.Ordering note for the fanout change
JetStream still assigns
ack.seqatomically on the server side as messages arrive, so the stream is still strictly ordered by arrival. What changes is the order within a 200-message gather chunk — two images dispatched in the same chunk can land in the stream in either order. Consumers fetch viaconsumer.fetchwhich yields in seq order, and the ADC worker processes each image-task independently with no cross-task ordering dependency, so this is safe. Globally the chunked order is still preserved: chunk N's messages all land in the stream before chunk N+1's.Performance — what each fix bought
The original incident on a 147k-image collection ran for ~19 minutes before the reaper killed it. The job never reached the NATS publish loop — it stalled inside
filter_processed_imagesdoing per-image SQL — so we have direct evidence for one fix (the filter rewrite) and arithmetic / measurement for the rest. Per-fix attribution on a 105k-image workload:filter_processed_imagesimage.url()× 2 + uncacheddeployment.data_sourceFK chain in queue prep loopawait self.js.publish(...)per imageSo the 20× headline isn't a single fix — the filter rewrite alone gets it under the reaper threshold, but the
url()cache and the fanout each remove a stage that would have hit the threshold once the previous one stopped being the bottleneck. With only the filter fixed, the same job-prep on 700k images would have stalled inside the publish loop at ~15 minutes wall. With all three fixed, the publish loop scales to 700k in ~100 s.The A/B below is the direct measurement that backs row 3 (sequential vs fanout publish) and shows what each stage contributes to the final 42 s.
Measured end-to-end on the Serbia dev box (2026-05-27)
Re-ran the incident path against the closest-shape collection on hand: project 20, collection 165 ("All images"), 105,091
SourceImagerows, pipelinequebec_vermont_moths_2023,dispatch_mode=async_api. Stack was this branch on top of Postgres + Redis + JetStream as on a fresh deploy. Pipeline 3's algorithms have already classified ~6k of the 105k, so the filter does meaningful work (all 105,091 images returned as "to process" after dedup).Headline comparison
REVOKEDcollect: SUCCESS,processqueued, no reaper firingThat's the ~20× headline win that closes #1321. The per-image cost on the job-prep path drops from ~7.7 ms (incident wall time ÷ image count) to ~0.4 ms.
Where the time went — A/B of the publish-loop change inside this PR
The performance work is a stack of 5 commits (
4c089c24–59d05619). The first four fix the filter rewrite, theselect_related, and theurl()caching; the fifth adds theasyncio.gatherfanout. (Later commits in the PR add the collect-progress emission, the review fixes, and tests — they don't change the perf path.) To check that the fanout commit specifically pulled its weight, I ran the same job twice — once with the first four commits applied (sequential publish loop still), once with all five (fanout publish loop):collect_imagesfilter_processed_images(105,091 images)state_manager.initialize_job(Redis SADD)queue_images_to_natspublish looprun_jobBoth runs are on the same code stack as this PR for the filter changes — the 9 s filter is post-fix in both. The "before" 19-minute baseline above was never re-measured directly because the pre-fix code would have stalled the worker the same way it did in the incident. The 9 s filter and 7.7 ms → 0.4 ms per-image numbers are derived from the same 105k vs 147k extrapolation, which is the closest 1:1 comparison the dev box could give without rolling back to a known-broken state.
Throughput on the publish loop: 105,091 / 139 ≈ 755 msg/s sequential vs 105,091 / 15 ≈ 7,000 msg/s with fanout. The fanout rate was confirmed against NATS
/varzsamples taken during the run (in_msgs delta of ~45k over a 9 s window mid-publish, ~60k over the following 10 s window).Extrapolating linearly at the measured fanout rate: a 700k-image job (the largest collection currently on Serbia) would spend ~100 s in the publish loop and ~60 s in the filter, totaling well under 5 minutes. Sequential publish on the same 700k collection would have taken ~15 minutes for the publish alone, so the fanout change is load-bearing for the largest collections, not just nice-to-have.
What the e2e proves: the job-prep path (Collect → filter → Redis init → NATS publish) on a real ~100k-image production-shape collection completes in 42 seconds and leaves the job in a healthy
STARTEDstate withcollect: SUCCESSandprocessqueued. The 10-minute reaper threshold isn't approached at any stage.Downstream consumption confirmed (2026-05-28 re-run)
The 2026-05-27 runs above queued all 105,091 messages but could not confirm the ADC worker actually consumed them: the dev box's ADC worker was in HTTP-poll mode with a 401 auth issue, so the
processstage stayed at 0% after queueing. That criterion was satisfied only by inference (queue-success log + Redis sets initialised + 105,091 JetStream messages in flight).After the auth was fixed, a fresh job (2547, same project 20 / collection 165 /
quebec_vermont_moths_2023) on the latest branch ran the full loop.collectreachedSUCCESS (1.0), thenprocessadvanced past 0 (STARTED, climbing) withresultsalso moving, andJob.updated_atbumped steadily throughout (observed 00:03→00:06 and on), so the reaper never approached its threshold. This closes the one gap from the earlier runs: the full Collect → filter → publish → consume → results path is now observed working end-to-end, not inferred. Absolute process-stage throughput is bounded by ADC inference speed on 105k images, which is orthogonal to this PR.Collect-stage progress emission (also in this PR)
filter_processed_imagesnow accepts optionaljobandtotalkwargs and emits a throttled fractionalcollectprogress update at most once every 5 s of wall time (COLLECT_PROGRESS_SAVE_INTERVAL_SECONDS), capped at 0.99 (COLLECT_PROGRESS_MAX_FRACTION) so the caller's terminalstatus=SUCCESS, progress=1flip still owns the final value.collect_imageswires both through. Legacy callers withoutjobstay silent — the throttle block is fully gated on both new kwargs.This makes the Collect stage tick
Job.save(update_fields=["progress", "updated_at"])regularly, which keepsJob.updated_atfresh against the reaper atami/jobs/tasks.py:929-944(cutoffJob.STALLED_JOBS_MAX_MINUTES = 10).updated_atis listed explicitly inupdate_fieldsbecause Django'sauto_nowonly fires itspre_savehook for fields named inupdate_fields; omitting it (as an earlier revision of this PR did) would update theprogressJSONB but leaveupdated_atstale, defeating the reaper-shielding entirely. The structural fix for the reaper false-positive lands here rather than in a follow-up because the wiring is cheap and there's no good reason to leave the gap open.Two new tests in
ami/ml/tests.py:test_filter_processed_images_emits_throttled_collect_progress— a counter-based clock stub verifies the ≥5 s gate fires exactly the expected number of saves across a multi-chunk run, all withupdate_fields=["progress", "updated_at"], and that the final emitted fraction is capped at 0.99.test_filter_processed_images_skips_progress_emission_without_job— omittingjobkeeps the legacy code path silent (zeroJob.save()calls).Earlier follow-up note about "celery_state=PENDING" was misleading paraphrase: during the Collect stage Django
Job.statusisSTARTED(set byMLJob.run()atami/jobs/models.py:462) and Celery'sAsyncResult(task_id).stateis alsoSTARTED— PENDING only applies between enqueue and pickup. The actual reaper trigger is(status ∈ running_states) AND (updated_at < cutoff). STARTED is inrunning_states(). The 5 s throttle directly addresses the real trigger; nothing PENDING-specific remains to chase.Not in this PR (follow-ups)
queue_images_to_natsstill has no incremental progress emission during the publish loop. The 42 s end-to-end wall time on 105k images measured above (and ~140 s extrapolated on 700k) is well under the 10-minute reaper cutoff, so this isn't load-bearing. Wiring it would requiresync_to_async-bridged updates inside theasyncio.gatherchunks and a careful pass on theJob.progressJSONB clobber race with the NATS result handler (_update_job_progressinami/jobs/tasks.py).Closes #1321.
Summary by CodeRabbit
Performance Improvements
Improvements