[Draft] Don't overwrite logs & status in concurrent background tasks by mihow · Pull Request #1026 · RolnickLab/antenna

mihow · 2025-10-31T00:49:48Z

Pulled from #981

More coming soon

netlify · 2025-10-31T00:49:53Z

✅ Deploy Preview for antenna-preview ready!

Name	Link
🔨 Latest commit	`1b17592`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/690407afd194380008e4798f
😎 Deploy Preview	https://deploy-preview-1026--antenna-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
Lighthouse	1 paths audited Performance: 30 (🔴 down 1 from production) Accessibility: 80 (no change from production) Best Practices: 100 (no change from production) SEO: 92 (no change from production) PWA: 80 (no change from production) View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2025-10-31T00:49:58Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/job-clobbering

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…tarvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path.

…ues (#1257) * feat(celery): split tasks across three queues to prevent cross-task starvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path. * docs(celery): add rollout plan for queue split branch * chore(celery): parameterize local dev worker queues via CELERY_QUEUES Match the production start script: read the queue list from $CELERY_QUEUES, defaulting to all three queues (antenna, jobs, ml_results) so the single local worker keeps consuming everything by default. Lets devs override for isolation testing if they want. Co-Authored-By: Claude <noreply@anthropic.com> * feat(celery): route create_detection_images to ml_results queue This task is emitted from save_results (pipeline.py:990, one delay per batch of source images) and does heavy image cropping + S3 writes. Left unrouted, it defaults to the antenna queue — the opposite of what the queue-split is trying to achieve, since a single large job's cropping fan-out can then starve beat/housekeeping. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(celery): move jobs + ml_results workers off the app host Production topology previously put all three worker services (antenna, jobs, ml_results) on ami-live, which meant the bursty ML pool was competing with Django/beat/flower for CPU and RAM. With CELERY_WORKER_CONCURRENCY=16 inherited per service, that's 48 prefork processes before any dedicated worker VM spins up. Now: - docker-compose.production.yml runs only the antenna worker (alongside Django + beat + flower on the app host). - docker-compose.worker.yml runs three dedicated services (antenna / jobs / ml_results) per worker VM, so isolation holds there too — a burst on one class can't saturate a shared pool and starve another. Rollout doc updated to reflect the new topology. Co-Authored-By: Claude <noreply@anthropic.com> * fix(celery): address review comments on queue split PR - Standardize rollout doc script name to reset_demo_to_branch.sh - Clarify settings comment: only staging/production/worker composes run per-queue dedicated workers; local/CI use a single all-queues worker Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): use reset_to_branch.sh (the generic script name) The script is used on staging, demo, and single-box deploys — not demo-specific. Standardize both mentions to reset_to_branch.sh, which matches the actual filename on the hosts. Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): scrub internal hostnames and update rollout doc accuracy - Generalize ami-live / ami-worker-2 / ami-worker-3 hostnames — this doc lives in a public repo and shouldn't reference deployment-specific names - Drop stale commit SHA; branch name is sufficient after further commits - Clarify that the "scp three files" list is the demo-path subset, not the full changeset on the branch Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

mihow · 2026-06-16T20:21:24Z

Claude says: Opened #1337 to capture the root cause this draft targets — concurrent unlocked read-modify-write of Job.progress/status — with the full lineage (#721 → #981 (closed) → this draft), the appearance-vs-lost-work breakdown, and four candidate fix directions. The key constraint it spells out: any fix has to stay atomic without re-adding the select_for_update that #1261 removed for contention reasons.

Worth deciding: either flesh this draft out against the #1337 design, or close it in favour of a fresh implementation and keep the spec in #1337. Happy to help compare the #981-derived approach here against the conditional-write / counter-column directions in #1337 before picking one.

mihow · 2026-06-20T00:58:13Z

Claude says: Closing as superseded. The status half of this concurrent-write problem is now fixed on main:

Stop a finished job from being pulled back to running by a slower worker #1338 (merged) makes the result handler's terminal transition a guarded statement-scope UPDATE ... WHERE status IN finalizable_states() — no row lock, can't be clobbered by a stale read-modify-write. It ships with a real concurrency regression test that reproduces the lost-update on main (red) and passes on the fix (green).
Guard the cancel and signal-handler job-status writes too (fast-follow on #1338) #1342 extends the same guard to the other lock-free terminal writers (cancel() and the celery task-signal handlers).

The periodic-check approach here (carried over from #981) is no longer the direction. Remaining work is tracked under the #1337 root-cause hub (counter-accumulation "Layer 2", the /next starvation filter, reaper hardening). Thanks for the early groundwork — it framed the problem.

feat: don't overwrite logs & status in concurrent background tasks

1b17592

mihow mentioned this pull request Apr 20, 2026

feat(job): refactor job logging so it isn't a bottleneck #1256

Open

mihow added the PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515. label Jun 16, 2026

This was referenced Jun 16, 2026

Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337

Open

fix(jobs): fix dangling jobs from going to revoked #1276

Open

mihow mentioned this pull request Jun 16, 2026

New async & distributed ML backend (aka "PSv2") #515

Open

20 tasks

mihow closed this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Don't overwrite logs & status in concurrent background tasks#1026

[Draft] Don't overwrite logs & status in concurrent background tasks#1026
mihow wants to merge 1 commit into
mainfrom
fix/job-clobbering

mihow commented Oct 31, 2025 •

edited

Loading

Uh oh!

netlify Bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Oct 31, 2025

Review skipped

Uh oh!

mihow commented Jun 16, 2026

Uh oh!

mihow commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mihow commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview ready!

Uh oh!

coderabbitai Bot commented Oct 31, 2025

Review skipped

Uh oh!

mihow commented Jun 16, 2026

Uh oh!

mihow commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihow commented Oct 31, 2025 •

edited

Loading

netlify Bot commented Oct 31, 2025 •

edited

Loading