[Draft] Don't overwrite logs & status in concurrent background tasks#1026
[Draft] Don't overwrite logs & status in concurrent background tasks#1026mihow wants to merge 1 commit into
Conversation
✅ Deploy Preview for antenna-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
…tarvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path.
…ues (#1257) * feat(celery): split tasks across three queues to prevent cross-task starvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path. * docs(celery): add rollout plan for queue split branch * chore(celery): parameterize local dev worker queues via CELERY_QUEUES Match the production start script: read the queue list from $CELERY_QUEUES, defaulting to all three queues (antenna, jobs, ml_results) so the single local worker keeps consuming everything by default. Lets devs override for isolation testing if they want. Co-Authored-By: Claude <noreply@anthropic.com> * feat(celery): route create_detection_images to ml_results queue This task is emitted from save_results (pipeline.py:990, one delay per batch of source images) and does heavy image cropping + S3 writes. Left unrouted, it defaults to the antenna queue — the opposite of what the queue-split is trying to achieve, since a single large job's cropping fan-out can then starve beat/housekeeping. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(celery): move jobs + ml_results workers off the app host Production topology previously put all three worker services (antenna, jobs, ml_results) on ami-live, which meant the bursty ML pool was competing with Django/beat/flower for CPU and RAM. With CELERY_WORKER_CONCURRENCY=16 inherited per service, that's 48 prefork processes before any dedicated worker VM spins up. Now: - docker-compose.production.yml runs only the antenna worker (alongside Django + beat + flower on the app host). - docker-compose.worker.yml runs three dedicated services (antenna / jobs / ml_results) per worker VM, so isolation holds there too — a burst on one class can't saturate a shared pool and starve another. Rollout doc updated to reflect the new topology. Co-Authored-By: Claude <noreply@anthropic.com> * fix(celery): address review comments on queue split PR - Standardize rollout doc script name to reset_demo_to_branch.sh - Clarify settings comment: only staging/production/worker composes run per-queue dedicated workers; local/CI use a single all-queues worker Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): use reset_to_branch.sh (the generic script name) The script is used on staging, demo, and single-box deploys — not demo-specific. Standardize both mentions to reset_to_branch.sh, which matches the actual filename on the hosts. Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): scrub internal hostnames and update rollout doc accuracy - Generalize ami-live / ami-worker-2 / ami-worker-3 hostnames — this doc lives in a public repo and shouldn't reference deployment-specific names - Drop stale commit SHA; branch name is sufficient after further commits - Clarify that the "scp three files" list is the demo-path subset, not the full changeset on the branch Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
|
Claude says: Opened #1337 to capture the root cause this draft targets — concurrent unlocked read-modify-write of Worth deciding: either flesh this draft out against the #1337 design, or close it in favour of a fresh implementation and keep the spec in #1337. Happy to help compare the #981-derived approach here against the conditional-write / counter-column directions in #1337 before picking one. |
|
Claude says: Closing as superseded. The status half of this concurrent-write problem is now fixed on
The periodic-check approach here (carried over from #981) is no longer the direction. Remaining work is tracked under the #1337 root-cause hub (counter-accumulation "Layer 2", the |

Pulled from #981
More coming soon