Skip to content

fix(scheduler): retry failed nodes when resuming a job#146

Draft
SavinRazvan wants to merge 1 commit into
mainfrom
cursor/fix-scheduler-resume-failed-nodes-ce8a
Draft

fix(scheduler): retry failed nodes when resuming a job#146
SavinRazvan wants to merge 1 commit into
mainfrom
cursor/fix-scheduler-resume-failed-nodes-ce8a

Conversation

@SavinRazvan

Copy link
Copy Markdown
Owner

Problem

BackgroundRuntime.resume_job() is documented to recover failed and cancelled jobs, but the scheduler treated FAILED checkpoints as terminal on resume. Failed nodes were added to the failed set and never re-executed, so a resumed job would immediately finish as failed without retrying the node.

The background-runtime test suite worked around this by swapping the graph handler before calling resume_job().

Fix

On resume=True, only COMPLETED checkpoints are skipped. FAILED, RUNNING, and CANCELLED checkpoints no longer block re-execution — their handlers run again so operators can recover failed jobs.

Tests

  • Updated test_scheduler_resume_retries_failed_checkpoint_nodes to assert a failed checkpoint is retried and completes on resume.
  • All scheduler checkpoint-resume and background-runtime cancel/resume tests pass (21 tests).

Verification

python3 -m pytest tests/modules/core/test_scheduler_checkpoint_resume.py \
  tests/modules/core/test_background_runtime_cancel_resume.py -q

Slack Thread

Open in Web Open in Cursor 

On resume, FAILED checkpoints were added to the terminal failed set,
so BackgroundRuntime.resume_job() could not re-execute failed nodes.
Only COMPLETED checkpoints are now treated as skipped on resume.

Author: Savin I. Razvan
GitHub-User: @SavinRazvan
Assisted-by: Cursor

Co-authored-by: Razvan <SavinRazvan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants