fix(scheduler): retry failed nodes when resuming a job by SavinRazvan · Pull Request #146 · SavinRazvan/eXo-brain

SavinRazvan · 2026-06-20T12:56:33Z

Problem

BackgroundRuntime.resume_job() is documented to recover failed and cancelled jobs, but the scheduler treated FAILED checkpoints as terminal on resume. Failed nodes were added to the failed set and never re-executed, so a resumed job would immediately finish as failed without retrying the node.

The background-runtime test suite worked around this by swapping the graph handler before calling resume_job().

Fix

On resume=True, only COMPLETED checkpoints are skipped. FAILED, RUNNING, and CANCELLED checkpoints no longer block re-execution — their handlers run again so operators can recover failed jobs.

Tests

Updated test_scheduler_resume_retries_failed_checkpoint_nodes to assert a failed checkpoint is retried and completes on resume.
All scheduler checkpoint-resume and background-runtime cancel/resume tests pass (21 tests).

Verification

python3 -m pytest tests/modules/core/test_scheduler_checkpoint_resume.py \
  tests/modules/core/test_background_runtime_cancel_resume.py -q

Slack Thread

@SavinRazvan

On resume, FAILED checkpoints were added to the terminal failed set, so BackgroundRuntime.resume_job() could not re-execute failed nodes. Only COMPLETED checkpoints are now treated as skipped on resume. Author: Savin I. Razvan GitHub-User: @SavinRazvan Assisted-by: Cursor Co-authored-by: Razvan <SavinRazvan@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): retry failed nodes when resuming a job#146

fix(scheduler): retry failed nodes when resuming a job#146
SavinRazvan wants to merge 1 commit into
mainfrom
cursor/fix-scheduler-resume-failed-nodes-ce8a

SavinRazvan commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SavinRazvan commented Jun 20, 2026

Problem

Fix

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants