Skip to content

Halt after max_consecutive_errored_rollouts#2835

Open
rasdani wants to merge 1 commit into
mainfrom
halt-consecutive-errored-rollouts
Open

Halt after max_consecutive_errored_rollouts#2835
rasdani wants to merge 1 commit into
mainfrom
halt-consecutive-errored-rollouts

Conversation

@rasdani

@rasdani rasdani commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Add orchestrator.max_consecutive_errored_rollouts with default 10 and None disable support.

Stop the training after a configurable number of rollouts failed in a row.

Prevents spamming sandboxes on a persistent sandbox service outage.

Validation

  • uv run pytest tests/unit/orchestrator/test_orchestrator_error_guard.py
  • uv run pytest tests/unit/test_configs.py -k max_consecutive_errored_rollouts
  • uv run ruff check docs/training.md packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py src/prime_rl/orchestrator/orchestrator.py tests/unit/test_configs.py tests/unit/orchestrator/test_orchestrator_error_guard.py
  • git diff --check

Known Non-Scope Failure

  • uv run pytest tests/unit/test_configs.py currently fails at test_shared_output_dir_propagates_through_cli; that existing CLI --output-dir propagation issue is intentionally not fixed in this PR.

Note

Low Risk
Localized orchestrator safety rail with default-on threshold and disable via None; no auth or training math changes.

Overview
Adds orchestrator.max_consecutive_errored_rollouts (default 10, None to disable) so long runs fail fast when env rollouts keep dying after verifiers retries are exhausted.

The orchestrator counts consecutive terminal errored train and eval rollouts in the main loop before TrainSink / EvalSink handling; successful rollouts reset the streak, and synthetic Cancelled markers from off-policy cleanup are ignored. Hitting the limit raises RuntimeError with a warning log per error. Docs and config validation tests cover the new knob; unit tests exercise streak logic and main-loop ordering.

Reviewed by Cursor Bugbot for commit 749b193. Bugbot is set up for automated code reviews on this repo. Configure here.

@rasdani rasdani force-pushed the halt-consecutive-errored-rollouts branch from a668787 to 749b193 Compare June 18, 2026 15:13
@rasdani rasdani changed the title Halt on consecutive errored rollouts Halt after max_consecutive_errored_rollouts Jun 18, 2026
@rasdani rasdani marked this pull request as ready for review June 18, 2026 15:15
@rasdani rasdani requested review from mikasenghaas and samsja June 18, 2026 15:17

@mikasenghaas mikasenghaas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm looks a bit dangerous, esp the default. in prod we have had hours of outages and then training resumed as the service came back which was acc quite nice

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

max_off_policy_steps: int = Field(8, ge=0)
"""Maximum policies allowed to generate a single rollout. Rollouts generated more than ``max_off_policy_steps`` ahead of training are discarded. Higher values yield better throughput at the cost of off-policy noise."""

max_consecutive_errored_rollouts: int | None = Field(10, ge=1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think 10 is a bit rough here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scared of cancelling runs on intermittent errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you suggest?

@samsja samsja left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove tests ?

@rasdani

rasdani commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@codex remove all tests this PR added

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

  • Removed the config tests added for max_consecutive_errored_rollouts, restoring the surrounding tests/unit/test_configs.py flow from trainer token-export coverage directly into existing inference DP-rank config coverage. tests/unit/test_configs.pyL170-L176
  • Deleted tests/unit/orchestrator/test_orchestrator_error_guard.py, which contained the orchestrator error-guard tests requested for removal.
  • Committed the changes on the current branch: 209b807 Remove rollout error guard tests.
  • Created a PR record titled: Remove tests added by rollout error guard PR.

Testing

  • git diff --check
  • ⚠️ uv run ruff check tests/unit/test_configs.py — could not run because uv failed dependency resolution for local dependency deps/research-environments/environments/aime2024 after warning about parsing exclude-newer = "7 days".

View task →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants