feat(orchestrator): use trace algorithm registry#2842
Conversation
| # instead of echo's tool default. | ||
| [orchestrator.algo] | ||
| type = "echo" | ||
|
|
There was a problem hiding this comment.
Echo skips alphabet-sort user feedback
Medium Severity
The echo debug config still targets alphabet-sort, but the commit drops per-role echo settings and relies on the builtin echo advantage, which only CE-trains unsampled tool tokens. The prior config explicitly trained user observation tokens because env feedback arrives as user messages, so ECHO supervision on observation tokens no longer applies for this env.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c5d4bf2. Configure here.
| values.append(0.1 if use_token else 0.0) | ||
| mask.append(use_token) | ||
| branch.advantages = values | ||
| branch.mask = mask |
There was a problem hiding this comment.
Echo ignores user observation tokens
Medium Severity
The builtin echo algorithm only assigns CE weights to unsampled tool message tokens. The alphabet-sort debug config previously trained user feedback tokens; that override was removed, so ECHO no longer supervises the observation tokens that env actually emits.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 61cf4a9. Configure here.
3a2acc4 to
f7affe6
Compare
| ) | ||
| branch_refs.append((trace, branch.index)) | ||
|
|
||
| scored = await asyncio.gather(*calls) if calls else [] |
There was a problem hiding this comment.
OPD unbounded prefill concurrency
Medium Severity
Builtin OPD.advantage schedules one asyncio.create_task per trace branch and awaits gather with no concurrency cap. The previous OPDAlgorithm honored max_concurrent (default 32), so large batches can stampede the reference endpoint.
Reviewed by Cursor Bugbot for commit f7affe6. Configure here.
7669bf5 to
a350b0c
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a350b0c. Configure here.
a350b0c to
d6d4afa
Compare
0a3995e to
0249bba
Compare


Summary
Reworks prime-rl's algorithm path around trace algorithms, stacked on the trace base in #2857.
Current scope:
src/prime_rl/orchestrator/algo/runtime package andsampler.pypackages/prime-rl-configs/src/prime_rl/configs/algorithm.pysrc/prime_rl/orchestrator/algorithms.pyasvf.AlgorithmimplementationsAlgorithmConfigcarrier inprime-rl-configs; builtin config classes are not duplicated, and env-server calls convert tovf.AlgorithmConfigat the runtime boundaryorchestrator.algorithms = ["grpo"]and per-envalgorithms = [...]select builtin ids or env-owned algorithm idsorchestrator.model;policyis the reserved runtime key;actorselects the rollout model and defaults topolicyorchestrator.models, so OPD/OPSD use configured model keys rather than a privileged teacher pathrun_algorithmsboundary with traces, algorithm configs, and model runtime configsrl/ceAlgorithms mutate trace branches explicitly by setting
branch.advantagesandbranch.mask; prime-rl validates dimensions before aggregating channels and routing them to the trainer.Deliberately out of scope here:
configs/v1orconfigs/debug/v1demo treeteacher-privileged config pathstudentconfig keyref_kltrainer path for OPD/OPSD in this pass; those fold reference-vs-rollout logprob deltas into RL advantagesStack
This is PR 2 of 2:
feat/algorithm-abstractionfeat/trace-algorithm-abstraction-baseDepends on the verifiers sibling PR pinned through
deps/verifiersatd491bf07: PrimeIntellect-ai/verifiers#1773.Validation
packages/prime-rl-configs, install into/tmp/slim-prime-rl-check, import config modules, and verify forbidden heavy deps do not load -> pass/tmp/slim-prime-rl-check/bin/pythonparsesexamples/Intellect-3.1/rl.tomland keepsslurm.template_path is None-> passUV_CACHE_DIR=/tmp/uv-cache uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py src/prime_rl/orchestrator/algorithms.py src/prime_rl/orchestrator/envs.py-> passgit diff --check feat/trace-algorithm-abstraction-base..HEAD-> passUV_CACHE_DIR=/tmp/uv-cache uv lock --check-> passUV_CACHE_DIR=/tmp/uv-cache uv run ruff check-> pass; targeted unit suite ->207 passedNote
High Risk
Large breaking config and training-pipeline refactor touching advantage computation, OPD/OPSD loss routing, and rollout actor selection—incorrect migration or channel aggregation could silently change gradients.
Overview
Replaces the old
[orchestrator.algo]/ teacher-centric stack with trace algorithms onverifiers.v1: theorchestrator/algo/runtime,sampler.py, andconfigs/algorithm.pyare removed, and builtins (grpo,max_rl,rl,sft,echo,opd,opsd) live inorchestrator/algorithms.pyasvf.Algorithmimplementations that setbranch.advantagesandbranch.mask.Configuration moves to
[[orchestrator.algorithms]](plus per-env[[orchestrator.train.env.algorithms]]),orchestrator.actorfor rollout sampling (defaultpolicy), and extra endpoints under[orchestrator.models.<key>]withtokens = truefor token-capable actors and reference scoring. SFT distillation setsactor = "reference"instead of frozen teacher on the algorithm.Runtime:
TrainSinkruns builtin ids in-process and env-owned ids viaenv_client.run_algorithmswith model runtime configs; the orchestrator wiresmodel_pools/model_runtimesand the dispatcher picks the train pool from globalactor(off-policy aging only for policy actor). Batch-timefinalize_batchreference scoring in the old algo package is dropped—OPD/OPSD now fold reference-vs-rollout logprob deltas into RL advantages rather than a separateref_kltrainer path. CI/debug TOMLs anddocs/algorithms.md/docs/training.mdare updated to the new shape.Reviewed by Cursor Bugbot for commit 0249bba. Bugbot is set up for automated code reviews on this repo. Configure here.