feat(orchestrator): use trace algorithm registry by willccbb · Pull Request #2842 · PrimeIntellect-ai/prime-rl

willccbb · 2026-06-20T03:18:58Z

Summary

Reworks prime-rl's algorithm path around trace algorithms, stacked on the trace base in #2857.

Current scope:

deletes the old src/prime_rl/orchestrator/algo/ runtime package and sampler.py
deletes packages/prime-rl-configs/src/prime_rl/configs/algorithm.py
adds builtin trace algorithms in src/prime_rl/orchestrator/algorithms.py as vf.Algorithm implementations
keeps one slim-compatible local AlgorithmConfig carrier in prime-rl-configs; builtin config classes are not duplicated, and env-server calls convert to vf.AlgorithmConfig at the runtime boundary
orchestrator.algorithms = ["grpo"] and per-env algorithms = [...] select builtin ids or env-owned algorithm ids
the trained model is orchestrator.model; policy is the reserved runtime key; actor selects the rollout model and defaults to policy
extra model endpoints are unprivileged keys under orchestrator.models, so OPD/OPSD use configured model keys rather than a privileged teacher path
unknown/env-owned algorithms run across the env-server run_algorithms boundary with traces, algorithm configs, and model runtime configs
trainer transport carries one sample with parallel advantage channels; same-loss channels are aggregated and currently route through rl / ce

Algorithms mutate trace branches explicitly by setting branch.advantages and branch.mask; prime-rl validates dimensions before aggregating channels and routing them to the trainer.

Deliberately out of scope here:

no separate configs/v1 or configs/debug/v1 demo tree
no teacher-privileged config path
no student config key
no ref_kl trainer path for OPD/OPSD in this pass; those fold reference-vs-rollout logprob deltas into RL advantages

Stack

This is PR 2 of 2:

feat(orchestrator): add trace support on algorithm abstraction #2857: trace support into feat/algorithm-abstraction
feat(orchestrator): use trace algorithm registry #2842: trace algorithm registry rewrite into feat/trace-algorithm-abstraction-base

Depends on the verifiers sibling PR pinned through deps/verifiers at d491bf07: PrimeIntellect-ai/verifiers#1773.

Validation

slim wheel import reproduction: build packages/prime-rl-configs, install into /tmp/slim-prime-rl-check, import config modules, and verify forbidden heavy deps do not load -> pass
slim real-config parse reproduction: /tmp/slim-prime-rl-check/bin/python parses examples/Intellect-3.1/rl.toml and keeps slurm.template_path is None -> pass
targeted cleanup lint: UV_CACHE_DIR=/tmp/uv-cache uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py src/prime_rl/orchestrator/algorithms.py src/prime_rl/orchestrator/envs.py -> pass
git diff --check feat/trace-algorithm-abstraction-base..HEAD -> pass
UV_CACHE_DIR=/tmp/uv-cache uv lock --check -> pass
Earlier on this branch before the rebase/cleanup: UV_CACHE_DIR=/tmp/uv-cache uv run ruff check -> pass; targeted unit suite -> 207 passed

Note

High Risk
Large breaking config and training-pipeline refactor touching advantage computation, OPD/OPSD loss routing, and rollout actor selection—incorrect migration or channel aggregation could silently change gradients.

Overview
Replaces the old [orchestrator.algo] / teacher-centric stack with trace algorithms on verifiers.v1: the orchestrator/algo/ runtime, sampler.py, and configs/algorithm.py are removed, and builtins (grpo, max_rl, rl, sft, echo, opd, opsd) live in orchestrator/algorithms.py as vf.Algorithm implementations that set branch.advantages and branch.mask.

Configuration moves to [[orchestrator.algorithms]] (plus per-env [[orchestrator.train.env.algorithms]]), orchestrator.actor for rollout sampling (default policy), and extra endpoints under [orchestrator.models.<key>] with tokens = true for token-capable actors and reference scoring. SFT distillation sets actor = "reference" instead of frozen teacher on the algorithm.

Runtime: TrainSink runs builtin ids in-process and env-owned ids via env_client.run_algorithms with model runtime configs; the orchestrator wires model_pools / model_runtimes and the dispatcher picks the train pool from global actor (off-policy aging only for policy actor). Batch-time finalize_batch reference scoring in the old algo package is dropped—OPD/OPSD now fold reference-vs-rollout logprob deltas into RL advantages rather than a separate ref_kl trainer path. CI/debug TOMLs and docs/algorithms.md / docs/training.md are updated to the new shape.

^{Reviewed by Cursor Bugbot for commit 0249bba. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-06-20T03:20:37Z

-# instead of echo's tool default.
-[orchestrator.algo]
-type = "echo"
-


Echo skips alphabet-sort user feedback

Medium Severity

The echo debug config still targets alphabet-sort, but the commit drops per-role echo settings and relies on the builtin echo advantage, which only CE-trains unsampled tool tokens. The prior config explicitly trained user observation tokens because env feedback arrives as user messages, so ECHO supervision on observation tokens no longer applies for this env.

Additional Locations (1)

src/prime_rl/orchestrator/advantages.py#L54-L68

^{Reviewed by Cursor Bugbot for commit c5d4bf2. Configure here.}

cursor · 2026-06-22T23:49:10Z

+                        values.append(0.1 if use_token else 0.0)
+                        mask.append(use_token)
+                branch.advantages = values
+                branch.mask = mask


Echo ignores user observation tokens

Medium Severity

The builtin echo algorithm only assigns CE weights to unsampled tool message tokens. The alphabet-sort debug config previously trained user feedback tokens; that override was removed, so ECHO no longer supervises the observation tokens that env actually emits.

Additional Locations (1)

configs/debug/algorithms/echo.toml#L17-L24

^{Reviewed by Cursor Bugbot for commit 61cf4a9. Configure here.}

cursor · 2026-06-23T00:08:13Z

+                )
+                branch_refs.append((trace, branch.index))
+
+        scored = await asyncio.gather(*calls) if calls else []


OPD unbounded prefill concurrency

Medium Severity

Builtin OPD.advantage schedules one asyncio.create_task per trace branch and awaits gather with no concurrency cap. The previous OPDAlgorithm honored max_concurrent (default 32), so large batches can stampede the reference endpoint.

^{Reviewed by Cursor Bugbot for commit f7affe6. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a350b0c. Configure here.}

willccbb mentioned this pull request Jun 20, 2026

feat(v1): add trace algorithm classes PrimeIntellect-ai/verifiers#1773

Draft

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/train_sink.py

Comment thread src/prime_rl/orchestrator/advantages.py Outdated

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py

willccbb changed the title ~~feat(orchestrator): replace algorithms with advantage functions~~ feat(orchestrator): use trace algorithm registry Jun 22, 2026

cursor Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated

cursor Bot reviewed Jun 22, 2026

View reviewed changes

willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch 2 times, most recently from 3a2acc4 to f7affe6 Compare June 23, 2026 00:06

cursor Bot reviewed Jun 23, 2026

View reviewed changes

This was referenced Jun 23, 2026

feat(orchestrator): add trace support on algorithm abstraction #2857

Draft

Trace-native advantage functions #2840

Closed

willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch 3 times, most recently from 7669bf5 to a350b0c Compare June 23, 2026 02:33

cursor Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/algorithms.py

willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch from a350b0c to d6d4afa Compare June 23, 2026 02:38

willccbb added 9 commits June 23, 2026 02:46

feat(orchestrator): replace algorithms with advantage functions

6a8b400

fix(monitor): log trace rollout advantages

fb516e3

feat(orchestrator): use trace algorithm registry

28d360a

fix(configs): keep algorithm config slim-safe

da1d9d8

chore(deps): update verifiers algorithm pin

3489c4b

chore(deps): refresh verifiers algorithm pin

3565636

chore(config): slim algorithm config carrier

67dd980

chore(deps): refresh v1 plugin package lock

20bd3fe

fix(configs): keep algorithm carrier slim-compatible

0249bba

willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch from 0a3995e to 0249bba Compare June 23, 2026 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): use trace algorithm registry#2842

feat(orchestrator): use trace algorithm registry#2842
willccbb wants to merge 9 commits into
feat/trace-algorithm-abstraction-basefrom
feat/trace-advantage-functions-v1-clean

willccbb commented Jun 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot Jun 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 22, 2026

Uh oh!

cursor Bot Jun 23, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willccbb commented Jun 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Validation

Uh oh!

cursor Bot Jun 20, 2026

Choose a reason for hiding this comment

Echo skips alphabet-sort user feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 22, 2026

Choose a reason for hiding this comment

Echo ignores user observation tokens

Uh oh!

cursor Bot Jun 23, 2026

Choose a reason for hiding this comment

OPD unbounded prefill concurrency

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

willccbb commented Jun 20, 2026 •

edited by cursor Bot

Loading