Skip to content

feat(orchestrator): use trace algorithm registry#2842

Draft
willccbb wants to merge 9 commits into
feat/trace-algorithm-abstraction-basefrom
feat/trace-advantage-functions-v1-clean
Draft

feat(orchestrator): use trace algorithm registry#2842
willccbb wants to merge 9 commits into
feat/trace-algorithm-abstraction-basefrom
feat/trace-advantage-functions-v1-clean

Conversation

@willccbb

@willccbb willccbb commented Jun 20, 2026

Copy link
Copy Markdown
Member

Summary

Reworks prime-rl's algorithm path around trace algorithms, stacked on the trace base in #2857.

Current scope:

  • deletes the old src/prime_rl/orchestrator/algo/ runtime package and sampler.py
  • deletes packages/prime-rl-configs/src/prime_rl/configs/algorithm.py
  • adds builtin trace algorithms in src/prime_rl/orchestrator/algorithms.py as vf.Algorithm implementations
  • keeps one slim-compatible local AlgorithmConfig carrier in prime-rl-configs; builtin config classes are not duplicated, and env-server calls convert to vf.AlgorithmConfig at the runtime boundary
  • orchestrator.algorithms = ["grpo"] and per-env algorithms = [...] select builtin ids or env-owned algorithm ids
  • the trained model is orchestrator.model; policy is the reserved runtime key; actor selects the rollout model and defaults to policy
  • extra model endpoints are unprivileged keys under orchestrator.models, so OPD/OPSD use configured model keys rather than a privileged teacher path
  • unknown/env-owned algorithms run across the env-server run_algorithms boundary with traces, algorithm configs, and model runtime configs
  • trainer transport carries one sample with parallel advantage channels; same-loss channels are aggregated and currently route through rl / ce

Algorithms mutate trace branches explicitly by setting branch.advantages and branch.mask; prime-rl validates dimensions before aggregating channels and routing them to the trainer.

Deliberately out of scope here:

  • no separate configs/v1 or configs/debug/v1 demo tree
  • no teacher-privileged config path
  • no student config key
  • no ref_kl trainer path for OPD/OPSD in this pass; those fold reference-vs-rollout logprob deltas into RL advantages

Stack

This is PR 2 of 2:

  1. feat(orchestrator): add trace support on algorithm abstraction #2857: trace support into feat/algorithm-abstraction
  2. feat(orchestrator): use trace algorithm registry #2842: trace algorithm registry rewrite into feat/trace-algorithm-abstraction-base

Depends on the verifiers sibling PR pinned through deps/verifiers at d491bf07: PrimeIntellect-ai/verifiers#1773.

Validation

  • slim wheel import reproduction: build packages/prime-rl-configs, install into /tmp/slim-prime-rl-check, import config modules, and verify forbidden heavy deps do not load -> pass
  • slim real-config parse reproduction: /tmp/slim-prime-rl-check/bin/python parses examples/Intellect-3.1/rl.toml and keeps slurm.template_path is None -> pass
  • targeted cleanup lint: UV_CACHE_DIR=/tmp/uv-cache uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py src/prime_rl/orchestrator/algorithms.py src/prime_rl/orchestrator/envs.py -> pass
  • git diff --check feat/trace-algorithm-abstraction-base..HEAD -> pass
  • UV_CACHE_DIR=/tmp/uv-cache uv lock --check -> pass
  • Earlier on this branch before the rebase/cleanup: UV_CACHE_DIR=/tmp/uv-cache uv run ruff check -> pass; targeted unit suite -> 207 passed

Note

High Risk
Large breaking config and training-pipeline refactor touching advantage computation, OPD/OPSD loss routing, and rollout actor selection—incorrect migration or channel aggregation could silently change gradients.

Overview
Replaces the old [orchestrator.algo] / teacher-centric stack with trace algorithms on verifiers.v1: the orchestrator/algo/ runtime, sampler.py, and configs/algorithm.py are removed, and builtins (grpo, max_rl, rl, sft, echo, opd, opsd) live in orchestrator/algorithms.py as vf.Algorithm implementations that set branch.advantages and branch.mask.

Configuration moves to [[orchestrator.algorithms]] (plus per-env [[orchestrator.train.env.algorithms]]), orchestrator.actor for rollout sampling (default policy), and extra endpoints under [orchestrator.models.<key>] with tokens = true for token-capable actors and reference scoring. SFT distillation sets actor = "reference" instead of frozen teacher on the algorithm.

Runtime: TrainSink runs builtin ids in-process and env-owned ids via env_client.run_algorithms with model runtime configs; the orchestrator wires model_pools / model_runtimes and the dispatcher picks the train pool from global actor (off-policy aging only for policy actor). Batch-time finalize_batch reference scoring in the old algo package is dropped—OPD/OPSD now fold reference-vs-rollout logprob deltas into RL advantages rather than a separate ref_kl trainer path. CI/debug TOMLs and docs/algorithms.md / docs/training.md are updated to the new shape.

Reviewed by Cursor Bugbot for commit 0249bba. Bugbot is set up for automated code reviews on this repo. Configure here.

# instead of echo's tool default.
[orchestrator.algo]
type = "echo"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo skips alphabet-sort user feedback

Medium Severity

The echo debug config still targets alphabet-sort, but the commit drops per-role echo settings and relies on the builtin echo advantage, which only CE-trains unsampled tool tokens. The prior config explicitly trained user observation tokens because env feedback arrives as user messages, so ECHO supervision on observation tokens no longer applies for this env.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c5d4bf2. Configure here.

Comment thread src/prime_rl/orchestrator/advantages.py Outdated
Comment thread src/prime_rl/orchestrator/metrics.py Outdated
Comment thread src/prime_rl/orchestrator/train_sink.py
Comment thread src/prime_rl/orchestrator/advantages.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
@willccbb willccbb changed the title feat(orchestrator): replace algorithms with advantage functions feat(orchestrator): use trace algorithm registry Jun 22, 2026
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
values.append(0.1 if use_token else 0.0)
mask.append(use_token)
branch.advantages = values
branch.mask = mask

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo ignores user observation tokens

Medium Severity

The builtin echo algorithm only assigns CE weights to unsampled tool message tokens. The alphabet-sort debug config previously trained user feedback tokens; that override was removed, so ECHO no longer supervises the observation tokens that env actually emits.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 61cf4a9. Configure here.

@willccbb willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch 2 times, most recently from 3a2acc4 to f7affe6 Compare June 23, 2026 00:06
)
branch_refs.append((trace, branch.index))

scored = await asyncio.gather(*calls) if calls else []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPD unbounded prefill concurrency

Medium Severity

Builtin OPD.advantage schedules one asyncio.create_task per trace branch and awaits gather with no concurrency cap. The previous OPDAlgorithm honored max_concurrent (default 32), so large batches can stampede the reference endpoint.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f7affe6. Configure here.

@willccbb willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch 3 times, most recently from 7669bf5 to a350b0c Compare June 23, 2026 02:33

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a350b0c. Configure here.

Comment thread src/prime_rl/orchestrator/algorithms.py
@willccbb willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch from a350b0c to d6d4afa Compare June 23, 2026 02:38
@willccbb willccbb force-pushed the feat/trace-advantage-functions-v1-clean branch from 0a3995e to 0249bba Compare June 23, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant