Trace-native advantage functions by willccbb · Pull Request #2840 · PrimeIntellect-ai/prime-rl

willccbb · 2026-06-19T01:42:19Z

Summary

Stacked on the intermediate prime-rl review base feat/trace-algorithm-abstraction-base, which combines prime-rl trace support (feat/nano-as-v1, #2742) with the old algorithm-abstraction branch (feat/algorithm-abstraction, #2746). That base now intentionally restores unrelated dependency, inference, template, and lockfile paths to the trace-support branch state, so this PR is the focused rewrite from the old algorithm-class runtime to trace-native advantage functions.

Replaces algorithm classes / RolloutView with decorated trace advantage functions: @vf.advantage(loss="rl" | "ce", scope="rollout" | "group").
Adds the sparse public config surface: orchestrator.advantage.name, orchestrator.actor, and unprivileged keyed orchestrator.models.
Stamps completed traces with trace.actor and trace.models, runs the selected advantage function, then builds training samples from branch-level advantages and mask.
Ports built-ins to self-contained advantage functions: grpo, max_rl, sft, echo, opd, and opsd.
Routes active training through generic rl and sft/CE trainer paths; OPD/OPSD write fixed token advantages and no longer require teacher/ref-logprob trainer fields.
Removes privileged teacher config, old algorithm config/runtime files, and teacher-logprob transport/trainer plumbing.
Pins deps/verifiers to af3fd017 from sibling draft PR Trace advantage authoring surface verifiers#1760.

Validation

The latest stack cleanup was diff-only. I verified that the top PR no longer contains the unrelated old-branch dependency/inference reversions; the only dependency-looking path left in the top diff is the expected deps/verifiers pin.

Validation previously run on this implementation branch:

uv run --no-sync ruff check ... on touched Python files
uv run --no-sync pytest tests/unit/orchestrator/test_advantage.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_orchestrator_setup.py -q
uv run --no-sync pytest tests/unit/train/rl/test_loss.py -q with GPU access
uv run --no-sync pytest tests/unit/train/models/test_nemotron_h_kl.py -q with GPU access
uv run --no-sync rl @ configs/v1/training_mode/{rl,opd,sft}.toml --dry-run ...
uv run --no-sync rl @ configs/ci/integration/reverse_text_rl_{opd,sft}/start.toml --dry-run ...
One-step CE smoke passed on GPUs 0/1
One-step RL smoke passed on GPUs 0/1 with the zero-advantage post-filter disabled
After removing the raw reward built-in: uv run --no-sync pytest tests/unit/orchestrator/test_advantage.py -q and focused ruff check

Default GRPO smoke on the tiny debug setup hit the existing consecutive zero-trainable-batch guard; that looked like task/filter behavior rather than an integration failure.

Note

High Risk
Large behavioral refactor across orchestrator config validation, rollout dispatch, advantage computation, and trainer loss routing; OPD semantics move from trainer KL to orchestrator token weights, so misconfigured models/actor or custom advantages can silently change training.

Overview
Replaces orchestrator.training_mode (rl / opd / sft) and privileged orchestrator.teacher with orchestrator.advantage.name, orchestrator.actor (default policy), and keyed orchestrator.models for auxiliary token endpoints.

Orchestrator pipeline: Completed traces get trace.actor and trace.models; @vf.advantage functions (built-ins: grpo, max_rl, sft, echo, opd, opsd) mutate per-branch advantages and mask before samples are built. OPD/OPSD fetch teacher/policy logprobs inside the advantage step instead of a separate batch-time teacher-logprob pass. Train rollouts can use a non-policy actor (e.g. SFT with actor = "teacher").

Trainer / transport: Drops teacher_logprobs and the dedicated opd loss path; batches are rl (DPPO + configured loss) or sft (masked CE) based on advantage loss metadata. Per-token advantages replace a single scalar advantage. Configs, docs, and tests are updated accordingly; built-in length-penalty advantage shaping is removed from this path.

^{Reviewed by Cursor Bugbot for commit 76f1c77. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 76f1c77. Configure here.}

willccbb · 2026-06-23T00:40:31Z

Closing this draft as superseded by the cleaned two-PR stack: base trace support in #2857, algorithm registry rewrite in #2842.

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/filters.py

Comment thread src/prime_rl/orchestrator/filters.py

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/train_sink.py Outdated

Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated

willccbb changed the base branch from main to feat/nano-as-v1 June 19, 2026 03:50

willccbb force-pushed the feat/trace-advantage-functions-v1 branch from d1748b2 to 6e868ef Compare June 19, 2026 03:55

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated

feat: add trace advantage functions

b3778db

willccbb force-pushed the feat/trace-advantage-functions-v1 branch from 6e868ef to b3778db Compare June 19, 2026 05:33

chore: drop raw reward advantage

76f1c77

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/filters.py

chore: retarget onto trace algorithm base

0bf0381

willccbb changed the base branch from feat/nano-as-v1 to feat/trace-algorithm-abstraction-base June 19, 2026 06:52

chore: refresh trace algorithm base

ea42243

willccbb closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace-native advantage functions#2840

Trace-native advantage functions#2840
willccbb wants to merge 4 commits into
feat/trace-algorithm-abstraction-basefrom
feat/trace-advantage-functions-v1

willccbb commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

willccbb commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willccbb commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

willccbb commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

willccbb commented Jun 19, 2026 •

edited

Loading