Trace-native advantage functions#2840
Closed
willccbb wants to merge 4 commits into
Closed
Conversation
d1748b2 to
6e868ef
Compare
6e868ef to
b3778db
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 76f1c77. Configure here.
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Stacked on the intermediate prime-rl review base
feat/trace-algorithm-abstraction-base, which combines prime-rl trace support (feat/nano-as-v1, #2742) with the old algorithm-abstraction branch (feat/algorithm-abstraction, #2746). That base now intentionally restores unrelated dependency, inference, template, and lockfile paths to the trace-support branch state, so this PR is the focused rewrite from the old algorithm-class runtime to trace-native advantage functions.RolloutViewwith decorated trace advantage functions:@vf.advantage(loss="rl" | "ce", scope="rollout" | "group").orchestrator.advantage.name,orchestrator.actor, and unprivileged keyedorchestrator.models.trace.actorandtrace.models, runs the selected advantage function, then builds training samples from branch-leveladvantagesandmask.grpo,max_rl,sft,echo,opd, andopsd.rlandsft/CE trainer paths; OPD/OPSD write fixed token advantages and no longer require teacher/ref-logprob trainer fields.deps/verifierstoaf3fd017from sibling draft PR Trace advantage authoring surface verifiers#1760.Validation
The latest stack cleanup was diff-only. I verified that the top PR no longer contains the unrelated old-branch dependency/inference reversions; the only dependency-looking path left in the top diff is the expected
deps/verifierspin.Validation previously run on this implementation branch:
uv run --no-sync ruff check ...on touched Python filesuv run --no-sync pytest tests/unit/orchestrator/test_advantage.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_orchestrator_setup.py -quv run --no-sync pytest tests/unit/train/rl/test_loss.py -qwith GPU accessuv run --no-sync pytest tests/unit/train/models/test_nemotron_h_kl.py -qwith GPU accessuv run --no-sync rl @ configs/v1/training_mode/{rl,opd,sft}.toml --dry-run ...uv run --no-sync rl @ configs/ci/integration/reverse_text_rl_{opd,sft}/start.toml --dry-run ...rewardbuilt-in:uv run --no-sync pytest tests/unit/orchestrator/test_advantage.py -qand focusedruff checkDefault GRPO smoke on the tiny debug setup hit the existing consecutive zero-trainable-batch guard; that looked like task/filter behavior rather than an integration failure.
Note
High Risk
Large behavioral refactor across orchestrator config validation, rollout dispatch, advantage computation, and trainer loss routing; OPD semantics move from trainer KL to orchestrator token weights, so misconfigured
models/actoror custom advantages can silently change training.Overview
Replaces
orchestrator.training_mode(rl/opd/sft) and privilegedorchestrator.teacherwithorchestrator.advantage.name,orchestrator.actor(defaultpolicy), and keyedorchestrator.modelsfor auxiliary token endpoints.Orchestrator pipeline: Completed traces get
trace.actorandtrace.models;@vf.advantagefunctions (built-ins:grpo,max_rl,sft,echo,opd,opsd) mutate per-branchadvantagesandmaskbefore samples are built. OPD/OPSD fetch teacher/policy logprobs inside the advantage step instead of a separate batch-time teacher-logprob pass. Train rollouts can use a non-policy actor (e.g. SFT withactor = "teacher").Trainer / transport: Drops
teacher_logprobsand the dedicatedopdloss path; batches arerl(DPPO + configured loss) orsft(masked CE) based on advantagelossmetadata. Per-tokenadvantagesreplace a single scalaradvantage. Configs, docs, and tests are updated accordingly; built-in length-penalty advantage shaping is removed from this path.Reviewed by Cursor Bugbot for commit 76f1c77. Bugbot is set up for automated code reviews on this repo. Configure here.