Skip to content

Trace-native advantage functions#2840

Closed
willccbb wants to merge 4 commits into
feat/trace-algorithm-abstraction-basefrom
feat/trace-advantage-functions-v1
Closed

Trace-native advantage functions#2840
willccbb wants to merge 4 commits into
feat/trace-algorithm-abstraction-basefrom
feat/trace-advantage-functions-v1

Conversation

@willccbb

@willccbb willccbb commented Jun 19, 2026

Copy link
Copy Markdown
Member

Summary

Stacked on the intermediate prime-rl review base feat/trace-algorithm-abstraction-base, which combines prime-rl trace support (feat/nano-as-v1, #2742) with the old algorithm-abstraction branch (feat/algorithm-abstraction, #2746). That base now intentionally restores unrelated dependency, inference, template, and lockfile paths to the trace-support branch state, so this PR is the focused rewrite from the old algorithm-class runtime to trace-native advantage functions.

  • Replaces algorithm classes / RolloutView with decorated trace advantage functions: @vf.advantage(loss="rl" | "ce", scope="rollout" | "group").
  • Adds the sparse public config surface: orchestrator.advantage.name, orchestrator.actor, and unprivileged keyed orchestrator.models.
  • Stamps completed traces with trace.actor and trace.models, runs the selected advantage function, then builds training samples from branch-level advantages and mask.
  • Ports built-ins to self-contained advantage functions: grpo, max_rl, sft, echo, opd, and opsd.
  • Routes active training through generic rl and sft/CE trainer paths; OPD/OPSD write fixed token advantages and no longer require teacher/ref-logprob trainer fields.
  • Removes privileged teacher config, old algorithm config/runtime files, and teacher-logprob transport/trainer plumbing.
  • Pins deps/verifiers to af3fd017 from sibling draft PR Trace advantage authoring surface verifiers#1760.

Validation

The latest stack cleanup was diff-only. I verified that the top PR no longer contains the unrelated old-branch dependency/inference reversions; the only dependency-looking path left in the top diff is the expected deps/verifiers pin.

Validation previously run on this implementation branch:

  • uv run --no-sync ruff check ... on touched Python files
  • uv run --no-sync pytest tests/unit/orchestrator/test_advantage.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_orchestrator_setup.py -q
  • uv run --no-sync pytest tests/unit/train/rl/test_loss.py -q with GPU access
  • uv run --no-sync pytest tests/unit/train/models/test_nemotron_h_kl.py -q with GPU access
  • uv run --no-sync rl @ configs/v1/training_mode/{rl,opd,sft}.toml --dry-run ...
  • uv run --no-sync rl @ configs/ci/integration/reverse_text_rl_{opd,sft}/start.toml --dry-run ...
  • One-step CE smoke passed on GPUs 0/1
  • One-step RL smoke passed on GPUs 0/1 with the zero-advantage post-filter disabled
  • After removing the raw reward built-in: uv run --no-sync pytest tests/unit/orchestrator/test_advantage.py -q and focused ruff check

Default GRPO smoke on the tiny debug setup hit the existing consecutive zero-trainable-batch guard; that looked like task/filter behavior rather than an integration failure.


Note

High Risk
Large behavioral refactor across orchestrator config validation, rollout dispatch, advantage computation, and trainer loss routing; OPD semantics move from trainer KL to orchestrator token weights, so misconfigured models/actor or custom advantages can silently change training.

Overview
Replaces orchestrator.training_mode (rl / opd / sft) and privileged orchestrator.teacher with orchestrator.advantage.name, orchestrator.actor (default policy), and keyed orchestrator.models for auxiliary token endpoints.

Orchestrator pipeline: Completed traces get trace.actor and trace.models; @vf.advantage functions (built-ins: grpo, max_rl, sft, echo, opd, opsd) mutate per-branch advantages and mask before samples are built. OPD/OPSD fetch teacher/policy logprobs inside the advantage step instead of a separate batch-time teacher-logprob pass. Train rollouts can use a non-policy actor (e.g. SFT with actor = "teacher").

Trainer / transport: Drops teacher_logprobs and the dedicated opd loss path; batches are rl (DPPO + configured loss) or sft (masked CE) based on advantage loss metadata. Per-token advantages replace a single scalar advantage. Configs, docs, and tests are updated accordingly; built-in length-penalty advantage shaping is removed from this path.

Reviewed by Cursor Bugbot for commit 76f1c77. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread src/prime_rl/orchestrator/filters.py
Comment thread src/prime_rl/orchestrator/filters.py
Comment thread src/prime_rl/orchestrator/train_sink.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
@willccbb willccbb changed the base branch from main to feat/nano-as-v1 June 19, 2026 03:50
@willccbb willccbb force-pushed the feat/trace-advantage-functions-v1 branch from d1748b2 to 6e868ef Compare June 19, 2026 03:55
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
@willccbb willccbb force-pushed the feat/trace-advantage-functions-v1 branch from 6e868ef to b3778db Compare June 19, 2026 05:33

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 76f1c77. Configure here.

Comment thread src/prime_rl/orchestrator/filters.py
@willccbb willccbb changed the base branch from feat/nano-as-v1 to feat/trace-algorithm-abstraction-base June 19, 2026 06:52
@willccbb

Copy link
Copy Markdown
Member Author

Closing this draft as superseded by the cleaned two-PR stack: base trace support in #2857, algorithm registry rewrite in #2842.

@willccbb willccbb closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant