Codex/orchestrator r3 memory#2853
Open
samsja wants to merge 8 commits into
Open
Conversation
…-memory # Conflicts: # src/prime_rl/trainer/batch.py # tests/unit/orchestrator/test_advantage.py # tests/unit/orchestrator/test_batch.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Medium Risk
Touches orchestrator batching, transport, filters, and monitoring on compacted trajectories; production paths change memory/release behavior but debug flags are opt-in.
Overview
Adds orchestrator debug (
orchestrator.debug) so you can stress the rollout pipeline without full inference/trainer stacks: optional no-op inference, no trainer (noop rollout transport + local policy version bumps), fake tokenizer, and RSS logging. The RL launcher skips inference/trainer processes and GPU allocation when those flags are set.Memory-focused changes for heavy R3 / long trajectories: after interleaving, raw trajectory token arrays and routed-expert payloads are compacted to length/metadata summaries; batches drop held references after send/logging; malloc_trim and optional process memory logging run per step. Training samples can carry a scalar
completion_temperatureinstead of per-token lists.Compatibility updates so behavior stays correct on compacted payloads: rollout filters read from
rollout.sampleswhen raw tokens are pruned; monitors and length helpers understand*_lenfields; trainer packing accepts compact temperatures.Ships a
fake-r3-trajectorydebug env (deterministic long multi-turn rollouts with optional routed experts) wired into theenvsextra for memory repro/testing.Reviewed by Cursor Bugbot for commit 10ed89d. Bugbot is set up for automated code reviews on this repo. Configure here.