[codex] add sparse filesystem weight sync#2823
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit da54fc0. Configure here.
| } | ||
| if "lm_head.weight" not in self._pulse_state_dict and "model.embed_tokens.weight" in self._pulse_state_dict: | ||
| self._pulse_state_dict["lm_head.weight"] = self._pulse_state_dict["model.embed_tokens.weight"].clone() | ||
| self._pulse_step = getattr(self, "_pulse_last_full_weight_step", None) or 0 |
There was a problem hiding this comment.
Sparse patches need warm cache
High Severity
Sparse broadcast directories store deltas against a prior step, not a full HF checkpoint. On resume or after an inference restart, the worker rebuilds its PULSE cache from the base model with _pulse_step 0 and applies only the latest patch from get_weight_dir. That patch’s base_step is usually the previous training step, so apply_sparse_value_patch raises a base-step mismatch and weight sync fails unless a full checkpoint weight tree exists for that step.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit da54fc0. Configure here.
| sparse: bool = False | ||
| """Use sparse BF16 value patches for filesystem weight broadcast.""" | ||
|
|
There was a problem hiding this comment.
can we just let this be a parm in the filesytem config ? and we remove from here
da54fc0 to
fd59b02
Compare


Summary
Adds delta-only sparse filesystem weight broadcast behind an explicit trainer filesystem config field:
The trainer now gathers the HF-compatible BF16 view, writes changed-index/value safetensor sparse updates, and logs
sparse_update/*metrics including sparsity, changed values, patch bytes, and patched tensor count. The vLLM filesystem worker keeps a dense CPU receiver cache, applies sparse updates in order, and validates each update'sbase_stepagainst the cached step so a missed or misordered update fails loudly.NCCL broadcast remains unchanged and dense.
sparselives only on the filesystem trainer weight-broadcast config; the old sharedweight_broadcast.sparsesurface is not accepted. Sparse filesystem broadcast is blocked for LoRA and multi-run training.Docs and the monitor-run skill were updated for the new sparse setting and W&B metrics.
Validation
uv run pytest tests/unit/utils/test_sparse_update.py tests/unit/test_configs.py -q(110 passed, 1 warning)uv run ruff check src/prime_rl/utils/sparse_update.py src/prime_rl/trainer/rl/broadcast/base.py src/prime_rl/trainer/rl/broadcast/filesystem.py src/prime_rl/inference/vllm/worker/filesystem.py src/prime_rl/trainer/rl/train.py packages/prime-rl-configs/src/prime_rl/configs/trainer.py packages/prime-rl-configs/src/prime_rl/configs/rl.py tests/unit/utils/test_sparse_update.pyuv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --trainer.weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-dry-run-pr2uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-old-flag-dry-run-pr2failed with the expected extra-field validation erroruv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --weight-broadcast.type nccl --trainer.weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-invalid-dry-run-pr2failed with the expected filesystem-only validation errorNote
Medium Risk
Changes the trainer-to-inference weight sync path; incorrect or misordered patches could desync policies, though the feature is opt-in with base-step checks and explicit exclusions for LoRA and multi-run.
Overview
Adds an opt-in sparse filesystem weight sync path via
trainer.weight_broadcast.sparse = true, so policy updates can ship BF16 index/value patches instead of full HF checkpoints on each broadcast.The trainer captures a baseline BF16 state after load or checkpoint resume (
prepare_baseline), then each broadcast writessparse_update_manifest.jsonplus changed values under the usualbroadcasts/step_N/layout and logssparse_update/*metrics (sparsity, changed numel, patch bytes). The vLLM filesystem worker keeps a dense CPU receiver cache, applies patches in order, and fails onbase_stepmismatch so skipped or out-of-order updates are not silently applied.Config validation blocks sparse with NCCL, LoRA adapter broadcast, and multi-run training; shared
weight_broadcast.type = filesystemno longer overwrites an already-set trainer sparse flag. NCCL dense broadcast is unchanged.Reviewed by Cursor Bugbot for commit fd59b02. Bugbot is set up for automated code reviews on this repo. Configure here.