Skip to content

[codex] add sparse filesystem weight sync#2823

Open
samsja wants to merge 1 commit into
mainfrom
feat/sparse-filesystem-weight-sync
Open

[codex] add sparse filesystem weight sync#2823
samsja wants to merge 1 commit into
mainfrom
feat/sparse-filesystem-weight-sync

Conversation

@samsja

@samsja samsja commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Adds delta-only sparse filesystem weight broadcast behind an explicit trainer filesystem config field:

[trainer.weight_broadcast]
type = "filesystem"
sparse = true

The trainer now gathers the HF-compatible BF16 view, writes changed-index/value safetensor sparse updates, and logs sparse_update/* metrics including sparsity, changed values, patch bytes, and patched tensor count. The vLLM filesystem worker keeps a dense CPU receiver cache, applies sparse updates in order, and validates each update's base_step against the cached step so a missed or misordered update fails loudly.

NCCL broadcast remains unchanged and dense. sparse lives only on the filesystem trainer weight-broadcast config; the old shared weight_broadcast.sparse surface is not accepted. Sparse filesystem broadcast is blocked for LoRA and multi-run training.

Docs and the monitor-run skill were updated for the new sparse setting and W&B metrics.

Validation

  • uv run pytest tests/unit/utils/test_sparse_update.py tests/unit/test_configs.py -q (110 passed, 1 warning)
  • uv run ruff check src/prime_rl/utils/sparse_update.py src/prime_rl/trainer/rl/broadcast/base.py src/prime_rl/trainer/rl/broadcast/filesystem.py src/prime_rl/inference/vllm/worker/filesystem.py src/prime_rl/trainer/rl/train.py packages/prime-rl-configs/src/prime_rl/configs/trainer.py packages/prime-rl-configs/src/prime_rl/configs/rl.py tests/unit/utils/test_sparse_update.py
  • uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --trainer.weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-dry-run-pr2
  • uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-old-flag-dry-run-pr2 failed with the expected extra-field validation error
  • uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --weight-broadcast.type nccl --trainer.weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-invalid-dry-run-pr2 failed with the expected filesystem-only validation error

Note

Medium Risk
Changes the trainer-to-inference weight sync path; incorrect or misordered patches could desync policies, though the feature is opt-in with base-step checks and explicit exclusions for LoRA and multi-run.

Overview
Adds an opt-in sparse filesystem weight sync path via trainer.weight_broadcast.sparse = true, so policy updates can ship BF16 index/value patches instead of full HF checkpoints on each broadcast.

The trainer captures a baseline BF16 state after load or checkpoint resume (prepare_baseline), then each broadcast writes sparse_update_manifest.json plus changed values under the usual broadcasts/step_N/ layout and logs sparse_update/* metrics (sparsity, changed numel, patch bytes). The vLLM filesystem worker keeps a dense CPU receiver cache, applies patches in order, and fails on base_step mismatch so skipped or out-of-order updates are not silently applied.

Config validation blocks sparse with NCCL, LoRA adapter broadcast, and multi-run training; shared weight_broadcast.type = filesystem no longer overwrites an already-set trainer sparse flag. NCCL dense broadcast is unchanged.

Reviewed by Cursor Bugbot for commit fd59b02. Bugbot is set up for automated code reviews on this repo. Configure here.

@samsja samsja marked this pull request as ready for review June 15, 2026 01:26

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit da54fc0. Configure here.

}
if "lm_head.weight" not in self._pulse_state_dict and "model.embed_tokens.weight" in self._pulse_state_dict:
self._pulse_state_dict["lm_head.weight"] = self._pulse_state_dict["model.embed_tokens.weight"].clone()
self._pulse_step = getattr(self, "_pulse_last_full_weight_step", None) or 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse patches need warm cache

High Severity

Sparse broadcast directories store deltas against a prior step, not a full HF checkpoint. On resume or after an inference restart, the worker rebuilds its PULSE cache from the base model with _pulse_step 0 and applies only the latest patch from get_weight_dir. That patch’s base_step is usually the previous training step, so apply_sparse_value_patch raises a base-step mismatch and weight sync fails unless a full checkpoint weight tree exists for that step.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit da54fc0. Configure here.

Comment on lines +120 to +122
sparse: bool = False
"""Use sparse BF16 value patches for filesystem weight broadcast."""

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just let this be a parm in the filesytem config ? and we remove from here

@samsja samsja force-pushed the feat/sparse-filesystem-weight-sync branch from da54fc0 to fd59b02 Compare June 15, 2026 01:35
@samsja samsja marked this pull request as draft June 15, 2026 01:35
@samsja samsja marked this pull request as ready for review June 15, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant