[codex] add sparse filesystem weight sync by samsja · Pull Request #2823 · PrimeIntellect-ai/prime-rl

samsja · 2026-06-15T01:23:30Z

Summary

Adds delta-only sparse filesystem weight broadcast behind an explicit trainer filesystem config field:

[trainer.weight_broadcast]
type = "filesystem"
sparse = true

The trainer now gathers the HF-compatible BF16 view, writes changed-index/value safetensor sparse updates, and logs sparse_update/* metrics including sparsity, changed values, patch bytes, and patched tensor count. The vLLM filesystem worker keeps a dense CPU receiver cache, applies sparse updates in order, and validates each update's base_step against the cached step so a missed or misordered update fails loudly.

NCCL broadcast remains unchanged and dense. sparse lives only on the filesystem trainer weight-broadcast config; the old shared weight_broadcast.sparse surface is not accepted. Sparse filesystem broadcast is blocked for LoRA and multi-run training.

Docs and the monitor-run skill were updated for the new sparse setting and W&B metrics.

Validation

uv run pytest tests/unit/utils/test_sparse_update.py tests/unit/test_configs.py -q (110 passed, 1 warning)
uv run ruff check src/prime_rl/utils/sparse_update.py src/prime_rl/trainer/rl/broadcast/base.py src/prime_rl/trainer/rl/broadcast/filesystem.py src/prime_rl/inference/vllm/worker/filesystem.py src/prime_rl/trainer/rl/train.py packages/prime-rl-configs/src/prime_rl/configs/trainer.py packages/prime-rl-configs/src/prime_rl/configs/rl.py tests/unit/utils/test_sparse_update.py
uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --trainer.weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-dry-run-pr2
uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-old-flag-dry-run-pr2 failed with the expected extra-field validation error
uv run rl @ configs/ci/integration/reverse_text/start.toml --dry-run --clean-output-dir --weight-broadcast.type nccl --trainer.weight-broadcast.sparse --output-dir /tmp/prime-rl-sparse-api-invalid-dry-run-pr2 failed with the expected filesystem-only validation error

Note

Medium Risk
Changes the trainer-to-inference weight sync path; incorrect or misordered patches could desync policies, though the feature is opt-in with base-step checks and explicit exclusions for LoRA and multi-run.

Overview
Adds an opt-in sparse filesystem weight sync path via trainer.weight_broadcast.sparse = true, so policy updates can ship BF16 index/value patches instead of full HF checkpoints on each broadcast.

The trainer captures a baseline BF16 state after load or checkpoint resume (prepare_baseline), then each broadcast writes sparse_update_manifest.json plus changed values under the usual broadcasts/step_N/ layout and logs sparse_update/* metrics (sparsity, changed numel, patch bytes). The vLLM filesystem worker keeps a dense CPU receiver cache, applies patches in order, and fails on base_step mismatch so skipped or out-of-order updates are not silently applied.

Config validation blocks sparse with NCCL, LoRA adapter broadcast, and multi-run training; shared weight_broadcast.type = filesystem no longer overwrites an already-set trainer sparse flag. NCCL dense broadcast is unchanged.

^{Reviewed by Cursor Bugbot for commit fd59b02. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit da54fc0. Configure here.}

cursor · 2026-06-15T01:29:12Z

+        }
+        if "lm_head.weight" not in self._pulse_state_dict and "model.embed_tokens.weight" in self._pulse_state_dict:
+            self._pulse_state_dict["lm_head.weight"] = self._pulse_state_dict["model.embed_tokens.weight"].clone()
+        self._pulse_step = getattr(self, "_pulse_last_full_weight_step", None) or 0


Sparse patches need warm cache

High Severity

Sparse broadcast directories store deltas against a prior step, not a full HF checkpoint. On resume or after an inference restart, the worker rebuilds its PULSE cache from the base model with _pulse_step 0 and applies only the latest patch from get_weight_dir. That patch’s base_step is usually the previous training step, so apply_sparse_value_patch raises a base-step mismatch and weight sync fails unless a full checkpoint weight tree exists for that step.

Additional Locations (1)

src/prime_rl/utils/pulse.py#L148-L155

^{Reviewed by Cursor Bugbot for commit da54fc0. Configure here.}

samsja · 2026-06-15T01:29:41Z

+    sparse: bool = False
+    """Use sparse BF16 value patches for filesystem weight broadcast."""
+


can we just let this be a parm in the filesytem config ? and we remove from here

samsja marked this pull request as ready for review June 15, 2026 01:26

cursor Bot reviewed Jun 15, 2026

View reviewed changes

samsja commented Jun 15, 2026

View reviewed changes

add sparse filesystem weight sync

fd59b02

samsja force-pushed the feat/sparse-filesystem-weight-sync branch from da54fc0 to fd59b02 Compare June 15, 2026 01:35

samsja marked this pull request as draft June 15, 2026 01:35

samsja marked this pull request as ready for review June 15, 2026 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] add sparse filesystem weight sync#2823

[codex] add sparse filesystem weight sync#2823
samsja wants to merge 1 commit into
mainfrom
feat/sparse-filesystem-weight-sync

samsja commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 15, 2026

Uh oh!

samsja Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		sparse: bool = False
		"""Use sparse BF16 value patches for filesystem weight broadcast."""

Conversation

samsja commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Sparse patches need warm cache

Uh oh!

samsja Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented Jun 15, 2026 •

edited by cursor Bot

Loading