Skip to content

Fix checkpoint directory race condition in DDP training#969

Open
0xadvait wants to merge 1 commit into
Physical-Intelligence:mainfrom
0xadvait:fix/ddp-checkpoint-race
Open

Fix checkpoint directory race condition in DDP training#969
0xadvait wants to merge 1 commit into
Physical-Intelligence:mainfrom
0xadvait:fix/ddp-checkpoint-race

Conversation

@0xadvait

Copy link
Copy Markdown

Problem

Multi-GPU training with torchrun and --overwrite crashes intermittently: every rank executes the checkpoint-directory setup in train_loop, so all ranks call shutil.rmtree on the same directory at once, and ranks that lose the race die with FileNotFoundError (full traceback in #868).

Fix

  • Only the main process deletes (--overwrite) and creates the checkpoint directory.
  • All ranks synchronize on dist.barrier() after directory setup, so no rank proceeds before the directory state is settled.
  • Resume detection stays on all ranks (read-only), and the resume error branches still raise on every rank, so non-main ranks cannot continue into a broken run.

Verification

  • ruff check and ruff format --check pass.
  • Simulated the race standalone with 8 processes against a shared directory: the unguarded pattern hit FileNotFoundError 140 times across 20 attempts, the guarded pattern produced 0 errors across 50 attempts.
  • save_checkpoint already guards on is_main, so no other rank-unsafe filesystem operations remain in the script.

Fixes #868

With torchrun and --overwrite, every rank deleted and recreated the
checkpoint directory simultaneously, so ranks could crash with
FileNotFoundError when another rank removed the directory first. Only
the main process now performs the rmtree and mkdir, and all ranks
synchronize on a barrier before training continues.

Fixes Physical-Intelligence#868
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race Condition in DDP Checkpoint Directory Operations (train_pytorch.py)

1 participant