Wider ResUNet#127
Draft
forklady42 wants to merge 1 commit into
Draft
Conversation
Width sweep on the 3K dataset_4 anchor (W32/W48/W64, matched recipe) shows monotonic improvement with width and a stable 0.33% absolute NMAE gap that holds across resumes. train_workers reduced from 8 to 4 to mitigate the pymatgen/loky semaphore leak that drove host-RAM OOMs at higher widths. use_checkpoint enabled because activation checkpointing is required for n_channels=64 to fit on an A100 80GB at batch_size=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
n_channelsfrom 32 to 64 inconfig_resunet.yamlbased on a width-ablation dry-run that shows monotonic improvement and a stable lead for W64.train_workersfrom 8 to 4 — at higher widths the pymatgen/loky-driven semaphore leak in dataloader workers drove host-RAM OOMs even on a 256G allocation.use_checkpoint: False→True— activation checkpointing is required for n_channels=64 to fit on an A100 80GB at batch_size=1, depth=2, kernel=5.Findings
Width sweep on
chg_datasets/dataset_4(3K, 128³ uniform), matched recipe (lr=0.01, precision=32, depth=2, kernel_size=5), 4× A100 80GB DDP, batch_size=1.Read at epoch ~17/50 (resumed across multiple SLURM submissions, ~25 min/epoch). The W64 vs W32 gap was 0.32% absolute at epoch 13 and 0.33% at epoch 17 — stable, not noise. W32's Δtrain/epoch dropped 4× over those four epochs while W64's barely moved, indicating W32 is hitting capacity ceiling first; the gap is likely to widen with more training rather than collapse.
W&B project: https://wandb.ai/PrinceOA/mp-resunet-width-ablation
Why these specific OOM mitigations
train_workers: 4— with 4 DDP ranks × 8 workers = 32 dataloader processes, the loky pool leaks accumulate across an epoch and push host RAM past the SLURM allocation. Halving worker count is the cheapest config-level fix; the upstream leak is a separate concern.use_checkpoint: True— was off in the prior default. At width 32 it didn't matter; at width 64 the activation memory at batch_size=1 with depth=2 doesn't fit without it. Setting it on by default removes a footgun for anyone who picks up this config and changes nothing else.Test plan
mp/dataset_2/mp_filelist.txt— note: NOTchg_datasets/dataset_2) with this config and confirm training proceeds past epoch 1 without GPU or host OOMuv sync+ a smoke training run exits cleanly with the new defaults🤖 Generated with Claude Code