Skip to content

Wider ResUNet#127

Draft
forklady42 wants to merge 1 commit into
mainfrom
betsy/resunet-default-width-64
Draft

Wider ResUNet#127
forklady42 wants to merge 1 commit into
mainfrom
betsy/resunet-default-width-64

Conversation

@forklady42

@forklady42 forklady42 commented May 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Bump default n_channels from 32 to 64 in config_resunet.yaml based on a width-ablation dry-run that shows monotonic improvement and a stable lead for W64.
  • Reduce train_workers from 8 to 4 — at higher widths the pymatgen/loky-driven semaphore leak in dataloader workers drove host-RAM OOMs even on a 256G allocation.
  • Flip use_checkpoint: FalseTrue — activation checkpointing is required for n_channels=64 to fit on an A100 80GB at batch_size=1, depth=2, kernel=5.

Findings

Width sweep on chg_datasets/dataset_4 (3K, 128³ uniform), matched recipe (lr=0.01, precision=32, depth=2, kernel_size=5), 4× A100 80GB DDP, batch_size=1.

width best val NMAE% last Δtrain%/ep trend
32 4.082 -0.056 plateauing fast
48 3.889 -0.110 slowing
64 3.751 -0.137 still descending

Read at epoch ~17/50 (resumed across multiple SLURM submissions, ~25 min/epoch). The W64 vs W32 gap was 0.32% absolute at epoch 13 and 0.33% at epoch 17 — stable, not noise. W32's Δtrain/epoch dropped 4× over those four epochs while W64's barely moved, indicating W32 is hitting capacity ceiling first; the gap is likely to widen with more training rather than collapse.

W&B project: https://wandb.ai/PrinceOA/mp-resunet-width-ablation

Why these specific OOM mitigations

  • train_workers: 4 — with 4 DDP ranks × 8 workers = 32 dataloader processes, the loky pool leaks accumulate across an epoch and push host RAM past the SLURM allocation. Halving worker count is the cheapest config-level fix; the upstream leak is a separate concern.
  • use_checkpoint: True — was off in the prior default. At width 32 it didn't matter; at width 64 the activation memory at batch_size=1 with depth=2 doesn't fit without it. Setting it on by default removes a footgun for anyone who picks up this config and changes nothing else.

Test plan

  • Launch a full-MP run (113K, mp/dataset_2/mp_filelist.txt — note: NOT chg_datasets/dataset_2) with this config and confirm training proceeds past epoch 1 without GPU or host OOM
  • Sanity-check that a fresh checkout + uv sync + a smoke training run exits cleanly with the new defaults

🤖 Generated with Claude Code

Width sweep on the 3K dataset_4 anchor (W32/W48/W64, matched recipe)
shows monotonic improvement with width and a stable 0.33% absolute
NMAE gap that holds across resumes.

train_workers reduced from 8 to 4 to mitigate the pymatgen/loky
semaphore leak that drove host-RAM OOMs at higher widths.
use_checkpoint enabled because activation checkpointing is required
for n_channels=64 to fit on an A100 80GB at batch_size=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@forklady42 forklady42 marked this pull request as draft May 1, 2026 16:23
@forklady42 forklady42 changed the title Default ResUNet width 64 + OOM-safety changes Wider ResUNet May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant