Wider ResUNet by forklady42 · Pull Request #127 · Quantum-Accelerators/electrai

forklady42 · 2026-05-01T16:21:07Z

Summary

Bump default n_channels from 32 to 64 in config_resunet.yaml based on a width-ablation dry-run that shows monotonic improvement and a stable lead for W64.
Reduce train_workers from 8 to 4 — at higher widths the pymatgen/loky-driven semaphore leak in dataloader workers drove host-RAM OOMs even on a 256G allocation.
Flip use_checkpoint: False → True — activation checkpointing is required for n_channels=64 to fit on an A100 80GB at batch_size=1, depth=2, kernel=5.

Findings

Width sweep on chg_datasets/dataset_4 (3K, 128³ uniform), matched recipe (lr=0.01, precision=32, depth=2, kernel_size=5), 4× A100 80GB DDP, batch_size=1.

width	best val NMAE%	last Δtrain%/ep	trend
32	4.082	-0.056	plateauing fast
48	3.889	-0.110	slowing
64	3.751	-0.137	still descending

Read at epoch ~17/50 (resumed across multiple SLURM submissions, ~25 min/epoch). The W64 vs W32 gap was 0.32% absolute at epoch 13 and 0.33% at epoch 17 — stable, not noise. W32's Δtrain/epoch dropped 4× over those four epochs while W64's barely moved, indicating W32 is hitting capacity ceiling first; the gap is likely to widen with more training rather than collapse.

W&B project: https://wandb.ai/PrinceOA/mp-resunet-width-ablation

Why these specific OOM mitigations

train_workers: 4 — with 4 DDP ranks × 8 workers = 32 dataloader processes, the loky pool leaks accumulate across an epoch and push host RAM past the SLURM allocation. Halving worker count is the cheapest config-level fix; the upstream leak is a separate concern.
use_checkpoint: True — was off in the prior default. At width 32 it didn't matter; at width 64 the activation memory at batch_size=1 with depth=2 doesn't fit without it. Setting it on by default removes a footgun for anyone who picks up this config and changes nothing else.

Test plan

Launch a full-MP run (113K, mp/dataset_2/mp_filelist.txt — note: NOT chg_datasets/dataset_2) with this config and confirm training proceeds past epoch 1 without GPU or host OOM
Sanity-check that a fresh checkout + uv sync + a smoke training run exits cleanly with the new defaults

🤖 Generated with Claude Code

Width sweep on the 3K dataset_4 anchor (W32/W48/W64, matched recipe) shows monotonic improvement with width and a stable 0.33% absolute NMAE gap that holds across resumes. train_workers reduced from 8 to 4 to mitigate the pymatgen/loky semaphore leak that drove host-RAM OOMs at higher widths. use_checkpoint enabled because activation checkpointing is required for n_channels=64 to fit on an A100 80GB at batch_size=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

forklady42 marked this pull request as draft May 1, 2026 16:23

forklady42 changed the title ~~Default ResUNet width 64 + OOM-safety changes~~ Wider ResUNet May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wider ResUNet#127

Wider ResUNet#127
forklady42 wants to merge 1 commit into
mainfrom
betsy/resunet-default-width-64

forklady42 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forklady42 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Findings

Why these specific OOM mitigations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

forklady42 commented May 1, 2026 •

edited

Loading