TL/CUDA: fix NVLS fallback and EC persistent hangs#1320
Conversation
|
👀 Investigating |
|
🤖 CI Triage Agent — The logs are crystal clear. Here is the complete diagnosis: Summary: Lint (codestyle) check failed because all 4 commit titles in PR #1320 exceed the 50-character limit enforced by the commit title linter. Root cause: The codestyle workflow enforces a strict 50-character maximum for non-merge commit titles. All four commits on the
The linter script checks Implicated commit: File: Suggested fix: Shorten each of the four commit titles to ≤ 50 characters and amend/rebase the commits. Suggested rewrites (all ≤ 50 chars):
Then force-push the branch: git rebase -i origin/master # edit the four commit messages
git push --force-with-leaseRelated: none
|
|
👀 Investigating |
|
🤖 CI Triage Agent — The logs are crystal clear. Here is the complete diagnosis: Summary: All 5 commits in PR #1320 fail the codestyle lint check because their titles exceed the 50-character limit enforced by Root cause: The project's commit title linter enforces a hard 50-character maximum (unless the title starts with
The check logic (from if [ ${#msg} -gt 50 ]
then
if ! echo $msg | grep -qP '^Merge'
then
echo "Commit title is too long: ${#msg}"
return 1
fi
fiAll five titles match a valid header prefix ( Implicated commit: File: Suggested fix: Rewrite (via
Then force-push the branch: Related: none
|
|
👀 Investigating |
|
🤖 CI Triage Agent — The logs are clear. I have a complete picture of the failure. No further tool calls are needed. Summary: Lint (codestyle) check failed — 6 out of 7 commits in PR #1320 have non-conforming commit titles (wrong header format and/or exceeding the 50-character limit). Root cause: The UCC codestyle CI enforces two rules for every commit title in a PR:
Six commits in this PR violate one or both rules:
Only Implicated commit: File: Suggested fix: Rewrite the offending commit titles via interactive rebase (
Then force-push the branch to re-trigger CI. Related: none
|
6c2b119 to
3463e98
Compare
|
| Filename | Overview |
|---|---|
| src/components/ec/cuda/ec_cuda_executor_persistent.c | Replaces ucc_memory_cpu_store_fence() with ucc_memory_bus_store_fence() in task_post and persistent_stop to fix GPU-visible ordering on weakly-ordered CPUs (aarch64/Grace); well-reasoned comments explain the DMB OSH vs DSB ISH distinction |
| src/components/tl/cuda/tl_cuda_nvls.c | Adds STATE_SYNC_STATUS to allgather each rank's import result so a per-rank pidfd_getfd failure causes all ranks to fall back together rather than deadlocking in cuMulticastBindAddr; improves diagnostic logging for permission-denied cases |
| src/components/tl/cuda/tl_cuda_nvls.h | Adds STATE_SYNC_STATUS to the state enum and three new fields to ucc_tl_cuda_nvls_t: init_ready, init_sync_data (temporary allgather buffer), and enabled (gates collective dispatch) |
| src/components/tl/cuda/tl_cuda_team.c | Replaces the static ucc_tl_cuda_nvls_check_support() call with a runtime team->nvls.enabled check in ucc_tl_cuda_get_supported_colls(), and downgrades NVLS init failure from tl_error to tl_debug |
| src/components/tl/cuda/allreduce/allreduce.c | Adds a cuda_team->nvls.enabled guard at the top of allreduce_init as a safety net, preventing dispatch to uninitialized NVLS resources if NVLS fell back |
Reviews (5): Last reviewed commit: "TL/CUDA: warn when NVLS peer fd import d..." | Re-trigger Greptile
|
/build |
3463e98 to
633c122
Compare
214eeee to
c578ecc
Compare
When a rank fails to import the multicast handle during NVLS team initialization (e.g. pidfd_getfd returns EPERM in a container without CAP_SYS_PTRACE or with a restrictive seccomp filter), it would bail out and fall back while the other ranks proceeded into the collective cuMulticastBindAddr barrier and blocked forever, deadlocking the team (observed as a hang in the first DDP collective in test_c10d_ucc). Add a STATE_SYNC_STATUS step that allgathers each rank's import status after STATE_IMPORT_HANDLE. If any rank failed, all ranks disable NVLS together and fall back symmetrically, avoiding the deadlock. Also downgrade the expected NVLS init/fallback failures (peer fd import, multicast object creation, top-level NVLS init failure) from tl_error to tl_debug, matching how TL/SHARP reports failed initialization, so a supported fallback does not emit a spurious error.
ucc_cuda_executor_persistent_stop() signals the persistent GPU kernel to exit by writing eee->pidx = -1 and the SHUTDOWN state into device-mapped (cudaHostAllocMapped, zero-copy) memory, then busy-waits for the kernel to write SHUTDOWN_ACK back. The shutdown flag was published without any memory barrier that orders the store against the GPU. On strongly-ordered CPUs (x86) this happens to work, but on weakly-ordered CPUs (aarch64, e.g. Grace/GB200/VR200) the inner- shareable CPU fences used elsewhere do not order against the GPU's shareability domain, so the persistent kernel may never observe pidx == -1. It then never exits and never acknowledges the shutdown, leaving the CPU spinning forever in the stop loop. This stalls the UCC progress thread and, with the PyTorch UCC backend, manifests as a hang in the first/teardown collective (e.g. test_ddp_checkpointing_dynamic_module hanging in an all_gather/barrier). Publish the shutdown flag with ucc_memory_bus_store_fence() (outer-shareable on aarch64, sfence on x86), which is defined specifically to synchronize write-back and device-mapped memory. The ack is a volatile flag in coherent device-mapped memory, so it is observed without a load fence in the wait loop.
task_post() writes task args into device-mapped (zero-copy) memory and then publishes them by advancing pidx, which the persistent GPU kernel polls. The ordering "task args visible before pidx" was enforced with the inner-shareable CPU store fence, which does not order stores against the GPU's shareability domain on weakly-ordered CPUs (aarch64/Grace). Use ucc_memory_bus_store_fence() (outer-shareable), consistent with the shutdown path, so the kernel cannot observe an advanced pidx with stale task args.
ucc_tl_cuda_get_supported_colls() advertised NVLS ALLREDUCE based on the static hardware capability check (cuMulticast attributes), not on whether NVLS actually initialized for the team. When NVLS init falls back for a single-node team (e.g. peer fd import denied, team size over the NVLS peer limit, non-uniform ppn), the TL/CUDA team is still created, so allreduce was routed to ucc_tl_cuda_allreduce_nvls_init with no NVLS resources set up, producing wrong results / crashes instead of falling back. Track whether NVLS finished initializing (nvls.enabled, set only after the final NVLS barrier) and: - gate advertising ALLREDUCE in get_supported_colls on nvls.enabled, so the score map routes allreduce to another TL when NVLS is unavailable; - defensively return UCC_ERR_NOT_SUPPORTED from ucc_tl_cuda_allreduce_init when NVLS is not enabled.
When single-node NVLS falls back because importing a peer process file descriptor is denied (EPERM/EACCES from pidfd_open/pidfd_getfd due to Yama ptrace_scope, missing CAP_SYS_PTRACE, or a seccomp filter), emit a single per-process warning that explains the cause and how to enable NVLS (host sysctl kernel.yama.ptrace_scope=0, docker --cap-add=SYS_PTRACE / --security-opt seccomp=unconfined, enroot --container-remap-root). The per-occurrence detail stays at debug level to avoid spam.
c578ecc to
cef80e2
Compare
|
/build |
What
Fix TL/CUDA NVLS hangs/failures that broke
test_c10d_ucc.py(e.g.test_ddp_checkpointing_dynamic_module) on aarch64 (Grace/GB200/VR200) and incontainers without ptrace permission:
(e.g.
pidfd_getfdEPERM in a restricted container), it fell back while theother ranks blocked forever in the collective
cuMulticastBindAddr. All ranksnow exchange import status and disable NVLS together.
and task-publish flags in device-mapped memory used inner-shareable fences,
which don't order against the GPU on aarch64, so the persistent kernel never
saw the update. Use the bus (outer-shareable) fences.
algorithm (based on a static HW capability check) even when NVLS did not
initialize. Advertise/route NVLS allreduce only when NVLS is actually enabled.
(ptrace_scope / CAP_SYS_PTRACE / docker / enroot hints); keep the rest at
debug so a supported fallback is not noisy.
Why ?
On aarch64 (GB200/VR200) and permission-restricted containers, NVLS either
deadlocked at team creation (hang in the first DDP collective) or silently
produced wrong results after falling back. NCCL worked, so it was UCC-specific.
Root-caused on a VR200 node: the progress thread was stuck in
ucc_cuda_executor_persistent_stop, and the peer-fd EPERM caused an asymmetricNVLS init.
Fixes: RM 5113172
How ?
tl_cuda_nvls: addSTATE_SYNC_STATUSthat allgathers each rank's importresult; disable NVLS team-wide on any failure. Track
nvls.enabled(set onlyafter the final NVLS barrier).
ec_cuda_executor_persistent: publish/poll shutdown and task-post flags withucc_memory_bus_store_fence()/ucc_memory_bus_load_fence().tl_cuda_team/allreduce: gate advertising and dispatch of NVLS allreduceon
nvls.enabled.Validated on VR200 (aarch64) and H100 (x86):
test_c10d_ucc.pypasses(50 passed / 14 skipped), single-node and multinode NVLS allreduce correct,
and the forced-EPERM path falls back cleanly instead of hanging.