Skip to content

[Issue] CI failure: rccl-UnitTests failing Register.ProcessIsolatedRegisterTests with NCCL_LOCAL_REGISTER disabled #4729

@chiranjeevipattigidi

Description

@chiranjeevipattigidi

Description:
rccl-UnitTests fails in the Register test suite when NCCL_LOCAL_REGISTER is disabled. The enabled variants pass, but multiple *_Disabled isolated tests fail due to an unexpected non-NULL registration handle.

Tests Failed:

CommRegisterDeregister_Disabled
MultipleBufferRegistration_Disabled
VariableSizeBuffers_Disabled

Error snippets:

2026-04-06T12:14:59.9595502Z Expected equality of these values:
2026-04-06T12:14:59.9595973Z   regHandles[i]
2026-04-06T12:14:59.9596365Z     Which is: 0x794820a03c50
2026-04-06T12:14:59.9596767Z   nullptr
2026-04-06T12:14:59.9597117Z     Which is: (nullptr)
2026-04-06T12:14:59.9597723Z Expected NULL handle for buffer 3
2026-04-06T12:14:59.9598061Z 
2026-04-06T12:14:59.9598694Z [ INFO     ] Test 'MultipleBufferRegistration_Disabled' (PID: 5247) FAILED with exit code 1 after 2955 ms
2026-04-06T12:14:59.9600142Z [ INFO     ] Running isolated test 'MultipleBufferRegistration_Enabled' (PID: 5259) with env: NCCL_LOCAL_REGISTER=1
2026-04-06T12:15:02.1466077Z [12:15:02Z] Mem: 96.0/3023.4GB (3%) | CPU: 0% | Jobs: ~1/384 | Disk: 784GB free
2026-04-06T12:15:02.9117231Z [ INFO     ] Test 'MultipleBufferRegistration_Enabled' PASSED (2953 ms)
2026-04-06T12:15:02.9126660Z [ INFO     ] Running isolated test 'VariableSizeBuffers_Disabled' (PID: 5272)
2026-04-06T12:15:05.8646421Z /__w/TheRock/TheRock/rocm-systems/projects/rccl/test/RegisterTests.cpp:164: Failure
2026-04-06T12:15:11.7703016Z [ INFO     ] Process-Isolated Tests: 4 passed, 3 failed, 0 skipped (20669 ms total)
2026-04-06T12:15:11.7704030Z [ INFO     ]   Failed: CommRegisterDeregister_Disabled - Test failed with exit code 1
2026-04-06T12:15:11.7705069Z [ INFO     ]   Failed: MultipleBufferRegistration_Disabled - Test failed with exit code 1
2026-04-06T12:15:11.7706003Z [ INFO     ]   Failed: VariableSizeBuffers_Disabled - Test failed with exit code 1

Full Logs:

Impact:

These Failures detected in Rock - Multiarch CI check for rocm-libraries submodule bump: Bump rocm-libraries from f000f77 to a9ff799

Currently blocking its promotion

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions